Problem 82: Beyond Identity Testing
Suggested by | Clément Canonne |
---|---|
Source | FOCS 2017 |
Short link | https://sublinear.info/82 |
Given access to i.i.d. samples from two unknown probability distributions and q over a discrete domain (say [n]=\{1,\dots,n\}) and distance parameter \varepsilon\in(0,1], the equivalence (also known as closeness) testing problem asks to distinguish w.h.p. between (i) p=q and (ii) \operatorname{d}_{\rm TV}(p,q)>\varepsilon. (The identity problem is the analogue when q is fixed and explicitly known beforehand.)
Equivalence up to a permutation would then be the variant where one must test whether p and q are equal up to relabeling of the elements: (i) \exists \pi\in\mathcal{S}_n s.t. p=q\circ\pi vs. (ii) \forall \pi\in\mathcal{S}_n, \operatorname{d}_{\rm TV}(p,q\circ \pi)>\varepsilon. Results of Valiant and Valiant [ValiantV-11] imply that this question has sample complexity \Theta\left(\frac{n}{\varepsilon^2\log n}\right).
The most general question then is:
- Given a fixed class \mathcal{F} of functions from [n] to [m], distinguish between (i) \exists \pi\in\mathcal{F} s.t. p=q\circ\pi vs. (ii) \forall \pi\in\mathcal{F}, \operatorname{d}_{\rm TV}(p,q\circ \pi)>\varepsilon.
In particular, one can study how the sample complexity depends on \mathcal{F}, or what it is for some classes of interest (e.g., n=m for \mathcal{F} a subgroup of the symmetric group \mathcal{S}_n; or m\ll n and \mathcal{F} being a class of “coarsenings,” capturing whether p and q are the same distribution but with a different discretization/binning).