Problem 83: Instance-Specific Hellinger Testing

From Open Problems in Sublinear Algorithms
Jump to: navigation, search
Suggested by Clément Canonne
Source FOCS 2017
Short link https://sublinear.info/83

Given the full description of a fixed distribution over a discrete domain (say [n]=\{1,\dots,n\}), as well as access to i.i.d. samples from an unknown probability distributions p over [n] and distance parameter \varepsilon\in(0,1], the identity testing problem asks to distinguish w.h.p. between (i) p=q and (ii) \operatorname{d}_{\rm TV}(p,q)>\varepsilon.

The sample complexity of this question as a function of n and \varepsilon is fully understood by now: \Theta(\sqrt{n}/\varepsilon^2) are necessary and sufficient, the worst-case lower bound following from taking q to be the uniform distribution on [n]. Valiant and Valiant [ValiantV-14] shown an instance-specific bound on this problem, where the sample complexity \Psi_{\rm TV} now only depends on \varepsilon and the (massive) parameter q instead of n: namely, that \Psi_{\rm TV}(q,\varepsilon) = \Theta\left(\max\left( \frac{\Phi(q,\Theta(\varepsilon))}{\varepsilon^2}, \frac{1}{\varepsilon}\right)\right) samples were necessary and sufficient, where \Phi is the functional defined by taking the 2/3-pseudonorm of the vector of probabilities of q, once both the biggest element and \varepsilon total mass of the smallest elements had been removed: \Phi(q,\varepsilon) = \lVert q^{-\max}_{-\varepsilon} \rVert_{2/3} . Using different techniques, Blais, Canonne, and Gur [BlaisCG-17] then established a similar instance-specific bound, with regard to a different functional, the "K-functional \kappa between \ell_1 and \ell_2 spaces:" \Psi_{\rm TV}(q,\varepsilon)=\Omega\left({\kappa_p(1-\Theta(\varepsilon))}/{\varepsilon}\right), \Psi_{\rm TV}(q,\varepsilon)=O\left({\kappa_p(1-\Theta(\varepsilon))}/{\varepsilon^2}\right) .

Now, consider the exact same problem, but replacing the total variation \operatorname{d}_{\rm TV}(p,q) by the Hellinger distance \operatorname{d}_{\rm H}(p,q) = \frac{1}{\sqrt{2}}\lVert\sqrt{p}-\sqrt{q}\rVert_2\,. Results of Daskalakis, Kamath, and Wright [DaskalakisKW-18] show that the worst-case sample complexity remains \Theta(\sqrt{n}/\varepsilon^2). Moreover, due to the quadratic dependence between Hellinger and total variation distances, both instance-specific bounds mentioned above apply, yet with possibly a quadratic gap between upper and lower bounds in terms of \varepsilon: leading to bounds on the instance-specific sample complexity \Psi_{\rm H} of Hellinger identity testing of \Psi_{\rm TV}(q,\varepsilon) \leq \Psi_{\rm H}(q,\varepsilon) \leq \Psi_{\rm TV}(q,\varepsilon^2).

What is the right dependence on \varepsilon of \Psi_{\rm H}?

Note that in both instance-specific bounds obtained for \Psi_{\rm TV}, there exist (simple) examples of q where \varepsilon ends up in the exponent, so this quadratic gap is not innocuous even for constant \varepsilon.