# Problem 83: Instance-Specific Hellinger Testing

Suggested by | Clément Canonne |
---|---|

Source | FOCS 2017 |

Short link | https://sublinear.info/83 |

Given the full description of a fixed distribution $q$ over a discrete domain (say $[n]=\{1,\dots,n\}$), as well as access to i.i.d. samples from an unknown probability distributions $p$ over $[n]$ and distance parameter $\varepsilon\in(0,1]$, the identity testing problem asks to distinguish w.h.p. between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q)>\varepsilon$.

The sample complexity of this question as a function of $n$ and $\varepsilon$ is fully understood by now: $\Theta(\sqrt{n}/\varepsilon^2)$ are necessary and sufficient, the worst-case lower bound following from taking $q$ to be the uniform distribution on $[n]$. Valiant and Valiant [ValiantV-14] shown an *instance-specific* bound on this problem, where the sample complexity $\Psi_{\rm TV}$ now only depends on $\varepsilon$ and the (massive) parameter $q$ instead of $n$: namely, that
$$\Psi_{\rm TV}(q,\varepsilon) = \Theta\left(\max\left( \frac{\Phi(q,\Theta(\varepsilon))}{\varepsilon^2}, \frac{1}{\varepsilon}\right)\right)$$
samples were necessary and sufficient, where $\Phi$ is the functional defined by taking the $2/3$-pseudonorm of the vector of probabilities of $q$, once both the biggest element and $\varepsilon$ total mass of the smallest elements had been removed:
$
\Phi(q,\varepsilon) = \lVert q^{-\max}_{-\varepsilon} \rVert_{2/3}
$. Using different techniques, Blais, Canonne, and Gur [BlaisCG-17] then established a similar instance-specific bound, with regard to a different functional, the "K-functional $\kappa$ between $\ell_1$ and $\ell_2$ spaces:"
$
\Psi_{\rm TV}(q,\varepsilon)=\Omega\left({\kappa_p(1-\Theta(\varepsilon))}/{\varepsilon}\right), \Psi_{\rm TV}(q,\varepsilon)=O\left({\kappa_p(1-\Theta(\varepsilon))}/{\varepsilon^2}\right)
$.

Now, consider the exact same problem, but replacing the total variation $\operatorname{d}_{\rm TV}(p,q)$ by the *Hellinger distance*
$$
\operatorname{d}_{\rm H}(p,q) = \frac{1}{\sqrt{2}}\lVert\sqrt{p}-\sqrt{q}\rVert_2\,.
$$
Results of Daskalakis, Kamath, and Wright [DaskalakisKW-18] show that the *worst-case* sample complexity remains $\Theta(\sqrt{n}/\varepsilon^2)$. Moreover, due to the quadratic dependence between Hellinger and total variation distances, both instance-specific bounds mentioned above apply, yet with possibly a quadratic gap between upper and lower bounds in terms of $\varepsilon$: leading to bounds on the instance-specific sample complexity $\Psi_{\rm H}$ of Hellinger identity testing of
$$\Psi_{\rm TV}(q,\varepsilon) \leq \Psi_{\rm H}(q,\varepsilon) \leq \Psi_{\rm TV}(q,\varepsilon^2).$$

What is the right dependence on $\varepsilon$ of $\Psi_{\rm H}$?

*Note that in both instance-specific bounds obtained for $\Psi_{\rm TV}$, there exist (simple) examples of $q$ where $\varepsilon$ ends up *in the exponent,* so this quadratic gap is not innocuous even for constant $\varepsilon$.*