Open Problems:83 - Revision history

Krzysztof Onak: cleaning the header

2017-11-08T14:54:34Z

cleaning the header

Krzysztof Onak: Krzysztof Onak moved page Waiting:Instance-optimal Hellinger testing to Open Problems:83 without leaving a redirect: The problem is ready for publication

2017-11-08T14:53:43Z

Krzysztof Onak moved page Waiting:Instance-optimal Hellinger testing to Open Problems:83 without leaving a redirect: The problem is ready for publication

Krzysztof Onak: Small fix.

2017-11-08T14:53:09Z

Small fix.

Krzysztof Onak: Moving a dot.

2017-11-08T06:23:56Z

Moving a dot.

Ccanonne at 21:24, 23 October 2017

2017-10-23T21:24:15Z

Ccanonne at 18:48, 20 October 2017

2017-10-20T18:48:25Z

Ccanonne: Created page with "{{Header |title=Instance-optimal Hellinger testing |source=focs17 |who=Clément Canonne }} Given the full description of a fixed distribution $q$ over a discrete domain (say $..."

2017-10-20T18:46:25Z

Created page with "{{Header |title=Instance-optimal Hellinger testing |source=focs17 |who=Clément Canonne }} Given the full description of a fixed distribution $q$ over a discrete domain (say $..."

New page

{{Header
|title=Instance-optimal Hellinger testing
|source=focs17
|who=Clément Canonne
}}
Given the full description of a fixed distribution $q$ over a discrete domain (say $[n]=\{1,\dots,n\}$), as well as access to i.i.d. samples from an unknown probability distributions $p$ over $[n]$ and distance parameter $\varepsilon\in(0,1]$, the identity testing problem asks to distinguish w.h.p. between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q)>\varepsilon$.

The sample complexity of this question as a function of $n$ and $\varepsilon$ is fully understood by now: $\Theta(\sqrt{n}/\varepsilon^2)$ are necessary and sufficient, the worst-case lower bound following from taking $q$ to be the uniform distribution on $[n]$. Valiant and Valiant {{cite|ValiantV-14}} shown an ''instance-optimal'' bound on this problem, where the sample complexity $$\Psi_{\rm TV}$$ now only depends on $n$ and the (massive) parameter $q$ instead of $n$: namely, that
$$\Psi_{\rm TV}(q,\varepsilon) = \Theta\left(\max\left( \frac{\Phi(q,\Theta(\varepsilon))}{\varepsilon^2}, \frac{1}{\varepsilon}\right)\right)$$
samples were necessary and sufficient, where $\Phi$ is the functional defined by taking the $2/3$-pseudonorm of the vector of probabilities of $q$, once both the biggest element and $\varepsilon$ total mass of the smallest elements had been removed:
$
\Phi(q,\varepsilon) = \lVert q^{-\max}_{-\varepsilon} \rVert_{2/3}
$. Using different techniques, Blais, Canonne, and Gur {{cite|BlaisCG-17}} then established a similar instance-optimal bound, with regard to a different functional, the "K-functional $\kappa$ between $\ell_1$ and $\ell_2$ spaces:"
$
\Psi_{\rm TV}(q,\varepsilon)=\Omega\left({\kappa_p(1-\Theta(\varepsilon))}/{\varepsilon}\right), \Psi_{\rm TV}(q,\varepsilon)=O\left({\kappa_p(1-\Theta(\varepsilon))}/{\varepsilon^2}\right)
$.

Now, consider the exact same problem, but replacing the total variation $\operatorname{d}_{\rm TV}(p,q)$ by the ''Hellinger distance''
$$
\operatorname{d}_{\rm H}(p,q) = \frac{1}{\sqrt{2}}\lVert\sqrt{p}-\sqrt{q}\rVert_2\,.
$$
Results of Daskalakis, Kamath, and Wright {{cite|DaskalakisKW-18}} show that the ''worst-case'' sample complexity remains $\Theta(\sqrt{n}/\varepsilon^2)$. Moreover, due to the quadratic dependence between Hellinger and total variation distances, both instance-optimal bounds mentioned above apply, yet with possibly a quadratic gap between upper and lower bounds in terms of $\varepsilon$:
$$\Psi_{\rm TV}(q,\varepsilon) \leq \Psi_{\rm H}(q,\varepsilon) \leq \Psi_{\rm TV}(q,\varepsilon^2)$$.

What is the ''right'' dependence on $\varepsilon$ of $\Psi_{\rm H}$?

''Note that in both instance-optimal bounds obtained for $\Psi_{\rm TV}$, there exist (simple) examples of $q$ where $\varepsilon$ ends up ''in the exponent,'' so this quadratic gap is not innocuous even for constant $\varepsilon$.''

@@ Line 21: / Line 21: @@
 $$
 Results of Daskalakis, Kamath, and Wright {{cite|DaskalakisKW-18}} show that the ''worst-case'' sample complexity remains $\Theta(\sqrt{n}/\varepsilon^2)$.  Moreover, due to the quadratic dependence between Hellinger and total variation distances, both instance-specific bounds mentioned above apply, yet with possibly a quadratic gap between upper and lower bounds in terms of $\varepsilon$: leading to bounds on the instance-specific sample complexity $\Psi_{\rm H}$ of Hellinger identity testing of
-$$\Psi_{\rm TV}(q,\varepsilon) \leq \Psi_{\rm H}(q,\varepsilon) \leq \Psi_{\rm TV}(q,\varepsilon^2)$$.
+$$\Psi_{\rm TV}(q,\varepsilon) \leq \Psi_{\rm H}(q,\varepsilon) \leq \Psi_{\rm TV}(q,\varepsilon^2).$$
 What is the right dependence on $\varepsilon$ of $\Psi_{\rm H}$?
 ''Note that in both instance-specific bounds obtained for $\Psi_{\rm TV}$, there exist (simple) examples of $q$ where $\varepsilon$ ends up ''in the exponent,'' so this quadratic gap is not innocuous even for constant $\varepsilon$.''

@@ Line 1: / Line 1: @@
 {{Header
 |source=focs17
 |who=Clément Canonne

← Older revision	Revision as of 14:53, 8 November 2017
(No difference)

@@ Line 6: / Line 6: @@
 Given the full description of a fixed distribution $q$ over a discrete domain (say $[n]=\{1,\dots,n\}$), as well as access to i.i.d. samples from an unknown probability distributions $p$ over $[n]$ and distance parameter $\varepsilon\in(0,1]$, the  identity testing problem asks to distinguish w.h.p. between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q)>\varepsilon$.
-The sample complexity of this question as a function of $n$ and $\varepsilon$ is fully understood by now: $\Theta(\sqrt{n}/\varepsilon^2)$ are necessary and sufficient, the worst-case lower bound following from taking $q$ to be the uniform distribution on $[n]$. Valiant and Valiant {{cite|ValiantV-14}} shown an ''instance-specific'' bound on this problem, where the sample complexity $\Psi_{\rm TV}$ now only depends on $n$ and the (massive) parameter $q$ instead of $n$: namely, that
+The sample complexity of this question as a function of $n$ and $\varepsilon$ is fully understood by now: $\Theta(\sqrt{n}/\varepsilon^2)$ are necessary and sufficient, the worst-case lower bound following from taking $q$ to be the uniform distribution on $[n]$. Valiant and Valiant {{cite|ValiantV-14}} shown an ''instance-specific'' bound on this problem, where the sample complexity $\Psi_{\rm TV}$ now only depends on $\varepsilon$ and the (massive) parameter $q$ instead of $n$: namely, that
 $$\Psi_{\rm TV}(q,\varepsilon) = \Theta\left(\max\left( \frac{\Phi(q,\Theta(\varepsilon))}{\varepsilon^2}, \frac{1}{\varepsilon}\right)\right)$$
 samples were necessary and sufficient, where $\Phi$ is the functional defined by taking the $2/3$-pseudonorm of the vector of probabilities of $q$, once both the biggest element and $\varepsilon$ total mass of the smallest elements had been removed:

@@ Line 6: / Line 6: @@
 Given the full description of a fixed distribution $q$ over a discrete domain (say $[n]=\{1,\dots,n\}$), as well as access to i.i.d. samples from an unknown probability distributions $p$ over $[n]$ and distance parameter $\varepsilon\in(0,1]$, the  identity testing problem asks to distinguish w.h.p. between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q)>\varepsilon$.
-The sample complexity of this question as a function of $n$ and $\varepsilon$ is fully understood by now: $\Theta(\sqrt{n}/\varepsilon^2)$ are necessary and sufficient, the worst-case lower bound following from taking $q$ to be the uniform distribution on $[n]$. Valiant and Valiant {{cite|ValiantV-14}} shown an ''instance-optimal'' bound on this problem, where the sample complexity $$\Psi_{\rm TV}$$ now only depends on $n$ and the (massive) parameter $q$ instead of $n$: namely, that
+The sample complexity of this question as a function of $n$ and $\varepsilon$ is fully understood by now: $\Theta(\sqrt{n}/\varepsilon^2)$ are necessary and sufficient, the worst-case lower bound following from taking $q$ to be the uniform distribution on $[n]$. Valiant and Valiant {{cite|ValiantV-14}} shown an ''instance-optimal'' bound on this problem, where the sample complexity $\Psi_{\rm TV}$ now only depends on $n$ and the (massive) parameter $q$ instead of $n$: namely, that
 $$\Psi_{\rm TV}(q,\varepsilon) = \Theta\left(\max\left( \frac{\Phi(q,\Theta(\varepsilon))}{\varepsilon^2}, \frac{1}{\varepsilon}\right)\right)$$
 samples were necessary and sufficient, where $\Phi$ is the functional defined by taking the $2/3$-pseudonorm of the vector of probabilities of $q$, once both the biggest element and $\varepsilon$ total mass of the smallest elements had been removed:
@@ Line 20: / Line 20: @@
 \operatorname{d}_{\rm H}(p,q) = \frac{1}{\sqrt{2}}\lVert\sqrt{p}-\sqrt{q}\rVert_2\,.
 $$
-Results of Daskalakis, Kamath, and Wright {{cite|DaskalakisKW-18}} show that the ''worst-case'' sample complexity remains $\Theta(\sqrt{n}/\varepsilon^2)$.  Moreover, due to the quadratic dependence between Hellinger and total variation distances, both instance-optimal bounds mentioned above apply, yet with possibly a quadratic gap between upper and lower bounds in terms of $\varepsilon$:
+Results of Daskalakis, Kamath, and Wright {{cite|DaskalakisKW-18}} show that the ''worst-case'' sample complexity remains $\Theta(\sqrt{n}/\varepsilon^2)$.  Moreover, due to the quadratic dependence between Hellinger and total variation distances, both instance-optimal bounds mentioned above apply, yet with possibly a quadratic gap between upper and lower bounds in terms of $\varepsilon$: leading to bounds on the instance-optimal sample complexity $\Psi_{\rm H}$ of Hellinger identity testing of
 $$\Psi_{\rm TV}(q,\varepsilon) \leq \Psi_{\rm H}(q,\varepsilon) \leq \Psi_{\rm TV}(q,\varepsilon^2)$$.
-What is the ''right'' dependence on $\varepsilon$ of $\Psi_{\rm H}$?
+What is the right dependence on $\varepsilon$ of $\Psi_{\rm H}$?
 ''Note that in both instance-optimal bounds obtained for $\Psi_{\rm TV}$, there exist (simple) examples of $q$ where $\varepsilon$ ends up ''in the exponent,'' so this quadratic gap is not innocuous even for constant $\varepsilon$.''