Difference between revisions of "Open Problems:96"

From Open Problems in Sublinear Algorithms
Jump to: navigation, search
Line 4: Line 4:
 
|who=Clément Canonne
 
|who=Clément Canonne
 
}}
 
}}
Given a distance parameter $\varepsilon\in(0,1]$, i.i.d. samples from an unknown distribution $p$ and a (known) reference distribution $q$, both over $[n] = \{1,\dots,n\}$, the identity testing question asks for the minimum number of samples sufficient to distinguish, with probability at least $2/3$, between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q) > \varepsilon$. ($\operatorname{d}_{\rm TV}$ here denotes the total variation distance.) This question is by now fully resolved, with $\Theta(\sqrt{n}/\varepsilon^2)$ samples being necessary and sufficient {{Cite|Paninski-08|ValiantV-14}}.
+
Given a distance parameter $\varepsilon\in(0,1]$, i.i.d. samples from an unknown distribution $p$, and a (known) reference distribution $q$, both over $[n] = \{1,\dots,n\}$, the identity testing question asks for the minimum number of samples sufficient to distinguish, with probability at least $2/3$, between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q) > \varepsilon$. ($\operatorname{d}_{\rm TV}$ here denotes the total variation distance.) This question is by now fully resolved, with $\Theta(\sqrt{n}/\varepsilon^2)$ samples being necessary and sufficient {{Cite|Paninski-08|ValiantV-14}}.
  
 
However, consider the following variant: given a (fixed) family $\mathcal{F}$ of functions from $[n]$ to $[m]$, and a reference distribution $q$ over $[m]$, distinguish between (i) there exists $f\in \mathcal{F}$, $p = q\circ f$, and (ii) $\min_{f\in\mathcal{F}} \operatorname{d}_{\rm TV}(p,q\circ f) > \varepsilon$.
 
However, consider the following variant: given a (fixed) family $\mathcal{F}$ of functions from $[n]$ to $[m]$, and a reference distribution $q$ over $[m]$, distinguish between (i) there exists $f\in \mathcal{F}$, $p = q\circ f$, and (ii) $\min_{f\in\mathcal{F}} \operatorname{d}_{\rm TV}(p,q\circ f) > \varepsilon$.
  
This ''$\mathcal{F}$-identity testing'' question includes the identity testing one as special case by setting $m=n$ and $\mathcal{F}$ to be the singleton containing the identity function. One can also take $m=n$ and $\mathcal{F}$ to be the class of all permutations, to test "identity up to relabeling" (a problem whose sample complexity is, from previous work of Valiant and Valiant, known to be $\Theta(n/(\varepsilon^2\log n))$ (see Corollary 11.30 of {{Cite|Goldreich-17}}).
+
This ''$\mathcal{F}$-identity testing'' question includes the identity testing one as special case by setting $m=n$ and $\mathcal{F}$ to be the singleton containing the identity function. One can also take $m=n$ and $\mathcal{F}$ to be the class of all permutations, to test “identity up to relabeling” (a problem whose sample complexity is, from previous work of Valiant and Valiant, known to be $\Theta(n/(\varepsilon^2\log n))$ (see Corollary 11.30 of {{Cite|Goldreich-17}}).
  
'''Question:''' For a fixed $m$, and $\mathcal{F}$ the family of all partitions of $[n]$ into $m$ consecutive intervals, what is the sample complexity of  $\mathcal{F}$-identity testing, as a function of $n,\varepsilon,m$?
+
'''Question:''' For a fixed $m$, and $\mathcal{F}$ the family of all partitions of $[n]$ into $m$ consecutive intervals, what is the sample complexity of  $\mathcal{F}$-identity testing, as a function of $n$, $\varepsilon$, and $m$?
  
''Note: this corresponds to testing whether $p$ is a ``refinement'' of the coarse distribution $q$; or, equivalently, if $p$ ad $q$ are the same, up to the precision of the measurements.''
+
'''Note:''' This corresponds to testing whether $p$ is a ''refinement'' of the coarse distribution $q$; or, equivalently, if $p$ and $q$ are the same, up to the precision of the measurements.

Revision as of 01:56, 25 August 2019

Suggested by Clément Canonne
Source WOLA 2019
Short link https://sublinear.info/96

Given a distance parameter $\varepsilon\in(0,1]$, i.i.d. samples from an unknown distribution $p$, and a (known) reference distribution $q$, both over $[n] = \{1,\dots,n\}$, the identity testing question asks for the minimum number of samples sufficient to distinguish, with probability at least $2/3$, between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q) > \varepsilon$. ($\operatorname{d}_{\rm TV}$ here denotes the total variation distance.) This question is by now fully resolved, with $\Theta(\sqrt{n}/\varepsilon^2)$ samples being necessary and sufficient [Paninski-08,ValiantV-14].

However, consider the following variant: given a (fixed) family $\mathcal{F}$ of functions from $[n]$ to $[m]$, and a reference distribution $q$ over $[m]$, distinguish between (i) there exists $f\in \mathcal{F}$, $p = q\circ f$, and (ii) $\min_{f\in\mathcal{F}} \operatorname{d}_{\rm TV}(p,q\circ f) > \varepsilon$.

This $\mathcal{F}$-identity testing question includes the identity testing one as special case by setting $m=n$ and $\mathcal{F}$ to be the singleton containing the identity function. One can also take $m=n$ and $\mathcal{F}$ to be the class of all permutations, to test “identity up to relabeling” (a problem whose sample complexity is, from previous work of Valiant and Valiant, known to be $\Theta(n/(\varepsilon^2\log n))$ (see Corollary 11.30 of [Goldreich-17]).

Question: For a fixed $m$, and $\mathcal{F}$ the family of all partitions of $[n]$ into $m$ consecutive intervals, what is the sample complexity of $\mathcal{F}$-identity testing, as a function of $n$, $\varepsilon$, and $m$?

Note: This corresponds to testing whether $p$ is a refinement of the coarse distribution $q$; or, equivalently, if $p$ and $q$ are the same, up to the precision of the measurements.