Difference between revisions of "Open Problems:82"

From Open Problems in Sublinear Algorithms
Jump to: navigation, search
m (Krzysztof Onak moved page Waiting:Beyond identity testing to Open Problems:82: The problem is ready for publication)
m (Making parentheses in an equaition nicer)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
{{Header
 
{{Header
|title=Beyond identity testing
 
 
|source=focs17
 
|source=focs17
 
|who=Clément Canonne
 
|who=Clément Canonne
 
}}
 
}}
Given access to i.i.d. samples from two unknown probability distributions $p,q$ over a discrete domain (say $[n]=\{1,\dots,n\}$) and distance parameter $\varepsilon\in(0,1]$, the ''equivalence'' (also known as ''closeness'') testing problem asks to distinguish w.h.p. between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q)>\varepsilon$. (The ''identity'' problem is the analogue when $q$ is fixed and explicitly known beforehand.)
+
Given access to i.i.d. samples from two unknown probability distributions $p$ and $q$ over a discrete domain (say $[n]=\{1,\dots,n\}$) and distance parameter $\varepsilon\in(0,1]$, the ''equivalence'' (also known as ''closeness'') testing problem asks to distinguish w.h.p. between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q)>\varepsilon$. (The ''identity'' problem is the analogue when $q$ is fixed and explicitly known beforehand.)
  
Equivalence up to a permutation would then be the variant where one must test whether $p,q$ are equal ''up to relabeling of the elements'':
+
Equivalence up to a permutation would then be the variant where one must test whether $p$ and $q$ are equal ''up to relabeling of the elements'':
(i) $\exists \pi\in\mathcal{S}_n$ s.t. $p=q\circ\pi$ vs. (ii) $\forall \pi\in\mathcal{S}_n$, $\operatorname{d}_{\rm TV}(p,q\circ \pi)>\varepsilon$. Results of Valiant and Valiant {{cite|ValiantV-11}} imply that this question has sample complexity $\Theta(\frac{n}{\varepsilon^2\log n})$.
+
(i) $\exists \pi\in\mathcal{S}_n$ s.t. $p=q\circ\pi$ vs. (ii) $\forall \pi\in\mathcal{S}_n$, $\operatorname{d}_{\rm TV}(p,q\circ \pi)>\varepsilon$. Results of Valiant and Valiant {{cite|ValiantV-11}} imply that this question has sample complexity $\Theta\left(\frac{n}{\varepsilon^2\log n}\right)$.
  
The most general question then is
+
The most general question then is:
 
: Given a fixed class $\mathcal{F}$ of functions from $[n]$ to $[m]$, distinguish between (i) $\exists \pi\in\mathcal{F}$ s.t. $p=q\circ\pi$ vs. (ii) $\forall \pi\in\mathcal{F}$, $\operatorname{d}_{\rm TV}(p,q\circ \pi)>\varepsilon$.
 
: Given a fixed class $\mathcal{F}$ of functions from $[n]$ to $[m]$, distinguish between (i) $\exists \pi\in\mathcal{F}$ s.t. $p=q\circ\pi$ vs. (ii) $\forall \pi\in\mathcal{F}$, $\operatorname{d}_{\rm TV}(p,q\circ \pi)>\varepsilon$.
  
In particular, one can study how the sample complexity depends on $\mathcal{F}$, or what it is for some classes of interest (e.g., $n=m$ for $\mathcal{F}$ a subgroup of the symmetric group $\mathcal{S}_n$; or $m\ll n$ and $\mathcal{F}$ being a class of "coarsenings," capturing whether $p,q$ are the same distribution but with a different discretization/binning).
+
In particular, one can study how the sample complexity depends on $\mathcal{F}$, or what it is for some classes of interest (e.g., $n=m$ for $\mathcal{F}$ a subgroup of the symmetric group $\mathcal{S}_n$; or $m\ll n$ and $\mathcal{F}$ being a class of “coarsenings,” capturing whether $p$ and $q$ are the same distribution but with a different discretization/binning).

Latest revision as of 06:10, 8 November 2017

Suggested by Clément Canonne
Source FOCS 2017
Short link https://sublinear.info/82

Given access to i.i.d. samples from two unknown probability distributions $p$ and $q$ over a discrete domain (say $[n]=\{1,\dots,n\}$) and distance parameter $\varepsilon\in(0,1]$, the equivalence (also known as closeness) testing problem asks to distinguish w.h.p. between (i) $p=q$ and (ii) $\operatorname{d}_{\rm TV}(p,q)>\varepsilon$. (The identity problem is the analogue when $q$ is fixed and explicitly known beforehand.)

Equivalence up to a permutation would then be the variant where one must test whether $p$ and $q$ are equal up to relabeling of the elements: (i) $\exists \pi\in\mathcal{S}_n$ s.t. $p=q\circ\pi$ vs. (ii) $\forall \pi\in\mathcal{S}_n$, $\operatorname{d}_{\rm TV}(p,q\circ \pi)>\varepsilon$. Results of Valiant and Valiant [ValiantV-11] imply that this question has sample complexity $\Theta\left(\frac{n}{\varepsilon^2\log n}\right)$.

The most general question then is:

Given a fixed class $\mathcal{F}$ of functions from $[n]$ to $[m]$, distinguish between (i) $\exists \pi\in\mathcal{F}$ s.t. $p=q\circ\pi$ vs. (ii) $\forall \pi\in\mathcal{F}$, $\operatorname{d}_{\rm TV}(p,q\circ \pi)>\varepsilon$.

In particular, one can study how the sample complexity depends on $\mathcal{F}$, or what it is for some classes of interest (e.g., $n=m$ for $\mathcal{F}$ a subgroup of the symmetric group $\mathcal{S}_n$; or $m\ll n$ and $\mathcal{F}$ being a class of “coarsenings,” capturing whether $p$ and $q$ are the same distribution but with a different discretization/binning).