Problem 5: Characterizing Sketchable Distances

From Open Problems in Sublinear Algorithms
Jump to: navigation, search
Suggested by Sudipto Guha and Piotr Indyk
Source Kanpur 2006
Short link

Some of the early successes in developing algorithms for the data stream model related to estimating $L_p$ norms [FeigenbaumKSV-02,Indyk-00,AlonMS-99] and the “Hamming norm” $L_0$ [CormodeDIM-03]. What other distances, or more generally “measures of dissimilarity,” can be approximated in the data stream model? Do all sketchable distances essentially arise as norms, specially, if deletions are allowed? Note that the set similarity distance (symmetric difference over union) can be estimated in the streaming model in the absence of deletions [BroderCFM-00].

Recent work provides some preliminary results [GuhaIM-07]. Let $f=(f_1, \ldots, f_n)$ and $g=(g_1, \ldots, g_n)$ be two frequency vectors defined by a stream in the usual way. Consider a distance $d(f,g)= \sum_i \phi(f_i,g_i)$ where $\phi:\mathbb N \times \mathbb N \rightarrow \mathbb R^+$ and $\phi(x,x)=0$. If there exist $a,b,c\in \mathbb N$ such that \[ \max \left ( \frac{\phi(a+c,a)}{\phi(b+c,b)} , \frac{\phi(a,a+c)}{\phi(b,b+c)} \right )> \alpha^2\] then it can be shown that any one-pass $\alpha$-approximation of $d(f,g)$ requires $\Omega(n)$ space where the stream defining $f$ and $g$ has length $O(n(a+b+c))$. Similar results hold for multiple-pass algorithms and for probabilistic divergences of the form $d(f,g)= \sum_i \phi(p_i,q_i)$ where $p_i=f_i/L_1(f)$ and $q_i=g_i/L_1(g)$. These results suggest that for a distance $d$ to be sketchable, $d(x,y)$ needs to be some function of $x-y$. In particular, they show that multiplicative approximation of all $f$-divergences and Bregman divergences, such as Kullback-Leibler and Hellinger, requires $\Omega(n)$ space with $L_1$ and $L_2^2$ being notable exceptions.


It was shown that for norms small sketches are equivalent to good embeddings into $\ell_{1 - \varepsilon}$ [AndoniKR-14].