Problem 4: Deterministic Summary Structures

From Open Problems in Sublinear Algorithms
Jump to: navigation, search
Suggested by Sumit Ganguly
Source Kanpur 2006
Short link https://sublinear.info/4

Given a stream of elements of the form $(i,\delta)$ where $i\in [n]$ and $\delta\in \{-1,1\}$ define the frequency of an element to be $f_i=\sum_{(i,\delta)} \delta$. We wish to find estimates $\hat{f}_i$ for each $f_i$ such that \[|\hat{f}_i-f_i|\leq \epsilon L_1\] where $L_1=\sum_i |f_i|$. The Count-Min algorithm is a randomized $O(\epsilon^{-1} \log (mn) \log \delta^{-1})$-space algorithm that returns such estimates with probability $1-\delta$ [CormodeM-05]. This is nearly optimal as the space lower bound is $O(\epsilon^{-1} \log (m) \log \epsilon n)$ [Ganguly-06].

However, in practice it is desirable to have deterministic algorithms rather than randomized algorithms. Using a deterministic collection of primes [Muthukrishnan-06], Ganguly [Ganguly-06] devised a deterministic $O(\phi^{-2}\epsilon^{-1} \log^2(mn))$-space algorithm that returned all items $i$ with $|f_i|\geq \phi L_1$ and no $j$ satisfying $|f_j|\leq (1-\epsilon)\phi L_1$. While this algorithm has the advantage of being deterministic, it uses more space than the Count-Min algorithm. Does there exist a deterministic algorithm that uses the same amount of space as Count-Min? Such an algorithm would lead to space-efficient algorithms for a range of problems including hierarchical heavy hitters, estimating inner product sizes, approximately optimal $B$-bucket histograms etc. Unfortunately, we conjecture that no such algorithm exists. Either an algorithm or lower bound would be very interesting.