Problem 3: $L_\infty$ Estimation
Suggested by | Graham Cormode |
---|---|
Source | Kanpur 2006 |
Short link | https://sublinear.info/3 |
One of the earliest results shown in data streaming is that approximating $L_ \infty$ of a stream of values requires space proportional to the dimensionality of the stream. The hard case used to prove this is when most items in the stream have frequency of occurrence 1, and approximating $L_ \infty $ is equivalent to testing whether any item has frequency two or higher. However, a variation of this problem is routinely studied under the name “heavy hitters.” Here, the lower bound is avoided by asking to find all items whose frequencies are greater than some fixed fraction $\phi$ of the total stream length, and tolerating approximation error $\epsilon$. Bounds are then provided which are polynomial in $(1/\phi)$ or $(1/\epsilon)$. A side effect of these algorithms is to estimate $L_ \infty $ of the stream with error proportional to $\epsilon$ times the $L_1$ or $L_2$ norm of the stream. Let the stream consist of items specified in $\log m$ bits. For insert only streams, the best space bound is $O(\epsilon^{-1}(\log m + \log L_1) )$ [MisraG-82,MetwallyAA-05], for computing on the difference between two streams the bounds are $O(\epsilon^{-1}\log m (\log m + \log L_1))$ [CormodeM-05a]. These algorithms approximate the $L_\infty$ distance in the sense above, but additionally identify a set of items which contribute significantly to the distance.
The open question is whether it is possible to approximate $L_\infty $ with additive error in terms of $\epsilon$ times $L_1$ or $L_2$ with less space. In particular, is it possible to reduce the dependency on $m$, since this is not needed in the output? One possible direction is to analyze data structures such as the Count-Min sketch, from which items frequencies can be estimated and in which $m$ does not occur in the (word) space complexity [CormodeM-05].[1]
Notes[edit]
- ↑ Formally, $\log m$ does affect the bit space complexity in two places: the data structure consists of $O(\log 1/\delta)$ hash functions whose specification requires $O(\log m)$ bits; and $O(\epsilon^{-1} \log 1/\delta)$ counters which in the worst case may count to the $L_1$ norm of the whole stream—this may perhaps be addressed by using approximate counters.