Problem 2: Quantiles

From Open Problems in Sublinear Algorithms
Jump to: navigation, search
Suggested by Graham Cormode
Source Kanpur 2006
Short link

The problem of tracking the quantiles (median and generalizations thereof) of a distribution produced by a stream has attracted significant study over the last decade [MankuRL-98,MankuRL-99,GreenwaldK-01,GilbertKMS-02,CormodeM-05,ShrivastavaBAS-04,GuhaM-06]. For deterministic algorithms on insert only streams, two algorithms obtain the best (and incomparable) space bounds: $O(\epsilon^{-1} \log \epsilon N)$ words [GreenwaldK-01] and $O(\epsilon^{-1}\log U )$ words [ShrivastavaBAS-04], where $U$ is the size of the domain from which the input is drawn.

The Greenwald-Khanna algorithm [GreenwaldK-01] is simple to implement, and works on streams of items drawn from arbitrary domains. However, the analysis is rather involved; moreover, attempts to modify the analysis for different situations (say, weighted input items, merging summaries together, giving different guarantees to different ranges etc.) lead to heuristics at best, which may no longer have strict guarantees and known bad cases. The $q$-digest algorithm [ShrivastavaBAS-04] is much simpler to analyze and more amenable to variations, meaning that several generalizations and alternatives have been proposed [HershbergerSST-04,CormodeKMS-06]. However, it carries with it a factor of $\log U$, meaning that the universe has to be known, making it impractical for tracking quantiles over streams of floating point values, or strings.

This leads to some interlinked open questions:

  1. What is the optimal space bound for an algorithm to compute quantiles of a data stream? Is $O(\epsilon^{-1})$ words achievable?
  2. Can the Greenwald-Khanna algorithm, or a variation thereof, submit to a simpler analysis which will allow generalizations of the algorithm to be more easily proposed and studied?