Problem 1: Fast $L_1$ Difference

From Open Problems in Sublinear Algorithms
Jump to: navigation, search
Suggested by Graham Cormode
Source Kanpur 2006
Short link

In data streaming, the focus is often on the space complexity of solving particular problems. It turns out that, in practice, when processing massive streams online, time efficiency is just as important, if not more so, than space usage. For many aggregates, such as $L_2$, $F_0$, quantiles, heavy hitters and so on, not only are the best known solutions optimal or nearly optimal in space, they also turn out to be very time efficient. Indeed, for many problems it seems that some solutions are known which require very little time to process each update in the stream. One notable exception is the problem of computing the $L_1$ difference between two vectors specified by streams. The well-known way to do this involves using 1-stable distributions (the Cauchy distribution), and tracking the inner product of each vector with a pseudo-random vector whose entries are each drawn from a Cauchy distribution. However, to get sufficient accuracy requires tracking a large number of independent inner-products, which means each update can be quite costly.

The main open question therefore is to study the time complexity of $L_1$ difference computations. Two possible directions suggest themselves:

  • The algorithms of Indyk and Woodruff [IndykW-05], and simplifications by Bhuvanagiri et al. [BhuvanagiriGKS-06] give improved bounds for $F_k$ computations, $k>2$, based on estimating large frequencies individually and removing; this approach has been extended to quantities such as entropy [BhuvanagiriG-06]. Can it also apply to $L_1$?
  • Recent work [LiHC-06] has studied sparse random projections for $L_2$. Follow up work [Li-06] has extended this to sparse projections using stable distributions. What time bounds does this imply for $(\epsilon,\delta)$-approximation of $L_1$ distance?

A more general open question arises. So far, there has been considerable success in proving space lower bounds for data stream computations using tools from communication complexity and cell probe model. Is it possible to give non-trivial time lower bounds for update cost (either worst case or amortized) on data streams? Note that the difference between an $O(1)$ and $O(\epsilon^{-2} \log^3 n)$ algorithm for processing each update in a stream translates into the difference between an $O(n)$ and $O(n\epsilon^{-2} \log^3 n)$ algorithm, which might be considered only a small difference in traditional algorithms.