Ads, Impressions, and Statistics
1. Consider the problem of reporting statistics about online ads to the advertisers that pay for them. Advertisers may wish to know how many unique people matching some demographic conditions have seen an ad at a particular time range. There are many ads and many demographic conditions and time ranges; and the service provider should be able to accurately answer any possible query over ads, demographics, and time. In other words, it’s an OLAP cube problem where the aggregate is distinct count rather than sum. How can one construct a sketch that will provide estimates with "good" errors that solves this with, say, $O(1)$ time complexity per query? Loosely speaking, "good" here means every query for an ad (without demographic conditions) has good relative error and any query that has a high count should have good relative error.
2. Suppose a company tracks an important high level metric (e.g., total network traffic) that is the sum of a metric over many discrete categories (e.g., country, device, network provider, etc.). When that metric changes in an unexpected way, it wishes to figure out what combinations of categories are driving that change. How can one find a small set of combination of categories that explains most of the change and can be described succinctly? Can this be done in real-time so that a streaming sketch can keep track of the appropriate approximate aggregates and the search for a succinct description of category can be done in "reasonable" time?