12-08-2011, 11:36 AM
Abstract.
We present novel space and time-efficient algorithms for finding frequent items over general update streams. Our algorithms are based on a novel adaptation of the popular dyadic intervals method for finding frequent items. The algorithms improve upon existing algorithms in both theory and practice.
1 Introduction
There is a growing class of applications in areas of business and scientific data processing that continuously monitor large volumes of rapidly arriving data for detecting user-programmed scenarios, some of which may encode anomaly and exception conditions or desirable conditions. Although a deep analysis of the data can be done, it is both space and time consuming. Data streaming systems are designed to give fast, but possibly approximate answers to a class of queries while processing the input data in an online fashion. For example, consider a satellite data processing system where continuous and voluminous weather data has to be rapidly processed to give a forewarning of an emerging climate phenomenon. While deep analysis is possible, often, an early warning capability is very desirable, which though approximate, could then be used to trigger a deeper analysis. As another example, consider a biological experiment scenario where there are sensors attached to many biological subjects whose data is being continuously transmitted to a central server. Monitoring extremal aggregate conditions over to sensor readings are often useful indicators in such scenarios. Central to the success of data streaming systems are highly space and timeefficient algorithms that can summarize input data streams while processing them in an online fashion. In this paper, we present novel algorithms for data stream processing in the same vein, specifically considering general data streams. In the general stream model, each input record indicates arbitrary insertions or deletions of an item, where, an item may be an IP-address, stock ticker, sensorid, etc.. In this model, the sum of aggregate insertions (positive) and deletions (negative) for each item over the course of the stream may be either positive or negative. We address the problem of finding frequent items over general data streams.
Download full report
http://googleurl?sa=t&source=web&cd=2&ve...-SSDBM.pdf&ei=oMJETomGDoftrQfU6emIBA&usg=AFQjCNFkK2fUMDQaU0U9FPPIEvG0Jtx5gQ&sig2=ZuXkr4FzWOK85A-ky176hA