12-08-2011, 10:53 AM
Abstract.
We present a randomized procedure named Hierarchical Sampling from Sketches (HSS) that can be used for estimating a class of functions over the frequency vector f of update streams of the form (S) = Pn i=1 (|fi|).We illustrate this by applying the HSS technique to design nearly space-optimal algorithms for estimating the pth moment of the frequency vector, for real p _ 2 and for estimating the entropy of a data stream. 3
1 Introduction
A variety of applications in diverse areas, such as, networking, database systems, sensor networks, web-applications, share some common characteristics, namely, that data is generated rapidly and continuously, and must be analyzed in real-time and in a single-pass over the data to identify large trends, anomalies, user-defined exception conditions, etc.. Furthermore, it is frequently sufficient to continuously track the “big picture”, or, an aggregate view of the data. In this context, efficient and approximate computation with bounded error probability is often acceptable. The data stream model presents a computational model for such applications, where, incoming data is processed in an online fashion using sub-linear space. 1.1 The data stream model A data stream S is viewed as a sequence of records of the form (pos, i, v), where, pos is the index of the record in the sequence, i is the identity of an item in [1, n] = {1, . . . , n}, and v is the change to the frequency of the item. v > 0 indicates an insertion of multiplicity v, while v < 0 indicates a corresponding deletion. The frequency of an item i, denoted by fi, is the sum of the changes to the frequency of i since the inception of the stream.
Download full report
http://googleurl?sa=t&source=web&cd=1&ve...%2Fhss.pdf&ei=Z7dETpeUL8iGrAeQ_qTXAw&usg=AFQjCNGWvY-RrKeLs_OPHNYnTPAfdYnaXw&sig2=TzwDEJD_NBSE7jsJsGiPZg