12-08-2011, 10:32 AM
Abstract.
Space-economical estimation of the pth frequency moments, defined as Fp = Pn i=1|fi|p, for p > 0, are of interest in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequency vector f1, . . . , fn with a suitably chosen random vector were pioneered by Alon, Matias and Szegedy [1], and have since played a central role in estimating Fp and for data stream computations in general. The concept of p-stable sketches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indyk [11] for estimating Fp, for 0 < p < 2, and has been further studied in Li [13]. In this paper, we consider the problem of estimating Fp, for 0 < p < 2. A disadvantage of the stable sketches technique and its variants is that they require O( 1 _2 ) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We present algorithms for estimating Fp, for 0 < p < 2, that does not require the use of stable sketches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-known summary structures for data streams, such as the COUNT-MIN sketch [7] and the COUNTSKETCH structure [5]. Our algorithms require space ˜O( 1 _2+p ) 3 to estimate Fp to within 1 ± _ factors and requires expected time O(log F1 log 1 _ ) to process each update. Thus, our technique trades an O( 1 _p ) factor in space for much more efficient processing of stream updates. We also present a stand-alone iterative estimator for F1.
1 Introduction
Recently, there has been an emergence of monitoring applications in diverse areas including network traffic monitoring, network topology monitoring, sensor networks, financial market monitoring, and web-log monitoring. In these applications, data is generated rapidly and continuously, and must be analyzed efficiently, in real-time and in a single-pass over the data to identify large trends, anomalies, user-defined exception conditions, and so on. In many of these applications, it is often required to continuously track the “big picture”, or an aggregate view of the data, as opposed to a detailed view of the data. In such scenarios, efficient approximate computation is often acceptable.
Download full report
http://googleurl?sa=t&source=web&cd=1&ve...ndom04.pdf&ei=p7NETpSQBIXnrAeqs4GKCw&usg=AFQjCNF4bINQaGweVG9AEGK7HHUiQbdqaw&sig2=-xOxrel1DMT2LHMrH6_KOw