JoinDistinct Aggregate Estimation over Update Streams
#1

ABSTRACT
There is growing interest in algorithms for processing and query- ing continuous data streams (i.e., data that is seen only once in a _xed order) with limited memory resources. Providing (perhaps approximate) answers to queries over such streams is a crucial re- quirement for many application environments; examples include large IP network installations where performance data from dif- ferent parts of the network needs to be continuously collected and analyzed. The ability to estimate the number of distinct (sub)tuples in the result of a join operation correlating two data streams (i.e., the cardinality of a projection with duplicate elimination over a join) is an important requirement for several data-analysis sce- narios. For instance, to enable real-time tra_c analysis and load balancing, a network-monitoring application may need to esti- mate the number of distinct (source, destination) IP-address pairs occurring in the stream of IP packets observed by router R1, where the source address is also seen in packets routed through a di_erent router R2. Earlier work has presented solutions to the individual problems of distinct counting and join-size estimation (without duplicate elimination) over streams. These solutions, however, are fundamentally di_erent and extending or combining them to handle our more complex \Join-Distinct" estimation problem is far from obvious. In this paper, we propose the _rst space-e_cient algorithmic solution to the general Join-Distinct estimation problem over continuous data streams (our techniques can actually handle general update streams comprising tuple dele- tions as well as insertions). Our estimators are probabilistic in nature and rely on novel algorithms for building and combining a new class of hash-based synopses (termed \JD sketches") for individual update streams. We demonstrate that our algorithms can provide low error, high-con_dence Join-Distinct estimates using only small space and small processing time per update. In fact, we present lower bounds showing that the space usage of our estimators is within small factors of the best possible for the Join-Distinct problem. Preliminary experimental results verify the e_ectiveness of our approach.
1. INTRODUCTION
Query-processing algorithms for conventional Database Management Systems (DBMS) rely on (possibly) several passes over a collection of static data sets in order to produce _Work done while the authors were with Bell Labs. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PODS 2005 June 1315, 2005, Baltimore, Maryland. Copyright 2005 ACM 1595930620/ 05/06 : : : $5.00. an accurate answer to a user query. For several emerging ap- plication domains, however, updates to the data arrive on a continuous basis, and the query processor needs to be able to produce answers to user queries based solely on the observed stream and without the bene_t of several passes over a static data image. As a result, there has been a urry of recent work on designing e_ective query-processing algorithms that work over continuous data streams to produce results online while guaranteeing (1) small memory footprints, and (2) low processing times per stream item [1, 6, 11]. Such algorithms typically rely on summarizing the data stream(s) involved in concise synopses that can be used to provide approximate answers to user queries along with some reasonable guaran- tees on the quality of the approximation. In their most general form, real-life data streams are ac- tually update streams; that is, the stream is a sequence of updates to data items, comprising data-item deletions as well as insertions. 1 Such continuous update streams arise naturally, for example, in the network installations of large Internet service providers, where detailed usage information (SNMP/RMON packet-ow data, active VPN circuits, etc.) from di_erent parts of the underlying network needs to be continuously collected and analyzed for interesting trends. The processing of such streams follows, in general, a dis- tributed model where each stream (or, part of a stream) is observed and summarized by its respective party (e.g., the element-management system of an individual IP router) and the resulting synopses are then collected (e.g., periodically) at a central site, where queries over the entire collection of streams can be processed [11]. This model is used, for exam- ple, in Lucent's Interprenet and Cisco's NetFlow products for IP network monitoring. Clearly, there are several forms of queries that users or applications may wish to pose (online) over such contin- uous update streams; examples include join or multi-join aggregates [1, 6], norm and quantile estimation [2, 14, 16], or histogram and wavelet computation [13, 12]. Estimat- ing the number of distinct (sub)tuples in the result of an equi-join operation correlating two update streams (i.e., the cardinality of a projection with duplicate elimination over a join) is one of the fundamental queries of interest for sev- eral data-analysis scenarios. As an example, a network- management application monitoring active IP-sessions may wish to correlate the active sessions at routers R1 and R2 by posing a query such as: \What is the number of dis- tinct (source, destination) IP-address pairs seen

Download full report
http://googleurl?sa=t&source=web&cd=1&ve...y%2Fjd.pdf&ei=WbZETuTNLofJrQfA0aSIBA&usg=AFQjCNHz6cGrTUqlBGSecBNoswHjDLwPvQ&sig2=EHOzKdtF8Zk4y8bKpAm80g
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: how to update ghriss details, samsung b 7000 firmware update, karmosanastan update newspaper com, automatic time update vista, karmasangsthan headline update, aggregate planning by transportation method, update,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  SoC Estimation of Rechargeable Batteries full report smart paper boy 1 1,513 04-02-2013, 05:20 PM
Last Post: santosh_bangaru
  over-under voltage cut-off with ON-Time delay PROJECT REPORT project topics 3 5,358 01-12-2012, 12:23 PM
Last Post: seminar details
  Over / Under Line Voltage Protection for Electrical Appliances smart paper boy 1 2,423 12-10-2012, 01:05 PM
Last Post: seminar details
  OVER/UNDER VOLTAGE PROTECTION OF ELECTRICAL APPLIANCES smart paper boy 1 2,666 12-10-2012, 01:05 PM
Last Post: seminar details
  BANDWIDTH ESTIMATION FOR IEEE 802.11 BASED ADHOC NETWORK computer science crazy 6 5,066 15-02-2012, 02:01 PM
Last Post: dtanvi
  Data Centralization Over Networks eproject 4 4,271 14-02-2012, 02:58 PM
Last Post: seminar paper
  Automatic Over Speed Detector seminar projects crazy 4 5,856 04-02-2012, 10:22 AM
Last Post: seminar addict
  Broadband over Power Line (BPL) smart paper boy 1 1,115 19-01-2012, 11:44 AM
Last Post: seminar addict
  Estimating Entropy over Data Streams smart paper boy 0 629 12-08-2011, 11:49 AM
Last Post: smart paper boy
  Processing Set Expressions over Continuous Update Streams smart paper boy 0 805 12-08-2011, 11:27 AM
Last Post: smart paper boy

Forum Jump: