Posts: 5,362
Threads: 2,998
Joined: Feb 2011
[attachment=11674]
Stream-based Data Management
Characteristics of Data Streams
Data Streams
Data streams—continuous, ordered, changing, fast, huge amount
Traditional DBMS—data stored in finite, persistent data sets
Characteristics
Huge volumes of continuous data, possibly infinite
Fast changing and requires fast, real-time response
Data stream captures nicely our data processing needs of today
Random access is expensive—single linear scan algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing
Stream Data Applications
Telecommunication calling records
Business: credit card transaction flows
Network monitoring and traffic engineering
Financial market: stock exchange
Engineering & industrial processes: power supply & manufacturing
Sensor, monitoring & surveillance: video streams
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random access is too expensive)
Data Streams vs. Data Sets
Data Sets: Data Streams:
Updates infrequent
Using Traditional Database
Data Streams Paradigm
Data Streams Paradigm
DBMS versus DSMS
• Persistent relations
• Transient streams (and persistent relations)
DBMS versus DSMS
• Persistent relations
• Transient streams (and persistent relations)
DBMS versus DSMS
• Persistent relations
• Transient streams (and persistent relations)
DBMS versus DSMS
• Persistent relations
• Transient streams (and persistent relations)
DBMS versus DSMS
• Persistent relations
• Transient streams (and persistent relations)
Challenges of Stream Data Processing
Multiple, continuous, rapid, time-varying, ordered streams
Main memory computations
Queries are often continuous
Evaluated continuously as stream data arrives
Answer updated over time
Queries are often complex
Beyond element-at-a-time processing
Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional processing and data mining
Most stream data are at pretty low-level or multi-dimensional in nature
Processing Stream Queries
Query types
One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)
Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements
For real-time response, main memory algorithm should be used
Memory requirement is unbounded if one will join future tuples
Approximate query answering
With bounded memory, it is not always possible to produce exact answers
High-quality approximate answers are desired
Data reduction and synopsis construction methods
Sketches, random sampling, histograms, wavelets, etc.
Methods for Approximate Query Answering
Sliding windows
Only over sliding windows of recent stream data
Approximation but often more desirable in applications
Batched processing, sampling and synopses
Batched if update is fast but computing is slow
Compute periodically, not very timely
Sampling if update is slow but computing is fast
Compute using sample data, but not good for joins, etc.
Synopsis data structures
Maintain a small synopsis or sketch of data
Good for querying historical data
Blocking operators, e.g., sorting, avg, min, etc.
Blocking if unable to produce the first output until seeing the entire input
Projects on DSMS (Data Stream Management System)
Research projects and system prototypes
STREAM (Stanford): A general-purpose DSMS
Cougar (Cornell): sensors
Aurora (Brown/MIT): sensor monitoring, dataflow
Hancock (AT&T): telecom streams
Niagara (OGI/Wisconsin): Internet XML databases
OpenCQ (Georgia Tech): triggers, incr. view maintenance
Tapestry (Xerox): pub/sub content-based filtering
Telegraph (Berkeley): adaptive engine for sensors
Tradebot (tradebot.com): stock tickers & streams
Tribeca (Bellcore): network monitoring
Streaminer (UIUC): new project for stream data mining
Stream Data Mining vs. Stream Querying
Stream mining—A more challenging task
It shares most of the difficulties with stream querying
Patterns are hidden and more general than querying
It may require exploratory analysis
Not necessarily continuous queries
Stream data mining tasks
Multi-dimensional on-line analysis of streams
Mining outliers and unusual patterns in stream data
Clustering data streams
Classification of stream data
Challenges for Mining Unusual Patterns in Data Streams
Most stream data are at pretty low-level or multi-dimensional in nature: needs ML/MD processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/levels
Fast, real-time detection and response
Summary
Stream data analysis: A rich and largely unexplored field
Current research focus in database community: DSMS system architecture, continuous query processing, supporting mechanisms
Stream data mining and stream OLAP analysis
Powerful tools for finding general and unusual patterns
Largely unexplored: current studies only touched the surface
Lots of exciting issues in further study
A promising one: Multi-level, multi-dimensional analysis and mining of stream data
What Is A Continuous Query ?
Query which is issued once and logically run continuously.
What is Continuous Query ?
Query which is issued once and run continuously.
What is Continuous Query ?
Query which is issued once and run continuously.
Special Challenges
Timely online answers even for rapid data streams
Ability of fast access to large portions of data
Processing of multiple streams simultaneously
Making Things Concrete
Making Things Concrete
Database = two streams of mobile call records
Outgoing(connectionID, caller, start, end)
Incoming(connectionID, callee, start, end)
Query language = SQL
FROM clauses can refer to streams and/or relations
Query 1 (self-join)
Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM Outgoing O1, Outgoing O2
WHERE (O2.time – O1.time > 2
AND O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
Result requires unbounded storage
Can provide result as data stream
Can output after 2 min, without seeing end
Query 2 (join)
Pair up callers and callees
SELECT O.caller, I.callee
FROM Outgoing O, Incoming I
WHERE O.call_ID = I.call_ID
Can still provide result as data stream
Requires unbounded temporary storage …
… unless streams are near-synchronized
Query 3 (group-by aggregation)
Total connection time for each caller
SELECT O1.caller, sum(O2.time – O1.time)
FROM Outgoing O1, Outgoing O2
WHERE (O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
GROUP BY O1.caller
Cannot provide result in (append-only) stream.
Alternatives:
• Output stream with updates
• Provide current value on demand
• Keep answer in memory
Conclusions
Conventional DBMS technology is inadequate
We need reconsider all aspects of data management and processing in presence of data streams