[attachment=14179]
Data Warehouse
What is Datawarehouse
“A data warehouse is a collection of business information, derived directly from operational systems and some external data sources. Its specific purpose is to support business decisions, not business operations”
Characteristics of Datawarehouse
Various development Stages of Datawarehouse
Business Case Assessment
Enterprise Infrastructure Evaluation
Project Planning
Finalize Project requirement
Data Analysis
Prototyping
Database Design
Various development Stages of Datawarehouse
ETL Framework Design
ETL Package Development
BI Application Development
Data Validation
Implementation
Release Evaluation
Data Modeling
WHAT IS A DATA MODEL?
A data model is an abstraction of some aspect of
the real world (system).
WHY A DATA MODEL?
Helps to visualise the business
A model is a means of communication.
Models help elicit and document requirements.
Models reduce the cost of change.
Model is the essence of DW architecture based on which DW will be implemented
What do we want to do with the data?
Model depends on what kind of data analysis we want to do:
Different Data Analysis Techniques
Query and reporting
Display Query Results
Multidimensional analysis
Analyze data content by looking at it in different perspectives
Data mining
discover patterns and clustering attributes in data
Levels Of Modeling
Conceptual modeling
Describe data requirements from a business point of view without technical details
Logical modeling
Refine conceptual models
Data structure oriented, platform independent
Physical modeling
Detailed specification of what is physically implemented using specific technology
Conceptual Model
A conceptual model shows data through business eyes.
All entities which have business meaning.
Important relationships
Few significant attributes in the entities.
Few identifiers or candidate keys.
Logical Model
Replaces many-to-many relationships with associative entities.
Defines a full population of entity attributes.
May use non-physical entities for domains and sub-types.
Establishes entity identifiers.
Has no specifics for any RDBMS or configuration.
Physical Model
A Physical data model may include
Referential Integrity
Indexes
Views
Alternate keys and other constraints
Tablespaces and physical storage objects.
What needs to be modeled duringa data warehouse project
STAGING AREA
YES ! (maybe multiple data models are required)
ODS
YES !
DATAWAREHOUSE/DATAMART
YES!
Data Modeling - Techniques
Modeling techniques
E-R Modeling
Dimensional Modeling
Implementation and modeling styles
Modeling versus implementation
Modeling: describe what should be built to non-technical folks
Implementation: describe what is actually built to technical folks
Implementation and modeling styles
Relational modeling
Use for implementation
Difficult to understand by non-technical folks
Dimensional modeling
Use for modeling during analysis and design phases
Can be implemented using other modeling styles e.g. object-oriented, relational
Limitations of E-R Modeling
Poor Performance
Tend to be very complex and difficult to navigate.
Dimensional Modeling
Dimensional modeling uses three basic concepts : measures, facts, dimensions.
Is powerful in representing the requirements of the business user in the context of database tables.
Focuses on numeric data, such as values counts, weights, balances and occurrences.
Dimensional modeling
Must identify
Business process to be supported
Grain (level of detail)
Dimensions
Facts
Conventions used in Dimensional modeling
Facts
Measures(Variables)
Dimensions
Dimension members
Dimension hierarchies
Facts
A fact is a collection of related data items, consisting of measures and context data.
Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process.
Facts are measured, “continuously valued”, rapidly changing information. Can be calculated and/or derived.
Fact Table
A table that is used to store business information (measures) that can be used in mathematical equations.
Quantities
Percentages
Prices
Calculated Values
Dimensions
A dimension is a collection of members or units of the same type of views.
Dimensions determine the contextual background for the facts.
Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how
Dimension Table
Table used to store qualitative data about fact records
Who
What
When
Where
Why
Dimension data should be
verbose, descriptive
complete
no misspellings, impossible values
indexed
equally available
documented ( metadata to explain origin, interpretation of each attribute)
Dimensional model
Visualize a dimensional model as a CUBE (hypercube because dimensions can be more than
3 in number)
Operations for OLAP
Drill Down :Higher level of detail
Roll Up: summarized level of data
(The navigation path is determined by hierarchies within dimensions.)
Slice: cuts through the cube. Users can focus on specific perspectives
Dice: rotates the cube to another perspective (change the dimension)
Drill down …. Roll up
Slice and Dice
Dimensions
Collection of members or units of the same type of views.
determine the contextual background for the facts.
the parameters over which we want to perform OLAP (eg. Time, Location/region, Customers)
Member is a distinct name to determine data item’s position (eg. Time - Month, quarter)
Hierarchy arrange members into hierarchies or levels
Hierarchies
Aggregates
Aggregate Tables are pre-stored summarized tables… created at a higher level of granularity across any or all of the dimensions.
If the existing granularity is Day wise sales, then creating a separate month wise sales table is an example of Aggregate Table.
The use of such aggregates is the single most effective tool the data warehouse designer has to improve query performance.
Usage of Aggregates can increase the performance of Queries by several times.
Measures
A measure is a numeric attribute of a fact, representing the performance or behaviour of the business relative to dimensions.
The actual numbers are called as variables.
eg. sales in money, sales volume, quantity supplied, supply cost, transaction amount
A measure is determined by combinations of the members of the dimensions and is located on facts.
Types of Facts
Additive
Able to add the facts along all the dimensions
Discrete numerical measures eg. Retail sales in $
Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account balance, Inventory balance
Added and divided by number of time period to get a time-average
Non Additive
Numeric measures that cannot be added across any dimensions
Intensity measure averaged across all dimensions eg. Room temperature
Textual facts - AVOID THEM
Common structures for Data Marts
enormalize!
Star
Single fact table surrounded by denormalized dimension tables
The fact table primary key is the composite of the foreign keys (primary keys of dimension tables)
Fact table contains transaction type information.
Many star schemas in a data mart
Easily understood by end users, more disk storage required
Example of Star Schema
Common structures for Data Marts
enormalize!
Snowflake
Single fact table surrounded by normalized dimension tables
Normalizes dimension table to save data storage space.
When dimensions become very very large
Less intuitive, slower performance due to joins
May want to use both approaches, especially if supporting multiple end-user tools.
Example of Snow flake schema
Keys …
Primary Keys
uniquely identify a record
Foreign Keys
primary key of another table referred here
Surrogate Keys
system-generated key for dimensions
key on its own has no meaning
integer key, less space
More Keys …
Smart Keys
primary key out of various attributes of dimension
AVOID THEM!
Join to Fact table should be on single surrogate key
Production Keys
DO NOT USE Production defined attributes
Business may reuse/change them - DW cannot!
Basic Dimensional Modeling Techniques
Slowing changing Dimensions
Rapidly changing Small Dimensions
Large Dimensions
Rapidly changing Large Dimensions
Degenerate Dimensions
Junk Dimensions
Slowly Changing Dimensions
A dimension is considered a Slowly Changing Dimension when its attributes remain almost constant over time, requiring relatively minor alterations to represent the evolved state.
Types of SCD
Type I – Retains the recent updated Value
Type II – Entire history of the dimensional data is maintained.
Version
Flag
Date
Type III – Only Current and one Previous value get maintained.
The Time Dimension
Time_key
day_of_week
day_number_in_month
day_number_overall
week_number_in_year
month
quarter
fiscal_period
holiday_flag
weekday_flag
last_day_in_month_flag
season
event
Time Dimension
An exclusive Time dimension is required because the SQL date semantics and functions cannot generate several important attributes required for analytical purposes.
Attributes like weekdays, weekends, fiscal period, holidays, season cannot be generated by SQL statements.
Moreover SQL date stamps occupy more space largely increasing the size of the fact table.
Joins on such SQL generated date-stamps are costly decreasing the query speed significantly.
Time Dimension
The holiday flag and season attributes are useful for holiday VS non-holiday analysis and season business analysis.
Event attribute is needed to record special days like strike days, etc..
Data Modeling for Data Warehouse
12 Steps :
1. Study ER 2.Evaluate and Analyse
3. Review Dimension 4. Add Time Dimension
5. Identify Facts 6. Granularity
7. Merge Facts 8. Review Facts
9. Name Facts 10. Size the model
11. Record Metadata 12. Validate mode
ETL Overview
Extraction
Source Systems (Multiple Source Systems)
Flat files, Excel, Legacy Systems, RDBMS etc.
Frequency of Extraction
Staging Area (If any? How many?)
Most Transformations from Source to Staging
Cleansing and Data Quality
Data integrity, De-duplication, completeness, correctness
Transformation
Usage of tools
Reusability of Transformations
Reusability of Mappings
Different tools
Informatica
Oracle Warehouse Builder
ETI
Sagent
PL/SQL scripts
Open Source Tools
Loading
Loading Frequency
Optimized Loading
Indexing
Partitioning
Aggregation
Sum
Average
Max
Update Strategy
Error Handling
STAGING AREA - Some Clarity
Staging Area
optional
to cleanse the source data
Accepts data from different sources
Data model is required at staging area
Multiple data models may be required for parking different sources and for transformed data to be pushed out to warehouse
ODS - Some Clarity
Operational Data Store
Optional
Granular, detailed level data
May feed warehouse (eg when warehouse is aggregated)
Usually a relational model
May keep data for a smaller time period than warehouse
Data Warehouse Architecture
DW Architecture
Architecture Choices depend on
Current infrastructure
Business environment
Desired management and control structure
resources
commitment …..
Data Warehouse/data mart
Types of Data Warehouse
Enterprise Data Warehouse
Data Mart
Enterprise data warehouse
Contains data drawn from multiple operational systems
Supports time- series and trend analysis across different business areas
Can be used as a transient storage area to clean all data and ensure consistency
Can be used to populate data marts
Can be used for everyday and strategic decision making
Data Mart
Logical subset of enterprise data warehouse
Organized around a single business process
Based on granular data
May or may not contain aggregates
Object of analytical processing by the end user.
Less expensive and much smaller than a full blown corporate data warehouse.
DW Implementation Approaches
Top Down
Bottom-up
Combination of both
Choices depend on:
current infrastructure
resources
architecture
ROI
Implementation speed
Top Down Implementation
Bottom Up Implementation
OLAP
What is OLAP??????
OLAP - On Line Analytical Processing
OLAP enables analysts, managers, and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information.
OLAP transforms raw data so that it reflects the real dimensionality of the enterprise as understood by the user.
Data Warehousing vs. OLAP
OLAP focuses on
Data transformed into information that meets the end-user’s analytical requirements
Data modeling and computation processes is consistent
OLTP and DW provides the source data whereas, OLAP turns that data into information.
OLAP - Functionality
OLAP functionality is characterized by
Dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities.
Calculations and modeling applied across dimensions, through hierarchies and/or across members
Trend analysis over sequential time periods
Slicing subsets for on-screen viewing
Drill -down to deeper levels of consolidation
Reach-through to underlying detail data
Rotation to new dimensional comparisons in the viewing area
OLAP - Functionality
OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity.
OLAP helps the user synthesize enterprise information through comparative, personalized viewing, as well as through analysis of historical and projected data in various "what -if" data model scenarios.
OLAP Functional Requirements
Fast Access and Calculations
Speed is critical to maintain an analyst’s train of thought.
An analyst needs to navigate throughout the data which requires aggregations, or roll-ups.
Powerful Analytical Capabilities
There is more complicated calculations to OLAP than simple aggregations, or roll-ups.
OLAPFunctional Requirements
Flexibility
viewing: graphs, charts, row or columns
definitions: format of numbers, name changes
analysis: Sales analyze data differently than marketing
interfaces: section wise,report look