Data Warehouse and Data Sources
Data Warehouse and Data Sources
Definition
Data Warehouse:
A subject-oriented, integrated, time-variant, non-
Data Warehousing:
The process of constructing and using a data
warehouse
Subject-oriented
The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together. Non-volatile
Data in the data warehouse are never overwritten or deleted once committed, the data are static, read-only, and retained for future reporting.
Integrated
The data warehouse contains data from most or all of an organization's operational systems and these data are made consistent.
Time-variant
The time horizon for the data warehouse is significantly longer than that of operational systems.
Data Sources
Sources that feed data into the data warehouse. plain text file, relational database, other types of database, Excel file, etc
inventory data, marketing data, systems data. Web server logs with user browsing data. Internal market research data. Third-party data, such as census data, demographics data, or survey data.
Data is Dynamic
Read/Write access Repetitive processing Transaction driven Application oriented Used by clerical staff for day-today operation Normalized data model(ER model) Must be optimized for writes and small queries
Advantages of Data
Warehousing
Potential high Return on Investment Competitive Advantage Increased Productivity of Corporate Decision Makers
Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands High maintenance Long duration projects Complexity of integration
10
Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Incremental extract = capturing Static extract = capturing a changes that have occurred since snapshot of the source data at a the last static extract point in time
11
Scrub = cleanseuses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, Also: decoding, reformatting, time
erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
stamping, conversion, key generation, merging, error detection/logging, locating missing data
12
Transform = convert data from format of operational system to format of data warehouse
13
Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of Update mode: only changes in
14
OLAP provides summary data and generates rich calculations. "How do sales of mutual funds in North America for this quarter compare with sales a year ago? What can we predict for sales next quarter? What is the trend as measured by percent change? Data mining discovers hidden patterns in data. Data mining operates at a detail level instead of a summary level. Data mining answers questions like "Who is likely to buy a mutual fund in the next six months, and what are the characteristics of these likely buyers?"