0% found this document useful (0 votes)
91 views

Data Warehouse and Data Sources

A data warehouse is a subject-oriented, integrated collection of historical data used to support management decision making. It contains data from multiple sources cleaned and transformed into a consistent format. The ETL process extracts data from source systems, scrubs the data to improve quality, transforms it to the warehouse format, and loads it into the data warehouse where it is indexed. OLAP provides summary views and calculations on warehouse data to answer questions like comparisons and trends over time, while data mining discovers hidden patterns in detailed data to predict likely customer behavior.

Uploaded by

Anshul Mehrotra
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

Data Warehouse and Data Sources

A data warehouse is a subject-oriented, integrated collection of historical data used to support management decision making. It contains data from multiple sources cleaned and transformed into a consistent format. The ETL process extracts data from source systems, scrubs the data to improve quality, transforms it to the warehouse format, and loads it into the data warehouse where it is indexed. OLAP provides summary views and calculations on warehouse data to answer questions like comparisons and trends over time, while data mining discovers hidden patterns in detailed data to predict likely customer behavior.

Uploaded by

Anshul Mehrotra
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Presented By Amit(11012) Anand(11013) Anshul(11014)

Definition

Data Warehouse:
A subject-oriented, integrated, time-variant, non-

updatable collection of data used in support of management decision-making processes

Data Warehousing:
The process of constructing and using a data

warehouse

Subject-oriented

The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together. Non-volatile
Data in the data warehouse are never overwritten or deleted once committed, the data are static, read-only, and retained for future reporting.

Integrated

The data warehouse contains data from most or all of an organization's operational systems and these data are made consistent.
Time-variant
The time horizon for the data warehouse is significantly longer than that of operational systems.

Data Sources

Sources that feed data into the data warehouse. plain text file, relational database, other types of database, Excel file, etc

Many different types of data can be a data source:


Operations - such as sales data, HR data, product data,

inventory data, marketing data, systems data. Web server logs with user browsing data. Internal market research data. Third-party data, such as census data, demographics data, or survey data.

Operational Systems vs Data Warehousing Systems


OPERTAIONAL Holds Current data DATA WARE HOUSE Holds Historic data

Data is Dynamic
Read/Write access Repetitive processing Transaction driven Application oriented Used by clerical staff for day-today operation Normalized data model(ER model) Must be optimized for writes and small queries

Data is largely static


Read only accesses Adhoc complex queries Analysis driven Subject oriented Used by top mangers for analysis Denormalized data model (Dimensional model) Must be optimized for queries involving a large portion of the warehouse.

Advantages of Data

Warehousing

Potential high Return on Investment Competitive Advantage Increased Productivity of Corporate Decision Makers

Problems with Data Warehousing

Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands High maintenance Long duration projects Complexity of integration

The ETL Process


Capture
Scrub

or data cleansing Transform Load and Index


ETL = Extract, transform, and load

10

Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Incremental extract = capturing Static extract = capturing a changes that have occurred since snapshot of the source data at a the last static extract point in time

11

Scrub = cleanseuses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, Also: decoding, reformatting, time

erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

stamping, conversion, key generation, merging, error detection/logging, locating missing data
12

Transform = convert data from format of operational system to format of data warehouse

13

Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of Update mode: only changes in

target data at periodic intervals

source data are written to data warehouse

14

OLAP provides summary data and generates rich calculations. "How do sales of mutual funds in North America for this quarter compare with sales a year ago? What can we predict for sales next quarter? What is the trend as measured by percent change? Data mining discovers hidden patterns in data. Data mining operates at a detail level instead of a summary level. Data mining answers questions like "Who is likely to buy a mutual fund in the next six months, and what are the characteristics of these likely buyers?"

You might also like