Modern Data Architecture: Bywhinmon
Modern Data Architecture: Bywhinmon
By W H Inmon
OCTOBER 2015
Sponsored by:
Modern Data Architecture
TABLE OF CONTENTS
DATA WAREHOUSE ............................................................................................................................................. 1
ENTER BIG DATA .................................................................................................................................................. 1
DO YOU NEED A DATA WAREHOUSE WHEN YOU HAVE BIG DATA? .................................................................... 1
AN ARCHITECTURE .............................................................................................................................................. 2
A TECHNOLOGY ................................................................................................................................................... 2
HARMONIOUS COEXISTENCE .............................................................................................................................. 3
REPETITIVE/NON-REPETITIVE DATA .................................................................................................................... 3
THE “GREAT DIVIDE” ........................................................................................................................................... 4
DATA MODELING ................................................................................................................................................. 4
TEXTUAL DISAMBIGUATION ................................................................................................................................ 4
CONTEXT ENRICHED BIG DATA ............................................................................................................................ 4
TWO KINDS OF DATA IN THE DATA WAREHOUSE................................................................................................ 5
A NEW TYPE OF ANALYTICAL PROCESSING .......................................................................................................... 5
REPETITIVE DATA/DATA WAREHOUSE INTERFACE ............................................................................................. 5
ARCHIVAL DATA TO BIG DATA ............................................................................................................................. 6
DOING ANALYTICS ............................................................................................................................................... 6
DATA MARTS AND THE DIMENSIONAL MODEL ................................................................................................... 6
WHAT ABOUT MODELING? ................................................................................................................................. 7
THE SYSTEM OF RECORD ..................................................................................................................................... 7
THE REMAINING ISSUES ...................................................................................................................................... 7
DATA WAREHOUSE
Data warehouse is an established concept and discipline that is discussed in books, conferences
and seminars. Indeed data warehouses are a standard feature of modern corporations.
Corporations use data warehouses to make business decisions every day. In a word, the data
warehouse represents “conventional wisdom” and is a standard part of the corporate
infrastructure.
First off, what is a data warehouse? From the beginning, the accepted definition of a data
warehouse is a collection of data that is:
Subject oriented
Integrated
Time variant
Non-volatile
set of data in support of management’s decisions. This definition of data warehouse is widely
quoted as the definition of what is a data warehouse. (See BUILDING THE DATA WAREHOUSE,
John Wiley, originally published 1991.)
The definition of Big Data is not quite as clear. Indeed there are different interpretations as to
what is meant by “Big Data”. But for the purposes of this paper the following definition of Big
Data will be used.
Big data:
Encompasses very large volumes of data
Is stored on affordable storage
Is stored in an unstructured manner
Is managed using the “Roman census” technique.
(For an in depth discussion of this definition, refer to the book BIG DATA – A PRIMER FOR THE
DATA SCIENTIST, Elsevier Publishing 2014.)
AN ARCHITECTURE
Data warehouse is an architecture. Data warehouse requires a discipline to build and store. A
data warehouse can be stored on a variety of media. The essence of a data warehouse is
integrity of data. Another way of thinking of a data warehouse is that a data warehouse is a
single version of the truth. The data that enters a data warehouse is carefully crafted and
vetted. The data found in a data warehouse is data that is used for the most basic decisions the
corporation makes.
Traditionally the data entering a data warehouse is integrated by means of technology called
“ETL” (extract/transform/load). Data typically starts off in an application and is recast into a
singular, integrated corporate format when it is placed inside a data warehouse.
A TECHNOLOGY
Big Data is a technology. Big Data is capable of storing a large amount of data. Big Data is a
physical media. In Big Data, there are storage mechanisms that cause data to be written and
then sought when desired.
The time of day is the time of day regardless of what a Rolex says, and one Rolex may show one
time and another Rolex may have another time.
There is the same difference between an architecture and a technology. You can put a data
warehouse on Big Data or on standard storage technology. It is still a data warehouse wherever
it is located. Or you can put any data that is not a data warehouse in Big Data or on storage
technology.
There is no competition between Big Data and a data warehouse. They are entirely different
things.
HARMONIOUS COEXISTENCE
Despite the confusion that is sown by the vendors of Big Data, there is a need to understand
how Big Data and a data warehouse can coexist. There is a need from an architectural
standpoint to have a “big picture” that outlines how Big Data and a data warehouse can coexist
and work in harmony and in a constructive manner.
REPETITIVE/NON-REPETITIVE DATA
The figure – general architecture – shows lots of major architectural features. The first major
architectural feature shown is that Big Data is divided into two major subdivisions – repetitive
occurrences of data and non-repetitive occurrences of data.
Repetitive occurrences of data consist of data where the same structure of data is repeated
many times. There are many different examples of repetitive data. Typical repetitive data
consists of log tape records, telephone call record detail records, click stream data, metering
data, meteorological data, and so forth. In repetitive data, the same structure of data occurs
over and over again. In many cases repetitive data is machine-written data or is produced by
analog processing.
Non-repetitive data also has many examples. Some examples of non-repetitive data include
email, call center conversations, survey comments, help desk conversations, warranty claim
data, and so forth. In non-repetitive data it is only an accident if the same data or the same
structure of data ever occurs twice. In almost every case, non-repetitive data is textual-based
data that was generated by the written or the spoken word.
DATA MODELING
One of the interesting differences between repetitive data and non-repetitive data is in terms
of how the data is modeled. Repetitive data is typically modeled by an ERD (entity relationship
diagram) data model. Non-repetitive data is modeled in an entirely different manner by the
usage of taxonomies and ontologies.
With an ERD the designer is free to change the data to fit the model. But with taxonomies and
ontologies, the base data NEVER changes. As a consequence, if there is a need to make
changes, it is the taxonomy or ontology that changes, not the base data.
Both types of data models can be (and usually should be) built generically. There is very little
difference between the models built for an industry. As a consequence, generic models – at
least as a starting point – are strongly advised.
TEXTUAL DISAMBIGUATION
The typical path for non-repetitive data to be handled and managed is through the passage of
the non-repetitive data through technology known as “textual disambiguation”. Non-repetitive
data is read and reformatted and – more importantly – contextualized. In order to make any
sense out of the non-repetitive data, it must have the context of the data established. The job
of textual disambiguation is to derive and identify the context of non-repetitive data. In many
cases the context of the non-repetitive data is MORE important than the data itself. In any case,
non-repetitive data cannot be used for decision making until the context has been established.
It is possible that there is so much contextualized data that it cannot be sent to the data
warehouse environment because of the sheer volume of data. If, however, the contextualized
data is sent to the classic data warehouse, the processing that takes place on it can be done
with standard analytical tools such as Tableau, Qlik, Business Objects, SAS, Excel, and so forth.
One of the really nice things about the two types of data in the data warehouse is that because
all the data arrives in a structured relational format, the data can be freely mixed and matched,
and joins and analysis across the different data types can be done.
Filtering then is merely the reading and selection of records that are then sent to the data
warehouse.
The second kind of processing is distillation. Distillation is similar to filtering except distillation
requires that further processing be done before the records are sent to the data warehouse. A
simple example of distillation might be the counting of records that have been selected. For
example, the distillation process may simply count the number of sales of items greater than
$10.00 for each Wal-Mart store for the month of September 2015.
The result of both the distillation and filtering of Big Data is placed in the data warehouse.
Usually the results are placed in a separate part of the data warehouse since the basis of the
data found in the data warehouse is not structured, transaction-based data.
It is noted that the process of filtering and distillation of repetitive data can become quite
involved. Usually the complications come in the form of handling the volume of data that is
needed for analysis. In some cases, there is an enormous amount of data that must be
processed. In other cases, the characteristics of the data being sought are not clearly defined
and are ambiguous.
DOING ANALYTICS
Analytics can be done all over the landscape. Classic analytical processing of transaction-based
data is done in the data warehouse as it has always been done. Nothing has changed there.
But now analytics on contextualized data can be done, and that form of analytics is new and
novel. Most organizations have not been able to base decision making on unstructured textual
data before. And there is a new form of analytics that is possible in the data warehouse, which
is the possibility of blended analytics. Blended analytics is analytics done using a blend of
structured transactional data and unstructured contextualized data.
But there are many other forms of analytics that are possible as well. There is the possibility of
doing analytics inside the repetitive data Big Data environment. This is where NoSQL analytical
processing is a possibility. And another form of analytics is analytics of the context-enriched Big
Data. A certain portion of the Big Data environment is context-enriched data which can produce
its own analytical results as well.
Each of these different forms of analytical processing produces is own unique results.
Interestingly, the data model, the dimensional model, the taxonomy and the ontology are all
very related but still different. They are like multiple blood-related siblings in a family. If you
take a look at a group of siblings, you see that they are all either boys or girls, they all have
similar skin color, you see that they have similar noses and mouths and eyes. And at the same
time there are individual differences that each sibling has. They are all clearly from the same
family and at the same time they are all still unique individuals.
But then the more important question arises – can we achieve integrity of data across the
architectural landscape? The answer is a resounding yes. By using a consistent modeling
strategy across all types of data, you can establish the foundation for data integrity.
References:
BUILDING THE DATA WAREHOUSE, John Wiley, NY, NY – the original book on data warehousing
DATA ARCHITECTURE – A PRIMER FOR THE DATA SCIENTIST, Elsevier Press, 2014 – a complete
description of data architecture
THE DATA WAREHOUSE TOOLKIT, John Wiley – a guide to dimensional modeling and the building of data
marts
Sponsored by:
Embarcadero Technologies, Inc. is a leading provider of award-winning tools for application developers
and database professionals so they can design systems right, build them faster and run them better,
regardless of their platform or programming language. ER/Studio is the company’s flagship data
architecture solution that combines business-driven data modeling and collaboration in a multi-platform
environment. ER/Studio is a registered trademark of Embarcadero Technologies. To learn more, please
visit http://www.embarcadero.com.