0% found this document useful (0 votes)
26 views26 pages

UNIT I notes

Big Data Analytics

Uploaded by

Priya Rajappa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views26 pages

UNIT I notes

Big Data Analytics

Uploaded by

Priya Rajappa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

BIG DATA ANALSIS

INTRODUCTION
Big data is the term for a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing applications.
• The challenges include capture, curation, storage, search, sharing, transfer, analysis, and
visualization.
• The trend to larger data sets is due to the additional information derivable from analysis of a
single large set of related data, as compared to separate smaller sets with the same total amount
of data, allowing correlations to be found to "spot business trends, determine quality of research,
prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic
conditions.”

Definition of Big Data:


Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
Big data refers to datasets whose size is typically beyond the storage capacity of and also
complex for traditional database software tools
Big data is anything beyond the human & technical infrastructure needed to support storage,
processing and analysis.

UNIT I Page 1
BIG DATA ANALSIS

UNIT I Page 2
BIG DATA ANALSIS

Volume (Scale)

UNIT I Page 3
BIG DATA ANALSIS

Variety (Complexity)
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
• A single application can be generating/collecting many types of data
• Big Public Data (online, weather, finance, etc)

UNIT I Page 4
BIG DATA ANALSIS

Velocity (Speed)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what you like  send
promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body  any abnormal
measurements require immediate reaction

UNIT I Page 5
BIG DATA ANALSIS

The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the
collected data in a timely manner and in a scalable fashion
Social media and networks (all of us are generating data)
Scientific instruments (collecting all sorts of data)
Mobile devices (tracking all objects all the time)
Sensor technology and networks (measuring all kinds of data

UNIT I Page 6
BIG DATA ANALSIS

UNIT I Page 7
BIG DATA ANALSIS

WHAT IS BIG DATA?

UNIT I Page 8
BIG DATA ANALSIS

According to Gartner, the definition of Big Data – “Big data” is high-volume, velocity, and
variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to complex
and large data sets that have to be processed and analyzed to uncover valuable information that
can benefit businesses and organizations.
However, there are certain basic tenets of Big Data that will make it even simpler to answer what
is Big Data:
 It refers to a massive amount of data that keeps on growing exponentially with time.
 It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
 It includes data mining, data storage, data analysis, data sharing, and data visualization.
 The term is an all-comprehensive one including data, data frameworks, along with the
tools and techniques used to process and analyze the data.
CHARACTERISTICS OF DATA:
As depicted in Figure 1.2, data has three key characteristics:
Small data (data as it existed prior to the big data revolution) is about certainty. It is about known
datasources; it is about no major changes to the composition or context of data.

Most often we have answers to queries like why this data was generated, where and when it was
generated, exactly how we would like to use it, what questions will this data be able to answer,
and so on. Big data is about complexity. Complexity in terms of multiple and unknown datasets,
in terms of exploding volume, in terms of speed at which the data is being generated and the
speed at which it needs to be processed and in terms of the variety of data (internal or external,
behavioural or social) that is being generated.
TYPES OF DIGITAL DATA
Types of Big Data
Now that we are on track with what is big data, let’s have a look at the types of big data:
a) Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that

UNIT I Page 9
BIG DATA ANALSIS

can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as
the employee details, their job positions, their salaries, etc., will be present in an organized
manner.
b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two important types of big data.
c) Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data containing
both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a particular repository (database),
yet contains vital information or tags that segregate individual elements within the data. Thus we
come to the end of types of data.
Unstructured data:
This is the data which does not conform to a data model or is not in a form which can be used
easily by a computer program.
 About 80-90% data of an organization is in this for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, white papers, body of an email etc.
Semi-structured data: This is the data which does not conform to a data model but has some
structure.
However, it is not in a form which can be used easily by a computer program;
 for example, en XML, markup languages like HTML, etc. Metadata for this data is available
but is not sufficient.
 Structured data: This is the data which is in an organized form (e.g., in rows and columns)
and can be easily used by a computer program. Relationships exist between entities of data, such
as classes their objects. Data stored in databases is an example of structured data.
Approximate Percentage Distribution of Digital Data
 Approximate percentage distribution of digital Data

UNIT I Page 10
BIG DATA ANALSIS

Structured Data
 This is the data which is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
 Relationships exist between entities of data, such as classes and their objects.
 Data stored in databases is an example of structured data.
Sources of Structured Data
 If your data is highly structured, one can look at leveraging any of the available RDBMS
 [Oracle Corp. — Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC —
Greenplum, Teradata — Teradata, MySQL (open source), PostgreSQL (advanced open source)
etc.] to house it.
 These databases are typically used to hold transaction/operational data generated and collected
by day-to-day business activities. In other words, the data of the On-Line Transaction Processing
(OLTP) systems are generally quite structured.

UNIT I Page 11
BIG DATA ANALSIS

Ease of Working with Structured Data


The ease is with respect to the following:
 Insert/update/delete: The Data Manipulation Language (DML) operations provide the
required ease with data input, storage, access, process, analysis, etc.
 Security: How does one ensure the security of information? There are available check
encryption and tokenization solutions to warrant the security of information throughout its
lifecycle.
 Organizations are able to retain control and maintain compliance adherence by ensuring that
only authorized individuals are able to decrypt and view sensitive information.
Indexing: An index is a data structure that speeds up the data retrieval operations (primarily the
SELECT DML statement) at the cost of additional writes and storage space, but the benefits that
ensue in search operation are worth the additional writes and storage space.
 Scalability: The storage and processing capabilities of the traditional RDBMS can be easily
scaled up by increasing the horsepower of the database server (increasing the primary
and secondary or peripheral storage capacity, processing capacity of the processor, etc.).
Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and
Durability (ACID) properties of transaction.
 Atomicity: A transaction is atomic, means that either it happens in its entirety or none of it at
all.
 Consistency: The database moves from one consistent state to another consistent state. In
other words, if the same piece of information is stored at two or more places, they are in
complete agreement.
 Isolation: The resource allocation to the transaction happens such that the transaction gets the
impression that it is the only transaction happening in isolation.

UNIT I Page 12
BIG DATA ANALSIS

 Durability: All changes made to the database during a transaction are permanent and that
accounts for the durability of the transaction.

Semi-structured Data
 This is the data which does not conform to a data model but has some structure.
 However, it is not in a form which can be used easily by a computer program.
 Example, emails, XML, markup languages like HTML, etc. Metadata for this data is available
but is not sufficient.
It has the following features:
 It does not conform to the data models that one typically associates with relational databases
or any other form of data tables.
 It uses tags to segregate semantic elements.
 Tags are also used to enforce hierarchies of records and fields within data.
 There is no separation between the data and the schema.
 The amount of structure used is dictated by the purpose at hand.
 In semi-structured data, entities belonging to the same class and also grouped together need
not necessarily have the same set of attributes.
 And if at all, they have the same set of attributes, the order of attributes may not be similar and
for all practical purposes it is not important as well.
Sources of Semi-structured Data
 Amongst the sources for semi-structured data, the front runners are ―XML‖ and ―JSON‖.
 XML: eXtensible Markup Language (XML) is hugely popularized by web services developed
utilizing the Simple Object Access Protocol (SOAP) principles.

UNIT I Page 13
BIG DATA ANALSIS

Characteristics of Semi-structured Data

Sources of Semi-structured Data


 JSON: Java Script Object Notation (JSON) is used to transmit data between a server and a
web application.
 JSON is popularized by web services developed utilizing the Representational State Transfer
(REST) - an architecture style for creating scalable web services.
 MongoDB (open-source, distributed, NoSQL, documentedoriented database) and Couchbase
(originally known as Membase, open-source, distributed, NoSQL, document-oriented database)
store data natively in JSON format.
An example of HTML is as follows:
<HTML>
<HEAD>
<TITLE>Place your title here</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFF">
<CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"x/CENTER>

UNIT I Page 14
BIG DATA ANALSIS

<HR> <a href="http://bigdatauniversity.com">Link Name</a>


<Hl>this is a Header</Hl>
<H2>this is a sub Header</H2>
Send me mail at <a href="mailto:[email protected]">
[email protected]</a>.
<P>a new paragraph!
<PxB>a new paragraph!</B>
<BRxBxI>this is a new sentence without a paragraph break, in bold italics.</Ix/B>
<HR>
</BODY>
</HTML>
Sample JSON document
{
_id:9,
BookTitle: ―Fundamentals of Business Analytics‖,
AuthorName: ―Seema Acharya‖,
Publisher: ―Wiley India‖,
YearofPublication: ―2011‖
}
Unstructured Data
 This is the data which does not conform to a data model or is not in a form which can be used
easily by a computer program.
 About 80–90% data of an organization is in this format.
 Example: memos, chat rooms, PowerPoint presentations, images, videos, letters, researches,
white papers, body of an email, etc.
Sources of Unstructured Data

UNIT I Page 15
BIG DATA ANALSIS

Issues with terminology – Unstructured Data

How to Deal with Unstructured Data?


 Today, unstructured data constitutes approximately 80% of the data that is being generated in
any enterprise.

Issues with "Unstructured" Data


 Data Mining:
 First, we deal with large data sets.
 Second, we use methods at the intersection of arti-ficial intelligence, machine learning,
statistics, and database systems to unearth consistent patterns in large data sets and/or systematic
relationships between variables.
 It is the analysis step of the ―knowledge discovery in databases‖ process.
Few popular data mining algorithms are as follows:
 Association rule mining:
 It is also called ―market basket analysis‖ or ―affinity analysis‖.

UNIT I Page 16
BIG DATA ANALSIS

 It is used to determine ―What goes with what?‖


 It is about when you buy a product, what is the other product that you are likely to purchase
with it.
 For example, if you pick up bread from the grocery, are you likely to pick eggs or cheese to go
with it.
Regression analysis:
 It helps to predict the relationship between two variables.
 The variable whose value needs to be predicted is called the dependent variable and the
variables which are used to predict the value are referred to as the independent variables.
Collaborative filtering:
 It is about predicting a user’s preference or preferences based on the preferences of a group of
users.
 For example, take a look at Table next slide.
 We are looking at predicting whether User 4 will prefer to learn using videos or is a textual
leaner depending on one or a couple of his or her known preferences.
 We analyze the preferences of similar user profiles and on the basis of it, predict that User 4
will also like to learn using videos and is not a textual learner.
Text Analytics or Text Mining: Compared to the structured data stored in relational databases,
text largely unstructured, amorphous, and difficult to deal with algorithmically.
 Text mining is the process of gleaning high quality and meaningful information (through
devising of patterns and trends by means of statistical pattern learning) from text.
 It includes tasks such as text categorization, text clusterirg, sentiment analysis, concept/entity
extraction, etc.
Natural language processing (NLP): It is related to the area of human computer interaction. It
about enabling computers to understand human or natural language input.
 Noisy text analytics: It is the process of extracting structured or semi-structured information
from noisy unstructured data such as chats, blogs, wikis, emails, message-boards, text messages,
etc.
 The noisy unstructured data usually comprises one or more of the following: Spelling
mistakes, abbreviations, acronyms, nonstandard words, missing punctuation, missing letter case,
filler words such as ―uh‖, ―urn‖, etc. Manual tagging with metadata: This is about tagging
manually with adequate metadata to provide the requisite semantics to understand unstructured
data.
 Part-of-speech tagging: It is also called POS or POST or grammatical tagging. It is the
process reading text and tagging each word in the sentence as belonging to a particular part of
speech such aj ―noun‖, ―verb‖, ―adjective‖, etc. Unstructured Information Management
Architecture (UIMA): It is an open source platform from IBM. It is used for real-time content
analytics.
 It is about processing text and other unstructured to find latent meaning and relevant
relationship buried therein.
EVOLUTION OF BIG DATA:
1970s and before was the era of mainframes. The data was essentially primitive and structured.
Relational databases evolved in 1980s and 1990s. The era was of data intensive applications. The
World Wide Web (WWW) and the Internet of Things (IOT) have led to an onslaught of
structured, unstructured, and multimedia data. Refer Table 1.1.

UNIT I Page 17
BIG DATA ANALSIS

THE HISTORY OF BIG DATA


Although the concept of big data itself is relatively new, the origins of large data sets go back to
the 1960s and '70s when the world of data was just getting started with the first data centers and
the development of the relational database.
Around 2005, people began to realize just how much data users generated through Facebook,
YouTube, and other online services. Hadoop (an open-source framework created specifically to
store and analyze big data sets) was developed that same year. NoSQL also began to gain
popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark) was
essential for the growth of big data because they make big data easier to work with and cheaper
to store. In the years since then, the volume of big data has skyrocketed. Users are still
generating huge amounts of data—but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to the
internet, gathering data on customer usage patterns and product performance. The emergence of
machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing has
expanded big data possibilities even further. The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to test a subset of data.
Benefits of Big Data and Data Analytics
 Big data makes it possible for you to gain more complete answers because you have more
information.
 More complete answers mean more confidence in the data—which means a completely
different approach to tackling problems.
Some examples of Big Data:
There are some examples of Big Data Analytics in different areas such as retail, IT
infrastructure, and social media.
Retail: As mentioned earlier, Big Data presents many opportunities to improve sales and
marketing analytics.

UNIT I Page 18
BIG DATA ANALSIS

An example of this is the U.S. retailer Target. After analyzing consumer purchasing behavior,
Target's statisticians determined that the retailer made a great deal of money from three main
life-event situations.
Marriage, when people tend to buy many new products
Divorce, when people buy new products and change their spending habits
Pregnancy, when people have many new things to buy and have an urgency to buy them. The
analysis target to manage its inventory, knowing that there would be demand for specific
products and it would likely vary by month over the coming nine- to ten-month cycles
IT infrastructure: MapReduce paradigm is an ideal technical framework for many Big Data
projects, which rely on large data sets with unconventional data structures.
One of the main benefits of Hadoop is that it employs a distributed file system, meaning it can
use a distributed cluster of servers and commodity hardware to process large amounts of data.
1. Composition: The composition of data deals with the structure of data, that is, the sources of
data, the granularity, the types, and the nature of data as to whether it is static or real-time
streaming.
2. Condition: The condition of data deals with the state of data, that is, "Can one use this data as
is foranalysis?" or "Does it require cleansing for further enhancement and enrichment?"
3. Context: The context of data deals with "Where has this data been generated?" "Why was this
datagenerated?" How sensitive is this data?" "What are the events associated with this data?" and
so on.
Some of the most common examples of Hadoop implementations are in the social media space,
where Hadoop can manage transactions, give textual updates, and develop social graphs among
millions of users.
Twitter and Facebook generate massive amounts of unstructured data and use Hadoop and its
ecosystem of tools to manage this high volume.
social media: It represents a tremendous opportunity to leverage social and professional
interactions to derive new insights.
LinkedIn represents a company in which data itself is the product. Early on, Linkedln founder
Reid Hoffman saw the opportunity to create a social network for working professionals.
As of 2014, Linkedln has more than 250 million user accounts and has added many additional
features and data-related products, such as recruiting, job seeker tools, advertising, and lnMaps,
which show a social graph of a user's professional network.
DEFINITION OF BIG DATA:
Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
Big data refers to datasets whose size is typically beyond the storage capacity of and also
complex for traditional database software tools
Big data is anything beyond the human & technical infrastructure needed to support storage,
processing and analysis.
It is data that is big in volume, velocity and variety. Refer to figure 1.3
Variety: Data can be structured data, semi-structured data and unstructured data. Data stored in a
database is an example of structured data.HTML data, XML data, email data,

UNIT I Page 19
BIG DATA ANALSIS

CSV files are the examples of semi-structured data.Power point presentation, images, videos,
researches, white papers, body of email etc are the examples of unstructured data.
Velocity: Velocity essentially refers tothe speed at which data is being created in real- time. We
have moved from simple desktop applications like payroll application to real- time processing
applications.
Volume: Volume can be in Terabytes or Petabytes or Zettabytes.
Gartner Glossary Big data is high-volume, high-velocity and/or high-variety information assets
that demand cost-effective, innovative forms of information processing that enable enhanced
insight and decision making.
For the sake of easy comprehension, we will look at the definition in three parts. Refer Figure
1.4.
Part I of the definition: "Big data is high-volume, high-velocity, and high-variety information
assets" talks about voluminous data (humongous data) that may have great variety (a good mix
of structured, semi-structured. and unstructured data) and will require a good speed/pace for
storage, preparation, processing and analysis.
Part II of the definition: "cost effective, innovative forms of information processing" talks about
embracing new techniques and technologies to capture (ingest), store, process, persist, integrate
and visualize the high-volume, high-velocity, and high-variety data.
Part III of the definition: "enhanced insight and decision making" talks about deriving deeper,
richer and meaningful insights and then using these insights to make faster and better decisions
to gain business value and thus a competitive edge.

UNIT I Page 20
BIG DATA ANALSIS

Data —> Information —> Actionable intelligence —> Better decisions —>Enhanced business
value

CHARACTERISTICS OF BIG DATA


Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity,
and
Volume. Let’s discuss the characteristics of big data.
These characteristics, isolated, are enough to know what big data is. Let’s look at them in depth:
a) Variety
Variety of Big Data refers to structured, unstructured, and semi-structured data that is gathered
from multiple sources. While in the past, data could only be collected from spreadsheets and

UNIT I Page 21
BIG DATA ANALSIS

databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios,
SM
posts, and so much more. Variety is one of the important characteristics of big data.

b) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and
activity bursts.
c) Volume
Volume is one of the characteristics of big data. We already know that Big Data indicates huge
‘volumes’ of data that is being generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions, etc. Such a large amount
of data is stored in data warehouses. Thus comes to the end of characteristics of big data.
NEED OF BIG DATA
Why Big data?
The more data we have for analysis, the greater will be the analytical accuracy and the greater
would be the confidence in our decisions based on these analytical findings. The analytical
accuracy will lead a greater positive impact in terms of enhancing operational efficiencies,
reducing cost and time, and originating new products, new services, and optimizing existing
services. Refer Figure 1.6.
Figure

Why is Big Data Important?


The importance of big data does not revolve around how much data a company has but how a
company utilizes the collected data. Every company uses data in its own way; the more
efficiently
a company uses its data, the more potential it has to grow. The company can take data from any
source and analyze it to find answers which will enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost
advantages to business when large amounts of data are to be stored and these tools also help in
identifying more efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data immediately and make quick
decisions based on the learning.

UNIT I Page 22
BIG DATA ANALSIS

3. Understand the market conditions: By analyzing big data you can get a better understanding
of current market conditions. For example, by analyzing customers’ purchasing behaviors, a
company can find out the products that are sold the most and produce products according to this
trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company. If you want to monitor and improve the
online presence of your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. There is no single business
that can claim success without first having to establish a solid customer base. However, even
with a customer base, a business cannot afford to disregard the high competition it faces. If a
business is slow to learn what customers are looking for, then it is very easy to begin offering
poor quality products. In the end, loss of clientele will result, and this creates an adverse overall
effect on business success. The use of big data allows businesses to observe various customer
related patterns and trends. Observing customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights
Big data analytics can help change all business operations. This includes the ability to match
customer expectation, changing company’s product line and of course ensuring that the
marketing campaigns are powerful.
7. Big Data Analytics As a Driver of Innovations and Product Development
Another huge advantage of big data is the ability to help companies innovate and redevelop their
products.
CHALLENGES WITH BIG DATA
Refer figure 1.5. Following are a few challenges with big data

High-volume

UNIT I Page 23
BIG DATA ANALSIS

high-velocity high variety


cost-effective,
innovative forms of information processing
enhanced
insight and decision making

Data volume:Data today is growing at an exponential rate. This high tide of data will continue to
rise continuously. The key questions are – Big Data Analytics: Unedited Version
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc
Storage: Cloud computingis the answer to managing infrastructure for big data as far as cost-
efficiency, elasticity and easy upgrading / downgrading is concerned. This further complicates
the decision to host big data solutions outside the enterprise.
Data retention: How long should one retain this data? Some data may require for log-term
decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that generate
insights, organizations need professionals who possess a high-level proficiency in data sciences.
Other challenges: Other challenges of big data are with respect to capture, storage, search,
analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the storage capacity of
traditional database software tools. There is no explicit definition of how big the data set should
be for it to be considered bigdata.Data visualization(computer graphics) is becoming popular as a
separate discipline. There are very few data visualization experts.
DIFFERENCE BETWEEN TRADITIONAL DATA AND BIG DATA
1. Traditional data: Traditional data is the structured data that is being majorly maintained by
all types of businesses starting from very small to big organizations. In a traditional database
system, a centralized database architecture used to store and maintain the data in a fixed format
or fields in a file. For managing and accessing the data Structured Query Language (SQL) is
used.
2. Big data: We can consider big data an upper version of traditional data. Big data deal with
too large or complex data sets which is difficult to manage in traditional data-processing
application software. It deals with large volume of both structured, semi structured and
unstructured data. Volume, Velocity and Variety, Veracity and Value refer to the 5’V
characteristics of big data . Big data not only refers to large amount of data it refers to
extracting meaningful data by analyzing the huge amount of complex data sets. semi-structured
The difference between Traditional data and Big data are as follows:

Traditional Data Big Data

Big data is generated outside the


Traditional data is generated in enterprise level. enterprise level.

Its volume ranges from Petabytes to


Its volume ranges from Gigabytes to Terabytes. Zettabytes or Exabytes.

UNIT I Page 24
BIG DATA ANALSIS

Traditional Data Big Data

Big data system deals with structured,


Traditional database system deals with structured semi-structured,database, and
data. unstructured data.

Traditional data is generated per hour or per day But big data is generated more frequently
or more. mainly per seconds.

Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is capable to High system configuration is required to


process traditional data. process big data.

The size is more than the traditional data


The size of the data is very small. size.

Special kind of data base tools are


Traditional data base tools are required to required to perform any databaseschema-
perform any data base operation. based operation.

Special kind of functions can manipulate


Normal functions can manipulate data. data.

Its data model is strict schema based and it is Its data model is a flat schema based and
static. it is dynamic.

Big data is not stable and unknown


Traditional data is stable and inter relationship. relationship.

Big data is in huge volume which


Traditional data is in manageable volume. becomes unmanageable.

It is difficult to manage and manipulate


It is easy to manage and manipulate the data. the data.

UNIT I Page 25
BIG DATA ANALSIS

Traditional Data Big Data

Its data sources includes ERP transaction data, Its data sources includes social media,
CRM transaction data, financial data, device data, sensor data, video, images,
organizational data, web transaction data etc. audio etc.

Big Data vs Data Warehouse


Big Data has become the reality of doing business for organizations today. There is a boom in
the
amount of structured as well as raw data that floods every organization daily. If this data is
managed well, it can lead to powerful insights and quality decision making.
Big data analytics is the process of examining large data sets containing a variety of data types to
discover some knowledge in databases, to identify interesting patterns and establish relationships
to solve problems, market trends, customer preferences, and other useful information.
Companies
and businesses that implement Big Data Analytics often reap several business benefits.
Companies
implement Big Data Analytics because they want to make more informed business decisions.
A data warehouse (DW) is a collection of corporate information and data derived from
operational
systems and external data sources. A data warehouse is designed to support business decisions
by
allowing data consolidation, analysis and reporting at different aggregate levels. Data is
populated
into the Data Warehouse through the processes of extraction, transformation and loading (ETL
tools). Data analysis tools, such as business intelligence software, access the data within the
warehouse.

UNIT I Page 26

You might also like