UNIT I notes
UNIT I notes
INTRODUCTION
Big data is the term for a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing applications.
• The challenges include capture, curation, storage, search, sharing, transfer, analysis, and
visualization.
• The trend to larger data sets is due to the additional information derivable from analysis of a
single large set of related data, as compared to separate smaller sets with the same total amount
of data, allowing correlations to be found to "spot business trends, determine quality of research,
prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic
conditions.”
UNIT I Page 1
BIG DATA ANALSIS
UNIT I Page 2
BIG DATA ANALSIS
Volume (Scale)
UNIT I Page 3
BIG DATA ANALSIS
Variety (Complexity)
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
• A single application can be generating/collecting many types of data
• Big Public Data (online, weather, finance, etc)
UNIT I Page 4
BIG DATA ANALSIS
Velocity (Speed)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what you like send
promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body any abnormal
measurements require immediate reaction
UNIT I Page 5
BIG DATA ANALSIS
The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the
collected data in a timely manner and in a scalable fashion
Social media and networks (all of us are generating data)
Scientific instruments (collecting all sorts of data)
Mobile devices (tracking all objects all the time)
Sensor technology and networks (measuring all kinds of data
UNIT I Page 6
BIG DATA ANALSIS
UNIT I Page 7
BIG DATA ANALSIS
UNIT I Page 8
BIG DATA ANALSIS
According to Gartner, the definition of Big Data – “Big data” is high-volume, velocity, and
variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to complex
and large data sets that have to be processed and analyzed to uncover valuable information that
can benefit businesses and organizations.
However, there are certain basic tenets of Big Data that will make it even simpler to answer what
is Big Data:
It refers to a massive amount of data that keeps on growing exponentially with time.
It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
It includes data mining, data storage, data analysis, data sharing, and data visualization.
The term is an all-comprehensive one including data, data frameworks, along with the
tools and techniques used to process and analyze the data.
CHARACTERISTICS OF DATA:
As depicted in Figure 1.2, data has three key characteristics:
Small data (data as it existed prior to the big data revolution) is about certainty. It is about known
datasources; it is about no major changes to the composition or context of data.
Most often we have answers to queries like why this data was generated, where and when it was
generated, exactly how we would like to use it, what questions will this data be able to answer,
and so on. Big data is about complexity. Complexity in terms of multiple and unknown datasets,
in terms of exploding volume, in terms of speed at which the data is being generated and the
speed at which it needs to be processed and in terms of the variety of data (internal or external,
behavioural or social) that is being generated.
TYPES OF DIGITAL DATA
Types of Big Data
Now that we are on track with what is big data, let’s have a look at the types of big data:
a) Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
UNIT I Page 9
BIG DATA ANALSIS
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as
the employee details, their job positions, their salaries, etc., will be present in an organized
manner.
b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two important types of big data.
c) Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data containing
both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a particular repository (database),
yet contains vital information or tags that segregate individual elements within the data. Thus we
come to the end of types of data.
Unstructured data:
This is the data which does not conform to a data model or is not in a form which can be used
easily by a computer program.
About 80-90% data of an organization is in this for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, white papers, body of an email etc.
Semi-structured data: This is the data which does not conform to a data model but has some
structure.
However, it is not in a form which can be used easily by a computer program;
for example, en XML, markup languages like HTML, etc. Metadata for this data is available
but is not sufficient.
Structured data: This is the data which is in an organized form (e.g., in rows and columns)
and can be easily used by a computer program. Relationships exist between entities of data, such
as classes their objects. Data stored in databases is an example of structured data.
Approximate Percentage Distribution of Digital Data
Approximate percentage distribution of digital Data
UNIT I Page 10
BIG DATA ANALSIS
Structured Data
This is the data which is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
Relationships exist between entities of data, such as classes and their objects.
Data stored in databases is an example of structured data.
Sources of Structured Data
If your data is highly structured, one can look at leveraging any of the available RDBMS
[Oracle Corp. — Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC —
Greenplum, Teradata — Teradata, MySQL (open source), PostgreSQL (advanced open source)
etc.] to house it.
These databases are typically used to hold transaction/operational data generated and collected
by day-to-day business activities. In other words, the data of the On-Line Transaction Processing
(OLTP) systems are generally quite structured.
UNIT I Page 11
BIG DATA ANALSIS
UNIT I Page 12
BIG DATA ANALSIS
Durability: All changes made to the database during a transaction are permanent and that
accounts for the durability of the transaction.
Semi-structured Data
This is the data which does not conform to a data model but has some structure.
However, it is not in a form which can be used easily by a computer program.
Example, emails, XML, markup languages like HTML, etc. Metadata for this data is available
but is not sufficient.
It has the following features:
It does not conform to the data models that one typically associates with relational databases
or any other form of data tables.
It uses tags to segregate semantic elements.
Tags are also used to enforce hierarchies of records and fields within data.
There is no separation between the data and the schema.
The amount of structure used is dictated by the purpose at hand.
In semi-structured data, entities belonging to the same class and also grouped together need
not necessarily have the same set of attributes.
And if at all, they have the same set of attributes, the order of attributes may not be similar and
for all practical purposes it is not important as well.
Sources of Semi-structured Data
Amongst the sources for semi-structured data, the front runners are ―XML‖ and ―JSON‖.
XML: eXtensible Markup Language (XML) is hugely popularized by web services developed
utilizing the Simple Object Access Protocol (SOAP) principles.
UNIT I Page 13
BIG DATA ANALSIS
UNIT I Page 14
BIG DATA ANALSIS
UNIT I Page 15
BIG DATA ANALSIS
UNIT I Page 16
BIG DATA ANALSIS
UNIT I Page 17
BIG DATA ANALSIS
UNIT I Page 18
BIG DATA ANALSIS
An example of this is the U.S. retailer Target. After analyzing consumer purchasing behavior,
Target's statisticians determined that the retailer made a great deal of money from three main
life-event situations.
Marriage, when people tend to buy many new products
Divorce, when people buy new products and change their spending habits
Pregnancy, when people have many new things to buy and have an urgency to buy them. The
analysis target to manage its inventory, knowing that there would be demand for specific
products and it would likely vary by month over the coming nine- to ten-month cycles
IT infrastructure: MapReduce paradigm is an ideal technical framework for many Big Data
projects, which rely on large data sets with unconventional data structures.
One of the main benefits of Hadoop is that it employs a distributed file system, meaning it can
use a distributed cluster of servers and commodity hardware to process large amounts of data.
1. Composition: The composition of data deals with the structure of data, that is, the sources of
data, the granularity, the types, and the nature of data as to whether it is static or real-time
streaming.
2. Condition: The condition of data deals with the state of data, that is, "Can one use this data as
is foranalysis?" or "Does it require cleansing for further enhancement and enrichment?"
3. Context: The context of data deals with "Where has this data been generated?" "Why was this
datagenerated?" How sensitive is this data?" "What are the events associated with this data?" and
so on.
Some of the most common examples of Hadoop implementations are in the social media space,
where Hadoop can manage transactions, give textual updates, and develop social graphs among
millions of users.
Twitter and Facebook generate massive amounts of unstructured data and use Hadoop and its
ecosystem of tools to manage this high volume.
social media: It represents a tremendous opportunity to leverage social and professional
interactions to derive new insights.
LinkedIn represents a company in which data itself is the product. Early on, Linkedln founder
Reid Hoffman saw the opportunity to create a social network for working professionals.
As of 2014, Linkedln has more than 250 million user accounts and has added many additional
features and data-related products, such as recruiting, job seeker tools, advertising, and lnMaps,
which show a social graph of a user's professional network.
DEFINITION OF BIG DATA:
Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
Big data refers to datasets whose size is typically beyond the storage capacity of and also
complex for traditional database software tools
Big data is anything beyond the human & technical infrastructure needed to support storage,
processing and analysis.
It is data that is big in volume, velocity and variety. Refer to figure 1.3
Variety: Data can be structured data, semi-structured data and unstructured data. Data stored in a
database is an example of structured data.HTML data, XML data, email data,
UNIT I Page 19
BIG DATA ANALSIS
CSV files are the examples of semi-structured data.Power point presentation, images, videos,
researches, white papers, body of email etc are the examples of unstructured data.
Velocity: Velocity essentially refers tothe speed at which data is being created in real- time. We
have moved from simple desktop applications like payroll application to real- time processing
applications.
Volume: Volume can be in Terabytes or Petabytes or Zettabytes.
Gartner Glossary Big data is high-volume, high-velocity and/or high-variety information assets
that demand cost-effective, innovative forms of information processing that enable enhanced
insight and decision making.
For the sake of easy comprehension, we will look at the definition in three parts. Refer Figure
1.4.
Part I of the definition: "Big data is high-volume, high-velocity, and high-variety information
assets" talks about voluminous data (humongous data) that may have great variety (a good mix
of structured, semi-structured. and unstructured data) and will require a good speed/pace for
storage, preparation, processing and analysis.
Part II of the definition: "cost effective, innovative forms of information processing" talks about
embracing new techniques and technologies to capture (ingest), store, process, persist, integrate
and visualize the high-volume, high-velocity, and high-variety data.
Part III of the definition: "enhanced insight and decision making" talks about deriving deeper,
richer and meaningful insights and then using these insights to make faster and better decisions
to gain business value and thus a competitive edge.
UNIT I Page 20
BIG DATA ANALSIS
Data —> Information —> Actionable intelligence —> Better decisions —>Enhanced business
value
UNIT I Page 21
BIG DATA ANALSIS
databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios,
SM
posts, and so much more. Variety is one of the important characteristics of big data.
b) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and
activity bursts.
c) Volume
Volume is one of the characteristics of big data. We already know that Big Data indicates huge
‘volumes’ of data that is being generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions, etc. Such a large amount
of data is stored in data warehouses. Thus comes to the end of characteristics of big data.
NEED OF BIG DATA
Why Big data?
The more data we have for analysis, the greater will be the analytical accuracy and the greater
would be the confidence in our decisions based on these analytical findings. The analytical
accuracy will lead a greater positive impact in terms of enhancing operational efficiencies,
reducing cost and time, and originating new products, new services, and optimizing existing
services. Refer Figure 1.6.
Figure
UNIT I Page 22
BIG DATA ANALSIS
3. Understand the market conditions: By analyzing big data you can get a better understanding
of current market conditions. For example, by analyzing customers’ purchasing behaviors, a
company can find out the products that are sold the most and produce products according to this
trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company. If you want to monitor and improve the
online presence of your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. There is no single business
that can claim success without first having to establish a solid customer base. However, even
with a customer base, a business cannot afford to disregard the high competition it faces. If a
business is slow to learn what customers are looking for, then it is very easy to begin offering
poor quality products. In the end, loss of clientele will result, and this creates an adverse overall
effect on business success. The use of big data allows businesses to observe various customer
related patterns and trends. Observing customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights
Big data analytics can help change all business operations. This includes the ability to match
customer expectation, changing company’s product line and of course ensuring that the
marketing campaigns are powerful.
7. Big Data Analytics As a Driver of Innovations and Product Development
Another huge advantage of big data is the ability to help companies innovate and redevelop their
products.
CHALLENGES WITH BIG DATA
Refer figure 1.5. Following are a few challenges with big data
High-volume
UNIT I Page 23
BIG DATA ANALSIS
Data volume:Data today is growing at an exponential rate. This high tide of data will continue to
rise continuously. The key questions are – Big Data Analytics: Unedited Version
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc
Storage: Cloud computingis the answer to managing infrastructure for big data as far as cost-
efficiency, elasticity and easy upgrading / downgrading is concerned. This further complicates
the decision to host big data solutions outside the enterprise.
Data retention: How long should one retain this data? Some data may require for log-term
decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that generate
insights, organizations need professionals who possess a high-level proficiency in data sciences.
Other challenges: Other challenges of big data are with respect to capture, storage, search,
analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the storage capacity of
traditional database software tools. There is no explicit definition of how big the data set should
be for it to be considered bigdata.Data visualization(computer graphics) is becoming popular as a
separate discipline. There are very few data visualization experts.
DIFFERENCE BETWEEN TRADITIONAL DATA AND BIG DATA
1. Traditional data: Traditional data is the structured data that is being majorly maintained by
all types of businesses starting from very small to big organizations. In a traditional database
system, a centralized database architecture used to store and maintain the data in a fixed format
or fields in a file. For managing and accessing the data Structured Query Language (SQL) is
used.
2. Big data: We can consider big data an upper version of traditional data. Big data deal with
too large or complex data sets which is difficult to manage in traditional data-processing
application software. It deals with large volume of both structured, semi structured and
unstructured data. Volume, Velocity and Variety, Veracity and Value refer to the 5’V
characteristics of big data . Big data not only refers to large amount of data it refers to
extracting meaningful data by analyzing the huge amount of complex data sets. semi-structured
The difference between Traditional data and Big data are as follows:
UNIT I Page 24
BIG DATA ANALSIS
Traditional data is generated per hour or per day But big data is generated more frequently
or more. mainly per seconds.
Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.
Its data model is strict schema based and it is Its data model is a flat schema based and
static. it is dynamic.
UNIT I Page 25
BIG DATA ANALSIS
Its data sources includes ERP transaction data, Its data sources includes social media,
CRM transaction data, financial data, device data, sensor data, video, images,
organizational data, web transaction data etc. audio etc.
UNIT I Page 26