WELCOME TO BIG DATA ERA !!!
Big Data Introduction
AGENDA
• DATA and Big Data
• Facts About Big Data
• Big Data Source
• Big Data Case Study
• Characteristics Of Big Data
• Big Data System Requirement
• Data processing Using Traditional System
2
DATA AND BIG DATA
Anything that is raw is data.
Any thing that is not processed is called data.
DATA >> processed >> INFORMATION
Mobile [storage , ram ]
10 TB Data
My existing infustructure does not support the data.
Any Data that is beyond the processing capacity is called Big Data.
WHAT IS BIG DATA?
Big data refers to the large volume of structured and unstructured data. The
analysis of big data leads to better insights for business.
FACTS ABOUT BIG DATA
• Personalized Marketing
• Recommendation Engines
• Sentiment Analysis
• Personalize Marketing to the Consumer Behavior
• Genome Sequencing
• Sensors Data
PERSONALIZED MARKETING
By collecting data, making target
personas, anticipating the future
needs of customers, and
understanding how they have
interacted with you, personalized
marketing will allow your
business to craft and deliver
messages that are relatable and
relevant to the target audience.
RECOMMENDATION
ENGINES
A recommendation engine
is a type of data filtering
tool using machine
learning algorithms to
recommend the most
relevant items to a
particular user or customer
SENTIMENT ANALYSIS
• Sentiment analysis studies the
subjective information in an expression,
that is, the opinions, appraisals,
emotions, or attitudes towards a topic,
person or entity. Expressions can be
classified as positive, negative, or
neutral.
• Sentiment analysis is the use of natural
language processing, text analysis,
computational linguistics, and biometrics
to systematically identify, extract,
quantify, and study affective states and
subjective information.
PERSONALIZE MARKETING TO THE CONSUMER BEHAVIOR
GENOME SEQUENCING
A laboratory method that is used to
determine the entire genetic makeup of a
specific organism or cell type. This method
can be used to find changes in areas of the
genome. These changes may help scientists
understand how specific diseases, such as
cancer, form.
SENSORS
DATA
BIG DATA SOURCE
Machine-Generated Data
Big Data Generated By People
Organization-Generated Data
MACHINE-
GENERATED
DATA
BIG DATA GENERATED BY PEOPLE
ORGANIZATION-
GENERATED
DATA
DATA IN
ZETTABYTE
BIG DATA: CASE STUDY
•Netflix is one of the largest providers of commercial streaming video in
the US with a customer base of over 29 million.
•It receives a huge volume of behavioral data.
• When do users watch a show?
• Where do they watch it?
• On which device do they watch the show?
• How often do they pause a program?
• How often do they re-watch a program?
• Do they skip the credits?
• What are the keywords searched?
BIG DATA: CASE STUDY
•Traditionally, the analysis of such data was done using a computer algorithm
that was designed to produce a correct solution for any given instance.
•As the data started to grow, a series of computers were employed to do the
analysis. They were also known as distributed systems.
CHARACTERISTIC
S OF BIG DATA
CHARACTERI
STICS OF BIG
DATA
CHARACTERISTIC
S OF BIG DATA
CHARACTERISTIC
S OF BIG DATA
DATA
PROCESSING
USING
TRADITIONAL
SYSTEM
AGENDA
• File System , DFS
• HDFS and its importance
• Data Processing Using Hadoop
• History of Hadoop
• Traditional Database Systems vs. Hadoop
25
FILE
SYSTEM
DISTRIBUTED SYSTEMS
A distributed system is a model in which components located on networked
computers communicate and coordinate their actions by passing messages.
HOW DOES A DISTRIBUTED SYSTEM WORK ?
CHALLENGES OF DISTRIBUTED SYSTEMS
High
High Limited programmin
chances of bandwidth g
system complexity
failure HADOOP is used to overcome these
challenges!
HDFS FILE
SYSTEM AND
IMPORTANCE
INTRODUCTION TO BIG DATA AND HADOOP
• What Is Hadoop?
Hadoop is a framework that allows distributed
processing of large datasets across
clusters of computers using simple programming
models.
Doug Cutting discovered Hadoop and named it
after his son’s yellow toy elephant. It is inspired
by the technical document published by Google.
CHARACTERISTICS OF HADOOP
Scalable: Can follow
both
horizontal
and vertical
scaling
Reliable: Stores Flexible: Stores a
copies of the lot of data
data on and enables you
different to use it later
machines and
is resistant to
hardware Economical: Ordinary
failure computers can be used for
data processing
TRADITIONAL DATABASE SYSTEMS VS.
HADOOP
• Traditional Database Systems
• Hadoop
• Data is stored in a central location and sent to the processor at run time.
• In Hadoop, the program goes to the data. It initially distributes the data to multiple systems and later runs the
computation wherever the data is located.
• Traditional Database Systems cannot be used to process and store a large amount of data (big data).
• Hadoop works better when the data size is big. It can process and store a large amount of data easily and effectively.
• Traditional RDBMS is used to manage only structured and semi-structured data. It cannot be used to manage
unstructured data.
• Hadoop has the ability to process and store a variety of data, whether it is structured or unstructured.
AGENDA
• Hadoop Introduction
• Importance of HDFS Architecture
• Hadoop Eco System
• HDFS Architecture
• Hadoop Setup
34
HADOOP CORE COMPONENTS
COMPONENTS OF HADOOP ECOSYSTEM
4 • HDFS is a storage layer of Hadoop
6 5
suitable for distributed storage and
3
processing.
7
• It provides file permissions,
2 authentication, and streaming access
10
to file system data.
8
1 • HDFS can be accessed through
Hadoop command line interface
9
COMPONENTS OF HADOOP ECOSYSTEM
5 • HBase is a NoSQL database or
7 6 non-relational database that
4 stores data in HDFS.
8 • It provides support to high
1 volume of data and high
3 throughput.
9 • It is used when you need
2
random, real-time read/write
10 access to your big data.
COMPONENTS OF HADOOP ECOSYSTEM
6 • Sqoop is a tool designed to
8 7 transfer data between
5 Hadoop and relational
database servers.
9
• It is used to import data
2
4 from relational databases
such as Oracle and MySQL
10 to HDFS
3
• and export data from HDFS
1 to relational databases.
COMPONENTS OF HADOOP ECOSYSTEM
7 • Flume is a distributed service for
9 8 ingesting streaming data suited
6 for event data from multiple
systems.
10
• It has a simple and flexible
3 5 architecture based on streaming
data flows.
1 • It is robust and fault tolerant and
2
has tunable reliability
mechanisms.
• It uses a simple extensible data
COMPONENTS OF HADOOP ECOSYSTEM
8 •Spark is an open-source cluster
10 9 computing framework that supports
7 Machine learning,
•Business intelligence, Streaming,
1
and Batch processing.
6
4 •Spark solves similar problems as
Hadoop MapReduce does but has
2
5 a fast in-memory approach and a
clean functional style API.
3
COMPONENTS OF HADOOP ECOSYSTEM
9 • Hadoop MapReduce is a
1 10 framework that processes data.
8 It is the original Hadoop
processing engine, which is
2 primarily Java-based.
5
7 • It is based on the map and
reduce programming model.
3
6
• It has an extensive and mature
4 fault tolerance.
• Hive and Pig are built on map-
reduce model.
COMPONENTS OF HADOOP ECOSYSTEM
10 • Once the data is processed, it is
2 1 analyzed using an open-source
high-level dataflow
9
• system called Pig.
3
6 • Pig converts its scripts to Map and
8
Reduce code to reduce the effort of
writing complex map-reduce
4 programs.
5
• Ad-hoc queries like Filter and Join,
which are difficult to perform in
MapReduce, can be
• easily done using Pig.
COMPONENTS OF HADOOP ECOSYSTEM
1 • It is an open-source high
3 2 performance SQL engine
10 that runs on the Hadoop
cluster.
4
• It is ideal for interactive analysis
7 and has very low latency, which
9
5
can be measured in
milliseconds.
8
6 • Impala supports a dialect of
SQL, so data in HDFS is modeled
as a database table.
COMPONENTS OF HADOOP ECOSYSTEM
2 • Hive is an abstraction
4 3 layer on top of Hadoop
1 that executes queries
using MapReduce.
5 • It is preferred for data
8
processing and ETL
10
(Extract Transform Load)
and ad hoc queries.
6
8
7
COMPONENTS OF HADOOP ECOSYSTEM
• It is Cloudera's near-real-time access
3
5 4 product that enables non-technical
users to search and explore data
2 stored in or ingested into Hadoop and
HBase.
6
• Cloudera Search is a fully integrated
1 data processing platform. It uses the
7 flexible, scalable, and robust storage
system included with CDH or
8
Cloudera’s Distribution, including
Hadoop.
BIG DATA PROCESSING
LEARNING OBJECTIVES
Discuss Hadoop Distributed File System
(HDFS)
Explain HDFS architecture and
components Describe YARN and its
features
Explain YARN architecture
WHY HDFS?
In the traditional system, storing and retrieving volumes of data had three major
issues:
Speed: Search and
analysis is time-
consuming
2
Reliability: Fetching data Cost: 10,000 to
is 3 1 $14,000 per terabyte
difficult
WHY HDFS?
HDFS resolves all major issues of the traditional file
system.
Hadoop clusters
read/write
terabytes of
data per
second
Speed: HDFS offers
HDFS copies Search and zero licensing
analysis 2 and support
the data
is time- costs
multiple
consuming
times
Reliability: Fetching data Cost: 10,000 to
is 3 1 $14,000 per terabyte
difficult
WHAT IS HDFS?
HDFS is a distributed file system that provides access to data across Hadoop
clusters.
It manages and supports analysis of very large volumes of Big Data.
CHARACTERISTICS OF HDFS
HDFS has high fault-
tolerance
HDFS has high
throughput
HDFS is
economical
HDFS STORAGE
Metadata
HDFS stores
files in a
number of NameNode
blocks
Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3
Node Node
B4
data file
B4
BB1 B3 E1 B2
3
2
B4
B4
4
Node
CB1
B3
HDFS STORAGE
Metadata
NameNode
Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3
Node Node
B4
data file
B4
B B1 B3 E1 B2
3
2
B4
B4
4
Node
Each block is
CB1
replicated to a few B3
separate
computers
HDFS STORAGE
Data is divided
Metadata into 128 MB per
block
NameNod
e
Node Node
B1 A DB1
B2
Very B2 B2
B3
large B3
Node Node
B4
data file
B4
B B1 B3 E1 B2
3
2
B4
B4
4
Node
CB1
B3
HDFS STORAGE
Metadata
NameNod
e Metadata keeps
Node Node information about
B1 A DB1 the block and its
B2 replication. It is
Very B2 B2
B3
stored in
large B3
Node Node NameNode.
B4
data file
B4
B B1 B3 E1 B2
3
2
B4
B4
4
Node
CB1
B3
HDFS ARCHITECTURE AND COMPONENTS
HDFS ARCHITECTURE
It is also known as the master and slave
architecture.
Maintain
Edit log Fsimage
Secondary
NameNode
Metadata
File system
Maste
DN1: A,C
r
DN2:
[Link] = AC
A,C
NameNode DN3:
A,C
Data Node 1 Data Node 2 Data Node 3 Data Node N
…
…
1 3 1 3 1 3
Slav
e
HDFS ARCHITECTURE
Responsible for
accepting jobs
from clients Maintain
Edit log Fsimage
Secondary
NameNode
Metadata
File system
Maste
DN1: A,C
r
DN2:
[Link] = AC
A,C
NameNode DN3:
A,C
Data Node 1 Data Node 2 Data Node 3 Data Node N
…
…
1 3 1 3 1 3
Slav
e
HDFS ARCHITECTURE
Maintain
Edit log Fsimage
Secondary
NameNode
Stores the
Metadata
File system block location
Maste and its
DN1: A,C
r replication
DN2:
[Link] = AC
A,C
NameNode DN3:
A,C
Data Node 1 Data Node 2 Data Node 3 Data Node N
…
…
1 3 1 3 1 3
Slav
e
HDFS
ARCHITECTURE
Maintain
Edit log Fsimage
Secondary
NameNode
Metadata
File system
Maste
DN1: A,C
r
DN2:
[Link] = AC
A,C
NameNode DN3: A file is split into one
A,C or more blocks,
stored, and replicated
in the slave nodes
Data Node 1 Data Node 2 Data Node 3 Data Node N
…
…
1 3 1 3 1 3
Slav
e
HDFS ARCHITECTURE
Maintain
Edit log Fsimage
Secondary
NameNode
Metadata
File system
Maste
DN1: A,C
r
DN2:
[Link] = AC
Data required for the A,C
operation is loaded NameNode DN3:
and segregated into A,C
chunks of data blocks
Data Node 1 Data Node 2 Data Node 3 Data Node N
…
…
1 3 1 3 1 3
Slav
e
HDFS COMPONENTS
The main components of
HDFS:
• Namenode
• Secondary Namenode
• File system
• Metadata
• Datanode
HDFS COMPONENTS
NAMENODE
The NameNode server is the core component of an HDFS cluster.
Namenode
It maintains and executes file system namespace operations such as opening,
closing, and renaming of files and directories that are present in HDFS.
Secondary
Namenode Metadata
File system
DN1
File system
1 3
[Link] =
DN2 1 3 AC
NameNode DN3 1 3
Metadata
Datanode Data Node 1 Data Node 2 Data Node 3 Data Node N
…
1 3 1 3 1 3 …
NameNode is a single point of
failure.
HDFS COMPONENTS
NAMENODE: OPERATION
The NameNode maintains two persistent files:
Namenode
• A transaction log called an Edit Log
Secondary • A namespace image called an FsImage
Namenode
File system
Metadata NameNode
Retrieves the Updates with
Datanode Edit Edit log
log at startup information
Fsimage
E
d
it
l
o
g
HDFS COMPONENTS
SECONDARY NAMENODE
Secondary NameNode server is responsible for maintaining a backup of the NameNode
Namenode server. It maintains the edit log and namespace image information in sync with the
NameNode server.
Secondary Master
Namenode
NameNode Secondary Maintain
NameNode
File system Slave s
Edit log Fsimage
Metadata
Datanode Data Node Data Node Data Node
HDFS
Cluster
There can be only one Secondary NameNode server in a cluster. It cannot be treated
as
a disaster recovery server (it partially restores the NameNode server in case of
failure)
HDFS COMPONENTS
FILE SYSTEM
HDFS exposes a file system namespace and allows NameNode
Namenode user
data to be stored in files.
Secondary
Namenode The file system supports operations such as
create, remove, move, and rename.
/
File system
Metadata /Dir 1 /Dir 1
Datanode
/Dir 1.1 File A /Dir 2.1
File B
HDFS COMPONENTS
METADATA
HDFS metadata is the structure of HDFS directories and files in a tree.
Namenode
It includes attributes of directories and files, such as ownership, permissions, quotas,
Secondary and replication factor.
Namenode
File system
Metadata
Datanode
HDFS COMPONENTS
DATANODE
The DataNode is a multiple instance server.
Namenode
It is responsible for storing and maintaining the data blocks. It also retrieves the blocks
when
Secondary asked by clients or the NameNode.
Namenode
Metadata (Name, replicas,
Metadata …):
File system ops
NameNode /home/foo/data, 3, …
Client
Metadata
Datanode
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
Bloc
k
Rack Client
Rack
1 1
DATA BLOCK
SPLIT
Data block split is an important process of HDFS architecture. Each file is split into one or more
blocks and the blocks are stored and replicated in DataNodes.
NameNode
DataNode1 DataNode2 DataNode3 DataNode4
A file split b1 b2 … b2 b3 … b1 … b1 b2 …
into b3
blocks
DataNodes managing
blocks
By default, each file block is 128
MB.
BLOCK REPLICATION
ARCHITECTURE
Block replication refers to creating copies of a block in multiple DataNodes. Usually, the
data is split into parts, such as part-0 and part-1.
NameNode
JobTrac
ker
B1 B2 B3
Job
1
Block
DataNode Replicatio DataNode
server1 n server 2
Resubmit Job
1
REPLICATION METHOD
• Each file is split into a sequence of blocks (of the same size, except the last one).
• Blocks are replicated for fault-tolerance.
• The block replication factor is usually configured at the cluster level (can also be done at the file
level).
• The NameNode receives a heartbeat and a block report from each DataNode in the cluster.
• A block report lists the blocks on a DataNode.
WHAT IS A RACK?
Rack is a collection of machines that are physically located in a single
place/data-
center and connected through a network.
In Hadoop, Rack is a physical collection of slave machines put together at a
single location for data storage.
RACK AWARENESS IN
HADOOP
• In large clusters of Hadoop, to improve network traffic while reading/writing HDFS files,
Namenode chooses data nodes that are on the same rack or a nearby rack to read/write
request.
• Namenode achieves this rack information by maintaining rack ids of each data node.
• This concept of choosing closer data nodes based on rack information is called Rack Awareness
in Hadoop.
REPLICATION AND RACK
AWARENESS IN HADOOP
The topology of the replica is critical to ensure the reliability of HDFS. Usually, data is replicated thrice.
The suggested replication topology is as follows:
• The first replica is placed on the same node as
that of the client.
• The second replica is placed on a different NameNode
rack from that of the first replica.
• The third replica is placed on the same rack as
that of the second one, but on a different Client
node.
Rack Rack
1 2
R3N1
R3N2 R2N1:
R3N2B1
R1N3: B1 R2N3: B1
REPLICATION AND RACK
AWARENESS: EXAMPLE
The diagram illustrates a Hadoop cluster with three racks. Each rack contains multiple
nodes.
NameNode
B3
B1 B2
REPLICATION AND RACK
AWARENESS: EXAMPLE
R1N1 represents
NameNode
Node 1 on Rack
1 and so on.. B1 B2
B3
REPLICATION AND RACK
AWARENESS: EXAMPLE
The NameNode
decides which
DataNode
belongs to which
rack.
NameNode
B3
B1 B2
INTRODUCTION TO YARN (YET ANOTHER
RESOURCE NEGOTIATOR)
WHAT IS YARN:
CASE STUDY
Yahoo was the first company to embrace Hadoop and became a trendsetter within the Hadoop ecosystem.
In late 2012, Yahoo struggled to handle iterative and stream processing of data on the Hadoop
infrastructure
due to MapReduce limitations.
Both iterative and stream processing were important for Yahoo in facilitating its move from batch
computing to continuous computing.
How could this issue be solved?