0% found this document useful (0 votes)

76 views9 pages

Introduction To MapReduce

Uploaded by

shivaraj BG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views9 pages

Introduction To MapReduce

Uploaded by

shivaraj BG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Introduction: MapReduce

• The concept of MapReduce was pioneered by Google.

• The original paper titled "MapReduce: Simplified Data Processing on Large
Clusters" was written by Jeffrey Dean and Sanjay Ghemawat, and it was
published in 2004.
• In the paper, they introduced the MapReduce programming model and
described its implementation at Google for processing large-scale data across
distributed clusters.
• MapReduce became a fundamental framework for distributed computing and
played a significant role in the development of big data technologies.
• While Google introduced the concept, the open-source Apache Hadoop project
later implemented its own version of MapReduce, making it accessible to a
broader community of developers and organizations.

Prerequisites that can help you grasp MapReduce more effectively

1. Programming Languages:

• Proficiency in a programming language is crucial.

• Java is commonly used in the Hadoop ecosystem, and many MapReduce
examples are written in Java.
• Knowledge of Python can also be useful.

2. Distributed Systems:

• Understanding the basics of distributed computing is essential.

• Familiarize yourself with concepts like nodes, clusters, parallel processing, and
the challenges associated with distributed systems.
3. Hadoop Ecosystem:

• MapReduce is often associated with the Hadoop framework.

• Therefore, it's helpful to have a basic understanding of Hadoop and its
ecosystem components, such as HDFS (Hadoop Distributed File System) and
YARN (Yet Another Resource Negotiator).

4. Basic Understanding of Big Data:

• MapReduce is commonly used in the context of big data processing.

• It's beneficial to have a foundational understanding of what constitutes "big
data," the challenges associated with large datasets, and the motivation behind
distributed computing for big data.

5. Linux/Unix Commands:

• Many big data platforms, including Hadoop, are typically deployed on Unix-
like systems.
• Familiarity with basic command-line operations in a Unix environment can be
helpful for interacting with Hadoop clusters.

6. SQL (Structured Query Language):

• If you are planning to use tools like Apache Hive, which provides a SQL-like
interface for querying data in Hadoop, a basic understanding of SQL can be
beneficial.

7. Concepts of Data Storage and Retrieval:

• Understanding how data is stored and retrieved in a distributed environment

is crucial.
• Concepts like Sharding, replication, and indexing are relevant.

8. Algorithmic and Problem-Solving Skills:

• MapReduce involves breaking down problems into smaller tasks that can be
executed in parallel.
• Strong algorithmic and problem-solving skills are valuable for designing
efficient MapReduce jobs.
Explanation
Q: Describe MapReduce Execution steps with a neat diagram 12 M

• MapReduce is a programming model and processing technique designed for

processing and generating large datasets that can be parallelized across a
distributed cluster of computers
• A job means a MapReduce Program.
• Each job consists of several smaller unit, called MapReduce Tasks.
• The basic idea behind MapReduce is to divide a large computation into smaller
tasks that can be performed in parallel across multiple nodes in a cluster.

In a MapReduce job

1. The data is split into smaller chunks, and a "map" function is applied to each
chunk independently.
2. The results are then shuffled and sorted, and a "reduce" function is applied to
combine the intermediate results into the final output.

MapReduce Programing approach allows for efficient processing of large datasets in

a distributed computing environment.
JobTracker and Task Tracker

• MapReduce consists of a single master JobTracker and one slave TaskTracker

per cluster node.
• The master is responsible for scheduling the component tasks in a job onto the
slaves, monitoring them and re-executing the failed tasks.
• The slaves execute the tasks as directed by the master.
• The MapReduce framework operates entirely on key, value-pairs.
• * The framework views the input to the task as a set of (key, value) pairs and
produces a set of (key, value) pairs as the output of the task, with different
types.

Map-Tasks

Map task means a task that implements a map( ) function.

which runs user application codes for each key-value pair (kl, vl).

• Key kl is a set of keys.

• Key kl maps to group of data values.
• Values vl are a large string which is read from the input file(s).
• The output of map( ) would be zero (when no values are found) or intermediate
key-value pairs (k2, v2).
Reduce Task

• Refers to a task which takes the output v2 from the map as an input and
combines those data pieces into a smaller set of data using a combiner.
• The reduce task is always performed after the map task.

Key-Value Pair

Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as
input and output.
Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.

Key-value pairs in Hadoop MapReduce are generated as follows:

• InputSplit - Defines a logical representation of data and presents a Split data

for processing at individual map ().
• RecordReader - Communicates with the Input Split and converts the Split into
records which are in the form of key-value pairs in a format suitable for reading
by the Mapper.
• RecordReader uses TextlnputFormat by default for converting data into key-
value pairs.
• RecordReader communicates with the InputSplit until the file is read.

Grouping by Key

• When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the
value v2 append in a list of values.
• A "Group By" operation on intermediate keys creates v2.

Shuffle and Sorting Phase

• All pairs with the same group key (k2) collect and group together, creating one
group for each key.
• Shuffle output format will be a List of. Thus, a different subset of the
intermediate key space assigns to each reduce node.

Reduced Tasks

• Implements reduce () that takes the Mapper output (which shuffles and sorts),
which is grouped by key-values (k2, v2) and applies it in parallel to each
group.
• Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
• The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
MapReduce Implementation

• MapReduce is a programming model and processing technique for handling

large datasets in a parallel and distributed fashion.
• The word count problem is a classic example of a task that can be solved using
MapReduce.
• Mathematical representation of the MapReduce algorithm for the word count
problem.
Example:

Step 1: Input Document:

D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"

Step 2: Map Function:

The Map function processes each word in the document and emits key-value pairs
where the key is the word, and the value is 1 (indicating the count).

Map("hello”) →{("hello",1)},

Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},

Map("hi”) →{("hi",1), ("hi",1)}, …

Step 3: Shuffle and Sort (Grouping by Key):

Group and sort the intermediate key-value pairs by key.

("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …

Step 4: Reduce Function:

The Reduce function takes each unique key and the list of values and calculates the
sum.

Reduce ("hello", [1]) →{("hello",1)},

Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},

Reduce ("hi", [1,1]) →{("hi",2)}, …

Step 5: Final Output:

{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}

INF4817 Software Engineering: Jan / Feb 2021
100% (3)
INF4817 Software Engineering: Jan / Feb 2021
3 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
(Optional) Lab 4.1: Using Junit
No ratings yet
(Optional) Lab 4.1: Using Junit
11 pages
Windows 10 Product Key 20212
No ratings yet
Windows 10 Product Key 20212
2 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Unit-4-1
No ratings yet
Unit-4-1
12 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
26 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Bda 03
No ratings yet
Bda 03
10 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Map reduce
No ratings yet
Map reduce
35 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Hadoop
No ratings yet
Hadoop
34 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
bda megh
No ratings yet
bda megh
50 pages
bda_unit_3[1]
No ratings yet
bda_unit_3[1]
20 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
Hadoop - Mapreduce (1)
No ratings yet
Hadoop - Mapreduce (1)
5 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Mapreduce 190419130907
No ratings yet
Mapreduce 190419130907
12 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
C#-Question Bank_All units
No ratings yet
C#-Question Bank_All units
5 pages
Terraform Associate
No ratings yet
Terraform Associate
2 pages
Onroad App
No ratings yet
Onroad App
20 pages
UNIT 1 - 2023-24 Part 1
No ratings yet
UNIT 1 - 2023-24 Part 1
15 pages
Streamline NX v3
No ratings yet
Streamline NX v3
8 pages
First Part of This Tutorial On The Java 8 Stream Api: Map Maptoint Maptolong Maptodouble
No ratings yet
First Part of This Tutorial On The Java 8 Stream Api: Map Maptoint Maptolong Maptodouble
21 pages
Salesforce Developer: Name: Matasugur Indu R O L L N O: 2 0 1 2 1 A 0 1 6 9 Branch Civil Engineering
No ratings yet
Salesforce Developer: Name: Matasugur Indu R O L L N O: 2 0 1 2 1 A 0 1 6 9 Branch Civil Engineering
18 pages
Cake Php Book
No ratings yet
Cake Php Book
846 pages
Amit P. Shah: Curriculum Vitae
No ratings yet
Amit P. Shah: Curriculum Vitae
7 pages
4.-PHP
No ratings yet
4.-PHP
48 pages
Optimizing An ANSI C Interpreter With Superoperators
No ratings yet
Optimizing An ANSI C Interpreter With Superoperators
11 pages
Cascading Style Sheets Level 2 (CSS2) :: Page 1 of 2
No ratings yet
Cascading Style Sheets Level 2 (CSS2) :: Page 1 of 2
2 pages
Steps To Create Issue Database in Express
No ratings yet
Steps To Create Issue Database in Express
11 pages
AUTOSAR SWS ECUStateManager PDF
No ratings yet
AUTOSAR SWS ECUStateManager PDF
210 pages
Core Java Study Final
No ratings yet
Core Java Study Final
410 pages
Pattern - Framework - UML
No ratings yet
Pattern - Framework - UML
43 pages
OOAD FullNote
No ratings yet
OOAD FullNote
248 pages
Change Log
No ratings yet
Change Log
18 pages
Chronotron Pro
No ratings yet
Chronotron Pro
14 pages
AnandSinha Resume
No ratings yet
AnandSinha Resume
1 page
Access Control and Inheritance
No ratings yet
Access Control and Inheritance
5 pages
15CS434E Unitv
No ratings yet
15CS434E Unitv
41 pages
Python First Sessional_IT_24-25
No ratings yet
Python First Sessional_IT_24-25
1 page
Java R19 - UNIT-3
No ratings yet
Java R19 - UNIT-3
30 pages
TYBBA (CA) Sem V LabBook of Java - MongoDB - Python
No ratings yet
TYBBA (CA) Sem V LabBook of Java - MongoDB - Python
257 pages
Unit 3
No ratings yet
Unit 3
29 pages
Java Imp
No ratings yet
Java Imp
85 pages

Introduction To MapReduce

Uploaded by

Introduction To MapReduce

Uploaded by

Introduction: MapReduce

• The concept of MapReduce was pioneered by Google.

Prerequisites that can help you grasp MapReduce more effectively

• Proficiency in a programming language is crucial.

• Understanding the basics of distributed computing is essential.

• MapReduce is often associated with the Hadoop framework.

4. Basic Understanding of Big Data:

• MapReduce is commonly used in the context of big data processing.

6. SQL (Structured Query Language):

7. Concepts of Data Storage and Retrieval:

• Understanding how data is stored and retrieved in a distributed environment

8. Algorithmic and Problem-Solving Skills:

• MapReduce is a programming model and processing technique designed for

MapReduce Programing approach allows for efficient processing of large datasets in

• MapReduce consists of a single master JobTracker and one slave TaskTracker

Map task means a task that implements a map( ) function.

• Key kl is a set of keys.

Key-value pairs in Hadoop MapReduce are generated as follows:

• InputSplit - Defines a logical representation of data and presents a Split data

Shuffle and Sorting Phase

• MapReduce is a programming model and processing technique for handling

Step 1: Input Document:

D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"

Step 2: Map Function:

Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},

Map("hi”) →{("hi",1), ("hi",1)}, …

Step 3: Shuffle and Sort (Grouping by Key):

Group and sort the intermediate key-value pairs by key.

("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …

Step 4: Reduce Function:

Reduce ("hello", [1]) →{("hello",1)},

Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},

Reduce ("hi", [1,1]) →{("hi",2)}, …

Step 5: Final Output:

{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}

You might also like