0% found this document useful (0 votes)
76 views9 pages

Introduction To MapReduce

Uploaded by

shivaraj BG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views9 pages

Introduction To MapReduce

Uploaded by

shivaraj BG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Introduction: MapReduce

• The concept of MapReduce was pioneered by Google.


• The original paper titled "MapReduce: Simplified Data Processing on Large
Clusters" was written by Jeffrey Dean and Sanjay Ghemawat, and it was
published in 2004.
• In the paper, they introduced the MapReduce programming model and
described its implementation at Google for processing large-scale data across
distributed clusters.
• MapReduce became a fundamental framework for distributed computing and
played a significant role in the development of big data technologies.
• While Google introduced the concept, the open-source Apache Hadoop project
later implemented its own version of MapReduce, making it accessible to a
broader community of developers and organizations.

Prerequisites that can help you grasp MapReduce more effectively


1. Programming Languages:

• Proficiency in a programming language is crucial.


• Java is commonly used in the Hadoop ecosystem, and many MapReduce
examples are written in Java.
• Knowledge of Python can also be useful.

2. Distributed Systems:

• Understanding the basics of distributed computing is essential.


• Familiarize yourself with concepts like nodes, clusters, parallel processing, and
the challenges associated with distributed systems.
3. Hadoop Ecosystem:

• MapReduce is often associated with the Hadoop framework.


• Therefore, it's helpful to have a basic understanding of Hadoop and its
ecosystem components, such as HDFS (Hadoop Distributed File System) and
YARN (Yet Another Resource Negotiator).

4. Basic Understanding of Big Data:

• MapReduce is commonly used in the context of big data processing.


• It's beneficial to have a foundational understanding of what constitutes "big
data," the challenges associated with large datasets, and the motivation behind
distributed computing for big data.

5. Linux/Unix Commands:

• Many big data platforms, including Hadoop, are typically deployed on Unix-
like systems.
• Familiarity with basic command-line operations in a Unix environment can be
helpful for interacting with Hadoop clusters.

6. SQL (Structured Query Language):

• If you are planning to use tools like Apache Hive, which provides a SQL-like
interface for querying data in Hadoop, a basic understanding of SQL can be
beneficial.

7. Concepts of Data Storage and Retrieval:

• Understanding how data is stored and retrieved in a distributed environment


is crucial.
• Concepts like Sharding, replication, and indexing are relevant.

8. Algorithmic and Problem-Solving Skills:

• MapReduce involves breaking down problems into smaller tasks that can be
executed in parallel.
• Strong algorithmic and problem-solving skills are valuable for designing
efficient MapReduce jobs.
Explanation
Q: Describe MapReduce Execution steps with a neat diagram 12 M

• MapReduce is a programming model and processing technique designed for


processing and generating large datasets that can be parallelized across a
distributed cluster of computers
• A job means a MapReduce Program.
• Each job consists of several smaller unit, called MapReduce Tasks.
• The basic idea behind MapReduce is to divide a large computation into smaller
tasks that can be performed in parallel across multiple nodes in a cluster.

In a MapReduce job

1. The data is split into smaller chunks, and a "map" function is applied to each
chunk independently.
2. The results are then shuffled and sorted, and a "reduce" function is applied to
combine the intermediate results into the final output.

MapReduce Programing approach allows for efficient processing of large datasets in


a distributed computing environment.
JobTracker and Task Tracker

• MapReduce consists of a single master JobTracker and one slave TaskTracker


per cluster node.
• The master is responsible for scheduling the component tasks in a job onto the
slaves, monitoring them and re-executing the failed tasks.
• The slaves execute the tasks as directed by the master.
• The MapReduce framework operates entirely on key, value-pairs.
• * The framework views the input to the task as a set of (key, value) pairs and
produces a set of (key, value) pairs as the output of the task, with different
types.

Map-Tasks

Map task means a task that implements a map( ) function.

which runs user application codes for each key-value pair (kl, vl).

• Key kl is a set of keys.


• Key kl maps to group of data values.
• Values vl are a large string which is read from the input file(s).
• The output of map( ) would be zero (when no values are found) or intermediate
key-value pairs (k2, v2).
Reduce Task

• Refers to a task which takes the output v2 from the map as an input and
combines those data pieces into a smaller set of data using a combiner.
• The reduce task is always performed after the map task.

Key-Value Pair

Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as
input and output.
Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.

Key-value pairs in Hadoop MapReduce are generated as follows:

• InputSplit - Defines a logical representation of data and presents a Split data


for processing at individual map ().
• RecordReader - Communicates with the Input Split and converts the Split into
records which are in the form of key-value pairs in a format suitable for reading
by the Mapper.
• RecordReader uses TextlnputFormat by default for converting data into key-
value pairs.
• RecordReader communicates with the InputSplit until the file is read.

Grouping by Key

• When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the
value v2 append in a list of values.
• A "Group By" operation on intermediate keys creates v2.

Shuffle and Sorting Phase

• All pairs with the same group key (k2) collect and group together, creating one
group for each key.
• Shuffle output format will be a List of. Thus, a different subset of the
intermediate key space assigns to each reduce node.

Reduced Tasks

• Implements reduce () that takes the Mapper output (which shuffles and sorts),
which is grouped by key-values (k2, v2) and applies it in parallel to each
group.
• Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
• The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
MapReduce Implementation

• MapReduce is a programming model and processing technique for handling


large datasets in a parallel and distributed fashion.
• The word count problem is a classic example of a task that can be solved using
MapReduce.
• Mathematical representation of the MapReduce algorithm for the word count
problem.
Example:

Step 1: Input Document:

D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"

Step 2: Map Function:

The Map function processes each word in the document and emits key-value pairs
where the key is the word, and the value is 1 (indicating the count).

Map("hello”) →{("hello",1)},

Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},

Map("hi”) →{("hi",1), ("hi",1)}, …

Step 3: Shuffle and Sort (Grouping by Key):

Group and sort the intermediate key-value pairs by key.

("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …

Step 4: Reduce Function:

The Reduce function takes each unique key and the list of values and calculates the
sum.

Reduce ("hello", [1]) →{("hello",1)},

Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},

Reduce ("hi", [1,1]) →{("hi",2)}, …

Step 5: Final Output:

{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}

You might also like