Apache Hadoop Developer Training PDF
Apache Hadoop Developer Training PDF
-Navneet Sharma
Mr. Som Shekhar Sharma
Velocity
High frequency data like in stocks
Variety
Structure and Unstructured data
Challenges In Big Data
Complex
No proper understanding of the underlying data
Storage
How to accommodate large amount of data in single
physical machine
Performance
How to process large amount of data efficiently and
effectively so as to increase the performance
Traditional Applications
D a t a Tr a n s f e r
Network
Application Server
Data Base
D a t a Tr a n s f e r
Challenges in Traditional Application- Part1
Network
We dont get dedicated network line from application to
data base
All the companys traffic has to go through same line
Data
Cannot control the production of data
Bound to increase
Format of the data changes very often
Data base size will increase
Statistics Part1
Assuming N/W bandwidth is 10MBPS
Data Transfer
Data Transfer
Bottleneck if number of
users are increased
Data Transfer
Data Transfer
New Approach - Requirements
Supporting Partial failures
Recoverability
Data Availability
Consistency
Data Reliability
Upgrading
Supporting partial failures
Should not shut down entire system if few machines
are down
Should result into graceful degradation of performance
Recoverability
If machines/components fails, task should be taken up
by other working components
Data Availability
28
What makes Hadoop special?
Highly reliable and efficient storage system
MR framework (MapReduce)
Designed for scaling in terms of performance
Overview Of Hadoop Processes
Processes running on Hadoop:
NameNode
Secondary NameNode
Task Tracker
Used by MapReduce
Framework
Job Tracker
Hadoop Process contd
Two masters :
NameNode aka HDFS master
If down cannot access HDFS
Job tracker- aka MapReduce master
If down cannot run MapReduce Job, but still you can access
HDFS
Overview Of HDFS
Overview Of HDFS
NameNode is the single point of contact
Consist of the meta information of the files stored in
HDFS
If it fails, HDFS is inaccessible
Hey, Hadoop is written in Java, and I am purely from C++ back ground,
how I can use Hadoop for my big data problems?
Hadoop Pipes
Well how about Python, Scala, Ruby, etc programmers? Does Hadoop
support all these?
Hadoop Streaming
38
RDBMS and Hadoop
Hadoop is not a data base
$HADOOP_HOME (/usr/local/hadoop)
mapred-site.xml start-all.sh All the log files All the 3rd partty
core-site.xml stop-all.sh for all the jar files are
hdfs-site.xml start-dfs.sh, etc corresponding present
masters process will be You will be
slaves created here requiring while
hadoop-env.sh
working with
HDFS API
conf Directory
Place for all the configuration files
All the hadoop related properties needs to go into one
of these files
mapred-site.xml
core-site.xml
hdfs-site.xml
bin directory
Place for all the executable files
Install ssh
yum install open-sshserver open-sshclient
chkconfig sshd on
service sshd start
Installing Java
Download Sun JDK ( >=1.6 ) 32 bit for linux
export JAVA_HOME=PATH_TO_YOUR_JAVA_HOME
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/home/training/hadoop-1.0.3
export PATH=$PATH:$HADOOP_HOME/bin
Edit $HADOOP_HOME/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/training/hadoop-temp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=PATH_TO_YOUR_JDK_DIRECTORY
Note
No need to change your masters and slaves file as you
are installing Hadoop in pseudo mode / single node
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>IP_OF_MASTER:54311</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/training/hadoop-temp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://IP_OF_MASTER:54310</value>
</property>
</configuration>
NOTE
All the configuration files has to be the same across all
the machines
Process running on
Master / NameNode
machine
Process running on
Master / NameNode
machine
NameNode DataNode DataNode
Job Tracker Task Task
Secondary Tracker Tracker
NameNode
DataNode
TaskTracker
Hadoop Set up in Production
SecondaryNameNode
NameNode JobTracker
DataNode DataNode
TaskTracker TaskTracker
Important Configuration properties
In this module you will learn
fs.default.name
mapred.job.tracker
hadoop.tmp.dir
dfs.block.size
dfs.replication
fs.default.name
Value is hdfs://IP:PORT [hdfs://localhost:54310]
dfs mapred
NameSpace ID has to be
same
You will get Incompatible
NameNode NameSpace ID NameSpace ID error if
there is mismatch
DataNode will not
come up
NameSpace IDs contd
Every time NameNode is formatted, new namespace id
is allocated to each of the machines (hadoop namenode format)
Q2. If you have 10 machines each of size 1TB and you have
utilized the entire capacity of the machine for HDFS? Then
what is the maximum file size you can put on HDFS
1TB
10TB
3TB
4TB
HDFS Shell Commands
In this module you will learn
How to use HDFS Shell commands
Command to list the directories and files
Command to put files on HDFS from local file system
Command to put files from HDFS to local file system
Displaying the contents of file
What is HDFS?
Its a layered or Virtual file system on top of local file
system
Does not modify the underlying file system
BIG
FILE
Accessing HDFS through command line
Remember it is not regular file system
Another variation
Another variant
This will give the list of blocks which the file sample
is made of and its location
DataNode
TaskTracker Runs on Slaves/DN
How HDFS is designed?
For Storing very large files - Scaling in terms of storage
Hadoop is efficient for storing large files but not efficient for storing
small files
Small files means the file size is less than the block size
NameNode try to recover the lost blocks from the failed machine and
bring back the replication factor to normal ( More on this later)
Batch Mode Processing
BIG FILE
Client is putting file
on Hadoop cluster
BIG
FILE
Batch Mode Processing contd
BIG FILE
Client wants to
analyze this file
BIG
FILE
Batch Mode Processing contd
BIG FILE
Client wants to
analyze this file
BIG
FILE
Normal Replication factor
Along with the data, check sum is also shipped for verifying
the data integrity.
If the replica is corrupt client intimates NN, and try to get the data
from other DN
HDFS API
Accessing the file system
Require 3 things:
Configuration object
Path object
FileSystem instance
Hadoops Configuration
Encapsulates client and server configuration
Use Configuration class to access the file system.
Configuration object requires how you want to access
the Hadoop cluster
Local File System
Pseudo Mode
Cluster Mode
Hadoops Path
File on HDFS is represented using Hadoops Path
object
Path is similar to HDFS URI such as
hdfs://localhost:54310/user/dev/sample.txt
FileSystem API
General file system Api
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws
IOException
Accessing the file system contd..
64MB 64MB
KEY ,(Val1,Val2,Val3.Valn)
Integer IntWritable
Long LongWritable
Float FloatWritable
Byte ByteWritable
String Text
Double DoubleWritable
Input Format
Before running the job on the data residing on HDFS,
you need to tell what kind of data it is?
Is data is textual data?
Is data is binary data?
Text Input Format Offset of the line within a Entire line till \n as
file value
Key Value Text Input Part of the record till the Remaining record after
Format first delimiter the first delimiter
Example:
Text Input Format contd
Internally every line is associated with an offset
Key Value
0 Hello, how are you?
1 Hey I am fine?How about you?
2 This is plain text
3 I will be using Text Input Format
How Input Split is processed by mapper?
Input split by default is the block size (dfs.block.size)
Key Value
Hello 2
you 2
I 2
Word Count Mapper
Assuming one input split
(Hello,1)
(how,1)
Map(1,Hello, how are you?) ===> (are,1)
(you,1)
Input to the map function
(Hello,1)
Map(2, Hello, I am fine? How about you?)====> (I,1)
(am,1)
Input to the mapper :
:
:
Pseudo Code
Map (inputKey,InputValue)
{
Break the inputValue into individual words;
For each word in the individual words
{
write (word ,1)
}
}
Word Count Reducer
Reducer will receive the intermediate key and its list of
values
Hello Hello
M World World
A
P
P
E (Hello ,1) (Hello ,1)
R
(World,1) (World,1)
Partitioni Partitioning
ng and Partitioning and Sorting
Sorting on and Sorting on IKV
IKV on IKV
shuffling
Reduce 1
Sorting
Grouping
Important feature of map reduce job
Intermediate output key and value generated by the
map phase is stored on local file system of the machine
where it is running
Intermediate key-value is stored as sequence file (Binary
key value pairs)
@Override
public void map(LongWritable inputKey,Text inputVal,Context context)
{
String line = inputVal.toString();
String[] splits = line.split(\\W+");
for(String outputKey:splits) {
context.write(new Text(outputKey), new IntWritable(1));
}
}
}
Writing Mapper Class
public class WordCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable>{
@Override
Your Mapper class should extend from Mapper class
public void map(LongWritable inputKey,Text inputVal,Context context)
{
Mapper<LongWritable,Text,Text,IntWritable>
String line = value.toString();
First splits
String[] TypeDef : Input key Type given by input format you use
= line.split("//W+");
for(String outputKey:splits)
{ Second TypeDef: Input value Type given by input format
output.write(new Text(outputKey), new IntWritable(1));
} Third TypeDef: Output key Type which you emit from mapper
} Fourth TypeDef: Output value Type which you emit from mapper
}
Writing Mapper Class
public class WordCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable>{
@Override
public void map(LongWritable inputKey,Text inputVal,Context context)
{
Third argument: Using this context object you will emit output key
value pair
Writing Mapper Class
public class WordCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable>{
@Override
public void map(LongWritable inputKey,Text value,Context context)
{
String line = value.toString();
String[] splits = line.split(\\W+");
for(String outputKey : splits) {
context.write(new Text(outputKey), new IntWritable(1));
}
}
Step 1: Take the String object from the input value
}
Step 2:Splitting the string object obtained in step 1, split them into individual
words and take them in array
Step 3: Iterate through each words in the array and emit individual word as
key and emit value as 1, which is of type IntWritable
Writing Reducer Class
public class WordCountReducer extend Reducer <Text, IntWritable, Text,
IntWritable > {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
}
}
Writing Reducer Class
public class WordCountReducer extends Reducer< Text, IntWritable, Text,
IntWritable> {
Your
public void Reducer class
reduce(Text key,should extend from values,
Iterable<IntWritable> Reducer classoutput) throws IOException,
Context
InterruptedException {
Reducer<Text,IntWritable,Text,IntWritable>
int sum = 0;
First TypeDef : Input key Type given by the output key of map
output
for (IntWritable val : values) {
sum += val.get();
}
Second TypeDef: Input value Type given by output value of map
outputnew IntWritable(sum));
output.write(key,
}
}
Third TypeDef: Output key Type which you emit from reducer
Fourth TypeDef: Output value Type which you emit from reducer
Writing Reducer Class
public class WordCountReducer extends Reducer <Text, IntWritable, Text,
IntWritable> {
for(IntWritable
Reducer will get key{ and
val : values) list of values
sum += val.get();
}
Example: Hello {1,1,1,1,1,1,1,1,1,1}
output.write(key, new IntWritable(sum));
}
}
Writing Reducer Class
public class WordCountReducer extend Reducer <Text, IntWritable, Text,
IntWritable > {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
Step6: Specify the reducer o/p key and o/p value class
i/p key and value to reducer is determined by the map o/p key and map o/p value class
respectively. So NO need to specify
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
Driver Class contd
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.waitForCompletion(true);
Usage of Tool Runner
Allows to specify the configuration options from the
command line
Can specify distributed cache properties
Can be used for tuning map reduce jobs
Flexibility to increase or decrease the reducer tasks
without changing the code
Can run the existing map reduce job to utilize local file
system instead of HDFS
}
Tool Runner contd
After the previous step, you have to override the run
method
The run method is where actual driver code goes
-D property=value
Problem statement:
From a list of files or documents map the words to the
list of files in which it has appeared
Output
word = List of documents in which this word has
appeared
Indexing Problem contd
Output from Final Output
File A the mapper
This is cat This: File A
Big fat hen is:File A
cat:File A
Big:File A
fat:File A This:File A,File B
hen:File A
is:File A,File B
cat:File A
fat: File A,File B
File B
This is dog This:File B
My dog is fat is:File B
dog:File B
My: File B
dog:File B
is:File B
fat: File B
Indexing problem contd
Mapper
For each word in the line, emit(word,file_name)
Reducer
Remember for word , all the file names list will be
coming to the reducer
Emit(word,file_name_list)
Average Word Length Problem
Consider the record in a file
Problem Statement
Calculate the average word length for each character
Output:
Character Average word length
H (3+3+5) /3 = 3.66
I 2/1 = 2
T 5/1 =5
D 4/1 = 4
Average Word Length contd
Mapper
For each word in a line,
emit(firstCharacterOfWord,lengthOfWord)
Reducer
You will get a character as key and list of values
corresponding to length
For each value in listOfValue
Calculate the sum and also count the number of values
Emit(character,sum/count)
Hands On
Module 5
In this module you will learn
What is combiner?
setup/cleanup method in mapper / reducer
Passing the parameters to mapper and reducer
Distributed cache
Counters
Hands On
Combiner
Large number of mapper running will produce large
amounts of intermediate data
This data needs to be passed to the reducer over the
network
Lot of network traffic
Shuffling/Copying the mapper output to the machine
where reducer will run will take lot of time
Combiner contd
Similar to reducer
Runs on the same machine as the mapper task
Runs the reducer code on the intermediate output of the
mapper
Thus minimizing the intermediate key-value pairs
Combiner runs on intermediate output of each mapper
Advantages
Minimize the data transfer across the network
Speed up the execution
Reduces the burden on reducer
Combiner contd
String searchWord;
public void setup(Context context) {
searchWord = context.getConfiguration().get(WORD);
}
}
Setup/cleanup method contd
MR framework will copy the files to the slave nodes on its local
file system before executing the task on that node
After the task is completed the file is removed from the local file
system.
Distributed Cache contd
Following can be cached
Text data
Jar files
.zip files
.tar.gz files
.tgz files
Distributed Cache contd
First Option: From the driver class
Second Option: From the command line
Using Distributed Cache-First Option
@Override
public void setup(Context context) throws IOException {
this.files = DistributedCache.
getCacheFiles(context.getConfiguration());
Path path = new Path(files[0]);
//do something
}
Using Distributed Cache-Second Option
You can send the files from the command line
Your driver should have implemented ToolRunner
Example:
Data has 3 fields only lastName,firstName,and empId
You would like to group the data by lastName and
sorting should be done on both lastName and firstName
Writable and WritableComparable
Hadoop uses its own serialization mechanism for
transferring the intermediate data over the network
Fast and compact
Hadoop does not use Java serialization
}
Implementing Custom Values contd
public class PointWritable implements Writable {
private IntWritable xcoord;//x coordinate
private IntWritable ycoord;//y coordinate
@Override
public void readFields(DataInput in) throws IOException {
xcoord.readFields(in);
ycoord.readFields(in);
} Read the fields in the same order you have defined
@Override
public void write(DataOutput out) throws IOException {
xcoord.write(out);
ycoord.write(out);
}
}
Implementing Custom Values contd
public class PointWritable implements Writable {
private IntWritable xcoord;//x coordinate
private IntWritable ycoord;//y coordinate
}
}
Implementing Custom Keys
public class Person implements WritableComparable<Person>
{
private Text lastName;
private Text firstName;
@Override
public void readFields(DataInput in) throws IOException {
lastName.readFields(in);
firstName.readFields(in);
}
@Override
public void write(DataOutput out) throws IOException {
lastName.write(out);
firstName.write(out);
}
Implementing Custom Keys contd
@Override
public int compareTo(Person other) {
if(cmp != 0) {
return cmp;
}
return firstName.compareTo(other.getFirstName());
}
}
Implementing Custom Keys contd
public class Person implements WritableComparable<Person> {
private Text lastName;
private Text firstName;
@Override
public void readFields(DataInput in) throws IOException {
.
}
@Override
public void write(DataOutput out) throws IOException {
}
return firstName.compareTo(other.getFirstName());
}
Key = World
Hash Code = 31 (lets say)
The key world and its list of values will go to 31 % 3 = 1st
reducer
Implementing Custom Partitioner
Recap:
For better load balancing of key value pairs, you should
consider more than 1 reducer
@Override
public int getPartition(Person outputKey, Text outputVal, int numOfReducer) {
//making sure that keys having same last name goes to the same reducer
//because partition is being done on last name
return Math.abs(outputKey.getLastName().hashCode()*127)%numOfReducer;
}
}
Custom Partitioner contd
public class PersonPartitioner extends Partitioner<Person, Text>{
@Override
Custom
public int getPartition(Person outputKey,
Partitioner should extend Text outputVal,
from Partitioner class int
numOfReducer) {
Input Arguments to Partitioner class represents mapper output key
//making sure that
and mapper keysvalue
output having same last name goes to the same
class
reducer
In thepartition
//because is being
current scenario thedone
mapon last name
output key is custom key
Custom key implements WritableComparable interface
return
Math.abs(outputKey.getLastName().hashCode()*127)%numOfReducer;
}
}
Custom Partitioner contd
public class PersonPartitioner extends Partitioner<Person, Text>{
@Override
public int getPartition(Person outputKey, Text outputVal, int numOfReducer) {
//making sure that keys having same last name goes to the same reducer
//because partition
Override is being done
the getPartition on last name
method
return
First argument is map output key
Math.abs(outputKey.getLastName().hashCode()*127)%numOfReducer;
} Second argument is map output value
return Math.abs(outputKey.getLastName().hashCode()*127)
%numOfReducer;
}
}
Using Custom Partitioner in Job
Job.setPartitionerClass(CustomPartitioner.class)
Hands-On
Refer hands-on document
Assignment
Modify your WordCount MapReduce program to
generate 26 output files for each character
Each output file should consist of words starting with
that character only
Implementing Custom Input Format
In built input formats available
Text Input format
Key value text input format
Sequence file input format
Nline input format etc
@Override
public RecordReader<Text, IntWritable> createRecordReader(InputSplit
input, TaskAttemptContext arg1) throws IOException, InterruptedException
{
return new WikiRecordReader();
}
}
Implementing custom input format contd
public class WikiProjectInputFormat extends
FileInputFormat<Text,IntWritable>{
}
Implementing custom input format contd
public class WikiProjectInputFormat extends
FileInputFormat<Text,IntWritable>{
@Override
public RecordReader<Text, IntWritable> createRecordReader(InputSplit
input, TaskAttemptContext arg1) throws IOException,
InterruptedException
}
Implementing custom input format contd
public class WikiProjectInputFormat extends
FileInputFormat<Text,IntWritable>{
@Override
WikiRecordReader
public is the
RecordReader<Text, custom record
IntWritable> reader which needs to be input,
createRecordReader(InputSplit
implemented arg1) throws IOException, InterruptedException
TaskAttemptContext
public WikiRecordReader() {
lineReader = new LineRecordReader();
}
@Override
public void initialize(InputSplit input, TaskAttemptContext context)
throws IOException, InterruptedException {
lineReader.initialize(input, context);
}
Implementing Record Reader contd
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if(!lineReader.nextKeyValue()) {
return false;
}
public WikiRecordReader()
Initialize the lineReader
{ lineReader takes the input split
lineReader = new LineRecordReader();
}
@Override
public void initialize(InputSplit input, TaskAttemptContext context)
throws IOException, InterruptedException
{
lineReader.initialize(input, context);
}
Implementing Record Reader contd
@Override
public boolean nextKeyValue() throws IOException, InterruptedException
{
if(!lineReader.nextKeyValue()){
return false;
} This function provides the input key values to mapper function
one at a time
Text value = lineReader.getCurrentValue();
String[] splits = value.toString().split(" ");
if(splits[0].equals("en")) {
lineKey = new Text(splits[1]);
lineValue = new IntWritable(Integer.parseInt(splits[2]));
} else {
lineKey = null;
lineValue=null;
}
return true;
}
Implementing Record Reader contd
@Override
public boolean nextKeyValue() throws IOException, InterruptedException
{
if(!lineReader.nextKeyValue()){
return false;
}
protected EmployeeSortComparator () {
super(Employee.class);
}
Custom comparator contd..
@Override
public int compare(WritableComparable o1, WritableComparable o2) {
Employee e1 = (Employee) o1;
Employee e2 = (Employee) o2;
int comparison =
e1.getFirstName().compareTo(e2.getFirstName());
if(comparison == 0) {
return e1.getLastName().compareTo(e2.getLastName());
}
return comparison;
}
job.setSortComparatorClass(EmployeeSortComparator .class);
Secondary Sorting
Motivation Sort the values for each of the keys.
Reminder Keys are by default sorted
So now values are also need to be sorted.
John Smith Bill Rod
John Rambo Gary Kirsten
Gary Kirsten John Andrews
John McMillan John McMillan
John Andrews John Rambo
Bill Rod John Smith
Tim Southee Tim Southee
Mapper
Output :
(Gary#Kirsten:Kirsten) (John#Rambo:Rambo) (John#Smith:Smith)
@Override
public int compare(WritableComparable o1, WritableComparable o2) {
Employee e1 = (Employee) o1;
Employee e2 = (Employee) o2;
int cmp = e1.getFirstName().compareTo(e2.getFirstName());
if(cmp != 0) {
return cmp;
}
return e1.getLastName().compareTo(e2.getLastName());
}
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
..
}
Steps :Custom Partitioner
Mappers emit <Custom Key, value>
All the keys having same natural key(first name) must
go the same reducer. So, create a custom Partitioner
which should partition on natural key.
public class SecondarySortPartitioner implements
Partitioner<Employee,Text>
{
@Override
public int getPartition(Employee emp, Text lastName, int
numOfReducer) {
return (emp.getFirstName().hashCode() & Integer.MAX_VALUE ) %
numOfReducer;
}
}
Steps: Grouping Comparator
Output from Mappers get collected to reducer.
Output is sorted on the composite key using comparator
defined earlier.
Example input for reducer phase
@Override
public int compare(WritableComparable o1, WritableComparable o2) {
Employee e1 = (Employee) o1;
Employee e2 = (Employee) o2;
return e1.getFirstName().compareTo(e2.getFirstName());
}
Steps: Driver class
How to make use of these pieces in MR Job.
job.setInputFormat(KeyValueTextInputFormat.class);
job.setMapperClass(SecondarySortMapper.class);
job.setReducerClass(SecondarySortReducer.class);
job.setMapOutputKeyClass(Employee.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setPartitionerClass(SecondarySortPartitioner.class);
job.setOutputKeyComparatorClass(SecondarySortComparator.class);
job.setOutputValueGroupingComparator(GroupingComparator.class);
Steps: Mapper
@Override
public void map(Text arg0, Text arg1, Context context) throws IOException {
}
Steps: Reducer
@Override
public void reduce(Employee emp, Iterator<Text> values, Context context)
throws IOException {
while(values.hasNext()) {
context.write(firstName , values.next());
}
}
Hands On
Refer hands-on document
Joins
In this module you will learn
What are joins?
What is map side join?
What is reduce side join?
Hands on
What are joins?
Joining means joining two data sets on a common key
Output should be
Name ID Country
James O1 Australia
Siddharth 11 India
Suman 23 UnitedStates
Joins contd
MapReduce provides two types of join
Map Side join
Reduce Side join
Both kind of joins does the join but differs the way it is
done
Map Side join is done at mapper phase. No Reducer is
required
Reduce side join requires to have reducer
Map Side Join
Use this join, if one of the data sets you are joining can
be fit into memory
In the previous example if CountryTable can be fit into
memory, then do the joining during mapper phase
}
Map Side join contd
This join is faster
Hash table is in memory and for every record just a look
up is required
Use when dataset is small. (Can fit in memory)
It is not scalable
Memory is constraint
What if the CountryTable size is 100GB?
Reduce side join
Use case : Joining two datasets sharing one common
field in reducer phase.
Motivation : Map side join relies heavily on memory,
so when datasets are very large, Reduce side join is the
only option.
Example:
Employee Location
M1 { 13 : 42 John 13 }
M2 { 13 : New York
Now, join the data in Reducer (all values for the same key will be
passed to the same reducer)
Is there any problem here ?
Reduce side join contd..
Problem: For larger datasets one value need to be
joined with several other values(and values are not
sorted!). Ex: location with employee
If you have 1000 blocks then 100o map task will run
If you have 50 machines and each machine is dual core
then total of 100 ( 50 machines * 2 cores) tasks can run
parallely.
Scheduling 1000 maps will be time consuming
You can merge the files and make it a bigger files
Can run more than one reduce, but the output will not
Compression
Compressing the intermediate key value pairs from the
mapper
Requirement:
Need of higher level abstraction on top of MapReduce
Not to deal with low level stuff involved in MapReduce
In this module you will learn
What is the motivation for Hive and Pig?
Hive basics
Pig basics
Hive
Data ware housing tool on top of Hadoop
hive>SHOW DATABASES;
Tuple:
Rows / Records
Bags:
Unordered Collection of tuples
Relation:
Pig Operator.
Generally the output of operator is stored in a relation
Note when you are defining the column name, you need to specify the
data types of the column,else NULL will be stored
For doing left outer join use LEFT OUTER key word
Joins contd
DESCRIBE
EXPLAIN
ILLUSTRATE
REGISTER
User Defined Functions (UDF)
DEFINE
DataMining: Top 10 English sites from wiki data