What is Hadoop Streaming?

Hadoop MapReduce was originally built for Java, which limited its accessibility to developers familiar with other languages. To address this, Hadoop introduced Streaming a utility that enables writing MapReduce programs in any language that supports standard input and output, such as Python, Bash or Perl.

Hadoop Streaming, available since version 0.14.1, allows external scripts to be used as Mapper and Reducer tasks. These scripts process input from STDIN and produce output to STDOUT, enabling non-Java programs to participate fully in Hadoop’s distributed data processing.

Use Cases of Hadoop Streaming

Suitable for developers preferring Python, Perl, Bash or other non-Java languages
Enables reuse of existing legacy scripts in MapReduce workflows
Facilitates rapid prototyping of data processing tasks using simple scripts
Supports development of custom mappers and reducers for non-standard or binary data formats

Data Flow in Hadoop Streaming

Hadoop Streaming processes key-value pairs through external mapper and reducer scripts using standard input and output. Let’s see how data flows through each stage in the diagram below.

Let’s walk through how data flows in a typical Hadoop Streaming job:

1. Input Reader / Format

Hadoop reads the input data using InputFormat class.

The data is split into <key, value> pairs.
These pairs are passed to the Mapper script.

2. Mapper Stream

Each input pair is sent to an external Mapper script via STDIN.
The script can be written in any language that supports standard input/output.
It processes the input and writes output to STDOUT in the form of intermediate <key, value> pairs.

3. Intermediate Key-Value Pairs

Hadoop automatically collects all intermediate output.
It shuffles and groups values by keys across all Mappers.
Sorted data is now ready for Reducers.

4. Reducer Stream

The grouped intermediate data is passed to an external Reducer script via STDIN.
The script processes each group of keys and values.
The output, written via STDOUT, contains final <key, value> results.

5. Output Format

Final output pairs from the Reducer are collected by Hadoop.
They are written to HDFS using configured OutputFormat (usually plain text files).

Running a Streaming Job in Hadoop

To run a Hadoop Streaming job, use hadoop jar command along with hadoop-streaming.jar file. This allows to plug in external scripts as Mapper and Reducer even if they’re written in languages like Python, Bash or Perl.

Let’s see an example of how to run a streaming job using Python scripts:

hadoop jar /path/to/hadoop-streaming.jar \
-input /data/input.txt \
-output /data/output \
-mapper mymapper.py \
-reducer myreducer.py \
-file mymapper.py \
-file myreducer.py

What Command Does

Runs a MapReduce job using external Python scripts.
-input: Specifies the input data stored in HDFS.
-output: Directory to store the final output (must not already exist).
-mapper: Python script used to process the input data.
-reducer: Python script used to process grouped key-value pairs.
-file: Uploads the mapper and reducer scripts to Hadoop nodes.

Internal Workflow

Hadoop passes each line of input to mymapper.py via STDIN.
The mapper emits key-value pairs using STDOUT.
Hadoop shuffles and groups the data by key.
Grouped data is passed to myreducer.py via STDIN.
The reducer outputs final results via STDOUT, which are written to /data/output.

Useful Hadoop Streaming Options

Option	Description
-input	Input path for the mapper
-output	Output path after reduce phase
-mapper	Command or script to run as the mapper
-reducer	Command or script to run as the reducer
-file	Upload mapper/reducer script to all compute nodes
-inputformat	Custom InputFormat class
-outputformat	Custom OutputFormat class
-partitioner	Defines how keys are divided among reducers
-combiner	reduce logic applied after map (local mini-reduce)
-verbose	Enables detailed logs
-numReduceTasks	Number of reducer tasks
-mapdebug, -reducedebug	Scripts to run on task failure (for debugging)