Hadoop MapReduce was originally built for Java, which limited its accessibility to developers familiar with other languages. To address this, Hadoop introduced Streaming a utility that enables writing MapReduce programs in any language that supports standard input and output, such as Python, Bash or Perl.
Hadoop Streaming, available since version 0.14.1, allows external scripts to be used as Mapper and Reducer tasks. These scripts process input from STDIN and produce output to STDOUT, enabling non-Java programs to participate fully in Hadoop’s distributed data processing.
Use Cases of Hadoop Streaming
- Suitable for developers preferring Python, Perl, Bash or other non-Java languages
- Enables reuse of existing legacy scripts in MapReduce workflows
- Facilitates rapid prototyping of data processing tasks using simple scripts
- Supports development of custom mappers and reducers for non-standard or binary data formats
Data Flow in Hadoop Streaming
Hadoop Streaming processes key-value pairs through external mapper and reducer scripts using standard input and output. Let’s see how data flows through each stage in the diagram below.

Let’s walk through how data flows in a typical Hadoop Streaming job:
1. Input Reader / Format
Hadoop reads the input data using InputFormat class.
- The data is split into <key, value> pairs.
- These pairs are passed to the Mapper script.
2. Mapper Stream
- Each input pair is sent to an external Mapper script via STDIN.
- The script can be written in any language that supports standard input/output.
- It processes the input and writes output to STDOUT in the form of intermediate <key, value> pairs.
3. Intermediate Key-Value Pairs
- Hadoop automatically collects all intermediate output.
- It shuffles and groups values by keys across all Mappers.
- Sorted data is now ready for Reducers.
4. Reducer Stream
- The grouped intermediate data is passed to an external Reducer script via STDIN.
- The script processes each group of keys and values.
- The output, written via STDOUT, contains final <key, value> results.
5. Output Format
- Final output pairs from the Reducer are collected by Hadoop.
- They are written to HDFS using configured OutputFormat (usually plain text files).
Running a Streaming Job in Hadoop
To run a Hadoop Streaming job, use hadoop jar command along with hadoop-streaming.jar file. This allows to plug in external scripts as Mapper and Reducer even if they’re written in languages like Python, Bash or Perl.
Let’s see an example of how to run a streaming job using Python scripts:
hadoop jar /path/to/hadoop-streaming.jar \
-input /data/input.txt \
-output /data/output \
-mapper mymapper.py \
-reducer myreducer.py \
-file mymapper.py \
-file myreducer.py
What Command Does
- Runs a MapReduce job using external Python scripts.
- -input: Specifies the input data stored in HDFS.
- -output: Directory to store the final output (must not already exist).
- -mapper: Python script used to process the input data.
- -reducer: Python script used to process grouped key-value pairs.
- -file: Uploads the mapper and reducer scripts to Hadoop nodes.
Internal Workflow
- Hadoop passes each line of input to mymapper.py via STDIN.
- The mapper emits key-value pairs using STDOUT.
- Hadoop shuffles and groups the data by key.
- Grouped data is passed to myreducer.py via STDIN.
- The reducer outputs final results via STDOUT, which are written to /data/output.
Useful Hadoop Streaming Options
Option | Description |
|---|---|
| -input | Input path for the mapper |
| -output | Output path after reduce phase |
| -mapper | Command or script to run as the mapper |
| -reducer | Command or script to run as the reducer |
| -file | Upload mapper/reducer script to all compute nodes |
| -inputformat | Custom InputFormat class |
| -outputformat | Custom OutputFormat class |
| -partitioner | Defines how keys are divided among reducers |
| -combiner | reduce logic applied after map (local mini-reduce) |
| -verbose | Enables detailed logs |
| -numReduceTasks | Number of reducer tasks |
| -mapdebug, -reducedebug | Scripts to run on task failure (for debugging) |