This project is a custom implementation of the Map Reduce framework. This framework is a programming model which facilitates performing parallel and distributed processing on huge data sets. Hadoop is the most popular implementation of Map Reduce.
If your task was to count the frequency of each unique node in a file of 5GB using a single computer, how much time would it require? Map Reduce framework allows us to leverage the multiple computing nodes to accomplish much, MUCH faster.
If you want to briefy understand the underlying concepts of this framework, you may go through this article.
(0) Client uploads input file(s) to FileServer
(1) Client initiates task request to Master
(2) Master accesses meta data of the input file from the FileServer
(3) Master assigns chunks of input file(s) to Mapper Nodes
(4) Mapper Node(s) download only the part(s) of input file(s) assigned to it
(5) Mapper Nodes finish processing and upload their resulting files to FileServer
(6) Mapper Node(s) inform Master that its task is done and the resulting files are uploaded to the FileServer
(7) Master assigns tasks to Reducer Node(s)
(8) Reducer Node(s) download files relevant to the assigned task
(9) Reducer Node(s) uploads resulting file to FileServer
(10) Reducer Node(s) inform master that task is done
(At this point Master assigns the task of aggregating the resulting files of all Reducer Nodes to one Reducer Node)
(11) Master informs the client that processing is done and the output file is present on the FileServer
(12) Client downloads output file from the FileServer
make all
./fs_server
./master_server <master_IP> <log_file>
./mapper_node <master_IP> <mapper_IP>
./reducer_node <master_IP> <reducer_IP>
./dummy_client <master_IP>
Span as many mappers/reducers as needed. Two sample tasks are implemented, word count and inverted document index. The framework implementation can be tested as on these two tasks. The syntax for the same can be found in dummy_client.cpp file.



