- March 27, 2017 at 10:55 am #3087
Generally MapReduce paradigm is based on sending map-reduce programs to computers where the actual data resides.
During a MapReduce job, Hadoop sends Map and Reduce tasks to appropriate servers in the cluster.
The framework manages all the details of data-passing like issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on the nodes with data on local disks that reduces the network traffic.
After completing a given task, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on key-value pairs, that is, the framework views the input to the job as a set of key-value pairs and produces a set of key-value pair as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence, it is required to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Both the input and output format of a MapReduce job are in the form of key-value pairs −
(Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3> (Output).
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)