Rupal Choudhary

Implementation of Map side Join and Reduce side join in pig.

Map side join:
A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splittable which means the ouput files should not be bigger than the HDFS block size. Using the org.apache.hadoop.mapred.join.CompositeInputFormat class we can achieve this.

Reduce side join:
Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer. We can also use the Secondary Sort technique to control the order of the records.

*The key of the map output, of datasets being joined, has to be the join key – so they reach the same reducer.
*Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly.
*In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required.
*A secondary sort needs to be done to ensure the ordering of the values sent to the reducer.
*If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.