Rupal Choudhary

Performance tuning in pig

Some of the solutions are mentioned below for the performance tuning in pig:

1) When there are a lot of small input files, smaller than the default data split size, then you need to enable the combination of data splits and set the minimum size of the combined data splits to a multiple of the default data split size, usually 256 or 512MB or even more.

2)Generally, the Mappers-to-Reducers ratio is about 4:1. A Pig job with 400 mappers usually should not have reducers more than 100. If your job has combiners implemented (using group … by and foreach together without other operators in between), that ratio can be as high as 10:1—that is, for 100 mappers using 10 reducers is sufficient in that case.

3)On the job history or Jobtracker portal, actual execution time of MapReduce tasks should be checked. If the majority of completed mappers or reducers were very short-lived like under 10-15 seconds, and if reducers ran even faster than mappers, we might have used too many reducers and/or mappers which is a good sign.

4) In job’s output directory in HDFS the number of output part-* files should be checked. If that number is greater than our planned reducer number (each reducer is supposed to generate one output file), it indicates that our planned reducer number somehow got overridden and we have used more reducers than we meant to.

5) The size of our output part-* files should be checked. If the planned reducer number is 50, but some of the output files are empty, that is a good indicator that we have over-allocated reducers for our job and wasted the cluster’s resources.