April 2, 2017 at 9:39 pm #3137
MD SAJID AKHTARParticipant
Differences between Old API and New API
–>New API uses Mapper and Reducer as Class, whereas Old API used Mapper and Reducer as Interface.
–>Old API can be found in org.apache.hadoop.mapred package,while new API is in the org.apache.hadoop.mapreduce.
–>In the new API,the reduce() method passes values as a Java.lang.iterable whereas in the Old API,same method passes values Java.lang.iterator.
–>Job control was done through Job Client in Old API.But in new API,Job Control is done through Job Class.April 3, 2017 at 1:23 pm #3152
Limitations of Map Reduce
Here are some cases where mapreduce does not work properly:
1. Iteration;when we need to process data again and again.
2. When we need a respond fast i.e; time consuming while processing the output.
3. It is not always easy to implement each and everything on MR program.
4. When we have OLTP needs, MR is not suitable for a large number of short on-line transaction.April 3, 2017 at 1:33 pm #3153
Describe the Combiner class
Combiner class is used to reduce the data size.It is an optional class,use for the production of performance tuning.It runs on every data node.It is taken care by the developer.Driver class directly call the combiner class.April 4, 2017 at 9:54 am #3164
1) Mapper maps input key/value pairs to a set of intermediate key/value pairs.
2) Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
3) The hadoop MapReduce framework spawns one map task for each Inputsplit generated by the InputFromate for the job.
4) Output pairs do not need to be of the same types as input pairs. A given input pair may map to zero or many output pairs.Output pairs are collected with calls to context write(WritableComparable, Writable).
4) Application can use the Counter to report its statistics.
5) All intermediate values associated with a given output key are subsequently grouped by the framework and passed to the Reducer to determine the final output. Users can control the grouping by specifying a Comparator via Job.setGroupingComparatorClass(Class).April 4, 2017 at 11:24 am #3166
write the difference between logical split and physical split
physical split =it is permanent and happens on storage
logical split=it is temporary and happens on processApril 4, 2017 at 11:27 am #3167
what is combiner class?
combiner class is used in between the mapper class and reduce class to reduce the volume of data transfer between map and reduce..
it is an optional class used for the production of performance training..April 5, 2017 at 11:27 am #3172
Compilation stage of pig- Logical plan and Physical plan.
Logical and Physical plans are created during the execution of a pig script. Pig scripts are based on interpreter checking. Logical plan is produced after semantic checking and basic parsing and no data processing takes place during the creation of a logical plan. For each line in the Pig script, syntax check is performed for operators and a logical plan is created. Whenever an error is encountered within the script, an exception is thrown and the program execution ends, else for each statement in the script has its own logical plan.
A logical plan contains collection of operators in the script but does not contain the edges between the operators.
After the logical plan is generated, the script execution moves to the physical plan where there is a description about the physical operators, Apache Pig will use, to execute the Pig script. A physical plan is more or less like a series of MapReduce jobs but then the plan does not have any reference on how it will be executed in MapReduce. During the creation of physical plan, cogroup logical operator is converted into 3 physical operators namely –Local Rearrange, Global Rearrange and Package. Load and store functions usually get resolved in the physical planApril 5, 2017 at 11:42 am #3177
Performance tuning in pig
Some of the solutions are mentioned below for the performance tuning in pig:
1) When there are a lot of small input files, smaller than the default data split size, then you need to enable the combination of data splits and set the minimum size of the combined data splits to a multiple of the default data split size, usually 256 or 512MB or even more.
2)Generally, the Mappers-to-Reducers ratio is about 4:1. A Pig job with 400 mappers usually should not have reducers more than 100. If your job has combiners implemented (using group … by and foreach together without other operators in between), that ratio can be as high as 10:1—that is, for 100 mappers using 10 reducers is sufficient in that case.
3)On the job history or Jobtracker portal, actual execution time of MapReduce tasks should be checked. If the majority of completed mappers or reducers were very short-lived like under 10-15 seconds, and if reducers ran even faster than mappers, we might have used too many reducers and/or mappers which is a good sign.
4) In job’s output directory in HDFS the number of output part-* files should be checked. If that number is greater than our planned reducer number (each reducer is supposed to generate one output file), it indicates that our planned reducer number somehow got overridden and we have used more reducers than we meant to.
5) The size of our output part-* files should be checked. If the planned reducer number is 50, but some of the output files are empty, that is a good indicator that we have over-allocated reducers for our job and wasted the cluster’s resources.April 5, 2017 at 11:49 am #3183
Implementation of Map side Join and Reduce side join in pig.
Map side join:
A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splittable which means the ouput files should not be bigger than the HDFS block size. Using the org.apache.hadoop.mapred.join.CompositeInputFormat class we can achieve this.
Reduce side join:
Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer. We can also use the Secondary Sort technique to control the order of the records.
*The key of the map output, of datasets being joined, has to be the join key – so they reach the same reducer.
*Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly.
*In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required.
*A secondary sort needs to be done to ensure the ordering of the values sent to the reducer.
*If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.