This topic contains 1 reply, has 1 voice, and was last updated by  Roopa P Karkera 7 months, 3 weeks ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #3212 Reply

    manideep

    1)When we run PIG in local mode,Will it convert the query in MR or not?
    A: Yes it will convert the query in MR and compile.
    local mode does not support parallel mapper execution with Hadoop 0.20.x and 1.0.0. This is because the LocalJobRunner of these Hadoop versions is not thread-safe.

    2)limitations of pig?
    –Storage layer Limitation:- Pig serves as a scripting language it does-not have the features of storing the data,it is a scripting language that can run on the hadoop after compilation into MR jobs just like map reduce jobs.
    –PIG runs in pipeline no conditionals such as (if..then)
    –slowly compared to mapreduce since it requires compiler
    –cannot handle unstructured data

    3)How to achieve performance tuning in PIG?
    a. Combiner
    The Pig combiner is an optimizer that is invoked when the statements in your scripts are
    arranged in certain ways. The examples below demonstrate when the combiner is used and
    not used. Whenever possible, make sure the combiner is used as it frequently yields an order
    of magnitude improvement in performance.
    b. Memory Management
    Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory
    limit is reached. This is very similar to how Hadoop decides when to spill data accumulated
    by the combiner.
    c. Multi-Query Execution
    With multi-query execution Pig processes an entire script or a batch of statements at once
    d. Optimization Rules
    Pig supports various optimization rules. By default optimization, and all optimization rules,
    are turned on. To turn off optimization, use: pig -optimizer_off [opt_rule | all ]
    e. Performance Enhancers
    Pig supports various optimization rules which are turned on by default

    4)how To implement Map side join or reduce side join in PIG?
    Map Side join:
    A map-side join between large inputs works by performing the join before the data reaches the map function. For this to work, though, the inputs to each map must be partitioned and sorted in a particular way. Each input data set must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition. This may sound like a strict requirement (and it is), but it actually fits the description of the output of a MapReduce job.
    A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splitable which means the output files should not be bigger than the HDFS block size. Using the org.apache.hadoop.mapred.join.CompositeInputFormat class we can achieve this.

    Reduce Side join:
    Reduce-Side joins are more simple than Map-Side joins since the input data sets need not to be structured. But it is less efficient as both data sets have to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer. We can also use the Secondary Sort technique to control the order of the records.

    5)How physical translator works at the time of compilation of pig query?
    The pig latin program compilation will be
    pig latin program->query parser->semantic checking->logicial optimizer->logical to physical translator -> physical to MR translator -> mapreduce launcher ->create a job jar to be submitted to hadoop cluster.
    When the coder loads the script file from the system(local/hdfs) ,the interpreter will check each & every line of the code for the operators,if any error found it ends the program otherwise a logical plan for the program is generated,make sure for each line of the script the logical plan is being generated & it grows considerably because each statement have its logical plan.

    6)compilation stage –optimized logical plan? physical plan?
    Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs.
    After performing the basic parsing and semantic checking, it produces a logical plan.

    *optimized logical plan:
    The logical plan describes the logical operators that have to be executed by Pig during execution

    *physical plan:
    The physical plan describes the physical operators that are needed to execute the script.

    #3297 Reply

    Roopa P Karkera

    1)When we run PIG in local mode,Will it convert the query in MR or not?
    Yes it will convert the query in MR

    4)Limitation of pig

    a)Pig serves as a scripting language it does-not have the features of storing the data,it is a scripting language that can run on the hadoop after compilation into MR jobs just like map reduce jobs.

    b) Not so much effective compared to spark when we feed data via JSON

    c) PIG runs in pipeline no conditionals such as (if..then)

    5)compilation stage
    Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs.
    After performing the basic parsing and semantic checking, it produces a logical plan.

    *optimized logical plan
    The logical plan describes the logical operators that have to be executed by Pig during execution

    *physical plan?

    The physical plan describes the physical operators that are needed to execute the script.

    6)How to acheive performance TUNING IN PIG?

    1. Combiner
    The Pig combiner is an optimizer that is invoked when the statements in your scripts are
    arranged in certain ways. The examples below demonstrate when the combiner is used and
    not used. Whenever possible, make sure the combiner is used as it frequently yields an order
    of magnitude improvement in performance.

    1.1. When the Combiner is Used
    The combiner is generally used in the case of non-nested foreach where all projections are
    either expressions on the group column or expressions on algebraic UDFs

    1.2. When the Combiner is Not Used
    The combiner is generally not used if there is any operator that comes between the GROUP
    and FOREACH statements in the execution plan. Even if the statements are next to each
    other in your script, the optimizer might rearrange them.

    2. Memory Management
    Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory
    limit is reached. This is very similar to how Hadoop decides when to spill data accumulated
    by the combiner.
    The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the
    default is set to 20% (0.2) of available memory. Note that this memory is shared across all
    large bags used by the application.

    3. Multi-Query Execution
    With multi-query execution Pig processes an entire script or a batch of statements at once

    4. Optimization Rules
    Pig supports various optimization rules. By default optimization, and all optimization rules,
    are turned on. To turn off optimiztion, use:

    pig -optimizer_off [opt_rule | all ]

    5. Performance Enhancers
    5.1. Use Optimization
    Pig supports various optimization rules which are turned on by default

    Reference : https://pig.apache.org/docs/r0.9.1/perf.pdf

    7)hOW To implement Mapside join or reduce side join in PIG?
    Map Side join

    A map-side join between large inputs works by performing the join before the data reaches the map function. For this to work, though, the inputs to each map must be partitioned and sorted in a particular way. Each input data set must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition. This may sound like a strict requirement (and it is), but it actually fits the description of the output of a MapReduce job.

    A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splittable which means the ouput files should not be bigger than the HDFS block size. Using the org.apache.hadoop.mapred.join.CompositeInputFormat class we can achieve this.

    Reduce Side join

    Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer. We can also use the Secondary Sort technique to control the order of the records.

    How it is done?
    The key of the map output, of datasets being joined, has to be the join key – so they reach the same reducer.
    Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly.
    In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required.
    A secondary sort needs to be done to ensure the ordering of the values sent to the reducer.
    If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.

Viewing 2 posts - 1 through 2 (of 2 total)
Reply To: PIG question
Your information:




cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.