This topic contains 2 replies, has 3 voices, and was last updated by  suchitra mohanty 1 year, 7 months ago.

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
    Posts
  • #1918 Reply

    faizan0607
    Participant

    Questions –
    1. Difference between mapper and Map Task ?
    2. Anatomy of MapReduce ?
    3. where we define split size greater than block size ?
    4. Difference between old API and new API ?
    5. what is shuffling and sorting ?
    ———————————————————————————————————————————————

    Answers –
    1. Mapper is one of the class used in the Map Reduce program.
    The main task of mapper class is to read data from input location, and based on the input type, it will generate a key value pair, that is an intermediate output in local machine.

    Map Task – It is one of the entity of Job Tracker which helps to process the data. It works on a condition i.e it executes one block at a time.
    ———————————————————————————————————————————————

    2. Anatomy of MapReduce –
    1. MapReduce program will run the job
    2. Job Client will sent the request to Job Tracker to generate the Job ID. Then Job Tracker assigns a unique Job ID.
    3. The Job Client copies the job resources and defines the logical split.
    4. The job of Job Client ends here by submitting the job to the JobTracker.
    5. JobTracker initializes the job by assigning task to the TaskTracker, it creates 2 entities namely MapTask and ReduceTask. MapTask works on the condition that it executes only one block at a time, hence we need multiple instances of of the MapTask which is defined by the Logical Split Size.
    6. Then the JobTracker retrieves the input split from the HDFS.
    7. The TaskTracker will send notification to the JobTracker by sending heartbeats.
    8. TaskTracker will retrieve job resources like information regarding the path of jar files etc.
    9. Then the TakTracker will launch the job, JobTracker will monitor the progress report.
    10. The MapReduce program will run and upon completion of the job the JobTracker will send notification to the Client.
    ———————————————————————————————————————————————

    3. When we define split size greater than block size this increases the split size, but at the cost of locality.

    4. OLD API-
    a. IN OLD API used Mapper & Reducer as Interface (still exist in New API as well)
    b. old API can still be found in org.apache.hadoop.mapred. package
    c. JobConf, the OutputCollector, and the Reporter object use for communicate with Map reduce System
    d. Controlling mappers by writing a MapRunnable, but no
    equivalent exists for reducers.

    NEW API-
    a. New API useing Mapper and Reducer as Class
    So can add a method (with a default implementation) to an
    abstract class without breaking old implementations of the class.
    b. new API is in the org.apache.hadoop.mapreduce package.
    c. use “context” object to communicate with mapReduce system.
    d. new API allows both mappers and reducers to control the execution
    flow by overriding the run() method.
    ———————————————————————————————————————————————

    5. Sorting makes similar key at one location. Shuffling is a process by which the intermediate output of the mapper is
    sorted and sent across to the reducers. Both these process are used for creating a unique key and a list of values.

    #1935 Reply

    chinni
    Participant

    1a)MAPPER :The purpose of the mapper is to organize the data in preparation for the processing done in the reduce phase.By default, the value is a data record and the key is generally the offset of the data record from the beginning of the data file.
    Mapper maps input key/value pairs to a set of intermediate key/value pairs.

    MAPTASK:The main purpose of MapTask class is data serialization for map output.

    2)anatomy of mapreduce
    a)first a job has to be submitted to hadoop cluster
    b)now job is initialized in hadoop.
    c)initialized job is now split into tasks
    D)job tracker assign tasks to task tracker.
    E)then tasks are executed in a distributed environment ,tracking the progress and status of job is tracked
    f)the execution process continues till all tasks are completed.

    4)NEW API(application program interface)
    a)New API useing Mapper and Reducer as Class
    So can add a method (with a default implementation) to an
    abstract class without breaking old implementations of the class
    b)Job control is done through the JOB class in New API
    c)new API is in the org.apache.hadoop.mapreduce package
    d)use “context” object to communicate with mapReduce system

    OLD API:
    a)IN OLD API used Mapper & Reduceer as Interface (still exist in New API as well)
    b)old API can still be found in org.apache.hadoop.mapred.
    c)Job Control was done through JobClient
    (not exists in the new API)
    d)JobConf, the OutputCollector, and the Reporter object use for communicate with Map reduce System

    #1945 Reply

    suchitra mohanty
    Participant

    1)MAPPER-mapper is a class which is used in map reduce program.

    MAP TASK-there are two entity in job tracker,one of them is map task and One Map Task execute one task at a time.

    2)Anatomy of Map Reduce-
    a)first a client will sent a request to job tracker.
    b)then job assigns the hadoop cluster.
    c)now the job is split in to tasks.
    d)now job is assigns to task tracker.
    e)the task will execute until all the task is being complete.

Viewing 3 posts - 1 through 3 (of 3 total)
Reply To: Some Q&A on MapReduce
Your information:




cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.