This topic contains 5 replies, has 3 voices, and was last updated by  kfncroeVecy 3 months, 4 weeks ago.

Viewing 6 posts - 1 through 6 (of 6 total)
  • Author
    Posts
  • #3075 Reply

    manideep
    Participant

    1) Difference between old API and new API
    ## The new API is in the org.apache.hadoop.mapreduce package (and subpackages). The old API can still be found in org.apache.hadoop.mapred.
    ## The new API favors abstract classes over interfaces, since these are easier to evolve. This means that you can add a method to an abstract class without breaking old implementations of the class. The Mapper and Reducer interfaces in the old API are abstract classes in the new API.
    ## The new API makes extensive use of context objects that allow the user code to communicate with the MapReduce system. The new Context, for example, essentially unifies the role of the JobConf, the OutputCollector, and the Reporter from the old API.
    ## In both APIs, key-value record pairs are pushed to the mapper and reducer, but in addition, the new API allows both mappers and reducers to control the execution flow by overriding the run() method.

    2. what is shuffling and sorting?how it works?example in real time
    ## Shuffle: Mapreduce makes the input to every reducer is sorted by key. The process by which the system performs the sort and transfers the map output to the reducer input is known as shuffle.
    Sort: Sorting happens in various stages of MapReduce program, So can exists in Map and reduce stage.
    Input data to mapper is given in chunks of array as K1 and V1. The output of mapper will be the list of [K2,V2]. Before passing the data as input to reducer we shuffle the data in the order of key pairs and we pass the [K2, List[V2]] as input to the reducer.
    Example, we have multiple records of temperature in different years and months, we are trying to fetch max temparture recorded in which year.
    2009 march 27
    2009 may 38
    2009 dec 36
    2010 jan 30
    2010 sep 42
    2011 mar 29
    2011 apr 39
    this input data will be passed to mapper as k1 v1 tuples, here the key will be year, since we need to find out the maximum temparature recorded in which year.
    the output of the mapper will be list of k2,v2 such as [2009,38][2009,36][2010,30]. this data will be shuffled according to the key and will be given as input to reducer as k2,list[v2]…… the ouput of the shuffled data is given as input
    2009 [27,38,36]
    2010 [30,42]
    2011 [29,39]
    In reducer we sort the data and get the maximum value in a given year by ascending order ….. output will be list of k3,v3

    2b) Secondary sorting::

    To overcome the problem where the output key group result to be sorted by a value and not just by the key. The idea is that to keep the order of the value stable with the consecutive runs, this case of sorting we known as secondary sort.
    A solution for secondary sorting involves doing multiple things. First, instead of simply emitting the stock symbol as the key from the mapper, we need to emit a composite key, a key that has multiple parts. The key will have the stock symbol and timestamp. If you remember, the process for a M/R Job is as follows.

    (K1,V1) –> Map –> (K2,V2)
    (K2,List[V2]) –> Reduce –> (K3,V3)

    here we use
    a. composite key comparator
    The composite key comparator is where the secondary sorting takes place. It compares composite key by symbol ascendingly and timestamp descendingly. It is shown below. Notice here we sort based on symbol and timestamp. All the components of the composite key is considered.
    b. natural key grouping comparator
    The natural key group comparator groups values together according to the natural key. Without this component, each K2={symbol,timestamp} and its associated V2=price may go to different reducers. Notice here, we only consider the natural key.
    c. natural key partitioner
    The natural key partitioner uses the natural key to partition the data to the reducers. Again, note that here, we only consider the natural key.
    d. The MR job
    Once we define the Mapper, Reducer, natural key grouping comparator, natural key partitioner, composite key comparator, and composite key, in Hadoop’s new M/R API, we may configure the Job

    3. write the algorithm of all MR programming?
    MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
    Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

    Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
    During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
    The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
    Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
    After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

    4. what is mapper job and reducer job?
    Mapper:

    Mapper maps input key/value pairs to a set of intermediate key/value pairs.

    Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

    The Hadoop Map/Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.

    Reducer:

    Reducer reduces a set of intermediate values which share a key to a smaller set of values.

    The number of reduces for the job is set by the user via JobConf.setNumReduceTasks(int).

    Overall, Reducer implementations are passed the JobConf for the job via the JobConfigurable.configure(JobConf) method and can override it to initialize themselves. The framework then calls reduce(WritableComparable, Iterator, OutputCollector, Reporter) method for each <key, (list of values)> pair in the grouped inputs. Applications can then override the Closeable.close() method to perform any required cleanup.

    Reducer has 3 primary phases: shuffle, sort and reduce.

    5) what is the application of output collector class? what is logical split?in which scenario? we put logical split greater than physical split? in which scenario we put logical split less than physical split?

    The output from the Map function is stored in Temporary Intermediate Files. These files are handled transparently by Hadoop, so in a normal scenario, the programmer doesn’t have access to that. we can review the logs for the respective job where you’ll find a log file for each map task.
    To control where the temporary files are generated, and have access to them, you have to create your own OutputCollector class

    there are two kinds of splitting:
    1.physical split it happens at the time of storing data,[storage] and its permanent
    2. logical split happens when processing the data, its temporary

    the logical split is greater than or equal to block size, ie, physical split size [recommended]
    split size will depends on max value of[minimum split size, max split size, block size]

    #3076 Reply

    Roopa
    Participant

    Module 3 : MapReduce :
    1a) Difference b/w old API and New API

    a)New API useing Mapper and Reducer as Class.So can add a method (with a default implementation) to an abstract class without breaking old implementations of the class
    In old API used Mapper & Reduceer as Interface (still exist in New API as well)
    b)New API is in the org.apache.hadoop.mapreduce package
    old API can still be found in org.apache.hadoop.mapred.
    c)New API allows both mappers and reducers to control the execution flow by overriding the run() method
    In Old APIControlling mappers by writing a MapRunnable, but no equivalent exists for reducers.
    d)Job control is done through the JOB class in New API
    Job Control was done through JobClient in Old API
    e)Job Configuration done through Configuration class via some of the helper methods on Job.
    jobconf objet was use for Job configuration.which is extension of Configuration class.
    java.lang.Object
    extended by org.apache.hadoop.conf.Configuration
    extended by org.apache.hadoop.mapred.JobConf
    f)In the new API, the reduce() method passes values as a java.lang.Iterable
    In the Old API, the reduce() method passes values as a java.lang.Iterator

    2)What is shuffle and sorting .How it works? Example in real time
    Sorting and shuffling are responsible for creating a unique key and a list of values.
    Making similar keys at one location is known as Sorting.
    And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.

    Input to the mapper is given K1 and V1. The output will be the list of [K2,V2].
    Before passing the data as input to reducer we shuffle the data in the order of key pairs and we pass the [K2, List[V2]] as input to the reducer.
    Example, we have multiple records of temperature in different years and months, we are trying to fetch max temparture recorded in which year.
    2009 march 27
    2009 may 38
    2009 dec 36
    2010 jan 30
    2010 sep 42
    2011 mar 29
    2011 apr 39
    this input data will be passed to mapper as k1 v1 tuples, here the key will be year, since we need to find out the maximum temparature recorded in which year.
    the output of the mapper will be list of k2,v2 such as [2009,38][2009,36][2010,30]. this data will be shuffled according to the key and will be given as input to reducer as k2,list[v2]…… the ouput of the shuffled data is given as input
    2009 [27,38,36]
    2010 [30,42]
    2011 [29,39]
    In reducer we sort the data and get the maximum value in a given year by ascending order ….. output will be list of k3,v3

    2c)Type of sorting? secondary sorting?
    To overcome the problem where the output key group result to be sorted by a value and not just by the key. The idea is that to keep the order of the value stable with the consecutive runs, this case of sorting we known as secondary sort.
    A solution for secondary sorting involves doing multiple things. First, instead of simply emitting the stock symbol as the key from the mapper, we need to emit a composite key, a key that has multiple parts. The key will have the stock symbol and timestamp. If you remember, the process for a M/R Job is as follows.

    (K1,V1) –> Map –> (K2,V2)
    (K2,List[V2]) –> Reduce –> (K3,V3)

    here we use
    a. composite key comparator
    The composite key comparator is where the secondary sorting takes place. It compares composite key by symbol ascendingly and timestamp descendingly. It is shown below. Notice here we sort based on symbol and timestamp. All the components of the composite key is considered.
    b. natural key grouping comparator
    The natural key group comparator “groups” values together according to the natural key. Without this component, each K2={symbol,timestamp} and its associated V2=price may go to different reducers. Notice here, we only consider the “natural” key.
    c. natural key partitioner
    The natural key partitioner uses the natural key to partition the data to the reducers. Again, note that here, we only consider the natural key.
    d. The MR job
    Once we define the Mapper, Reducer, natural key grouping comparator, natural key partitioner, composite key comparator, and composite key, in Hadoop’s new M/R API, we may configure the Job

    3)write the algo of all MR program
    The flow of MapReduce program goes in this way,
    Input->Splitting->Mapping->Shuffling->Reducing-?Final Result
    Happens with help of three classes
    Mapper Class :Input,Splitting,Mapping
    Reducer Class:Shuffling,Reducing,Final Result
    Driver class:Public static void main

    1)Input :
    Data which is passed from client machine for processing
    Can be of the below Format :
    ->Text Input format
    ->Key Value text input
    ->N-line Input
    ->Sequence file

    2)Splitting :
    Data is split line by line ad stored in (Key,Value)Format.Where key is the logical address and value is the entire record for processing
    (K1,V1)

    3)Mapping : List(k2,v2)
    The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

    4)Shuffling and Sorting : K2,List(v2)
    Takes Output from mapper i.e List(k2,v2) and shuffles and sorts the data according to the Algorithm given

    5)Reducing :
    This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

    During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
    The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
    Most of the computing takes place on nodes with data on local disks that reduces the network traffic
    After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

    4)What is mapper job and reduce job

    Mapper job : Responsible to read data from input location, and based on the input type, it will generate a key value pair,that is, an intermediate output in local machine.
    The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.

    Reducer job: Responsible to process the intermediate output received from the mapper and generate the final output.
    The four basic parameters of a reducer are Text, IntWritable, Text, IntWritable.The first two represent intermediate output parameters and the second two represent final output parameters.

    5a)Write the application of OutputCollector class.How to customize OutputCollector class?

    The output from the Map function is stored in Temporary Intermediate Files. These files are handled transparently by Hadoop, so in a normal scenario, the programmer doesn’t have access to that. we can review the logs for the respective job where you’ll find a log file for each map task.
    To control where the temporary files are generated, and have access to them, you have to create your own OutputCollector class

    Interface OutputCollector<K,V>

    Collects the <key, value> pairs output by Mappers and Reducers.
    OutputCollector is the generalization of the facility provided by the Map-Reduce framework to collect data output by either the Mapper or the Reducer i.e. intermediate outputs or the output of the job.

    Method Summary:
    void collect(K key, V value)
    Adds a key/value pair to the output.

    Method Detail :
    void collect(K key,V value)
    throws IOException
    Adds a key/value pair to the output.
    Parameters:
    key – the key to collect.
    value – to value to collect.
    Throws:
    IOException

    5b)What is logical split?
    Basically two kinds of splitting:
    1)Physical split :
    Happens at the time of storing the data and it is permanent
    Client before sending request to NameNode splits the data into number of smaller chunks Based on Block size specified
    Default size of block : 64MB

    2)Logical split :
    Happend at the time of processing Data and it is temporary

    Split size depends on following parameters:
    mapred.min.split.size – The smallest valid size in bytes for a file split
    mapred.max.split.size – The largest valid size in bytes for a file split
    dfs.block.size – The size of the block

    The logical split is greater than or equal to block size, ie, physical split size [recommended]
    Example:

    Suppose you have a file of 100MB and HDFS default block configuration is 64MB then it will chopped in 2 split and occupy two HDFS blocks.
    Now you have a MapReduce program to process this data but you have not specified input split then based on the number of blocks(2 block) will be considered as input split for the MapReduce processing and two mapper will get assigned for this job. But suppose, you have specified the split size(say 100MB) in your MapReduce program then both blocks(2 block) will be considered as a single split for the MapReduce processing and one Mapper will get assigned for this job.

    Now suppose, you have specified the split size(say 25MB) in your MapReduce program then there will be 4 input split for the MapReduce program and 4 Mapper will get assigned for the job.

    6) Write a case for partitioner
    A partitioner works like a condition in processing an input dataset.
    The partition phase takes place after the Map phase and before the Reduce phase.

    A partitioner partitions the key-value pairs of intermediate Map-outputs.
    It partitions the data using a user-defined condition, which works like a hash function.
    The total number of partitions is same as the number of Reducer tasks for the job

    Case for Partitioner :
    Consider a Employee data set with following fields
    Employee (ID,Name,Age,Gender,Salary).
    We have to write an application to process the input dataset to find the highest salaried employee by gender in different age groups (for example, below 20, between 21 to 30, above 30).

    Map Tasks :
    The map task accepts the key-value pairs as input while we have the text data in a text file. The input for this map task is as follows

    Input – The key would be a Logical address of file stored,And value would be entire record.

    Method – The operation of this map task is as follows –
    Read the value (record data), which comes as input value from the argument list in a string.
    Using the split function, separate the gender and store in a string variable.
    Send the gender information and the record data value as output key-value pair from the map task to the partition task.
    Repeat all the above steps for all the records in the text file.

    Output – We will get the gender data and the record data value as key-value pairs.

    Partitioner Task
    The partitioner task accepts the key-value pairs from the map task as its input.
    Partition implies dividing the data into segments. According to the given conditional criteria of partitions, the input key-value paired data can be divided into three parts based on the age criteria.

    Age less than or equal to 20
    Age Greater than 20 and Less than or equal to 30.
    Age Greater than 30

    Output – The whole data of key-value pairs are segmented into three collections of key-value pairs. The Reducer works individually on each collection.

    #3108 Reply

    Roopa
    Participant

    Module 3:
    5c)In which senario ,we put logical split greater than physical split?
    The splitting size of the input file could be the developer’s concern while writing the MR program. If the developer found that the physical split is breaking the records into parts and the MR program receives the incomplete records for processing, then the result could be wrong, then logical split size need to be increased in such a manner to split the input file in meaningful unit of processing to pass to the mapper.

    5d)in which senario we put logical split less than physical split?
    When block size shall be sufficient enough as split size to process efficiently and correctly to generate the desired output by the MR job, then developer may choose such minimum logical split size is less than the block size. Here, each record size may be very small enough compared to each block size, that means, each block contains many records as full and no intersection of any record happened between blocks.
    In other scenario, where record size is equal to the block size.

    1b)write down MR programme in new API

    package PackageDemo;
    import java.io.IOException;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;

    public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{
    public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
    {
    String line = value.toString();
    String[] words=line.split(“,”);
    for(String word: words )
    {
    Text outputKey = new Text(word.toUpperCase().trim());
    IntWritable outputValue = new IntWritable(1);
    con.write(outputKey, outputValue);
    }
    }
    }
    public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
    {
    public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedException
    {
    int sum = 0;
    for(IntWritable value : values)
    {
    sum += value.get();
    }
    con.write(word, new IntWritable(sum));
    }
    }
    }

    public class WordCount {
    public static void main(String [] args) throws Exception
    {
    Configuration c=new Configuration();
    String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
    Path input=new Path(files[0]);
    Path output=new Path(files[1]);
    Job j=new Job(c,”wordcount”);
    j.setJarByClass(WordCount.class);
    j.setMapperClass(MapForWordCount.class);
    j.setReducerClass(ReduceForWordCount.class);
    j.setOutputKeyClass(Text.class);
    j.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(j, input);
    FileOutputFormat.setOutputPath(j, output);
    System.exit(j.waitForCompletion(true)?0:1);
    }
    }

    7)How to control the mapper output?
    Mapper output can be controlled using below classes in MapReduce
    1)Partitioner :
    A partitioner partitions the key-value pairs of intermediate Map-outputs.
    It partitions the data using a user-defined condition, which works like a hash function.
    The total number of partitions is same as the number of Reducer tasks for the job

    2)Combiner
    Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs.
    Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative.
    The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

    8.In which senario logical split vary??

    Block is the physical representation of data.
    It is part of input processed by a single map.Each split is processed by a single map
    Split is the logical representation of data present in Block.

    For example if data is 128 MB and block size if 64 MB(default)
    Case 1 – Input split size [64 MB] = Block size [64 MB],No. of Map task 2
    Case 2 – Input split size [32 MB] = Block size [64 MB],No. of Map task 4
    Case 2 – Input split size [128 MB] = Block size [64 MB],No. of Map task 1

    Data splitting happens based on file offsets.The goal of splitting of file and store it into different blocks is parallel processing

    9)In which machine does shuffle run mapper or reducer?
    Reducer

    10.In which machine does partitioner run mapper or reducer?
    Mapper

    11)what is spill buffer data?

    Spill buffer data means writing the buffer data (from in-memory cache) to the local physical disk to emptied the buffer when it reaches the 80% occupied. If this spilling is not done in proper time, the data in buffer may be overwritten by upcoming mapper output.
    Amount of memory available for this is set by mapreduce.task.io.sort.mb.
    The spill buffer data happens at least once when the mapper finishes all its task, because the output of the mapper should be sorted and saved to the disk for reducer to read and process to generate the final output.
    The final output then is written back to HDFS.

    12)Where does shuffle,partition,sort output be stored?

    It is recommended to store the Mapper output in local File System(Disk).Concept is called Data localization.We can acheive this by using below Property In mapred-site.xml

    The local directory where MapReduce stores intermediate data files

    <property>
    <name>mapred.local.dir</name>
    <value>/var/lib/hadoop-0.20/cache/mapred/local</value>
    <final>true</final>
    </property>

    17.Distributed cache(joining)and its application with example?
    Useful when our input word-frequency files are NOT sorted, and one of two files is small enough to fit in memory. In this case you can use it as a distributed cache, and load it in memory as a hash table Map<String, Integer>. Each mapper than will stream the largest input file as key value pairs and look up the values of the smaller file from the hash map.

    Pros: Efficient, linear complexity based on largest input set size. Does not require reducer.

    Cons: Requires one of the inputs to fit in memory.

    18.Difference between Mapside join and reduceside join

    Mapside join:
    The inputs for to each map must be partitioned and sorted in a specific way. Each input dataset must be divided into the
    same number of partitions, and it must be sorted by the same key (the join key) in each source.

    All the records for a particular key must reside in the same partition and which is mandatory. A map-side join can be used to
    join the outputs of several jobs that had the same number of reducers, the same keys and output files that are no bigger than the HDFS block size.

    Reduceside join:
    Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have
    to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer. We can also use the Secondary Sort technique
    to control the order of the records.

    19)How MR read xml file?Write the property to read xml file
    conf.set(“START_TAG_KEY”, “<employee>”);
    conf.set(“END_TAG_KEY”, “</employee>”);

    Classroom Question :

    20)Combiner running on single data node .How many times it can be called?
    Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all.
    In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer. One can think of Combiners as “mini-reducers” that take place on the output of the mappers, prior to the shuffle and sort phase

    #3118 Reply

    Manideep

    6) Write a case for partitioner …
    Partitioner divides the intermediate data according to the number of reducers so that all the data in a single partition get executed by single reducer.It means each partition can executed by only a single reducer. If you reducer, automatically partition called in reducer automatically.

    A partitioner works like a condition in processing an input dataset.The partition phase takes place after the Map phase and before the Reduce phase.
    It partitions the data using a user-defined condition, which works like a hash function.

    Case for Partitioner :
    Consider a Employee data set with following fields
    Employee (ID,Name,Age,Gender,Salary).
    We have to write an application to process the input dataset to find the highest salaried employee by gender in different age groups (for example, below 20, between 21 to 30, above 30).

    The map task accepts the key-value pairs as input while we have the text data in a text file. The input for this map task is as follows
    Input – The key would be a Logical address of file stored,And value would be entire record.
    Output – We will get the gender data and the record data value as key-value pairs.

    The partitioner task accepts the key-value pairs from the map task as its input.
    Partition implies dividing the data into segments. According to the given conditional criteria of partitions, the input key-value paired data can be divided into three parts based on the age criteria.

    Age less than or equal to 20
    Age Greater than 20 and Less than or equal to 30.
    Age Greater than 30

    Output – The whole data of key-value pairs are segmented into three collections of key-value pairs. The Reducer works individually on each collection.

    7)How to control the mapper output?
    The mapper output is stored on the Local file system of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
    We can handle the mapper output using combiner and partitioner classes.
    When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS.

    8)In which scenario logical split varies?
    there are two kinds of splitting:
    1.physical split it happens at the time of storing data,[storage] and its permanent
    2.logical split happens when processing the data, its temporary

    the logical split is greater than or equal to block size, ie, physical split size [recommended]
    split size will depends on max value of[minimum split size, max split size, block size]

    Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.
    Suppose you have a file of 100MB and HDFS default block configuration is 64MB
    Default: data will chopped in 2 split and occupy 2 blocks. 2 mapper will get assigned for this job.
    split size(say 100MB) :both blocks(2 block) will be considered as a single split for the Map/Reduce processing and 1 Mapper will get assigned for this job.
    split size(say 25MB) : there will be 4 input split for the Map/Reduce program and 4 Mapper will get assigned for the job.

    9)In which machine does shuffle run mapper or reducer?
    Mapper

    10.In which machine does partitioner run mapper or reducer?
    Reducer

    11)what is spill buffer data?
    Spill buffer data means writing the buffer data to the local physical disk to emptied the buffer when it reaches the 80% occupied. If this spilling is not done in proper time, the data in buffer may be overwritten by upcoming mapper output.
    Amount of memory available for this is set by mapreduce.task.io.sort.mb.

    12)Where does shuffle,partition,sort output be stored?
    After shuffling mapper generate the oputput to a temporarly store the intermediate data on local file system.
    this temporary file configured at core-site.xml in the hadoop file.
    Hadoop framework aggregate and sort this intermediate data, then update into hadoop to be processed by the reduce function.
    The Framework deletes this temporary data in the local system after hadoop completes the job.

    13)Write the conf for xml loader?
    For Xml files we have to create your own custom input format for processing using MapReduce jobs.
    a)We need to import jar file ##import javax.xml.stream.XMLInputFactory;
    b)Create XMLInputformat Class for reading XML File by extending InputTextformat class
    c)In driver class we need to define root nodes
    Configuration conf = new Configuration();
    conf.set(“START_TAG_KEY”, “<employee>”);
    conf.set(“END_TAG_KEY”, “</employee>”);
    job.setInputFormatClass(XmlInputFormat.class); //need to add xmlformat class to job
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(LongWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(LongWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    d)In Mapper class we need to iterate through the list
    InputStream is = new ByteArrayInputStream(value.toString().getBytes());
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(is);
    doc.getDocumentElement().normalize();
    NodeList nList = doc.getElementsByTagName(“employee”);

    for (int temp = 0; temp < nList.getLength(); temp++) {
    Node nNode = nList.item(temp);
    if (nNode.getNodeType() == Node.ELEMENT_NODE) {
    Element eElement = (Element) nNode;
    String id = eElement.getElementsByTagName(“id”).item(0).getTextContent();
    String name = eElement.getElementsByTagName(“name”).item(0).getTextContent();
    String gender = eElement.getElementsByTagName(“gender”).item(0).getTextContent();
    // System.out.println(id + “,” + name + “,” + gender);
    context.write(new Text(id + “,” + name + “,” + gender), NullWritable.get());
    }
    }
    }

    15)How to monitor MR jobprogress apart from terminal..
    We can use the Command Line interface to manage and display jobs, history and logs.
    Job tracker and Task tracker web UI to track the status of a launched job or to check the history of previously run jobs.

    16)Distributed cache(joining)and its application with example?
    Useful when our input word-frequency files are NOT sorted, and one of two files is small enough to fit in memory. In this case you can use it as a distributed cache, and load it in memory as a hash table Map<String, Integer>. Each mapper than will stream the largest input file as key value pairs and look up the values of the smaller file from the hash map.
    Pros: Efficient, linear complexity based on largest input set size. Does not require reducer.
    Cons: Requires one of the inputs to fit in memory.

    17)Difference between Mapside join and reduceside join

    Mapside join:
    The inputs for to each map must be partitioned and sorted in a specific way. Each input dataset must be divided into the
    same number of partitions, and it must be sorted by the same key (the join key) in each source.

    All the records for a particular key must reside in the same partition and which is mandatory. A map-side join can be used to
    join the outputs of several jobs that had the same number of reducers, the same keys and output files that are no bigger than the HDFS block size.

    reduceside join:
    Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have
    to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer. We can also use the Secondary Sort technique
    to control the order of the records.

    #4520 Reply

    ksqeppeVecy

    personal loans bad credit
    <br />
    payday loans <br />
    personal loans <br />

    http://paydayourloansonline.com/ – personal loans <br />

    best personal loans
    #gagahshsnsnfbfhhw2333

    #4533 Reply

    kfncroeVecy

    prepaid debit cards payday loans, check mate payday loans, payday loans canton ohio, payday loans jacksonville fl quick loans payday loans indianapolis, payday loans portland online loans payday loans without checking account, online payday loans no credit check instant approval, immediate payday loans, online payday loans tennessee, payday loans no credit, indian payday loans http://paydayourloansonline.com/#personal – loans direct payday loans phone number, nevada payday loans, payday loans in michigan #helloganamana2233

Viewing 6 posts - 1 through 6 (of 6 total)
Reply To: MapReduce questions
Your information:




cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.