Viewing 14 posts - 1 through 14 (of 14 total)
  • Author
  • #3138

    What is the process of Profiling in MapReduce?

    At every host of a cluster a special jar file is added with a javaagent(a javaagent is a piece of code that used to instrument the program running on JVM)that will be embedded in every JVM process that runs on that machine.

    Profiler’s javaagent gathers stack-traces from JVM process 100times/second and sends those information to a dedicated host running on a NoSQL Database(InfluxDB).

    After the stack-traces are collected and we run a distributed application , a set of scripts are run on the database to extract data about class or method execution and to visualize the data using flame graph.


    What is Distributed cache and it’s application ?

    The distributed cache works on small data file or libraries of code that may be accessed by all nodes in the cluster. When data is small you can put all your cache into single machine.
    2 ways to put data into cache and access it –

    1) Generic option parser :- In this case we can specify the cache file with the list of URLs as the argument of the file. The files that are specify via URL are already present in the filesystem and it is accessible by every machine in the cluster.

    2) Distributed cache API :- In this case we use the method in the Driver class to add name of files which should be sent to all nodes to the system. We use DistributedCache.addCacheFile() to add the File and getLocalCacheFiles() use to retrieve the list of paths.

    It is mainly used to store application data residing in database.

    Sanchita Sen

    Difference between Old API and New API
    Old API New API
    1.JobConf(used by mapper class,reducer class, 1.JobClass(used for configuration by mapper,reducer)
    input file)

    2.It is present in org.apache.hadoop.mapred 2.It is present in org.apache.mapreduce

    3.Works with Tool,Toolrunner 3.Does not work with such classes
    4.Uses mapper and reducer as interface 4.Uses mapper and reducer as class


    What are the limitations of MapReduce?

    1. There is no real time processing of MapReduce.
    2. When a quick response is needed MapReduce is not a good choice.
    3. When a processing requires a lot of data to be shuffled over a network.
    4. It is not vary easy to implement each and everything as a MapReduce program.

    Difference between Mapside join and Reduceside join?

    Mapside join between large input works before the data reaches the map function.
    Mapside join happens when file size is small than cache.
    Each input must be divided into same no of partitioned and must be sorted by the same key.

    Reduceside join happens when file size is greater than cache.
    Input data set need not to be structured.
    It is less efficient because the dataset have to go through the shuffle phase and in the reducer the same key records are brought together.


    In which scenario logical split vary?
    It vary depending upon the format of input file.
    For text input file whenever a new line character will come, the split will happen.

    Neha Bhakat

    Write a case for Partitioner
    Partitioner run in the intermediate step between map and reduce tasks.It works like a condition in processing an input dataset.Partitioner defines how many reducer is needed.that’s means it will divide the data according to the number of reducer.
    A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-defined condition, which works like a hash function.If we don’t define partitioner by default hashpartition will be called.


    How can we manage Map Reduce process other than terminal?

    We can manage MapReduce process using Cloudera Manager Monitoring Tool other than terminal.

    Neha Bhakat

    Write the application of outputcollector class
    Outputcollector class provided by the Map-Reduce framework to collect the data output by either Mapper or Reducer i.e intermediate outputs.


    How MR read XML file ?

    conf.set(“xmlinput.start”,” “);
    conf.set(“xmlinput.end”,” “);

    We define the srting from xml start tag.
    Define String from xml end tag.
    set the XmlInputFormatclass.


    Where does suffle,partition,sort output be sorted?
    Ans. It is sorted in Reducer

    Sanchita Sen

    Limitation of MapReduce
    1. MapReduce is a disk process.
    2. It works for batch processing.
    3. It consumes time for processing the code

    Neha Bhakat

    What is Mapper job and Reducer job?
    Mapper and Reducer are two function of Map-Reduce framework.
    Mapper maps the input key/value pairs to set of intermediate key/value pairs(k2,v2).

    Reducer reduces a set of intermediate values to produce final the output.


    What is split Buffer?
    Ans.In map reduce we have two type of spliting. 1.Physical split 2.logical split
    Physical split: the input splits into blocks depending upon the size of the block.
    default size is 64mb and standard size is 128mb.
    Logical split: It depends upon the input format of the input file. If it is a text file
    then whenever a new line character will come, it splits.

Viewing 14 posts - 1 through 14 (of 14 total)
  • You must be logged in to reply to this topic.