April 2, 2017 at 9:56 pm #3138
1.What is the process of Profiling in MapReduce?
At every host of a cluster a special jar file is added with a javaagent(a javaagent is a piece of code that used to instrument the program running on JVM)that will be embedded in every JVM process that runs on that machine.
Profiler’s javaagent gathers stack-traces from JVM process 100times/second and sends those information to a dedicated host running on a NoSQL Database(InfluxDB).
After the stack-traces are collected and we run a distributed application , a set of scripts are run on the database to extract data about class or method execution and to visualize the data using flame graph.April 2, 2017 at 10:01 pm #3139
What is Distributed cache and it’s application ?
The distributed cache works on small data file or libraries of code that may be accessed by all nodes in the cluster. When data is small you can put all your cache into single machine.
2 ways to put data into cache and access it –
1) Generic option parser :- In this case we can specify the cache file with the list of URLs as the argument of the file. The files that are specify via URL are already present in the filesystem and it is accessible by every machine in the cluster.
2) Distributed cache API :- In this case we use the method in the Driver class to add name of files which should be sent to all nodes to the system. We use DistributedCache.addCacheFile() to add the File and getLocalCacheFiles() use to retrieve the list of paths.
It is mainly used to store application data residing in database.April 2, 2017 at 10:12 pm #3140
Difference between Old API and New API
Old API New API
1.JobConf(used by mapper class,reducer class, 1.JobClass(used for configuration by mapper,reducer)
2.It is present in org.apache.hadoop.mapred 2.It is present in org.apache.mapreduce
3.Works with Tool,Toolrunner 3.Does not work with such classes
4.Uses mapper and reducer as interface 4.Uses mapper and reducer as classApril 2, 2017 at 10:14 pm #3141
What are the limitations of MapReduce?
April 2, 2017 at 10:15 pm #3142
- There is no real time processing of MapReduce.
- When a quick response is needed MapReduce is not a good choice.
- When a processing requires a lot of data to be shuffled over a network.
- It is not vary easy to implement each and everything as a MapReduce program.
Difference between Mapside join and Reduceside join?
Mapside join between large input works before the data reaches the map function.
Mapside join happens when file size is small than cache.
Each input must be divided into same no of partitioned and must be sorted by the same key.
Reduceside join happens when file size is greater than cache.
Input data set need not to be structured.
It is less efficient because the dataset have to go through the shuffle phase and in the reducer the same key records are brought together.April 2, 2017 at 10:16 pm #3143
In which scenario logical split vary?
It vary depending upon the format of input file.
For text input file whenever a new line character will come, the split will happen.April 2, 2017 at 10:21 pm #3144
Write a case for Partitioner
Partitioner run in the intermediate step between map and reduce tasks.It works like a condition in processing an input dataset.Partitioner defines how many reducer is needed.that’s means it will divide the data according to the number of reducer.
A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-defined condition, which works like a hash function.If we don’t define partitioner by default hashpartition will be called.
April 2, 2017 at 10:24 pm #3145
How can we manage Map Reduce process other than terminal?
We can manage MapReduce process using Cloudera Manager Monitoring Tool other than terminal.April 2, 2017 at 10:28 pm #3146
Write the application of outputcollector class
Outputcollector class provided by the Map-Reduce framework to collect the data output by either Mapper or Reducer i.e intermediate outputs.April 4, 2017 at 9:11 am #3155
How MR read XML file ?
We define the srting from xml start tag.
Define String from xml end tag.
set the XmlInputFormatclass.April 4, 2017 at 9:36 am #3156
Where does suffle,partition,sort output be sorted?
Ans. It is sorted in ReducerApril 4, 2017 at 9:39 am #3158
Limitation of MapReduce
1. MapReduce is a disk process.
2. It works for batch processing.
3. It consumes time for processing the codeApril 4, 2017 at 9:41 am #3159
What is Mapper job and Reducer job?
Mapper and Reducer are two function of Map-Reduce framework.
Mapper maps the input key/value pairs to set of intermediate key/value pairs(k2,v2).
Reducer reduces a set of intermediate values to produce final the output.April 4, 2017 at 9:46 am #3161
What is split Buffer?
Ans.In map reduce we have two type of spliting. 1.Physical split 2.logical split
Physical split: the input splits into blocks depending upon the size of the block.
default size is 64mb and standard size is 128mb.
Logical split: It depends upon the input format of the input file. If it is a text file
then whenever a new line character will come, it splits.