Q1. What is the job of Mapper and Reducer?
Ans:- The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reducer is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
Q2. What is output collector? How it control the mapper output? Write the application of output collector class.
Ans:- The output from the mapper or reducer is stored and processed internally and is then transferred to the next stage. Access to these locations or files are not possible externally. Generally output collector collects data which is either the output of mapper or reducer. To have access to these folder we have to create an output collector class to access the location and content of this files. The key value pairs generated during splitting of the input value of mapper is collected and managed by output collector.
Q3. What is shuffle and sorting? How it works? Example in real time. Write customize shuffling algorithm.
Ans:- After the Map phase and before the beginning of the Reduce phase there is a handoff process, known as shuffle and sort. Here, data from the mapper tasks is prepared and moved to the nodes where the reducer tasks will be running. When the mapper task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then written to disk. Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous.
A reducer can run multiple reduce tasks. Shuffling and sorting are performed locally, by each reducer, for its own input data.
Playing cards is a real time example for shuffling and sorting.
Q4. What are the external daemons running in standalone mode?
Ans:- In standalone mode there is no external daemons running.
Q5. How to modify the heartbeat and block report time interval of data node?
Ans:- The hearbeat and block report time interval can be modified by modifying the codes written in hdf-site.xml for example:-
<description>Determines datanode heartbeat interval in seconds.</description>
<description>Determines when machines are marked dead</description>
Same way we can change the interval for block report.
Q6. What is FSimage & what type of metadata its stores?
Ans:- An FSimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonic increasing transaction ID. An FSimage file represents the file system state after all modifications up to a specific transaction ID.
In master node we have name node which maintains a metadata in two files. One is FSimage and the other one is edit log. So this FSimage is initially loaded when the hadoop system is started and this FSimage contains the directory structure of the clusters and data stored. Then afterwards, for every transaction occurring, edit logs file is updated.
The Namenode after loading the FSimage has the whole idea of where data is stored in memory.
1. Transactions are coming in, the information is stored in the edit log.
2. Periodically, per default every hour, the checkpoint node/secondary namenode, retrieves the logs, and merges them with the latest FSimage and keeps the data as a checkpoint. At this point, the nn has the image in the memory, the edit logs are emptied and the latest checkpoint is stored as an image on the snn/cn.
Q7. If 2TB data is given what is the max expected metadata generated?
Ans:- Meta data generated for 2TB of data will be very small almost in KB.
Q8. Write how to save job by using combiner?
Ans:- When a MapReduce Job is running on a large dataset, Hadoop Mapper generates large chunks of intermediate data that is passed on to Hadoop Reducer for further processing, which leads to massive network congestion. So the MapReduce framework offers a function known as ‘Combiner’ that can play a crucial role in reducing network congestion. As a matter of fact ‘Combiner’ is also termed as ‘Mini-reducer’. It is important to note that the primary job of a Hadoop Combiner is to process the output data from Hadoop Mapper, before passing it to a Hadoop Reducer. Technically speaking, Combiner and Reducer use the same code. Combiner collects data and arrange the like data’s together and transfer it to the output file.