1. What is HeartBeat In Hadoop?

Ans: As of now, we know that once if the input file is loaded on to the Hadoop Cluster, the file is sliced into blocks, and these blocks are distributed among the cluster.

Now Job Tracker and Task Tracker comes into picture. To process the data, Job Tracker assigns certain tasks to the Task Tracker. Let us think that, while the processing is going on one DataNode in the cluster is down. Now, NameNode should know that the certain DataNode is down , otherwise it cannot continue processing by using replicas. To make NameNode aware of the status(active / inactive) of DataNodes, each DataNode sends a “Heart Beat Signal” for every 10 minutes(Default). This mechanism is called as HEART BEAT MECHANISM.

Based on this Heart Beat Signal Job Tracker assigns tasks to the Tasks Trackers which are active. If any task tracker is not able to send the signal in the span of 10 mins, Job Tracker treats it as inactive, and checks for the ideal one to assign the task. If there are no ideal Task Trackers, Job Tracker should wait until any Task Tracker becomes ideal.

2.What Are The Side Data Distribution Techniques?

Ans:–Side data refers to extra static small data required by map reduce to perform job. Main challenge is the availability of side data on the node where the map would be executed. Hadoop provides two side data distribution techniques.

Using Job Configuration

An arbitrary Key value pair can be set in job configuration. very useful technique in case of small file. Suggested size of file to keep in configuration object is in KBs.Because conf object would be read by job tracker, task tracker and new child jvm. this would increase overhead at every front. A part from this side data would require serialization if it has non-primitive encoding.

Distributed Cache

Rather than serializing side data in the job configuration, it is preferable to distribute datasets using Hadoop’s distributed cache mechanism. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to any particular node once per job.

3.What Is Shuffling In Map Reduce?

Ans: The process of moving map outputs to the reducers is know as shuffling. A different subset of the intermediate key space is assigned to each reduce node; these subsets (know as “partitions”) are the inputs to the reduce tasks. Each map task may emit (key , value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper id its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data.

4.List Hadoop’s three configuration files?

Ans: core-site.xml

5.What Is a “Map” In Hadoop?

Ans: Map Is Responsible To Read Data from Input Location And Based On The Input Type It will generate A Key/Value Pair That Is An Intermediate Output In Local Machine.

6.What Is a “Reducer” In Hadoop?

Ans: It Is Responsible To process The Intermediate Output Received From Mapper And Generates The Final Output.

7.What Are The Parameters Of Mappers And Reducers?

Ans: We Have 4 Basic Parameters Of Mapper
1.Long Writable
3.Int Writable

We Have 4 Parameters Of Reducer

2.Int Writable
4.Int Writable

8.How can we change the split size if our commodity hardware has less storage space?

Ans: If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter‘. There is a feature of customization in Hadoop which can be called from the main method.

9.Can we rename the output file?

Ans: Yes we can rename the output file by implementing multiple format output class.

10.What is Streaming?

Ans: Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.

11.What is a Combiner?

Ans: A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.

12.What is the difference between an HDFS Block and Input Split?

Ans: HDFS Block is the physical division of the data and Input Split is the logical division of the data.

13.What happens in a textinputformat?

Ans: In textinputformat, each line in the text file is a record. Key is the byte offset of the line and value is the content of the line. For instance, Key: longWritable, value: text.

14.What do you know about keyvaluetextinputformat?

Ans:In keyvaluetextinputformat, each line in the text file is a ‘record‘. The first separator character divides each line. Everything before the separator is the key and everything after the separator is the value. For instance, Key: text, value: text.

15.What is MapReduce?

Ans: It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming