November 3, 2014 at 1:51 pm #1032
(1)What is Difference between Secondary namenode, Checkpoint namenode & backupnod Secondary Namenode, a poorly named component of hadoop
The namenode stores the HDFS filesystem information in a file named fsimage. Updates to the file system (add/remove blocks) are not updating the fsimage file, but instead are logged into a file, so the I/O is fast append only streaming as opposed to random file writes. When restaring, the namenode reads the fsimage and then applies all the changes from the log file to bring the filesystem state up to date in memory. This process takes time.
The secondarynamenode job is not to be a secondary to the name node, but only to periodically read the filesystem changes log and apply them into the fsimage file, thus bringing it up to date. This allows the namenode to start up faster next time
The Backup Node in hadoop is an extended checkpoint node that performs checkpointing and also support online streaming of file system edits.
The advantage over the checkpoint node is that the namespace presents in it’s main memory is always in sync with primary name node FS since it maintain an In memory up to date
In Checkpoint Node checkpoints are created on their local FS by downloading FSImages and EditLogs files from active primary Namenode and merge these two files and new image is saved in their Local FS.
So checkpoint creation in backup node will always be faster than checkpointnode.
(3)What is shuffleing in mapreduce?
MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle
(5)Can we change the file cached by DistributedCache
No, DistributedCache tracks the caching with timestamp. cached file should not be changed during the job execution.
(6)What if job tracker machine is down?
Single point failure from execution point of view.
(7) we deploye Can job tracker other than name node?
Yes, in production it is highly recommended. For self development and learning you may setup according to your need.
(8)What are the four modules that make up the Apache Hadoop framework?
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at
(9)Which modes can Hadoop be run in? List a few features for each mode.
• Standalone, or local mode, which is one of the least commonly used environments. When it is used, it’s usually only for running MapReduce programs. Standalone mode lacks a distributed file system, and uses a local file system instead.
• Pseudo-distributed mode, which runs all daemons on a single machine. It is most commonly used in QA and development environments.
• Fully distributed mode, which is most commonly used in production environments. Unlike pseudo-distributed mode, fully distributed mode runs all daemons on a cluster of machines rather than a single one.
(10)Where are Hadoop’s configuration files located?
Hadoop’s configuration files can be found inside the conf sub-directory.
(11)List Hadoop’s three configuration files.
(12)What are “slaves” and “masters” in Hadoop?
NameNode,SecondaryNamenode and Jobtracker are Masters and DataNode,TaskTrackers are slaves.
(13)How many datanodes can run on a single Hadoop cluster?
Hadoop slave nodes contain only one datanode process each.
(14)What is job tracker in Hadoop?
Job tracker is used to submit and track jobs in MapReduce.
(15)How many job tracker processes can run on a single Hadoop cluster?
Like datanodes, there can only be one job tracker process running on a single Hadoop cluster. Job tracker processes run on their own Java virtual machine process. If job tracker goes down, all currently active jobs stop.
(16)What sorts of actions does the job tracker process perform?
• Client applications send the job tracker jobs.
• Job tracker determines the location of data by communicating with Namenode.
• Job tracker finds nodes in task tracker that has open slots for the data.
• Job tracker submits the job to task tracker nodes.
• Job tracker monitors the task tracker nodes for signs of activity. If there is not enough activity, job tracker transfers the job to a different task tracker node.
• Job tracker receives a notification from task tracker if the job has failed. From there, job tracker might submit the job elsewhere, as described above. If it doesn’t do this, it might blacklist either the job or the task tracker.
(17)How does job tracker schedule a job for the task tracker?
When a client application submits a job to the job tracker, job tracker searches for an empty node to schedule the task on the server that contains the assigned datanode.
(18)What does the mapred.job.tracker command do?
will provide a list of nodes that are currently acting as a job tracker process.
(19)What is “PID”?
PID stands for Process ID.
(20)What is “jps”?
jps is a command used to check if your task tracker, job tracker, datanode, and Namenode are working.
(21)Is there another way to check whether Namenode is working?
Besides the jps command, you can also use: /etc/init.d/hadoop-0.20-namenode status.
(22)How would you restart Namenode?
To restart Namenode, you could either write:
• sudo hdfs
• /etc/init.d/ha, press enter, then /etc/init.d/hadoop-0.10-namenode start
and then press Enter, or you could simply click stop-all.sh, then select start-all.sh.
(23)What is “fsck”?
fsck standards for File System Check.
(24)What is a “map” in Hadoop?
A map reads data from an input location, and outputs a key value pair according to the input type.
(25)What is a “reducer” in Hadoop?
a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
(26)What are the parameters of mappers and reducers?
The four parameters for mappers are:
• LongWritable (input)(K1)
• text (input)(V1)
• text (intermediate output)(K2)
• IntWritable (intermediate output)(V2)
The four parameters for reducers are:
• Text (intermediate output)(K2)
• IntWritable (intermediate output)(V2)
• Text (final output)(K3)
• IntWritable (final output)(V3)
(27)Is it possible to rename the output file, and if so, how?
Still we dint read this
(28)List the network requirements for using Hadoop.
• Secure Shell (SSH) for launching server processes
• Password-less SSH connection
(29)Which port does SSH work on?
SSH works on the default port number, 22.
(30)What is streaming in Hadoop?
streaming is a feature that lets engineers code with MapReduce in any language, as long as that programming language is able to accept and produce standard output. Even though Hadoop is Java-based, the chosen language doesn’t have to be Java. It can be Perl, Ruby, etc. If you want to use customization in MapReduce, however, Java must be used.
(31)What is the difference between Input Split and an HDFS Block?
InputSplit and HDFS Block both refer to the division of data, but InputSplit handles the logical division while HDFS Block handles the physical division.
(32)What does the file hadoop-metrics.properties do?
The hadoop-metrics.properties file controls reporting in Hadoop.
(33)Name the most common Input Formats defined in Hadoop? Which one is default?
TextInputFormat is the Hadoop default.
(34)What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.
KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.
(35)What is InputSplit in Hadoop?
When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit.
(36)How is the splitting of file invoked in Hadoop framework
It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.
(36)Consider case scenario: In M/R system,- HDFS block size is 64 MB
– Input format is FileInputFormat
– We have 3 files of size 64K, 65Mb and 127Mb
Hadoop will make 5 splits as follows:
How many input splits will be made by Hadoop framework?
– 1 split for 64K files
– 2 splits for 65MB files
– 2 splits for 127MB files
(38)What is the purpose of RecordReader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.
(41)What is JobTracker?
JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
(42)What are some typical functions of Job Tracker?
The following are some typical tasks of JobTracker:-
– Accepts jobs from clients
– It talks to the NameNode to determine the location of the data.
– It locates TaskTracker nodes with available slots at or near the data.
– It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.
(43)What is TaskTracker?
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker
(44)What is the relationship between Jobs and Tasks in Hadoop?
One to Many Like One job is broken down into one or many tasks in Hadoop
(46)Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
It will restart the task again on some other TaskTracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.
(47)Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this?
(48)How does speculative execution work in Hadoop?
JobTracker makes different TaskTrackers process same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
(49)Using command line in Linux, how will you
– See all jobs running in the Hadoop cluster
– Kill a job?
Hadoop job – list
Hadoop job – kill jobID
(50)What is Hadoop Streaming?
(52)What is Distributed Cache in Hadoop?
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.