March 4, 2015 at 6:10 pm #1260
(1)What is Difference between Secondary name node, Checkpoint name node & backup node Secondary Name node, a poorly named component of Hadoop?
Name Node is master in HDFS cluster, manages file system name space and access. It maintains the file system namespace and the metadata for all the files and directories in HDFS. Name node machine itself contains these information in form of file on local disk. It persist two files namespace image (FS Image) and Edit Log.
Secondary namenode is deprecated and now it is known as checkpoint node. Hadoop latest version after 2.0 supports checkpoint node.
However Secondary namenode and backup nodes are not same. Backup node performs same operation of checkpointing and do one more task than to Secondary/Checkpoint namenode is maintains an updated copy of FSImage in memory(RAM). It is always synchronized with namenode. So there is no need to copy FSImage & log file from namenode.
Because Backupnode keep upto date changes in RAM, So Backupnode and Namenode’s RAM should be of same size.
(2)What are the Side Data Distribution Techniques?
(3)What is shuffling in map reduce?
Shuffling is the process by which intermediate data from mappers are transferred to 0,1 or more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of reducers (for a balanced load).
(4)What is partitioning?
(5)Can we change the file cached by Distributed Cache
(6)What if job tracker machine is down?
(7)Can we deploy job tracker other than name node?
(8)What are the four modules that make up the Apache Hadoop framework?
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications; and
Hadoop MapReduce – a programming model for large scale data processing.
(9)Which modes can Hadoop be run in? List a few features for each mode.
• Standalone, or local mode, which is one of the least commonly used environments. When it is used, it’s usually only for running MapReduce programs. Standalone mode lacks a distributed file system, and uses a local file system instead.
• Pseudo-distributed mode, which runs all daemons on a single machine. It is most commonly used in QA and development environments.
• Fully distributed mode, which is most commonly used in production environments. Unlike pseudo-distributed mode, fully distributed mode runs all daemons on a cluster of machines rather than a single one.
(10)Where are Hadoop’s configuration files located?
Hadoop’s configuration files can be found inside the conf sub-directory.
(11)List Hadoop’s three configuration files?
(12)What are “slaves” and “masters” in Hadoop?
In Hadoop, slaves are a list of hosts for task tracker servers and data nodes. Masters list hosts for secondary name node servers.
(13)How many data nodes can run on a single Hadoop cluster?
Only one Name node process can run on a single Hadoop cluster. The file system will go offline if this Name node goes down.
(14)What is job tracker in Hadoop?
Job tracker is used to submit and track jobs in Map Reduce.
(15)How many job tracker processes can run on a single Hadoop cluster?
Like data nodes, there can only be one job tracker process running on a single Hadoop cluster. Job tracker processes run on their own Java virtual machine process. If job tracker goes down, all currently active jobs stop.
(16)What sorts of actions does the job tracker process perform?
Client applications send the job tracker jobs.
Job tracker determines the location of data by communicating with Name node.
Job tracker finds nodes in task tracker that has open slots for the data.
Job tracker submits the job to task tracker nodes.
Job tracker monitors the task tracker nodes for signs of activity. If there is not enough activity, job tracker transfers the job to a different task tracker node.
Job tracker receives a notification from task tracker if the job has failed. From there, job tracker might submit the job elsewhere, as described above. If it doesn’t do this, it might blacklist either the job or the task tracker.
(17)How does job tracker schedule a job for the task tracker?
When a client application submits a job to the job tracker, job tracker searches for an empty node to schedule the task on the server that contains the assigned data node.
(18)What does the mapred.job.tracker command do?
The mapred.job.tracker command will provide a list of nodes that are currently acting as a job tracker process.
(19)What is “PID”?
PID stands for Process ID.
(20)What is “JPS”?
JPS is a command used to check if your task tracker, job tracker, data node, and Name node are working.
(21)Is there another way to check whether Name node is working?
Besides the JPS command, you can also use: /etc/init.d/hadoop-0.20-namenode status.
(22)How would you restart Name node?
To restart Namenode, you could either write:
/etc/init.d/ha, press enter, then /etc/init.d/hadoop-0.10-namenode start
and then press Enter, or you could simply click stop-all.sh, then select start-all.sh.
(23)What is “fsck”?
FSCK standards for File System Check.
(24)What is a “map” in Hadoop?
In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.
(25)What is a “reducer” in Hadoop?
In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
(26)What are the parameters of mappers and reducers?
The four parameters for mappers are:
• LongWritable (input)
• text (input)
• text (intermediate output)
• IntWritable (intermediate output)
The four parameters for reducers are:
• Text (intermediate output)
• IntWritable (intermediate output)
• Text (final output)
• IntWritable (final output)
(27)Is it possible to rename the output file, and if so, how?
Yes, it is possible to rename the output file by utilizing a multi-format output class.
(28)List the network requirements for using Hadoop.
Secure Shell (SSH) for launching server processes
Password-less SSH connection
(29)Which port does SSH work on?
SSH works on the default port number, 22.
(30)What is streaming in Hadoop?
As part of the Hadoop framework, streaming is a feature that lets engineers code with Map Reduce in any language, as long as that programming language is able to accept and produce standard output. Even though Hadoop is Java-based, the chosen language doesn’t have to be Java. It can be Perl, Ruby, etc. If you want to use customization in Map Reduce, however, Java must be used.
(31)What is the difference between Input Split and an HDFS Block?
Input Split and HDFS Block both refer to the division of data, but Input Split handles the logical division while HDFS Block handles the physical division.
(32)What does the file hadoop-metrics.properties do?
The hadoop-metrics.properties file controls reporting in Hadoop.
(33)Name the most common Input Formats defined in Hadoop? Which one is default?
The most common Input Formats defined in Hadoop are
TextInputFormat is the Hadoop default.
(34)What is the difference between TextInputFormat and keyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.
KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.
(35)What is Input Split in Hadoop?
When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split.
(36)How is the splitting of file invoked in Hadoop Framework?
It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.
(36)Consider case scenario: In M/R system,
– HDFS block size is 64 MB
– Input format is FileInputFormat
– We have 3 files of size 64K, 65Mb and 127Mb
(37)How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows:
– 1 split for 64K files
– 2 splits for 65MB files
– 2 splits for 127MB files
(38)What is the purpose of Record Reader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format
(39)After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and sort”. Explain what happens in this phase?
It is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same.
After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.
Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.
(40)If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?
The default partitioner computes a hash value for the key and assigns the partition based on this result.
(41)What is Job Tracker?
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
1. Client applications submit jobs to the Job tracker.
2. The JobTracker talks to the NameNode to determine the location of the data
3. The JobTracker locates TaskTracker nodes with available slots at or near the data
4. The JobTracker submits the work to the chosen TaskTracker nodes.
5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
7. When the work is completed, the JobTracker updates its status.
8. Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
(42)What are some typical functions of Job Tracker?
1. Accepts jobs from clients
2. It talks to the NameNode to determine the location of the data.
3. It locates TaskTracker nodes with available slots at or near the data.
4. It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.
(43)What is Task Tracker?
A Task Tracker is a node in the cluster that accepts tasks – Map, Reduce and Shuffle operations – from a Job Tracker.
(44)What is the relationship between Jobs and Tasks in Hadoop?
(46)Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
(47)Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this?
(48)How does speculative execution work in Hadoop?
(49)Using command line in Linux, how will you
– See all jobs running in the Hadoop cluster
On Linux: $ bin/hadoop job –list
– Kill a job?
On Linux: $ bin/hadoop job -kill jobid
(50)What is Hadoop Streaming?
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.