#1262

rhiddhiman
Participant

(101)What is a rack?
Rack is a collection of nodes (machines).

(102)On what basis data will be stored on a rack?
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

(103)Do we need to place 2nd and 3rd data in rack 2 only?
Yes, to avoid DataNode failure.

(104)What if rack 2 and datanode fails?
If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

(105)What is a Secondary Namenode? Is it a substitute to the Namenode?
Secondary NameNode is a slave daemon in HDFS, it holds the backup of the MetaData of the NameNode. In case the NameNode fails the backup of the MetaData stored in the Secondary NameNode is used to create another NameNode.
Secondary NameNode is not a substitute for the NameNode.

(106)What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

(107)What is ‘Key value pair’ in HDFS?
Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

(108)What is the difference between MapReduce engine and HDFS cluster?
HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

(109)Is map like a pointer?
No, Map is not like a pointer.

(110)Do we require two servers for the Namenode and the datanodes?
Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system as it stores information about the location details of all the files stored in different datanodes and on the other hand, datanodes require low configuration system.

(111)Why are the number of splits equal to the number of maps?
The number of maps is equal to the number of input splits because we want the key and value pairs of all the input splits.

(112)Is a job split into maps?
No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split, a map is needed.

(113)Which are the two types of ‘writes’ in HDFS?
There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

(114)Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency. For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and accessed.

(115)Can Hadoop be compared to NOSQL database like Cassandra?
Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

(116)How can I install Cloudera VM in my system?

(117)What is a Task Tracker in Hadoop? How many instances of Task Tracker run on a hadoop cluster?
A TaskTracker is a MapReduce daemon in the cluster that accepts tasks – Map, Reduce and Shuffle operations – from a JobTracker.
Depending on the number of jobs N number of TaskTracker instances can run on a Hadoop cluster.

(118)What are the four basic parameters of a mapper?
The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.

(119)What is the input type/format in MapReduce by default?
Text Input Format

(120)Can we do online transactions(OLTP) using Hadoop?
No

(121)Explain how HDFS communicates with Linux native file system

(122)What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop.
One

(123)What is the InputFormat ?
It is a mechanism to provide input to the Hadoop cluster.

(124)What is the InputSplit in map reduce software?
An InputSplit is a logical representation of a memory block of input work for a map task. By default it is 64MB.

(125)What is a IdentityMapper and IdentityReducer in MapReduce ?
– org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
– org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

(126)How JobTracker schedules a task?
When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

(127)When is the reducers are started in a MapReduce job?
In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

(128)On What concept the Hadoop framework works?
It works on the concept of MapReduce.

(129)What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
DataNode is a HDFS daemon in a Hadoop cluster which is responsible for storing data. N number of DataNodes can run on a Hadoop cluster.

(130)What other technologies have you used in hadoop domain?

(131)How NameNode Handles data node failures?
By replication

(132)How many Daemon processes run on a Hadoop system?
5

(133)What is configuration of a typical slave node on Hadoop cluster?

(134) How many JVMs run on a slave node?
1

(135)How will you make changes to the default configuration files?
a. We got to “conf” sub-directory under Hadoop directory
b. Open the configuration files and edit the following into it
sudo gedit core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

sudo gedit hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

sudo gedit mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

sudo gedit hadoop-env.sh

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67

(136)Can I set the number of reducers to zero?
Yes it can be set to zero, setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]

(137)Whats the default port that jobtrackers listens ?
50030

(138)unable to read options file while i tried to import data from mysql to hdfs. Narendra

(139)What problems have you faced when you are working on Hadoop code?

(140)how would you modify that solution to only count the number of unique words in all the documents?

(141)What is the difference between a Hadoop database and Relational Database?
Hadoop database stores both structured and unstructured data while relational database stores only structured data.

(142)How the HDFS Blocks are replicated?
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack.

(143)What is a Task instance in Hadoop? Where does it run?
Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

(144)what is meaning Replication factor?
Replication factor is a number which denotes how many times a data would be copied into HDFS. By default the replication factor in Hadoop cluster is 3.

(145)If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

(146)How the Client communicates with HDFS?
The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

(147)Which object can be used to get the progress of a particular job
JobClient or the Web UI.

(148)What is next step after Mapper or MapTask?
Shuffling and Sorting

(149)What are the default configuration files that are used in Hadoop?
Core-site.xml
Hdfs-site.xml
Mapred-site.xml
Hadoop-env.sh

(150)Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
No, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.

(151)What is HDFS Block size? How is it different from traditional file system block size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block 3 times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size cannot be compared with the traditional file system block size.

(152)what is SPF?
The JobTracker is the single point of failure in Hadoop cluster. If it fails processing stops.

(153)Where do you specify the Mapper Implementation?

(154)What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical production cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

(155)Explain the core methods of the Reducer?
The API of Reducer is very similar to that of Mapper, there’s a run() method that receives a Context containing the job’s configuration as well as interfacing methods that return data from the reducer itself back to the framework. The run() method calls setup() once, reduce() once for each key associated with the reduce task, and cleanup() once at the end. Each of these methods can access the job’s configuration data by using Context.getConfiguration ().
As in Mapper, any or all of these methods can be overridden with custom implementations. If none of these methods are overridden, the default reducer operation is the identity function; values are passed through without further processing.
The heart of Reducer is it’s reduce () method. This is called once per key; the second argument is an Iterable which returns all the values associated with that key.
(156)What is Hadoop framework?
Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware.

(157)Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job
Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.

(158)How would you tackle counting words in several text documents?
Using the WordCount class provided by Hadoop.

(159)How does master slave architecture in the Hadoop?

(160)How would you tackle calculating the number of unique visitors for each hour by mining a huge Apache log? You can use post processing on the output of the MapReduce job.

(161)How did you debug your Hadoop code ?

(162)How will you write a custom partitioner for a Hadoop job?
To have Hadoop use a custom partitioner you will have to do minimum the following three:
– Create a new class that extends Partitioner Class
– Override method getPartition
– In the wrapper that runs the Mapreduce, either
– Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

(163)How can you add the arbitrary key-value pairs in your mapper?
You can set arbitrary (key, value) pairs of configuration data in your Job,
e.g. with Job.getConfiguration().set(“myKey”, “myVal”), and then retrieve this data
in your mapper with Context.getConfiguration().get(“myKey”). This kind of
functionality is typically done in the Mapper’s setup() method.

(164)what is a datanode?
DataNode is a HDFS daemon of Hadoop cluster which is used for storage of data.

(165)What are combiners? When should I use a combiner in my MapReduce Job?
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

(166)How Mapper is instantiated in a running job?
First job will come to job tracker, then it assigned a particular job ID then initialized in that way Mapper is instantiated in a running job

(167)Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer

(168)What happens if you don?t override the Mapper methods and keep them as it is?

(169)How does an Hadoop application look like or their basic components?

(170)What is the meaning of speculative execution in Hadoop? Why is it important?
Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performaing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.

(170)What are the restriction to the key and value class ?
The key and value classes have to be serialized by the framework. To make them serializable Hadoop provides a Writable interface. As you know from the java itself that the key of the Map should be comparable, hence the key has to implement one more interface WritableComparable.

(171)Explain the WordCount implementation via Hadoop framework ?
• We will count the words in all the input file flow as below
• input: Assume there are two files each having a sentence Hello World Hello World (In file 1)Hello World Hello World (In file 2)
• Mapper : There would be each mapper for the a file
For the given sample input the first map output: < Hello, 1>< World, 1>< Hello, 1>< World, 1>
The second map output: < Hello, 1>< World, 1> < Hello, 1>< World, 1>
• Combiner/Sorting (This is done for each individual map)So output looks like this
The output of the first map: < Hello, 2>< World, 2>
The output of the second map: < Hello, 2>< World, 2>
• Reducer :It sums up the above output and generates the output as below
< Hello, 4>< World, 4>
• OutputFinal output would look like
Hello 4 times World 4 times

(172)What Mapper does?
It will take input from the HDFS client and process an output known as Intermediate output in the form of key, value pairs and will feed it as an input to the Reducer.

(173)what is MAP REDUCE?
MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

(174)Explain the Reducer?s Sort phase?
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged (It is similar to merge-sort).

(175)What are the primary phases of the Reducer?
Shuffle, Sort and Reduce

(176)Explain the Reducer’s reduce phase?
In this phase the reduce(MapOutKeyType, Iterable, Context) method is called for each pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via Context.write(ReduceOutKeyType, ReduceOutValType). Applications can use the Context to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The output of the Reducer is not sorted.

(177)Explain the shuffle?
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

(178)What happens if number of reducers are 0?
The output would be generated only by the mapper and the reducer phase would be omitted.

(179)How many Reducers should be configured?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapreduce.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

(180)What is Writable & WritableComparable interface?
-org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.
-org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.

(181)What is the Hadoop MapReduce API contract for a key and value Class?
The Key must implement the org.apache.hadoop.io.WritableComparable interface.
The value must implement the org.apache.hadoop.io.Writable interface.

(182)Where is the Mapper Output (intermediate kay-value data) stored ?
Local File System

(183)What is the difference between HDFS and NAS ?

(184)Whats is Distributed Cache in Hadoop
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

(185)Have you ever used Counters in Hadoop. Give us an example scenario?

(186)What is the main difference between Java and C++?
JAVA is purely object oriented programming language whereas C++ is not.

(187)What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered?

(188)What is the use of Context object?

(189)What is the Reducer used for?
Reducer reduces a set of intermediate values which share a key to a (usually smaller) set of values.
The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).

(190)What is the use of Combiner?
It is an optional component or class, and can be specify via Job.setCombinerClass(ClassName), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

(191)Explain how input and output data format of the Hadoop framework?
The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. See the flow mentioned below
(input) -> map -> -> combine/sorting -> -> reduce -> (output)

(192)What is compute and Storage nodes?
Compute Node: This is the computer or machine where your actual business logic will be executed.
Storage Node: This is the computer or machine where your file system reside to store the processing data.
In most of the cases compute node and storage node would be the same machine.

(193)what is namenode?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

(194)How does Mappers run() method works?
The Mapper.run() method then calls map(KeyInType, ValInType, Context) for each key/value pair in the InputSplit for that task

(195)what is the default replication factor in HDFS?
3

(196)It can be possible that a Job has 0 reducers?
Yes

(197)How many maps are there in a particular Job?
Depends on the number of input splits.

(198)How many instances of JobTracker can run on a Hadoop Cluser?
1

(199)How can we control particular key should go in a specific reducer?
Using a custom Partitioner

(200)what is the typical block size of an HDFS block?
64MB is default but in production line 128MB is recommended.

(201)What do you understand about Object Oriented Programming (OOP)? Use Java examples.

(202)What are the main differences between versions 1.5 and version 1.6 of Java?

(203)Describe what happens to a MapReduce job from submission to output?
a. HDFS Client submits a job
b. Record Reader takes the input and converts it to key-value pairs and gives it as input to the Mapper
c. Mapper processes the data and gives an intermediate value (in key-value pairs) as output which is stored in the local file system.
d. The intermediate value goes through shuffling and sorting and is provided to the Reducer
e. The final output is given by the Reducer and is stored in HDFS.

Prwatech