Hadoop Interview Questions and Answers

  • date 28th October, 2019 |
  • by Prwatech |


Hadoop Interview Questions and Answers

Hadoop Interview Questions and Answers, Are you looking for interview questions on Hadoop? Or the one who is looking for the best platform which provides a list of Top rated Hadoop interview questions for both experienced and fresher of 2019. Then you’ve landed on the right path. We Prwatech India’s Leading Big Data Training institute Team collected the best & Top Rated Hadoop interview questions and answers which helps to crack any type of interview easily.


Here is the list of Top Rated 50 Hadoop interview questions and answers

If you are the one who is dreaming to become the certified Pro Hadoop developer, then don’t just dream to become the certified Hadoop Developer achieve it with 15+ Years of experienced world-class Trainers of India’s Leading Hadoop Training institute.


How is Hadoop different from other parallel computing systems?

Hadoop is a distributed file system that allows you to store and process a massive amount of data on a cloud/cluster of machines, handling data Redundancy. The primary benefit is that since data is stored in different Nodes, it is better to process the data in a distributed manner. Data locality The facility allows each node to process the data stored on it rather than Moving data towards the processing unit. On the other hand, in the RDBMS computing system, you can query data in Real-time, but it is not efficient to process the huge amount of stored data which is in tables, records, and columns?


Explain the difference between Name Node, Checkpoint Name Node, and Backup Node.


The NameNode stores the metadata of the HDFS. The state of HDFS is stored in a file called fsimage and is the base of the metadata. During the runtime, modifications are just written to a log file called edits. On the
next start-up of the NameNode the state is read from fsimage, the changes from edits are applied to that and the new state is written back to fsimage. After this edit is cleared and contains is now ready for new log entries…

A Checkpoint Node introduced to solve the drawbacks of the NameNode.  The changes were just written to edits and not merged to the previous image during the runtime. If the NameNode runs for a while edits get huge and the next startup will take even longer because more changes have to be applied to the state to determine the last state of the metadata.

The Checkpoint Node fetches periodically fsimage and edits from the NameNode and merges them. The resulting state is called checkpoint. After this is upload the result to the NameNode.

Backup Node

The Backup Node provides almost the same functionality as the Checkpoint node but is synchronized with the NameNode. It doesn’t need to fetch
the changes periodically because it receives a stream of file system edits from the NameNode. It holds the current state in-memory and just needs to save this to an image file to create a new checkpoint.


What are the most common input formats in Hadoop?

Hadoop supports Text, Parquet, RC, ORC, Sequence etc file format. The text file format is the default file format in Hadoop. Depending upon the business requirement one can use the different file formats. Like ORC and Parquet are the columnar file format, if you want to process the data vertically you can work with parquet or ORC. If you want to process data horizontally you can work with Avro file format. These file formats (Parquet, ORC, Avro) come up with compression techniques and consumes less space compared to other file formats.


What is a Sequence File in Hadoop?

A SequenceFile is a binary file format, consists of serialized key-value pairs and serves as a container for data to be used in HDFS. MapReduce stores data in this file format during the processing of the MapReduce tasks.


What is the role of a Job Tracker in Hadoop?

The JobTracker is a master daemon in Hadoop 1.x Architecture. It is replaced by the Resource Manager/ App Master in YARN. It receives the request from the Mapreduceclient, submits/distributes the work to the different askTrackernodes, and collects the status of the ongoing tasks from task trackers.


What is the use of a Record Reader in Hadoop?

RecordReader interacts with the InputSplit (created by InputFormat) and converts the splits into the form of key-value pairs that are suitable for reading by the Mapper.


What is Speculative Execution in Hadoop?

In Hadoop, Speculative Execution is a process where if a task is taking so much time to execute than the master node starts executing another instance of that same task. And the task which is finished first between the two is accepted and the other task will be is stopped by killing that.


How can you debug the Hadoop code?

First, check the list of MapReduce jobs currently running. Next, check if there are any orphaned jobs running; if yes, you need to determine the location of RM logs.

1. Run: “ps –ef | grep –I ResourceManager”
and look for the log directory in the displayed result.
Find out the job-id from the displayed list and check if there is an error
message with that job.
2. On the basis of RM logs, identify the worker node that was involved in
the execution of the task.
3. Now, login to that node and run command – “ps –ef | grep
4. Examine the Node Manager log.
The majority of errors come from the user level logs for each map-
reduce job.


How to configure Replication Factor in HDFS?

Open the hdfs-site.xml file which is inside conf/ folder of the Hadoop
installation directory. Change the value property to any integer value you
want to set as a replication factor. ex. 2,3,4,5,etc.






6.You can also change the replication factor in runtime on a per-file or per-
directory basis using the Hadoop FS shell.

7.$ hadoop fs –setrep –w 3 /my/file_name

8.$ hadoop fs –setrep –w 3 /my/directory_name


Which companies use Hadoop?

List of the top companies using Apache Hadoop:

1.Wipro Ltd
9.Mark and Spencer
10.Royal Bank of Scotland
11.Royal Mail
15.British Airways


How is Hadoop related to Big Data? Describe its components.

Apache Hadoop is an open-source software framework written in Java. It is primarily used for the storage and processing of large sets of data, better known as big data. It comprises of several components that allow the storage and processing of large data volumes in a clustered environment. However, the two main components are Hadoop Distributed File System and MapReduce programming.


Hadoop, as a whole, consists of the following parts:


Hadoop Distributed File System,
Abbreviated as HDFS, it is primarily a file system similar to many of the already existing ones. However, it is also a virtual file system.
There is one notable difference with other popular file systems, which is, when we move a file in HDFS, it is automatically split into smaller files. These smaller files are then replicated on a minimum of three different servers so that they can be used as an alternative to unforeseen circumstances. This replication count isn’t necessarily hard-
set, and can be decided upon as per requirements.

Hadoop MapReduce

MapReduce is mainly the programming aspect of Hadoop that allows the processing of large volumes of data. There is also a provision that breaks own requests into smaller requests, which are then sent to multiple servers. This allows utilization of the scalable power of the CPU.


HBASE happens to be a layer that sits atop the HDFS and has been developed by means of the Java programming language. HBASE primarily has the following aspects –


2.Highly scalable

3.Fault tolerance

Every single row that exists in HBASE is identified by means of a key.
The number of columns is also not defined, but rather grouped into
column families.


This is basically a centralized system that maintains

1.Configuration information

2.Naming information

3.Synchronization information

Besides these, Zookeeper is also responsible for group services and is
utilized by HBASE. It also comes to use for MapReduce programs.Solr/Lucene – This is nothing but a search engine. Its libraries are developed by Apache and required over 10 years to be developed in its
present robust form.

Programming Languages
There are basically two programming languages that are identified as original Hadoop programming languages,


Besides these, there are a few other programming languages that can be
used for writing programs, namely C, JAQL and Java. We can also
make direct usage of SQL for interaction with the database, although
that requires the use of standard JDBC or ODBC drivers.


Define HDFS and YARN, and talk about their respective components.


Hadoop Distributed File System (HDFS)
HDFS is a distributed file system that provides access to data across Hadoop clusters. A cluster is a group of computers that work together. Like other Hadoop-related technologies, HDFS is a key tool that manages and supports the analysis of very large volumes petabytes and zettabytes of data.

HDFS Components

The main components of HDFS are,
2.Secondary Namenode
3.File system

HDFS Command Line
The following are a few basic command lines of HDFS.

To copy the file prwatech.txt from the local disk to the user's directory,

type the command line:
1.hdfsdfs –put prwatech.txt prwatech.txt
This will copy the file to /user/username/prwatech.txt

To get a directory listing of the user's home directory, type the command line:

$hdfsdfs –ls

To create a directory called testing under the user's home directory, type the command line:

$hdfsdfs –mkdir

To delete the directory testing and all of its components, type the
command line:

hdfsdfs -rm -r


What is YARN?

YARN is the acronym for Yet Another Resource Negotiator. YARN is a resource manager created by separating the processing engine and the management function of MapReduce. It monitors and manages workloads, maintains a multi-tenant environment manages the high availability features of Hadoop, and implements security controls. Before 2012, users could write MapReduce programs using scripting languages such as Java, Python, and Ruby. They could also use Pig, a language used to transform data. No matter what language was used, its implementation depended on the MapReduce processing model.

In May 2012, during the release of Hadoop version 2.0, YARN was introduced. You are no longer limited to working with the MapReduce framework anymore as YARN supports multiple processing models in addition to MapReduce, such as Spark. Other features of YARN  include significant performance improvement and a flexible execution engine.

Let us first understand the important three Elements of YARN

The three important elements of the YARN architecture are:
1.Resource Manager
2.Application Master
3.Node Managers

Resource Manager,
The ResourceManager, or RM, which is usually one per cluster, is the master server. Resource Manager knows the location of the DataNode and how many resources they have. This information is referred to as Rack Awareness. The RM runs several services, the most important of which is the Resource Scheduler decides how to assign resources.

Application Master,
The Application Master is a framework-specific process that negotiates resources for a single application, that is, a single job or a directed acyclic graph of jobs, which runs in the first container allocated for the purpose.
Each Application Master requests resources from the Resource Manager and then works with containers provided by Node Managers.


What is the purpose of the JPS command in Hadoop?

JPS (Java virtual machine process tool) is a command used to check all the Hadoop daemons are running or not on the machine-like Namenode, Secondary Namenode, Datanode, Resource Manager, Node Manager.


Why do we need Hadoop for Big Data Analytics?

The primary function of Hadoop is to facilitate quickly doing analytics on huge sets of unstructured data. In other words, Hadoop is all about handling " big data." So the first question to ask is whether that. the kind of data you are working with. Secondly, does your data require real-time, or close to real-time analysis? Where Hadoop excels is in allowing large datasets to be processed quickly. Another consideration is the rate at which your data storage requirements are growing. A big advantage of Hadoop is that it is extremely scalable. You can add new storage capacity simply by adding server nodes in your Hadoop cluster. In theory, a Hadoop cluster can be almost infinitely expanded as needed using low-cost commodity server and storage hardware.

If your business faces the combination of huge amounts of data, along with a much less than huge storage budget, Hadoop may well be the best solution for you.


Explain the different features of Hadoop.


Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike a traditional relational database system (RDBMS) that can’t scale to process large amounts of data, Hadoop enables businesses to run applications on thousands of nodes involving many thousands of terabytes of data.

Varied Data Sources,

Hadoop accepts a variety of data. Data can come from a range of sources like email conversation, social media, etc. and can be of the structured or unstructured form. Hadoop can derive value from diverse data. Hadoop can accept data in a text file, XML file, images, CSV files, etc.


Hadoop is an economical solution as it uses a cluster of commodity hardware to store data. Commodity hardware is cheap machines hence the cost of adding nodes to the framework is not much high. In Hadoop 3.0 we have only 50% of storage overhead as opposed to 200% in Hadoop2.x. This requires less machine to store data as the redundant data decreased significantly.


Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations.  Hadoop can be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis, and fraud detection.

Fast :

Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in the much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.

Performance :

Hadoop with its distributed processing and distributed storage architecture processes huge amounts of data with high speed. Hadoop even defeated supercomputer the fastest machine in 2008. It divides the input data file into a number of blocks and stores data in these blocks over several nodes. It also divides the task that the user submits into various sub-tasks which assign to these worker nodes containing required data and these sub-task run in parallel thereby improving the performance.


In Hadoop 3.0 fault tolerance is provided by erasure coding. For example, 6 data blocks produce 3 parity blocks by using erasure coding technique, so HDFS stores a total of these 9 blocks. In the event of failure of any node the data block affected can be recovered by using these parity blocks and the remaining data blocks.

Highly Available :

In Hadoop 2.x, HDFS architecture has a single active NameNode and a single Standby NameNode, so if a NameNode goes down then we have standby NameNode to count on. But Hadoop 3.0 supports multiple standby NameNode making the system even more highly available as it can continue functioning in case of two or more NameNodes crashes.

Resilient to failure,

A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.

 Low Network Traffic,

In Hadoop, each job submitted by the user is split into a number of independent sub-tasks and these sub-tasks are assigned to the data nodes thereby moving a small amount of code to data rather than moving huge data to code which leads to low network traffic.

High Throughput:

Throughput means job done per unit time. Hadoop stores data in a distributed fashion which allows using distributed processing with ease. A given job gets divided into small jobs that work on chunks of data in parallel thereby giving high throughput.

Ease of use,

The Hadoop framework takes care of parallel processing, MapReduce programmers do not need to care for achieving distributed processing, it is done at the backend automatically.

Compatibility :

Most of the emerging technology of Big Data is compatible with Hadoop like Spark, Flink, etc. They have got processing engines that work over Hadoop as a backend i.e. We use Hadoop as data storage platforms for them.

Multiple Languages Supported,

Developers can code using many languages on Hadoop like C, C++, Perl, Python, Ruby, and Groovy.


What are the Edge Nodes in Hadoop?

Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools. They’re also often used as staging areas for data being transferred into the Hadoop cluster. As such, Oozie, Pig, Sqoop, and management tools such as Hue and Ambari run well there. The figure shows the processes you can run on the Edge nodes.

Edge nodes are often overlooked in Hadoop hardware architecture discussions. This situation is unfortunate because edge nodes serve an important purpose in a Hadoop cluster, and they have hardware requirements that are different from master nodes and slave nodes. In general, it’s a good idea to minimize deployments of administration tools on master nodes and slave nodes to ensure that critical Hadoop services like the NameNode have as little competition for resources as possible.

The figure shows two edge nodes, but for many Hadoop clusters, a single edge node would suffice. Additional edge nodes are most commonly needed when the volume of data being transferred in or out of the cluster is too much for a single server to handle.


What are the five V’s of Big Data?

In recent years, Big Data as defined by the “3Vs” but now there is “5Vs” of Big Data which are also termed as the characteristics of Big Data as follows:

1. The name ‘Big Data’ itself is related to a size that is enormous.
2. Volume is a huge amount of data.
3. To determine the value of data, size of data plays a very crucial role.
4. If the volume of data is very large then it is actually considered as a ‘Big Data’. This means whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data.
5. Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.

Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2 billion GB) per month. Also, by the year 2020, we will have almost 40000 ExaBytes of data.


1. Velocity refers to the high speed of accumulation of data.
2. In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones, etc.
3. There is a massive and continuous flow of data. This determines the potential of data that how fast the data is generated and processed to meet the demands.
4. Sampling data can help in dealing with the issue like ‘velocity’.Example: There are more than 3.5 billion searches per day are made on Google. Also, Facebook users are increasing by 22%(Approx.) year by year.


1. It refers to the nature of data that is structured, semi-structured and unstructured data.
2. It also refers to heterogeneous sources.
3. Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise. It can be structured, semi-structured and unstructured.
4.Structured data: This data is basically organized data. It generally refers to data that has defined the length and format of data.
5.Semi-Structured Data: This data is basically a semi-organized data. It is generally a form of data that does not conform to the formal structure of data. Log files are examples of this type of data.
6.Unstructured data: This data basically refers to unorganized data. It generally refers to data that doesn’t fit neatly into the traditional row and column structure of the relational database. Texts, pictures, videos, etc. are examples of unstructured data that can’t be stored in the form of rows and columns.


1. It refers to inconsistencies and uncertainty in data, that is data that is available can sometimes get messy and quality and accuracy are difficult to control.
2. Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources.
3.Example: Data in bulk could create confusion whereas less amount of data could convey half or Incomplete Information.


1. After having the 4 V’s into account there comes one more V which stands for Value. The bulk of Data having no Value is of no good to the company, unless you turn it into something useful.
2. Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5Vs.


Define respective components of HDFS and YARN

Hadoop Distributed File System (HDFS) HDFS is a distributed file system that provides access to data across Hadoop clusters. A cluster is a group of computers that work together. Like other Hadoop-related technologies, HDFS is a key tool that manages and supports the analysis of very large volumes of petabytes and zettabytes of data.


HDFS Components

The main components of HDFS are,

2.Secondary Namenode
3.File system
6.HDFS Command Line


The following are a few basic command lines of HDFS.

To copy the file prwatech.txt from the local disk to the user's directory,
Type the command line:
$hdfsdfs –put prwatech.txt prwatech.txt
This will copy the file to /user/username/prwatech.txt
To get a directory listing of the user's home directory,
Type the command line:
$hdfsdfs –ls
To create a directory called testing under the user's home directory,
Type the command line:

$hdfsdfs –mkdir
To delete the directory testing and all of its components,
type the command line,
hdfsdfs -rm -r


What is fsck?

The FSCK is a system utility. It is a tool that is used to check the consistency of a file system in the Unix-like operating systems. It is a tool that will check and repair inconsistencies in Unix-like systems including Linux. The tool can be used with the help of ‘fsck’ command in Linux. This is equivalent to the ‘CHKDSK’ in Microsoft Windows.


What are the main differences between NAS (Network-attached storage) and HDFS?

HDFS is the primary storage system of Hadoop. HDFS designs to store very large files running on a cluster of commodity hardware. Network-attached storage (NAS) is a file-level computer data storage server. NAS provides data access to a heterogeneous group of clients.


What is the Command to format the Name Node?

$hadoopnamenode –format


Which hardware configuration is most beneficial for Hadoop jobs?

Dual processors or core machines with a configuration of 4 / 8 GB RAM and ECC memory are ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly.


What happens when two users try to access the same file in the HDFS?


As you know, HDFS stands for Hadoop Distributed File System. HDFS strictly works on Write Once Read Many principles also known as WORM. It means only one client can write the file at a time. But read can happen concurrently.


What is the difference between “HDFS Block” and “Input Split”?

Inputsplit is a logical reference to data means it doesn't contain any data inside. It is only used during data processing by MapReduce and HDFS block is a physical location where actual data gets stored. And both are configurable by the different methodologies. Moreover, all blocks of the file are of the same size except the last block. The last Block can be of the same size or smaller. While Split size is approximately equal to block size, by default. An entire block of data may not fit into a single input split


Explain the difference between Hadoop and RDBMS.

Query-Response- In RDBMS, query response time is immediate. In Hadoop, It takes much more time to respond so there is latency due to Batch processing. Data Size: RDBMS is useful when we have GB’s of data but if we have data that exceeds GB’s, TB’s, PB’s then Hadoop is very useful in processing such data. Structure Of Data: RDBMS is best suited for only Structured-data and Hadoop can store & process Structured, Semi-Structured or unstructured type of data.

Scaling: RDBMS allows only vertical scalability and Linear whereas Hadoop is both vertical & horizontal scalable so Hadoop gives better performance in this case.

Updates: In RDBMS, we can Read/Write Many Times, in Hadoop, there is WORM so we can Write only once and read many times.

Cost: Hadoop is an open-source whereas RDBMS is a licensed product and you have to buy it.


What are the configuration parameters in a “Map Reduce” program?

1.Input location of Jobs in the distributed file system.
2.Output location of Jobs in the distributed file system.
3.The input format of data
4.The output format of data.
5.The class which contains the map function.
6.The class which contains the reduce function


What are the different configuration files in Hadoop?



How is NFS different from HDFS?

NFS (Network File system): A protocol developed that allows clients to access files over the network. NFS clients allow files to be accessed as if the files reside over the local machine, even though they reside on the disk of a networked machine.

HDFS (Hadoop Distributed File System): A file system that is distributed amongst many networked computers or nodes. HDFS is fault-tolerant because it stores multiple replicas of files on the file system, the default replication level is 3.

The major difference between the two is Replication/Fault Tolerance. HDFS was designed to survive failures. NFS does not have any fault tolerance built-in.


What is Map Reduce? What is the syntax you use to run a Map Reduce program?

Map-Reduce is a processing technique and a program model for distributed computing based over JVM. A Map-Reduce algorithm consists of two important steps, namely Map and Reduce. The map takes a set of the dataset and converts it into another set of the dataset, where every element is broken down into tuples of key/value pairs.

Secondly, reduce task, which takes the output from a map as an input and combines those data-tuples into a smaller set of tuples. As the sequence of the name Map-Reduce implies, the reduce task is always performed after the map job.

The major advantage of Map-Reduce is that it is easy to scale data processing over multiple computing nodes.

Under the Map-Reduce model, the data processing primitives are known as mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the Map-Reduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a single cluster is merely a configuration change. This simple scalability is what has attracted many programmers to implement the Map-Reduce model.


Map-Reduce Job:

(Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).


package hadoop;

import java.util.*;

import java.io.IOException;

import java.io.IOException;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class ClassName {

//Mapper class

public static class E_EMapper extends MapReduceBase implements

Mapper<LongWritable ,/*Input key Type */

Text,                /*Input value Type*/

Text,                /*Output key Type*/

IntWritable>        /*Output value Type*/


//Map function

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

String lasttoken = null;

StringTokenizer s = new StringTokenizer(line,”\t”);

String year = s.nextToken();

while(s.hasMoreTokens()) {

lasttoken = s.nextToken();


intavgprice = Integer.parseInt(lasttoken);

output.collect(new Text(year), new IntWritable(avgprice));




//Reducer class

public static class E_EReduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable> {


//Reduce function

public void reduce( Text key, Iterator <IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

intmaxavg = 30;

intval = Integer.MIN_VALUE;

while (values.hasNext()) {

if((val = values.next().get())>maxavg) {

output.collect(key, new IntWritable(val));





//Main function

public static void main(String args[])throws Exception {

JobConfconf = new JobConf(ProcessUnits.class);









FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));





What are the different file permissions in HDFS for files or directory levels?

Hadoop Distributed file system (HDFS) uses a specific permission model for files and directories
Following user, levels are used in HDFS

1.o Owner
2.o Group
3.o Others

For each of the user mentioned are above following permission are

1.o read(r)
2.o write(w)
3.o execute(x)
Above mentioned permissions work differently for files and directories.

For files,

o read (r) permission – Reading a file

o write (w) permission – writing a file

For Directories,

The r permission lists the content of the specific directory

The w permission creates or deletes a directory

The x permission accessing a child directory

How to restart all the daemons in Hadoop?

1.Use the command to stop all the daemons at a time


3.then use the command to start all the stopped daemons at the same time.



What is the use of jps command in Hadoop?

The “jps” command is used to identify which all Hadoop daemons are in running state. It will list all the Hadoop daemons running on the machine i.e. namenode, nodemanager, resourcemanager, datanode, etc.

Explain the process that overwrites the replication factors in HDFS.

There are different ways to overwrite the replication factor as per the respective requirement.


They are as follows

1. If you need to override the replication factor on a per-file basis using the Hadoop FS shell.
2. [user@localhost~]$hadoop fs –setrep –w 3 /path/to/my/file
3. If you need to override the replication factor of all the files under a directory.
4. [user@localhost~]$hadoop fs –setrep –w 3 -R /path/to/my/directory
5. If you need to override it via code, you can do following –
6. Configuration conf = new Configuration();
7. conf.set(“dfs.replication”, “1”);
8. Job job = new Job(conf);


What will happen with a Name Node that doesn’t have any data?

Name-Node is don’t contains data into it. It consists of a file system tree and the metadata for all the files and directories present in the system.


How Is Hadoop CLASSPATH essential to start or stop Hadoop daemons?

Classpath will consist of a list of directories containing jar files to stop or start daemons.


Why is HDFS only suitable for large data sets and not the correct tool to use for many small files?

1. Hadoop HDFS lacks the ability to support random reading of small files.
2. The small file in HDFS is smaller than the HDFS block size (default 128 MB).
3. If we are storing these huge numbers of small files, HDFS cannot handle these lots of files.
4.HDFS works with a small number of large files for storing large datasets.
5. It is not suitable for a large number of small files.
6. A large number of many small files overload NameNode since it stores the namespace of HDFS.


Why do we need Data Locality in Hadoop? Explain

Data Locality ensures that the Map-Reduce task is moved to the Data Node for performing the required processing. This ensures small-sized computation code(KBs) is transferred across the network rather than huge size data(GBs, TBS) in turn better utilization of network resources and time required for performing specific Map-reduce tasks.


DFS can handle a large volume of data then why do we need the Hadoop framework?

DFS can Handle large volumes of datasets, but the HADOOP framework will help to process those large data. The large data is divided into multiple blocks and stored to different commodity hardware’s

What do you understand by Rack Awareness in Hadoop?

Rack awareness is having the knowledge of Cluster topology or more specifically you can say, how the different data nodes are distributed across the racks of a Hadoop cluster. The importance of this knowledge relies on this assumption that collective data nodes inside a specific rack will have more bandwidth and less latency whereas two data nodes in separate racks will have comparatively less bandwidth and higher latency.


Quick Support

image image