Tagged: Hadoop Basics
- March 16, 2017 at 8:41 pm #3051
1 Algo for nn to allocate block on different data mode?
When client sends request to Namenode to store data on DN cluster.
NameNode has below 3 factors according to which it decides the allocation of DataNode
1)Nearest Location -Nearest location from client request to
datanode for easy access of Data
2)Network Traffic – It Network traffic is high it will give priority to second nearest location
3)Data Redandancy – If DN already have replicated data,the data will be stored in next nearest location along with consideration of network traffic
2. Write the function called for spilting the user data into blocks?
HDFS is designed to support very large files.
Applications that are compatible with HDFS are those that deal with large data sets.
A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.
This process is mainly to maintain the distributed environment for easy access of Data
Default size of block : 64MB
Minimum block size : 64MB
Increase blocksize : 64*N(No.of blocks to be increased)
production block size :128MB
3. how to modify the hart beat and block report time interval for the data node?
The hearbeat and block report time interval can be modified by modifying the codes written in hdf-site.xml for example:-
< description>Determines datanode heartbeat interval in seconds.</description>
Default value is : 3
< description>Determines when machines are marked dead</description>
Same way we can change the interval for block report with below property.
< description>Determines datanode blockreport interval in Hrs</description>
Default value is 6Hrs
4 What is fsimage and what type of metadata it store?
An FSimage file contains the complete state of the file system at a point in time.Snapshot of the filesystem when NameNode started
metadata stored in fsimage contains
List of filesystem
List of blocks for each file
List of datanode for each block
File attributes eg:creation time,Replication factor etc
5 If 2Tb is given what is the max expected metadata will generate?
If you know input files sizes then divide it with block size and multiply with 128kb, this gives approximate size of metadata.
6. Write the life cycle of SNN in Production ?
SNN is a Misnomer.It’s not a back-up for the NN,It helps to recover the NN in case of crash
-> The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the
->It is responsible for combining the editlogs with fsimage from the
-> It gets the edit logs from the namenode in regular intervals and applies to fsimage
-> Once it has new fsimage, it copies back to namenode
-> Namenode will use this fsimage for the next restart,which will reduce the startup time
The secondarynamenode job is not to be a secondary to the name node, but only to periodically read the filesystem changes log and apply them into the fsimage file, thus bringing it up to date. This allows the namenode to start up faster next time.
Unfortunatley the secondarynamenode service is not a standby secondary namenode, despite its name. Specifically, it does not offer HA for the namenode.
7. What metadata that nn hold on cache memory?
NN holds metadata of file, size, owner, group, permissions, block size etc.
8 . If any dn stop working,how the blocks of dead dn will move to the active data
->Every DN will send a Heartbeat(for every 3 sec) to NN (Heartbeat Acknowledgement)
->If in case the NN doesn’t receive this signal the DN to be declared as dead
->In this case the Data stored in the node which is declared to be dead gets merged to other node
->During this the important thing to be happened is Replication always happens in different DataNode
->Every 10th HB after 30sec signal will be given by DataNode(Block report)
->NN after getting this Block Report updates its Metadata
->This communication happens through port defined for NameNode
9 Write the condition : in which cases block size changed ?
-> Usually it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (default 64MB, you can increase 128 MB 256 MB etc..). but on the other hand for smaller files, using a smaller block size is better.
->So we are talking about larger file large block & smaller file small blocks.
->We can do this by modifying ”dfs.block.size” parameter can be used when the file is written. It will help in overriding default block size written in hdfs-site.xml
->Consider below example
Actually the default block size in 64 MB for Hadoop. But we can modify it as per our requirement. For example: If we need to store 68 MB of data, then two block will be created.
The size for first block will be 64 MB.
The size for second block will be 4 MB.
Note: But rest of 60 MB for Second block will not be wasted. Because Hadoop is so intelligent it will create one separate block for 4 MB data only, to save unnecessary wasting of space in terms of block