Forum

This topic contains 2 replies, has 2 voices, and was last updated by  Roopa 3 months ago.

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
    Posts
  • #2993 Reply

    Module 1 : 4 th feb 2017

    1a. Algorithm for Name Node to allocate block on different Data Node….
    A:- Name node is the master of HDFS that maintains and manages the blocks present on the data node. Name node gets the list of blocks and its location for any given file from HDFS client. with that information name node constructs file from the blocks. Name node maintains the metadata of about the file stored in the system. And it respond back to the client where to store the data. Client directly transfer blocks to data node.

    1b. Write the function called for splitting the user data into blocks
    A:- HDFS architecture uses physical split method, which convert user information into multiple splits. example, given 1 TB data divided into multiple blocks if size 64 mb each by default.

    2. How to modify the heartbeat and block report time interval of data node?
    A:- we can modify the heartbeat and block report in hdfs-default.xml file with property names dfs.heartbeat.interval, dfs.blockreport.intervalMsec

    3.what is fsimage and what type of metadata it store?
    A:-An fsimage file contains the complete state of the file system at a point in time.Metadata maintained by the name node includes the name of the file, size, owner, group, permissions, block size etc… the complete metadata will be stored as snapshot in fsimage.

    4.If 2Tb is given what is the max expected metadata will generate?
    A:- Metadata is data or information that describes data or its characteristics. which will be in KiloBytes.If you know input files sizes then divide it with block size and multiply with 120kb, this gives approximate size of metadata.

    5. write the lifecycle of SNN in production?
    A:-Secondary Namenode helps to handle name node restart by taking over responsibility of merging editlogs with fsimage from the namenode.
    ->It gets the edit logs from the namenode in regular intervals and applies to fsimage
    ->Once it has new fsimage, it copies back to namenode
    ->Namenode will use this fsimage for the next restart,which will reduce the startup time

    6.what metadata that NN hold on cache memory?
    A:- NN holds metadata of file, size, owner, group, permissions, block size etc…

    7.If any DN stop working,how the blocks of dead DN will move to the active data node?
    A:- Whenever any node gets down all the blocks will shift to another node and maintains replication factor. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them.

    8. Write the condition : in which cases block size changed
    A:- a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode, and it also reduces the metadata size of the Namenode, reducing Namenode loadIt depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks[128/256 MB]. For smaller files, using a smaller block size is better[64MB].

    #2994 Reply

    Roopa P Karkera

    Module 1: Assignment

    1 Algo for NN to allocate block on different data mode
    ->When client sends request to Namenode to store data on DN cluster.
    NameNode has below 3 factors according to which it decides the allocation of DataNode
    1)Nearest Location -Nearest location from client request to
    datanode for easy access of Data
    2)Network Traffic – It Network traffic is high it will give priority to second nearest location
    3)Data Redandancy – If DN already have replicated data,the data will be stored in next nearest location along with consideration of network traffic

    2 Write the function called for splitting the user data into blocks
    Physical Split :
    -> HDFS is designed to support very large files.
    -> Applications that are compatible with HDFS are those that deal with large data sets.
    -> A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.
    -> This process is mainly to maintain the distributed environment for easy access of Data
    Default size of block : 64MB
    Minimum block size : 64MB
    Increase blocksize : 64*N(No.of blocks to be increased)
    production block size :128MB

    3 How to modify the heartbeat and block report time interval of data node
    The hearbeat and block report time interval can be modified by modifying the codes written in hdf-site.xml for example:-
    <property>
    <name>dfs.heartbeat.interval</name>
    <value>2</value>
    <description>Determines datanode heartbeat interval in seconds.</description>
    </property>
    Default value is : 3

    <property>
    <name>dfs.heartbeat.recheck.interval</name>
    <value>1</value>
    <description>Determines when machines are marked dead</description>
    </property>
    Same way we can change the interval for block report with below property.

    <property>
    <name>dfs.blockreport.intervalMsec</name>
    <value>12</value>
    <description>Determines datanode blockreport interval in Hrs</description>
    </property>

    Default value is 6Hrs

    4 what is fsimage and what type of metadata it store?
    An FSimage file contains the complete state of the file system at a point in time.Snapshot of the filesystem when NameNode started
    metadata stored in fsimage contains
    ->List of filesystem
    ->List of blocks for each file
    ->List of datanode for each block
    ->File attributes eg:creation time,Replication factor etc

    5 If 2Tb is given what is the max expected metadata will generate?

    If you know input files sizes then divide it with block size and multiply with 128kb, this gives approximate size of metadata.

    6 write the lifecycle of SNN in production

    SNN is a Misnomer.It’s not a back-up for the NN,It helps to recover the NN in case of crash
    -> The Secondary NameNode is one which constantly reads all the file systems and
    metadata from the RAM of the NameNode and writes it into the hard disk or the
    file system.
    ->It is responsible for combining the editlogs with fsimage from the
    NameNode.
    -> It gets the edit logs from the namenode in regular intervals and applies to fsimage
    -> Once it has new fsimage, it copies back to namenode
    -> Namenode will use this fsimage for the next restart,which will reduce the startup time

    7 what metadata that nn hold on cache memory?
    NN holds metadata of file, size, owner, group, permissions, block size etc.

    8 If any dn stop working,how the blocks of dead dn will move to the active data node?
    ->Every DN will send a Heartbeat(for every 3 sec) to NN (Heartbeat Acknowledgement)
    ->If in case the NN doesn’t receive this signal the DN to be declared as dead
    ->In this case the Data stored in the node which is declared to be dead gets merged to other node
    ->During this the important thing to be happened is Replication always happens in different DataNode
    ->Every 10th HB after 30sec signal will be given by DataNode(Block report)
    ->NN after getting this Block Report updates its Metadata
    ->This communication happens through port defined for NameNode

    9 Write the condition : in which cases block size changed
    -> Usually it depends on the input data. If w want to maximize throughput for a very large input file, using very large blocks (default 64MB, you can increase 128 MB 256 MB etc..). but on the other hand for smaller files, using a smaller block size is better.
    ->So we are talking about larger file large block & smaller file small blocks.
    ->We can do this by modifying ”dfs.block.size” parameter can be used when the file is written. It will help in overriding default block size written in hdfs-site.xml
    ->Consider below example
    Actually the default block size in 64 MB for Hadoop. But we can modify it as per our requirement. For example: If we need to store 68 MB of data, then two block will be created.

    The size for first block will be 64 MB.
    The size for second block will be 4 MB.

    Note: But rest of 60 MB for Second block will not be wasted. Because Hadoop is so intelligent it will create one separate block for 4 MB data only, to save unnecessary wasting of space in terms of block

    #3106 Reply

    Roopa
    Participant

    Module 2 : Clusters

    1 what are the external daemons are running on Standalone mode?
    In standalone mode there is no external daemons running.
    Everything runs in a single JVM

    2 Set Standalone cluster

    Step1: Unrar Ubuntu 12.04,
    Step 2: Open VMPlayer and click on Open a virtual Machine.
    Step3: Set the path of ubuntu in vm from system , where extracted ubuntu is present.After select the ubuntu.vmx file and click ok.
    Step4: Click on Play Virtual Machine.
    Step5: After click on play virtual machine and next click on dashboard
    Step6:After Click on Dashboard type terminal then you see the Terminal then Open that Terminal.
    Step7: Update the repository
    Command : sudo apt-get update
    If it asking for password is password
    Username is username
    Step8: Install java use below command.
    Command : sudo apt-get install openjdk-6-jdk
    After installing java check for java version present
    command : java-version
    Step9: Install openssh -server by using below command
    Command : sudo apt-get install openssh-server
    Step10: Downloaded hadoop-1.2.0.tar.gz and extracted by using below command.
    Command : tar -xvf hadoop-1.2.0.tar.gz
    After extracting we can go to Downloads folder using Cd command
    Step11: Edit core-site.xml
    Using below command
    Command : sudo gedit hadoop-1.2.0/conf/core-site.xml
    We can edit the property here
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
    </property>
    Here in localhost we can give the ip

    Step12: Edit hdfs-site.xml
    Use below command for open file
    Command: sudo gedit hadoop-1.2.0/conf/hdfs-site.xml

    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    <property>
    <name>dfs.permissions</name>
    <value>false</value>
    </property>

    dfs.replication – Changed replication factor to 3

    Step13: Edit mapred-site.xml
    Use below command to open this file
    Command : sudo gedit hadoop-1.2.0/conf/mapred-site.xml

    <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
    </property>

    Step 14:
    Get your IP Address:
    Command:ifconfig
    Command: sudo gedit /etc/hosts
    Create a ssh key:
    Command: ssh-keygen -t rsa –P” ”
    Moving the key to authorized key:
    Command:cat $HOME/ .ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys
    Configuration:
    Add java_home in hadoop-env.sh file:
    Command: sudo gedit hadoop-1.2.0/conf/hadoop-env.sh
    Type:export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386

    Step 15:
    Format the name node
    Command: bin/hadoop namenode -format
    Start the Namenode, Datanode
    Command: bin/start-dfs.sh
    Start the task tracker and job tracker
    Command: bin/start-mapred.sh
    Command: bin/start-dfs.sh
    To check Hadoop started correctly
    Command: jps

    3 How to control job running on dn?
    To start all Daemons
    sh start-all.sh

    To control job running on DataNode
    ./hadoop-daemon.sh start datanode
    ./hadoop-daemon.sh stop datanode

    4 how to define the thrash path?

    The default the Trash path is /user/<username>/.Trash

    HDFS trash is just like the Recycle Bin in Windows operating systems. Its purpose is to prevent you from unintentionally deleting something.
    You can enable this feature by setting this property:
    “fs.trash.interval” with a number greater than 0 in core-site.xml.
    After the trash feature is enabled, when you remove something from HDFS by using the rm command, files or directories will not be wiped out immediately; instead, they will be moved to a trash directory (/user/${username}/.Trash, for example).
    hadoop fs -rm -r /tmp/5gb
    15/09/01 20:34:48 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.
    Moved: ‘hdfs://hdpnn/tmp/5gb’ to trash at: hdfs://hdpnn/user/ambari-qa/.Trash/Current

    Deletion interval:specifies how long (in minutes) a checkpoint will be expired before it is deleted.
    It is the value of fs.trash.interval. The NameNode runs a thread to periodically remove expired checkpoints from the file system.
    Emptier interval:specifies how long (in minutes) the NameNode waits before running a thread to manage checkpoints.
    The NameNode deletes checkpoints that are older than fs.trash.interval and creates a new checkpoint from /user/${username}/.Trash/Current.
    This frequency is determined by the value of fs.trash.checkpoint.interval, and it must not be greater than the deletion interval.
    This ensures that in an emptier window, there are one or more checkpoints in the trash

    Example:
    fs.trash.interval = 360 (deletion interval = 6 hours)
    fs.trash.checkpoint.interval = 60 (emptier interval = 1 hour)
    This causes the NameNode to create a new checkpoint every hour and to delete checkpoints that have existed longer than 6 hours.

    To empty the thrash :
    hadoop fs -expunge
    This command causes the NameNode to permanently delete files from the trash that are older than the threshold, instead of waiting for the next emptier window.
    It immediately removes expired checkpoints from the file system.

    To bypass Thrash :
    hadoop fs -rm -skipTrash /path/to/permanently/delete
    This bypasses the trash and removes the files immediately from the file system.

Viewing 3 posts - 1 through 3 (of 3 total)
Reply To: HDFS AND CLUSTER QUESTIONS
Your information:




cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.