This topic contains 7 replies, has 7 voices, and was last updated by  ujjwal kumar vivek 1 year, 7 months ago.

Viewing 8 posts - 1 through 8 (of 8 total)
  • Author
    Posts
  • #1869 Reply

    somu s
    Member

    HDFS:-
    1. Name the most common Input Formats defined in hadoop? Which one is default?

    2. Consider case scenario: In M/R system, – HDFS block size is 64 MB
    3. What are some typical functions of Job Tracker?

    4. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?

    5. What is Hadoop Streaming?
    6. What is Distributed Cache in Hadoop?
    7. Is it possible to have Hadoop job output in multiple directories? If yes, how?
    8. What will a Hadoop job do if you try to run it with an output directory that is already present? Will it
    9. How will you write a custom partitioner for a Hadoop job?
    10. How did you debug your Hadoop code?

    #1891 Reply

    sweety jain
    Participant

    ans:1)the most common i/p formats in hadoop are
    1.text i/p format
    2.key value i/p format
    3.sequence file input format
    4.text i/p format
    the default i/p format is text input format
    ans:3)typical functions of job tracker
    a) Client applications submit jobs to the Job tracker.

    b)The JobTracker talks to the NameNode to determine the location of the data

    c)The JobTracker locates TaskTracker nodes with available slots at or near the data

    d) The JobTracker submits the work to the chosen TaskTracker nodes.

    e)When the work is completed, the JobTracker updates its status.

    f)Client applications can poll the JobTracker for information.

    ans:4)it will restart the task again on some other tasktracker and only if the task fails more than 4 times(default setting can be changed),it will kill the job

    #1893 Reply

    chinni
    Participant

    1a) The two most common input formats defined in hadoop
    Text input format
    key value input format
    sequential fiel input format
    Text input format is the hadoop default.

    2A) Input file format is fileinputformat.
    we have 3files of size 64k,65mb,127mb.

    3A) The job tracker do some typical functions.they are
    1.take the jobs from clients.
    2.jt communicate with name node to determine the location of data.
    3.it locates the tasktracker nodes with nearest data.
    4.it submits the work to chosen TT nodes nd monitors progress of each task by receiving heartbeat signals from TT.

    #1896 Reply

    chinni
    Participant

    4a)It will restart the task again on some other Task Tracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.

    5a)Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.

    7a)Yes, by using Multiple Outputs class.

    10a)There can be several ways of doing this but most common ways are:-
    1.By using counters.
    2.The web interface provided by Hadoop framework.

    #1898 Reply

    1> Text Input File Format is the default in Hadoop
    2> the input format FileInputFormat
    we have 3Files 64k,65mb,127mb

    hadoop has 5 spilts are
    1 spilt is 64k
    2 spilts is 65mb
    2 spilts is 127mb

    3> 1.accepting job clients
    2.it talks to the NameNode to determine the data
    3.it assign slots the TaskTracker nearest of data
    4>if the task will restart the other TaskTracker and if tasks fail it will wait 4 times and it will kill the job

    5>Hadoop streaming is a utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

    6>Distributed Cache is a facility provided by the MapReduce framework to cache files needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

    7>Yes, by using Multiple Outputs class

    8>The Hadoop job will throw an exception and exit.

    9>

    10>There can be several ways of doing this but most common ways are:-
    By using counters.
    The web interface provided by Hadoop framework.

    #1915 Reply

    faizan0607
    Participant

    Answers :
    1. The most common Input Formats defined in hadoop are-
    i. Text input format
    ii. Key value text input format
    iii. line input format
    iv. sequence file input format.
    While the default one is Text input format.

    3. Typical functions of Job Tracker
    i. Job Tracker accepts the job given by the Client
    ii. Job Tracker communicates with the Name Node to get the location of the data
    iii.Job Tracker then finds the Task Tracker of the corresponding Data Node which have slots available.
    iv. Job Tracker then orders the Task Tracker to store the data in the corresponding Data Node
    v. After the work is done the Job Tracker update the status.

    4. The task will be given to different Task Tracker, if the task keeps on failing for four times then it will kill the job.

    5. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Reference – “http://wiki.apache.org/hadoop/HadoopStreaming”

    6. Distributed Cache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

    Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the FileSystem at the path specified by the url and are accessible by every machine in the cluster.
    The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

    Reference – https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html

    7. Yes, with the help of multiple output class.

    8. If we try to run a Hadoop job with an output directory that is already present an exception is thrown, while it will exit from the job.

    #1916 Reply

    Paresh sahare
    Participant

    HDFS:-
    1. Name the most common Input Formats defined in hadoop? Which one is default?
    Ans :-
    1.TextInputFormat
    2.KeyValueInputFormat
    3.SequenceFileInputFormat
    TextInputFormat is the Hadoop default.

    2. Consider case scenario: In M/R system, – HDFS block size is 64 MB
    Ans :- – Input format is FileInputFormat are Consider case scenario: In M/R system
    sources :- https://books.google.co.in/books?id=6BmkBwAAQBAJ&pg=PT292&lpg=PT292&dq=HDFS+block+size+is+64+MB&source=bl&ots=79JN1X5OOn&sig=SxNF8oKwtjqibqsoDLCSRlXtSh4&hl=en&sa=X&ved=0CEYQ6AEwB2oVChMIsuGL1b_kyAIVCQiOCh2_HgfF#v=onepage&q=HDFS%20block%20size%20is%2064%20MB&f=false

    3. What are some typical functions of Job Tracker?
    Ans :- The following are some typical tasks of JobTracker:-
    Accepts jobs from clients
    It talks to the NameNode to determine the location of the data.
    It locates TaskTracker nodes with available slots at or near the data.
    It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.

    sources http://hadoopbigdata.h2kinfosys.com/some-typical-functions-of-job-tracker/

    4. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
    Ans :-It will restart the task again on some other TaskTracker and only if the task fails more
    than four (default setting and can be changed) times will it kill the job.
    sources http://www.wiziq.com/blog/31-questions-for-hadoop-developers/

    5. What is Hadoop Streaming?
    Ans :- Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows
    you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
    sources https://hadoop.apache.org/docs/r1.2.1/streaming.html

    6. What is Distributed Cache in Hadoop?
    Ans :-DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.
    sources :- https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html

    7. Is it possible to have Hadoop job output in multiple directories? If yes, how?
    Ans:-The MultipleOutputs class simplifies writing output data to multiple outputs
    sources https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

    8. What will a Hadoop job do if you try to run it with an output directory that is already present? Will it
    Ans:-Overwrite it
    – Warn you and continue
    – Throw an exception and exit
    The Hadoop job will throw an exception and exit.
    sources http://www.wiziq.com/blog/31-questions-for-hadoop-developers/

    9. How will you write a custom partitioner for a Hadoop job?
    Ans:-
    10. How did you debug your Hadoop code?
    Ans:-There can be several ways of doing this but most common ways are:-
    1)Counters are lightweight objects in Hadoop that allow you to keep track of system progress in both the map and reduce stages of processing.
    By default, Hadoop defines a number of standard counters in “groups”; these show up in the jobtracker webapp, giving you information such
    as “Map input records”, “Map output records”, etc. This short guide shows you how to programmatically manipulate counters, and is up to date as
    of Hadoop 0.20.1.
    sources http://lintool.github.io/Cloud9/docs/content/counters.html
    2)The web interface Provided by Hadoop framework.

    #1927 Reply

    ujjwal kumar vivek
    Participant

    Ans 1:>> the common input format in hadoop are :
    1) text input format
    2) sequence file input format
    3) key value input format
    By default the input format is text input format

    Ans 3:>> Functions of job tracker are as follow:
    1) It communicate with the namenode to determine the location of data stored on the datanode
    2) It assign the task to the task tracker.
    3) It monitors the progress of each task by receiving the heartbeat signal from the task tracker.

    Ans 4:>> The job tracker will assign the task to another task tracker. The job tracker will by default make four attempt to perform the task after that it will kill the job

    Ans 5:>> Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
    https://wiki.apache.org/hadoop/HadoopStreaming

    Ans 6:>> Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
    https://hadoop.apache.org/docs/r1.2.1/streaming.html

Viewing 8 posts - 1 through 8 (of 8 total)
Reply To: HDFS DUMP QUESTIONS
Your information:




cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.