Forum

This topic contains 3 replies, has 2 voices, and was last updated by  rhiddhiman 2 years, 7 months ago.

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #1258 Reply

    rhiddhiman
    Participant

    (1)What is Difference between Secondary namenode, Checkpoint namenode & backupnod Secondary Namenode, a poorly named component of Hadoop
    (2)What are the Side Data Distribution Techniques
    Side data can be defined as extra read-only data needed by a job to process the main dataset. The challenge is to make side data available to all the map or reduce tasks which are spread across the cluster in a convenient and efficient fashion.

    (3)What is shuffleing in mapreduce?
    MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers the map outputs to the reducers as inputs is known as SHUFFLING.

    (4)What is partitioning?
    Partitioning is division on the basis of the likes of timestamps, keys and other similar attributes.

    (5)Can we change the file cached by DistributedCache

    (6)What if job tracker machine is down?
    If job tracker is down it will not be functional and all the running processes would be halted. Job Tracker is the single point of failure of Hadoop MapReduce Service.

    (7)Can we deploy job tracker other than name node?
    Yes, in production cluster it runs on a separate machine.

    (8)What are the four modules that make up the Apache Hadoop framework?
    Hadoop Common: The common utilities that support the other Hadoop modules.
    Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
    Hadoop YARN: A framework for job scheduling and cluster resource management.
    Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

    (9)Which modes can Hadoop be run in? List a few features for each mode.
    Standalone: There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.
    Pseudo-Distributed: The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale.
    Distributed: The Hadoop daemons run on a cluster of machines.

    (10)Where are Hadoop’s configuration files located?
    Inside “conf” directory under “Hadoop” main directory.

    (11)List Hadoop’s three configuration files.
    Core-site.xml
    Hdfs-site.xml
    Mapred-site.xml

    (12)What are “slaves” and “masters” in Hadoop?
    Slaves are the daemons in Hadoop cluster that are responsible for storing data and their replications along with the processing: Data Node, Secondary Name Node, Task Tracker
    Masters are the daemons in Hadoop cluster that are responsible for monitoring the data, their replications and processing requests: Name Node and Job Tracker

    (13)How many datanodes can run on a single Hadoop cluster?
    N- number of datanodes can run on a single Hadoop cluster.

    (14)What is job tracker in Hadoop?
    JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop.

    (15)How many job tracker processes can run on a single Hadoop cluster?
    One

    (16)What sorts of actions does the job tracker process perform?
    Client applications send the job tracker jobs.
    Job tracker determines the location of data by communicating with Namenode.
    Job tracker finds nodes in task tracker that has open slots for the data.
    Job tracker submits the job to task tracker nodes.
    Job tracker monitors the task tracker nodes for signs of activity. If there is not enough activity, job tracker transfers the job to a different task tracker node.
    Job tracker receives a notification from task tracker if the job has failed. From there, job tracker might submit the job elsewhere, as described above. If it doesn’t do this, it might blacklist either the job or the task tracker.

    (17)How does job tracker schedule a job for the task tracker?
    When a client application submits a job to the job tracker, job tracker searches for an empty node to schedule the task on the server that contains the assigned datanode.

    (18)What does the mapred.job.tracker command do?
    The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task.

    (19)What is “PID”?
    Process Identifier

    (20)What is “jps”?
    JVM Process Status

    (21)Is there another way to check whether Namenode is working?
    Besides the jps command, you can also use: /etc/init.d/hadoop-0.20-namenode status.

    (22)How would you restart Namenode?
    To restart Namenode, you could either write:
    • sudo hdfs
    • su-hdfs
    • /etc/init.d/ha, press enter, then /etc/init.d/hadoop-0.10-namenode start
    and then press Enter, or you could simply click stop-all.sh, then select start-all.sh.

    (23)What is “fsck”?
    fsck standards for File System Check.

    (24)What is a “map” in Hadoop?
    “Map” is a phase of solving a query in HDFS. “Map” is responsible to read data from input location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local machine.

    (25)What is a “reducer” in Hadoop?
    “Reducer” is another phases of solving a query in HDFS. “Reducer” is responsible to process the intermediate output received from the mapper and generate the final output.

    (26)What are the parameters of mappers and reducers?
    The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.
    The four basic parameters of a reducer are Text, IntWritable, Text, IntWritable.The first two represent intermediate output parameters and the second two represent final output parameters.

    (27)Is it possible to rename the output file, and if so, how?
    Yes, it is possible to rename the output file by utilizing a multi-format output class.
    (28)List the network requirements for using Hadoop.
    SSH – Secure Shell for launching server processes.

    (29)Which port does SSH work on?
    22

    (30)What is streaming in Hadoop?
    As part of the Hadoop framework, streaming is a feature that lets developers code with MapReduce in any language, as long as that programming language is able to accept and produce standard output. Even though Hadoop is Java-based, the chosen language doesn’t have to be Java. It can be Perl, Ruby, etc.

    (31)What is the difference between Input Split and an HDFS Block?
    Though both refer to the division of data Input split is a logical block in Hadoop whereas HDFS Block is a physical block of memory.

    (32)What does the file hadoop-metrics.properties do?
    The hadoop-metrics.properties file controls reporting in Hadoop.

    (33)Name the most common Input Formats defined in Hadoop? Which one is default?
    TextInputFormat — Default
    KeyValueInputFormat
    SequenceFileInputFormat
    NLineInputFormat

    (34)What is the difference between TextInputFormat and KeyValueInputFormat class?
    Both Input Formats treat each line in a text file as a record. In TextInputFormat a new line divides each line, the byte offset value is the key and the content for that particular key is the value. But in KeyValueInputFormat the first separator character divides each line, everything before the separator is the key and everything after the separator is the value
    (35)What is InputSplit in Hadoop?

    (36)How is the splitting of file invoked in Hadoop framework
    It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.

    (36)Consider case scenario: In M/R system,
    – HDFS block size is 64 MB
    – Input format is FileInputFormat
    – We have 3 files of size 64K, 65Mb and 127Mb
    (37)How many input splits will be made by Hadoop framework?
    5

    (38)What is the purpose of RecordReader in Hadoop?
    Taking input from HDFS Client, arrange them to key-value pairs then feed them as input to Mapper.

    (39)After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and sort”. Explain what happens in this phase?

    (40)If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?
    The default partitioner computes a hash value for the key and assigns the partition based on this result.

    (41)What is JobTracker?
    Job Tracker is a Hadoop service that assigns Task Trackers with MapReduce tasks.

    (42)What are some typical functions of Job Tracker?
    – Client applications submit jobs to the Job tracker.
    – The JobTracker talks to the NameNode to determine the location of the data
    – The JobTracker locates TaskTracker nodes with available slots at or near the data
    – The JobTracker submits the work to the chosen TaskTracker nodes.
    – The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
    – A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the TaskTracker as unreliable.
    – When the work is completed, the JobTracker updates its status.

    (43)What is TaskTracker?
    A TaskTracker is a node in the cluster that accepts tasks – Map, Reduce and Shuffle operations – from a JobTracker.
    Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
    The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated

    (44)What is the relationship between Jobs and Tasks in Hadoop?
    One to Many relationship

    (46)Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
    It will assign the task again to another TaskTracker, if the task fails more than the number of times set to perform (default is 4) the task it will kill the task.

    (47)Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this?

    (48)How does speculative execution work in Hadoop?

    (49)Using command line in Linux, how will you
    – See all jobs running in the Hadoop cluster: Hadoop job –list
    – Kill a job: Hadoop job –kill jobID

    (50)What is Hadoop Streaming?
    Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Using the streaming system you can develop working hadoop jobs with extremely limited knowldge of Java. At it’s simplest your development task is to write two shell scripts that work well together, let’s call them shellMapper.sh and shellReducer.sh. On a machine that doesn’t even have hadoop installed you can get first drafts of these working by writing them to work in this way:
    cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile
    With streaming, Hadoop basically becomes a system for making pipes from shell-scripting work (with some fudging) on a cluster. There’s a strong logical correspondence between the unix shell scripting environment and hadoop streaming jobs. The above example with Hadoop has somewhat less elegant syntax, but this is what it looks like:
    stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper “shellMapper.sh” -file shellReducer.sh -reducer “shellReducer.sh” -output /dfsOutputDir/myResults
    The real place the logical correspondence breaks down is that in a one machine scripting environment shellMapper.sh and shellReducer.sh will each run as a single process and data will flow directly from one process to the other. With Hadoop the shellMapper.sh file will be sent to every machine on the cluster that has data chunks and each such machine will run it’s own chunk through the shellMapper.sh process on each machine. The output from those scripts doesn’t run a reduce on each of those machines. Instead the output is sorted so that different lines from various mapping jobs are streamed across the network to different machines (Hadoop defaults to four machines) where the reduce(s) can be performed.

    (51)What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.?
    Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

    (52)What is Distributed Cache in Hadoop?
    Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

    (53)Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?
    Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.

    (54)Is it possible to have Hadoop job output in multiple directories? If yes, how?
    Yes, by using Multiple Outputs class.

    (55)What will a Hadoop job do if you try to run it with an output directory that is already present? Will it
    – Overwrite it
    – Warn you and continue
    – Throw an exception and exit
    Throw an exception and exit

    (56)How can you set an arbitrary number of mappers to be created for a job in Hadoop?
    Cannot be set

    (57)How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
    You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.

    (58)How will you write a custom partitioner for a Hadoop job?
    To have Hadoop use a custom partitioner you will have to do minimum the following three:
    – Create a new class that extends Partitioner Class
    – Override method getPartition
    – In the wrapper that runs the Mapreduce, either
    – Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

    (59)How did you debug your Hadoop code?
    There can be several ways of doing this but most common ways are:-
    – By using counters.
    – The web interface provided by Hadoop framework.

    (60)What is BIG DATA?
    Big Data is a term which is used to describe exponential growth of both structured and unstructured data.

    (61)Can you give some examples of Big Data?
    Shopping websites like Amazon, eBay, Flipkart etc manage huge quantity of data which are examples of BigData

    (62)Can you give a detailed overview about the Big Data being generated by Facebook?
    As of December 31, 2012, there are 1.06 billion monthly active users on facebook and 680 million mobile users. On an average, 3.2 billion likes and comments are posted every day on Facebook. 72% of web audience is on Facebook. And why not! There are so many activities going on facebook from wall posts, sharing images, videos, writing comments and liking posts, etc. In fact, Facebook started using Hadoop in mid-2009 and was one of the initial users of Hadoop.

    (63)According to IBM, what are the three characteristics of Big Data?
    According to IBM, the three characteristics of Big Data are: Volume: Facebook generating 500+ terabytes of data per day. Velocity: Analyzing 2 million records each day to identify the reason for losses. Variety: images, audio, video, sensor data, log files, etc.

    (64)How Big is ‘Big Data’?
    With time, data volume is growing exponentially. Earlier we used to talk about Megabytes or Gigabytes. But time has arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes! Global data volume was around 1.8ZB in 2011 and is expected to be 7.9ZB in 2015. It is also known that the global information doubles in every two years!

    (65)How analysis of Big Data is useful for organizations?
    Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on and which areas are less important. Big data analysis provides some early key indicators that can prevent the company from a huge loss or help in grasping a great opportunity with open hands! A precise analysis of Big Data helps in decision making! For instance, nowadays people rely so much on Facebook and Twitter before buying any product or service. All thanks to the Big Data explosion.

    (66)Who are ‘Data Scientists’?
    Data scientists are soon replacing business analysts or data analysts. Data scientists are experts who find solutions to analyze data. Just as web analysis, we have data scientists who have good business insight as to how to handle a business challenge. Sharp data scientists are not only involved in dealing business problems, but also choosing the relevant issues that can bring value-addition to the organization.

    (67)What are some of the characteristics of Hadoop framework?
    Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google’s MapReduce. The infrastructure is based on Google’s Big Data and Distributed File System. Hadoop handles large files/data throughput and supports data intensive distributed applications. Hadoop is scalable as more nodes can be easily added to it.

    (68)Give a brief overview of Hadoop history.
    In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for Hadoop.

    (69)Give examples of some companies that are using Hadoop structure?
    A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook, eBay, Twitter, Google and so on.

    (70)What is the basic difference between traditional RDBMS and Hadoop?
    RDBMS deals with structured data whereas Hadoop deals with both structured and unstructured data

    (71)What is structured and unstructured data?
    Structured Data: Resides in a fixed field within a record row column manner
    Unstructured Data: Data doesn’t reside in a traditional row column manner

    (72)What are the core components of Hadoop?
    HDFS (Name Node, Data Node, Secondary Name Node) and MapReduce (Job Tracker and Task Tracker)

    (73)What is HDFS?
    Hadoop Distributed File System: a distributed file-system that stores data on a cluster of machines

    (74)What are the key features of HDFS?
    HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

    (75)What is Fault Tolerance?
    Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

    (76)Replication causes data redundancy then why is is pursued in HDFS?
    HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at atleast 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.

    (77)Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?
    Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.

    (78)What is throughput? How does HDFS get a good throughput?
    Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

    (79)What is streaming access?
    As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

    (80)What is a commodity hardware? Does commodity hardware include RAM?
    Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on RAM.

    (81)What is a metadata?
    Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

    (82)Why do we use HDFS for applications having large data sets and not when there are lot of small files?
    Seek Time will be more and getting MetaData for all small files will consume more resources then getting MetaData for large data sets.

    (83)What is a daemon?
    Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “services” and in Dos is ” TSR”.

    (84)Is Namenode machine same as datanode machine as in terms of hardware?
    It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing environment, Namenode and datanodes are on different machines.

    (85)What is a heartbeat in HDFS?
    A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

    (86)Are Namenode and job tracker on the same host?
    No, both reside in different machines.

    (87)What is a ‘block’ in HDFS?
    A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

    (88)What are the benefits of block transfer?
    A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

    (89)If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?
    In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

    (90)How indexing is done in HDFS?
    Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

    (91)If a data Node is full how it’s identified?
    When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.

    (92)If datanodes increase, then do we need to upgrade Namenode?
    While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.

    (93)Are job tracker and task trackers present in separate machines?
    Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

    (94)When we send a data to a node, do we allow settling in time, before sending another data to that node?
    Yes

    (95)Does hadoop always require digital data to process?
    Yes

    (96)On what basis Namenode will decide which datanode to write on?
    As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

    (97)Doesn’t Google have its very own version of DFS?
    Yes, GFS (Google File System)

    (98)Who is a ‘user’ in HDFS?
    A user is like you or me, who has some query or who needs some kind of data.

    (99)Is client the end user in HDFS?
    No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).

    (100)What is the communication channel between client and namenode/datanode?
    SSH

    #1262 Reply

    rhiddhiman
    Participant

    (101)What is a rack?
    Rack is a collection of nodes (machines).

    (102)On what basis data will be stored on a rack?
    When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

    (103)Do we need to place 2nd and 3rd data in rack 2 only?
    Yes, to avoid DataNode failure.

    (104)What if rack 2 and datanode fails?
    If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

    (105)What is a Secondary Namenode? Is it a substitute to the Namenode?
    Secondary NameNode is a slave daemon in HDFS, it holds the backup of the MetaData of the NameNode. In case the NameNode fails the backup of the MetaData stored in the Secondary NameNode is used to create another NameNode.
    Secondary NameNode is not a substitute for the NameNode.

    (106)What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
    In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

    (107)What is ‘Key value pair’ in HDFS?
    Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

    (108)What is the difference between MapReduce engine and HDFS cluster?
    HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

    (109)Is map like a pointer?
    No, Map is not like a pointer.

    (110)Do we require two servers for the Namenode and the datanodes?
    Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system as it stores information about the location details of all the files stored in different datanodes and on the other hand, datanodes require low configuration system.

    (111)Why are the number of splits equal to the number of maps?
    The number of maps is equal to the number of input splits because we want the key and value pairs of all the input splits.

    (112)Is a job split into maps?
    No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split, a map is needed.

    (113)Which are the two types of ‘writes’ in HDFS?
    There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

    (114)Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
    Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency. For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and accessed.

    (115)Can Hadoop be compared to NOSQL database like Cassandra?
    Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

    (116)How can I install Cloudera VM in my system?

    (117)What is a Task Tracker in Hadoop? How many instances of Task Tracker run on a hadoop cluster?
    A TaskTracker is a MapReduce daemon in the cluster that accepts tasks – Map, Reduce and Shuffle operations – from a JobTracker.
    Depending on the number of jobs N number of TaskTracker instances can run on a Hadoop cluster.

    (118)What are the four basic parameters of a mapper?
    The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.

    (119)What is the input type/format in MapReduce by default?
    Text Input Format

    (120)Can we do online transactions(OLTP) using Hadoop?
    No

    (121)Explain how HDFS communicates with Linux native file system

    (122)What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
    JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop.
    One

    (123)What is the InputFormat ?
    It is a mechanism to provide input to the Hadoop cluster.

    (124)What is the InputSplit in map reduce software?
    An InputSplit is a logical representation of a memory block of input work for a map task. By default it is 64MB.

    (125)What is a IdentityMapper and IdentityReducer in MapReduce ?
    – org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
    – org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

    (126)How JobTracker schedules a task?
    When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

    (127)When is the reducers are started in a MapReduce job?
    In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
    If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
    Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

    (128)On What concept the Hadoop framework works?
    It works on the concept of MapReduce.

    (129)What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
    DataNode is a HDFS daemon in a Hadoop cluster which is responsible for storing data. N number of DataNodes can run on a Hadoop cluster.

    (130)What other technologies have you used in hadoop domain?

    (131)How NameNode Handles data node failures?
    By replication

    (132)How many Daemon processes run on a Hadoop system?
    5

    (133)What is configuration of a typical slave node on Hadoop cluster?

    (134) How many JVMs run on a slave node?
    1

    (135)How will you make changes to the default configuration files?
    a. We got to “conf” sub-directory under Hadoop directory
    b. Open the configuration files and edit the following into it
    sudo gedit core-site.xml

    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
    </property>
    </configuration>

    sudo gedit hdfs-site.xml

    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    </configuration>

    sudo gedit mapred-site.xml

    <configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
    </property>
    </configuration>

    sudo gedit hadoop-env.sh

    # The java implementation to use. Required.
    export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67

    (136)Can I set the number of reducers to zero?
    Yes it can be set to zero, setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]

    (137)Whats the default port that jobtrackers listens ?
    50030

    (138)unable to read options file while i tried to import data from mysql to hdfs. Narendra

    (139)What problems have you faced when you are working on Hadoop code?

    (140)how would you modify that solution to only count the number of unique words in all the documents?

    (141)What is the difference between a Hadoop database and Relational Database?
    Hadoop database stores both structured and unstructured data while relational database stores only structured data.

    (142)How the HDFS Blocks are replicated?
    HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack.

    (143)What is a Task instance in Hadoop? Where does it run?
    Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

    (144)what is meaning Replication factor?
    Replication factor is a number which denotes how many times a data would be copied into HDFS. By default the replication factor in Hadoop cluster is 3.

    (145)If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
    Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

    (146)How the Client communicates with HDFS?
    The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

    (147)Which object can be used to get the progress of a particular job
    JobClient or the Web UI.

    (148)What is next step after Mapper or MapTask?
    Shuffling and Sorting

    (149)What are the default configuration files that are used in Hadoop?
    Core-site.xml
    Hdfs-site.xml
    Mapred-site.xml
    Hadoop-env.sh

    (150)Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
    No, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.

    (151)What is HDFS Block size? How is it different from traditional file system block size?
    In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block 3 times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size cannot be compared with the traditional file system block size.

    (152)what is SPF?
    The JobTracker is the single point of failure in Hadoop cluster. If it fails processing stops.

    (153)Where do you specify the Mapper Implementation?

    (154)What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?
    The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical production cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

    (155)Explain the core methods of the Reducer?
    The API of Reducer is very similar to that of Mapper, there’s a run() method that receives a Context containing the job’s configuration as well as interfacing methods that return data from the reducer itself back to the framework. The run() method calls setup() once, reduce() once for each key associated with the reduce task, and cleanup() once at the end. Each of these methods can access the job’s configuration data by using Context.getConfiguration ().
    As in Mapper, any or all of these methods can be overridden with custom implementations. If none of these methods are overridden, the default reducer operation is the identity function; values are passed through without further processing.
    The heart of Reducer is it’s reduce () method. This is called once per key; the second argument is an Iterable which returns all the values associated with that key.
    (156)What is Hadoop framework?
    Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware.

    (157)Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job
    Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.

    (158)How would you tackle counting words in several text documents?
    Using the WordCount class provided by Hadoop.

    (159)How does master slave architecture in the Hadoop?

    (160)How would you tackle calculating the number of unique visitors for each hour by mining a huge Apache log? You can use post processing on the output of the MapReduce job.

    (161)How did you debug your Hadoop code ?

    (162)How will you write a custom partitioner for a Hadoop job?
    To have Hadoop use a custom partitioner you will have to do minimum the following three:
    – Create a new class that extends Partitioner Class
    – Override method getPartition
    – In the wrapper that runs the Mapreduce, either
    – Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

    (163)How can you add the arbitrary key-value pairs in your mapper?
    You can set arbitrary (key, value) pairs of configuration data in your Job,
    e.g. with Job.getConfiguration().set(“myKey”, “myVal”), and then retrieve this data
    in your mapper with Context.getConfiguration().get(“myKey”). This kind of
    functionality is typically done in the Mapper’s setup() method.

    (164)what is a datanode?
    DataNode is a HDFS daemon of Hadoop cluster which is used for storage of data.

    (165)What are combiners? When should I use a combiner in my MapReduce Job?
    Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

    (166)How Mapper is instantiated in a running job?
    First job will come to job tracker, then it assigned a particular job ID then initialized in that way Mapper is instantiated in a running job

    (167)Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
    org.apache.hadoop.mapreduce.Mapper
    org.apache.hadoop.mapreduce.Reducer

    (168)What happens if you don?t override the Mapper methods and keep them as it is?

    (169)How does an Hadoop application look like or their basic components?

    (170)What is the meaning of speculative execution in Hadoop? Why is it important?
    Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performaing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.

    (170)What are the restriction to the key and value class ?
    The key and value classes have to be serialized by the framework. To make them serializable Hadoop provides a Writable interface. As you know from the java itself that the key of the Map should be comparable, hence the key has to implement one more interface WritableComparable.

    (171)Explain the WordCount implementation via Hadoop framework ?
    • We will count the words in all the input file flow as below
    • input: Assume there are two files each having a sentence Hello World Hello World (In file 1)Hello World Hello World (In file 2)
    • Mapper : There would be each mapper for the a file
    For the given sample input the first map output: < Hello, 1>< World, 1>< Hello, 1>< World, 1>
    The second map output: < Hello, 1>< World, 1> < Hello, 1>< World, 1>
    • Combiner/Sorting (This is done for each individual map)So output looks like this
    The output of the first map: < Hello, 2>< World, 2>
    The output of the second map: < Hello, 2>< World, 2>
    • Reducer :It sums up the above output and generates the output as below
    < Hello, 4>< World, 4>
    • OutputFinal output would look like
    Hello 4 times World 4 times

    (172)What Mapper does?
    It will take input from the HDFS client and process an output known as Intermediate output in the form of key, value pairs and will feed it as an input to the Reducer.

    (173)what is MAP REDUCE?
    MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

    (174)Explain the Reducer?s Sort phase?
    The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged (It is similar to merge-sort).

    (175)What are the primary phases of the Reducer?
    Shuffle, Sort and Reduce

    (176)Explain the Reducer’s reduce phase?
    In this phase the reduce(MapOutKeyType, Iterable, Context) method is called for each pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via Context.write(ReduceOutKeyType, ReduceOutValType). Applications can use the Context to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The output of the Reducer is not sorted.

    (177)Explain the shuffle?
    Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

    (178)What happens if number of reducers are 0?
    The output would be generated only by the mapper and the reducer phase would be omitted.

    (179)How many Reducers should be configured?
    The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapreduce.tasktracker.reduce.tasks.maximum).
    With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

    (180)What is Writable & WritableComparable interface?
    -org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.
    -org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.

    (181)What is the Hadoop MapReduce API contract for a key and value Class?
    The Key must implement the org.apache.hadoop.io.WritableComparable interface.
    The value must implement the org.apache.hadoop.io.Writable interface.

    (182)Where is the Mapper Output (intermediate kay-value data) stored ?
    Local File System

    (183)What is the difference between HDFS and NAS ?

    (184)Whats is Distributed Cache in Hadoop
    Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

    (185)Have you ever used Counters in Hadoop. Give us an example scenario?

    (186)What is the main difference between Java and C++?
    JAVA is purely object oriented programming language whereas C++ is not.

    (187)What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered?

    (188)What is the use of Context object?

    (189)What is the Reducer used for?
    Reducer reduces a set of intermediate values which share a key to a (usually smaller) set of values.
    The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).

    (190)What is the use of Combiner?
    It is an optional component or class, and can be specify via Job.setCombinerClass(ClassName), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

    (191)Explain how input and output data format of the Hadoop framework?
    The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. See the flow mentioned below
    (input) -> map -> -> combine/sorting -> -> reduce -> (output)

    (192)What is compute and Storage nodes?
    Compute Node: This is the computer or machine where your actual business logic will be executed.
    Storage Node: This is the computer or machine where your file system reside to store the processing data.
    In most of the cases compute node and storage node would be the same machine.

    (193)what is namenode?
    The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

    (194)How does Mappers run() method works?
    The Mapper.run() method then calls map(KeyInType, ValInType, Context) for each key/value pair in the InputSplit for that task

    (195)what is the default replication factor in HDFS?
    3

    (196)It can be possible that a Job has 0 reducers?
    Yes

    (197)How many maps are there in a particular Job?
    Depends on the number of input splits.

    (198)How many instances of JobTracker can run on a Hadoop Cluser?
    1

    (199)How can we control particular key should go in a specific reducer?
    Using a custom Partitioner

    (200)what is the typical block size of an HDFS block?
    64MB is default but in production line 128MB is recommended.

    (201)What do you understand about Object Oriented Programming (OOP)? Use Java examples.

    (202)What are the main differences between versions 1.5 and version 1.6 of Java?

    (203)Describe what happens to a MapReduce job from submission to output?
    a. HDFS Client submits a job
    b. Record Reader takes the input and converts it to key-value pairs and gives it as input to the Mapper
    c. Mapper processes the data and gives an intermediate value (in key-value pairs) as output which is stored in the local file system.
    d. The intermediate value goes through shuffling and sorting and is provided to the Reducer
    e. The final output is given by the Reducer and is stored in HDFS.

    #1263 Reply

    sivatejakumarreddy
    Participant

    Q1. Name the most common Input Formats defined in Hadoop? Which one is
    default?
    The two most common Input Formats defined in Hadoop are:
    – TextInputFormat
    – KeyValueInputF2ormat
    – SequenceFileInputFormat
    TextInputFormat is the Hadoop default.

    Q2. What is the difference between TextInputFormat and KeyValueInputFormat
    class?
    TextInputFormat: It reads lines of text files and provides the offset of the line as key to
    the Mapper and actual line as Value to the mapper.
    KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything
    up to the first tab character is sent as key to the Mapper and the remainder of the line is
    sent as value to the mapper.

    Q3. What is InputSplit in Hadoop?
    When a Hadoop job is run, it splits input files into chunks and assign each split to a
    mapper to process. This is called InputSplit.

    Q4. How is the splitting of file invoked in Hadoop framework?
    It is invoked by the Hadoop framework by running getInputSplit()method of the Input
    format class (like FileInputFormat) defined by the user.

    Q5. Consider case scenario: In M/R system, – HDFS block size is 64 MB1
    – Input format is FileInputFormat
    – We have 3 files of size 64K, 65Mb and 127Mb
    How many input splits will be made by Hadoop framework?
    Hadoop will make 5 splits as follows:
    – 1 split for 64K files
    – 2 splits for 65MB files
    – 2 splits for 127MB files

    Q6. What is the purpose of RecordReader in Hadoop?
    The InputSplit has defined a slice of work, but does not describe how to access it. The
    RecordReader class actually loads the data from its source and converts it into (key,
    value) pairs suitable for reading by the Mapper. The RecordReader instance is defined
    by the Input Format.

    Q7. After the Map phase finishes, the Hadoop framework does “Partitioning,
    Shuffle and sort”. Explain what happens in this phase?
    Partitioning: It is the process of determining which reducer instance will receive which
    intermediate keys and values. Each mapper must determine for all of its output (key,
    value) pairs which reducer will receive1 them. It is necessary that for any key,
    regardless of which mapper instance generated it, the destination partition is the same.
    Shuffle: After the first map tasks have completed, the nodes may still be performing
    several more map tasks each. But they also begin exchanging the intermediate outputs
    from the map tasks to where they are required by the reducers. This process of moving
    map outputs to the reducers is known as shuffling.
    Sort: Each reduce task is responsible for reducing the values associated with several
    intermediate keys. The set of intermediate keys on a single node is automatically sorted
    by Hadoop before they are presented to the Reducer.

    Q8. If no custom partitioner is defined in Hadoop then how is data partitioned
    before it is sent to the reducer?
    The default partitioner computes a hash value for the key and assigns the partition
    based on this result.

    Q9. What is a Combiner?
    The Combiner is a ‘mini-reduce’ process which operates only on data generated by a
    mapper. The Combiner will receive as input all data emitted by the Mapper instances on
    a given node. The output from the Combiner is then sent to the Reducers, instead of the
    output from the Mappers.

    Q10. What is JobTracker?
    JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.

    Q11. What are some typical functions of Job Tracker?
    The following are some typical tasks of JobTracker:-
    – Accepts jobs from clients
    – It talks to the NameNode to determine the location of the data.
    – It locates TaskTracker nodes with available slots at or near the data.
    – It submits the work to the chosen TaskTracker nodes and monitors progress of each
    task by receiving heartbeat signals from Task tracker.

    Q12. What is TaskTracker?
    TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle
    operations – from a JobTracker.

    Q13. What is the relationship between Jobs and Tasks in Hadoop?
    One job is broken down into one or many tasks in Hadoop.

    Q14. Suppose Hadoop spawned 100 tasks for a job and one of the task failed.
    What will Hadoop do?
    It will restart the task again on some other TaskTracker and only if the task fails more
    than four (default setting and can be changed) times will it kill the job.

    Q15. Hadoop achieves parallelism by dividing the tasks across many nodes, it is
    possible for a few slow nodes to rate-limit the rest of the program and slow down
    the program. What mechanism Hadoop provides to combat this?
    Speculative Execution.

    Q16. How does speculative execution work in Hadoop?
    JobTracker makes different TaskTrackers pr2ocess same input. When tasks complete,
    they announce this fact to the JobTracker. Whichever copy of a task finishes first
    becomes the definitive copy. If other copies were executing speculatively, Hadoop tells
    the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then
    receive their inputs from whichever Mapper completed successfully, first.

    Q17. Using command line in Linux, how will you
    – See all jobs running in the Hadoop cluster
    – Kill a job?
    Hadoop job – list
    Hadoop job – kill jobID

    Q18. What is Hadoop Streaming?
    Streaming is a generic API that allows programs written in virtually any language to be
    used as Hadoop Mapper and Reducer implementations.

    Q19. What is the characteristic of streaming API that makes it flexible run
    MapReduce jobs in languages like Perl, Ruby, Awk etc.?
    Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer
    phases of a MapReduce job by having both Mappers and Reducers receive their input
    on stdin and emit output (key, value) pairs on stdout.

    Q20. What is Distributed Cache in Hadoop?
    Distributed Cache is a facility provided by the MapReduce framework to cache files
    (text, archives, jars and so on) needed by applications during execution of the job. The
    framework will copy the necessary files to the slave node before any tasks for the job
    are executed on that node.

    Q21. What is the benefit of Distributed cache? Why can we just have the file in
    HDFS and have the application read it?
    This is because distributed cache is much faster. It copies the file to all trackers at the
    start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use
    the same copy of distributed cache. On the other hand, if you put code in file to read it
    from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a
    TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also
    HDFS is not very efficient when used like this.

    Q.22 What mechanism does Hadoop framework provide to synchronise changes
    made in Distribution Cache during runtime of the application?
    This is a tricky question. There is no such mechanism. Distributed Cache by design is
    read only during the time of Job execution.

    Q23. Have you ever used Counters in Hadoop. Give us an example scenario?
    Anybody who claims to have worked on a Hadoop project is expected to use counters.

    Q24. Is it possible to provide multiple input to Hadoop? If yes then how can you
    give multiple directories as input to the Hadoop job?
    Yes, the input format class provides methods to add multiple directories as input to a
    Hadoop job.

    Q25. Is it possible to have Hadoop job output in multiple directories? If yes, how?
    Yes, by using Multiple Outputs class.

    Q26. What will a Hadoop job do if you try to run it with an output directory that is
    already present? Will it
    – Overwrite it
    – Warn you and continue
    – Throw an exception and exit
    The Hadoop job will throw an exception and exit.

    Q27. How can you set an arbitrary number of mappers to be created for a job in
    Hadoop?
    You cannot set it.

    Q28. How can you set an arbitrary number of Reducers to be created for a job in
    Hadoop?
    You can either do it programmatically by using method setNumReduceTasks in the
    Jobconf Class or set it up as a configuration setting.

    Q29. How will you write a custom partitioner for a Hadoop job?
    To have Hadoop use a custom partitioner you will have to do minimum the following
    three:
    – Create a new class that extends Partitioner Class
    – Override method getPartition
    – In the wrapper that runs the Mapreduce, either
    – Add the custom partitioner to the job programmatically using method set Partitioner
    Class or – add the custom partitioner to the job as a config file (if your wrapper reads
    from config file or oozie)

    Q30. How did you debug your Hadoop code?
    There can be several ways of doing this but most common ways are:-
    – By using counters.
    – The web interface provided by Hadoop framework.

    Q31. Did you ever built a production process in Hadoop? If yes, what was the
    process when your Hadoop job fails due to any reason?
    It is an open-ended question but most candidates if they have written a production job,
    should talk about some type of alert mechanism like email is sent or there monitoring
    have a good alerting system for errors since unexpected data can very easily break the
    job.

    #1273 Reply

    rhiddhiman
    Participant

    (201)What do you understand about Object Oriented Programming (OOP)? Use Java examples.

    (202)What are the main differences between versions 1.5 and version 1.6 of Java?

    (203)Describe what happens to a MapReduce job from submission to output?
    a. HDFS Client submits a job
    b. Record Reader takes the input and converts it to key-value pairs and gives it as input to the Mapper
    c. Mapper processes the data and gives an intermediate value (in key-value pairs) as output which is stored in the local file system.
    d. The intermediate value goes through shuffling and sorting and is provided to the Reducer
    e. The final output is given by the Reducer and is stored in HDFS.

    (204)What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application

    (205)Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason

    (206)Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it

    (207)What is HDFS ? How it is different from traditional file systems?
    HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
    HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
    HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
    HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.

    (208)What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
    This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

    (209)How JobTracker schedules a task?
    repeated

    (210)How many Daemon processes run on a Hadoop system?
    5 – NameNode, Secondary NameNode, DataNode, JobTracker, TaskTracker

    (211)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
    Repeated

    (212)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
    Repeated

    (213)What is the difference between HDFS and NAS ?
    Repeated

    (214)How NameNode Handles data node failures?
    Repeated

    (215)Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
    Repeated

    (216)Where is the Mapper Output (intermediate kay-value data) stored ?
    Repeated

    (217)What are combiners? When should I use a combiner in my MapReduce Job?
    Repeated

    (218)What is a IdentityMapper and IdentityReducer in MapReduce ?

    (219)When is the reducers are started in a MapReduce job?

    (220)If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
    Repeated

    (221)What is HDFS Block size? How is it different from traditional file system block size?
    Repeated

    (222)How the Client communicates with HDFS?
    Repeated

    (223)What is NoSQL?
    A NoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability.

    (224)We have already SQL then Why NoSQL?
    SQL can only be used to query any relational database which has a well defined schema, So when we have a requirement to query any non structured/schemaless data then No-SQL comes into play.
    In Today’s world everything is data(Social Networking,Online Transactions..). With SQL it’s tough to manage and give high performance. Therefore NOSQL comes in picture.

    (225)What is the difference between SQL and NoSQL?
    SQL is strictly for structured data while NoSQL can manage unstructured data as well.

    (226)Is NoSQL follow relational DB model?

    (227)Why would NoSQL be better than using a SQL Database? And how much better is it?

    (228)What do you understand by Standalone (or local) mode?
    There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.

    (229)What is Pseudo-distributed mode?
    The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale.

    (230)What does /var/hadoop/pids do?
    It stores the process id’s.

    (231)Pig for Hadoop – Give some points?
    PIG is a product of YAHOO.
    PIG is a platform to use the scripting language PIG LATIN which is similar to SQL.
    Works with Structured, Semi-structured and Unstructured Data.
    Works faster with structured data.
    Not required to install in Hadoop cluster only required on the user machine.
    Written in JAVA
    Abstraction level more, less lines of code, fast results.
    We use PIG for
    Time Sensitive Data Loads
    Processing Many Data Sources (eg: google, forums)
    Analytic Insight Through Sampling (eg: facebook page insights)

    (232)Hive for Hadoop – Give some points?
    HIVE is a complete datawarehousing package which is also known as the SQL of Hadoop.
    It is good for log analysis.
    It is specifically designed to work with structured data.
    It works with semi structured data and unstructured data as well but not preferred for BigData.
    Update and Delete functionality not available in HIVE, but after version 0.13 update functionality has been added.

    (233)File permissions in HDFS?

    (234)what is ODBC and JDBC connectivity in Hive?
    The Hive ODBC Driver is a software library that implements the Open Database Connectivity (ODBC) API standard for the Hive database management system, enabling ODBC compliant applications to interact seamlessly (ideally) with Hive through a standard interface.
    The Hive JDBC driver is also a software library like ODBC, but JDBC allows only Java applications to run on the Hive server.

    (235)What is Derby database?
    Apache Derby (previously distributed as IBM Cloudscape) is a RDBMS developed by the Apache Software Foundation that can be embedded in Java programs and used for online transaction processing. A Derby database contains dictionary objects such as tables, columns, indexes, and jar files. A Derby database can also store its own configuration information.

    (236)What is Schema on Read and Schema on Write?
    Schema on Read (used in BigData)
    First we Load the data into HDFS, then as we begin to read the data the schema is interpreted.
    Schema on Write (used in traditional RDBMS)
    Create a Schema for a table then load the data into it. But if the input schema changes we need to drop the previous table and create a new one with the new set of schema provided. It is good for small amount of data but would take too much time if we are working with a large set of data.

    (237)What infrastructure do we need to process 100 TB data using Hadoop?

    (238)What is Internal and External table in Hive?
    Hive has a relational database on the master node it uses to keep track of state. For instance, when you CREATE TABLE FOO(foo string) LOCATION ‘hdfs://tmp/’;, this table schema is stored in the database. If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the filesystem and finding them, etc). These are the metadata.
    When we drop an internal table, it drops the data, and it also drops the metadata.
    When we drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch the data itself.

    (239)What is Small File Problem in Hadoop?
    When loading and processing large number of small files in Hadoop the NameNode will require a lot of resources to store the metadata of each file and also the seek time would be more for large number of small files. So, large number of small files is not feasible for Hadoop cluster.

    (240)How does a client read/write data in HDFS?

    (241)What should be the ideal replication factor in Hadoop?
    3

    (242)What is the optimal block size in HDFS?
    64MB

    (243)Explain Metadata in Namenode
    MetaData: MetaData consists of all the details about a particular file – filename, size, location of storage, type of file and the likes.

    (244)How to enable recycle bin or trash in Hadoop
    To enable the trash feature and set the time delay for the trash removal, set the fs.trash.interval property in core-site.xml to the delay (in minutes). For example, if you want users to have 24 hours (1,440 minutes) to restore a deleted file, you should have in core-site.xml

    (245)what is difference between int and intwritable
    Int in java 32 bit signed two’s complement interger.
    IntWritable used in Hadoop it implemented the Interfaces like Comparable , Writable, Writable Comparable; Those Intefaces are all useful for hadoop MapReduce; The Comparable Interface can use for Compare when the reduce sort the keys; and Writable can write the result to the local disk

    (246)How to change Replication Factor (For below cases):

    (247)In Map Reduce why map write output to Local Disk instead of HDFS?
    Mapper output is not the intended output, so if the mapper output is stored in HDFS there is unnecessary utilization of resources like replication of 3 times and metadata stored in NameNode. So to avoid the unnecessary resource usage it is stored in LFS.

    (248)Rack awareness of Namenode

    (249)Hadoop the definitive guide (2nd edition) pdf

    (250)What is bucketing in Hive?
    Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based on the value of a partition column, such as date. Using partitions can make it faster to do queries on slices of the data.
    Tables or partitions may further be subdivided into buckets, to give extra structure to
    the data that may be used for more efficient queries. For example, bucketing by user
    ID means we can quickly evaluate a user-based query by running it on a randomized
    sample of the total set of users.

    (251)What is Clustring in Hive?

    (252)What type of data we should put in Distributed Cache? When to put the data in DC? How much volume we should put in?
    Any data that we intend to share across all nodes in the cluster can be put in Distributed Cache. It is in read only mode.

    (253)What is Distributed Cache?
    Repeated

    (254)What is Partioner in hadoop? Where does it run,mapper or reducer?
    The partitioner class determines which partition a given (key , value) pair will go to .The default partitioner computes a hash value for the key and assigns the partition based on this result.
    It runs in the Reducer. If there is no Reducer, there is no Partitioner.

    (255)new Jvm instead of a new java thread?

    (256)How to write a Custom Key Class?
    To write a Custom Key Class, we need to implement WritableComparable Interface.

    (257)What is the utility of using Writable Comparable (Custom Class) in Map Reduce code?

    (258)What are Input Format, Input Split & Record Reader and what they do?
    Input Format: The InputFormat defines how to read data from a file into the Mapper instances. Hadoop comes with several implementations of InputFormat; some work with text files and describe different ways in which the text files can be interpreted.
    Input Split: Is a logical division of Data in Hadoop Cluster.
    Record Reader: It is the first stage of a MapReduce function, it takes input from the source and converts it to the form of key-value pairs which in turn acts as an input to the Mapper.

    (259)Why we use IntWritable instead of Int? Why we use LongWritable instead of Long?

    (260)How to enable Recycle bin in Hadoop?
    Repeated

    (261)If data is present in HDFS and RF is defined, then how can we change Replication Factor?
    We can change the replication factor on a per-file basis using the Hadoop FS shell.
    hadoop fs –setrep –w 3 /my/file
    Alternatively, we can change the replication factor of all the files under a directory.
    hadoop fs –setrep –w 3 -R /my/dir

    (262)How we can change Replication factor when Data is on the fly?

    (262)mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/hadoop/inpdata. Name node is in safemode.

    (263)What Hadoop Does in Safe Mode.
    When a Hadoop Cluster starts, NameNode will CHECKPOINT. After NameNode starts and Checkpointing is complete the DataNodes will start up. This stage is known as SafeMode in Hadoop. When Hadoop is in SafeMode only reading data is possible and no writing can be done at that time.

    (264)What should be the ideal replication factor in Hadoop Cluster?
    3

    (265)Heartbeat for Hadoop.
    Heartbeat is a mechanism which helps in maintaining coordination between NameNode and its DataNodes.

    (266)What will be the consideration while we do Hardware Planning for Master in Hadoop architecture?

    (267)When should be hadoop archive create

    (268)what factors the block size takes before creation?

    (269)In which location Name Node sores its Metadata and why?
    MetaData is stored in RAM for faster access and processing.

    (270)Should we use RAID in Hadoop or not?
    Yes

    (271)How blocks are distributed among all data nodes for a particular chunk of data?

    (272)How to enable Trash/Recycle Bin in Hadoop?
    Repeated

    (273)what is hadoop archive

    (274)How to create hadoop archive

    (275)How we can take Hadoop out of Safe Mode
    hadoop dfsadmin -safemode leave

    (276)What is safe mode in Hadoop?
    Repeated

    (277)Why Mapreduce output written in local disk
    Repeated

    (278)When Hadoop Enter in Safe Mode
    Repeated

    (279)Data node block size in HDFS, why 64MB?
    64 MB is default Block Size if you are using Apache Hadoop distribution.
    One Reason to have 64MB block size is to minimize the disk seek time.
    Also MapReduce job can be executed efficiently on large blocks like 64 MB or more.

    (280)What is the Non DFS Used
    Non DFS used is any data in the filesystem of the data node(s) that isn’t in dfs.data.dirs. This would include log files, mapreduce shuffle output and local copies of data files.

    (281)Virtual Box & Ubuntu Installation

    (282)What is Rack awareness?
    Rack awareness: to take a node’s physical location into account while scheduling tasks and allocating storage.
    (283)On what basis name node distribute blocks across the data nodes?
    Repeated

    (284)What is Output Format in hadoop?

    (285)How to write data in Hbase using flume?

    (286)What is difference between memory channel and file channel in flume?

    (287)How to create table in hive for a json input file.

    (288)What is speculative execution in Hadoop?
    Repeated

    (289)What is a Record Reader in hadoop?
    Repeated

    (290)How to resolve the following error while running a query in hive: Error in metadata: Cannot validate serde

    (291)What is difference between internal and external tables in hive?
    Repeated

    (292)What is Bucketing and Clustering in Hive?
    Repeated

    (293)How to enable/configure the compression of map output data in hadoop?

    (294)What is InputFormat in hadoop?
    Repeated

    (295)How to configure hadoop to reuse JVM for mappers?

    (296)What is difference between split and block in hadoop?
    Split is a logical division of data whereas block is the physical division of data.

    (297)What is Input Split in hadoop?
    Repeated

    (298)How can one write custom record reader?

    (299)What is balancer? How to run a cluster balancing utility?

    (300)What is version-id mismatch error in hadoop?

    (301)How to handle bad records during parsing?

    (302)What is identity mapper and reducer? In which cases can we use them?

    (303)What is Reduce only jobs?

    (304)What is crontab? Explain with suitable example.

    (305)Safe-mode execeptions

    (306)What is the meaning of the term “non-DFS used” in Hadoop web-console?
    Repeated

    (307)What is AMI

    (308)Can we submit the mapreduce job from slave node?
    No

    (309)How to resolve small file problem in hdfs?
    Repeated

    (310)How to overwrite an existing output file during execution of mapreduce jobs?
    If we want to overwrite the existing output:
    Need to overwrite the hadoop OutputFormat class:
    public class OverwriteOutputDirOutputFile extends TextOutputFormat{

    public void checkOutputSpecs(FileSystem ignored, JobConf job)
    throws FileAlreadyExistsException,
    InvalidJobConfException, IOException {
    // Ensure that the output directory is set and not already there
    Path outDir = getOutputPath(job);
    if (outDir == null && job.getNumReduceTasks() != 0) {
    throw new InvalidJobConfException(”Output directory not set in JobConf.”);
    }
    if (outDir != null) {
    FileSystem fs = outDir.getFileSystem(job);
    // normalize the output directory
    outDir = fs.makeQualified(outDir);
    setOutputPath(job, outDir);
    // get delegation token for the outDir’s file system
    TokenCache.obtainTokensForNamenodes(job.getCredentials(),
    new Path[] {outDir}, job);
    // check its existence
    /* if (fs.exists(outDir)) {
    throw new FileAlreadyExistsException(”Output directory ” + outDir +
    ” already exists”);
    }*/
    }
    }
    }
    and need to set this as part of job configuration.

    (311)What is difference between reducer and combiner?
    A Combiner is a mini reducer that performs the local reduce task. Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Combiner function will run on the Map output and combiner functions output to Reducers input. In one word Combiner function is used for network optimization. If the map generate more number of outputs as per requirement, then we need to use combiner but:
    1) One constraint that a Combiner will have, unlike a Reducer, is that the input/output key and value types must match the output types of your Mapper.
    ex: job.setMapOutputKeyClass(Text.class); and job.setCombinerClass(IntSumReducer.class); and IntSumReducer.class context.write(NullWritable.get(), result); this is wrong it should be context.write(Text, result);
    2) Combiners can only be used on the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} . This also means that combiners may operate only on a subset of your keys and values or may not execute at all, still you want the output of the program to remain same.
    3) Reducers can get data from multiple Mappers as part of the partitioning process. Combiners can only get its input from one Mapper.
    Combiner function is not the replacement for Reducer but we should use as per requirements.

    (311)What do you understand from Node redundancy and is it exist in hadoop cluster

    (312)how to proceed to write your first mapreducer program.

    (313)How to change replication factor of files already stored in HDFS
    Repeated

    (314)java.io.IOException: Cannot create directory, while formatting namenode

    (315)How can one set space quota in Hadoop (HDFS) directory

    (316)How can one increase replication factor to a desired value in Hadoop?
    We can change it in the configuration file hdfs-site.xml.

Viewing 4 posts - 1 through 4 (of 4 total)
Reply To: Hadoop Questions Assignment
Your information:




cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.