#1273

rhiddhiman
Participant

(201)What do you understand about Object Oriented Programming (OOP)? Use Java examples.

(202)What are the main differences between versions 1.5 and version 1.6 of Java?

(203)Describe what happens to a MapReduce job from submission to output?
a. HDFS Client submits a job
b. Record Reader takes the input and converts it to key-value pairs and gives it as input to the Mapper
c. Mapper processes the data and gives an intermediate value (in key-value pairs) as output which is stored in the local file system.
d. The intermediate value goes through shuffling and sorting and is provided to the Reducer
e. The final output is given by the Reducer and is stored in HDFS.

(204)What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application

(205)Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason

(206)Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it

(207)What is HDFS ? How it is different from traditional file systems?
HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.

(208)What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

(209)How JobTracker schedules a task?
repeated

(210)How many Daemon processes run on a Hadoop system?
5 – NameNode, Secondary NameNode, DataNode, JobTracker, TaskTracker

(211)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
Repeated

(212)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
Repeated

(213)What is the difference between HDFS and NAS ?
Repeated

(214)How NameNode Handles data node failures?
Repeated

(215)Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
Repeated

(216)Where is the Mapper Output (intermediate kay-value data) stored ?
Repeated

(217)What are combiners? When should I use a combiner in my MapReduce Job?
Repeated

(218)What is a IdentityMapper and IdentityReducer in MapReduce ?

(219)When is the reducers are started in a MapReduce job?

(220)If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
Repeated

(221)What is HDFS Block size? How is it different from traditional file system block size?
Repeated

(222)How the Client communicates with HDFS?
Repeated

(223)What is NoSQL?
A NoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability.

(224)We have already SQL then Why NoSQL?
SQL can only be used to query any relational database which has a well defined schema, So when we have a requirement to query any non structured/schemaless data then No-SQL comes into play.
In Today’s world everything is data(Social Networking,Online Transactions..). With SQL it’s tough to manage and give high performance. Therefore NOSQL comes in picture.

(225)What is the difference between SQL and NoSQL?
SQL is strictly for structured data while NoSQL can manage unstructured data as well.

(226)Is NoSQL follow relational DB model?

(227)Why would NoSQL be better than using a SQL Database? And how much better is it?

(228)What do you understand by Standalone (or local) mode?
There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.

(229)What is Pseudo-distributed mode?
The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale.

(230)What does /var/hadoop/pids do?
It stores the process id’s.

(231)Pig for Hadoop – Give some points?
PIG is a product of YAHOO.
PIG is a platform to use the scripting language PIG LATIN which is similar to SQL.
Works with Structured, Semi-structured and Unstructured Data.
Works faster with structured data.
Not required to install in Hadoop cluster only required on the user machine.
Written in JAVA
Abstraction level more, less lines of code, fast results.
We use PIG for
Time Sensitive Data Loads
Processing Many Data Sources (eg: google, forums)
Analytic Insight Through Sampling (eg: facebook page insights)

(232)Hive for Hadoop – Give some points?
HIVE is a complete datawarehousing package which is also known as the SQL of Hadoop.
It is good for log analysis.
It is specifically designed to work with structured data.
It works with semi structured data and unstructured data as well but not preferred for BigData.
Update and Delete functionality not available in HIVE, but after version 0.13 update functionality has been added.

(233)File permissions in HDFS?

(234)what is ODBC and JDBC connectivity in Hive?
The Hive ODBC Driver is a software library that implements the Open Database Connectivity (ODBC) API standard for the Hive database management system, enabling ODBC compliant applications to interact seamlessly (ideally) with Hive through a standard interface.
The Hive JDBC driver is also a software library like ODBC, but JDBC allows only Java applications to run on the Hive server.

(235)What is Derby database?
Apache Derby (previously distributed as IBM Cloudscape) is a RDBMS developed by the Apache Software Foundation that can be embedded in Java programs and used for online transaction processing. A Derby database contains dictionary objects such as tables, columns, indexes, and jar files. A Derby database can also store its own configuration information.

(236)What is Schema on Read and Schema on Write?
Schema on Read (used in BigData)
First we Load the data into HDFS, then as we begin to read the data the schema is interpreted.
Schema on Write (used in traditional RDBMS)
Create a Schema for a table then load the data into it. But if the input schema changes we need to drop the previous table and create a new one with the new set of schema provided. It is good for small amount of data but would take too much time if we are working with a large set of data.

(237)What infrastructure do we need to process 100 TB data using Hadoop?

(238)What is Internal and External table in Hive?
Hive has a relational database on the master node it uses to keep track of state. For instance, when you CREATE TABLE FOO(foo string) LOCATION ‘hdfs://tmp/’;, this table schema is stored in the database. If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the filesystem and finding them, etc). These are the metadata.
When we drop an internal table, it drops the data, and it also drops the metadata.
When we drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch the data itself.

(239)What is Small File Problem in Hadoop?
When loading and processing large number of small files in Hadoop the NameNode will require a lot of resources to store the metadata of each file and also the seek time would be more for large number of small files. So, large number of small files is not feasible for Hadoop cluster.

(240)How does a client read/write data in HDFS?

(241)What should be the ideal replication factor in Hadoop?
3

(242)What is the optimal block size in HDFS?
64MB

(243)Explain Metadata in Namenode
MetaData: MetaData consists of all the details about a particular file – filename, size, location of storage, type of file and the likes.

(244)How to enable recycle bin or trash in Hadoop
To enable the trash feature and set the time delay for the trash removal, set the fs.trash.interval property in core-site.xml to the delay (in minutes). For example, if you want users to have 24 hours (1,440 minutes) to restore a deleted file, you should have in core-site.xml

(245)what is difference between int and intwritable
Int in java 32 bit signed two’s complement interger.
IntWritable used in Hadoop it implemented the Interfaces like Comparable , Writable, Writable Comparable; Those Intefaces are all useful for hadoop MapReduce; The Comparable Interface can use for Compare when the reduce sort the keys; and Writable can write the result to the local disk

(246)How to change Replication Factor (For below cases):

(247)In Map Reduce why map write output to Local Disk instead of HDFS?
Mapper output is not the intended output, so if the mapper output is stored in HDFS there is unnecessary utilization of resources like replication of 3 times and metadata stored in NameNode. So to avoid the unnecessary resource usage it is stored in LFS.

(248)Rack awareness of Namenode

(249)Hadoop the definitive guide (2nd edition) pdf

(250)What is bucketing in Hive?
Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based on the value of a partition column, such as date. Using partitions can make it faster to do queries on slices of the data.
Tables or partitions may further be subdivided into buckets, to give extra structure to
the data that may be used for more efficient queries. For example, bucketing by user
ID means we can quickly evaluate a user-based query by running it on a randomized
sample of the total set of users.

(251)What is Clustring in Hive?

(252)What type of data we should put in Distributed Cache? When to put the data in DC? How much volume we should put in?
Any data that we intend to share across all nodes in the cluster can be put in Distributed Cache. It is in read only mode.

(253)What is Distributed Cache?
Repeated

(254)What is Partioner in hadoop? Where does it run,mapper or reducer?
The partitioner class determines which partition a given (key , value) pair will go to .The default partitioner computes a hash value for the key and assigns the partition based on this result.
It runs in the Reducer. If there is no Reducer, there is no Partitioner.

(255)new Jvm instead of a new java thread?

(256)How to write a Custom Key Class?
To write a Custom Key Class, we need to implement WritableComparable Interface.

(257)What is the utility of using Writable Comparable (Custom Class) in Map Reduce code?

(258)What are Input Format, Input Split & Record Reader and what they do?
Input Format: The InputFormat defines how to read data from a file into the Mapper instances. Hadoop comes with several implementations of InputFormat; some work with text files and describe different ways in which the text files can be interpreted.
Input Split: Is a logical division of Data in Hadoop Cluster.
Record Reader: It is the first stage of a MapReduce function, it takes input from the source and converts it to the form of key-value pairs which in turn acts as an input to the Mapper.

(259)Why we use IntWritable instead of Int? Why we use LongWritable instead of Long?

(260)How to enable Recycle bin in Hadoop?
Repeated

(261)If data is present in HDFS and RF is defined, then how can we change Replication Factor?
We can change the replication factor on a per-file basis using the Hadoop FS shell.
hadoop fs –setrep –w 3 /my/file
Alternatively, we can change the replication factor of all the files under a directory.
hadoop fs –setrep –w 3 -R /my/dir

(262)How we can change Replication factor when Data is on the fly?

(262)mkdir: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/hadoop/inpdata. Name node is in safemode.

(263)What Hadoop Does in Safe Mode.
When a Hadoop Cluster starts, NameNode will CHECKPOINT. After NameNode starts and Checkpointing is complete the DataNodes will start up. This stage is known as SafeMode in Hadoop. When Hadoop is in SafeMode only reading data is possible and no writing can be done at that time.

(264)What should be the ideal replication factor in Hadoop Cluster?
3

(265)Heartbeat for Hadoop.
Heartbeat is a mechanism which helps in maintaining coordination between NameNode and its DataNodes.

(266)What will be the consideration while we do Hardware Planning for Master in Hadoop architecture?

(267)When should be hadoop archive create

(268)what factors the block size takes before creation?

(269)In which location Name Node sores its Metadata and why?
MetaData is stored in RAM for faster access and processing.

(270)Should we use RAID in Hadoop or not?
Yes

(271)How blocks are distributed among all data nodes for a particular chunk of data?

(272)How to enable Trash/Recycle Bin in Hadoop?
Repeated

(273)what is hadoop archive

(274)How to create hadoop archive

(275)How we can take Hadoop out of Safe Mode
hadoop dfsadmin -safemode leave

(276)What is safe mode in Hadoop?
Repeated

(277)Why Mapreduce output written in local disk
Repeated

(278)When Hadoop Enter in Safe Mode
Repeated

(279)Data node block size in HDFS, why 64MB?
64 MB is default Block Size if you are using Apache Hadoop distribution.
One Reason to have 64MB block size is to minimize the disk seek time.
Also MapReduce job can be executed efficiently on large blocks like 64 MB or more.

(280)What is the Non DFS Used
Non DFS used is any data in the filesystem of the data node(s) that isn’t in dfs.data.dirs. This would include log files, mapreduce shuffle output and local copies of data files.

(281)Virtual Box & Ubuntu Installation

(282)What is Rack awareness?
Rack awareness: to take a node’s physical location into account while scheduling tasks and allocating storage.
(283)On what basis name node distribute blocks across the data nodes?
Repeated

(284)What is Output Format in hadoop?

(285)How to write data in Hbase using flume?

(286)What is difference between memory channel and file channel in flume?

(287)How to create table in hive for a json input file.

(288)What is speculative execution in Hadoop?
Repeated

(289)What is a Record Reader in hadoop?
Repeated

(290)How to resolve the following error while running a query in hive: Error in metadata: Cannot validate serde

(291)What is difference between internal and external tables in hive?
Repeated

(292)What is Bucketing and Clustering in Hive?
Repeated

(293)How to enable/configure the compression of map output data in hadoop?

(294)What is InputFormat in hadoop?
Repeated

(295)How to configure hadoop to reuse JVM for mappers?

(296)What is difference between split and block in hadoop?
Split is a logical division of data whereas block is the physical division of data.

(297)What is Input Split in hadoop?
Repeated

(298)How can one write custom record reader?

(299)What is balancer? How to run a cluster balancing utility?

(300)What is version-id mismatch error in hadoop?

(301)How to handle bad records during parsing?

(302)What is identity mapper and reducer? In which cases can we use them?

(303)What is Reduce only jobs?

(304)What is crontab? Explain with suitable example.

(305)Safe-mode execeptions

(306)What is the meaning of the term “non-DFS used” in Hadoop web-console?
Repeated

(307)What is AMI

(308)Can we submit the mapreduce job from slave node?
No

(309)How to resolve small file problem in hdfs?
Repeated

(310)How to overwrite an existing output file during execution of mapreduce jobs?
If we want to overwrite the existing output:
Need to overwrite the hadoop OutputFormat class:
public class OverwriteOutputDirOutputFile extends TextOutputFormat{

public void checkOutputSpecs(FileSystem ignored, JobConf job)
throws FileAlreadyExistsException,
InvalidJobConfException, IOException {
// Ensure that the output directory is set and not already there
Path outDir = getOutputPath(job);
if (outDir == null && job.getNumReduceTasks() != 0) {
throw new InvalidJobConfException(”Output directory not set in JobConf.”);
}
if (outDir != null) {
FileSystem fs = outDir.getFileSystem(job);
// normalize the output directory
outDir = fs.makeQualified(outDir);
setOutputPath(job, outDir);
// get delegation token for the outDir’s file system
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] {outDir}, job);
// check its existence
/* if (fs.exists(outDir)) {
throw new FileAlreadyExistsException(”Output directory ” + outDir +
” already exists”);
}*/
}
}
}
and need to set this as part of job configuration.

(311)What is difference between reducer and combiner?
A Combiner is a mini reducer that performs the local reduce task. Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Combiner function will run on the Map output and combiner functions output to Reducers input. In one word Combiner function is used for network optimization. If the map generate more number of outputs as per requirement, then we need to use combiner but:
1) One constraint that a Combiner will have, unlike a Reducer, is that the input/output key and value types must match the output types of your Mapper.
ex: job.setMapOutputKeyClass(Text.class); and job.setCombinerClass(IntSumReducer.class); and IntSumReducer.class context.write(NullWritable.get(), result); this is wrong it should be context.write(Text, result);
2) Combiners can only be used on the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} . This also means that combiners may operate only on a subset of your keys and values or may not execute at all, still you want the output of the program to remain same.
3) Reducers can get data from multiple Mappers as part of the partitioning process. Combiners can only get its input from one Mapper.
Combiner function is not the replacement for Reducer but we should use as per requirements.

(311)What do you understand from Node redundancy and is it exist in hadoop cluster

(312)how to proceed to write your first mapreducer program.

(313)How to change replication factor of files already stored in HDFS
Repeated

(314)java.io.IOException: Cannot create directory, while formatting namenode

(315)How can one set space quota in Hadoop (HDFS) directory

(316)How can one increase replication factor to a desired value in Hadoop?
We can change it in the configuration file hdfs-site.xml.

Prwatech