HADOOP QUESTIONS:
What is the Monetized Analytics in Hadoop?
Monetized analytics helps businesses to take important and better decisions and helps earn revenues. However, Big Data analytics is also used to derive revenue beyond the insights the insights it provides you might be able to get a unique data set that is valuable for other companies.

What is the Sqoop merge tool?
The sqoop merge tool works hand in hand with the incremental import last modified mode.each import creates new file , so if you want to keep the table data together in one file, you use the merge tool.

What is the Oozie Activity in Hadoop?An Oozie activity is any possible entity that can tracked in Oozie functional subsystems and Hadoop jobs .The Oozie the oozie SLA defines ,stores , information for any oozie activity.

What is the Ensemble Methods in Hadoop
Ensemble methods refers to process of generating multiple models and combining them to solve a specific problem. The process that we follow in an ensemble method is quite similar to what we follow in our day-to-day life. We take opinions from different experts before arriving at a final decision.

What is the sharding in Hadoop?
Database sharding can be defined as a partitioning scheme for large databases distributed across various servers, and is responsible for new levels of database performance and scalability.It divides a database into smaller part called “shards” and replicates those across a number of distributed servers.

What is the polyglot persistence in Hadoop?The is the big data to applied to a set of apprehensions that use several core database technologies. A polyglot is often used to solve a complex problem by breaking it into fragments and applying different database modelling techniques.

What is the RecordReader in Hadoop?The input split defines the unit of work in a MapReduce program.But the input split does not describe the way to access the unit of work.The Recordreader class loads all required data from its source it source and converts it into a key/value pair, it is pairs that can be read by the mapper.

What is the Visualization Layer in Hadoop?
The visualization Layer handles the task of interpreting and visualizing Big Data.It can be described as viewing a piece of information from different perspectives,interpreting it in different manners.

Can you give us some more details about SSH communication between Masters and
the Slaves?
SSH is a password-less secure communication where data packets are sent across the slave.
It has some format into which data is sent across. SSH is not only between masters and
slaves but also between two hosts.

What is the CAP Theorem in Hadoop?
In case of distributed databases, the three important aspects of the CAP theorem are consistently (A) , Availability (A),and partition tolerance (P) . The first one refers to the number of nodes that should respond to a read request before it is considered as a successful operation. The second is the number of nodes that should respond to a write request before its considered a successful operation. The third is the number of nodes where the data is replicated or copied.

Differentiate between FileSink and FileRollSink ?
The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.

What is BloomMapFile used for ?
The BloomMapFile is a class that extends MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase format.

How often do you need to reformat the namenode ?
Never. The namenode needs to formatted only once in the beginning . Reformating of the namenode will lead to lost of the data on entire.
The namenode is the only system that needs to be formatted only once. It will create the directory structure for file system metadata and create namespaceID for the entire file system.

What is the Hive DDL in Hadoop?
Data Definition Language (DDL) is used to describe data and data structures of a database. Hive has it’s own DDL, such as SQL DDL, which is used for managing, creating, altering and dropping databases, tables and other objects in a database.

What is the role of “Zookeeper” in a Hadoop cluster?
Ans: the purpose of Zookeeper is cluster management. Zookeeper will help you achieve coordination between Hadoop nodes. Zookeeper also help to:
a. Manage configuration across nodes
b. Implement reliable messaging
c. Implement redundant services
d. Synchronize process execution

What is the Oozie Activity in Hadoop?
An Oozie activity is any possible entry that can be tracked in Oozie functional subsystems and Hadoop jobs. The Oozie SLA defines, and tracks the desired SLA information for any Oozie activity.

What is the Oozie SLA in Hadoop?
Oozie SLA specifies the quality of an Oozie application in measurable terms. SLA can be determined after taking the business requirements and nature of software into consideration.

What is the checkpoint in Hadoop?
Checkpoint functionality, the Backup node maintains the current state of all the HDFS block metadata in memory, just like the name node. If you r using the Backup node you can’t run the checkpoint node there is no need to do so, because the checkpointing process is already being taken care of.
U can say … THE checkpoint node is the replacement for the secondary namenode

Execution of Asynchronous Action in Oozie?Any asynchronous action in the Hadoop cluster can be executed in the form of Hadoop MapReduce jobs. This makes Oozie scalable. When you use Hadoop to perform processing/computation tasks triggered by a workflow action, workflow jobs must wait until the completion of these tasks before moving to next node in the workflow.

What is the Oozie Recovery capabilities in Hadoop?
Oozie can recover workflow jobs in two ways. First, when action starts successfully, Oozie applies the MapReduce retry mechanisms for recovery. On the other hand, if an action fails to start, Oozie uses some other recovery techniques according to the nature of the failure.

What is the Oozie Bundle in Hadoop?
The Oozie bundle is a top-level abstraction. In other words, it is a bundle or set of co-ordinator applications.The Oozie bundle enables a user to start, stop, suspend, resume , or rerun a job at the bundle level. It provides better operational control over the set of co-coordinator applications.

What is the Split and Flatten in Pig
Split means operator partitions a given relation into two or more relations
Flatten means operator is used for un-nesting as well as collecting tuples .
The FLATTEN operator seems syntactically similar to a user-defined function statement.

What is the Oozie Coodinator in Hadoop?
The Oozie co-odinator is used to specify the conditions for a workflow in the form of predicates. It triggers the execution of the workflow at the specified time, on regular intervals, or on the basis of the available data.

What is difference between static and dynamic partitions in hive?
Static partition: the name of the partition is hard coded in the insert statement.
Dynamic: hive will automatically determine the partition based on the value of the partition filed.

How do we include native libraries in YARN jobs?
Ans: By using -Djava.library. path option on the command or else by setting LD_LIBRARY_PATH in .bashrc file

How does Hadoop’s CLASSPATH plays virtual role in starting or stopping in Hadoop deamons?
Ans: class path will contain list of directories containing jar files required to stop/start deamons.
Ex: HADOOP_HOME/share/Hadoop/common/lib contains all the common utility jar files

How can you set an arbitrary number of Reducer’s to be created for a job in Hadoop?

Ans: You can either do it programmatically by using method setNumReduce tasks in the jobconf Class or set it up as a configuration setting.

How do you overwrite replication factor?

Ans: Three are few ways to do this. Look the below illustration .

Hadoop fs -setrep -w5 -R hadoop- test

Hadoop fs-Ddfs.replication=5 -cp hadoop- test/test.csv hadoop- test/test_with_rep5.csv

Map reduce jobs are failing on a cluster that was just restarted. They worked before restarted. What could be wrong?

Ans: the cluster is in a safe mode. The administrator needs to wait f
Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to few hours.

How would an Hadoop administrator deploy various components of Hadoop in production?

Ans: Deploy namenode and job tracker on the master node, and deploy data nodes and tasktrackers on multiple slave nodes.
There is a need for only one namenode and job tracker on the system. The number of data nodes depends on the available hardware.

How would an Hadoop administrator deploy various components of Hadoop in production?

Ans: Deploy namenode and job tracker on the master node, and deploy data nodes and tasktrackers on multiple slave nodes.
There is a need for only one namenode and job tracker on the system. The number of data nodes depends on the available hardware.

What are the hardware requirements for a Hadoop cluster ( primary and secondary namenodes and data nodes)?

Ans: There are no requirements for data nodes. However, the name nodes require a specified amount of RAM to store filesystem image in memory based on the design of the primary name node and secondary name node, entire filesystem information will be stored in memory. Therefore, both name nodes to have enough memory to contain the entire filesystem image.

What is the Operationlized Analytics and Monetized Analytics?
Operetionalized Analytics means making Analytics an important part of the business process. For instance, an insurance company can use a model to predict the probability of a claim being fraudulent.

Monetized Analytics helps businesses to take important and better decisions and helps earn revenues.
What is the Split and Flatten in Pig
Split means operator partitions a given relation into two or more relations
Flatten means operator is used for un-nesting as well as collecting tuples .
The FLATTEN operator seems syntactically similar to a user-defined function statement.

What is the Oozie Coodinator in Hadoop?
The Oozie co-odinator is used to specify the conditions for a workflow in the form of predicates. It triggers the execution of the workflow at the specified time, on regular intervals, or on the basis of the available data.

What is the Oozie Bundle in Hadoop?
The Oozie bundle is a top-level abstraction. In other words, it is a bundle or set of co-ordinator applications.The Oozie bundle enables a user to start, stop, suspend, resume , or rerun a job at the bundle level. It provides better operational control over the set of co-ordinator applications.

What is the Oozie Recovery capabilities in Hadoop?
Oozie can recover workflow jobs in two ways. First, when action starts successfully, Oozie applies the MapReduce retry mechanisms for recovery. On the other hand, if an action fails to start, Oozie uses some other recovery techniques according to the nature of the failure.

Execution of Asynchronous Action in Oozie?Any asynchronous action in the Hadoop cluster can be executed in the form of Hadoop MapReduce jobs. This makes Oozie scalable. When you use Hadoop to perform processing/computation tasks triggered by a workflow action, workflow jobs must wait until the completion of these tasks before moving to next node in the workflow.

What is the Oozie SLA in Hadoop?
Oozie SLA specifies the quality of an Oozie application in measurable terms. SLA can be determined after taking the business requirements and nature of software into consideration.

What is the checkpoint in Hadoop?
Checkpoint functionality, the Backup node maintains the current state of all the HDFS block metadata in memory, just like the name node. If you r using the Backup node you can’t run the checkpoint node there is no need to do so, because the checkpointing process is already being taken care of.
U can say … THE checkpoint node is the replacement for the secondary namenode

What is the role of “Zookeeper” in a Hadoop cluster?
Ans: the purpose of Zookeeper is cluster management. Zookeeper will help you achieve coordination between Hadoop nodes. Zookeeper also help to:
a. Manage configuration across nodes
b. Implement reliable messaging
c. Implement redundant services
d. Synchronize process execution

What is the Oozie Activity in Hadoop?
An Oozie activity is any possible entry that can be tracked in Oozie functional subsystems and Hadoop jobs. The Oozie SLA defines, and tracks the desired SLA information for any Oozie activity.

What is custom RecordReader in Hadoop?
The RecordReader class generates key/value pairs from data within the boundaries create by the input split. In the input file, we have a start and a corresponding end. The start is a byte and tells the RecordReader to start gererating key/value pairs.

What is the Business Importance in Hadoop? The results thus obtained must always be placed in a business context to take it as a final process of validation. Lets assume that an executive is 99 percent confident confident that the change in a process would result in the fetching of a 10 persent hike in the revenue.

What is the Monitoring Layer in Hadoop?
The monitorying layer consists of a number of monitoring systems. These systems remains automatically aware of all the configurations and functions of different operating systems and hardware.

What is the Containers in Hadoop?
A container is nothing but a set of physical resource on a single node. A container consists of memory , CPU cores, and disks. Depending upon the resources in a node , a node can have multiple containers that are assigned to a specific Application Manager .

Mention what are the key components of Hbase ?
Zookeeper : it does the co-ordination work between client and Hbase master
Hbase master : Hbase Master monitors the region server
RegionServer : RegionServer monitors the Region.
Region : it contains in memory data store(MemStore) and Hfile
Catalog Tables : Catalog tables consist of ROOT and META

What is SSH action in Oozie?
This invokes an action in specified shell script located on an Oozie server node(not HDFS).

What is custom RecordReader in Hadoop?
The RecordReader class generates key/value pairs from data within the boundaries create by the input split. In the input file, we have a start and a corresponding end. The start is a byte and tells the RecordReader to start gererating key/value pairs.

What is Terasort?
TeraSort is a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. It is commonly used to measure MapReduce performance of an Apache™ Hadoop® cluster
More on this : https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html

⁠⁠⁠What ports does HBase use?
Hbase runs the master and it’s informational HTTP server at 60000 and 60010 respectively and regionservers at 60020 and their informational HTTP server at 60030 .

Explain how does Hadoop Classpath plays a vital role in stopping or starting in Hadoop daemons?
Classpath will consist of a list of directories containing jar files to stop or start daemons.

Which are the three main hdfs-site.xml properties?

The three main hdfs-site.xml properties are:

1. dfs.name.dir which gives you the location on which metadata will be stored and where DFS is located – on disk or onto the remote.

2. dfs.data.dir which gives you the location where the data is going to be stored.

3. fs.checkpoint.dir which is for secondary Namenode.

What is the Edge nodes in Hadoop?Edge nodes are the interface between the hadoop cluster and the outside network. For this reason , they are sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools.
What are active and passive “Namenodes”?

In Hadoop-2.x, we have two Namenodes – Active “Namenode” and Passive “Namenode”. Active “Namenode” is the “Namenode” which works and runs in the cluster. Passive “Namenode” is a standby “Namenode”, which has similar data as active “Namenode”. When the active “Namenode” fails, the passive “Namenode” replaces the active “Namenode” in the cluster. Hence, the cluster is never without a “Namenode” and so it never fails.

What is the Monetized Analytics in Hadoop?
Monetized analytics helps businesses to take important and better decisions and helps earn revenues. However, Big Data analytics is also used to derive revenue beyond the insights the insights it provides you might be able to get a unique data set that is valuable for other companies.

What is the Sqoop merge tool?
The sqoop merge tool works hand in hand with the incremental import last modified mode.each import creates new file , so if you want to keep the table data together in one file, you use the merge tool.

What is the Oozie Activity in Hadoop?An Oozie activity is any possible entity that can tracked in Oozie functional subsystems and Hadoop jobs .The Oozie the oozie SLA defines ,stores , information for any oozie activity.

To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.

Containers in Hadoop?
A container is nothing but a set of physical resource on a single node. A container consists of memory , CPU cores, and disks. Depending upon the resources in a node , a node can have multiple containers that are assigned to a specific Application Manager

 What is the Fault Tolerance in Hadoop? Cloud computing provides fault tolerance by offering uninterrupted services to customers, especially in cases of component failure. The responsibility of handing the workload is shifted to other components of the cloud.

 What is the Background of Yarn ?
Apache Hadoop is an ecosystem used for processing large amounts data through the MapReduce data processing model. This ecosystem was originally developed by Google. Hadoop supports distributed processing of large amounts of data though the core MapReduce processing mechanism.

 What is the Intelligent Keys in Hadoop?Intelligent Keys the data stored in Hbase is ordered by row key, and the row key is the only native index provided by the system, careful intelligent design of the row key can make a huge difference.

 What is the kick-off time and Bundle action in Oozie?Kick-off time is the refers to the time a bundle application starts. And Bundle action means It refers to the start of a coordinator job of a coordinator application by the Oozie server.

What is the Anomaly identification on Hadoop.
It is the refers to the identification of anomalies, identification of an event that shows a difference between the actual observation and what u expected in your data.

 What is the Backward Capability in Yarn?Yarn is backward compatible, which means that the code developed using MapReduce can run on Yarn hadoop2 without any or some minor changes. This is very important feature as application that are developed using MapReduce usually cater to a large user base and run on widespread distributed systems.

 What is the Through polling in Oozie?Sometimes, the current is unable to call-back URL about the tasks completion for some reason, such as a transient network failure. In this case , Oozie uses the polling mechanism.In this mechanism, Oozie select a task to execute and complete itself.

 What is the data mining in Hive?
Hive is a batch-oriented and data -warehousing layer created on the basic elements of Hadoop , such as HDFS and mapreduce.This layer plays an important role in mining of big data . Hive offers a simple SQL -lite-implementation call hiveQL to SQL users without losing access through mappers and reducers

 What is the public cloud in Hadoop?
A cloud that is owned by a company that the one which can be either an individual user or a company using it is known as a public cloud. In this cloud, there is no need for the organization(customers) to control or manage the resources, they are being administered by a third party.

 What is compute and storage nodes?
Compute Node : this is the computer or machine where your actual business logic will be executed .
Storage Node : this is the computer or machine where your files system reside to storage the processing data. In most of the cases compute node and storage Node would be the same machine .

 Explain how can you check whether Namenode is working beside using the jps command?
Beside using the jps command, to check whether Namenode are working you can also use /etc/init.d/hadoop-0.20-namenode status.

 What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?
Distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.  What is the Public Cloud in Hadoop?
A cloud that is owned and managed by a company than the one using it is known as a public cloud . In this cloud, there is no need for the organizations(customers) to control or manage the resources, they r being administered by a third party.

What is the Private cloud in Hadoop?
The cloud that remains entirely in the ownership of the organization using it known as a private cloud. In other words, in this cloud computing infrastructure is solely designed for single organization and can’t be accessed by other organizations.However, the organization may allow this cloud to be used by its employees, partners, and customers.

What is a IdentityMapper and IdentityReducer in MapReduce ?
org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

What is a IdentityMapper and IdentityReducer in MapReduce ?
org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

 What is predicative Analysis in Hadoop?
Consider a case where a customer files to get the insurance money of a car claiming that it is destroyed in a fire. However, the customers on file Indicate that most of the valuable items were removed from the car prior to the fire.this might indicate that the car was torched on purpose.

What is the Lacalytics ?
This is a big marketing and analytics platform for mobile and web apps. It’s developer is Localytics, in Boston. It supports cross platform and web-based applications . Localytics supports push messaging, business analytics, and acquisition campaigns management.

What is difference between hive and hbase ?
Hive allows most of SQL queries , but hbase not allows SQL queries directly .
Hive doesn’t support record level update , insert and deletion operations on table , but hbase can do it.
Hive is a data warehosue framework where as hbase is a NOSQL data base.
Hive runs on top mapreduce , Hbase run on top of HDFS. Note: hive support for update and delete in latest version.

cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.