Hive Interview Questions and answers
Hive Interview Questions and answers, are you looking for the best interview questions on Hive? Or hunting for the best platform which provides a list of Top Rated interview questions on Hive for experienced? Then stop hunting and follow Best Hadoop Training Institute for the List of Top-Rated Hadoop Hive interview questions and answers for experienced for which are useful for both Fresher’s and experienced.
Are you the one who is a hunger to become Pro certified Hadoop Developer then ask your Industry Certified Experienced Hadoop Trainer for more detailed information? Don’t just dream to become Pro-Developer Achieve it learning the Hadoop Course under world-class Trainer like a pro. Follow the below-mentioned interview questions on Hadoop Hive with answers to crack any type of interview that you face.
Data modification in HIVE?
Ans : Data modification is basically done by many ways
1.INSERT: insertion is used the insert the records the value in the table
Syntax :
CREATE TABLE <table_name>(column_name_1 datatype,column_name2,datatype,……
Column_name_n data type) row format delimited fields terminated by ‘,’ stored as textfile;
Example:
create table earth(name String,idint,nametypeString,recclassString,massint,Fall_hString,yearTIMESTAMP,reclatbigint,reclongbigint,Geolocation String)row format delimited fields terminated by ‘,’ stored as textfile;
♦ UPDATE: update is used to update the records in the existing table
Syntax :
Update <table_name> set column_name=”” where column_name=””;
Hadoop Hive Tutorials
Manage table and External table?
ANS : Hive table is basically of two types :
♦ Managed table: it is used to store and process the datasets in default hive directory
Path :/user/hive/warehouse/prwatech.db/earth
For creating a managed table we simply declare CREATE statement as well as file format
Syntax :
create table earth(name String,idint,nametypeString,recclassString,massint,Fall_hString,yearTIMESTAMP,reclatbigint,reclongbigint,Geolocation String)row format delimited fields terminated by ‘,’ stored as textfile;
After creating the table data need to be load in the table so, we have syntax to load the data
Syntax : Load data inpath’/home/cloudera/Desktop/Meteorite_Landings.csv overwrite into table earth.
if you want to drop managed table, it deleted the meta-store of the table as well as data of the table
♦ ExternalTable :-
if we use external table we have to specify the path of the table during the creation of the table as “EXTERNAL” keyword
Syntax :
create external table example_customer(customer STRING,firstnameSTRING,lastnameSTRING,ageINT,profession STRING)row format delimited
fields terminated by ‘,’ LOCATION /user/cloudera/external;
if we want to drop the external table the meta-store of the table will be deleted but physical data will be remains.
What is SerDe and its application?
Ans: Whenever a file in needed to Read/Write from/to into the Hive table so, SerDe basically is an interface as to how to process the data in a table. As file Text InputFormat is sending to the Hive table in hdfs as the select query is performed on the table then RecordReader will take the InputFormat file and convert it into key-value pair till the last line of the file.ThenRecordReader will return you the one record and that record is actual serialized record and that serialized data actual converted into Row represent to the end-user .whenever select query is performed that Row is called to Serde.deserialized () with the help of objectInspector class for entire Row and there is objectInspector for each field so ,object Inspector mapping each field and the deserialized the field and show to the end-user.
Object Inspector?
ANS : Object Inspector is nothing but a class that contains Serde.serialized () and Serde.Deserialized method to perform the serialization on the Row on the Table.
What are the different factors to achieve performance tunning in Hive?
ANS : Some factors to achieve performance tunning in the hive are :
♦ Enable Compression: keeping data from textfile to hive table in compressed format
Such as Gzip and Bzip2 , that provides better performance than the uncompressed performance
♦ Optimized joins: we can easily improve the performance of join by enabling the Auto Convert Map joins and enabling the skew joins. AutoMap joins is very powerful and very useful features when joining the small table with Big Table .if we enable this feature in hive then the small table will store their data in the local cache of every node and join with the big table in the Map phase. it provides two advantages that loading the small data in the cache will save the read time on every data node.
On the other hand, it avoids the skew join in the hive, since the joins are already done in the map phase for every block of the data.
♦ Enable Tez execution Engine: running Hive query on the Map-reduce engine give the less performance than Tez execution engine . to enable Tezexution on the hive
Syntax : hive > set hive.execution.engine=tez;
What is the execution engine?
ANS : Execution engine basically a component which is used to provide the platform for running the hive operation such as Hive query to the table so, we can run two types of engine MapReduce and Tez.
By default hive, the execution engine is running on MapReduce. To set as Tezthen :
Syntax : hive > Set hive.execution.engine=tez;
What is the Primary key?
ANS: The primary key is a constraint which is basically used to enforce the unique value to be inserted into the table. It uniquely identifies the records in the table
Syntax :<column_name><data_type><Primary key>
Example :
Create table employee
(
Eidvarchar(20) primary key,
Name varchar(20),
Age int
);
What is Vectorization?
ANS: Vectorization is used in the hive to improve the query optimization and also to reduce the CPU usage for typically filtering, aggregation, joins, scans. It basically scans 1024 rows at a time rather than individual rows on the table
To enable the vectorization :
Hive > Set hive.vectorized.execution=true;
What is ORC & REGEX?
ANS: ORC stands for optimized Row Columnar, it’s a file format through which we can persistence our data. Among all the other format ORC is referred to as best during processing because it compresses or reduces the size of data up to 75%.On comparing sequence file, Text File and RC file Format, ORC shows better performance.
In which case use bucketing and partitioning individually?
ANS :Partition is done on the table to optimized the query performance .when we have large data sets and we need to perform some query to the respective data sets then it takes a long time to scan and filter the required data .so to reduce the CPU time we need partitioning of the table , the records basically split into multiple partition . Hence, while we write the query to get the data from the table, only the required partition table is quired.
Bucketing
Bucketing is used to provide the equal size of the partition of the table .suppose we have large data size and partition the table based on fields, after partitioning the table size does not match the actual expectation and remains huge. Thus to overcome the issue Hive provides the Bucketing concepts. Basically it provides or allows the user to divide the table in a more manageable format.
Synatx :create table txnrecsByCat(txnnoINT,txndateSTRING,cusnoINT,amountDOUBLE,productSTRING,citySTRING,stateSTRING,spendby
STRING)partitioned by (category STRING) clustered by (state) INTO 10 buckets row format delimited fields terminated by ‘,’ stored as textfile;
What are Hive limitations?
ANS: Some of the limitations of the Hive are :
♦ It does not allow the user to insert, update and delete the records the in row-level
It only provides the option to drop the table if you are going to delete the tableitv won’t because behind Hive was working with files and HDFS.
♦ Hive takes less time to load the data in the table because of the property “Schema on read” but it takes a longer time when querying the data from the table because the data is verified with schema at the time of the query.
♦ It leads to performance degradation while performing the ACID property during a transaction in hive 0.14.
♦ It does not support the trigger.
How to define distributed cache memory size in map side join?
ANS: Map side is used in the hive to speed up the query execution when multiple tables are involved in the joins whereas, a small table is stored in memory and join is done in the map phase of the MapReduce Job. Hive joins are faster than the normal joins since no reducers are necessary.
How to add a column in the existing table.
ANS :We can easily modify the existing table to add column
Synatx : ALTER TABLE <table_name> add <column_name><data_type>.
Example : ALTER TABLE employee add column Martial_statusvarchar(20).
Can we implement multiple columns in bucketing?
ANS: Yes we can easily implement bucketing on multiple columns of the table as it needed because bucketing is used during the partition of the table to have a more manageable way of data
Syntax :
create table txnrecsByCat(txnnoINT,txndateSTRING,cusnoINT,amountDOUBLE,productSTRING,citySTRING,stateSTRING,spendby
STRING)partitioned by (category STRING) clustered by (state) INTO 10 buckets row format delimited fields terminated by ‘,’ stored as textfile;
Where bad records will be storedin bucketing?
ANS: It is basically a situation when the Text file (data) is loaded into the hive table after loading the data one of the tuple is null found in the table i.e., the data is not present in the Text File that records are considered as bad records and it is kept in other buckets.