Hive interview questions and answers
Hive interview questions and answers, are you looking for the best Interview Questions on Hive? Or hunting for the best platform which provides a list of Top Rated Hive interview questions and answers? Then stop hunting and follow Best Big Data Training Institute for the List of Top-Rated Hive interview questions and answers for which are useful for both Fresher’s and experienced.
We, Prwatech India’s Leading Hadoop Training Institute listed some of the Best Top Rated Hive interview questions and answers in which most of the Interviewers are asking Candidates nowadays. So follow the Below Mentioned Best Hive interview questions and Crack any Kind of Interview Easily.
Are you the one who is a hunger to become Pro certified Hadoop Developer then ask your Industry Certified Experienced Hadoop Trainer for more detailed information? Don’t just dream to become Pro-Developer Achieve it learning the Hadoop Course under world-class Trainer like a pro. Follow the below-mentioned hive interview questions for Hadoop admin to crack any type of interview that you face.
Q1. What is Apache Hive?
Ans: A tool to process structured data in Hadoop Echosystem we use Hive. It is a data warehouse. Moreover, to summarize Big Data, it is built on top of Hadoop. Also, we can perform querying and analyzing in an easy way.
Q2: Hive Parquet File Format
Ans: Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.
1. Create a Parquet file by specifying the ‘STORED AS PARQUET’ option at the end of a CREATE TABLE Command.
2.Hive Parquet File Format Example
3. Below is the Hive CREATE TABLE command with storage format specification:
4.Create table parquet table
(column_specs)
Stored as parquet;
Q3. in which scenario’s we use pig & hive.
Ans: Pig-It is a tool used to take highly unstructured data and then convert it into a meaningful form. For ex. taking randomly generated logs and converting them into a comma-separated format where each field means something.
What pig script does is it runs a Map-Reduce job on the dataset and converts it into another dataset. So, if u don’t want to write a map-reduce job, and ur need is basic which can handle without a complex map-reduce job u can go ahead with Pig to convert gibberish to some sensible format.
Hive: Uses HQL which is similar to SQL. It allows you to work on data sets stored in Hadoop or data in HBase (via a connector) or any other data to perform SQL like actions on these data. Again, if u plan to use Hive, data should be in some sort of structured manner.
Hadoop Hive Advanced Tutorials
Q4: How to analyze weblogs using regex SERDE in the hive.
1.Use double ‘\’ and ‘.*’ in the end (it’s important!):
2.CREATE EXTERNAL TABLE access_log (
`Ip` STRING,
`time_local` STRING,
`method` STRING,
`uri` STRING,
`protocol` STRING,
`status` STRING,
`bytes_sent` STRING,
`referer` STRING,
`Useragent` STRING
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
‘input.regex’=’^(\\S+) \\S+ \\S+ \\[([^\\[]+)\\] “(\\w+) (\\S+) (\\S+)” (\\d+) (\\d+) “([^”]+)” “([^”]+)”.*’
)
STORED AS TEXTFILE
LOCATION ‘/tmp/access_logs/’;
Q5.How to convert csv data into avro for loading in hive.
1.Create a Hive table stored as textfile
USE test;
CREATE TABLE csv_table (
Student_id INT,
Subject_id INT,
Marks INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;
2. Load csv_table with student.csv data
LOAD DATA LOCAL INPATH “/path/to/student.csv” OVERWRITE INTO TABLE test.csv_table;
3. Create another Hive table using AvroSerDe
CREATE TABLE avro_table
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
TBLPROPERTIES (
‘avro.schema.literal’='{
“namespace”: “com.rishav.avro”,
“name”: “student_marks”,
“type”: “record”,
“fields”: [ { “name”:”student_id”,”type”:”int”}, { “name”:”subject_id”,”type”:”int”}, { “name”:”marks”,”type”:”int”}]
}’);
4. Load avro_table with data from csv_table
INSERT OVERWRITE TABLE avro_table SELECT student_id, subject_id, marks FROM csv_table;
Q6. difference between load vs insert in hive.
Ans:
Load:-Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert:- Query Results can be inserted into tables by using the insert clause.
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] …) select_statement FROM from_statement;
in load all the data which in the file is copied into the table, in insert you can put data based on some condition.
INSERT OVERWRITE will overwrite any existing data in the table or partition and INSERT INTO will append to the table or partition keeping the existing data.
Q7: wordcount in the hive.
Ans: create table docs(line string)
load data inpath ‘docs’ overwrite into table docs;
create table word_count as select word, count(1) as count
from (select explode(split(line,’\s’)) as word from docs group by word order by word) ;
select * from docs;
Q8: hive UDF, Types of UDF’s, Generic UDF, which function you override when you write one UDF.
Ans: evaluate function.
Q9: Can you write a hive query to remove duplicate records from a table.
Ans: We have options like ‘Distinct’ to use in a select query.
Q10.Explain the ORC file format in the hive.
Ans:ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats.
To be more specific, ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases. On comparing to Text, Sequence and RC file formats, ORC shows better performance. Basically, it contains rows of data in groups. Such as Stripes along with a file footer. Therefore, we can say when the Hive is processing the data ORC format improves the performance.
Q11: Explain Vectorization in Hive.
Ans: Vectorization In Hive – Hive Optimization Techniques, to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single-row each time.
However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Q12: diff between external & internal tables in the hive.
Ans: Here is the key difference between an external table and managed table:
1. In the case of a managed table, If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory.
2. On the contrary, in the case of an external table, Hive just deletes the metadata information regarding the table and leaves the table data present in HDFS untouched.
Q13: where the mapper’s intermediate data will be stored in Hadoop.
Ans: It will be stored in the temp directory that you specify in core-site.xml (Hadoop configuration file). The contents of the directory will be deleted once MapReduce execution is over. The mapper output (intermediate data) is stored on the local file system (NOT HDFS) of each individual mapper nodes.
Q14: When to use bucketing in the hive.
Ans: A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from that of join key? Therefore, in these cases, you can perform a map side join by bucketing the table using the join key.
Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.
Q15: When to use dynamic partitioning in the hive.
Ans: In static partitioning, every partitioning needs to be backed with an individual hive statement which is not feasible for a large number of partitions as it will require the writing of a lot of hive statements.
In that scenario, dynamic partitioning is suggested as we can create as many numbers of partitions with a single hive statement.
Q16: When to perform partitioning on a hive table.
Ans: Without partitioning, Hive reads all the data in a directory and performs the query filters on it. This is slow and expensive since all the data has to be read. Partitioning is performed to reduce the time taken for query execution in the hive. suppose you have data in TB’s & GB’s and you want to filter the data on specific columns. The problem without partitioning in the hive is that when we apply where clause even on a simple query in Hive reads the entire dataset.
when we are running the queries on large tables it takes so much time and becomes a bottleneck. Partitioning overcomes this issue by distributing the tables on those specific columns in HDFS and that allows better query execution performance.
Q17: Analytical functions in the hive.
Ans: sum(), count(), min(), max(), lead(), lag(), first_value(), last_value(), row_number(), rank(), dense_rank()
Q18: What is SerDe in Apache Hive?
Ans: Serializer/Deserializer, SerDe is an acronym for Serializer/Deserializer. However, for the purpose of IO, Hive uses the Hive SerDe interface.it handles both serialization and deserialization in Hive. Also, it interprets the results of serialization as individual fields for processing.
Q19: Can a table be renamed in Hive?
Ans: apply Alter Table table_name RENAME TO new_name
Top Hive Interview Questions and answers for Fresher and Experienced
Q20: How do you check if a particular partition exists?
Ans: perform the following query, we can check whether a particular partition exists or not in a table
SHOW PARTITIONS table_name
PARTITION(partitioned_column=’partition_value’)
Q21: Where does the data of a Hive table get stored?
Hadoop HDFS Commands and Tutorials
Ans: In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. we can change by setting hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.
Q22: What is Hive Metastore?
Ans: to store the metadata such as information about Hive databases, tables, partitions much more in the Hive we use Metastore.
Q23: Why does Hive not store metadata information in HDFS?
Ans: Hive stores metadata information in the megastore using RDBMS. to achieve low latency we use RDBMS because HDFS read/write operations are time-consuming processes.
Q24: Is it possible to change the default location of a managed table?
Ans: by using the clause – LOCATION ‘’ we can change the default location of hive managed table.
Q25: How Hive distributes the rows into buckets?
Ans: It uses a Hash partitioner. By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive selects the bucket number for a row. hash_function depends on the column data type.
Although, hash_function for an integer data type will be:
hash_function (int_type_column)= value of int_type_column.
Q26: Different types of Joins in Hive.
Ans: There are 4 different types of joins in HiveQL –
Inner join: A simple join.
Left Outer Join: All the rows from the left table are returned even if there are no matches in the right table.
Right Outer Join: Here all the rows from the right table are returned even if there are no matches in the left table.
Full Outer Join: Combines the records of both the left and right outer tables.
Q27: What are the different components of a Hive architecture?
Ans: There are several components of Hive Architecture. Such as –
User Interface – It calls the execute interface to the driver. Further, the driver creates a session handle to the query. Then sends the query to the compiler to generate an execution plan for it.
Metastore – It is used to send the metadata to the compiler for the execution of the query on receiving the send MetaData request.
Compiler- It generates the execution plan that is a DAG of stages where each stage is either a metadata operation, a map or a reduce a job or an operation on HDFS.
Execute Engine- by managing the dependencies for submitting each of these stages to the relevant components we use Execute engine.
Q28: When should we use SORT BY instead of ORDER BY?
Ans: SORT BY clause performs sorting using multiple reducers. on the other hand, ORDER BY uses only one and this becomes a bottleneck.
Q29: Explain Hive Avro File format
Ans: Hive AVRO File Format
AVRO is an open-source project that provides data serialization and data exchange services for Hadoop. You can exchange data between the Hadoop ecosystem and a program written in any programming language. Avro is one of the popular file formats in Big Data Hadoop based applications.
Create an AVRO file by specifying the ‘STORED AS AVRO’ option at the end of a CREATE TABLE Command.
Hive AVRO File Format Example
Below is the Hive CREATE TABLE command with storage file format specification:
Create table avro_table
(column_specs)
stored as avro;