Hive interview questions and answers, are you looking for the best Interview Questions on Hive? Or hunting for the best platform which provides a list of Top Rated Hive interview questions and answers? Then stop hunting and follow Best Big Data Training Institute for the List of Top-Rated Hive interview questions and answers for which are useful for both Fresher’s and experienced.
We, Prwatech India’s Leading Hadoop Training Institute listed some of the Best Top Rated Hive interview questions and answers in which most of the Interviewers are asking Candidates nowadays. So follow the Below Mentioned Best Hive interview questions and Crack any Kind of Interview Easily.
Are you the one who is a hunger to become Pro certified Hadoop Developer then ask your Industry Certified Experienced Hadoop Trainer for more detailed information? Don’t just dream to become Pro-Developer Achieve it learning the Hadoop Course under world-class Trainer like a pro. Follow the below-mentioned hive interview questions for Hadoop admin to crack any type of interview that you face.
Ans: A tool to process structured data in Hadoop Echosystem we use Hive. It is a data warehouse. Moreover, to summarize Big Data, it is built on top of Hadoop. Also, we can perform querying and analyzing in an easy way.
Ans: Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.
1. Create a Parquet file by specifying the ‘STORED AS PARQUET’ option at the end of a CREATE TABLE Command.
2.Hive Parquet File Format Example
3. Below is the Hive CREATE TABLE command with storage format specification:
4.Create table parquet table
(column_specs)
Stored as parquet;
Ans: Pig-It is a tool used to take highly unstructured data and then convert it into a meaningful form. For ex. taking randomly generated logs and converting them into a comma-separated format where each field means something.
What pig script does is it runs a Map-Reduce job on the dataset and converts it into another dataset. So, if u don’t want to write a map-reduce job, and ur need is basic which can handle without a complex map-reduce job u can go ahead with Pig to convert gibberish to some sensible format.
Hive: Uses HQL which is similar to SQL. It allows you to work on data sets stored in Hadoop or data in HBase (via a connector) or any other data to perform SQL like actions on these data. Again, if u plan to use Hive, data should be in some sort of structured manner.
1.Use double ‘\’ and ‘.*’ in the end (it’s important!):
2.CREATE EXTERNAL TABLE access_log (
`Ip` STRING,
`time_local` STRING,
`method` STRING,
`uri` STRING,
`protocol` STRING,
`status` STRING,
`bytes_sent` STRING,
`referer` STRING,
`Useragent` STRING
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
‘input.regex’=’^(\\S+) \\S+ \\S+ \\[([^\\[]+)\\] “(\\w+) (\\S+) (\\S+)” (\\d+) (\\d+) “([^”]+)” “([^”]+)”.*’
)
STORED AS TEXTFILE
LOCATION ‘/tmp/access_logs/’;
1.Create a Hive table stored as textfile
USE test;
CREATE TABLE csv_table (
Student_id INT,
Subject_id INT,
Marks INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;
2. Load csv_table with student.csv data
LOAD DATA LOCAL INPATH “/path/to/student.csv” OVERWRITE INTO TABLE test.csv_table;
3. Create another Hive table using AvroSerDe
CREATE TABLE avro_table
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
TBLPROPERTIES (
‘avro.schema.literal’='{
“namespace”: “com.rishav.avro”,
“name”: “student_marks”,
“type”: “record”,
“fields”: [ { “name”:”student_id”,”type”:”int”}, { “name”:”subject_id”,”type”:”int”}, { “name”:”marks”,”type”:”int”}]
}’);
4. Load avro_table with data from csv_table
INSERT OVERWRITE TABLE avro_table SELECT student_id, subject_id, marks FROM csv_table;
Ans:
Load:-Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert:- Query Results can be inserted into tables by using the insert clause.
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] …) select_statement FROM from_statement;
in load all the data which in the file is copied into the table, in insert you can put data based on some condition.
INSERT OVERWRITE will overwrite any existing data in the table or partition and INSERT INTO will append to the table or partition keeping the existing data.
Ans: create table docs(line string)
load data inpath ‘docs’ overwrite into table docs;
create table word_count as select word, count(1) as count
from (select explode(split(line,’\s’)) as word from docs group by word order by word) ;
select * from docs;
Ans: evaluate function.
Ans: We have options like ‘Distinct’ to use in a select query.
Ans:ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats.
To be more specific, ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases. On comparing to Text, Sequence and RC file formats, ORC shows better performance. Basically, it contains rows of data in groups. Such as Stripes along with a file footer. Therefore, we can say when the Hive is processing the data ORC format improves the performance.
Ans: Vectorization In Hive – Hive Optimization Techniques, to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single-row each time.
However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Ans: Here is the key difference between an external table and managed table:
1. In the case of a managed table, If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory.
2. On the contrary, in the case of an external table, Hive just deletes the metadata information regarding the table and leaves the table data present in HDFS untouched.
Ans: It will be stored in the temp directory that you specify in core-site.xml (Hadoop configuration file). The contents of the directory will be deleted once MapReduce execution is over. The mapper output (intermediate data) is stored on the local file system (NOT HDFS) of each individual mapper nodes.
Ans: A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from that of join key? Therefore, in these cases, you can perform a map side join by bucketing the table using the join key.
Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.
Ans: In static partitioning, every partitioning needs to be backed with an individual hive statement which is not feasible for a large number of partitions as it will require the writing of a lot of hive statements.
In that scenario, dynamic partitioning is suggested as we can create as many numbers of partitions with a single hive statement.
Ans: Without partitioning, Hive reads all the data in a directory and performs the query filters on it. This is slow and expensive since all the data has to be read. Partitioning is performed to reduce the time taken for query execution in the hive. suppose you have data in TB’s & GB’s and you want to filter the data on specific columns. The problem without partitioning in the hive is that when we apply where clause even on a simple query in Hive reads the entire dataset.
when we are running the queries on large tables it takes so much time and becomes a bottleneck. Partitioning overcomes this issue by distributing the tables on those specific columns in HDFS and that allows better query execution performance.
Ans: sum(), count(), min(), max(), lead(), lag(), first_value(), last_value(), row_number(), rank(), dense_rank()
Ans: Serializer/Deserializer, SerDe is an acronym for Serializer/Deserializer. However, for the purpose of IO, Hive uses the Hive SerDe interface.it handles both serialization and deserialization in Hive. Also, it interprets the results of serialization as individual fields for processing.
Ans: apply Alter Table table_name RENAME TO new_name
Ans: perform the following query, we can check whether a particular partition exists or not in a table
SHOW PARTITIONS table_name
PARTITION(partitioned_column=’partition_value’)
Ans: In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. we can change by setting hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.
Ans: to store the metadata such as information about Hive databases, tables, partitions much more in the Hive we use Metastore.
Ans: Hive stores metadata information in the megastore using RDBMS. to achieve low latency we use RDBMS because HDFS read/write operations are time-consuming processes.
Ans: by using the clause – LOCATION ‘’ we can change the default location of hive managed table.
Ans: It uses a Hash partitioner. By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive selects the bucket number for a row. hash_function depends on the column data type.
Although, hash_function for an integer data type will be:
hash_function (int_type_column)= value of int_type_column.
Ans: There are 4 different types of joins in HiveQL –
Inner join: A simple join.
Left Outer Join: All the rows from the left table are returned even if there are no matches in the right table.
Right Outer Join: Here all the rows from the right table are returned even if there are no matches in the left table.
Full Outer Join: Combines the records of both the left and right outer tables.
Ans: There are several components of Hive Architecture. Such as –
User Interface – It calls the execute interface to the driver. Further, the driver creates a session handle to the query. Then sends the query to the compiler to generate an execution plan for it.
Metastore – It is used to send the metadata to the compiler for the execution of the query on receiving the send MetaData request.
Compiler- It generates the execution plan that is a DAG of stages where each stage is either a metadata operation, a map or a reduce a job or an operation on HDFS.
Execute Engine- by managing the dependencies for submitting each of these stages to the relevant components we use Execute engine.
Ans: SORT BY clause performs sorting using multiple reducers. on the other hand, ORDER BY uses only one and this becomes a bottleneck.
Ans: Hive AVRO File Format
AVRO is an open-source project that provides data serialization and data exchange services for Hadoop. You can exchange data between the Hadoop ecosystem and a program written in any programming language. Avro is one of the popular file formats in Big Data Hadoop based applications.
Create an AVRO file by specifying the ‘STORED AS AVRO’ option at the end of a CREATE TABLE Command.
Hive AVRO File Format Example
Below is the Hive CREATE TABLE command with storage file format specification:
Create table avro_table
(column_specs)
stored as avro;