This topic contains 0 replies, has 1 voice, and was last updated by Manideep 2 months, 1 week ago.
- April 15, 2017 at 11:19 am #3233
1)difference between manage table and external table?
For External Tables ,Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
For Internal tables , Hive moves data into its warehouse directory. If the table is dropped, then the table metadata and the data will be deleted.
###External Tables –
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
Deleting the table deletes the metadata and data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organization).
There are two ways you can handle incremental data in hive:
a.You can use partition on the basis of the date for the table in which the incremental will be loaded.
b. You can use a temporary table in which before inserting the incremental data (suppose day by day in this case) in the main
table you have to insert the data per day into the temporary table. Then, you can run the following query which will
insert the data per day from the temporary table to the main table-
“insert into table main_table select * from temporary_table;”
3)manage table or external table in production?
When you drop an internal table, it drops the data, and it also drops the metadata.
When you drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch
the data itself.
In Production, we go for the external table, since Hive does not move the data into its warehouse directory.
4)what is SerDe & its application
SerDe is a Library which is built-in to the Hadoop API.
Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables
SerDe – Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email,
etc) or unstructured records (Audio, Video, etc) to be processed also.
Serialise is done on write, the structured data is serialised into a bit/byte stream for storage. On read, the data is
deserialised from the bit/byte storage format to the structure required by the reader.
eg: Hive needs structures that look like rows and columns but hdfs stores the data in bit/byte blocks, so serialise on write,
deserialise on read.
Reference : https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java
5)what are the different factors to achive performance tuning in HIVE?
a) enable compression in hive:compress map/reduce output.Compression techniques significantly reduce the intermediate data
volume, which internally reduces the amount of data transfers between mappers and reducers.
b) Optimizing joins by enabling auto convert joins and optimization of skew joins
AutoMapJoins :useful feature when joining bigtable with small table. This is achieved by first loading small table
into cache will save read time on data node and then join the bigtable in map phase.
We can enable skew joins by setting property in hive-site.xml
c)partitioning table to improve query performance on large table. It allows data to store in separate sub directory under
Reference : https://www.qubole.com/blog/big-data/hive-best-practices/
tables can be created in two ways: internal tables (manage tables), external table
By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A
Creating Hive Tables:
CREATE TABLE pokes (foo INT, bar STRING); //creates a table called pokes with two columns
CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); //creates a table called invites with two columns and
a partition column called ds.The partition column is a virtual column.
“base_table” that will include any delimited files located in HDFS under the ‘/user/hive/base_table’ directory
CREATE TABLE base_table (
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
shows an external Hive table “incremental_table” that will include any delimited files with incremental change records,
located in HDFS under the ‘/user/hive/incremental_append’ directory:
CREATE EXTERNAL TABLE incremental_table (
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them
in batches of 1024 rows at once instead of single row each time.
This feature significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
8) can we change the file format once table is created?
Once we store data in hdfs we cannot change the content of the file.
The table definition exists independent from the data, so that, if the table is dropped, the HDFS folders and files remain in
their original state.