This topic contains 1 reply, has 1 voice, and was last updated by  Roopa P Karkera 5 months, 3 weeks ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #3301 Reply

    Roopa P Karkera

    1. Data modification in HIVE

    ALTER TABLE name RENAME TO new_name
    ALTER TABLE name ADD COLUMNS (col_spec[, col_spec …])
    ALTER TABLE name DROP [COLUMN] column_name
    ALTER TABLE name CHANGE column_name new_name new_type
    ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec …])

    UPDATE feature is introduced in Hive 0.14.

    Syntax is:

    UPDATE tablename SET column = value [, column = value …] [WHERE expression]

    2. Incremental data

    There are two ways you can handle incremental data in hive:
    1.You can use partition on the basis of the date for the table in which the incremental will be loaded.
    2. You can use a temporary table in which before inserting the incremental data (suppose day by day in this case) in the main table you have to insert the data per day into the temporary table. Then, you can run the following query which will insert the data per day from the temporary table to the main table-
    “insert into table main_table select * from temporary_table;”
    After that the temporary table needs to be truncated so that it will be available for the next day data.
    We need to maintain these in a script.

    When you design the data, create the logical partitions in hive tables.

    If your data is time series, you can create partition based on year/month/.

    For geography based data, create country or region level partitions.

    When you design the data, create the logical partitions in hive tables.

    If your data is time series, you can create partition based on year/month/.

    For geography based data, create country or region level partitions.

    Partition :

    Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department.
    Using partition, it is easy to query a portion of the data.

    Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying.
    Bucketing works based on the value of hash function of some column of a table.

    3. Manage table or external table in production

    For External Tables –

    External table stores files on the HDFS server but tables are not linked to the source file completely.
    If you delete an external table the file still remains on the HDFS server.
    As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
    External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
    Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.

    For Manage tables-
    Stored in a directory based on settings in hive.metastore.warehouse.dir, by default managed tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
    Deleting the table deletes the metadata and data from master-node and HDFS respectively.
    Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organization).
    Hive may have internal or external tables .This is a choice that affects how data is loaded, controlled, and managed.

    Use EXTERNAL tables when:

    The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
    Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
    Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
    You are not creating table based on existing table (AS SELECT).

    Use MANAGED tables when:
    The data is temporary.
    You want Hive to completely manage the life-cycle of the table and data.

    4. What is SerDE and its application?
    SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.
    A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.
    https://cwiki.apache.org/confluence/display/Hive/SerDe

    5. Object inspector?

    ObjectInspector helps us to look into the internal structure of a complex object. A (probably configured) ObjectInspector instance stands for a specific type and a specific way to store the data of that type in the memory.
    For native java Object, we can directly access the internal structure through member fields and methods. ObjectInspector is a way to delegate that functionality away from the Object, so that we have more control on the behavior of those actions.
    An efficient implementation of ObjectInspector should rely on factory, so that we can make sure the same ObjectInspector only has one instance.
    That also makes sure hashCode() and equals() methods of java.lang.Object directly works for ObjectInspector as well.

    static class ObjectInspector.Category
    Category.

    Method Summary
    ObjectInspector.Category getCategory()
    An ObjectInspector must inherit from one of the following interfaces if getCategory() returns: PRIMITIVE: PrimitiveObjectInspector LIST: ListObjectInspector MAP: MapObjectInspector STRUCT: StructObjectInspector.
    String getTypeName()
    Returns the name of the data type that is inspected by this ObjectInspector.

    https://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspector.html

    6. What are the different factors to achieve performance tuning in Hive?

    Write good SQL:
    SQL is a powerful declarative language. Like other declarative languages, there is more than one way to write a SQL statement. Although each statement’s functionality is the same, it may have strikingly different performance characteristics.

    De-normalizing data:
    Normalization is a standard process used to model your data tables with certain rules to deal with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results.
    Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues .
    Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.

    Use ApacheTEZ:

    Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine.
    if it’s not turned on by default in your environment, use Tez by setting to ‘true’ the following in the beginning of your Hive query:
    set hive.execution.engine=tez;

    Use Vectorization:

    Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time. Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
    set hive.vectorized.execution.enabled = true;
    set hive.vectorized.execution.reduce.enabled = true;

    Use ORC File:

    Hive supports ORCfile, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression and more.
    Using ORCFile for every HIVE table is extremely beneficial to get fast response times for your HIVE queries.

    https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/

    7. What is execution engine?\
    Execution Engine :
    The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.

    Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine.
    if it’s not turned on by default in your environment, use Tez by setting to ‘true’ the following in the beginning of your Hive query:
    set hive.execution.engine=tez;

    9. Sample table

    Create Table Statement :

    CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

    [(col_name data_type [COMMENT col_comment], …)]
    [COMMENT table_comment]
    [ROW FORMAT row_format]
    [STORED AS file_format]

    Ex :
    CREATE TABLE IF NOT EXISTS employee ( eid int, name String,salary String, destination String)
    COMMENT ‘Employee details’
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ‘\t’
    LINES TERMINATED BY ‘\n’
    STORED AS TEXTFILE;

    Load Data Statement :

    LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename
    [PARTITION (partcol1=val1, partcol2=val2 …)]

    LOCAL is identifier to specify the local path. It is optional.
    OVERWRITE is optional to overwrite the data in the table.
    PARTITION is optional.

    LOAD DATA LOCAL INPATH ‘/home/user/sample.txt’ // File path which has employee details
    OVERWRITE INTO TABLE employee;

    10. Vectorization

    Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time. Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
    set hive.vectorized.execution.enabled = true;
    set hive.vectorized.execution.reduce.enabled = true;

    11. ORC and REGEX
    ORC :
    The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data.
    It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
    Compared with RCFile format, for example, ORC file format has many advantages such as:
    a single file as the output of each task, which reduces the NameNode’s load
    Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
    light-weight indexes stored within the file
    skip row groups that don’t pass predicate filtering
    seek to a given row
    block-mode compression based on data type
    run-length encoding for integer columns
    dictionary encoding for string columns
    concurrent reads of the same file using separate RecordReaders
    ability to split files without scanning for markers
    bound the amount of memory needed for reading or writing
    metadata stored using Protocol Buffers, which allows addition and removal of fields
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

    12.can we change the file format once table is created?

    The table definition exists independent from the data, so that, if the table is dropped, the HDFS folders and files remain in their original state.

    13. In which case,we use bucketing and partition individually? Its Limitation

    Hive Partitioning:

    Partition divides large amount of data into multiple slices based on value of a table column(s).
    Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department .
    For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING)

    Pros:
    Distribute execution load horizontally
    Faster execution of queries in case of partition with low volume of data.

    Cons:
    Possibility of too many small partition creations – too many directories.
    Effective for low volume data for a given partition. But some queries like group by on high volume of data still take long time to execute.

    Hive Bucketing:

    Bucketing decomposes data into more manageable or equal parts.
    With partitioning, there is a possibility that you can create multiple small partitions based on column values. If you go for bucketing, you are restricting number of buckets to store the data. This number is defined during table creation scripts.

    Pros:
    Due to equal volumes of data in each partition, joins at Map side will be quicker.
    Faster query response like partitioning

    Cons:
    You can define number of buckets during table creation but loading of equal volume of data has to be done manually by programmers.

    https://www.linkedin.com/pulse/hive-partitioning-bucketing-examples-gaurav-singh

    #3314 Reply

    Roopa P Karkera

    NASA Project :

    1.create database prwatech;

    2.show databases;

    3.use prwatech;

    4.hadoop dfs -copyFromLocal ‘/home/cloudera/Desktop/NASA/Meteorite_Landings.csv’ /

    5.
    CREATE TABLE meteorite_landings (name string, id int, nametype string, recclass string, mass_g int, fall string, year string, reclat string, reclong string, GeoLocation string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ‘,’
    STORED AS TEXTFILE
    TBLPROPERTIES (“skip.header.line.count”=”1”);

    [I/P :
    name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation (address),GeoLocation (zip),GeoLocation (city),GeoLocation,GeoLocation (state) ]

    6.LOAD DATA INPATH ‘/Meteorite_Landings.csv’ OVERWRITE INTO TABLE meteorite_landings;

    1) Number of recordings grouped by recclass.

    select recclass,count(recclass) from meteorite_landings group by recclass;

    2) Heaviest and Lightest Meteorite landing recorded.

    select max(mass_g),min(mass_g) from meteorite_landings;

    3) Average mass of meteorites:
    select avg(mass_g) from meteorite_landings;
    6261.665678524374

Viewing 2 posts - 1 through 2 (of 2 total)
Reply To: HIVE-Assignament
Your information:




cf22

Your Name (required)

Your Email (required)

Subject

Phone No

Your Message

Cart

  • No products in the cart.