Blog

Hive Questions and Answers

Q.1 Data modification in HIVE?

ANS : Data modification is basically done by many ways

♦ INSERT : insertion is used the insert the records the value in the table

Syntax :

CREATE TABLE <table_name>(column_name_1 datatype,column_name2,datatype,……

Column_name_n data type) row format delimited fields terminated by ‘,’ stored as textfile;

Example:

create table earth(name String,idint,nametypeString,recclassString,massint,Fall_hString,yearTIMESTAMP,reclatbigint,reclongbigint,Geolocation String)row format delimited fields terminated by ‘,’ stored as textfile;

♦ UPDATE : update is used to update the records in the existing table

 

Syntax :

Update <table_name> set column_name=”” where column_name=””;

Q.2 Manage table and External table?

ANS :Hive table is basically of two types :

♦ Managed table : it is used to store and process the datasets in default hive directory

Path :/user/hive/warehouse/prwatech.db/earth

For creating managed table we simply declare CREATE statement as well as file format

Syntax :

create table earth(name String,idint,nametypeString,recclassString,massint,Fall_hString,yearTIMESTAMP,reclatbigint,reclongbigint,Geolocation String)row format delimited fields terminated by ‘,’ stored as textfile;

After creating the table data need to be load in the table so, we have syntax to load the data

Syntax : Load data inpath’/home/cloudera/Desktop/Meteorite_Landings.csv overwrite into table earth.

if you want to drop managed table, it deleted the meta-store of the table as well as data of the table

♦ ExternalTable :-

if we use external table we have to specify the path of the table during the creation of the table as “EXTERNAL” keyword

Syntax :

create external table example_customer(customer STRING,firstnameSTRING,lastnameSTRING,ageINT,profession STRING)row format delimited

fields terminated by ‘,’ LOCATION /user/cloudera/external;

if we want to drop external table the meta-store of the table will deleted but physical data will be remains.

Q.3 What is SerDe and its application?

ANS :Whenever a file in needed to Read/Write from/to into the Hive table so , SerDe basically is an interface as how to process the data in table. As file Text InputFormat

is send to the Hive table in hdfs as select query is performed on table then RecordReader will takes the InputFormat file and convert it into key value pair till last line of the file.ThenRecordReader will return you the one record and that record is actual serialized record and that serialized data actual converted into Row represent to the end user .whenever select query is performed that Row is called to Serde.deserialized () with the help of objectInspector class for entire Row and there is objectInspector for each fields so ,object Inspector mapping each field and the deserialized the field and show to the end user.

Q.4 Object Inspector?

ANS :Object Inspector is nothing but a class which contains Serde.serialized () and Serde.Deserialized method to perform the serialization on the Row on the Table .

Q.5 What are the different factors to achieve performance tunning in Hive?

ANS :Some factors to achieve performance tunning in hive are :

♦ Enable Compression :  keeping data from textfile to hive table in compressed format

Such as Gzip and Bzip2 , that provides the better performance than uncompressed performance

♦ Optimized joins : we can easily improve the performance of join by enabling the Auto Convert Map joins and enabling the skew joins . Auto Map joins is very powerful and very useful features when join the small table with Big Table .if we enable this feature in hive then small table will store their data in the local cache of every node and join with the big table in the Map phase . it provide two advantage that loading the small data in the cache will save the read time on every data node.

On the other hand it avoid the skew join in hive, since the joins are already done in the map phase for every block of the data.

♦ Enable Tez execution Engine : running Hive query on the Map reduce engine give the less performance than Tez execution engine . to enable Tezexution on hive

Syntax : hive > set hive.execution.engine=tez;

Q.6 What is execution engine?

ANS :Execution engine basically a component which is used to provide the platform for running the hive operation such as Hive query to the table so,  we can run two types of  engine mapreduce and tez .

By default hive execution engine is running on mapreduce. To set as Tezthen :

Synatx : hive > Set hive.execution.engine=tez;

 Q.8 What is Primary key?

ANS :Primary key is a constraint which is basically used to enforce the unique value to be inserted into the table .It uniquely identifies the records in the table

Synatx :<column_name><data_type><Primary key>

Example :

Create table employee

(

Eidvarchar(20) primary key,

Name varchar(20),

Age int

);

Q.9 What is Vectorization?

ANS : Vectorization is used in hive to improve the query optimization and also to reduce the CPU usage for typically filtering ,aggregation, joins,scans.It basically scan 1024 rows at time rather than individual rows on the table

To enabe the vectorization :

Hive > Set hive.vectorized.execution=true;

Q.10What is ORC & REGEX?

ANS :ORC stands for optimized Row Columnar ,it’s a file format through which we can persistence our data. Among all the other format ORC is reffered as best during processing because it compress or reduce the size of data up to 75%.On comparing sequence file,Text File and RC file Format ,ORC shows better performance.

Q.11 In which case use buckting and partitioning individually?

ANS :Partition is done on the table to optimized the query performance .when we have large data sets and we need to perform some query to the respective data sets then it takes long time to scan and filter the required data  .so to reduce the CPU time we need partitioning of the table , the records basically split into multiple partition .Hence ,while we write the query to get the data from the table ,only the required partition table are quired  .

Bucketing

Bucketing is used to provide the equal size of partition of the table .suppose we have large data size and we partition the table based on fields ,after partitioning the table size does not match the with actual expectation and remain huge. Thus to overcome from the issue Hive provide the Bucketing concepts.Basically it provides or allow the user to divide the table in more manageable format  .

Synatx :create table txnrecsByCat(txnnoINT,txndateSTRING,cusnoINT,amountDOUBLE,productSTRING,citySTRING,stateSTRING,spendby

STRING)partitioned by (category STRING) clustered by (state) INTO 10 buckets row format delimited fields terminated by ‘,’ stored as textfile;

Q.12 What are Hive limitations?

ANS : Some of the limitations of Hive are :

♦ It does not allow the user to insert, update and delete the records the in row level

It only provide the option to drop the table if you are going to delete the  tableitv won’t because behind Hive was working with files and HDFS.

♦ Hive takes less time to load the data in table because of the property “Schema on read” but it takes longer time when querying the data from the table because the data is verified with schema at the time of query.

♦ It leads performance degradation while performing the ACID property during transaction in hive 0.14.

♦ It does not support trigger.

Q.13 How to define distributed cache memory size in map side join?

ANS :Map side is used in hive to speed up the query execution when multiple table is involved into the joins whereas, small table is stored in memory  and join is done in the map phase of the MapReduce Job. Hive joins are faster than the normal joins since no reducers are necessary.

Q.14 How to add column in existing table.

ANS :We can easily modify the existing table to add column

Synatx : ALTER TABLE <table_name> add <column_name><data_type>.

Example :  ALTER TABLE employee add column Martial_statusvarchar(20).

Q.15 Can we implements multiple columns in bucketing?

ANS :Yes we can easily implement bucketing on multiple column of the table as it needed because bucketing is used during partition of the table to have more manageable way of data

Syantax :

create table txnrecsByCat(txnnoINT,txndateSTRING,cusnoINT,amountDOUBLE,productSTRING,citySTRING,stateSTRING,spendby

STRING)partitioned by (category STRING) clustered by (state) INTO 10 buckets row format delimited fields terminated by ‘,’ stored as textfile;

Q.16 Where bad records will be storedin bucketing?

ANS :It is basically a situation when the Text file (data) is loaded into the hive table after loading the data one of the tuple is null found in the table i.e., the data is not present in the Text File that records are considered as bad records and it is kept in other bucket.

 

Hadoop-Hive script with Oozie

♦ Create a dataset

♦ Start hive on terminal

♦ After open hive Create Database cricket_team Go to cricket_team and create table ind_player

 

 

♦ Create table ind_player

♦ Load data into the table ind_player

♦ Write a hive script and then save it as .hql extension

♦ Save the .hql file in HDFS

♦ Go to hive/conf and copy hive-site.xml in HDFS

 

♦ Go to oozie dashboard

Go to  Hue→ workflow→editors→workflow→create

♦ Drag Hive in “Drop your action here

♦ Put the Script file and Hive XML file path and click on save.

♦ Then submit

♦ Save and Run

♦ After Successful run, you will get this type of screen

Hadoop-HBase Testcase #3

♦ Description: The basic objective of this project is to create a database for IPL player and their stats using HBase in such a way that we can easily extract data for a particular player on the basis of column in particular columnar family. Using this technique we can easily sort and extract data from our database using particular column as reference.

Create a HBase table for student query in such way that it include following columnar family:

  1. Batsman
  2. Bowlers
  3. All-rounders

And add following column in each column family as per query

  1. Player id
  2. Jersey number
  3. Name
  4. DOB
  5. Handed
  6. Matches played
  7. Strike Rate
  8. Total Runs
  9. Country

Some of the usecase to extract data are mentioned below:

  1. Players info from particular country
  2. Players info with particular Jersey Number
  3. Run scored by player between particular player id
  4. Players info which are left handed
  5. Players from particular country which are left handed
  6. Players from particular country with their strike rate
  7. Total match played for particular jersey number
  8. Info of player using DOB
  9. Total run of player using Player ID
  10. Runs scored by Right Handed players

 

Solution :

♦ Create Ipl stats Database :

♦ Move the above database in HDFS :

♦ Create a table with following the columnar family in HBase :

♦ Move bulk data from HDFS into HBase :

♦ Data after loading into HBase :

TEST CASE :

♦ 1 : Players info from particular country : Sorting player on the basis of their respective country .

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’country’,=,’binary:India’)”}

 

♦ 2 : Players info with particular Jersey Number :Finding info about Players on the basis of their jersey Number

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’jersy_no’,=,’binary:18′)”}

♦ 3 : Run scored by player between particular player id :Sorting players on the basis of run scored using player id

Command : scan ‘iplstat’,{COLUMN=>’Batsman:total_runs’,LIMIT=>9,STARTROW=>’7′}

♦ 4 : Players info which are left handed : Finding player info on the basis of their batting hand

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’handed’,=,’binary:Left-handbat’)”}

♦ 5 : Players from particular country which are left handed :Finding info about a player on the basis of their country and Left-handed batting

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’country’,=,’binary:India’) AND SingleColumnValueFilter(‘Batsman’,’handed’,=,’binary:Left-handbat’)”}

 

♦ 6 : Players from particular country with their strike rate :Finding strike rate of a players on the basis of their country

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’country’,=,’binary:India’) AND QualifierFilter(=,’binary:strike_rate’)”}

 

♦ 7 : Total match played for particular jersynumber :Finding total match played by player on the basis of their jersy number

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’jersy_no’,=,’binary:7′) AND QualifierFilter(=,’binary:matches_played’)”}

 

 

♦ 8 : Info of player using DOB :Finding information about player on the basis of their DOB (Date of Birth)

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’dob’,=,’binary:12/12/1981′)”}

♦ 9 : Total run of player using Player name :Using player name o find total run scored by him.

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’name’,=,’binary:VKohli’) AND QualifierFilter(=,’binary:total_runs’)”}

 

♦ 10 : Runs scored and name of all Right Handed players :Finding the run scored and name for all right handed players.

Command : scan ‘iplstat’,{FILTER=>”SingleColumnValueFilter(‘Batsman’,’handed’,=,’binary:Right-handbat’) AND MultipleColumnPrefixFilter(‘name’,’total_runs’)”}

PIG Questions and Answers

Q1 : What is pig?

Answer: Pig is an Apache open source project which is run on hadoop, provides engine for data flow in parallel on hadoop. It includes language called pig latin,which is  for expressing these data flow. It includes different operations like joins, sort, filter.etc and also ability to write User Define Functions (UDF) for processing and reading and writing.pig uses both HDFS and MapReduce i.e, storing and processing.

Q2 : What is difference between pig and sql?

Answer: Pig latin is procedural version of SQl.pig has certainly similarities, more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but don’t tell how to answer the given question. suppose ,if user want to do multiple operations on tables, we have write multiple queries and also use temporary table for storing, sql is support for sub queries but intermediate we have to use temporary tables, SQL users find sub queries confusing and difficult to form properly. Using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

Q3 : How Pig differs from MapReduce

Answer: In mapreduce, group by operation performed at reducer side and filter, projection can be implemented in the map phase.pig latin also provides standard-operation similar to map reduce like order by and filters, group by..etc. We can analyze pig script and know data flows and also early to find the error checking.pig Latin is much lower cost to write and maintain than Java code for Map Reduce.

Q4 : What are the execution modes of Pig?

Answer: Local Mode: Pig operation will be executed in single JVM. MapReduce Mode: Execution will be done of the Hadoop cluster.

 

Q5 : Differentiate Between Piglatin And Hiveql?

Answer : Some of the difference are :

  • It is necessary to specify the schema in HiveQL, whereas it is optional in PigLatin.
  • HiveQL is a declarative language, whereas PigLatin is procedural.
  • HiveQL follows a flat relational data model, whereas PigLatin has nested relational data model.

 

Q6 : What Is The Usage Of Foreach Operation In Pig Scripts?

Answer : FOREACH operation in Apache Pig is used to apply transformation to each element in the data bag, so that respective action is performed to generate new data items.

Syntax- FOREACH data_bagname GENERATE exp1, exp2.

 

Q7 : What is the difference between Group and COGROUP?

Answer : GROUP operator is used to grouping the data in a single relation and COGROUP is used for making the relation in GROUP and JOIN.

 

Q8 : What do you mean by UNION and SPLIT operator?

Answer: By using a UNION operator we can merge the contents of two or more relations and a SPILT operator is used to divide the single relation into two or more relations.

 

Q9 : What Is A Udf In Pig?

Answer : If the in-built operators do not provide some functions then programmers can implement those functionalities by writing user defined functions using other programming languages like Java, Python, Ruby, etc. These User Defined Functions (UDF’s) can then be embedded into a Pig Latin Script.

 

Q10 : why should we use ‘filters’ in pig scripts?

Answer: Filters are similar to where clause in SQL filter which contain predicate. If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not predicate contain different operators like ==,>=,<=,!=.so,== and != can be applied to maps and tuples.

A= load ‘inputs’ as(name,address)

B=filter A by name matches ‘Prwatech*’;

 

 

 

Hadoop-HBase Testcase #2

♦ Description: The basic objective of this project is to create a student database using HBase in such a way that we can easily extract data for a particular student on the basis of the column in particular columnar family. Using this technique we can easily sort and extract data from our database using a particular column as a reference.

 

♦ Create an HBase table for student query in such a way that it includes following columnar family :

  1. Weekdays_query
  2. Weekend_query
  3. Weekdays_demo
  4. Weekend_demo
  5. Weekdays_joining
  6. Weekend_joining

 

♦ And add the following column in each column family as per query :

  1. Student id
  2. Date of Query (DOQ)
  3. Name
  4. Contact number
  5. Email id
  6. Course
  7. Location
  8. Status

♦ Some of the use cases to extract data are mentioned below:

  1. Total number of Query for Particular date: Sorting student on the basis of the DOQ.
  2. Name of student between particular id: Sorting Student on the basis of id and course.
  3. Student details using particular student Id: Sorting Student on the basis of student Id.
  4. Student details for the particular course : Sorting Student on the basis of their choice of course.
  5. Student details using status: Sorting Student on the basis of their status.
  6. Name of student between particular id for a particular course: Sorting Student on the basis of id and course.
  7. Student enrolled for the demo: Sorting Student on the basis of demo scheduled.
  8. Student not confirmed for the session: Sorting Student on the basis of the status of not joining or no response.
  9. Student query for Pune/Bangalore location: Sorting Student on the basis of location.
  10. Student name and email info using student ID: Sorting Student on the basis of missing email info.

 

     Solution :

 

♦ Create student database:

♦ Move the above database in HDFS :

♦ Create a table with following the columnar family in HBase :

♦ Move bulk data from HDFS into HBase :

 

♦ Data after loading into HBase :

 

♦ Using HBase Shell :

TEST CASE :

1: Total number of Query for Particular date: Sorting student on the basis of the DOQ.

Command: scan ‘Prwatech’,{FILTER=>”SingleColumnValueFilter(‘Weekdays_query’,’DOQ’,=,’binary:1stFeb’)”}

 

♦ 2: Name of student  between particular idSorting Student on the basis of id and course

Command : scan ‘Prwatech’,{COLUMN=>’Weekdays_query:name’,LIMIT=>7, STARTROW=>’3′}

 

♦ 3: Student details using particular student Id: Sorting Student on the basis of student Id.

Command : scan ‘Prwatech’,{FILTER=>”(PrefixFilter(‘4’))”}

♦ 4: Student details for the particular course: Sorting Student on the basis of their choice of course.

Command : scan ‘Prwatech’,{FILTER=>”SingleColumnValueFilter(‘Weekdays_query’,’course’,=,’binary:Hadoop’)”}

 

♦ 5: Student details using status: Sorting Student on the basis of their status.

Command : hbase(main):008:0> scan ‘Prwatech’,{FILTER=>”SingleColumnValueFilter(‘Weekdays_query’,’status’,=,’binary:Joining’)”}

 

♦ 6: Name of student between particular id for a particular course: Sorting Student on the basis of id and course.

Command: scan ‘Prwatech’,{COLUMN=>’Weekdays_query:course’,LIMIT=>7, STARTROW=>’3′,FILTER=>”ValueFilter(=,’binary:Hadoop’)”}

 

 

♦ 7: Student enrolled for the demo: Sorting Student on the basis of demo scheduled.

Command:  scan ‘Prwatech’,{FILTER=>”SingleColumnValueFilter(‘Weekdays_query’,’status’,=,’binary:Need Demo’)”}

 

♦ 8: Student not confirmed for Session: Sorting Student on the basis of the status of not joining or no response.

Command: scan ‘Prwatech’,{FILTER=>”SingleColumnValueFilter(‘Weekdays_query’,’status’,=,’binary:Not confirm’)”}

 

♦ 9: Student query for Pune/Bangalore location: Sorting Student on the basis of location.

Command: hbase(main):014:0> scan ‘Prwatech’,{FILTER=>”SingleColumnValueFilter(‘Weekdays_query’,’location’,=,’binary:Pune’)”}

 

♦ 10: Student name and email info using student ID: Sorting Student on the basis of missing email info.

Command : scan ‘Prwatech’,{COLUMN=>’Weekdays_query’,STARTROW=>’3′,LIMIT=>5,FILTER=>”MultipleColumnPrefixFilter(‘name’,’emailid’)”}

 

♦ Using Eclipse :

Case 1 : Total number of Query for Particular date: Sorting student on the basis of the DOQ.

package com.test.ap;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import com.amazonaws.services.applicationdiscovery.model.Filter;

import com.amazonaws.services.elasticmapreduce.model.KeyValue;

 

public class ScanTable{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

// Instantiating the Scan class

Scan scan = new Scan();

SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes(“Weekdays_query”), Bytes.toBytes(“DOQ”) , CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes(“1stFeb”)));

scan.setFilter(filter);

ResultScanner scanner = table.getScanner(scan);

 

// Reading values from scan result

for (Result result = scanner.next(); result != null; result = scanner.next())

 

scan.setFilter(filter);

ResultScanner scanner1 = table.getScanner(scan);

for (Result result : scanner1) {

for (org.apache.hadoop.hbase.KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner1.close();

String result = null;

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

Case 2 : Name of student between particular id: Sorting Student on the basis of id and course.

package com.test.ap;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.KeyValue;

import org.apache.hadoop.hbase.client.Get;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import org.apache.hadoop.hbase.filter.InclusiveStopFilter;

import org.apache.hadoop.hbase.filter.PrefixFilter;

import org.apache.hadoop.hbase.filter.RegexStringComparator;

import org.apache.hadoop.hbase.filter.SubstringComparator;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.Filter;

import org.apache.hadoop.hbase.filter.QualifierFilter;

import org.apache.hadoop.hbase.filter.FamilyFilter;

import org.apache.hadoop.hbase.filter.FilterList;

import org.apache.hadoop.hbase.filter.ValueFilter;

import org.apache.hadoop.hbase.util.Bytes;

import com.sun.imageio.plugins.png.RowFilter;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

public class Gaprow{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

Scan scan = new Scan();

Filter filter1 = new InclusiveStopFilter(Bytes.toBytes(“7”));

Scan scan1 = new Scan();

scan1.setStartRow(Bytes.toBytes(“4”));

scan1.setFilter(filter1);

ResultScanner scanner = table.getScanner(scan1);

for (Result result : scanner) {

for (org.apache.hadoop.hbase.KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner.close();

ResultScanner scanner1 = table.getScanner(scan1);

// Reading values from scan result

for (Result result = scanner1.next(); result != null; result = scanner1.next())

 

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

Case 3 : Student details using particular student Id: Sorting Student on the basis of student Id.

package com.test.ap;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.KeyValue;

import org.apache.hadoop.hbase.client.Get;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import org.apache.hadoop.hbase.filter.PrefixFilter;

import org.apache.hadoop.hbase.filter.RegexStringComparator;

import org.apache.hadoop.hbase.filter.SubstringComparator;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.Filter;

import org.apache.hadoop.hbase.filter.QualifierFilter;

import org.apache.hadoop.hbase.filter.FamilyFilter;

import org.apache.hadoop.hbase.filter.FilterList;

import org.apache.hadoop.hbase.filter.ValueFilter;

import org.apache.hadoop.hbase.util.Bytes;

import com.sun.imageio.plugins.png.RowFilter;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

 

public class Rowfilter{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

Scan scan = new Scan();

// Instantiating the Scan class

Filter filter = new PrefixFilter(Bytes.toBytes(“4”));

Scan scan1 = new Scan();

scan1.setFilter(filter);

ResultScanner scanner = table.getScanner(scan1);

for (Result result : scanner) {

for (KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner.close();

 

ResultScanner scanner1 = table.getScanner(scan1);

// Reading values from scan result

for (Result result = scanner1.next(); result != null; result = scanner1.next())

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

Case 4: Student details for the particular course : Sorting Student on the basis of their choice of course.

package com.test.ap;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import com.amazonaws.services.applicationdiscovery.model.Filter;

import com.amazonaws.services.elasticmapreduce.model.KeyValue;

public class ScanTable{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

// Instantiating the Scan class

Scan scan = new Scan();

SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes(“Weekdays_query”), Bytes.toBytes(“course”)  , CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes(“Hadoop”)));

scan.setFilter(filter);

 

ResultScanner scanner = table.getScanner(scan);

 

// Reading values from scan result

for (Result result = scanner.next(); result != null; result = scanner.next())

scan.setFilter(filter);

ResultScanner scanner1 = table.getScanner(scan);

for (Result result : scanner1) {

for (org.apache.hadoop.hbase.KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner1.close();

String result = null;

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

Case 5 : Student details using status: Sorting Student on the basis of their status.

package com.test.ap;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import com.amazonaws.services.applicationdiscovery.model.Filter;

import com.amazonaws.services.elasticmapreduce.model.KeyValue;

 

public class ScanTable{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

// Instantiating the Scan class

Scan scan = new Scan();

SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes(“Weekdays_query”), Bytes.toBytes(“status”) , CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes(“Joining”)));

scan.setFilter(filter);

ResultScanner scanner = table.getScanner(scan);

// Reading values from scan result

for (Result result = scanner.next(); result != null; result = scanner.next())

 

scan.setFilter(filter);

ResultScanner scanner1 = table.getScanner(scan);

for (Result result : scanner1) {

for (org.apache.hadoop.hbase.KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner1.close();

String result = null;

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

Case 6 : Name of student between particular id for a particular course: Sorting Student on the basis of id and course.

package com.test.ap;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.KeyValue;

import org.apache.hadoop.hbase.client.Get;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import org.apache.hadoop.hbase.filter.InclusiveStopFilter;

import org.apache.hadoop.hbase.filter.PrefixFilter;

import org.apache.hadoop.hbase.filter.RegexStringComparator;

import org.apache.hadoop.hbase.filter.SubstringComparator;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.Filter;

import org.apache.hadoop.hbase.filter.QualifierFilter;

import org.apache.hadoop.hbase.filter.FamilyFilter;

import org.apache.hadoop.hbase.filter.FilterList;

import org.apache.hadoop.hbase.filter.ValueFilter;

import org.apache.hadoop.hbase.util.Bytes;

import com.sun.imageio.plugins.png.RowFilter;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

public class Multifilter{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

Scan scan = new Scan();

List<Filter> filters = new ArrayList<Filter>();

Filter filter1 = new ValueFilter(CompareFilter.CompareOp.EQUAL,

new SubstringComparator(“Hadoop”));

 

filters.add(filter1);

Filter filter2 = new InclusiveStopFilter(Bytes.toBytes(“7”));

Scan scan1 = new Scan();

scan1.setStartRow(Bytes.toBytes(“3”));

filters.add(filter2);

FilterList fl = new FilterList( FilterList.Operator.MUST_PASS_ALL,filters);

scan1.setFilter(fl);

ResultScanner scanner = table.getScanner(scan1);

for (Result result : scanner) {

for (org.apache.hadoop.hbase.KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner.close();

 

ResultScanner scanner1 = table.getScanner(scan1);

// Reading values from scan result

for (Result result = scanner1.next(); result != null; result = scanner1.next())

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

Case 7 : Student enrolled for the demo: Sorting Student on the basis of demo scheduled.

package com.test.ap;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import com.amazonaws.services.applicationdiscovery.model.Filter;

import com.amazonaws.services.elasticmapreduce.model.KeyValue;

public class ScanTable{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

// Instantiating the Scan class

Scan scan = new Scan();

SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes(“Weekdays_query”), Bytes.toBytes(“status”) , CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes(“Need Demo”)));

scan.setFilter(filter);

 

ResultScanner scanner = table.getScanner(scan);

// Reading values from scan result

for (Result result = scanner.next(); result != null; result = scanner.next())

 

scan.setFilter(filter);

ResultScanner scanner1 = table.getScanner(scan);

for (Result result : scanner1) {

for (org.apache.hadoop.hbase.KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner1.close();

String result = null;

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

Case 8 : Student not confirmed for the session: Sorting Student on the basis of the status of not joining or no response.

package com.test.ap;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import com.amazonaws.services.applicationdiscovery.model.Filter;

import com.amazonaws.services.elasticmapreduce.model.KeyValue;

public class ScanTable{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

// Instantiating the Scan class

Scan scan = new Scan();

SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes(“Weekdays_query”), Bytes.toBytes(“status”)  , CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes(“Not confirm”)));

scan.setFilter(filter);

 

ResultScanner scanner = table.getScanner(scan);

// Reading values from scan result

for (Result result = scanner.next(); result != null; result = scanner.next())

scan.setFilter(filter);

ResultScanner scanner1 = table.getScanner(scan);

for (Result result : scanner1) {

for (org.apache.hadoop.hbase.KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner1.close();

String result = null;

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

Case 9 : Student query for Pune/Bangalore location: Sorting Student on the basis of location.

package com.test.ap;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Result;

import org.apache.hadoop.hbase.client.ResultScanner;

import org.apache.hadoop.hbase.client.Scan;

import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;

import org.apache.hadoop.hbase.filter.CompareFilter;

import org.apache.hadoop.hbase.filter.BinaryComparator;

import com.amazonaws.services.applicationdiscovery.model.Filter;

import com.amazonaws.services.elasticmapreduce.model.KeyValue;

public class ScanTable{

public static void main(String args[]) throws IOException{

// Instantiating Configuration class

Configuration config = HBaseConfiguration.create();

// Instantiating HTable class

HTable table = new HTable(config, “Prwatech”);

// Instantiating the Scan class

Scan scan = new Scan();

SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes(“Weekdays_query”), Bytes.toBytes(“location”) , CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes(“Pune”)));

scan.setFilter(filter);

ResultScanner scanner = table.getScanner(scan);

// Reading values from scan result

for (Result result = scanner.next(); result != null; result = scanner.next())

 

scan.setFilter(filter);

ResultScanner scanner1 = table.getScanner(scan);

for (Result result : scanner1) {

for (org.apache.hadoop.hbase.KeyValue kv : result.raw()) {

System.out.println(“KV: ” + kv + “, Value: ” +

Bytes.toString(kv.getValue()));

}

}

scanner1.close();

String result = null;

System.out.println(“Found row : ” + result);

//closing the scanner

scanner1.close();

}

}

 

 

 

HDFS Questions and Answers – II

Q.1 Algorithm for NN to allocate a block on different Data Node?

Ans : The Name node uses the nearest neighbor algorithm to allocate the blocks on different

nodes. Name node easily send their data to the nearest data node.

 

Q.2 Write the functions called for splitting the user data into blocks?

Ans : The Split() method is used to break the data into  multiple chunks for allocating the data node in the cluster .suppose the Abc.txt file is splited into six multiple chunks called A1,B2,C3,D4,E5,F6 Sequentially a chunk C3 will send the request to HDFS client to get the location of data node from the name node and in reverse the name node will send the location of first or near data node for storing the data.

Q.3 How to modify the heartbeat and block report time interval of Data node?

Ans: We can easily modify the heartbeat of the data node in the cluster by managing the configuration file of the Hadoop

File name: hdf-site.xml = this file is required for setting environment of the Hadoop, it also manages the namenode, secondary name node and data node

Inside the file we have a parameter called “dfs.heartbeat.interval” just modify the value according to the requirement.

We can easily change the block report form the “hdfs-site.xml” file there we have a parameters called

“dfs.blockreport.intervalMsec” just modify according to the requirement.

Q.4 what is fsImage and which type of metadata it store?

Ans: FsImage is one of the directory of the Name Node which is used to store configuration of data, its replication factors, block report and it store the metadata of location of the data where it stores

Q.5 if 2TB data given, what is the max expected Meta data will generate?

Ans : 2TB=2*1024 GB *1024 MB/128Mb=16384 Blocks

 

Q.6 Write the Lifecycle of SNN in production?

Ans : Secondary name node is used to overcome the situation of single point of failure suppose if the name node is failed due to some network issue the entire cluster in not working, so in order to provide the backup for name node, it basically takes the checkpoints of the Hadoop file system of the name node .The name node contain two directory Edit logs and fsImage. When edit log mergers all its files and copied it to the fsImage and restart the cluster, after restart name node will run these files to update its metadata so secondary name node takes the checkpoints of the Hadoop file system not playing the role of Name node in the cluster

Q.7 if any DN stop working, how the blocks of dead DN will move to the active DN?

Ans : Its moves manually, Data node is only responsible for this. When data node goes down down then first it will search the nearest space where dead data node could adjust. If it could not found any place then it will switch to the blank data node.

Flume Questions and Answers

Q1 : What is Apache Flume

As we know, whereas it involves with efficiency and dependably collect, mixture and transfer large amounts from one or additional supply’s to a centralized data source we tend to use Apache Flume. However, it will ingest any reasonably knowledge together with log knowledge, event data, network knowledge, social-media generated knowledge, email messages, message queues etc since knowledge sources area unit customizable in Flume.

Q2 : What are the Basic Features of flume ?

  • A data collection service for Hadoop : Using Flume, we can get the data from multiple servers immediately into Hadoop.

 

  • For distributed systems: Along with the log files, Flume is also used to import huge volumes of event data produced by social networking sites like Facebook and Twitter, and e-commerce websites like Amazon and Flipkart.

 

  • Open source: It is a open source software. It doesn’t requires any licence key for its activation.

 

  • Scalable: Flume can be scaled horizontally.

 

Q3 : What are some applications of Flume ?

Assume a web application wants to analyze the customer behaviors about current activity. So this is where Flume comes in handy. It extract data and move the data to Hadoop for analysis.

Flume is used to move the log data generated by application servers into HDFS at a higher speed.

 

Q4 : What is an Agent?

A process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their destination.

Q5 : What is a channel?

It stores events,events are delivered to the channel via sources operating within the agent.An event stays in the channel until a sink removes it for further transport.

 

Q6 : Does Apache Flume provide support for third party plug-ins?

Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.

Q7 : Does Apache Flume support third-party plugins ?

Yes, Flume has 100% plugin-based architecture, it can load and ships data from external sources to external destination which separately from Flume. SO that most of the bigdata analysis use this tool for screaming data.

Q8 : What’s FlumeNG?

FlumeNG is nothing however a period loader for streaming your knowledge into Hadoop. Basically, it stores knowledge in HDFS and HBase. Thus, if we wish to urge started with FlumeNG, it improves on the first flume.

Q9 : How do you handle agent failues?

If Flume agent goes down then all flows hosted on that agent are aborted.Once the agent is restarted then flow will resume. If the channel is set up as in-memory channel then all events that are stored in the chavvels when the agent went down are lost. But chanels setup as file or other stable channels will continue to process events where it lest off.

Q10 : Can Flume can distribute data to multiple destinations?

Answer:  Yes. It support multiplexing flow. The event flows from one source to multiple channel and multiple destionations, It is acheived by defining a flow multiplexer.

 

 

Sqoop Questions and Answers

Q1: The first and best function of Sqoop?

Ans : Sqoop can import individual tables or entire databases into HDFS. The data is stored in the native directories and files in the HDFS file system.

Q2: Why Sqoop uses mapreduce in import/export operations?

Ans : Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

Q3: What is the data loading or import in Sqoop?

Load directly into Hive tables, creating HDFS files in the background and the Hive metadata automatically

Q4: Sqoop imports data into three kinds of data storage what are those?

Hive Tables

HDFS files

Hbase (HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable)

Q5: What is Apache Sqoop?

Ans : Apache Sqoop is a tool used for transferring data between Apache Hadoop clusters and relational databases.

Sqoop was originally developed by Cloudera. The name ‘Sqoop’ is a short form for ‘SQL-to-Hadoop’.

Sqoop can import full or partial tables from a SQL database into HDFS in a variety of formats. Sqoop can also export data from HDFS to a SQL database.

 

Q6: What is the basic command-line syntax for using Apache Sqoop?

 

Ans : Apache Sqoop is a command-line utility that has various commands to import data, export data, list data etc. These commands are called tools in Sqoop. Following is the basic command-line syntax for using Apache Scoop.

 

Q7: How do you import data from a single table ‘customers’ into HDFS directory ‘customerdata’?

Ans : You can import the data from a single table using the tool or command ‘import –table’. You can use the option ‘–warehouse-dir’ to import the data into ‘customerdata’ HDFS directory.

$ sqoop import –table customers
–connect jdbc:mysql://myhostname/interviewgrid
–username myusername –password mypassword
–warehouse-dir /customerdata

 

Using Sqoop command how can we control the number of mappers?.

We can control the number of mappers by executing the parameter –num-mapers in sqoop command. The –num-mappers arguments control the number of map tasks, which is the degree of parallelism used. Start with a small number of map tasks, then choose a high number of mappers starting the performance may down on the database side.

Syntax: -m, –num-mappers

Q8 :  How Sqoop can be used in a Java program?

Answer: The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.

Q9 : What is a sqoop metastore?

Answer: It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.

Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument.

Q10: Wha is sqoop-merge and explain its uses ? 

Answer:
Sqoop merge is a tool which combines two different datasets which maintain the only version by overwriting the entries in an older version of a dataset with new files to make it latest version dataset. There happens a process of flattening while merging the two different datasets which preserves the data without any loss and with efficiency and safety. In order to perform this operation merge key command will be used like “–merge-key”

 

HDFS Questions and Answers

Q.1 :  Algorithm for Name Node to allocates block on different Data Nodes ?

Ans: The Name node basically uses the nearest neighbour algorithm to allocate the blocks on different node .Name node easily send the data to the nearer server data node, for ex:-

Explanation:

As we seen in the above diagram that a cluster having name node with Data node resides in different location and a client from another location (called Andhra Pradesh) sends some files for storing on the Data nodes then the Name Node has the responsibility to find the nearest Data node from the client to perform their operation in less time , so here  we have three location New Delhi , Mumbai and Chennai. From the nearest algorithm Chennai is nearer to the client apart from others.So,name node basically sends the location of the data node to the client for storing their data.

Q2 : Write the function called Splitting  the user data into blocks ?

Ans : The Split() method is used to break the data into  multiple chunks for allocating the data node in the cluster .suppose the Abc.txt file is splitted into six multiple chunks called A1,B2,C3,D4,E5,F6 ,now

Sequentially a chunk C3 will send the request to HDFS client to get the location of data node from the name node and in reverse the name node will send the location of first or near data node for storing the data

Q3 :How to modify heartbeat and block report time interval of the data node .?

Ans :Heartbeat is used to check the status (Active & Inactive) of the data node in the cluster in order to perform different operations.The data node regularly sends the heart beat report in every 3 sec(default) to the name node. If data node is Inactive the operation assign by the name node will not be occurred.

We can easily modify the heartbeat of the data node in the cluster by managing the configuration file of the Hadoop

File name : hdf-site.xml = this file is required for setting environment of the Hadoop , it also manages the  name node ,secondary name node and data node

Inside the file we have a parameter called “dfs.heartbeat.interval”  just modify the value according to the requirement

Block Report

Block report is used to send the status of the different block of the data nodes,every 10th heart beat is referred as block report,it basically contains the information of number of replication factor and block name etc.

We can easily change the block report form the “hdfs-site.xml” file there we have a parameters called

“dfs.blockreport.intervalMsec” just modify according to the requirement.

Q4 : What is fsImage and which type of metadata it store ? 

Ans :fsImage is one of the directory of the Name Node which is used to store configuration of  data ,its replication factors ,block report and it store the metadata of location of the data where it stores

 

Q5 :If 2Tb data is given what is the max expected metadata is generated ?

Ans : we have 2TB data i.e. 2*1024gb=2048GB*1024/128=16384.

 

Q6 : Write the Life cycle of SNN in production ?

Ans :Secondary name node is used to overcome the situation of single point of failure suppose if the name node is failed due to some network issue the entire cluster in not working ,so in order to provide the backup for namenode ,it basically takes the checkpoints of the Hadoop file system of the name node .The name node contain two directory Edit logs and fsImage . when edit log mergers all its files and copied it to the fsImage and restart the cluster,after restart name node will run these files to update its metadata so secondary name node takes the checkpoints of the Hadoop file system not playing the role of Namenode in the cluster

 

Q7 : If any DN stop working , how the blocks of dead DN will move to the active DN ?

Ans : In a cluster if any DN is not working ,the blocks of Data node will simply merger with the any of the active Data nod in the cluster to insure or maintain replication factor .It never happen that the replication factor has gone down because in a cluster during splitting the amount of data into number of blocks called chunks, we decided the number of replication factor to the data

Hadoop-PIG Introduction

Apache Pig is a platform which is used to analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. It supports many relational features making it easy to Join, group and aggregates data.

PIG has many things in common with ETL, if those ETL tools run over on many server simultaneously. Apache Pig is included in the Hadoop ecosystem. It is a frontrunner for extracting, loading and transforming various forms of data unlike, many traditional ETL tools which are good at structure data, HIVE and PIG are created to load and transform unstructured, structured and semi-structured data into HDFS.

Both PIG and HIVE make use of MapReduce function, they might not be as fast at doing non batch oriented processing. Some open source tool attempt to inform this limitation, but the problem still exists.

♦ ETL : It will extract data from sources A, B, and C but instead of transforming it first, you first load the row data into a database or HDFS.

Often the loading process requires no schemas the data can remain in the respository, unprocessed for a longtime. When the data is needed someone builds a schema, transform the data and determine how to analyze data. That person might load the new , transformed data onto another platform such as Apache HBase.

♦ VENDOR:

 1. Yahoo: One of the heaviest user of Hadoop (core and Pig), runs 40% of all its Hadoop jobs with PIG.

2. Twitter: It is also well known user of PIG

♦ Components of PIG:

  1. A high level data processing language called Pig Latin.

 

  1. A compiler that compiles and runs Pig latin scripts in a choice of evaluation mechanism. The main evaluation mechanism is Hadoop. Pig also supports a local mode for development purposes.

 

♦ Data Flow Language:

We write Pig Latin Programs in a sequence of steps where each step is a single high-level data transformation. The transformation support relational style operations, such as filter, union, group and Join.

Even though the operations are relational in style, Pig Latin remains a data flow language. A flow language is friendlier to programmers who think in terms of algorithms, which are more natural of expressed by the data and the control flows. On the other hand, a declarative language such as SQL is preferred to just state the result.

Data Type :

“Pig eats anything”. Input data can come in any format. Popular format such as tab-delimited text files, are natively supported. Users can add functions as well. Pig doesn’t require meta data or schema on data, but it can take advantage of them if they are provided.

Pig can operate on data that is relational , nested ,semi-structured or unstructured . Pig supports complex data types such as bags and tuples that can be nested to form a fairly sophisticated data structure.

 

♦ Running PIG (PIG Shell):

Grunt: To enter PIG commands manually. It is used for ad hoc data analysis and performs interactive cycles of program development. The Grunt shell also supports file utility common ds such as ls and cp.

Script: Large Pig programs or ones that will be running repeatedly are run in script file.

 

♦ Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple.

Example Data set : {Prwatech1,Hadoop,20000,Pune}   {Prwatech2,Python,30000,Bangalore}   {Prwatech3,AWS,12000,Pune}

 

Atom

Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field.

Example − ‘Prwatech1’ or ‘Hadoop’

Tuple

A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Prwatech1, Hadoop)

Bag

A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

Example − {(Prwatech1,Hadoop ), (Prwatech2, Python)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Prwatech1, Hadoop, {20000, Pune,}}

Map

A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’

Example − [name#Prwatech1, Course#Hadoop]

Relation

A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).

Quick Support

image image