Hadoop PIG Introduction

  • date 7th April, 2019 |
  • by Prwatech |

Hadoop Pig Introduction - Components and Commands of PIG

  Hadoop Pig Introduction, Welcome to the world of Hadoop PIG Tutorials. In these Tutorials, one can explore Introduction to Hadoop PIG,Components and Commands of PIG. Learn More advanced Tutorials on introduction of Hadoop PIG from India’s Leading Hadoop Training institute which Provides Advanced Hadoop Course for those tech enthusiasts who wanted to explore the technology from scratch to advanced level like a Pro.   We Prwatech the Pioneers of Hadoop Training offering advanced certification course and Hadoop PIG Introduction to those who are keen to explore the technology under the World-class Training Environment.  

Apache Hadoop Pig Introduction

  Apache Pig is a platform which is used to analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. It supports many relational features making it easy to Join, group and aggregates data.   PIG has many things in common with ETL, if those ETL tools run over on many server simultaneously. Apache Pig is included in the Hadoop ecosystem. It is a front runner for extracting, loading and transforming various forms of data unlike, many traditional ETL tools which are good at structure data, HIVE and PIG are created to load and transform unstructured, structured and semi-structured data into HDFS.   Both PIG and HIVE make use of MapReduce function, they might not be as fast at doing non batch oriented processing. Some open source tool attempt to inform this limitation, but the problem still exists.   Hadoop Pig Introduction

Components of PIG

   ETL : It will extract data from sources A, B, and C but instead of transforming it first, you first load the row data into a database or HDFS.   Often the loading process requires no schema the data can remain in the repository, unprocessed for a longtime. When the data is needed someone builds a schema, transform the data and determine how to analyze data. That person might load the new , transformed data onto another platform such as Apache HBase.  


   1. Yahoo: One of the heaviest user of Hadoop (core and Pig), runs 40% of all its Hadoop jobs with PIG.   Hadoop Pig Introduction   2. Twitter: It is also well known user of PIG  

Commands of PIG:

  1. A high level data processing language called Pig Latin.
  2. A compiler that compiles and runs Pig Latin scripts in a choice of evaluation mechanism. The main evaluation mechanism is Hadoop. Pig also supports a local mode for development purposes.
  Hadoop Pig Introduction  

Data Flow Language:

  We write Pig Latin Programs in a sequence of steps where each step is a single high-level data transformation. The transformation support relational style operations, such as filter, union, group and Join.   Even though the operations are relational in style, Pig Latin remains a data flow language. A flow language is friendlier to programmers who think in terms of algorithms, which are more natural of expressed by the data and the control flows. On the other hand, a declarative language such as SQL is preferred to just state the result.  

Data Type :

  “Pig eats anything”. Input data can come in any format. Popular format such as tab-delimited text files, are natively supported. Users can add functions as well. Pig doesn’t require meta data or schema on data, but it can take advantage of them if they are provided.   Pig can operate on data that is relational , nested ,semi-structured or unstructured . Pig supports complex data types such as bags and tuples that can be nested to form a fairly sophisticated data structure.  

Running PIG (PIG Shell):

  Grunt: To enter PIG commands manually. It is used for ad hoc data analysis and performs interactive cycles of program development. The Grunt shell also supports file utility common ds such as ls and cp.   Script: Large Pig programs or ones that will be running repeatedly are run in script file.   Hadoop Pig Introduction  

Pig Latin Data Model

  The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple.   Example Data set : {Prwatech1,Hadoop,20000,Pune}   {Prwatech2,Python,30000,Bangalore}   {Prwatech3,AWS,12000,Pune}  


  Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field.   Example − ‘Prwatech1’ or ‘Hadoop’  


  A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.   Example − (Prwatech1, Hadoop)  


  A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type. Example − {(Prwatech1,Hadoop ), (Prwatech2, Python)} A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Prwatech1, Hadoop, {20000, Pune,}}  


  A map (or data map) is a set of key-value pairs. The key needs to be of type char array and should be unique. The value might be of any type. It is represented by ‘[]’   Example − [name#Prwatech1, Course#Hadoop]  


A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).      

Quick Support

image image