- March 6, 2017 at 1:06 pm #2997
Run PIG command from hue.
Hue — the open source Web UI that makes Apache Hadoop easier to use.
To run the pig command we have editor in hue in this we can write the pig command and we can run by action tab available in the hue editior.
When we run PIG in local mode,will it convert the query in MR or not
No, As the pig scripts run in the local system. By default pig stores data in local file system. For MapReduce, its mandatory to start hadoop and files should be stored in HDFS.
How the physical translator works at the time of compilation of pig query?
Pig undergoes some steps when a pig latin script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by pig during execution. After this, pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.
Limitation of Pig
Pig coding approaches are slower than a fully tuned Hadoop MapReduce program.
When using Pig for executing jobs, Hadoop developers need not worry about any version mismatch.
There is very limited possibility for the developer to write java level bugs when coding in Pig .
Pig has problems in dealing with unstructured data like images, videos, audio, text that is ambiguously delimited, log data, etc.
Pig cannot deal with poor design of XML or JSON and flexible schemas.
How to achieve perfomance tuning in PIG?
Pig supports various optimization rules which are turned on by default. Become familiar with these rules.
If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations. A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with speed of arithmetic computation. It has an additional advantage of early error detection.
3.Project Early and Often
Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like:
4.Filter Early and Often
As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline.
5.Reduce Your Operator Pipeline
For clarity of your script, you might choose to split your projects into several steps for instance:
6.Make Your UDFs Algebraic
Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don’t. The latest code significantly improves combiner usage; however, you need to make sure you do your part. If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) make sure you implement it as such. For details on how to write algebraic UDFs, see Algebraic Interface.
7.Drop Nulls Before a Join
With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row (and no output), in a standard join the rows with a null key will always be dropped.
PIGGY BANK & Application?
PiggyBank is a collection of useful LOAD, STORE, and UDF functions. Mortar compiles and registers it automatically, so you can use anything you find there.
For example, to use the CommonLogLoader from PiggyBank, you can do:
data = LOAD ‘s3n://path/to/input’
AS (addr: chararray, logname: chararray, user: chararray, time: chararray,
method: chararray, uri: chararray, proto: chararray,
status: int, bytes: int);