/
Pig Tutorial Pig Tutorial

Pig Tutorial - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
406 views
Uploaded On 2015-09-26

Pig Tutorial - PPT Presentation

Hui Li lihuiindianaedu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21 2012 What is Pig Framework for analyzing large unstructured and semistructured data on top of Hadoop ID: 141483

centroids pig distance centroid pig centroids centroid distance gpa move lines foreach generate result group results run load words centroided initial iter

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Pig Tutorial" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Pig Tutorial

Hui Li lihui@indiana.edu

Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012Slide2

What is Pig

Framework for analyzing large un-structured and semi-structured data on top of Hadoop.

Pig Engine Parses, compiles Pig Latin scripts into MapReduce jobs run on top of Hadoop. Pig Latin is simple but powerful data flow language similar to scripting languages. SQL – like languageProvide common data operations (e.g. filters, joins, ordering)Slide3

Motivation of Using Pig

Faster developmentFewer lines of code (Writing map reduce like writing SQL queries)Re-use the code (Pig library, Piggy bank)

One test: Find the top 5 words with most high frequency10 lines of Pig Latin V.S 200 lines in Java15 minutes in Pig Latin V.S 4 hours in JavaSlide4

Word Count using MapReduceSlide5

Word Count using Pig

Lines=

LOAD ‘input/access.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word;Groups = GROUP Words BY word;Counts = FOREACH Groups GENERATEgroup, COUNT(Words);Results = ORDER

Words BY Counts

DESC

;

Top5 =

LIMIT

Results 5;

STORE

Top5

INTO

/output/top5words;Slide6

Pig Tutorial

Basic Pig knowledge: (Word Count)

Pig Data TypesPig OperationsHow to run Pig ScriptsAdvanced Pig features: (Kmeans Clustering)Embedding Pig within PythonUser Defined Function Slide7

Pig Data Types

Pig Latin Data TypesPrimitive typesInt, long, float, double,

boolean,nul, chararray, bytearry, Complex typesCell  field in Database{(0002576169), (Tome), (21), (“Male”)….}Tuple  Row in Database( 0002576169, Tome, 21, “Male”)DataBag  Table or View in Database{(0002576169 , Tome, 21, “Male”), (0002576170, Mike, 20, “Male”), (0002576171 Lucy, 20, “Female”)…. }Slide8

Pig Operations

Loading dataLOAD loads input dataLines=

LOAD ‘input/access.log’ AS (line: chararray); ProjectionFOREACH … GENERTE (similar to SELECT)takes a set of expressions and applies them to every record. De-duplicationDISTINCT removes duplicate recordsGroupingGROUPS collects together records with the same keyAggregationAVG, COUNT, COUNT_STAR, MAX, MIN, SUMSlide9

How to run Pig Latin scripts

Local modeNeither Hadoop nor HDFS is required

Local host and local file system is usedUseful for prototyping and debuggingHadoop modeRun on a Hadoop cluster and HDFSBatch mode - run a script directly Pig –p input=someInput script.pigScript.pigLines = LOAD ‘$input’ AS (…);Interactive mode use the Pig shell to run scriptGrunt> Lines = LOAD ‘/input/input.txt’ AS (line:chararray);Grunt> Unique = DISTINCT Lines;Grunt> DUMP Unique;Slide10

Sample: Word Count using Pig

Lines=

LOAD ‘input/access.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word;Groups = GROUP Words BY word;Counts = FOREACH Groups GENERATEgroup, COUNT(Words);Results = ORDER

Words BY Counts

DESC

;

Top5 =

LIMIT

Results 5;

STORE

Top5

INTO

/output/top5words;Slide11

Sample: Kmeans using Pig

A

method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.Assignment step: Assign each observation to the cluster with the closest mean Update step: Calculate the new means to be the centroid of the observations in the clusterReference: http://en.wikipedia.org/wiki/K-means_clusteringSlide12

Kmeans Using Pig

PC =

Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = Foreach grouped Generate group,

AVG

(

centroided.gpa

);

store result into 'output';

""")

while

iter_num

<MAX_ITERATION:

PCB

=

PC.

bind

({'centroids':

initial_centroids

})

results =

PCB.

runSingle

()

iter

=

results.result

("result").iterator()

centroids = [None] *

v

distance_move

=

0.0

# get new centroid of this iteration,

calculate

the moving distance with last iteration

for

i

in

range(v):

tuple =

iter.next

()

centroids[

i

] = float(

str

(

tuple.get

(1)))

distance_move

=

distance_move

+

fabs

(

last_centroids

[

i

]-centroids[

i

])

distance_move

=

distance_move

/

v;

if

distance_move

<tolerance:

converged

= True

break

……Slide13

Embedding Python scripts with Pig

Pig does not support flow control statement: if/else, while loop, for loop

, etc.Pig embedding API can leverage all language features provided by Python including control flow: Loop and exit criteriaSimilar to the database embedding APIEasier parameter passing JavaScript is available as wellThe framework is extensible. Any JVM implementation of a language could be integrated Slide14

Compile Pig Script

P = Pig.

compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = Group centroided by centroid; result = Foreach grouped Generate group, AVG(

centroided.gpa

);

store result into 'output';

""")

Compile the Pig script outside the loop since we will run the same query every time

Within the loop, we invoke the compiled Pig script

public class

Kmeans

extends Configured implements Tool {

while

iter_num

<MAX_ITERATION:

Q =

P.

bind

({'centroids':

initial_centroids

})

results =

Q.

runSingle

();

........

}//public classSlide15

User Defined Function

What is UDFWay to do an operation on a field or fieldsCalled from within a pig script

Currently all done in JavaWhy use UDFYou need to do more than grouping or filteringActually filtering is a UDFMaybe more comfortable in Java land than in SQL/Pig LatinP = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids');Slide16

Zoom In Pig Kmeans code

while

iter_num<MAX_ITERATION: PCB = PC.bind({'centroids':initial_centroids}) results = PC.runSingle()iter = results.result("result").iterator() centroids = [None] * v distance_move = 0for i in range(v): tuple = iter.next() centroids[i] = float(

str

(

tuple.get

(1)))

distance_move

=

distance_move

+

fabs

(

last_centroids

[

i

]-centroids[

i

])

distance_move

=

distance_move

/

v;

if

distance_move

<tolerance:

writeoutput

()

converged

= True

break

last_centroids

= centroids[:]

initial_centroids

= ""

for

i

in

range(v):

initial_centroids = initial_centroids + str(last_centroids[i]) if i!=v-1: initial_centroids = initial_centroids + ":" iter_num += 1

Iterate MAX_ITERATION times

Binding parameters

get

new centroid of this iteration,

calculate

the moving distance with last iteration

Update CentroidsSlide17

Run Pig Kmeans Scripts

2012-07-14 14:51:24,636 [main] INFO

org.apache.pig.scripting.BoundScript - Query to run:register udf.jar DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output';

Input(s): Successfully

read 10000 records (219190 bytes) from:

"

hdfs

://

iw-ubuntu

/user/developer/student.txt"

Output(s

): Successfully

stored 4 records (134 bytes) in:

"

hdfs

:

//

iw-ubuntu

/user/developer/output“

last centroids: [

0.371927835052,1.22406743491,2.24162171881,3.40173705722

]Slide18

References:

1) http://pig.apache.org

(Pig official site)2) http://en.wikipedia.org/wiki/K-means_clustering3) slides by Adam Kawa the 3rd meeting of WHUG June 21, 20124) Docs http://pig.apache.org/docs/r0.9.05) Papers: http://wiki.apache.org/pig/PigTalksPapers6) http://en.wikipedia.org/wiki/Pig_LatinQuestions?