Hui Li lihuiindianaedu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21 2012 What is Pig Framework for analyzing large unstructured and semistructured data on top of Hadoop ID: 141483
Download Presentation The PPT/PDF document "Pig Tutorial" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Pig Tutorial
Hui Li lihui@indiana.edu
Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012Slide2
What is Pig
Framework for analyzing large un-structured and semi-structured data on top of Hadoop.
Pig Engine Parses, compiles Pig Latin scripts into MapReduce jobs run on top of Hadoop. Pig Latin is simple but powerful data flow language similar to scripting languages. SQL – like languageProvide common data operations (e.g. filters, joins, ordering)Slide3
Motivation of Using Pig
Faster developmentFewer lines of code (Writing map reduce like writing SQL queries)Re-use the code (Pig library, Piggy bank)
One test: Find the top 5 words with most high frequency10 lines of Pig Latin V.S 200 lines in Java15 minutes in Pig Latin V.S 4 hours in JavaSlide4
Word Count using MapReduceSlide5
Word Count using Pig
Lines=
LOAD ‘input/access.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word;Groups = GROUP Words BY word;Counts = FOREACH Groups GENERATEgroup, COUNT(Words);Results = ORDER
Words BY Counts
DESC
;
Top5 =
LIMIT
Results 5;
STORE
Top5
INTO
/output/top5words;Slide6
Pig Tutorial
Basic Pig knowledge: (Word Count)
Pig Data TypesPig OperationsHow to run Pig ScriptsAdvanced Pig features: (Kmeans Clustering)Embedding Pig within PythonUser Defined Function Slide7
Pig Data Types
Pig Latin Data TypesPrimitive typesInt, long, float, double,
boolean,nul, chararray, bytearry, Complex typesCell field in Database{(0002576169), (Tome), (21), (“Male”)….}Tuple Row in Database( 0002576169, Tome, 21, “Male”)DataBag Table or View in Database{(0002576169 , Tome, 21, “Male”), (0002576170, Mike, 20, “Male”), (0002576171 Lucy, 20, “Female”)…. }Slide8
Pig Operations
Loading dataLOAD loads input dataLines=
LOAD ‘input/access.log’ AS (line: chararray); ProjectionFOREACH … GENERTE (similar to SELECT)takes a set of expressions and applies them to every record. De-duplicationDISTINCT removes duplicate recordsGroupingGROUPS collects together records with the same keyAggregationAVG, COUNT, COUNT_STAR, MAX, MIN, SUMSlide9
How to run Pig Latin scripts
Local modeNeither Hadoop nor HDFS is required
Local host and local file system is usedUseful for prototyping and debuggingHadoop modeRun on a Hadoop cluster and HDFSBatch mode - run a script directly Pig –p input=someInput script.pigScript.pigLines = LOAD ‘$input’ AS (…);Interactive mode use the Pig shell to run scriptGrunt> Lines = LOAD ‘/input/input.txt’ AS (line:chararray);Grunt> Unique = DISTINCT Lines;Grunt> DUMP Unique;Slide10
Sample: Word Count using Pig
Lines=
LOAD ‘input/access.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word;Groups = GROUP Words BY word;Counts = FOREACH Groups GENERATEgroup, COUNT(Words);Results = ORDER
Words BY Counts
DESC
;
Top5 =
LIMIT
Results 5;
STORE
Top5
INTO
/output/top5words;Slide11
Sample: Kmeans using Pig
A
method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.Assignment step: Assign each observation to the cluster with the closest mean Update step: Calculate the new means to be the centroid of the observations in the clusterReference: http://en.wikipedia.org/wiki/K-means_clusteringSlide12
Kmeans Using Pig
PC =
Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = Foreach grouped Generate group,
AVG
(
centroided.gpa
);
store result into 'output';
""")
while
iter_num
<MAX_ITERATION:
PCB
=
PC.
bind
({'centroids':
initial_centroids
})
results =
PCB.
runSingle
()
iter
=
results.result
("result").iterator()
centroids = [None] *
v
distance_move
=
0.0
# get new centroid of this iteration,
calculate
the moving distance with last iteration
for
i
in
range(v):
tuple =
iter.next
()
centroids[
i
] = float(
str
(
tuple.get
(1)))
distance_move
=
distance_move
+
fabs
(
last_centroids
[
i
]-centroids[
i
])
distance_move
=
distance_move
/
v;
if
distance_move
<tolerance:
converged
= True
break
……Slide13
Embedding Python scripts with Pig
Pig does not support flow control statement: if/else, while loop, for loop
, etc.Pig embedding API can leverage all language features provided by Python including control flow: Loop and exit criteriaSimilar to the database embedding APIEasier parameter passing JavaScript is available as wellThe framework is extensible. Any JVM implementation of a language could be integrated Slide14
Compile Pig Script
P = Pig.
compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = Group centroided by centroid; result = Foreach grouped Generate group, AVG(
centroided.gpa
);
store result into 'output';
""")
Compile the Pig script outside the loop since we will run the same query every time
Within the loop, we invoke the compiled Pig script
public class
Kmeans
extends Configured implements Tool {
while
iter_num
<MAX_ITERATION:
Q =
P.
bind
({'centroids':
initial_centroids
})
results =
Q.
runSingle
();
........
}//public classSlide15
User Defined Function
What is UDFWay to do an operation on a field or fieldsCalled from within a pig script
Currently all done in JavaWhy use UDFYou need to do more than grouping or filteringActually filtering is a UDFMaybe more comfortable in Java land than in SQL/Pig LatinP = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids');Slide16
Zoom In Pig Kmeans code
while
iter_num<MAX_ITERATION: PCB = PC.bind({'centroids':initial_centroids}) results = PC.runSingle()iter = results.result("result").iterator() centroids = [None] * v distance_move = 0for i in range(v): tuple = iter.next() centroids[i] = float(
str
(
tuple.get
(1)))
distance_move
=
distance_move
+
fabs
(
last_centroids
[
i
]-centroids[
i
])
distance_move
=
distance_move
/
v;
if
distance_move
<tolerance:
writeoutput
()
converged
= True
break
last_centroids
= centroids[:]
initial_centroids
= ""
for
i
in
range(v):
initial_centroids = initial_centroids + str(last_centroids[i]) if i!=v-1: initial_centroids = initial_centroids + ":" iter_num += 1
Iterate MAX_ITERATION times
Binding parameters
get
new centroid of this iteration,
calculate
the moving distance with last iteration
Update CentroidsSlide17
Run Pig Kmeans Scripts
2012-07-14 14:51:24,636 [main] INFO
org.apache.pig.scripting.BoundScript - Query to run:register udf.jar DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output';
Input(s): Successfully
read 10000 records (219190 bytes) from:
"
hdfs
://
iw-ubuntu
/user/developer/student.txt"
Output(s
): Successfully
stored 4 records (134 bytes) in:
"
hdfs
:
//
iw-ubuntu
/user/developer/output“
last centroids: [
0.371927835052,1.22406743491,2.24162171881,3.40173705722
]Slide18
References:
1) http://pig.apache.org
(Pig official site)2) http://en.wikipedia.org/wiki/K-means_clustering3) slides by Adam Kawa the 3rd meeting of WHUG June 21, 20124) Docs http://pig.apache.org/docs/r0.9.05) Papers: http://wiki.apache.org/pig/PigTalksPapers6) http://en.wikipedia.org/wiki/Pig_LatinQuestions?