/
HIVE CSCE 587 Spring 2018 HIVE CSCE 587 Spring 2018

HIVE CSCE 587 Spring 2018 - PowerPoint Presentation

southan
southan . @southan
Follow
344 views
Uploaded On 2020-06-24

HIVE CSCE 587 Spring 2018 - PPT Presentation

Step 1 Data Start by downloading the data httpsrawgithubusercontentcomhortonworksdatatutorialsmastertutorialshdphowtoprocessdatawithapachehiveassetsdriverdatazip There are two files in this data set that we saw last week with PIG ID: 785302

drivers timesheet col table timesheet drivers table col temp extract driverid regexp click data select logged csv miles execute

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "HIVE CSCE 587 Spring 2018" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

HIVE

CSCE 587

Spring 2018

Slide2

Step 1: Data

Start by downloading the data

https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/how-to-process-data-with-apache-hive/assets/driver_data.zip

There are two files in this data set that we saw last week with PIG:

drivers.csv

timesheet.csv

Slide3

Step 2: Load the files into HDFS

Unlike last week, we will use the GUI to load.

Start by logging onto

ambaria

: vm-Hadoop-xx.cse.sc.edu:8080

use your

maria_dev

credentials

Slide4

Step 2: Load the files into HDFS

Click on the icon that resembles 3x3 grid on the menu bar

(it is at the top of the window on the far right side)

Select “Files View”

Slide5

Step 2: Load the files into HDFS

Navigate to /user/

maria_dev

1. scroll to bottom of list to find user.

2. click on “user”.

3. scroll down list to find “

maria_dev

4. click on “

maria_dev

Slide6

Navigating to /user/maria_dev

You should see something like this, although you should also see the other files that you created last week with Hadoop and PIG.

Slide7

Click on the “Upload” button

Then select the “Browse” button and navigate to where you stored the files on the

linux

file system.

Select drivers.csv to upload

Do the same for timesheet.csv

Slide8

Had you started with a tabula rasa, your directory would look like this:

Slide9

3. HIVE View 2.0

Switch context from “Files” view to “Hive View 2.0:

Slide10

Step 3.1

This brings up the query editor

Slide11

Step 3.2: Create an empty table

Enter:

create table

temp_drivers

(

col_value

STRING);

Then click on “Execute”

Slide12

Result

Slide13

S

tep 3.3

Enter:

LOAD DATA INPATH '/user/

maria_dev

/drivers.csv' OVERWRITE INTO TABLE

temp_drivers

;

Then click on “Execute”

Slide14

G

o back to the “Files” view

What has changed?

The file drivers.csv is no longer there

 Loading the file into Hive has “consumed” drivers.csv

Slide15

Take a peek at the table “

temp_drivers

Enter:

select * from

temp_drivers

limit 10;

then click on “Execute”

Slide16

Step 3.4: extract the fields we want

enter:

CREATE TABLE drivers (

driverId

INT, name STRING,

ssn

BIGINT, location STRING, certified STRING,

wageplan

STRING);

the click on “Execute”

Slide17

Step 3.5:

QUERY TO EXTRACT DATA FROM

Temp_drivers

insert overwrite table drivers

SELECT

regexp_extract

(

col_value

, '^(?:([^,]*),?){1}', 1)

driverId

,

regexp_extract

(

col_value

, '^(?:([^,]*),?){2}', 1) name,

regexp_extract

(

col_value

, '^(?:([^,]*),?){3}', 1)

ssn

,

regexp_extract

(

col_value

, '^(?:([^,]*),?){4}', 1) location,

regexp_extract

(

col_value

, '^(?:([^,]*),?){5}', 1) certified,

regexp_extract

(

col_value

, '^(?:([^,]*),?){6}', 1)

wageplan

from

temp_drivers

;

Slide18

Step 3.5:

create a query to

extract DATA FROM

Temp_drivers

insert overwrite table drivers

SELECT

regexp_extract

(

col_value

, '^(?:([^,]*),?){1}', 1)

driverId

,

regexp_extract

(

col_value

, '^(?:([^,]*),?){2}', 1) name,

regexp_extract

(

col_value

, '^(?:([^,]*),?){3}', 1)

ssn

,

regexp_extract

(

col_value

, '^(?:([^,]*),?){4}', 1) location,

regexp_extract

(

col_value

, '^(?:([^,]*),?){5}', 1) certified,

regexp_extract

(

col_value

, '^(?:([^,]*),?){6}', 1) wageplanfrom temp_drivers;

Slide19

Take a peek at the first 10 rows of the resulting table

Enter:

select * from drivers limit 10;

Then click on “Execute”

Slide20

Processing timesheet.csv

creating similar tables from timesheet.csv

Start by creating

temp_timesheet

with the following command:

CREATE TABLE

temp_timesheet

(

col_value

string

);

Then populate it with data from timesheet.csv

LOAD DATA INPATH '/user/

maria_dev

/timesheet.csv' OVERWRITE INTO TABLE

temp_timesheet

;

Finally, look at the first 10 lines of the table as a sanity check.

Select * from

temp_timesheet

limit 10;

CREATE TABLE temp_timesheet

(

col_value

string

);

Slide21

3.6

CREATE TABLE

temp_timesheet

(

col_value

string);

LOAD DATA INPATH '/user/

maria_dev

/timesheet.csv' OVERWRITE INTO TABLE

temp_timesheet

;

Slide22

Creating timesheet from

temp_timesheet

Start by creating an empty table timesheet

Enter:

CREATE

TABLE timesheet (

driverId

INT, week INT,

hours_logged

INT ,

miles_logged

INT);

Then click on “Execute”

Next populate the table by extracting columns

fromm

temp_timesheet

Slide23

Extracting columns from

temp_timesheet

Enter:

insert overwrite table timesheet

SELECT

regexp_extract

(

col_value

, '^(?:([^,]*),?){1}', 1)

driverId

,

regexp_extract

(

col_value

, '^(?:([^,]*),?){2}', 1) week,

regexp_extract

(

col_value

, '^(?:([^,]*),?){3}', 1)

hours_logged

,

regexp_extract

(

col_value

, '^(?:([^,]*),?){4}', 1)

miles_logged

f

rom

temp_timesheet

;

Then click on “Execute”

Slide24

Take a peek at the first 10 rows

enter:

select * from timesheet limit 10;

click on “Execute”

Slide25

Now group timesheet data by

driverID

so that we can

sum the hours logged and

sum the miles logged

Enter:

SELECT

driverId

, sum(

hours_logged

), sum(

miles_logged

) FROM timesheet GROUP BY

driverId

;

Then click on “Execute”

Slide26

Results after grouping by

driverID

and summing logged hours and logged miles

Slide27

Combine columns from drivers and timesheet tables

Columns from drivers table:

driverId

name

Columns from timesheets table:

total_hours

total_miles

Join column:

driverId

Slide28

Combine columns from drivers and timesheet tables

SELECT

d.driverId

, d.name,

t.total_hours

,

t.total_miles

from

drivers d

JOIN (SELECT

driverId

,

sum(

hours_logged

)

total_hours

, sum(

miles_logged

)

total_miles

FROM

timesheet

GROUP BY

driverId

) t

ON (

d.driverId

=

t.driverId

);

Slide29

Results after entering command and executing

Slide30