Step 1 Data Start by downloading the data httpsrawgithubusercontentcomhortonworksdatatutorialsmastertutorialshdphowtoprocessdatawithapachehiveassetsdriverdatazip There are two files in this data set that we saw last week with PIG ID: 785302
Download The PPT/PDF document "HIVE CSCE 587 Spring 2018" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
HIVE
CSCE 587
Spring 2018
Slide2Step 1: Data
Start by downloading the data
https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/how-to-process-data-with-apache-hive/assets/driver_data.zip
There are two files in this data set that we saw last week with PIG:
drivers.csv
timesheet.csv
Slide3Step 2: Load the files into HDFS
Unlike last week, we will use the GUI to load.
Start by logging onto
ambaria
: vm-Hadoop-xx.cse.sc.edu:8080
use your
maria_dev
credentials
Slide4Step 2: Load the files into HDFS
Click on the icon that resembles 3x3 grid on the menu bar
(it is at the top of the window on the far right side)
Select “Files View”
Slide5Step 2: Load the files into HDFS
Navigate to /user/
maria_dev
1. scroll to bottom of list to find user.
2. click on “user”.
3. scroll down list to find “
maria_dev
”
4. click on “
maria_dev
”
Slide6Navigating to /user/maria_dev
You should see something like this, although you should also see the other files that you created last week with Hadoop and PIG.
Slide7Click on the “Upload” button
Then select the “Browse” button and navigate to where you stored the files on the
linux
file system.
Select drivers.csv to upload
Do the same for timesheet.csv
Slide8Had you started with a tabula rasa, your directory would look like this:
Slide93. HIVE View 2.0
Switch context from “Files” view to “Hive View 2.0:
Slide10Step 3.1
This brings up the query editor
Slide11Step 3.2: Create an empty table
Enter:
create table
temp_drivers
(
col_value
STRING);
Then click on “Execute”
Slide12Result
Slide13S
tep 3.3
Enter:
LOAD DATA INPATH '/user/
maria_dev
/drivers.csv' OVERWRITE INTO TABLE
temp_drivers
;
Then click on “Execute”
Slide14G
o back to the “Files” view
What has changed?
The file drivers.csv is no longer there
Loading the file into Hive has “consumed” drivers.csv
Slide15Take a peek at the table “
temp_drivers
”
Enter:
select * from
temp_drivers
limit 10;
then click on “Execute”
Slide16Step 3.4: extract the fields we want
enter:
CREATE TABLE drivers (
driverId
INT, name STRING,
ssn
BIGINT, location STRING, certified STRING,
wageplan
STRING);
the click on “Execute”
Slide17Step 3.5:
QUERY TO EXTRACT DATA FROM
Temp_drivers
insert overwrite table drivers
SELECT
regexp_extract
(
col_value
, '^(?:([^,]*),?){1}', 1)
driverId
,
regexp_extract
(
col_value
, '^(?:([^,]*),?){2}', 1) name,
regexp_extract
(
col_value
, '^(?:([^,]*),?){3}', 1)
ssn
,
regexp_extract
(
col_value
, '^(?:([^,]*),?){4}', 1) location,
regexp_extract
(
col_value
, '^(?:([^,]*),?){5}', 1) certified,
regexp_extract
(
col_value
, '^(?:([^,]*),?){6}', 1)
wageplan
from
temp_drivers
;
Slide18Step 3.5:
create a query to
extract DATA FROM
Temp_drivers
insert overwrite table drivers
SELECT
regexp_extract
(
col_value
, '^(?:([^,]*),?){1}', 1)
driverId
,
regexp_extract
(
col_value
, '^(?:([^,]*),?){2}', 1) name,
regexp_extract
(
col_value
, '^(?:([^,]*),?){3}', 1)
ssn
,
regexp_extract
(
col_value
, '^(?:([^,]*),?){4}', 1) location,
regexp_extract
(
col_value
, '^(?:([^,]*),?){5}', 1) certified,
regexp_extract
(
col_value
, '^(?:([^,]*),?){6}', 1) wageplanfrom temp_drivers;
Slide19Take a peek at the first 10 rows of the resulting table
Enter:
select * from drivers limit 10;
Then click on “Execute”
Slide20Processing timesheet.csv
creating similar tables from timesheet.csv
Start by creating
temp_timesheet
with the following command:
CREATE TABLE
temp_timesheet
(
col_value
string
);
Then populate it with data from timesheet.csv
LOAD DATA INPATH '/user/
maria_dev
/timesheet.csv' OVERWRITE INTO TABLE
temp_timesheet
;
Finally, look at the first 10 lines of the table as a sanity check.
Select * from
temp_timesheet
limit 10;
CREATE TABLE temp_timesheet
(
col_value
string
);
3.6
CREATE TABLE
temp_timesheet
(
col_value
string);
LOAD DATA INPATH '/user/
maria_dev
/timesheet.csv' OVERWRITE INTO TABLE
temp_timesheet
;
Slide22Creating timesheet from
temp_timesheet
Start by creating an empty table timesheet
Enter:
CREATE
TABLE timesheet (
driverId
INT, week INT,
hours_logged
INT ,
miles_logged
INT);
Then click on “Execute”
Next populate the table by extracting columns
fromm
temp_timesheet
Slide23Extracting columns from
temp_timesheet
Enter:
insert overwrite table timesheet
SELECT
regexp_extract
(
col_value
, '^(?:([^,]*),?){1}', 1)
driverId
,
regexp_extract
(
col_value
, '^(?:([^,]*),?){2}', 1) week,
regexp_extract
(
col_value
, '^(?:([^,]*),?){3}', 1)
hours_logged
,
regexp_extract
(
col_value
, '^(?:([^,]*),?){4}', 1)
miles_logged
f
rom
temp_timesheet
;
Then click on “Execute”
Slide24Take a peek at the first 10 rows
enter:
select * from timesheet limit 10;
click on “Execute”
Slide25Now group timesheet data by
driverID
so that we can
sum the hours logged and
sum the miles logged
Enter:
SELECT
driverId
, sum(
hours_logged
), sum(
miles_logged
) FROM timesheet GROUP BY
driverId
;
Then click on “Execute”
Slide26Results after grouping by
driverID
and summing logged hours and logged miles
Slide27Combine columns from drivers and timesheet tables
Columns from drivers table:
driverId
name
Columns from timesheets table:
total_hours
total_miles
Join column:
driverId
Slide28Combine columns from drivers and timesheet tables
SELECT
d.driverId
, d.name,
t.total_hours
,
t.total_miles
from
drivers d
JOIN (SELECT
driverId
,
sum(
hours_logged
)
total_hours
, sum(
miles_logged
)
total_miles
FROM
timesheet
GROUP BY
driverId
) t
ON (
d.driverId
=
t.driverId
);
Slide29Results after entering command and executing
Slide30