/
Marcel Kornacker // Cloudera, Inc. Marcel Kornacker // Cloudera, Inc.

Marcel Kornacker // Cloudera, Inc. - PowerPoint Presentation

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
389 views
Uploaded On 2017-08-11

Marcel Kornacker // Cloudera, Inc. - PPT Presentation

FrictionFree ELT Automating Data Transformation with Cloudera Impala Outline Evolution of ELT in the context of analytics traditional systems Hadoop today Clouderas vision for ELT make most of it disappear ID: 577819

schema data select table data schema table select elt json parquet impala create orders source customers transformation conversion hadoop

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Marcel Kornacker // Cloudera, Inc." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Marcel Kornacker // Cloudera, Inc.

Friction-Free ELT: Automating Data Transformation with Cloudera ImpalaSlide2

Outline

Evolution of ELT in the context of analytics

traditional systems

Hadoop todayCloudera’s vision for ELTmake most of it disappearautomate data transformation

2Slide3

Traditional ELT

Extract: physical extraction from source data store

could be an RDBMS acting as an operational data store

or log data materialized as jsonLoad: the targeted analytic DBMS converts the transformed data into its binary format (typically columnar) Transform:

data cleansing and standardization

conversion of naturally complex/nested data into a flat relational schema

3Slide4

Three aspects to the traditional ELT process:

semantic transformation such as data standardization/cleansing

»

makes data more queryable, adds valuerepresentational transformation: from source to target schema (from complex/nested to flat relational)» “lateral” transformation that doesn’t change semantics,

adds operational overhead

data movement: from source to staging area to target system

»

adds yet more operational overhead

Traditional ELT

4Slide5

The goals of “analytics with no ELT”:

simplify aspect 1

eliminate aspects 2 and 3

Traditional ELT5Slide6

A typical ELT workflow with Hadoop looks like this:

raw source data initially lands in HDFS (examples: text/xml/json log files) …

Text, XML, JSON

ELT with Hadoop Today

6Slide7

… map that data into a table to make it queryable …

Text, XML, JSON

Original Data

CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS LOCATION ‘/raw-log-data/‘;

ELT with Hadoop Today

7Slide8

… create the target table at a different location …

Text, XML, JSON

Original Data

Parquet

CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘;

ELT with Hadoop Today

8Slide9

… convert the raw source data to the target format …

Text, XML, JSON

Original Data

Parquet

INSERT INTO LogData SELECT * FROM RawLogData;

ELT with Hadoop Today

9Slide10

… the data is then available for batch reporting/analytics (via Impala, Hive, Pig, Spark) or interactive analytics (via Impala, Search)

Text, XML, JSON

Original Data

Parquet

Impala

Hive

Spark

Pig

ELT with Hadoop Today

10Slide11

Compared to traditional ELT, this has several advantages:

Hadoop acts as a centralized location for all data: raw source data lives side by side with the transformed data

data does not need to be moved between multiple platforms/clusters

data in the raw source format is queryable as soon as it lands, although at reduced performance, compared to an optimized columnar data formatELT with Hadoop Today

11Slide12

However, even this still has drawbacks:

new data needs to be loaded periodically into the target table

doing that reliably and within SLAs can be a challenge

you now have two tables:one with current but slow dataanother with lagging but fast data

Text, XML, JSON

Original Data

Parquet

Impala

Hive

Spark

Pig

Another user-managed data pipeline!

ELT with Hadoop Today

12Slide13

No explicit loading/conversion step to move raw data into a target table

A single view of the data that is up-to-date and (mostly) in an efficient columnar format

automated with custom logic

Text, XML, JSON

Parquet

Impala

Hive

Spark

Pig

automated

A Vision for Analytics with No ELT

13Slide14

support for complex/nested schemas

»

avoid remapping of raw data into a flat relational schema

background and incremental data conversion» retain in-place single view of entire data set, with most data being in an efficient formatbonus: schema inference and schema evolution» start analyzing data as soon as it arrives, regardless of its complexity

A Vision for Analytics with No ELT

14Slide15

Standard relational: all columns have scalar values:

CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc.

Complex types: structs, arrays, maps

in essence, a nested relational schemaSupported file formats:Parquet, json, XML, AvroDesign principle for SQL extensions: maintain SQL’s way of dealing with multi-valued data

Support for Complex Schemas in Impala

15Slide16

Sample schema

CREATE TABLE Customers (

cid BIGINT,

address STRUCT { street STRING, zip INT,

city STRING

},

orders ARRAY<STRUCT {

oid BIGINT,

total DECIMAL(9, 2),

items ARRAY< STRUCT { iid BIGINT, qty INT, price DECIMAL(9, 2) }> }>

)

Support for Complex Schemas in Impala

16Slide17

Paths in expressions: generalization of column references to drill through structs

Can appear anywhere a conventional column reference is legal

SELECT address.zip, c.address.street

FROM Customers cWHERE address.city = ‘Oakland’

SQL Syntax Extensions: Structs

17Slide18

Basic idea: nested collections are referenced like tables in the FROM clause

SELECT SUM(i.price * i.qty)

FROM Customers.orders.items i

WHERE i.price > 10

SQL Syntax Extensions: Arrays and Maps

18Slide19

Parent/child relationships don’t require join predicates

identical to

SELECT c.cid, o.total

FROM Customers c, c.orders o

SELECT c.cid, o.total

FROM Customers c INNER JOIN c.orders o

SQL Syntax Extensions: Arrays and Maps

19Slide20

All join types are supported (INNER, OUTER, SEMI, ANTI)

Advanced querying capabilities via correlated inline views and subqueries

SELECT c.cid

FROM Customers c LEFT ANTI JOIN c.orders o

SELECT c.cid

FROM Customers c

WHERE EXISTS

(SELECT * FROM c.orders.items where iid = 117)

SQL Syntax Extensions: Arrays and Maps

20Slide21

Aggregates over collections are concise and easy to read

instead of

SELECT c.cid,

COUNT(c.orders), AVG(c.orders.items.price)

FROM Customers c

SELECT c.cid, a, b

FROM Customers c,

(SELECT COUNT(*) AS a FROM c.orders),

(SELECT AVG(price) AS b FROM c.orders.items)

SQL Syntax Extensions: Arrays and Maps

21Slide22

Aggregates over collections can replace subqueries

instead of

SELECT c.cid

FROM Customers c

WHERE COUNT(c.orders) > 10

SELECT c.cid

FROM Customers c

WHERE (SELECT COUNT(*) FROM c.orders) > 10

SQL Syntax Extensions: Arrays and Maps

22Slide23

Parquet optimizes storage of complex schemas by inlining structural data into the columnar data

(repetition/definition levels)

No performance penalty for accessing nested collections!

Parquet translates logical hierarchy into physical collocation:you don’t need parent-child joinsqueries run faster!

Complex Schemas with Parquet

23Slide24

Automated data conversion takes care of

format conversion: from text/JSON/Avro/… to Parquet

compaction: multiple smaller files are coalesced into a single larger one

Sample workflow:create table for data:attach data to table:

CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET;

LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData SOURCE FORMAT JSON;

Automated Data Conversion

24Slide25

Automated Data Conversion

25

JSON

Impala

Hive

Spark

Pig

PARQUET

JSON

Single-Table ViewSlide26

Conversion process

atomic: the switch from the source to the target data files is atomic from the perspective of a running query (but any running query sees the full data set)

redundant: with option to retain original data

incremental: Impala’s catalog service detects new data files that are not in the target format automaticallyAutomated Data Conversion

26Slide27

Specify data transformation via “view” definition:

derived table is expressed as Select, Project, Join, Aggregation

custom logic incorporated via UDFs, UDAs

CREATE DERIVED TABLE CleanLogData AS

SELECT StandardizeName(name), StandardizeAddr(addr, zip), …

FROM LogData

STORED AS PARQUET;

Automating Data Transformation: Derived Tables

27Slide28

From the user’s perspective:

table is queryable like any other table (but doesn’t allow INSERTs)

reflects all data visible in source tables at time of query (not: at time of CREATE)

performance is close to that of a table created with CREATE TABLE … AS SELECT (ie, that of a static snapshot)Automating Data Transformation: Derived Tables

28Slide29

From the system’s perspective:

table is Union of

physically materialized data, derived from input tables as of some point in the past

view over yet-unprocessed data of input tablestable is updated incrementally (and in the background) when new data is added to input tablesAutomating Data Transformation: Derived Tables

29Slide30

Automating Data Transformation: Derived Tables

30

Impala

Hive

Spark

Pig

Delta Query

Materialized Data

Base Table

Updates

Single-Table ViewSlide31

Schema inference from data files is useful to reduce the barrier to analyzing complex source data

as an example, log data often has hundreds of fields

the time required to create the DDL manually is substantial

Example: schema inference from structured data filesavailable today:future formats: XML, json, Avro

CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’

Schema Inference and Schema Evolution

31Slide32

Schema evolution:

a necessary follow-on to schema inference: every schema evolves over time; explicit maintenance is as time-consuming as the initial creation

algorithmic schema evolution requires sticking to generally safe schema modifications: adding new fields

adding new top-level columnsadding fields within structsSchema Inference and Schema Evolution

32Slide33

Example workflow:

scans data to determine new columns/fields to add

synchronous: if there is an error, the ‘load’ is aborted and the user notified

LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH SCHEMA EXPANSION;

Schema Inference and Schema Evolution

33Slide34

CREATE TABLE … LIKE <File>:

available today for Parquet

Impala 2.3+ for Avro, JSON, XML

Nested types: Impala 2.3Background format conversion: Impala 2.4Derived tables: > Impala 2.4

Timeline of Features in Impala

34Slide35

Hadoop offers a number of advantages over traditional multi-platform ETL solutions:

availability of all data sets on a single platform

data becomes accessible through SQL as soon as it lands

However, this can be improved further:a richer analytic SQL that is extended to handle nested data

an automated background conversion process that preserves an up-to-date view of all data while providing BI-typical performance

a declarative transformation process that focuses on application semantics and removes operational complexity

simple automation of initial schema creation and subsequent maintenance that makes dealing with large, complex schemas less labor-intensive

Conclusion

35Slide36

Thank you

Marcel Kornacker |

marcel@cloudera.com

36