/
Hadoop or Hadoop or

Hadoop or - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
385 views
Uploaded On 2017-09-02

Hadoop or - PPT Presentation

Hadont Saqib Mustafa Webinar Series 2016 How Hadoop is used Friction free data loading repository Easy loading of data in HDFS Easy scaling up Scaling down can be an issue Scalable engine for data transformation ID: 584628

webinar data series 2016 data webinar 2016 series amp snowflake hadoop json warehouse structured synerzip analytics management city access tweet concurrency scale

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Hadoop or" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Hadoop or Hadon’t

Saqib Mustafa

Webinar Series 2016Slide2

How Hadoop is used

Friction free data loading repository

Easy loading of data in HDFS

Easy scaling up

Scaling down can be an issueScalable engine for data transformationUsed to prep data for more expensive data warehouse systemsExploratory analysis Short term projectsStore historical data

Hadoop is

a 10 year old open source project that was originally created to solve search-engine woes

Webinar Series 2016Slide3

Challenges of Hadoop

Difficult to Secure

Security is a nightmare

Complexity of use

Inefficient for processing

structured data

Built for semi-structured data

Lack of access control

Difficult to provide security through out the environment

Java programmers to write MapReduce Jobs

IT Dependencies

Struggles with structured data from enterprise applications

Joins of multiple data sets can be slow

Fast analytics at Scale

Concurrency at scale

Hadoop does not support work load management

Serializes workload

Webinar Series 2016Slide4

Today’s realities

Barriers to insight

Analytics hindered by slow, incomplete access to data

Data pipeline itself becomes a roadblock

Costly, complex infrastructure

Significant resources consumed to build and maintain data platforms

Data cannot be combined in one system and queried efficientlyData challenges

Silos of diverse data from diverse sources, growing rapidly

Web

3

rd

-party

IOT

Enterprise apps

Hadoop &

noSQL

Data Warehouse(s)

Datamarts

Webinar Series 2016Slide5

Limitations of current solutions

Legacy Data warehousing

Costly:

upfront capital costs, overprovisioning, …

Complex: partitioning, indexes, replication,

…Inflexible: forklift migrations, fixed schema, …

Hadoop

noSQL

platforms like Hadoop

Complex:

new skills &

tools required like Java/MapReduce

Slow:

poor performance on analytics at scale

Incomplete:

patchwork of tools, incomplete

SQL

Security:

Open access to environment, lack of enterprise controls

Webinar Series 2016Slide6

Key unaddressed use cases

Integrated data analytics

Combine structured + semi-structured data for reporting & analytics

Exploratory & ad hoc analytics

Easy access to data for SQL analysts to explore data, identify correlations, build & test models

Datamart

& data silo consolidationConsolidate legacy datamarts to eliminate silos and serve data quicker

Webinar Series 2016Slide7

Our vision:

Reinvent the data warehouse

Data

Warehousing

Performance

& e

nterprise capabilities

Using SQL

Connectivity to existing tools in the ecosystem

Cloud

E

lasticity

& a

gility

Big

Data

Flexibility

&

scalability

Can accommodate Semi-structured data

JSON, AVRO, XML

Webinar Series 2016Slide8

What we built:

The Snowflake Elastic Data Warehouse

All-new SQL data warehouse

No legacy code or

constraints

Designed for the cloud

Running in Amazon Web Services

Delivered as a service

No infrastructure, knobs or tuning to manage

All your data

Deploy Structured and Semi-structured data in one place

Webinar Series 2016Slide9

Traditional Big Data Pipeline to Analyst

Web

CSV File

IOT

Preprocessing Hadoop &

noSQL

Data Warehouse(s)

Tweet Sample

{name =“Saqib”}

{city = “Madison”}

{college=“Wisconsin”}

Name

City

College

Saqib

Madison

Wisconsin

Cust

ID

Name

City

College

Time stamp

0001

Saqib

Madison

Wisconsin

XX:YY:ZZ

CSV File

JSON File

Customer Table

Disadvantages

Involves

extra steps from JSON to CSV to DB Table

Any

changes to the model need changes to the whole environment

Datamarts

How JSON Data

is adopted

Webinar Series 2016Slide10

Big Data Pipeline to

Snowflake

Web

IOT

Tweet Sample

{name =“Saqib”}

{city = “Madison”}

{college=“Wisconsin”}

Cust

ID

Time stamp

Tweet_text

(type VARIANT)

0001

XX:YY:ZZ

{name = “Saqib”}

{city = “Madison”}

{college= “Wisconsin”}

JSON File

Customer Table

Name

Saqib

Tweet Sample

{name =“Saqib”}

{city = “Madison”}

{college=“Wisconsin”}

City

Madison

College

Wisconsin

How it is stored

Query

Result

Select

Cust_id

,

Tweet_text.name

,

Tweet_text.city

,

Tweet_text.college

From

Customer

001,

Saqib

Madison

Wisconsin

Advantages

Direct ingestion into table

No changes to schema for any change in the source

data

Snowflake automatically ingests,

columnarizes

and optimizes the data

You can create joins on the variant type too

Webinar Series 2016Slide11

A new architecture:

Multi-cluster, shared data

Shared-nothing

Multi-cluster, shared data

Decentralized,

local

storage

Centralized,

scale-out storage

Single

cluster

Multiple, independent

compute clusters

Hadoop

/ NoSQL architectures

/ Some Data warehouses

Snowflake

Test/Dev

Analysts

Sales

Webinar Series 2016Slide12

Enabling Strong Concurrency to allow Analytics at Scale

Optimization

Management

Security

01010

01101

00011

Metadata

ODBC

JDBC

Connectors

Centralized storage

Instant, automatic scalability & elasticity

Single service

Scalable, resilient cloud services layer coordinates access & management

Elastically scalable compute

Multiple “virtual warehouse” compute clusters scale horsepower & concurrency

Ad hoc /BI analytics

Development

Loading

Database Storage

Enabling Concurrency

Concurrency through scaling and warehouse/storage separation

Webinar Series 2016Slide13

No infrastructure, knobs, or tuning

Manual

Query

Optimization

Dynamic optimization, parallelization, and concurrency management

Infrastructure

Management

Hardware, software, availability,

resiliency, disaster recovery

managed by Snowflake

Data

Storage

Management

Adaptive data distribution, automatic compression, automatic optimization

Metadata

Management

Automatic statistics collection, scaling, and redundancy

**..

**..

Webinar Series 2016Slide14

Fits with your Ecosystem

Diverse Data

Sources

(Big or Traditional)

Java

>_

Scripting

Reporting & Analytics

Data Management & Transformation

Custom

Webinar Series 2016Slide15

Protected by industrial-strength security

Authentication

Embedded multi-factor

authentication

Federated

authentication available

X

….

X

Access control

Role-based access control model

Granular privileges on

all objects

&

actions

Data encryption

All data encrypted, always, end-to-end

Encryption keys managed

automatically

External validation

Certified against enterprise-class requirements

(

e.g. SOC 2 Type

II,

HIPAA)

Webinar Series 2016Slide16

Ad-tech analytics (JSON processing)

Scenario

Analyze and monetize large data set of website site

traffic

.Growth through Website acquisitionPain PointsLarge data volumes of traditional and JSON data requiring exploration and

analysisSeparate Data warehouse and Hadoop environmentsNo Single version of Truth

Unpredictable performance on both data warehouse and Hadoop No Dev/Test environment

Environment:

100N Hadoop environment

36N Data warehouse

Varying formats of JSON from different websites

Ask.com

,

about.com

,

investopedia.com

,

okcupid.com

,

dictionary.com

Webinar Series 2016Slide17

Enabling Ad-tech Analytics

Solution

Use Snowflake to load all data into one warehouse

Load JSON and traditional data from different sources into

Snowflake for data analysts to directly explore, build, test, and deploy new algorithmsTableau with native connection to Snowflake for analyticsUse Snowflake’s cloning feature to provide an up to date Dev/Test environment

“Because

of [Snowflake],

business intelligence is moving from a cost center to a value

center”

Keith

Lavery

Sr

. Director, BI, Data and Analytics

Webinar Series 2016Slide18

Serving diverse customers

Snowflake

is faster, more flexible, and more scalable than the alternatives on the market. The fact that we don’t need to do any configuration or tuning is great because we can focus on analyzing data instead of on managing and tuning a data warehouse

.

Craig Lancaster, CTO

Webinar Series 2016Slide19

Recap

Snowflake is

An all-new data warehouse

Designed for the cloudCombined structured and semi-structured data in an optimized mannerSnowflake delivers...

One place for diverse dataEasier, faster analyticsElastic scaling for any scale of data, workload, & concurrency

Without the cost and complexity of alternativesWebinar Series 2016Slide20

Sql

> select

questions

from AudienceHistorical Results(also

a Snowflake Feature)SQL

ResultSelect Name, Email, website, twitter, json_exampleFrom

presenter

Where

question =“Not addressed”

Saqib Mustafa

Saqib.mustafa@snowflake.net

www.snowflake.net

@

snowflakedb

& @

drkoalz

,

THANK YOU

Webinar Series 2016Slide21

CONTACT

US

Historical Results

(also

a Snowflake Feature)

SQLResultSelect Name,

Email, website, twitter, json_exampleFrom presenterWhere question =“Not addressed”

Saqib Mustafa

Saqib.mustafa@snowflake.net

www.snowflake.net

@

snowflakedb

& @

drkoalz

,

THANK YOU

Webinar Series 2016Slide22

Enabling Machine

Learning analysis

Scenario

Digital advertising click fraud detection through analytics

Pain PointsLarge data volumes of JSON data requiring exploration and analysisSolutionLoad JSON ad impression and click data into Snowflake for data analysts to directly explore, build, test, and deploy new algorithms

Able to Speed up Algorithms from 24hours to 2 hours

Snowflake

enables us to unlock large datasets to make it possible for business analysts, developers and account managers to ask their own questions directly of the data

.

Tamer Hassan

Co-Founder and

CTO

Webinar Series 2016Slide23

23

www.synerzip.com

Webinar Series 2016

Ranjani

Shah

ranjani.shah@synerzip.com

469.374.0500Slide24

Synerzip in a Nutshell

Software product development

partner

for small/mid-sized technology companies

Exclusive focus on small/mid-sized technology companies, typically venture-backed companies in growth phase

By definition, all Synerzip work is the IP of its respective clients

Deep experience in full SDLC – design,

dev

, QA/testing, deployment

Dedicated team of

high caliber

software professionals for each client

Seamlessly extends client’s local team offering full transparency

Stable teams with very low turn-over

NOT just “staff augmentation, but provide full management support

Actually

reduces risk

of development/delivery

Experienced team – uses appropriate level of engineering discipline

Practices Agile development – responsive yet disciplined

Reduces cost

– dual-site team, 50% cost advantage

Offers long-term

flexibility

– allows (facilitates) taking offshore team captive – aka “BOT” option

Webinar Series 2016Slide25

Synerzip Clients

Webinar Series 2016Slide26

Next

Webinar

Presented by:

Todd

Little

 is Vice President of

Product

Development for IHS, a leading global provider of information, analytics, and expertise.

Webinar Series 2016

To Estimate or #

NoEstimates

, That is the Question

Thursday

,

June 23,

2016 @ Noon CSTSlide27

Ranjani

Shah

ranjani.shah@synerzip.com

469.374.0500

Connect with Synerzip

@Synerzip

linkedin.com/company/synerzip

facebook.com/Synerzip

Webinar Series 2016Slide28

Webinar Series 2016