Hadont Saqib Mustafa Webinar Series 2016 How Hadoop is used Friction free data loading repository Easy loading of data in HDFS Easy scaling up Scaling down can be an issue Scalable engine for data transformation ID: 584628
Download Presentation The PPT/PDF document "Hadoop or" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Hadoop or Hadon’t
Saqib Mustafa
Webinar Series 2016Slide2
How Hadoop is used
Friction free data loading repository
Easy loading of data in HDFS
Easy scaling up
Scaling down can be an issueScalable engine for data transformationUsed to prep data for more expensive data warehouse systemsExploratory analysis Short term projectsStore historical data
Hadoop is
a 10 year old open source project that was originally created to solve search-engine woes
Webinar Series 2016Slide3
Challenges of Hadoop
Difficult to Secure
Security is a nightmare
Complexity of use
Inefficient for processing
structured data
Built for semi-structured data
Lack of access control
Difficult to provide security through out the environment
Java programmers to write MapReduce Jobs
IT Dependencies
Struggles with structured data from enterprise applications
Joins of multiple data sets can be slow
Fast analytics at Scale
Concurrency at scale
Hadoop does not support work load management
Serializes workload
Webinar Series 2016Slide4
Today’s realities
Barriers to insight
Analytics hindered by slow, incomplete access to data
Data pipeline itself becomes a roadblock
Costly, complex infrastructure
Significant resources consumed to build and maintain data platforms
Data cannot be combined in one system and queried efficientlyData challenges
Silos of diverse data from diverse sources, growing rapidly
Web
3
rd
-party
IOT
Enterprise apps
Hadoop &
noSQL
Data Warehouse(s)
Datamarts
Webinar Series 2016Slide5
Limitations of current solutions
Legacy Data warehousing
Costly:
upfront capital costs, overprovisioning, …
Complex: partitioning, indexes, replication,
…Inflexible: forklift migrations, fixed schema, …
Hadoop
noSQL
platforms like Hadoop
Complex:
new skills &
tools required like Java/MapReduce
Slow:
poor performance on analytics at scale
Incomplete:
patchwork of tools, incomplete
SQL
Security:
Open access to environment, lack of enterprise controls
Webinar Series 2016Slide6
Key unaddressed use cases
Integrated data analytics
Combine structured + semi-structured data for reporting & analytics
Exploratory & ad hoc analytics
Easy access to data for SQL analysts to explore data, identify correlations, build & test models
Datamart
& data silo consolidationConsolidate legacy datamarts to eliminate silos and serve data quicker
Webinar Series 2016Slide7
Our vision:
Reinvent the data warehouse
Data
Warehousing
Performance
& e
nterprise capabilities
Using SQL
Connectivity to existing tools in the ecosystem
Cloud
E
lasticity
& a
gility
Big
Data
Flexibility
&
scalability
Can accommodate Semi-structured data
JSON, AVRO, XML
Webinar Series 2016Slide8
What we built:
The Snowflake Elastic Data Warehouse
All-new SQL data warehouse
No legacy code or
constraints
Designed for the cloud
Running in Amazon Web Services
Delivered as a service
No infrastructure, knobs or tuning to manage
All your data
Deploy Structured and Semi-structured data in one place
Webinar Series 2016Slide9
Traditional Big Data Pipeline to Analyst
Web
CSV File
IOT
Preprocessing Hadoop &
noSQL
Data Warehouse(s)
Tweet Sample
{name =“Saqib”}
{city = “Madison”}
{college=“Wisconsin”}
Name
City
College
Saqib
Madison
Wisconsin
Cust
ID
Name
City
College
Time stamp
0001
Saqib
Madison
Wisconsin
XX:YY:ZZ
CSV File
JSON File
Customer Table
Disadvantages
Involves
extra steps from JSON to CSV to DB Table
Any
changes to the model need changes to the whole environment
Datamarts
How JSON Data
is adopted
Webinar Series 2016Slide10
Big Data Pipeline to
Snowflake
Web
IOT
Tweet Sample
{name =“Saqib”}
{city = “Madison”}
{college=“Wisconsin”}
Cust
ID
Time stamp
Tweet_text
(type VARIANT)
0001
XX:YY:ZZ
{name = “Saqib”}
{city = “Madison”}
{college= “Wisconsin”}
JSON File
Customer Table
Name
Saqib
Tweet Sample
{name =“Saqib”}
{city = “Madison”}
{college=“Wisconsin”}
City
Madison
College
Wisconsin
How it is stored
Query
Result
Select
Cust_id
,
Tweet_text.name
,
Tweet_text.city
,
Tweet_text.college
From
Customer
001,
Saqib
Madison
Wisconsin
Advantages
Direct ingestion into table
No changes to schema for any change in the source
data
Snowflake automatically ingests,
columnarizes
and optimizes the data
You can create joins on the variant type too
Webinar Series 2016Slide11
A new architecture:
Multi-cluster, shared data
Shared-nothing
Multi-cluster, shared data
Decentralized,
local
storage
Centralized,
scale-out storage
Single
cluster
Multiple, independent
compute clusters
Hadoop
/ NoSQL architectures
/ Some Data warehouses
Snowflake
Test/Dev
Analysts
Sales
Webinar Series 2016Slide12
Enabling Strong Concurrency to allow Analytics at Scale
Optimization
Management
Security
01010
01101
00011
Metadata
ODBC
JDBC
Connectors
Centralized storage
Instant, automatic scalability & elasticity
Single service
Scalable, resilient cloud services layer coordinates access & management
Elastically scalable compute
Multiple “virtual warehouse” compute clusters scale horsepower & concurrency
Ad hoc /BI analytics
Development
Loading
Database Storage
Enabling Concurrency
Concurrency through scaling and warehouse/storage separation
Webinar Series 2016Slide13
No infrastructure, knobs, or tuning
Manual
Query
Optimization
Dynamic optimization, parallelization, and concurrency management
Infrastructure
Management
Hardware, software, availability,
resiliency, disaster recovery
managed by Snowflake
Data
Storage
Management
Adaptive data distribution, automatic compression, automatic optimization
Metadata
Management
Automatic statistics collection, scaling, and redundancy
**..
**..
Webinar Series 2016Slide14
Fits with your Ecosystem
Diverse Data
Sources
(Big or Traditional)
Java
>_
Scripting
Reporting & Analytics
Data Management & Transformation
Custom
Webinar Series 2016Slide15
Protected by industrial-strength security
Authentication
Embedded multi-factor
authentication
Federated
authentication available
X
….
X
Access control
Role-based access control model
Granular privileges on
all objects
&
actions
Data encryption
All data encrypted, always, end-to-end
Encryption keys managed
automatically
External validation
Certified against enterprise-class requirements
(
e.g. SOC 2 Type
II,
HIPAA)
Webinar Series 2016Slide16
Ad-tech analytics (JSON processing)
Scenario
Analyze and monetize large data set of website site
traffic
.Growth through Website acquisitionPain PointsLarge data volumes of traditional and JSON data requiring exploration and
analysisSeparate Data warehouse and Hadoop environmentsNo Single version of Truth
Unpredictable performance on both data warehouse and Hadoop No Dev/Test environment
Environment:
100N Hadoop environment
36N Data warehouse
Varying formats of JSON from different websites
Ask.com
,
about.com
,
investopedia.com
,
okcupid.com
,
dictionary.com
Webinar Series 2016Slide17
Enabling Ad-tech Analytics
Solution
Use Snowflake to load all data into one warehouse
Load JSON and traditional data from different sources into
Snowflake for data analysts to directly explore, build, test, and deploy new algorithmsTableau with native connection to Snowflake for analyticsUse Snowflake’s cloning feature to provide an up to date Dev/Test environment
“Because
of [Snowflake],
business intelligence is moving from a cost center to a value
center”
Keith
Lavery
Sr
. Director, BI, Data and Analytics
Webinar Series 2016Slide18
Serving diverse customers
Snowflake
is faster, more flexible, and more scalable than the alternatives on the market. The fact that we don’t need to do any configuration or tuning is great because we can focus on analyzing data instead of on managing and tuning a data warehouse
.
Craig Lancaster, CTO
Webinar Series 2016Slide19
Recap
Snowflake is
…
An all-new data warehouse
Designed for the cloudCombined structured and semi-structured data in an optimized mannerSnowflake delivers...
One place for diverse dataEasier, faster analyticsElastic scaling for any scale of data, workload, & concurrency
Without the cost and complexity of alternativesWebinar Series 2016Slide20
Sql
> select
questions
from AudienceHistorical Results(also
a Snowflake Feature)SQL
ResultSelect Name, Email, website, twitter, json_exampleFrom
presenter
Where
question =“Not addressed”
Saqib Mustafa
Saqib.mustafa@snowflake.net
www.snowflake.net
@
snowflakedb
& @
drkoalz
,
THANK YOU
Webinar Series 2016Slide21
CONTACT
US
Historical Results
(also
a Snowflake Feature)
SQLResultSelect Name,
Email, website, twitter, json_exampleFrom presenterWhere question =“Not addressed”
Saqib Mustafa
Saqib.mustafa@snowflake.net
www.snowflake.net
@
snowflakedb
& @
drkoalz
,
THANK YOU
Webinar Series 2016Slide22
Enabling Machine
Learning analysis
Scenario
Digital advertising click fraud detection through analytics
Pain PointsLarge data volumes of JSON data requiring exploration and analysisSolutionLoad JSON ad impression and click data into Snowflake for data analysts to directly explore, build, test, and deploy new algorithms
Able to Speed up Algorithms from 24hours to 2 hours
Snowflake
enables us to unlock large datasets to make it possible for business analysts, developers and account managers to ask their own questions directly of the data
.
Tamer Hassan
Co-Founder and
CTO
Webinar Series 2016Slide23
23
www.synerzip.com
Webinar Series 2016
Ranjani
Shah
ranjani.shah@synerzip.com
469.374.0500Slide24
Synerzip in a Nutshell
Software product development
partner
for small/mid-sized technology companies
Exclusive focus on small/mid-sized technology companies, typically venture-backed companies in growth phase
By definition, all Synerzip work is the IP of its respective clients
Deep experience in full SDLC – design,
dev
, QA/testing, deployment
Dedicated team of
high caliber
software professionals for each client
Seamlessly extends client’s local team offering full transparency
Stable teams with very low turn-over
NOT just “staff augmentation, but provide full management support
Actually
reduces risk
of development/delivery
Experienced team – uses appropriate level of engineering discipline
Practices Agile development – responsive yet disciplined
Reduces cost
– dual-site team, 50% cost advantage
Offers long-term
flexibility
– allows (facilitates) taking offshore team captive – aka “BOT” option
Webinar Series 2016Slide25
Synerzip Clients
Webinar Series 2016Slide26
Next
Webinar
Presented by:
Todd
Little
is Vice President of
Product
Development for IHS, a leading global provider of information, analytics, and expertise.
Webinar Series 2016
To Estimate or #
NoEstimates
, That is the Question
Thursday
,
June 23,
2016 @ Noon CSTSlide27
Ranjani
Shah
ranjani.shah@synerzip.com
469.374.0500
Connect with Synerzip
@Synerzip
linkedin.com/company/synerzip
facebook.com/Synerzip
Webinar Series 2016Slide28
Webinar Series 2016