HDInsight Lance Olson Partner Group Program Manager BRK2557 Big data and traditional data warehouse Big data in the cloud Cloud versus onpremises Patterns and case studies HDInsight workloads ID: 543423
Download Presentation The PPT/PDF document "Harnessing the Power of Hadoop: Cloud Sc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1Slide2
Harnessing the Power of Hadoop: Cloud Scale with Microsoft Azure HDInsight
Lance OlsonPartner Group Program Manager
BRK2557Slide3
Big data and traditional data warehouseBig data in the cloud
Cloud versus on-premisesPatterns and case studiesHDInsight workloads
AgendaSlide4
Big Data vs Traditional DWSlide5
Bottom-Up
(Inductive)
Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive Analytics
Prescriptive Analytics
DIFFICULTY
What
happened?
Why did
it happen?
Descriptive Analytics
INFORMATION
Diagnostic Analytics
VALUE
OPTIMIZATION
Top-Down
(Deductive)
Confirmation
Theory
Hypothesis
Observation
Two Approaches to Information
M
anagement for Analytics: Top-Down + Bottom-UpSlide6
Implement
Data
Warehouse
Physical Design
ETL Development
Reporting & Analytics Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand Corporate Strategy
Data Warehousing Uses A Top-Down Approach
Data
sources
OLTP
ERP
CRM
LOB
ETL
BI and
analytic
Dashboards
Reporting
Data warehouse
Gather Requirements
Business
Requirements
Technical
RequirementsSlide7
The “data lake” Uses A Bottom-Up
Approach
Ingest all data
regardless of requirements
Store
all data
in
native
format without schema definition
Do analysis
Using analytic engines like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Relational
Sensors
Video
LOB applications
Web
Social
Clickstream
Devices
Relational
Sensors
Video
LOB applications
Web
Social
ClickstreamSlide8
Data Lake + Data Warehouse Better Together
What happened?
What is happening?
Why did it happen?
What are key relationships?
What will happen?
What if?How risky is it?What should happen?What is the best option?
How can I optimize?
Data
sourcesOLTP
ERP
CRM
LOB
ETL
BI and
analytic
Dashboards
Reporting
Data warehouse
Devices
Relational
Sensors
Video
LOB applications
Web
Social
ClickstreamSlide9
Big Data in the CloudSlide10
Why Cloud + Big Data?
Massive Compute and Storage
Deployment expertise
Data of
all Volume
Variety
,
Velocity
Speed
Scale
Economics
Always Up,
Always On
Open and flexible
Time to valueSlide11
Why Microsoft Azure?
On-premises Servers
S
oftware
Appliances
Azure Facts
>4 trillion objects in Azure
300,000-1M+ requests per second
Double compute and storage every 6 months
Azure
Storage
HDInsight
Data Factory
ML
Stream Analytics
Database
DocumentDB
Search
Event HubsSlide12
Microsoft’s cloud Hadoop offering
100% open source Apache Hadoop
Built on the latest releases across Hadoop (2.6)
Up and running in minutes with no hardware to deploy
Harness existing .NET and Java
skills
Utilize familiar BI tools for analysis including Microsoft Excel
Introducing Azure HDInsightSlide13
Hadoop Is Being Run Everywhere in the WorldSlide14
Cloud and On-Premises “vs” or “+”?Slide15
Cloud + On-Premises Hybrid Scenarios
On-Premises
Development, Testing, & Pilot
IoT
Applications
Other Azure Services such as BI / MLSlide16
Use Cases: Let the data decide
Use Cases
Where?
Active Archive /
Compliance Reporting
Restricted data
= “down here”. “Up there” could be considered for other scenarios.ETL / Data Warehouse OptimizationOften has “down here” gravity, but cloud-based ETL offload has big payout
Smart Meter AnalysisTypically born “up there”
Single View of CustomerMay have heavy “down here” gravity; unless you’re using SaaS apps, then why not “up there”?
New Data for Product ManagementRestricted data = “down here”. “Up there” could be considered for many scenarios.
Vehicle Data for Transportation/LogisticsWhy not “up there”?Vehicle Data for Insurance
May have heavy “down here” gravity (ex. join w/risk data, etc.)Slide17
Use Cases: Patterns and Case StudiesSlide18
Rockwell Automation is partnered with one of the six oil and gas super majors to build unmanned internet-connected gas dispensers. Each dispenser emits real-time management metrics allowing them to detect anomalies and predict when proactive maintenance needs to occur.
Store sensor data every 5 minutes
Temperature, pressure, vibration, etc
.
Tens of thousands of data points / second
Data
Factory
Azure Blobs
Azure HDInsight
Hive, Pig,
Azure SQL DB
Power BI for O365
Mobile Notification Hub
Mobile Device
Real-time notificationSlide19
JustGiving wanted to harness the power of their data by using network science to map people’s connections and relationships so that they could connect people with the causes they care about. Based on 15
years
of data, the
JustGiving
GiveGraph
is the world’s largest
ecosystem of givingbehavior. It contains more than 81 million person nodes, thousands of causes and 285 million connections and is the engine that drives JustGiving’s
social platform, enabling levels of personalization and engagement that a traditional infrastructure would be unable to deliver.
SQL Server
On-premises
Agent
Azure Blobs
Azure HDInsight
Give
Graph
Azure Tables
Web API
Website +
Event store
Service Bus
Real-time Event
Serves results
Azure Cache
Activity
FeedsSlide20
Common Hadoop Patterns
Single view of entityCustomer, Product, Machine, etc.Predictive Analytics
Data Scientists and Analysts finding patterns and correlations
New models emerge to explain business performance
New predictions emerge based on previously disassociated data
Data Discovery
Large amounts of machine, sensor, clickstream, and geolocation dataNew value emerges when correlated with data from product, customer, and inventory catalogsUse CasesAd Placement and OffersActive Archive
ETL OffloadSingle View of Customer
Recommendation EngineCustomer Targeting and Acquisition
New Data for Product ManagementVehicle Data
Web Personalization and ExperienceSlide21
HDInsight WorkloadsSlide22
HDInsight Supports Hive
Microsoft contribution to Apache code
Hadoop 2.0
1400s
44.3s
35.1s
Sample Query
Hive 10
HDP 1.3 /
Hive 11
HDP 2.0
32x Speedup
40X
Speedup
SQL-like queries on Hadoop data in
HDInsight
HDInsight
provides easy-to-use graphical query interface for Hive
HiveQL
is a SQL-like language (subset of SQL)
Hive structures include well-understood database concepts such as tables, rows, columns, partitions
Compiled into
MapReduce
jobs that are executed on Hadoop
Dramatic performance gains with Stinger/
Tez
Stinger is a Microsoft,
Hortonworks
and OSS driven initiative to bring interactive queries with Hive
Brings query execution engine technology from Microsoft SQL Server to Hive
Performance gains up to 100x
HDP
2.1
15s
100x
SpeedupSlide23
HDInsight Supports HBase
Data Node
Data Node
Data Node
Data Node
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Name Node
Job Tracker
HMaster
Coordination
Region Server
Region Server
Region Server
Region Server
NoSQL database on data in
HDInsight
Columnar, NoSQL database
Runs on top of the Azure Blob Stores in
HDInsight
Provides flexibility in that new columns can be added to column families at any timeSlide24
Storm for Azure HDInsight
Stream analytics for Near-Real Time ProcessingConsumes millions of real-time events from a scalable event broker (
ie
. Apache Kafka, Azure Event Hub)
Performs time-sensitive computation
Output to persistent stores, dashboards or
devicesCustomizable with Java + .NETDeeply integrated to Visual Studio
Event Queuing System
Collection
Presentation and action
Event producers
Transformation
Long-term storage
Event
Hubs
Storage
adapters
Stream
processing
Cloud gateways
(web APIs)
Field
gateways
Applications
Search and query
Data analytics (Excel)
Web/thick client
dashboards
Live Dashboards
Apache Storm on
HDInsight
Devices to take action
Kafka /
RabbitMQ
/
ActiveMQ
Web and Social
Devices
Sensors
Azure Stream Analytics
HDFS
Azure DBs
Azure storage
HBaseSlide25
Azure HDInsight running Linux
Choice of Windows or Linux clusters
Managed & supported by Microsoft
Re-use common tools, documentation, samples from Hadoop/Linux ecosystem
Add Hadoop projects that were authored on Linux to
HDInsight
Easier transition from on-premises to cloudSlide26
Microsoft Makes Hadoop Easier
Deep Visual Studio IntegrationDebug Hive jobs through Yarn logs or troubleshoot Storm topologiesVisualize Hadoop clusters, tables, and storage
Submit Hive queries, Storm topologies (C# or Java spouts/bolts)
IntelliSense for authoring Hive jobs and Storm business logicSlide27
Built
for Hadoop
Hyper Scale, Massive throughput
Enterprise
Ready
Introducing Azure Data Lake
A hyper scale repository for big data analytic workloads
Sign up
http://azure.com/datalakeSlide28
Visit
Myignite
at
http://myignite.microsoft.com
or download and use the
Ignite Mobile App
with the QR code above.
Please evaluate this sessionYour feedback is important to us!Slide29