David Schmittdiel CSC 9010003 9162014 Outline Me Big Data review and background Problem statement Case study StubHub Intro I dont have a Computer Science background but I really really regret it ID: 393592
Download Presentation The PPT/PDF document "Visualizing Big Data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Visualizing Big Data
David Schmittdiel
CSC 9010-003
9/16/2014Slide2
Outline
Me
Big Data review and background
Problem statement
Case study: StubHubSlide3
Intro
I don’t have a Computer Science background (but I really, really regret it)
MATLAB
PHP
MySQL Oracle
Manager of Business Intelligence Development at StubHub
Bringing actionable data to the masses
Self-service, on-demand, exploratory BI
Data discovery through visualization
AutomationSlide4
Big Data, Big Ruse?
Stephen Few: “What the hell is Big Data anyway?”
BI vendor-driven responses:
Increased data volume AND velocity
New data sources (unstructured)
Fundamental question:
Do you really need Big Data?
“
Until you’ve figured out
how
to use the data that you already have, collecting more will only distract you from the real task.
Time
spent
collecting
more data is time that could be better spent weaving it into something meaningful
.”
Stephen Few,
Perceptual Edge - July/August/September 2012
,
“Big Data, Big Ruse”
http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdfSlide5
The Real Task
Transforming raw data into meaningful, useful, actionable information
Leveraging the past to guide future endeavors
Finding the signals amidst the noise
Driving forces:
Scientific research
Business (ecommerce)
Government
Stephen Few: “The success of BI … [is] measured
in our increased ability to
understand
data and then make better decisions based on that understanding
.”Slide6
Visualizing Small Data
MS Excel
Ease of use for tasks involving smaller data sets, limited interactivity
Stephen Few: “building applications on top of Excel can be arduous and painful”
Stephen Few,
Perceptual Edge – September/October 2009
,
“
Fundamental Differences in Analytical Tools
”
http://www.perceptualedge.com/articles/visual_business_intelligence/differences_in_analytical_tools.pdfSlide7
Visualizing Small Data
Static dashboards: “custom analytics”
Time-consuming to build but relatively easy to maintain
“Remove … functionality
that isn’t relevant to the analytical
objective of
its
users”Slide8
Unique Challenges
Juliana
Freire
: “Visualization: Big Data Considerations”
Interactivity is key, but challenging for Big Data
Need
better integration between data management
and visualization components
Phil Simon describing Netflix’s data mindset:
Data should be accessible, easy to discover, and easy to process for
everyone
The longer you take to find the data, the less valuable it
becomes
Whether a dataset is large or small, being able to visualize it makes it easier
to explain
Juliana
Freire
,
DIMACS 2013
,
“Big Data Analysis and Integration”
http://dimacs.rutgers.edu/Workshops/BigData/Slides/2013-dimacs.pdf
Phil Simon,
HBR Webinar
,
“
The Visual
Organization:
Data Visualization, Big Data, and the
Quest for
Better Decisions
”
http://www.scribd.com/doc/232032215/HBR-Webinar-Summary-The-Visual-OrganizationSlide9
Case Study: StubHub
Using SAP Business Objects (BO) since at least 2008 on top of Oracle 11g DW
Included in the “Leaders” quadrant
of 2014 Gartner report
BO “delivers a broad range of BI and analytic capabilities through a semantic layer best suited for large IT-managed deployments that require robust governance and administrative capabilities”
Customers use “primarily for reporting; the number that use it for interactive discovery or visualization was well below the average”
Gartner,
Magic Quadrant for Business Intelligence and Analytics Platforms
www.gartner.com/technology/reprints.do?id=1-1QLGACN&ct=140210&st=sbSlide10
Case Study: StubHub
Feedback from business users was universally poor
Hard to use
Limited number of (inadequate) visualizations available
Not interactive
Supported by Tech org only
Reporting Team within Analytics org formed in January, 2013
Innovative
Responsive
Promote self-service
Objective vs subjective use of dataSlide11
Case Study: StubHub
General concept: aggregate
any
metrics by
any
breakdown, over
any
time period, filtered for
anything
Supports “exploratory analytics”: pursue each question as it arises
Settle instead for a collection of dashboards categorized by business use caseSlide12
Case Study: StubHub
First iteration: Dynamic SQL
Complicated rules for commenting based on front-end selections
s
elect
-- DATE:
sp.src_created_dttm_sale
g.genre_cat_final
as "GCF", -- DISPLAY CATEGORY: GCF
g.genre_descr
as "Genre", -- DISPLAY CATEGORY: Genre
sum(
sp.ticket_cost
) as "GTS", -- DATA METRIC: GTS
count(distinct
transaction_id
) as "# Orders", -- DATA METRIC: # Orders
from
owbruntarget_dw.dw_sales_pipeline_fact
sp
join
owbruntarget_dw.dw_genre_dim
g on
sp.genre_dw_id
=
g.genre_dw_id
-- DISPLAY CATEGORY or FILTER: GCF, Genre
where 1=1
-- FILTER:
g.genre_cat_final
for GCF
-- FILTER:
g.genre_descr
for Genre
AND
trunc
(
src_created_dttm_sale
) between :
startdate
and :
enddate
group by
g.genre_cat_final
, -- DISPLAY CATEGORY: GCF
g.genre_descr
, -- DISPLAY CATEGORY: Genre
-- DATEG:
sp.src_created_dttm_sale
''
Proved unworkable because of long query execution times, even after incorporating bind variablesSlide13
Case Study: StubHub
Next iteration: “pandas”
dataframes
Open source Python library for data manipulation and analysis
F
ast and efficient
DataFrame
object for data manipulation with integrated indexing
Tools for reading and writing data between in-memory data structures and different formats (e.g. CSV)
For each dashboard, one static query
Tuning + Oracle query optimizer
Retrieve comprehensive data set needed to power the dashboard
Store data in CSV files on network
“Jukebox” functionality: only files needed are loaded into memory for processing
Pandas:
http://pandas.pydata.org/pandas-docs/stable/index.htmlSlide14
Case Study: StubHub
Results:
Huge decrease in dashboard run times
Corresponding increase in adoption rateSlide15
Case Study: StubHub
Where does the interactivity necessary for data discovery come from?
Template-based front end built with PHP + HTML + CSS + jQuery
Provide different levels of granularity
Decreases amount of time needed to create a new dashboard (vs. Tableau)
Menus control requests for:
Categories
group by
Metrics
aggregate functions
Filters
where clause
Date range
Chart types, date aggregationSlide16
Case Study: StubHub
How to provide i
ntegration between back-end data management and front-end visualization components?
Solution is Data-Driven Documents (D3.js)
JavaScript library to drive the creation and control of dynamic and interactive graphical forms which run in web browsers
W3C-compliant, making use of the widely implemented Scalable Vector Graphics (SVG), JavaScript, HTML5, and Cascading Style Sheets (CSS3) standards
Large data sets can be easily bound to SVG objects using JSON and simple D3 functions to generate charts and diagrams
D3:
http://d3js.org/Slide17
Case Study: StubHub
Summary of approach
Create a collection of BI dashboards that are:
Fast
Customizable
Interactive
Highly visual
On-demand
Scalable
Consistent
Custom build EVERYTHING as needed
Leverage open source technologies whenever possible
Data source agnostic to accommodate new data stores as they become available
Output from
MapReduce
jobs in CSV format