Cristiana Amza University of Toronto Big Data is Here Data growth by 2015 100x in ten years IDC 2012 Value in data analysis for commerce health science services source Economist ID: 919045
Download Presentation The PPT/PDF document "Big IT Challenges and Opportunities for ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Big IT Challenges and Opportunities for Neuroscience
Cristiana
Amza
University of Toronto
Slide2Big Data is Here
Data growth (by 2015) = 100x in ten years [IDC 2012]
Value in
data analysis for commerce, health, science, services, ….
[source: Economist]
courtesy of
Babak
Falsafi
Slide3Data Growing Faster than Technology
WinterCorp
Survey,
www.wintercorp.com
Growingtechnology
gap
courtesy of
Babak
Falsafi
Slide4Challenge 1: Costs of a datacenter
4
Estimated costs of datacenter:
46,000
servers
$3,500,000 per month to run
Data courtesy of James Hamilton [SIGMOD’11 Keynote]
3yr server and 10yr infrastructure amortization
Server & Power are
88%
of total cost
Slide5Datacenter Energy Not Sustainable
In
modern world, 6% of all electricity, growing at >20%!
Billion Kilowatt hour/year
2001 2005 2009 2013 2017
A Modern Datacenter
17x football stadium, $3 billion
50
million
homes
courtesy of
Babak
Falsafi
Slide6Amazon
Can’t Recover
All Its
Cloud Data
F
rom Outage
Max Eddy, 27 April 2011,
www.geekosystem.com
When the Cloud Fails:
T-Mobile
, Microsoft
Lose
Sidekick Customer Data
Om
Malik, 10 October 2009, gigaom.com
Whoops – Facebook
loses
1 billion photos
Chris
Keall
,
10 March 2009
, The National Business
Review
Cloud Storage Often Results in
Data Loss
Chad Brooks, 10 October
2011
, www.businessnewsdaily.com
6
Cloudy
with a chance of
failure
courtesy
of
Haryadi
S.
Gunawi
Challenge 2: Data Management (Anomalies)
Slide77
Data Center Statistics at Google
Slide8Problems are entrenched
I have been working in this area since
2001 (1993)
Problems have only grown more complex/intractableSame old Distributed Systems problemsNew: Gaps due to heterogeneity/remoteness/indirection
8
Slide9Share
resources:
Can I place customer A’s DB along side customer B’s DB? Will their service-levels be met?
Service
Provider
Why: Application/Provider Different Goals
9
Solve problems
:
I’m only getting 500 TPS. What’s wrong? Is the cloud to blame?
Customer
Gap: application semantic not available to service provider
Slide10Ex: Application/Provider Gap
Cloud
monitoring and logging
data (for anomaly detection) IBM TJ Watson 1500 VM deployment produces terrabytes/day
No notable success stories with analyzing log data
10
Slide11Why: Administrator/Machine Gap
Log data is textual - used to be interpreted by humans
e
.g.:“Error x on reading input y”No human can read terrabytes
of dataText parsing is expensive for machineMachine can’t understand what the human is looking for
11
Slide12Why: Sw/Data/
Hw
Gap
Virtual Machines – “virtual” resources and remote data harder to track, harder to pin-point what is wrong
Provider: Not my application, not my data, not my problem !Developer: Not my platform, not my fault !
12
Slide1313
Hardware $$$
Periodic data ingest
Cross-silo data management $$$
Hadoop
Misdirected
Focus on Stop-gap Solutions
Data center or Cloud
Slide14What can we do ?Find Meaningful Apps
We can produce/find tons of data
Need to analyze something of vital importance to justify draining vital resources
Otherwise the simplest solution is to stop creating the problem(s)
14
Slide15Opportunity: The Brain Challenge
Brain
Modelling
and Simulation Spent sabbatical studying Neuroscience Applications (2010-11)Brain Summits/Workshops/Conf e.g., held
at IBM TJ WatsonStarted a collaboration with Prof. Stephen Strother
at Baycrest (top Brain research group worldwide)An application that is both data and compute intensive
15
Slide16The Brain Worldwide
North America (US and Canada)
Top-down approach (from brain images fMRI
brain model)Brain as network, diagnosing normal/abnormal, drug trials
EuropeBottom-up approach (neurons in Petri dishes brain model)
IBM TJ Watson, Toronto BaycrestCross-correlations between fMRI (image), EEG (signal), clinical data, genetic data (centralized repository in Kingston)
Collab
with Sick Kids, Sunnybrook Hospitals,
UHealthNet
16
Slide17Neuroscience Challenges
D
ata Intensive
An hour of brain slice scan in human
1GB dataUsing functional MRIAn hour in mouse
50~100 GB dataUsing two-photon imaging Recording the entire brain of a zebrafish
1TB or more data
Using light-sheet microscopy
100k neurons in
zebrafish ( 10 billion neurons in human )Compute Intensive: complex analyses
A large set of tuning parameters in analyzing
algorithms
Univariate
and multivariate analysis, PCA, etc.
High dimensional search to find the optimal
17
Slide18Functional MRI
Goal
Studying
brain functionality
Procedure
Asking patients (subjects) to do a task and capturing brain slices measuring blood oxygen level.
Correlating images to identify brain activity
18
Slide19Brain Modeling Loop
19
Slide20fMRI Data Preprocessing Example
20
Common Preprocessing Steps
Physiological
Noise
Correction
Intra-Subject
Motion
Correction
Between
Subject
Alignment
Smoothing
Intensity
Normalization
Detrending
Each step has many tuning parameters to choose.
There is
NO
fixed optimization parameters for all subjects.
Individual
optimization consistently improves signal
detection.
Goal:
Search the
optimal combinations
of parameters of these steps.
courtesy of
Stephen
Strother
Slide21Brain Processing Pipeline
Subject
Selection &
Experimental Design
Data Acquisition
Data Preprocessing
StatisticalModel
Results
21
Slide22Brain Processing Pipeline
Subject
Selection &
Experimental Design
Data Acquisition
Data Preprocessing
NPAIRS
Results
22
Slide23NPAIRS
SPLIT-HALF 2
FULL DATA
Scans
“Design” Data
Scans
Statistical
Parametric Map
SJ
2
SPM
SJ1
“Design” Data
SPLIT-HALF 1
REPRODUCIBILITY ESTIMATE (
r
)
v
SJ
2
v
SJ
1
Split J
23
Slide24NPAIRS As Our Application
NPAIRS Goal: Processing images to find images
correlations
For same subject, across subjectsFeature Extraction: A common technique in image processing applications (e.g. Face Recognition)Using Principal Component Analysis to extract Eigen Vectors
24
Slide25Need for Interactivity to Find Optimum as f(Q)
25
Max
(
p,r
)
Slide26Long Sample Times
26
Sample
: 2
hrs
Need for Guidance
27
Sample
: 2
hrs
Sample
: 4
hrs
What if ?
Vary 1000 parameters instead of just Q
For 1000 subjects
For 1000x more data (higher resolution, more images)28
Slide29What if ?
Vary 1000 parameters instead of just Q
For 1000 subjects
For 1000x more data (higher resolution, more images)Combinatorial explosion !Curse of multi-dimensionality (can’t even visualize)
29
Slide30Huge Search Space
Current
approach
Low resolution, noisy data Fixed selection of parameters for each step, one subject
Poor accuracy Need
several days to get solutionIdeal approachFull data resolution,
full set of tuning
parameters, many subjects
Never explored due to
lack
of algorithm, user guidance and
computing
power!
30
Slide31NPAIRS Execution on Different HW Nodes
31
Slide32Job Scheduling for Brain
Data centers usually have heterogeneous structure – variety of multicores, GPUs, etc.
Lots of tuning knobs (parameters) in the infrastructure
Job scheduling to available resources becomes non-trivialBoils down to search for optimal in high dimensional spaces
32
Slide33IT Misfits for Application
State of the art in IT does not satisfy: Cloud too complex,
MapReduce
(Hadoop) too simplistic Job scheduling problem and IT management gapsD
ata transfer (truckload ?), data safety and privacyLong computation (days, weeks) may crashHard to tolerate/mask failures
Hard to know what went wrong33
Slide34Key Idea
Focus
on Search in High Dimensional Space as the Unified High Level
ProblemNow: Experts in Math, AI, OS, HW, Compiler, NetworkFuture: Get experts working on same High Level Problem Develop common framework that works for: Performance modeling,
anomaly modeling, biophysical modeling, energy modeling (for both IT and Neuroscience)
34
Slide35Search for Optimum in High Dimensional Spaces
Service
Provider
Unified High Level Goal
35
Search for Optimum in High Dimensional Spaces
Customer
Need guided modeling to understand !
Slide36Building models takes time
Cache
L1
L2
L3
Core i7
256 kB
1024 kB
8192 kB
Slide37Guided Modeling Can Help
Cache
L1
L2
L3
Core i7
256 kB
1024 kB
8192 kB
Slide38Intelligent Guidance: Step function (3D to 2D reduction)
L1
L2
L3
Xeon
256 kB
2048 kB
20480 kB
Core i7
256 kB
1024 kB
8192 kB
Slide39With minimum new samples …
Cache
L1
L2
L3
Xeon
256 kB
2048 kB
20480 kB
Slide40Conclusions
Big Data processing is driving a quantum leap in IT
Hampered by slow progress in data center management
We propose to investigate guided modeling for High Dimensional SpacesPromising preliminary results with Neuroscience workloadsSignificant speedup of NPAIRS on CPU
+GPU cluster
40
Slide41What can we do next ?
Consolidate Research Agendas
Find overarching, mission critical paradigms
Develop standards, common tools and benchmarksIntegrate solutions, think holistically, unified, end-to-
end
41
Slide42Questions ?
Jin Chen, PhD, from the
Baycrest
Research Group in Neuroscience My long time collaboratorAvailable for this Q&A
42
Slide43Job Modeling: Exhaustive Sampling
43
Sample
Set
:
1 to 99Fitness Score R2: 0.995
Total Run Time:
64933
Guidance: Step function + Fast Sampling
44
Sample
Set
:
2 4 8 12 16 20 24 32 48
96
Fitness Score R
2
: 0.993
Total Run Time: 5313
Large Time
S
aving!