/
Big IT Challenges and Opportunities for Neuroscience Big IT Challenges and Opportunities for Neuroscience

Big IT Challenges and Opportunities for Neuroscience - PowerPoint Presentation

deborah
deborah . @deborah
Follow
342 views
Uploaded On 2022-06-15

Big IT Challenges and Opportunities for Neuroscience - PPT Presentation

Cristiana Amza University of Toronto Big Data is Here Data growth by 2015 100x in ten years IDC 2012 Value in data analysis for commerce health science services source Economist ID: 919045

brain data modeling high data brain high modeling parameters cloud search application images sample gap find npairs courtesy provider

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Big IT Challenges and Opportunities for ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Big IT Challenges and Opportunities for Neuroscience

Cristiana

Amza

University of Toronto

Slide2

Big Data is Here

Data growth (by 2015) = 100x in ten years [IDC 2012]

Value in

data analysis for commerce, health, science, services, ….

[source: Economist]

courtesy of 

Babak

Falsafi

Slide3

Data Growing Faster than Technology

WinterCorp

Survey,

www.wintercorp.com

Growingtechnology

gap

courtesy of 

Babak

Falsafi

Slide4

Challenge 1: Costs of a datacenter

4

Estimated costs of datacenter:

46,000

servers

$3,500,000 per month to run

Data courtesy of James Hamilton [SIGMOD’11 Keynote]

3yr server and 10yr infrastructure amortization

Server & Power are

88%

of total cost

Slide5

Datacenter Energy Not Sustainable

In

modern world, 6% of all electricity, growing at >20%!

Billion Kilowatt hour/year

2001 2005 2009 2013 2017

A Modern Datacenter

17x football stadium, $3 billion

50

million

homes

courtesy of 

Babak

Falsafi

Slide6

Amazon

Can’t Recover

All Its

Cloud Data

F

rom Outage

Max Eddy, 27 April 2011,

www.geekosystem.com

When the Cloud Fails:

T-Mobile

, Microsoft

Lose

Sidekick Customer Data

Om

Malik, 10 October 2009, gigaom.com

Whoops – Facebook

loses

1 billion photos

Chris

Keall

,

10 March 2009

, The National Business

Review

Cloud Storage Often Results in

Data Loss

Chad Brooks, 10 October

2011

, www.businessnewsdaily.com

6

Cloudy

with a chance of

failure

courtesy

of 

Haryadi

S.

Gunawi

Challenge 2: Data Management (Anomalies)

Slide7

7

Data Center Statistics at Google

Slide8

Problems are entrenched

I have been working in this area since

2001 (1993)

Problems have only grown more complex/intractableSame old Distributed Systems problemsNew: Gaps due to heterogeneity/remoteness/indirection

8

Slide9

Share

resources:

Can I place customer A’s DB along side customer B’s DB? Will their service-levels be met?

Service

Provider

Why: Application/Provider Different Goals

9

Solve problems

:

I’m only getting 500 TPS. What’s wrong? Is the cloud to blame?

Customer

Gap: application semantic not available to service provider

Slide10

Ex: Application/Provider Gap

Cloud

monitoring and logging

data (for anomaly detection) IBM TJ Watson 1500 VM deployment produces terrabytes/day

No notable success stories with analyzing log data

10

Slide11

Why: Administrator/Machine Gap

Log data is textual - used to be interpreted by humans

e

.g.:“Error x on reading input y”No human can read terrabytes

of dataText parsing is expensive for machineMachine can’t understand what the human is looking for

11

Slide12

Why: Sw/Data/

Hw

Gap

Virtual Machines – “virtual” resources and remote data harder to track, harder to pin-point what is wrong

Provider: Not my application, not my data, not my problem !Developer: Not my platform, not my fault !

12

Slide13

13

Hardware $$$

Periodic data ingest

Cross-silo data management $$$

Hadoop

Misdirected

Focus on Stop-gap Solutions

Data center or Cloud

Slide14

What can we do ?Find Meaningful Apps

We can produce/find tons of data

Need to analyze something of vital importance to justify draining vital resources

Otherwise the simplest solution is to stop creating the problem(s)

14

Slide15

Opportunity: The Brain Challenge

Brain

Modelling

and Simulation Spent sabbatical studying Neuroscience Applications (2010-11)Brain Summits/Workshops/Conf e.g., held

at IBM TJ WatsonStarted a collaboration with Prof. Stephen Strother

at Baycrest (top Brain research group worldwide)An application that is both data and compute intensive

15

Slide16

The Brain Worldwide

North America (US and Canada)

Top-down approach (from brain images fMRI

 brain model)Brain as network, diagnosing normal/abnormal, drug trials

EuropeBottom-up approach (neurons in Petri dishes  brain model)

IBM TJ Watson, Toronto BaycrestCross-correlations between fMRI (image), EEG (signal), clinical data, genetic data (centralized repository in Kingston)

Collab

with Sick Kids, Sunnybrook Hospitals,

UHealthNet

16

Slide17

Neuroscience Challenges

D

ata Intensive

An hour of brain slice scan in human

1GB dataUsing functional MRIAn hour in mouse

50~100 GB dataUsing two-photon imaging Recording the entire brain of a zebrafish

1TB or more data

Using light-sheet microscopy

100k neurons in

zebrafish ( 10 billion neurons in human )Compute Intensive: complex analyses

A large set of tuning parameters in analyzing

algorithms

Univariate

and multivariate analysis, PCA, etc.

High dimensional search to find the optimal

17

Slide18

Functional MRI

Goal

Studying

brain functionality

Procedure

Asking patients (subjects) to do a task and capturing brain slices measuring blood oxygen level.

Correlating images to identify brain activity

18

Slide19

Brain Modeling Loop

19

Slide20

fMRI Data Preprocessing Example

20

Common Preprocessing Steps

Physiological

Noise

Correction

Intra-Subject

Motion

Correction

Between

Subject

Alignment

Smoothing

Intensity

Normalization

Detrending

Each step has many tuning parameters to choose.

There is

NO

fixed optimization parameters for all subjects.

Individual

optimization consistently improves signal

detection.

Goal:

Search the

optimal combinations

of parameters of these steps.

courtesy of 

Stephen

Strother

Slide21

Brain Processing Pipeline

Subject

Selection &

Experimental Design

Data Acquisition

Data Preprocessing

StatisticalModel

Results

21

Slide22

Brain Processing Pipeline

Subject

Selection &

Experimental Design

Data Acquisition

Data Preprocessing

NPAIRS

Results

22

Slide23

NPAIRS

SPLIT-HALF 2

FULL DATA

Scans

“Design” Data

Scans

Statistical

Parametric Map

SJ

2

SPM

SJ1

“Design” Data

SPLIT-HALF 1

REPRODUCIBILITY ESTIMATE (

r

)

v

SJ

2

v

SJ

1

Split J

23

Slide24

NPAIRS As Our Application

NPAIRS Goal: Processing images to find images

correlations

For same subject, across subjectsFeature Extraction: A common technique in image processing applications (e.g. Face Recognition)Using Principal Component Analysis to extract Eigen Vectors

24

Slide25

Need for Interactivity to Find Optimum as f(Q)

25

Max

(

p,r

)

Slide26

Long Sample Times

26

Sample

: 2

hrs

Slide27

Need for Guidance

27

Sample

: 2

hrs

Sample

: 4

hrs

Slide28

What if ?

Vary 1000 parameters instead of just Q

For 1000 subjects

For 1000x more data (higher resolution, more images)28

Slide29

What if ?

Vary 1000 parameters instead of just Q

For 1000 subjects

For 1000x more data (higher resolution, more images)Combinatorial explosion !Curse of multi-dimensionality (can’t even visualize)

29

Slide30

Huge Search Space

Current

approach

Low resolution, noisy data Fixed selection of parameters for each step, one subject

Poor accuracy Need

several days to get solutionIdeal approachFull data resolution,

full set of tuning

parameters, many subjects

Never explored due to

lack

of algorithm, user guidance and

computing

power!

30

Slide31

NPAIRS Execution on Different HW Nodes

31

Slide32

Job Scheduling for Brain

Data centers usually have heterogeneous structure – variety of multicores, GPUs, etc.

Lots of tuning knobs (parameters) in the infrastructure

Job scheduling to available resources becomes non-trivialBoils down to search for optimal in high dimensional spaces

32

Slide33

IT Misfits for Application

State of the art in IT does not satisfy: Cloud too complex,

MapReduce

(Hadoop) too simplistic Job scheduling problem and IT management gapsD

ata transfer (truckload ?), data safety and privacyLong computation (days, weeks) may crashHard to tolerate/mask failures

Hard to know what went wrong33

Slide34

Key Idea

Focus

on Search in High Dimensional Space as the Unified High Level

ProblemNow: Experts in Math, AI, OS, HW, Compiler, NetworkFuture: Get experts working on same High Level Problem Develop common framework that works for: Performance modeling,

anomaly modeling, biophysical modeling, energy modeling (for both IT and Neuroscience)

34

Slide35

Search for Optimum in High Dimensional Spaces

Service

Provider

Unified High Level Goal

35

Search for Optimum in High Dimensional Spaces

Customer

Need guided modeling to understand !

Slide36

Building models takes time

 

Cache

L1

L2

L3

Core i7

256 kB

1024 kB

8192 kB

Slide37

Guided Modeling Can Help

 

Cache

L1

L2

L3

Core i7

256 kB

1024 kB

8192 kB

Slide38

Intelligent Guidance: Step function (3D to 2D reduction)

 

L1

L2

L3

Xeon

256 kB

2048 kB

20480 kB

Core i7

256 kB

1024 kB

8192 kB

Slide39

With minimum new samples …

 

Cache

L1

L2

L3

Xeon

256 kB

2048 kB

20480 kB

Slide40

Conclusions

Big Data processing is driving a quantum leap in IT

Hampered by slow progress in data center management

We propose to investigate guided modeling for High Dimensional SpacesPromising preliminary results with Neuroscience workloadsSignificant speedup of NPAIRS on CPU

+GPU cluster

40

Slide41

What can we do next ?

Consolidate Research Agendas

Find overarching, mission critical paradigms

Develop standards, common tools and benchmarksIntegrate solutions, think holistically, unified, end-to-

end

41

Slide42

Questions ?

Jin Chen, PhD, from the

Baycrest

Research Group in Neuroscience My long time collaboratorAvailable for this Q&A

42

Slide43

Job Modeling: Exhaustive Sampling

43

Sample

Set

:

1 to 99Fitness Score R2: 0.995

Total Run Time:

64933

Slide44

Guidance: Step function + Fast Sampling

44

Sample

Set

:

2 4 8 12 16 20 24 32 48

96

Fitness Score R

2

: 0.993

Total Run Time: 5313

Large Time

S

aving!