/
3DAPAS/ECMLS panel 3DAPAS/ECMLS panel

3DAPAS/ECMLS panel - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
379 views
Uploaded On 2017-09-01

3DAPAS/ECMLS panel - PPT Presentation

Dynamic Distributed Data Intensive Analysis Environments for Life Sciences June 8 2011 San Jose Geoffrey Fox Shantenu Jha Dan Katz Judy Qiu Jon Weissman Discussion Topics Programming methods languages vs frameworks advantages and disadvantages of each ID: 584092

clouds data model life data clouds life model sciences shared analysis important compute large cloud supercomputers parallel programming resources

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "3DAPAS/ECMLS panel" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

3DAPAS/ECMLS panelDynamic Distributed Data Intensive Analysis Environments for Life Sciences

June 8 2011 San Jose

Geoffrey Fox, Shantenu Jha,

Dan Katz, Judy

Qiu, Jon

WeissmanSlide2

Discussion Topics?Programming methods: languages vs. frameworks (advantages and disadvantages of each)Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable?Does Life Sciences want clouds or supercomputers?

Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual?

How important is data security and privacy?Slide3

Programming methods: languages vs. frameworks for data intensive/Life Science areasSaaS

offers “Blast etc.” on demand

Role of PGAS, Data parallel compilers (like Chapel) i.e. main stream HPC high level approaches

Nodes v. Cores v. GPU’s – do hybrid programming models have special features

MapReduce v. MPI

Distributed environments like SAGA

Data parallel analysis languages like Pig Latin, Sawzall

Role of databases like

SciDB

and SQL based analysis

See DryadLINQ

Is R (cloud R, parallel R) critical

What about Excel,

Matlab

…Slide4

Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable?This related to privacy and programming model questions

Is data stored in central resources

Does data have co-located compute resources (cloud)

If co-located, are data and compute on same cluster

How is data spread out over disks on nodes?

Or is data in a storage system supporting wide area file system shared by nodes of cloud?

Or is data in a database (

SciDB

SkyServer

)?

Or is data in an object store like OpenStack?

What

kind of middleware exists, or needs to be developed to enable effective compute-data movement? Or it just a run-time scheduling problem?

What are performance issues and how do we get data there for dynamic data as that produced by sequencers.Slide5

Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual?Is it a few large centers?

Is it a distributed set of repositories containing say all data from a particular lab?

Or both of the above?

How to manage and present stream of new data

The world created ~1000

exabytes

of data this year – how much will Life Sciences create?

Relative importance of large shared data centers versus instrumental or computer generated individually owned data?

Is Data replication important?

Storage model – files, objects

, databases?

How often is the different types of data read (presumably written once!)

Which data is most important? Raw or processed to some level?

What is metadata challenge?Slide6

Does Life Sciences want Clouds or Supercomputers?Clouds are cost effective and elastic for varying need

Supercomputers support low latency (MPI) parallel applications

Clouds main commercial offering; supercomputers main academic large scale computing solution

Also Open Science Grid, EGI ….

Cost(time) of transporting data from sequencers and repositories to analysis engines ( clouds)

Will NLR or Internet2 link to clouds; they do to TeraGrid

What can LS

data-intensive

community learn from the HEP community? e.g., Will the HEP approach of community-wide "workload management systems" and VOs work

?

What

is the role of Campus Clusters/resources in genomic data-sharing?

No

history of large cloud budgets in federal grantsSlide7

How important is data security and privacy?Human Genome processing cannot use most cost effective solutions which will be shared resources such as public clouds

Commercial, military applications

What other research applications have such concerns

Analysis of copyrighted material such as digital books

Partly technical; partly policy issue of establishing a trusted approach

Companies accept off site paper storage?

See recent hacking attacks such as Sony network,

gmail

How

important

is fault

tolerance/autonomic computing?