/
Anatomy of a Climate Science-centric Workflow Anatomy of a Climate Science-centric Workflow

Anatomy of a Climate Science-centric Workflow - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
391 views
Uploaded On 2016-07-08

Anatomy of a Climate Science-centric Workflow - PPT Presentation

Harinarayan Krishnan CA librated and S ystematic C haracterization A ttribution and D etection of E xtremes CASCADE Team Kevin Bensema Surendra Byna Soyoung ID: 395382

workflow amp software cascade amp workflow cascade software analysis parallel teca environment algorithms data development infrastructure science execution climate

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Anatomy of a Climate Science-centric Wor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Anatomy of a Climate Science-centric Workflow

Harinarayan

Krishnan,

CA

librated

and

S

ystematic

C

haracterization,

A

ttribution, and

D

etection of

E

xtremes

(CASCADE Team)

Kevin

Bensema

,

Surendra

Byna

,

Soyoung

Jeon

,

Karthik

Kashinath

,

Burlen

Loring

,

Pardeep

Pall

,

Prabhat

,

Alexandru

Romosan

, Oliver

Ruebel

,

Daithi

Stone

,

Travis O'Brien

, Christopher

Paciorek

, Michael

Wehner

, Wes Bethel,

William CollinsSlide2

Challenges

Scale

of data already at TBs and will only grow larger

.

Processing Three to Six hours of intervals frequently.

Foci now is on High resolution 1/4

th

to 1/8

th

degree. Extensible to higher

.

High resolution and high frequency analysis add several orders of magnitude

.Slide3

Proposed Strategy

Identification

of use cases, extraction of common computational algorithms, scaling & optimization of current work

.

Template workflow configurations of common use cases

.

Abstraction of services to HPC environments

.

Easy to use archiving, distribution, and verification strategies

.

Standardization of parallel work environment.Slide4

What it is/What it is not

What it is not

Not a general workflow

Not a general infrastructure – Balancing between performance & exploratory science.

What it is

…For Example:t = cascade.Teca() t['filename'] = ‘myfile’writer = cascade.Writer(cascade.ESGF)writer[‘input’] = t[‘out’]n = workflow.NERSC(<resources>, writer)n.execute()Note: Active Work in progress & ongoing…Slide5
Slide6
Slide7

What it is/What it is not

What it is not

Not a general workflow

Not a general infrastructure – Balancing between performance & exploratory science.

What it is

A highly customized climate-centric API (Zonal Mean Averages, GEV, etc…)

Workflow – Verification/Validations, Job scheduling, Staging, Deployment, etc…Modules – Performance & Timing Support, Calendar Support, etc… Template workflows Slide8

Climate Science-centric Workflow

Workspace – A collaboration environment to share, track documents, visualize status, update issues.

One-on-one – Identify use cases that require implementing new features or scaling & performance optimization of existing ones.

Software tools – Development and Deployment of algorithms & software packages as well as building & maintaining packages for target environments.

Workflow components – Connecting it all together.Slide9

Communication InfrastructureSlide10

Quick Note: Software Environment

Infrastructure -

cascade.lbl.gov

/esg02.nersc.gov

Confluence – Portal to publish and collaborate with team members

Jira

– Bug & Issue tracking portal.

CDash/Jenkins – Infrastructure to report status of software build & regression tests.BitBucket – Main software repository. ESGF service – Service for distribution of data generated by CASCADE.Slide11

CASCADE Team

Detection & Attribution Team – Characterization, detection, and attribution of simulated and observed extremes in a variety of different contexts -- Analysis Algorithms

Model Fidelity – . Evaluation and improvement of model fidelity in simulating extremes

Statistics –

Development of statistical frameworks for extremes analysis, uncertainty quantification, and model evaluation

Formulation of highly parallel software for analysis and uncertainty quantification of extremesSlide12

Analysis Infrastructure Tasks

Development of new climate-centric algorithms and evaluation of current ones. Implement scalable, parallel versions as needed.

Performance analysis and data

m

anagement.

Deployment and Maintenance on HPC environments.

Creating a standardized

environment – Provide same execution environment on all deployed platforms, and seamless bridges different technologies (Python <-> R).User Support.Slide13

Detection & Attribution

Single Program Multiple Data SPMD scripts

– refactoring current algorithms to work in parallel.

Distribution/Staging – Functionality to distribute data generated through ESGF also stage data at NERSC.

TECA – Active development of Parallel Toolkit for Extreme Climate Analysis.

Teleconnections – Ensemble analysis & software solutions to investigate of frequency of teleconnection events.Slide14

Model FidelitySlide15

Model Fidelity

ILIAD workflow

The parallelization of the generation of initial

conditions.

Dynamic Building, Compilation & Execution of CESM.

Module verification – monitor execution status & successful completion.

Module for automation of archiving of output (initial conditions,

namelist files, CESM output).DepCache – External tool for speeding up execution of Python libraries.Slide16

Statistics

Integration of Statistical Algorithms – Working to deploy relevant statistical algorithms within CASCADE framework.

Parallelization – Scaling statistics scripts to work in a parallel environment.

l

lex

Installation – Generalized Extreme Value Analysis & Peaks Over Threshold statistical analysis algorithms (Developed by Stats team members)Slide17

Software Suite

Python environment

IPython

, mpi4py,

numpy

, …

CDAT-Core (cdms2,

cdtime,…)Rpy2 (Python-R bridge)R environmentextRemes, ismevLlex – GEV & POT (Dr. Chris Paciorek’s package)pbdR - pbdMPI, pbdSLAP, pbdPROF, pbdNCDF (ORNL)TECA – parallel toolkit developed at LBNL (TC, ETC, AR)- Prototype deployment at NERSC (module load cascade)- Transitioning maintenance of NERSC ESGF Node to CASCADE analysis group.Slide18

Workflow Infrastructure

Unified Workflow Service

Load balanced services that handle job Scheduling, Validation & Verification, Fault Tolerance

Core Modules

Calendar support

Data Reduction Operations (Sum, Max, Min, Average, etc…)

I/O services (Parallel Read/Write)

Threading/MPI wrapping (Map|Foreach)Slide19

Additional Services

MPO – A tool for recording scientific workflows, Developed by General Atomics & LBNL.

Tigres

– Template Interfaces for Agile Parallel Data-Intensive Science, Developed by Advanced Computing for Science Group at LBNL.

ESGF – Support for automated distribution through ESGF installation. Slide20

Modules & API

CoreModule

Timing, Logging

Standard definition of parameter inputs & outputs

All modules are inherently Workflows of one.

implicit connectivity of workflow

BaseAPI (Pythonic)__getitem__,__setitem__: param[“input”] = valcascade_static_{param|output}_spec: {name, value, type, user_defined}cascade_execute – core execution functionSlide21

Example Workflow

Example use case: Running a single module

^^^^^^^^^^^^^^

t

=

Teca

()

# Where teca is a derived class of CascadeBasefilename = 'myfile’t['filename'] = filenamet.execute()^^^^^^^^^^^^^^^^^t1 = Teca() # Where Teca is a derived class of CascadeBaset2 = TecaAnalysis() # Where TecaAnalysis is a derived class of CascadeBaset2['inputdata'] = t1['outputdata'] # Note, this establishes a link

t2.execute()Slide22

Proposed Workflow

t1 =

Teca

()

t2 =

TecaAnalysis

()

t3 = TecaAnalysis()s = Diff()t2['inputdata'] = t1['outputdata’]t3[‘inputdata’] = t1[‘outputdata’]s[‘inputdata1’] = t2[‘outputdata’]s[‘inputdata2’] = t3[‘outputdata’]s.write(‘prefix’, ‘file’)

s.execute()Slide23

Recap: Anatomy of Climate Science-centric Workflow

Software Environment – Development, Deployment, and Maintenance

Custom Use Case Support for D&A, Model Fidelity, and Statistics team needs.

Software Suite – Scaling, Parallelism, Performance Management, Software Services (Python, R, TECA)

Workflow Development – Thin Client & Workflow service, Module development, Optimization (Data Movement, Workflow execution), Provenance.Slide24

Thanks

Questions?