/
Analysis Tools for Analysis Tools for

Analysis Tools for - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
385 views
Uploaded On 2016-08-01

Analysis Tools for - PPT Presentation

Data Enabled S cience S A L S A HPC Group http salsahpcindianaedu School of Informatics and Computing Indiana University Bioinformatics Pipeline Gene Sequences N 1 Million ID: 428285

data twister broker mds twister data mds broker node mapreduce map reduce scaling iterative sequence calculation driver worker network

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Analysis Tools for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Analysis Tools for

Data Enabled Science

SALSA HPC Group http://salsahpc.indiana.eduSchool of Informatics and ComputingIndiana UniversitySlide2

Bioinformatics Pipeline

Gene Sequences (N = 1 Million)Distance MatrixInterpolative MDS with Pairwise Distance Calculation

Multi-Dimensional Scaling (MDS)

Visualization

3D Plot

Reference Sequence Set (M = 100K)

N - M Sequence Set (900K)

Select Reference

Reference Coordinates

x, y, z

N - M Coordinates

x, y, z

Pairwise Alignment & Distance Calculation

O(N

2

) Slide3

Structure of Twister4AzureSlide4

Iterative

MapReduce for AzureMerge Step

In-Memory Caching of static dataCache aware hybrid scheduling using Queues as well as using a bulletin board (special table) Slide5

Performance –

Kmeans ClusteringPerformance with/without

data caching Speedup gained using data cache

Scaling speedupIncreasing number of iterationsSlide6

Performance Comparisons

BLAST Sequence SearchCap3 Sequence AssemblySmith Watermann

Sequence AlignmentSlide7

Twister v0.9

Configuration Program to setup Twister environment automatically on a clusterFull mesh network of brokers for facilitating communicationNew messaging interface for reducing the message serialization overheadMemory Cache to share data between tasks and jobs

New Infrastructure for Iterative MapReduce ProgrammingSlide8

Twister-MDS Demo

This demo is for real time visualization of the process of multidimensional scaling(MDS) calculation. We use Twister to do parallel calculation inside the cluster, and use PlotViz to show the intermediate results at the user client computer. The process of computation and monitoring is automated by the program. Slide9

MDS projection of 100,000 protein sequences showing a few experimentally

identified clusters in preliminary work with Seattle Children’s Research Institute.Twister-MDS OutputSlide10

Twister-MDS Work Flow

Master Node

Twister DriverTwister-MDSActiveMQBrokerMDS Monitor

PlotViz

I. Send message to start the job

II. Send intermediate results

Local Disk

III. Write data

IV. Read data

Client NodeSlide11

MDS Output Monitoring Interface

Pub/Sub Broker Network

Worker Node

Worker Pool

Twister Daemon

Master Node

Twister

Driver

Twister-MDS

Worker Node

Worker Pool

Twister Daemon

map

reduce

map

reduce

calculateStress

calculateBC

Twister-MDS StructureSlide12

New Network of

Messaging BrokersTwister Driver Node

Twister Daemon Node

ActiveMQ

Broker Node

Broker-Daemon Connection

Broker-Broker Connection

Broker-Driver Connection

7

Brokers

and

32 Computing

Nodes in total

Full Mesh Network

Hierarchical SendingSlide13

P

erformance ImprovementsSlide14

Harnessing

the Power of Workflow

Design Workflow PatternConfigure Trident JobsSlide15

Harnessing the Power of Workflow

Future Work: Combine Windows Trident with TwisterSlide16

Twister for Polar Science

The Center for Remote Sensing of Ice Sheets ResearchEducationKnowledge Transfer

Utilizing the Power of

Twister to Perform Large Scale Scientific CalculationSlide17

Twister for Polar Science

Deploying a Twister Appliance for Polar Grid

copy

instantiate

Virtual Machines

Group

VPN

Virtual IP - DHCP

5.5.1.1

Virtual IP - DHCP

5.5.1.2

GroupVPN

Credentials

(from

Web site)Slide18

Twister

ArchitectureLinux HPCBare-systemAmazon CloudWindows Server HPCBare-system

VirtualizationCross Platform Iterative MapReduceLife Sciences, Physics, Information Retrieval, Social NetworkSmith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD, High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological MappingCPU Nodes

Virtualization

Applications

Runtimes

Infrastructure software

Hardware

Azure Cloud

Services and Workflow

High

L

evel Language

Messaging Middleware

Storage and Data Parallel File System

Grid Appliance

GPU Nodes

Support Scientific Simulations (Data Mining and Data Analysis)Slide19

Twister Futures

Development of library of Collectives to use at Reduce phaseBroadcast and Gather needed by current applicationsDiscover other important onesImplement efficiently on each platform – especially AzureBetter software message routing with broker networks using asynchronous I/O with communication fault toleranceSupport nearby location of data and computing using data parallel file systemsClearer application fault tolerance model based on implicit synchronizations points at iteration end pointsLater: Investigate GPU supportLater: run time for data parallel languages like Sawzall, Pig Latin, LINQSlide20

 

(a) Map Only(d) Loosely Synchronous(c) Iterative MapReduce

(b) Classic MapReduce  

Input 

 

 

map

 

 

 

 

 

 

reduce

 

Input

 

 

 

map

 

 

 

 

 

 

reduce

Iterations

Input

 

Output

 

 

map

 

 

 

 

 

P

ij

CAP3 Analysis

Smith-Waterman Distances

Parametric sweeps

PolarGrid Matlab data analysis

High Energy Physics (HEP) Histograms

Distributed search

Distributed sorting

Information retrieval

 

Many MPI scientific applications such as solving differential equations and particle dynamics

 

Domain of MapReduce and Iterative Extensions

MPI

Expectation maximization clustering e.g.

Kmeans

Linear Algebra

Multimensional

Scaling

Page Rank

 

Twister FuturesSlide21

Education and Broader Impact

We devote a lot to guide studentswho are interested in computingSlide22

Education

We offer classes with

new topicTogether with tutorials on the most popular cloud computing toolsSlide23

Hosting workshops

and spreading our technology across the nation

Giving students unforgettable research experience

Broader ImpactSlide24

Acknowledgement

SA

LSA HPC Group Indiana Universityhttp://salsahpc.indiana.edu