Data Enabled S cience S A L S A HPC Group http salsahpcindianaedu School of Informatics and Computing Indiana University Bioinformatics Pipeline Gene Sequences N 1 Million ID: 428285
Download Presentation The PPT/PDF document "Analysis Tools for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Analysis Tools for
Data Enabled Science
SALSA HPC Group http://salsahpc.indiana.eduSchool of Informatics and ComputingIndiana UniversitySlide2
Bioinformatics Pipeline
Gene Sequences (N = 1 Million)Distance MatrixInterpolative MDS with Pairwise Distance Calculation
Multi-Dimensional Scaling (MDS)
Visualization
3D Plot
Reference Sequence Set (M = 100K)
N - M Sequence Set (900K)
Select Reference
Reference Coordinates
x, y, z
N - M Coordinates
x, y, z
Pairwise Alignment & Distance Calculation
O(N
2
) Slide3
Structure of Twister4AzureSlide4
Iterative
MapReduce for AzureMerge Step
In-Memory Caching of static dataCache aware hybrid scheduling using Queues as well as using a bulletin board (special table) Slide5
Performance –
Kmeans ClusteringPerformance with/without
data caching Speedup gained using data cache
Scaling speedupIncreasing number of iterationsSlide6
Performance Comparisons
BLAST Sequence SearchCap3 Sequence AssemblySmith Watermann
Sequence AlignmentSlide7
Twister v0.9
Configuration Program to setup Twister environment automatically on a clusterFull mesh network of brokers for facilitating communicationNew messaging interface for reducing the message serialization overheadMemory Cache to share data between tasks and jobs
New Infrastructure for Iterative MapReduce ProgrammingSlide8
Twister-MDS Demo
This demo is for real time visualization of the process of multidimensional scaling(MDS) calculation. We use Twister to do parallel calculation inside the cluster, and use PlotViz to show the intermediate results at the user client computer. The process of computation and monitoring is automated by the program. Slide9
MDS projection of 100,000 protein sequences showing a few experimentally
identified clusters in preliminary work with Seattle Children’s Research Institute.Twister-MDS OutputSlide10
Twister-MDS Work Flow
Master Node
Twister DriverTwister-MDSActiveMQBrokerMDS Monitor
PlotViz
I. Send message to start the job
II. Send intermediate results
Local Disk
III. Write data
IV. Read data
Client NodeSlide11
MDS Output Monitoring Interface
Pub/Sub Broker Network
Worker Node
Worker Pool
Twister Daemon
Master Node
Twister
Driver
Twister-MDS
Worker Node
Worker Pool
Twister Daemon
map
reduce
map
reduce
calculateStress
calculateBC
Twister-MDS StructureSlide12
New Network of
Messaging BrokersTwister Driver Node
Twister Daemon Node
ActiveMQ
Broker Node
Broker-Daemon Connection
Broker-Broker Connection
Broker-Driver Connection
7
Brokers
and
32 Computing
Nodes in total
Full Mesh Network
Hierarchical SendingSlide13
P
erformance ImprovementsSlide14
Harnessing
the Power of Workflow
Design Workflow PatternConfigure Trident JobsSlide15
Harnessing the Power of Workflow
Future Work: Combine Windows Trident with TwisterSlide16
Twister for Polar Science
The Center for Remote Sensing of Ice Sheets ResearchEducationKnowledge Transfer
Utilizing the Power of
Twister to Perform Large Scale Scientific CalculationSlide17
Twister for Polar Science
Deploying a Twister Appliance for Polar Grid
copy
instantiate
…
Virtual Machines
Group
VPN
Virtual IP - DHCP
5.5.1.1
Virtual IP - DHCP
5.5.1.2
GroupVPN
Credentials
(from
Web site)Slide18
Twister
ArchitectureLinux HPCBare-systemAmazon CloudWindows Server HPCBare-system
VirtualizationCross Platform Iterative MapReduceLife Sciences, Physics, Information Retrieval, Social NetworkSmith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD, High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological MappingCPU Nodes
Virtualization
Applications
Runtimes
Infrastructure software
Hardware
Azure Cloud
Services and Workflow
High
L
evel Language
Messaging Middleware
Storage and Data Parallel File System
Grid Appliance
GPU Nodes
Support Scientific Simulations (Data Mining and Data Analysis)Slide19
Twister Futures
Development of library of Collectives to use at Reduce phaseBroadcast and Gather needed by current applicationsDiscover other important onesImplement efficiently on each platform – especially AzureBetter software message routing with broker networks using asynchronous I/O with communication fault toleranceSupport nearby location of data and computing using data parallel file systemsClearer application fault tolerance model based on implicit synchronizations points at iteration end pointsLater: Investigate GPU supportLater: run time for data parallel languages like Sawzall, Pig Latin, LINQSlide20
(a) Map Only(d) Loosely Synchronous(c) Iterative MapReduce
(b) Classic MapReduce
Input
map
reduce
Input
map
reduce
Iterations
Input
Output
map
P
ij
CAP3 Analysis
Smith-Waterman Distances
Parametric sweeps
PolarGrid Matlab data analysis
High Energy Physics (HEP) Histograms
Distributed search
Distributed sorting
Information retrieval
Many MPI scientific applications such as solving differential equations and particle dynamics
Domain of MapReduce and Iterative Extensions
MPI
Expectation maximization clustering e.g.
Kmeans
Linear Algebra
Multimensional
Scaling
Page Rank
Twister FuturesSlide21
Education and Broader Impact
We devote a lot to guide studentswho are interested in computingSlide22
Education
We offer classes with
new topicTogether with tutorials on the most popular cloud computing toolsSlide23
Hosting workshops
and spreading our technology across the nation
Giving students unforgettable research experience
Broader ImpactSlide24
Acknowledgement
SA
LSA HPC Group Indiana Universityhttp://salsahpc.indiana.edu