S A L S A Team salsaweb salsa Community Grids Laboratory Digital Science Center Pervasive Technology Institute Indiana University Virtual Cluster provisioning via XCAT Supports both ID: 171152
Download Presentation The PPT/PDF document "Science in Clouds" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Science in Clouds
S
A
L
S
A
Team
salsaweb
/salsa
Community Grids Laboratory,
Digital Science Center
Pervasive Technology Institute
Indiana UniversitySlide2
Virtual Cluster provisioning via XCAT
Supports both
stateful and stateless OS images
iDataplex Bare-metal Nodes
Linux Bare-system
Linux Virtual Machines
Windows Server 2008 HPCBare-system
Xen Virtualization
Microsoft DryadLINQ / MPI
Apache Hadoop / MapReduce++ / MPI
Smith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD Using DryadLINQ, High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological Mapping
XCAT Infrastructure
Xen Virtualization
Applications
Runtimes
Infrastructure software
Hardware
Windows Server 2008 HPC
Science Clouds ArchitectureSlide3
Pairwise Distances
– Smith Waterman
HEP data analysis
DryadLINQ, Hadoop,
MapReduce++ Implementations
High Energy Physics
Calculate
pairwise
distances for a collection of genes (used for clustering, MDS)
Fine grained tasks in MPI
Coarse grained tasks in DryadLINQ
Performed on 768 cores (Tempest Cluster)
125 million distances
4 hours & 46 minutes
Upper triangle
NxN
matrix broken down to
DxD
blocks
0
1
2
D-1
D-1
2
1
0
Each D consecutive blocks are merged to form a set of row blocks ; each with
NxD
elements process has workload of
NxD
elements
Blocks in upper triangleSlide4
Scalability of Pairwise
Distance Calculations
DryadLinq
on Windows HPC
Hadoop
MapReduce
on Linux bare metal
Hadoop
MapReduce
on Linux Virtual Machines running on
Xen
Hypervisor
VM overhead decreases with the increase of block sizes
Memory bandwidth bound computation
Communication in bursts
Performed on
IDataPlex
Cluster using 32 nodes * 8 cores
Performance degradation for 125 million distance calculations on VM
15.33%
Perf
. Degradation = (
T
vm
–
T
baremetal
)/
T
baremetalSlide5
This shows the natural load balancing of
Hadoop
MR dynamic task assignment using a global pipe line in contrast to the
DryadLinq
static assignmentsPairwise Distance Calculations
Effect of Inhomogeneous Data
Calculation Time per Pair [A,B]
α Length A * Length B
Inhomogeneity
of data does not have a significant effect when the sequence lengths are randomly distributedSlide6
CAP3 – Gene Assembly
Expressed Sequence Tag assembly to re-construct full-length mRNA
Perform using DryadLINQ, Apache Hadoop, MapReduce++ implementations
PhyloD
using
DryadLINQ
Performance of CAP3
Derive associations between HLA alleles and HIV
codons
and between
codons themselves
DryadLINQ
implementationSlide7
K-Means Clustering & Matrix Multiplication
Using Cloud Technologies
K-Means clustering on 2D vector data
DryadLINQ, Hadoop, MapReduce++ and MPI implementationsMapReduce++ performs close to MPI
Performance of K-Means
Parallel Overhead
Matrix Multiplication
Matrix
multiplication in
MapReduce
model
Hadoop
,
MapReduce
++, and MPI
MapReduce++
perform close to MPISlide8
Virtualization Overhead – Cloud Technologies
Nearly 15% performance degradation in
Hadoop
on XEN VMs
Hadoop handles the inhomogeneous data better than Dryad
-- Dynamic Task Scheduling of Hadoop made this possible Handling large data on VMs add more overhead -- Especially if the data is accessed over the networkSlide9
Virtualization Overhead - MPI
Implements Cannon’s Algorithm [1]
Exchange large messages
More susceptible to bandwidth than latency
14% reduction
in speedup between bare-system and 1-VM per node
Performance - 64 CPU cores
Speedup – Fixed matrix size (5184x5184)
Up to 40 million 3D data points
Amount of communication depends only on the number of cluster centers
Amount of communication << Computation and the amount of data processed
At the highest granularity VMs show
33% or
more
total
overhead
Extremely large overheads
for smaller grain sizes
Performance – 128 CPU cores
Overhead = (P * T(P) –T(1))/T(1)Slide10
MapReduce++
Streaming based
communication
Intermediate results are directly transferred from the map tasks to the reduce tasks –
eliminates local files
Cacheable map/reduce tasksStatic data remains in memory
Combine phase to combine reductionsUser Program is the composer
of MapReduce computationsExtends the MapReduce model to
iterative computations
Data Split
D
MR
Driver
User
Program
Pub/Sub Broker Network
D
File System
M
R
M
R
M
R
M
R
Worker Nodes
M
R
D
Map Worker
Reduce Worker
MRDeamon
Data Read/Write
Communication
Reduce (Key, List<Value>)
Iterate
Map(Key, Value)
Combine (Key, List<Value>)
User Program
Different synchronization and intercommunication mechanisms used by the parallel runtimes
Yahoo
Hadoop
uses short running processes communicating via disk and tracking processes
Microsoft DRYAD uses short running processes communicating via pipes disk or shared memory between cores
MapReduce ++ is long running processing with asynchronous distributed
Randezvous
synchronization
Disk HTTP
Disk HTTP
Disk HTTP
Disk HTTP
Pipes
Pipes
Pipes
Pipes
Pub-Sub Bus
Pub-Sub Bus
Pub-Sub Bus
Pub-Sub BusSlide11
High Performance Dimension Reduction and Visualization
Need is pervasive
Large and high dimensional data are everywhere: biology, physics, Internet, …
Visualization can help data analysis Visualization with high performanceMap high-dimensional data into low dimensions.Need high performance for processing large data
Developing high performance visualization algorithms: MDS(Multi-dimensional Scaling), GTM(Generative Topographic Mapping), DA-MDS(Deterministic Annealing MDS), DA-GTM
(Deterministic Annealing GTM), …Slide12
Biology Clustering Results
Alu
families
Metagenomics
Slide13
Analysis of 26 Million PubChem Entries
26 million
PubChem
compounds with 166 featuresDrug discoveryBioassay 3D visualization for data exploration/miningMapping by MDS(Multi-dimensional Scaling)
and GTM(Generative Topographic Mapping)Interactive visualization tool PlotViz
Discover hidden structuresSlide14
MDS/GTM for 100K
PubChem
GTM
MDS
> 300
200 ~ 300
100 ~ 200
< 100
Number of Activity ResultsSlide15
Bioassay activity in
PubChem
MDS
GTM
Highly Active
Active
Inactive
Highly InactiveSlide16
Correlation between MDS/GTM
MDS
GTM
Canonical Correlation between MDS & GTMSlide17
Child Obesity Study
Discover environmental factors related with child obesity
About 137,000 Patient records with 8 health-related and 97 environmental factors has been analyzed
Health data
Environment data
BMI
Blood Pressure
WeightHeight…
GreennessNeighborhoodPopulation
Income…
Genetic Algorithm
Canonical Correlation Analysis
VisualizationSlide18
a)
b)
a
) The plot of the first pair of canonical variables for 635 Census Blocks
b
)
The
color coded correlation between MDS and first eigenvector of PCA decomposition
Canonical Correlation Analysis and Multidimensional ScalingSlide19
SALSA Dynamic Virtual Cluster Hosting
iDataplex
Bare-metal Nodes (32 nodes)
XCAT Infrastructure
Linux
Bare-system
Linux on Xen
Windows Server 2008 Bare-system
Cluster Switching from Linux Bare-system to Xen VMs to Windows 2008 HPC
SW-G Using Hadoop SW-G : Smith Waterman
Gotoh Dissimilarity Computation – A typical MapReduce style application
SW-G Using Hadoop
SW-G Using DryadLINQ
SW-G Using Hadoop
SW-G Using Hadoop
SW-G Using DryadLINQ
Monitoring InfrastructureSlide20
Monitoring Infrastructure
Pub/Sub Broker Network
Summarizer
Switcher
Monitoring Interface
iDataplex
Bare-metal Nodes
(32 nodes)
XCAT Infrastructure
Virtual/Physical ClustersSlide21
SALSA HPC Dynamic Virtual ClustersSlide22
Life Science Demos
Metagenomics
Clustering to find multiple genes
Biology Data
Visualization of
PubChem
data
by using MDS and GTM
Visualization of ALU repetition alignment
(Chimp and Human data combined)
by using Smith Waterman dissimilarity.
PubChem
Bioassay active counts
Bioassay activity/inactivity classification
(Using
Multicore
and
MapReduce
)