/
Science in Clouds Science in Clouds

Science in Clouds - PowerPoint Presentation

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
390 views
Uploaded On 2015-10-24

Science in Clouds - PPT Presentation

S A L S A Team salsaweb salsa Community Grids Laboratory Digital Science Center Pervasive Technology Institute Indiana University Virtual Cluster provisioning via XCAT Supports both ID: 171152

mapreduce data hadoop mds data mapreduce mds hadoop gtm performance visualization bare high dryadlinq nodes linux system virtual xen

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Science in Clouds" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Science in Clouds

S

A

L

S

A

Team

salsaweb

/salsa

Community Grids Laboratory,

Digital Science Center

Pervasive Technology Institute

Indiana UniversitySlide2

Virtual Cluster provisioning via XCAT

Supports both

stateful and stateless OS images

iDataplex Bare-metal Nodes

Linux Bare-system

Linux Virtual Machines

Windows Server 2008 HPCBare-system

Xen Virtualization

Microsoft DryadLINQ / MPI

Apache Hadoop / MapReduce++ / MPI

Smith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD Using DryadLINQ, High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological Mapping

XCAT Infrastructure

Xen Virtualization

Applications

Runtimes

Infrastructure software

Hardware

Windows Server 2008 HPC

Science Clouds ArchitectureSlide3

Pairwise Distances

– Smith Waterman

HEP data analysis

DryadLINQ, Hadoop,

MapReduce++ Implementations

High Energy Physics

Calculate

pairwise

distances for a collection of genes (used for clustering, MDS)

Fine grained tasks in MPI

Coarse grained tasks in DryadLINQ

Performed on 768 cores (Tempest Cluster)

125 million distances

4 hours & 46 minutes

Upper triangle

NxN

matrix broken down to

DxD

blocks

0

1

2

D-1

D-1

2

1

0

Each D consecutive blocks are merged to form a set of row blocks ; each with

NxD

elements process has workload of

NxD

elements

Blocks in upper triangleSlide4

Scalability of Pairwise

Distance Calculations

DryadLinq

on Windows HPC

Hadoop

MapReduce

on Linux bare metal

Hadoop

MapReduce

on Linux Virtual Machines running on

Xen

Hypervisor

VM overhead decreases with the increase of block sizes

Memory bandwidth bound computation

Communication in bursts

Performed on

IDataPlex

Cluster using 32 nodes * 8 cores

Performance degradation for 125 million distance calculations on VM

15.33%

Perf

. Degradation = (

T

vm

T

baremetal

)/

T

baremetalSlide5

This shows the natural load balancing of

Hadoop

MR dynamic task assignment using a global pipe line in contrast to the

DryadLinq

static assignmentsPairwise Distance Calculations

Effect of Inhomogeneous Data

Calculation Time per Pair [A,B]

α Length A * Length B

Inhomogeneity

of data does not have a significant effect when the sequence lengths are randomly distributedSlide6

CAP3 – Gene Assembly

Expressed Sequence Tag assembly to re-construct full-length mRNA

Perform using DryadLINQ, Apache Hadoop, MapReduce++ implementations

PhyloD

using

DryadLINQ

Performance of CAP3

Derive associations between HLA alleles and HIV

codons

and between

codons themselves

DryadLINQ

implementationSlide7

K-Means Clustering & Matrix Multiplication

Using Cloud Technologies

K-Means clustering on 2D vector data

DryadLINQ, Hadoop, MapReduce++ and MPI implementationsMapReduce++ performs close to MPI

Performance of K-Means

Parallel Overhead

Matrix Multiplication

Matrix

multiplication in

MapReduce

model

Hadoop

,

MapReduce

++, and MPI

MapReduce++

perform close to MPISlide8

Virtualization Overhead – Cloud Technologies

Nearly 15% performance degradation in

Hadoop

on XEN VMs

Hadoop handles the inhomogeneous data better than Dryad

-- Dynamic Task Scheduling of Hadoop made this possible Handling large data on VMs add more overhead -- Especially if the data is accessed over the networkSlide9

Virtualization Overhead - MPI

Implements Cannon’s Algorithm [1]

Exchange large messages

More susceptible to bandwidth than latency

14% reduction

in speedup between bare-system and 1-VM per node

Performance - 64 CPU cores

Speedup – Fixed matrix size (5184x5184)

Up to 40 million 3D data points

Amount of communication depends only on the number of cluster centers

Amount of communication << Computation and the amount of data processed

At the highest granularity VMs show

33% or

more

total

overhead

Extremely large overheads

for smaller grain sizes

Performance – 128 CPU cores

Overhead = (P * T(P) –T(1))/T(1)Slide10

MapReduce++

Streaming based

communication

Intermediate results are directly transferred from the map tasks to the reduce tasks –

eliminates local files

Cacheable map/reduce tasksStatic data remains in memory

Combine phase to combine reductionsUser Program is the composer

of MapReduce computationsExtends the MapReduce model to

iterative computations

Data Split

D

MR

Driver

User

Program

Pub/Sub Broker Network

D

File System

M

R

M

R

M

R

M

R

Worker Nodes

M

R

D

Map Worker

Reduce Worker

MRDeamon

Data Read/Write

Communication

Reduce (Key, List<Value>)

Iterate

Map(Key, Value)

Combine (Key, List<Value>)

User Program

Different synchronization and intercommunication mechanisms used by the parallel runtimes

Yahoo

Hadoop

uses short running processes communicating via disk and tracking processes

Microsoft DRYAD uses short running processes communicating via pipes disk or shared memory between cores

MapReduce ++ is long running processing with asynchronous distributed

Randezvous

synchronization

Disk HTTP

Disk HTTP

Disk HTTP

Disk HTTP

Pipes

Pipes

Pipes

Pipes

Pub-Sub Bus

Pub-Sub Bus

Pub-Sub Bus

Pub-Sub BusSlide11

High Performance Dimension Reduction and Visualization

Need is pervasive

Large and high dimensional data are everywhere: biology, physics, Internet, …

Visualization can help data analysis Visualization with high performanceMap high-dimensional data into low dimensions.Need high performance for processing large data

Developing high performance visualization algorithms: MDS(Multi-dimensional Scaling), GTM(Generative Topographic Mapping), DA-MDS(Deterministic Annealing MDS), DA-GTM

(Deterministic Annealing GTM), …Slide12

Biology Clustering Results

Alu

families

Metagenomics

Slide13

Analysis of 26 Million PubChem Entries

26 million

PubChem

compounds with 166 featuresDrug discoveryBioassay 3D visualization for data exploration/miningMapping by MDS(Multi-dimensional Scaling)

and GTM(Generative Topographic Mapping)Interactive visualization tool PlotViz

Discover hidden structuresSlide14

MDS/GTM for 100K

PubChem

GTM

MDS

> 300

200 ~ 300

100 ~ 200

< 100

Number of Activity ResultsSlide15

Bioassay activity in

PubChem

MDS

GTM

Highly Active

Active

Inactive

Highly InactiveSlide16

Correlation between MDS/GTM

MDS

GTM

Canonical Correlation between MDS & GTMSlide17

Child Obesity Study

Discover environmental factors related with child obesity

About 137,000 Patient records with 8 health-related and 97 environmental factors has been analyzed

Health data

Environment data

BMI

Blood Pressure

WeightHeight…

GreennessNeighborhoodPopulation

Income…

Genetic Algorithm

Canonical Correlation Analysis

VisualizationSlide18

a)

b)

a

) The plot of the first pair of canonical variables for 635 Census Blocks

b

)

The

color coded correlation between MDS and first eigenvector of PCA decomposition

Canonical Correlation Analysis and Multidimensional ScalingSlide19

SALSA Dynamic Virtual Cluster Hosting

iDataplex

Bare-metal Nodes (32 nodes)

XCAT Infrastructure

Linux

Bare-system

Linux on Xen

Windows Server 2008 Bare-system

Cluster Switching from Linux Bare-system to Xen VMs to Windows 2008 HPC

SW-G Using Hadoop SW-G : Smith Waterman

Gotoh Dissimilarity Computation – A typical MapReduce style application

SW-G Using Hadoop

SW-G Using DryadLINQ

SW-G Using Hadoop

SW-G Using Hadoop

SW-G Using DryadLINQ

Monitoring InfrastructureSlide20

Monitoring Infrastructure

Pub/Sub Broker Network

Summarizer

Switcher

Monitoring Interface

iDataplex

Bare-metal Nodes

(32 nodes)

XCAT Infrastructure

Virtual/Physical ClustersSlide21

SALSA HPC Dynamic Virtual ClustersSlide22

Life Science Demos

Metagenomics

Clustering to find multiple genes

Biology Data

Visualization of

PubChem

data

by using MDS and GTM

Visualization of ALU repetition alignment

(Chimp and Human data combined)

by using Smith Waterman dissimilarity.

PubChem

Bioassay active counts

Bioassay activity/inactivity classification

(Using

Multicore

and

MapReduce

)