/
A Ping Too Far A Ping Too Far

A Ping Too Far - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
372 views
Uploaded On 2016-04-19

A Ping Too Far - PPT Presentation

Real World Network Latency Measurement Gary Jackson JHUAPL Work done while at the University of Maryland Pete Keleher and Alan Sussman University of Maryland Department of Computer Science ID: 283855

learned amp vendor policy amp learned policy vendor resources openmpi nodes measurements technical htcondor schedule environment work obstacles lesson

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A Ping Too Far" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A Ping Too FarReal World Network Latency Measurement

Gary Jackson

JHU/APL

Work done while

at the University

of Maryland

Pete

Keleher

and Alan

Sussman

University of Maryland

Department of Computer ScienceSlide2

Introduction

Context:

Peer-to-peer HPC resource discovery and management

Goal

C

ollect a high-quality all-to-all network latency map

Campus or department scale, including HPC resources

As opposed to Internet-scale, which is well-trod ground

Purpose

Compare latency prediction techniques

Increase the fidelity of peer-to-peer system simulations

Solved many

problems

Technical solutions to technical and policy obstacles

Managed only partial success

Could not get measurements on more than one HPC-equipped cluster, so it’s not useful to us

But maybe the data set is useful to someone elseSlide3

Four Policy Challenges

Where to measure?

Ask for access

Compel stakeholders

Find

existing

resources that meet needs

Work around policy obstacles

Cannot run persistent daemons on resources

Minimal

change

Cannot

ask for significant changes to environment or other policies

Non

-

disruptive

Use of resources cannot disrupt other usersSlide4

Five Technical Challenges

Load interferes with measurement

User-level programs on both ends

Quick measurements

Quality measurements

Fix technical obstaclesSlide5

The Plan

Use local resources

UMIACS

HTCondor

Pool

~160 nodes spread out over several clusters

Two clusters equipped with

InfiniBand

(IB)

"Backfills" HTC jobs on to

clusters managed with TORQUE

OSU MPI microbenchmarks

Distributed system to schedule & collectSlide6

Particulars of the Environment

SchedulingC

annot schedule arbitrary pairs of nodes in

HTCondor

Static Slots

1 job per slot

1 slot per core

A

ll slots must be controlled for exclusive measurement

Node Heterogeneity

Lesson learned:

Compute environment exists to support somebody’s research, but maybe not yoursSlide7

Aside: Load Affects Network Latency

Space-sharing application model

Measurements

between two IB connected nodes

Varied CPU load

Higher load leads to

I

ncreased latency

Unpredictable

latency

Lesson Learned:

Environment for measurement should match model environment Slide8

Solved Technical Obstacles

OpenMPI

is finicky about OS & libraries

Build OpenMPI separately for every single host

OpenMPI

over TCP mysteriously hangs

Bogus bridge interface for virtualization

Tell OpenMPI not to use

it

User limits for mapped

memory prevents RDMA over IB

Had to modify

HTCondor

init

script

IB library provided by OS didn't work

Had to build it

ourselves on Cluster E

First hint that something was really wrong

Lesson Learned:

There are going to be a lot of little problems along the way.Slide9

Solved Policy Obstacles

Local Resource Management Systems

Cannot schedule arbitrary pairs of

nodes

Cannot

run processes outside of

HTCondor

& Torque

Cannot ask to change the way resources are allocated

Solution:

Built distributed system to schedule & collect measurements

Accounts

Cannot get accounts on some systems

Solution:

Workaround to start OpenMPI

daemon processes on both ends without SSH

Lesson Learned

Sometimes, there are technical solutions to policy problems.Slide10

Setting the Stage for Failure

Cluster EOne of the two clusters in pool with IB

Upstream IB libraries from OS vendor didn’t work

IB used exclusively for

IPoIB

to support

Lustre

Nodes have a large amount of memory

OpenMPI

processes crashing

D

espite rebuild of IB libraries from

hardware vendorSlide11

Fatal Obstacle

IB driver has tunable parameter to adjust the amount of memory that can be mapped (64GB)

Nodes have twice that physical memory (128GB)

Needs to be twice the physical memory size (256GB)

OS vendor has no guidelines for adjusting that value

Unknown impact on Lustre filesystem using

IPoIB

So this can't be

fixed

Lesson Learned:

Sometimes there’s nothing you can do.

Slide12

Cannot Lay Blame

Sysadmins?

No, they made a conservative decision to support primary stakeholders

IB vendor?

Driver right from the IB vendor probably would have worked

OS vendor?

Supports what they intended to support (

IPoIB

)

Me?

Using native RDMA over IB isn't asking too

much

Lesson Learned:

Sometimes it’s no-ones fault.Slide13

Results

Ping is not a good predictor of application-level latencyTends to over-estimate

Compared latency prediction techniques

Distributed Tree Metric (DTM)

Vivaldi

Global Network

Positioning

Result:

DTM continues to perform better than the other techniquesSlide14

Takeaway

IF you are building a big system/thesis that will rely on many different systems/admin domains

THEN you need to check all the potential choke

-points in advance

If the work is self contained, this is much easier.

I should have tested MPI over IB RDMA on that cluster much earlier in the process.Slide15
Slide16

Policy

Cannot ask for invasive changes to policy or implementation

Cannot

disrupt

HTCondor

pool

Cannot

interfere with TORQUE users

Cannot

get accounts on compute nodes

Must

be prepared for

preemption

Lesson Learned:

Policies exists to support someone’s research, but maybe not yours.Slide17

Seizing a Node

Submitter: query HTCondor

and submit master & slave jobs

Node

masters & slaves:

seize

exclusive control over a node

For a node with n slots

Submit n-1 slaves

Submit 1 masterSlide18

Scheduling Measurements

When all slaves & master are running, contact central controlSlaves & master yield periodically to allow other jobs to

run

Scheduler:

Schedule measurements between masters

Collect & store results from masters