Real World Network Latency Measurement Gary Jackson JHUAPL Work done while at the University of Maryland Pete Keleher and Alan Sussman University of Maryland Department of Computer Science ID: 283855
Download Presentation The PPT/PDF document "A Ping Too Far" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A Ping Too FarReal World Network Latency Measurement
Gary Jackson
JHU/APL
Work done while
at the University
of Maryland
Pete
Keleher
and Alan
Sussman
University of Maryland
Department of Computer ScienceSlide2
Introduction
Context:
Peer-to-peer HPC resource discovery and management
Goal
C
ollect a high-quality all-to-all network latency map
Campus or department scale, including HPC resources
As opposed to Internet-scale, which is well-trod ground
Purpose
Compare latency prediction techniques
Increase the fidelity of peer-to-peer system simulations
Solved many
problems
Technical solutions to technical and policy obstacles
Managed only partial success
Could not get measurements on more than one HPC-equipped cluster, so it’s not useful to us
But maybe the data set is useful to someone elseSlide3
Four Policy Challenges
Where to measure?
Ask for access
✖
Compel stakeholders
✖
Find
existing
resources that meet needs
✔
Work around policy obstacles
Cannot run persistent daemons on resources
Minimal
change
Cannot
ask for significant changes to environment or other policies
Non
-
disruptive
Use of resources cannot disrupt other usersSlide4
Five Technical Challenges
Load interferes with measurement
User-level programs on both ends
Quick measurements
Quality measurements
Fix technical obstaclesSlide5
The Plan
Use local resources
UMIACS
HTCondor
Pool
~160 nodes spread out over several clusters
Two clusters equipped with
InfiniBand
(IB)
"Backfills" HTC jobs on to
clusters managed with TORQUE
OSU MPI microbenchmarks
Distributed system to schedule & collectSlide6
Particulars of the Environment
SchedulingC
annot schedule arbitrary pairs of nodes in
HTCondor
Static Slots
1 job per slot
1 slot per core
A
ll slots must be controlled for exclusive measurement
Node Heterogeneity
Lesson learned:
Compute environment exists to support somebody’s research, but maybe not yoursSlide7
Aside: Load Affects Network Latency
Space-sharing application model
Measurements
between two IB connected nodes
Varied CPU load
Higher load leads to
I
ncreased latency
Unpredictable
latency
Lesson Learned:
Environment for measurement should match model environment Slide8
Solved Technical Obstacles
OpenMPI
is finicky about OS & libraries
Build OpenMPI separately for every single host
OpenMPI
over TCP mysteriously hangs
Bogus bridge interface for virtualization
Tell OpenMPI not to use
it
User limits for mapped
memory prevents RDMA over IB
Had to modify
HTCondor
init
script
IB library provided by OS didn't work
Had to build it
ourselves on Cluster E
First hint that something was really wrong
Lesson Learned:
There are going to be a lot of little problems along the way.Slide9
Solved Policy Obstacles
Local Resource Management Systems
Cannot schedule arbitrary pairs of
nodes
Cannot
run processes outside of
HTCondor
& Torque
Cannot ask to change the way resources are allocated
Solution:
Built distributed system to schedule & collect measurements
Accounts
Cannot get accounts on some systems
Solution:
Workaround to start OpenMPI
daemon processes on both ends without SSH
Lesson Learned
Sometimes, there are technical solutions to policy problems.Slide10
Setting the Stage for Failure
Cluster EOne of the two clusters in pool with IB
Upstream IB libraries from OS vendor didn’t work
IB used exclusively for
IPoIB
to support
Lustre
Nodes have a large amount of memory
OpenMPI
processes crashing
D
espite rebuild of IB libraries from
hardware vendorSlide11
Fatal Obstacle
IB driver has tunable parameter to adjust the amount of memory that can be mapped (64GB)
Nodes have twice that physical memory (128GB)
Needs to be twice the physical memory size (256GB)
OS vendor has no guidelines for adjusting that value
Unknown impact on Lustre filesystem using
IPoIB
So this can't be
fixed
Lesson Learned:
Sometimes there’s nothing you can do.
☠
Slide12
Cannot Lay Blame
Sysadmins?
No, they made a conservative decision to support primary stakeholders
IB vendor?
Driver right from the IB vendor probably would have worked
OS vendor?
Supports what they intended to support (
IPoIB
)
Me?
Using native RDMA over IB isn't asking too
much
Lesson Learned:
Sometimes it’s no-ones fault.Slide13
Results
Ping is not a good predictor of application-level latencyTends to over-estimate
Compared latency prediction techniques
Distributed Tree Metric (DTM)
Vivaldi
Global Network
Positioning
Result:
DTM continues to perform better than the other techniquesSlide14
Takeaway
IF you are building a big system/thesis that will rely on many different systems/admin domains
THEN you need to check all the potential choke
-points in advance
If the work is self contained, this is much easier.
I should have tested MPI over IB RDMA on that cluster much earlier in the process.Slide15Slide16
Policy
Cannot ask for invasive changes to policy or implementation
Cannot
disrupt
HTCondor
pool
Cannot
interfere with TORQUE users
Cannot
get accounts on compute nodes
Must
be prepared for
preemption
Lesson Learned:
Policies exists to support someone’s research, but maybe not yours.Slide17
Seizing a Node
Submitter: query HTCondor
and submit master & slave jobs
Node
masters & slaves:
seize
exclusive control over a node
For a node with n slots
Submit n-1 slaves
Submit 1 masterSlide18
Scheduling Measurements
When all slaves & master are running, contact central controlSlaves & master yield periodically to allow other jobs to
run
Scheduler:
Schedule measurements between masters
Collect & store results from masters