Joshua Goehner and Dorian Arnold University of New Mexico In collaboration with LLNL and U of Wisconsin Key statistics from Top500 Nov 2011 149500 30 more than 8192 core In 2006 this number was 12500 24 ID: 812939
Download The PPT/PDF document "LIBI: The Lightweight Infrastructure-Boo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
LIBI: The Lightweight Infrastructure-Bootstrapping Infrastructure
Joshua
Goehner
and Dorian Arnold
University of New Mexico
(In collaboration with LLNL and U. of Wisconsin
Slide2Key statistics from Top500 (Nov. 2011)149/500
(
30%) more than 8,192 coreIn 2006, this number was 12/500 (2.4%)7 systems: more than 64K cores6 systems: between 64K and 128K cores3 systems: more than 128K coresLater this year: LLNL’s Sequoia w/ 1.6M cores2018?: Exascale systems w/ 107 or 108 cores
A Little Preaching to the Choir
Software systems must scale as well!
Slide3Before bootstrapping:Program image on storage device
Set of (allocated) computer nodes
After bootstrapping:Application processes started on computer nodesApplication’s configuration completeready for primary operationSoftware Infrastructure-bootstrappingGiven a node allocation, start infrastructure's composite processes and propagate necessary startup information.
Slide4“I
gotta
get my application up and running!”
Slide5(1) “First, I start all the processes on the appropriate nodes”
Slide6(2) “Next, I must disseminate some initialization information”
Slide7Bootstrapping is complete when the
infrastructure is ready for steady-state usage.
Slide8Basic Bootstrapping
Sequential instantiation
e.g. rsh or ssh-basedPoint-to-point propagation70 seconds to start 2K processes!
Slide9Infrastructure-specific, scalable mechanisms
Still limited by sequential operations
Not portable to other infrastructuresUsing high-performance resource managersMyriad interfacesNo communication facilitiesGeneric bootstrapping infrastructuresLaunchMON: targets tools with wrapper for existing RMsScELA: sequentially launch agents that create local processesMore Scalable Approaches
Slide10MRNet Sequential-ish Bootstrapping
Parent
creates children
Local
fork
()
/
exec
()
Remote
rsh
-based mechanism
Integrated instantiation and information
propagation
MRNet’s
“standard”
Slide11MRNet XT (hybrid) process launch
Bulk-launch
1 process per node
Process launches collocated
processes
Information propagated in
similar
fashion
Slide12LIBI Approach
LIBI:
Lightweight infrastructure-bootstrapping infrastructureName is no longer a work in progress Generic service for scalable distributed software infrastructure bootstrappingProcess launchScalable, low-level collectives
Large Scale Distributed Software
Job Launchers
Communication Services
LIBI
rsh/ssh
SLURM
OpenRTE
ALPS
LaunchMON
Debuggers
System Monitors
Applications
Performance Analyzers
Overlay Networks
COBO
MPI
Slide13session: set of processes (to be) deployed
session master
manages other memberssession front-end interacts with session masterLIBI currently supports only master/member communicationhost-distribution: where to create processes<hostname, num-processes>process distribution: how/where to create processes<session-id, executable, arguments, host-distribution, environment>LIBI API
Slide14launch(process
-distribution-array)
instantiate processes according to input distributions[send|recv]UsrData(session-id, msg)communicate between front-end and session masterbroadcast(), scatter(), gather(), barrier()communicate between session master and membersLIBI API (cont’d)
Slide15front-end( )
{
LIBI_fe_init(); LIBI_fe_createSession(sess); proc_dist_req_t pd; pd.sessionHandle = sess; pd.proc_path =
get_ExePath()
;
pd.proc_argv
=
get_ProgArgs
()
;
pd.hd
=
get_HostDistribution
()
; LIBI_fe_launch
(pd); /
/test broadcast and barrier LIBI_fe_sendUsrData(sess1,
msg, len )
; LIBI_fe_recvUsrData(sess1, msg
, len);
//test scatter and gather LIBI_fe_sendUsrData
(sess1, msg, len
); LIBI_fe_recvUsrData
(sess1, msg, len
); return 0;
}Example LIBI Front-end
Slide16session_member
()
{ LIBI_init(); //test broadcast and barrier LIBI_recvUsrData(msg, msg_length); LIBI_broadcast(msg
, msg_length
)
;
LIBI_barrier
()
;
LIBI_sendUsrData
(msg
,
msg_length
)
/
/test scatter and gather
LIBI_recvUsrData(msg, msg_length
); LIBI_scatter
(msg, sizeof(rcvmsg), rcvmsg
); LIBI_gather
(sndmsg, sizeof(sndmsg),
msg);
LIBI_sendUsrData(msg, msg_length
);
LIBI_finalize();}
Example LIBI-launched Application
Slide17LaunchMON-based runtime
Tested SLURM or
rsh launchingCOBO PMGR serviceLIBI Implementation StatusLarge Scale Distributed Software
Job Launchers
Communication Services
LIBI
rsh/ssh
SLURM
OpenRTE
ALPS
LaunchMON
Debuggers
System Monitors
Applications
Performance Analyzers
Overlay Networks
COBO
MPI
Slide18Goal to demonstrate ease-of-integration and
performance/scalability
enhancementsMicrobenchmarkTime to instantiate a set of processesTime to broadcast 128 bytes followed by barrierTime to scatter and gather 128 bytesLIBI Early Evaluation
Slide19LIBI Microbenchmark Results
Slide20LIBI Microbenchmark Results
Slide21MRNet uses LIBI to launch all MRNet processes
Parse topology file and setup/call
LIBI_launch()Session front-end gathers/scatters startup informationParent listening socket (IP/port)MRNet/LIBI Integration
Slide22STAT over MRNet over LIBI performance results
Basic, scalable
rsh-based mechanismHybrid (XT-like) launch approachMechanisms to alleviate filesystem contentionLike our scalable binary relocation serviceMore flexible process and host distributionsInstantiating different images in same sessionIntegrating allocating and launchingIntegrated scalable communication infrastructureSome Future Work