ParalleX in Action Martin Swany Associate Chair and Professor Intelligent Systems Engineering Deputy Director Center for Research in Extreme Scale Technology CREST Indiana University ParalleX ID: 533107
Download Presentation The PPT/PDF document "HPX-5" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
HPX-5ParalleX in Action
Martin Swany
Associate Chair and Professor, Intelligent Systems Engineering
Deputy Director, Center for Research in Extreme Scale Technology (CREST)
Indiana UniversitySlide2
ParalleX Execution Model
Core Tenets
Fine grained parallelism
Hide latency with concurrency
Runtime introspection and adaptation
Formal components
Global address space (shared memory programming)
Processes
Compute complexes
Lightweight control objects
Parcels
Fully flexible but promotes fine-grained dataflow programs
HPX-5
based
on
ParalleX
and is part of the Center for Shock-Wave Processing of Advanced Reactive Materials (C-SWARM) effort in PSAAP-IISlide3
Model: Global Address Space
Flat byte-addressable global addresses
Put/get with local and remote completion
Active message targets
Array collectives
Controls thread distribution and load balance
Current implementation
Block-based allocation
Malloc
/free with distribution (local, cyclic, user,
etc
)
Traditional PGAS or directory-based AGAS
High performance local allocation (high frequency LCO allocation)
Soft core affinity for NUMASlide4
Model: ParcelsActive messages with continuations
Target data
action,
global address, immediate
data
Continuation data
action, global address
lco_set
,
lco_delete
,
memput
, free,
etc
Execute local to target address
Unified
local and remote
execution model
send()
equiv
to
thread_create
()Slide5
Model: User-level threads
Cooperative threads
Block on dynamic dependencies (
lco_get
,
memput
,
etc
)
Continuation passing style
Progenitor parcel specifies continuation target, action
Thread “continues” value
Call/cc “pushes” continuation parcel
Isomorphic with parcelsSlide6
Model: Local Control Objects
Abstract synchronization interface
Unified local/remote access
Threads get, set, wait, reset, compound ops
Parcel sends dependent on
Built-in classes
Futures, reductions, generation counts, semaphores,
…
User defined classes
Initialize, set handler, predicate
Colocates
data with control and synchronization
Implement dataflow with parcel continuationsSlide7
Control: Parallel Parcels and Threads
Serial work
thread_continue
thread_call
/cc
happens-before
Thread 1 < Thread 2
Parallel work
parcel_send
unordered
Thread 1 <> Thread 4
Higher level
hpx_call
local parforhierarchical parfor
Thread 1
Thread 2
thread_continue
(x)
parcel_send
(p)
parcel_send
(r)
parcel_send
(q)
Thread 4Slide8
Thread-thread synchronizationTraditional monitor style
synchronization
Dynamic output dependencies
Blocked threads as continuations
Data-flow execution
Pending parcels as continuations
Execution ”consumes” output
Can be manually regenerated for
iterative execution
Generic user-defined
Any set of continuations
Any function and predicate
Lazy evaluation of function
Control: LCO Synchronization
…lco_set(a)
lco_set(b)lco_set(x)f(a, b, …, x); pred();…
future
lco_set
lco_get
parcel_send
(p)and
…lco_getparcel_send(p)Slide9
Data Structures, Distribution
Global linked data structures
Graphs, trees, DAGs
Global cyclic block arrays
locality(block address)
Global user-defined distributions
locality[block address]
Active GAS
Distributed directory allows blocks to be dynamically remapped from their home localities.
Application-specific explicit load balancing
Automatic load balancing through GAS tracing and graph partitioning (slow)Slide10
Fibonacci
HPX_ACTION_DECL
(
fib
)
;
int
fib_handler
(
int
n
)
{ if (n < 2) { return
HPX_THREAD_CONTINUE(n); } // sequential int l = n - 1;
int
r = n
- 2;
hpx_addr_t
lhs = hpx_lco_future_new(sizeof(int)); // GAS malloc hpx_addr_t rhs =
hpx_lco_future_new(sizeof(int)); // GAS malloc hpx_call(HPX_HERE, fib
, lhs, l); // parallel hpx_call(HPX_HERE, fib, rhs,
r); // parallel
hpx_lco_get
(lhs,
sizeof(int
), &
l); // LCO synchronization
hpx_lco_get
(
rhs
,
sizeof
(
int
)
,
&
r
)
; // LCO synchronization
hpx_lco_delete_sync
(
lhs
)
; // GAS free
hpx_lco_delete_sync
(
rhs
)
;
//
GAS free
int
fn
= l + r; return HPX_THREAD_CONTINUE(fn); // sequential } HPX_ACTION(HPX_DEFAULT, 0, fib, fib_handler, HPX_INT);
fib(n) = fib(n-1) + fib(n-2)Slide11
Networking / Comms
Internal interfaces
Preferred: put/get with remote completion
Legacy: parcel send
Photon
rDMA
put/get with remote completion operations
Native
PSM (
libfabric
), IB verbs
,
uGNI
, sockets (libfabric)Parcel emulation through eager buffersSynchronized with fine-grained point-to-point lockingIsend/Irecv
MPI_THREAD_FUNNELED implementationPWC emulated through Isend/IrecvPortability, legacy upgrade pathSlide12
Networking / Comms
A key idea in the Photon library -
Put/Get with
Completion
Minimal overhead to trigger waiting thread via LCO
useful
paradigm when combined with an “unexpected active message”
capability
Essentially attach parcel continuations (either already-running threads or yet-to-be-instantiated parcels) to both local and remote
completion operationsSlide13
Networking / Comms
One of the key lessons in HPX-5 is the power of
memget
,
memput
with completion
primitives (with associated low-level
photon_pwc
and
photon_gwc
)
provides a
very powerful abstractionOne-sided operations in AMTs are not themselves that usefulThe ability to continue threads or spawn parcels provides performance improving functionality Slide14
Thank you
hpx.crest.iu.edu