/
HPX-5 HPX-5

HPX-5 - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
372 views
Uploaded On 2017-04-03

HPX-5 - PPT Presentation

ParalleX in Action Martin Swany Associate Chair and Professor Intelligent Systems Engineering Deputy Director Center for Research in Extreme Scale Technology CREST Indiana University ParalleX ID: 533107

lco hpx parcel thread hpx lco thread parcel fib local set global parcels address threads send model completion gas

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "HPX-5" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

HPX-5ParalleX in Action

Martin Swany

Associate Chair and Professor, Intelligent Systems Engineering

Deputy Director, Center for Research in Extreme Scale Technology (CREST)

Indiana UniversitySlide2

ParalleX Execution Model

Core Tenets

Fine grained parallelism

Hide latency with concurrency

Runtime introspection and adaptation

Formal components

Global address space (shared memory programming)

Processes

Compute complexes

Lightweight control objects

Parcels

Fully flexible but promotes fine-grained dataflow programs

HPX-5

based

on

ParalleX

and is part of the Center for Shock-Wave Processing of Advanced Reactive Materials (C-SWARM) effort in PSAAP-IISlide3

Model: Global Address Space

Flat byte-addressable global addresses

Put/get with local and remote completion

Active message targets

Array collectives

Controls thread distribution and load balance

Current implementation

Block-based allocation

Malloc

/free with distribution (local, cyclic, user,

etc

)

Traditional PGAS or directory-based AGAS

High performance local allocation (high frequency LCO allocation)

Soft core affinity for NUMASlide4

Model: ParcelsActive messages with continuations

Target data

action,

global address, immediate

data

Continuation data

action, global address

lco_set

,

lco_delete

,

memput

, free,

etc

Execute local to target address

Unified

local and remote

execution model

send()

equiv

to

thread_create

()Slide5

Model: User-level threads

Cooperative threads

Block on dynamic dependencies (

lco_get

,

memput

,

etc

)

Continuation passing style

Progenitor parcel specifies continuation target, action

Thread “continues” value

Call/cc “pushes” continuation parcel

Isomorphic with parcelsSlide6

Model: Local Control Objects

Abstract synchronization interface

Unified local/remote access

Threads get, set, wait, reset, compound ops

Parcel sends dependent on

Built-in classes

Futures, reductions, generation counts, semaphores,

User defined classes

Initialize, set handler, predicate

Colocates

data with control and synchronization

Implement dataflow with parcel continuationsSlide7

Control: Parallel Parcels and Threads

Serial work

thread_continue

thread_call

/cc

happens-before

Thread 1 < Thread 2

Parallel work

parcel_send

unordered

Thread 1 <> Thread 4

Higher level

hpx_call

local parforhierarchical parfor

Thread 1

Thread 2

thread_continue

(x)

parcel_send

(p)

parcel_send

(r)

parcel_send

(q)

Thread 4Slide8

Thread-thread synchronizationTraditional monitor style

synchronization

Dynamic output dependencies

Blocked threads as continuations

Data-flow execution

Pending parcels as continuations

Execution ”consumes” output

Can be manually regenerated for

iterative execution

Generic user-defined

Any set of continuations

Any function and predicate

Lazy evaluation of function

Control: LCO Synchronization

…lco_set(a)

lco_set(b)lco_set(x)f(a, b, …, x); pred();…

future

lco_set

lco_get

parcel_send

(p)and

…lco_getparcel_send(p)Slide9

Data Structures, Distribution

Global linked data structures

Graphs, trees, DAGs

Global cyclic block arrays

locality(block address)

Global user-defined distributions

locality[block address]

Active GAS

Distributed directory allows blocks to be dynamically remapped from their home localities.

Application-specific explicit load balancing

Automatic load balancing through GAS tracing and graph partitioning (slow)Slide10

Fibonacci

HPX_ACTION_DECL

(

fib

)

;

int

 

fib_handler

(

int

 n

)

 

{  if (n < 2) { return

 HPX_THREAD_CONTINUE(n); } // sequential  int l = n - 1;

  int

 r = n 

- 2;

  hpx_addr_t

lhs = hpx_lco_future_new(sizeof(int)); // GAS malloc  hpx_addr_t rhs =

 hpx_lco_future_new(sizeof(int)); // GAS malloc  hpx_call(HPX_HERE, fib

, lhs, l); // parallel  hpx_call(HPX_HERE, fib, rhs,

 r); // parallel

  hpx_lco_get

(lhs, 

sizeof(int

), &

l); // LCO synchronization

 

hpx_lco_get

(

rhs

,

 

sizeof

(

int

)

,

 

&

r

)

; // LCO synchronization

 

hpx_lco_delete_sync

(

lhs

)

; // GAS free

 

hpx_lco_delete_sync

(

rhs

)

;

//

GAS free

  

int

 

fn

 = l + r;  return HPX_THREAD_CONTINUE(fn); // sequential } HPX_ACTION(HPX_DEFAULT, 0, fib, fib_handler, HPX_INT);

fib(n) = fib(n-1) + fib(n-2)Slide11

Networking / Comms

Internal interfaces

Preferred: put/get with remote completion

Legacy: parcel send

Photon

rDMA

put/get with remote completion operations

Native

PSM (

libfabric

), IB verbs

,

uGNI

, sockets (libfabric)Parcel emulation through eager buffersSynchronized with fine-grained point-to-point lockingIsend/Irecv

MPI_THREAD_FUNNELED implementationPWC emulated through Isend/IrecvPortability, legacy upgrade pathSlide12

Networking / Comms

A key idea in the Photon library -

Put/Get with

Completion

Minimal overhead to trigger waiting thread via LCO

useful

paradigm when combined with an “unexpected active message”

capability

Essentially attach parcel continuations (either already-running threads or yet-to-be-instantiated parcels) to both local and remote

completion operationsSlide13

Networking / Comms

One of the key lessons in HPX-5 is the power of

memget

,

memput

with completion

primitives (with associated low-level

photon_pwc

and

photon_gwc

)

provides a

very powerful abstractionOne-sided operations in AMTs are not themselves that usefulThe ability to continue threads or spawn parcels provides performance improving functionality Slide14

Thank you

hpx.crest.iu.edu

Related Contents


Next Show more