/
Andrei  Gheata  (CERN) for the GeantV development team Andrei  Gheata  (CERN) for the GeantV development team

Andrei Gheata (CERN) for the GeantV development team - PowerPoint Presentation

inventco
inventco . @inventco
Follow
342 views
Uploaded On 2020-06-26

Andrei Gheata (CERN) for the GeantV development team - PPT Presentation

GeantV Parallelism transport structure and overall performance Outlook Motivation amp objectives Implementation Concurrent services Tuning knobs and adaptive behavior Interface to accelerators ID: 788224

cern transport geantv 2016 transport cern 2016 geantv dev event geant task october tracks basketizer numa physics scheduler steps

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Andrei Gheata (CERN) for the GeantV de..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Andrei Gheata (CERN) for the GeantV development team

GeantV

– Parallelism, transport structure and overall performance

Slide2

OutlookMotivation & objectives

Implementation

Concurrent services

Tuning knobs and adaptive behaviorInterface to acceleratorsNUMA awarenessGlobal performance and perspectives

geant-dev@cern.ch

2

October 2016

Slide3

GeantV – Adapting simulation to modern hardware

Classical simulation

hard to approach the full machine potential

GeantV

simulation

needs to

profit at best from all processing pipelines

Single event scalar transport

Embarrassing

parallelism

Cache

coherence

– low

Vectorization – low (scalar auto-vectorization)

Multi-event vector-aware transport

Fine grain parallelism

Cache

coherence

– high

Vectorization – high (explicit multi-particle interfaces)

3

geant-dev@cern.ch

October 2016

Slide4

Basketizer

Outputs

Geometry

Filters

Baskets

(Fast)Physics

Filters

Propagator

Input queue

Vector

stepper

Step sampling

Filter neutrals

(Field) Propagator

Step limiter

reshuffle

VecGeom

navigator

Full geometry

Simplified geometry

Physics sampler

Phys. Process

post-step

Secondaries

TO

SCHEDULER

Vol

1

Vol

2

Vol

3

Vol

n

e

+

/e

-

γ

GeantV

concurrency:

static thread

approach

Inconvenients

:

thread “awareness” (thread id)

hard to connect to task-based frameworks

Coproc

. broker

4

SCHEDULER

MT vector/scalar

processing

geant-dev@cern.ch

October 2016

Slide5

Scheduler features

E

fficient

concurrent basketizers

Filtering tracks by several possible locality criteriaGiving “reasonable” size vectors all along the simulation

Provide scalable & balanced workloadMinimize memory footprintMinimize cool-down phase (tails)

Adaptive behavior to maximize performance

Dynamic switch of vector/scalar processing

Learning dynamically the “important” filters

Adjust dynamically event slots to control memory

Accommodate additional concurrent processing in the simulation workflow

Hits/digits/MC truth I/O Digitization/reconstruction tasks

geant-dev@cern.ch5

October 2016

Slide6

SOA data handling challenges

What we want before doing work on data

fEventV

fParticleV

Use

A

fEventV

fParticleV

fEventV

fParticleV

fEventV

fParticleV

Compact

Move

A

A

B

What we need to do

fEventV

fParticleV

Reshuffle

fEventV

fParticleV

A

A

fEventV

fParticleV

fEventV

fParticleV

fEventV

fParticleV

Basketize

Basketize

A

B

C

… single threaded

… concurrently

geant-dev@cern.ch

6

October 2016

charged

neutral

crossing

Slide7

And the price to pay…

geant-dev@cern.ch

7

fraction of run time

Geometry

Run-time fraction spent in different parts of

GeantV

October 2016

24-core dual socket E5-2695 v2 @ 2.40GHz (IVB).

Hyperthreading

The observed contention led us to organize tracks in a different way, in order to reduce the concurrency cost

Slide8

Concurrent services: queues

We use concurrent queues for

Aggregating/balancing the work load between workers

S

everal mutex-based/lock free implementations evaluated

GeantV queues can work at ~105 transactions/sec

Lock free queues are doing great on

MacOSX

+ clang

compared to

mutex-based ones (

50x factor!)8

geant-dev@cern.chOctober 2016

Slide9

Monitoring to understand behavior

Implemented real-time monitoring tools based on ROOT

Very useful to understand model behavior

Triggered improvement/evolution of the modelSome parameters can be really adaptiveSuch as “important” volumes that can feed vectors

Basketizing only 10% of volumes in CMS leads to 70% of transport done in vector mode Switching the fraction of vector/scalar work has impact on performance in a large range

FixedShield102880: 708955 steps

* HVQX8780: 462821 steps

* ZDC_EMLayer9b00: 83838

steps

* BeamTube22b780: 78748

steps

* OQUA6780: 62597

steps * QuadInner3300: 56376 steps * ZDC_EMAbsorber9d00: 53672 steps * QuadOuter3700: 52155 steps * QuadCoil3680: 49086 steps * ZDC_EMFiber9e80: 41705 stepsgeant-dev@cern.ch9October 2016

scalarvector

frequency of baskets

basket sizeReal time snapshots#tracks transported

Slide10

Memory control

Memory determined by number of tracks “in flight”

Determined by number of events “in flight”

Controlling the memory is important for low production cutsNumber of secondary particles can explodeCurrently implemented a policy to delete empty baskets when reaching a memory watermark

Not fully effective, but keeping the memory constantExtra levers (future work):

Reducing dynamically the number of events in flight (possible with new event server)Back burner (waiting queue) for high energy tracks

geant-dev@cern.ch

10

queued baskets

memory

tracks

in flight

October 2016

Slide11

Scheduler “knobs”Keep

memory under control

Limiting number of buffered

eventsPrioritizing events “mostly” transportedUsing watermark limit to clean baskets

Keep the vectors upOptimize vector size

Too large: to many pending basketsToo small: inefficient

vectorization

Trigger postponing tracks or tracking with scalar

algorithms

Adjust also dynamically basket

size

Popularity service: basketize only the “important” volumesgeant-dev@cern.ch

11October 2016

Slide12

Optimization of scheduling parameters

Depends on what needs to be optimized

E.g. memory vs. computing time

A multivariate problem, probably too early to optimize

Development is iterative with short cycles

Genetic algorithm approach started to be investigated

October 2016

geant-dev@cern.ch

12

Slide13

Optimizations for “dense” physics: reusing tracks

If interacting in the current step, no need to re-

basketize

(same volume)Recycle the input basket

Large gain for dense physicsNormally the

basketizer becomes fully blockingLarge part of tracks can be reused in the same thread to release load

geant-dev@cern.ch

13

October 2016

Slide14

New development: Integration with task-based experiment frameworks

Some experiments (e.g. CMS) adopted task-based frameworks

Integrate

GeantV

in a task-based workflow is very important (and now possible)Several scenarios invoking GeantV as a task possible,

e.g:

Event Generator/

Reader

Particle filter (

type,region

, energy,

…)

Full/fast simulation(GeantV)

Digitization (MC truth + det. response)

Tracking/Reconstruction

14

Experiment frameworkgeant-dev@cern.chOctober 2016

Slide15

Initial task

Top level task spawning a “branch” in TBB tree of tasks

Basketizer

(s)

concurrent service

injects full baskets

Transport task

Transports one basket for one step

Basket queue

concurrent service

inject

event

Flow control task

event finished? queue empty?

enqueue basket

input

PrioritizerFlush baskets/prioritize events taskinspect

command:dump all your basketsreuse tracks keeping locality

outputtransported tracksUser Digitizers tasks

User scoring

I/O task

Framework:

GeantV

flow of work in the task approach

Transport task may be further split into subtasks

queue empty

?memory thresholdevent finished?

event finished?

15

EventServer

user task

0..n

geant-dev@cern.ch

October 2016

Slide16

InitTask

Connection to

user framework

October 2016

geant-dev@cern.ch

UserRunSim

Geant

::

ProcessEvents

UserEndRun

Task

Post-simulation task (

e.g

steering reco)

InitTask

InitTaskInitialTask

FlowControllerongoing R&DUserReadEventsTask#1

UserReadEventsTask#2UserReadEventsTask#3

UserRunSimUserRunSim

UserGenerator

#1

UserGenerator

#2

UserGenerator #3

UserGenerator

#n->NextEvent

()

GeantEventServer

TransportTask

Events

UserEndEvent

Task

Post-event task (

e.g

steering digitizers)

event finished?User framework (e.g running concurrent event reader tasks)

Prepare events to be picked-up (using PrimaryGenerator interface)pull baskets from event server/work queueKeep <N> chains of GeantV simulation tasks

<N>

<N>

User

FrameworkInitialization

Run finished

?Geant::ProcessEvents

Geant

::ProcessEvents

StartSimulationTask16Configure GeantV and call initialization task

Slide17

Preliminary TBB results

A first implementation of a task-based approach for

GeantV

using TBB was deployed.Some overheads on Haswell/AVX2 not so obvious on KNL/AVX512

Less than 20% performance loss for the first implementation

17

AVX2

Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

2 sockets x 8 physical cores

KNL/AVX512

geant-dev@cern.ch

October 2016

Slide18

Topology-aware GeantV

Replicate schedulers on NUMA clusters

One

basketizer per NUMA node

libhwloc to detect topologyPossible to use pinning/NUMA allocators to increase locality

Multi-propagator mode running one/more clusters per quadrantLoose communication between NUMA nodes at basketizing

step

Implemented, currently being integrated

Tracks

Transport

Basketizer

0

Scheduler

0

Tracks

Transport

Basketizer1Scheduler1

Tracks

Transport

Basketizer

2

Scheduler

2

Tracks

Transport

Basketizer

3

Scheduler

3

Global

basketizer

18

geant-dev@cern.ch

October 2016

Slide19

Handling sub-node clustering

Known scalability issues

of

full GeantV due to synchronization

in re-basketizingNew approach deploying several propagators clustering resources at sub-node level

Objectives: improved scalability at the scale of KNL and beyond, address both many-node and multi-socket (HPC) modes +

non-homogenous

resources

Implemented recently

October 2016

geant-dev@cern.ch

19

GeantV propagator

Scheduler

Basketizer

GeantV

run managerScheduler

Basketizer

Scheduler

Basketizer

GeantV

propagator

GeantV

propagator(

…)

NUMA discovery service(

libhwloc)

node

socket

socket

socket

Tracks

Transport

Basketizer

0

Scheduler0

Tracks

Transport

Basketizer

1

Scheduler

1

Tracks

Transport

Basketizer

2

Scheduler2

Tracks

Transport

Basketizer

3

Scheduler3

Global

basketizer

Slide20

Multi-propagator performance

First version revealed a bottleneck in event fetching

Triggered the development of the

event serverScalability gets better by increasing number of propagatorsNot final results, still fixing/optimizing

New version has still a bug in basketizing

October 2016

geant-dev@cern.ch

20

#

ncores

KNL

#

ncores XEON

Slide21

GeantV

plans for HPC environments

Standard mode (1 independent process per node)

Always possible, no-brainer

Possible issues with work balancing (events take different time)Possible issues with output granularity (merging may be required)

Multi-tier mode (event servers)Useful to work with events from file, to handle merging and workload balancing

Communication with event servers via MPI to get event id’s in common files

Event feeder

Node

1

Transport

Transport

Numa

0

Numa

1Event feeder

Node2Transport

Transport

Numa0Numa1

Event server

Node

mod[N]

Transport

Transport

Numa

0

Numa

1

Merging service

Event feeder

Node

1

Transport

Transport

Numa0

Numa1

Event feeder

Node2

Transport

Transport

Numa0

Numa1Event server

Nodemod

[N]

Transport

Transport

Numa0Numa1

Merging service21

Event feeder

Node

1

Transport

Transport

Numa

0

Numa1

Event feeder

Node2

Transport

Transport

Numa

0

Numa

1

Event server

Node

mod

[N]

Transport

Transport

Numa

0

Numa

1

Merging service

geant-dev@cern.ch

October 2016

future R&D

MPI

MPI

Slide22

Performance measurements for LHC setups: test matrix

Scheduler

Geometry

Physics

Magnetic

Field Stepper

Geant4 only

Legacy

G4

Various

Physics Lists

Various RK implementations

Geant4 or GeantVVecGeom 2016 scalarTabulated

PhysicsScalar Physics CodeHelix (Fixed Field)Cash-Karp Runge-KuttaGeantV

onlyVecGeom 2015VecGeom 2016 vectorLegacy TGeo

Vector Physics CodeVectorized RK Implementation22Semantic changes

12

3October 2016geant-dev@cern.ch

Slide23

Validation and performance for LHC setups

E

xercise at

the scale

of

LHC

experiments

(

CMS &

LHCb

)

Full geometry converted to VecGeom + uniform magnetic fieldTabulated physics, fixed 1MeV Measuring several cumulative observables in sensitive detectorsEnergy deposit and particle flux densities for p, π, KComparing GeantV

single threaded with the corresponding Geant4 applicationGeant4.10.2, special physics list using tabulated physicsgeant-dev@cern.ch23

Comparable signal,number of secondaries, total steps and physics steps within statistical fluctuations.TG4/TGV = 2.5TG4/T

GV = 3.5Speed-up due to:1.5 - Infrastructure optimizations

2.4 - Algorithmic improvements in geometry3.5 - Extra locality/marginal basket vectorizationTo be profiledOctober 2016

Slide24

Future workSOA->AOS integration

Tuning for many-core

R&D and testing in HPC environments

Adapting to new architectures (Power8)Integration with physics and optimization: R-K propagator and multiple scattering

geant-dev@cern.ch

24

October 2016

Slide25

ConclusionsGeantV

core delivering already a part of the hoped performance

Many optimization requirements, now understanding how to handle most of them

More performance to be extracted from vectorization soonAdditional levels of locality (NUMA) available in modern HWTopology detection available in

GeantV, currently being integratedIntegration with task-based HEP frameworks now possible

A TBB-enabled GeantV version readyStudying more efficient use of HPC resources

Using a multi-tier approach for better workload balancing

Very promising results in complex applications

Gains from infrastructure simplification, geometry and locality/vectorization

25

geant-dev@cern.ch

October 2016