GeantV Parallelism transport structure and overall performance Outlook Motivation amp objectives Implementation Concurrent services Tuning knobs and adaptive behavior Interface to accelerators ID: 788224
Download The PPT/PDF document "Andrei Gheata (CERN) for the GeantV de..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Andrei Gheata (CERN) for the GeantV development team
GeantV
– Parallelism, transport structure and overall performance
Slide2OutlookMotivation & objectives
Implementation
Concurrent services
Tuning knobs and adaptive behaviorInterface to acceleratorsNUMA awarenessGlobal performance and perspectives
geant-dev@cern.ch
2
October 2016
Slide3GeantV – Adapting simulation to modern hardware
Classical simulation
hard to approach the full machine potential
GeantV
simulation
needs to
profit at best from all processing pipelines
Single event scalar transport
Embarrassing
parallelism
Cache
coherence
– low
Vectorization – low (scalar auto-vectorization)
Multi-event vector-aware transport
Fine grain parallelism
Cache
coherence
– high
Vectorization – high (explicit multi-particle interfaces)
3
geant-dev@cern.ch
October 2016
Slide4Basketizer
Outputs
Geometry
Filters
Baskets
(Fast)Physics
Filters
Propagator
Input queue
Vector
stepper
Step sampling
Filter neutrals
(Field) Propagator
Step limiter
reshuffle
VecGeom
navigator
Full geometry
Simplified geometry
Physics sampler
Phys. Process
post-step
Secondaries
TO
SCHEDULER
Vol
1
Vol
2
Vol
3
Vol
n
e
+
/e
-
γ
GeantV
concurrency:
static thread
approach
Inconvenients
:
thread “awareness” (thread id)
hard to connect to task-based frameworks
Coproc
. broker
4
SCHEDULER
MT vector/scalar
processing
geant-dev@cern.ch
October 2016
Slide5Scheduler features
E
fficient
concurrent basketizers
Filtering tracks by several possible locality criteriaGiving “reasonable” size vectors all along the simulation
Provide scalable & balanced workloadMinimize memory footprintMinimize cool-down phase (tails)
Adaptive behavior to maximize performance
Dynamic switch of vector/scalar processing
Learning dynamically the “important” filters
Adjust dynamically event slots to control memory
Accommodate additional concurrent processing in the simulation workflow
Hits/digits/MC truth I/O Digitization/reconstruction tasks
geant-dev@cern.ch5
October 2016
Slide6SOA data handling challenges
What we want before doing work on data
…
fEventV
fParticleV
…
Use
A
fEventV
fParticleV
…
fEventV
fParticleV
fEventV
fParticleV
Compact
Move
A
A
B
What we need to do
…
fEventV
fParticleV
…
Reshuffle
fEventV
fParticleV
…
A
A
fEventV
fParticleV
…
fEventV
fParticleV
fEventV
fParticleV
Basketize
Basketize
A
B
C
… single threaded
… concurrently
geant-dev@cern.ch
6
October 2016
charged
neutral
crossing
Slide7And the price to pay…
geant-dev@cern.ch
7
fraction of run time
Geometry
Run-time fraction spent in different parts of
GeantV
October 2016
24-core dual socket E5-2695 v2 @ 2.40GHz (IVB).
Hyperthreading
The observed contention led us to organize tracks in a different way, in order to reduce the concurrency cost
Slide8Concurrent services: queues
We use concurrent queues for
Aggregating/balancing the work load between workers
S
everal mutex-based/lock free implementations evaluated
GeantV queues can work at ~105 transactions/sec
Lock free queues are doing great on
MacOSX
+ clang
compared to
mutex-based ones (
50x factor!)8
geant-dev@cern.chOctober 2016
Slide9Monitoring to understand behavior
Implemented real-time monitoring tools based on ROOT
Very useful to understand model behavior
Triggered improvement/evolution of the modelSome parameters can be really adaptiveSuch as “important” volumes that can feed vectors
Basketizing only 10% of volumes in CMS leads to 70% of transport done in vector mode Switching the fraction of vector/scalar work has impact on performance in a large range
FixedShield102880: 708955 steps
* HVQX8780: 462821 steps
* ZDC_EMLayer9b00: 83838
steps
* BeamTube22b780: 78748
steps
* OQUA6780: 62597
steps * QuadInner3300: 56376 steps * ZDC_EMAbsorber9d00: 53672 steps * QuadOuter3700: 52155 steps * QuadCoil3680: 49086 steps * ZDC_EMFiber9e80: 41705 stepsgeant-dev@cern.ch9October 2016
scalarvector
frequency of baskets
basket sizeReal time snapshots#tracks transported
Slide10Memory control
Memory determined by number of tracks “in flight”
Determined by number of events “in flight”
Controlling the memory is important for low production cutsNumber of secondary particles can explodeCurrently implemented a policy to delete empty baskets when reaching a memory watermark
Not fully effective, but keeping the memory constantExtra levers (future work):
Reducing dynamically the number of events in flight (possible with new event server)Back burner (waiting queue) for high energy tracks
geant-dev@cern.ch
10
queued baskets
memory
tracks
in flight
October 2016
Slide11Scheduler “knobs”Keep
memory under control
Limiting number of buffered
eventsPrioritizing events “mostly” transportedUsing watermark limit to clean baskets
Keep the vectors upOptimize vector size
Too large: to many pending basketsToo small: inefficient
vectorization
Trigger postponing tracks or tracking with scalar
algorithms
Adjust also dynamically basket
size
Popularity service: basketize only the “important” volumesgeant-dev@cern.ch
11October 2016
Slide12Optimization of scheduling parameters
Depends on what needs to be optimized
E.g. memory vs. computing time
A multivariate problem, probably too early to optimize
Development is iterative with short cycles
Genetic algorithm approach started to be investigated
October 2016
geant-dev@cern.ch
12
Slide13Optimizations for “dense” physics: reusing tracks
If interacting in the current step, no need to re-
basketize
(same volume)Recycle the input basket
Large gain for dense physicsNormally the
basketizer becomes fully blockingLarge part of tracks can be reused in the same thread to release load
geant-dev@cern.ch
13
October 2016
Slide14New development: Integration with task-based experiment frameworks
Some experiments (e.g. CMS) adopted task-based frameworks
Integrate
GeantV
in a task-based workflow is very important (and now possible)Several scenarios invoking GeantV as a task possible,
e.g:
Event Generator/
Reader
Particle filter (
type,region
, energy,
…)
Full/fast simulation(GeantV)
Digitization (MC truth + det. response)
Tracking/Reconstruction
14
Experiment frameworkgeant-dev@cern.chOctober 2016
Slide15Initial task
Top level task spawning a “branch” in TBB tree of tasks
Basketizer
(s)
concurrent service
injects full baskets
Transport task
Transports one basket for one step
Basket queue
concurrent service
inject
event
Flow control task
event finished? queue empty?
enqueue basket
input
PrioritizerFlush baskets/prioritize events taskinspect
command:dump all your basketsreuse tracks keeping locality
outputtransported tracksUser Digitizers tasks
User scoring
I/O task
Framework:
GeantV
flow of work in the task approach
Transport task may be further split into subtasks
queue empty
?memory thresholdevent finished?
event finished?
15
EventServer
user task
0..n
geant-dev@cern.ch
October 2016
Slide16InitTask
Connection to
user framework
October 2016
geant-dev@cern.ch
UserRunSim
Geant
::
ProcessEvents
UserEndRun
Task
Post-simulation task (
e.g
steering reco)
InitTask
InitTaskInitialTask
FlowControllerongoing R&DUserReadEventsTask#1
UserReadEventsTask#2UserReadEventsTask#3
UserRunSimUserRunSim
UserGenerator
#1
UserGenerator
#2
UserGenerator #3
UserGenerator
#n->NextEvent
()
GeantEventServer
TransportTask
Events
UserEndEvent
Task
Post-event task (
e.g
steering digitizers)
event finished?User framework (e.g running concurrent event reader tasks)
Prepare events to be picked-up (using PrimaryGenerator interface)pull baskets from event server/work queueKeep <N> chains of GeantV simulation tasks
<N>
<N>
User
FrameworkInitialization
Run finished
?Geant::ProcessEvents
Geant
::ProcessEvents
StartSimulationTask16Configure GeantV and call initialization task
Slide17Preliminary TBB results
A first implementation of a task-based approach for
GeantV
using TBB was deployed.Some overheads on Haswell/AVX2 not so obvious on KNL/AVX512
Less than 20% performance loss for the first implementation
17
AVX2
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
2 sockets x 8 physical cores
KNL/AVX512
geant-dev@cern.ch
October 2016
Slide18Topology-aware GeantV
Replicate schedulers on NUMA clusters
One
basketizer per NUMA node
libhwloc to detect topologyPossible to use pinning/NUMA allocators to increase locality
Multi-propagator mode running one/more clusters per quadrantLoose communication between NUMA nodes at basketizing
step
Implemented, currently being integrated
Tracks
Transport
Basketizer
0
Scheduler
0
Tracks
Transport
Basketizer1Scheduler1
Tracks
Transport
Basketizer
2
Scheduler
2
Tracks
Transport
Basketizer
3
Scheduler
3
Global
basketizer
18
geant-dev@cern.ch
October 2016
Slide19Handling sub-node clustering
Known scalability issues
of
full GeantV due to synchronization
in re-basketizingNew approach deploying several propagators clustering resources at sub-node level
Objectives: improved scalability at the scale of KNL and beyond, address both many-node and multi-socket (HPC) modes +
non-homogenous
resources
Implemented recently
October 2016
geant-dev@cern.ch
19
GeantV propagator
Scheduler
Basketizer
GeantV
run managerScheduler
Basketizer
Scheduler
Basketizer
GeantV
propagator
GeantV
propagator(
…)
NUMA discovery service(
libhwloc)
node
socket
socket
socket
Tracks
Transport
Basketizer
0
Scheduler0
Tracks
Transport
Basketizer
1
Scheduler
1
Tracks
Transport
Basketizer
2
Scheduler2
Tracks
Transport
Basketizer
3
Scheduler3
Global
basketizer
Slide20Multi-propagator performance
First version revealed a bottleneck in event fetching
Triggered the development of the
event serverScalability gets better by increasing number of propagatorsNot final results, still fixing/optimizing
New version has still a bug in basketizing
October 2016
geant-dev@cern.ch
20
#
ncores
KNL
#
ncores XEON
Slide21GeantV
plans for HPC environments
Standard mode (1 independent process per node)
Always possible, no-brainer
Possible issues with work balancing (events take different time)Possible issues with output granularity (merging may be required)
Multi-tier mode (event servers)Useful to work with events from file, to handle merging and workload balancing
Communication with event servers via MPI to get event id’s in common files
Event feeder
Node
1
Transport
Transport
Numa
0
Numa
1Event feeder
Node2Transport
Transport
Numa0Numa1
Event server
Node
mod[N]
Transport
Transport
Numa
0
Numa
1
Merging service
Event feeder
Node
1
Transport
Transport
Numa0
Numa1
Event feeder
Node2
Transport
Transport
Numa0
Numa1Event server
Nodemod
[N]
Transport
Transport
Numa0Numa1
Merging service21
Event feeder
Node
1
Transport
Transport
Numa
0
Numa1
Event feeder
Node2
Transport
Transport
Numa
0
Numa
1
Event server
Node
mod
[N]
Transport
Transport
Numa
0
Numa
1
Merging service
geant-dev@cern.ch
October 2016
future R&D
MPI
MPI
Slide22Performance measurements for LHC setups: test matrix
Scheduler
Geometry
Physics
Magnetic
Field Stepper
Geant4 only
Legacy
G4
Various
Physics Lists
Various RK implementations
Geant4 or GeantVVecGeom 2016 scalarTabulated
PhysicsScalar Physics CodeHelix (Fixed Field)Cash-Karp Runge-KuttaGeantV
onlyVecGeom 2015VecGeom 2016 vectorLegacy TGeo
Vector Physics CodeVectorized RK Implementation22Semantic changes
12
3October 2016geant-dev@cern.ch
Slide23Validation and performance for LHC setups
E
xercise at
the scale
of
LHC
experiments
(
CMS &
LHCb
)
Full geometry converted to VecGeom + uniform magnetic fieldTabulated physics, fixed 1MeV Measuring several cumulative observables in sensitive detectorsEnergy deposit and particle flux densities for p, π, KComparing GeantV
single threaded with the corresponding Geant4 applicationGeant4.10.2, special physics list using tabulated physicsgeant-dev@cern.ch23
Comparable signal,number of secondaries, total steps and physics steps within statistical fluctuations.TG4/TGV = 2.5TG4/T
GV = 3.5Speed-up due to:1.5 - Infrastructure optimizations
2.4 - Algorithmic improvements in geometry3.5 - Extra locality/marginal basket vectorizationTo be profiledOctober 2016
Slide24Future workSOA->AOS integration
Tuning for many-core
R&D and testing in HPC environments
Adapting to new architectures (Power8)Integration with physics and optimization: R-K propagator and multiple scattering
geant-dev@cern.ch
24
October 2016
Slide25ConclusionsGeantV
core delivering already a part of the hoped performance
Many optimization requirements, now understanding how to handle most of them
More performance to be extracted from vectorization soonAdditional levels of locality (NUMA) available in modern HWTopology detection available in
GeantV, currently being integratedIntegration with task-based HEP frameworks now possible
A TBB-enabled GeantV version readyStudying more efficient use of HPC resources
Using a multi-tier approach for better workload balancing
Very promising results in complex applications
Gains from infrastructure simplification, geometry and locality/vectorization
25
geant-dev@cern.ch
October 2016