at the CMS experiment Felice Pantaleo CERN felicecernch Overview Motivations Heterogeneous Computing Track seeding on GPUs during Run3 Run2 Track Seeding Online Offline Conclusion 2 3 ID: 563835
Download Presentation The PPT/PDF document "New Track Seeding Techniques" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
New Track Seeding Techniques at the CMS experiment
Felice Pantaleo
CERN
felice@cern.chSlide2
Overview
Motivations
Heterogeneous Computing
Track seeding on GPUs during Run-3Run-2 Track SeedingOnlineOfflineConclusion
2Slide3
3
Today the CMS online
farm consists of
~22k Intel Xeon coresThe current approach: one event per logical corePixel Tracks are not reconstructed for all the events at the HLT
This
will be even more difficult at higher pile-up
Combinatorial
time in pixel seeding O(m!) in worst caseMore memory/event
CMS High-Level Trigger in 2016
(1/2
)Slide4
CMS High-Level Trigger in 2016 (2/2)
4
full track reconstruction and particle flow e.g. jets, tau
Today the CMS online
farm consists of
~22k Intel Xeon cores
The current approach: one event per logical core
Pixel Tracks are not reconstructed for all the events at the HLTThis will be even more difficult at higher pile-upCombinatorial time in pixel seeding
O(
m
!) in worst case
More
memory/eventSlide5
CMS and LHC Upgrade Schedule
5
CMS
pixel detector upgradeSlide6
Terminology - CMS Silicon Tracker
6
Online:
Pixel-only tracks used for fast tracking and vertexing
Offline
:
Pixel tracks are used as seeds for the
Kalman filter in the strip detectorSlide7
Phase 1 Pixel detector (1/2)
The already complex online and offline track reconstruction will have to deal not only with a much more crowded environment but also with data coming from a more complex detector.
7Slide8
Phase 1 Pixel detector (2/2)
8
The already complex online and offline track reconstruction will have to deal not only with a much more crowded environment but also with data coming from a more complex detector. Slide9
Tracking at HLT
Pixel hits are used for pixel tracks, vertices, seeding
HLT Iterative tracking:
Iteration name
Phase0 Seeds
Phase1
SeedsTarget TracksPixel Trackstripletsquadruplets
Iter0
Pixel
Tracks
Pixel Tracks
Prompt, high
p
T
Iter1
triplets
quadruplets
Prompt, low
p
T
Iter2
doublets
triplets
High
p
T
,
recovery
9Slide10
Pixel Tracks
Evaluation of Pixel Tracks combinatorial
complexity could easily be dominated by track density and become the bottleneck of the High-Level Trigger and offline reconstruction execution times.
The CMS HLT farm and its offline computing infrastructure cannot rely anymore on an exponential growth of frequency guaranteed by the manufacturers.
Hardware
and algorithmic solutions have been
studied
10Slide11
Heterogeneous ComputingSlide12
CPU vs GPU architectures
CPU
GPU
12Slide13
CPU vs GPU architectures
Large caches (slow memory accesses to quick cache accesses)
Powerful ALUs
Low bandwidth to memory (tens GB/s)
In CMS:
One event per thread
Rested on our laurels, thanks to independency of events
HLT/
Reco
limited by I/O
Memory footprint a issue
CPU
13Slide14
CPU vs GPU architectures
Many Streaming Multiprocessors execute kernels (aka functions) using hundreds of threads concurrently
High bandwidth to memory (up to 1TB/s)
Number of threads in-fly increasesIn CMS:Cannot make threads work on different events due to SIMT architectureHandle heterogeneity to assign the job to the best matchLong term solution: unroll and offload combinatorics to many threads
GPU
14Slide15
Track seeding on GPUs during Run-3Slide16
From RAW to Tracks during run 3
Profit from the end-of-year upgrade of the Pixel to redesign the seeding code
Exploiting the information coming from the 4
th layer would improve efficiency, b-tag, IP resolutionTrigger avg latency should stay within 220msReproducibility of the results (bit-by-bit equivalence CPU-GPU)Integration in the CMS software frameworkIngredients:Massive parallelism within the eventIndependence from thread ordering in algorithmsAvoid useless data transfers and transformations
Simple data formats optimized for parallel memory access
Result:
A GPU based application
that takes RAW data and gives Tracks as result16Slide17
Algorithm Stack
17
Raw to Digi
Hits - Pixel Clusterizer
Hit Pairs
Ntuplets
- Cellular Automaton
Input, size linear with PU
Output, size ~linear with PUSlide18
Cellular Automaton (CA)
The CA is a track seeding algorithm designed for parallel architectures
It requires a list of layers and their pairings
A graph of all the possible connections between layers is createdDoublets aka Cells are created for each pair of layers (compatible with a region hypothesis)Fast computation of the compatibility between two connected cellsNo knowledge of the world outside adjacent neighboring cells required, making it easy to parallelize18Slide19
CAGraph of seeding layers
Seeding layers interconnections
Hit doublets for each layer pair can be computed independently by sets of threads
19Slide20
CA: R-z plane compatibility
The compatibility between two cells is checked only if they share one hit
AB and BC share hit B
In the R-z plane a requirement is alignment of the two cells:There is a maximum value of that depends on the minimum value of the momentum range that we would liketo explore
20Slide21
CA: x-y plane compatibility
In the transverse plane, the intersection between the circle passing through the hits forming the two cells and
the
beamspot is checked:They intersect if the distancebetween the centers d(C,C’)satisfies:r’-r < d(C,C’)
<
r’+r
Since it is a Out – In propagation,
a tolerance is added to the beamspot radius (in red)One could also ask for a minimumvalue of transverse momentum and reject low values of r’ 21Slide22
Cells Connection
blockIdx.x
and
threadIdx.x
= Cell id in a
LayerPair
Each cell asks its innermost hits for cells to check compatibility with.
22
blockIdx.y
=
LayerPairIndex
[0,13)Slide23
Evolution
If two cells satisfy all the compatibility requirements they are said to be neighbors and their state is set to 0
In the evolution stage, their state increases in discrete generations if there is an outer neighbor with the same state
At the end of the evolution stage the state of the cells will contain theinformation about the lengthIf one is interested in quadruplets, there will be surely one starting from a state 2 cell, pentuplets state 3, etc.23Slide24
24
24
T=0
T=1
T=2Slide25
Quadruplets finding
blockIdx.x
and
threadIdx.x
= Cell id in an outermost
LayerPair
blockIdx.y
=
LayerPairIndex
in
ExternalLayerPairs
Each cell on. an outermost layer pair will perform a DFS of depth = 4 following inner neighbors.
25Slide26
Triplet Propagation
26
Propagate 1-2-3 triplet to 4th layer and search for compatible hits
Natural continuation of the current approach from pairs to tripletsSlide27
Simulated Physics Performance PixelTracks
27
CA tuned to have same efficiency as Triplet Propagation
Efficiency significantly larger than 2016, especially in the forward region (|
η
|>1.5).Slide28
Simulated Physics Performance PixelTracks
28
Fake rate up to 40% lower than Triplet Propagation
Two orders of magnitudes lower than 2016 tracking thanks to higher purity of quadruplets wrt to tripletsSlide29
Timing
The physics performance and timing justified the integration of the Cellular Automaton in its sequential implementation at the HLT already in 2017
Hardware: Intel
Core
i7-4771@3.5GHz , NVIDIA
GTX 1080
29
Algorithmtime per event [ms]
2016 Pixel Tracks
29.3 ± 13.1
Triplet
Propagation
72.1 ± 25.7
GPU Cellular Automaton
1.2 ± 0.9
CPU Cellular Automaton
14 ± 6.2Slide30
Cellular Automaton @
Run-2 Offline Track Seeding Slide31
CA in offline tracking
The performance of the sequential Cellular Automaton at the HLT justified its integration also in the 2017 offline iterative tracking
31Slide32
CA in offline tracking
The performance of the sequential Cellular Automaton at the HLT justified its integration also in the 2017 offline iterative tracking
32Slide33
CA Physics performance vs 2016
33
Reconstruction efficiency increased
especially in forward region.Fake rate significantly reduced in the entire pseudo-rapidity rangeSlide34
CA Physics performance vs conventional
34
Overall track reconstruction efficiencies are very similar
Fake rate lower up to about 40% when transitioning from the barrel to the forward (|η
|>1.5).Slide35
CA Physics performance vs
PU
35
Efficiency slightly affected by the increasing pile-upMore fake tracks pass the track selection and their hits get removed from the pool of available hits
Fake rate ~quadratic dependency from PUSlide36
Timing vs PU
36
CA track seeding at same level of the 2016 seeding
More robust, smaller complexity vs PU than 2016 track seeding despite the increased number of layer combinations involved in the seeding phase with respect to the 2016 seeding
In pattern recognition, the CA seeding brings no additional gain
~20% faster track reconstruction
wrt
to 2016 tracking at avg PU70Slide37
Conclusion
Pixel Track seeding algorithms have been redesigned with
high-throughput parallel architectures in mind
Improvements in performance may come even when running sequentiallyFactors at the HLT, tens of % in the offline, depending on the fraction of the code that use new algos
CA algorithms with additional graph knowledge capability are very powerful
By adding more Graph Theory sugar, steal some work from the track building and become more flexible
The GPU and CPU algorithms run in CMSSW and produce the same bit-by-bit result
Transition to GPUs@HLT during Run3 smootherRunning Pixel Tracking at the CMS HLT for every event would become cheap @PU ~ 50 – 70Integration in the CMS High-Level Trigger farm under study
37Slide38
Backup
felice@cern.ch
38Slide39
Impact parameters resolution
39
The 2017 detector shows better performance than 2016 over all the η spectrum
.Slide40
Transverse momentum resolution
40
The
performance between 2016 and 2017 is comparable.