o n behalf of the Trigger DAQOffline Computing Preparatory Group ALICE Pierre Vande Vyvre Thorsten Kollegger Predrag Buncic ATLAS David Rousseau Benedetto Gorini Nikos ID: 782173
Download The PPT/PDF document "Computing at the HL-LHC Predrag Buncic" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Computing at the HL-LHCPredrag Buncicon behalf of the Trigger/DAQ/Offline/Computing Preparatory Group
ALICE: Pierre
Vande
Vyvre
, Thorsten
Kollegger
, Predrag Buncic; ATLAS: David Rousseau, Benedetto
Gorini
, Nikos
Konstantinidis
; CMS: Wesley Smith,
Christoph
Schwick
, Ian Fisk, Peter Elmer ;
LHCb
: Renaud
Legac
,
Niko
Neufeld
Slide2Computing @ HL-HLC
CPU needs (per event) will grow with track multiplicity (pileup)
Storage needs are proportional to accumulated luminosity
Resource estimates
The probable evolution of the distributed computed technologies
Slide3LHCb & ALICE @ Run 340 MHz40 MHz5-40 MHz20 kHz (0.1 MB/event)
2 GB/s
Storage
Reconstruction
+
Compression
50 kHz
75 GB/
s
50 kHz (1.5 MB/event)
PEAK OUTPUT
ATLAS & CMS @ Run 410-20 GB/sStorage
Level 1
HLT
5-10 kHz (2MB/event)
40 GB/
s
Storage
Level 1
HLT
10 kHz (
4
MB/event)
PEAK OUTPUT
Big Data“Data that exceeds the boundaries and sizes of normal processing capabilities, forcing you to take a non-traditional approach” size
a
ccess rate
standard
data
processing
applicable
BIG
DATA
Slide6How do we score today?PB
RAW + Derived data
@
Run1
Slide7Data: Outlook for HL-LHCVery rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data…PB
Slide8Digital Universe Expansion
Slide9Data storage issuesWe are heading towards Exabyte scale in storageBig data in true senseOur data problems may still look small on the scale of storage needs of internet giantsBusiness e-mail, video, music, smartphones, digital cameras generate more and more need for storageThe cost will probably continue to fall Potential issues Commodity high capacity disks may start to look more like tapes, optimized for multimedia storage, sequential accessNeed to be combined with flash memory disks for fast random accessThe residual cost of disk servers will remainWhile we might be able to write all this data, how long it will take to
read it back? Need for sophisticated parallel I/O and processing.
Slide10CPU: Online requirementsALICE & LHCb @ Run 3,4 HLT farms for online reconstruction/compression or event rejections with aim to reduce data volume to manageable levels Data volume x 10ALICE estimate: 200k today’s core equivalent, 2M HEP-SPEC-06LHCb estimate: 880k today’s core equivalent, 8M HEP-SPEC-06Looking into alternative platforms to optimize performance and minimize costGPUs ( 1 GPU = 3 CPUs for ALICE online tracking use case)
ARM, Atom…ATLAS & CMS @ Run 4Trigger rate x 10, event size x 3, data volume x 30
Assuming linear scaling with pileup, a total factor of 50 increase (10M HEP-SPEC-06) of HLT power would be needed wrt. today’s CMS farm
Slide11CPU: Outlook for HL-LHCVery rough estimate of new CPU requirements for online and offline processing per year of data taking using a simple extrapolation of current requirements scaled by the number of events. Little contingency left, we must work on improving the performance.
ONLINE
GRID
Moore’s law limit
M
HS06
Slide12Clock frequency
Vectors
Instruction Pipelining
Instruction Level Parallelism (ILP)
Hardware threading Multi-core Multi-socket Multi-node
R
unning independent
jobs per core (as
we do
now) is still the best solution for High
T
hroughput
C
omputing applications
}
Gain in memory footprint
and time-to
-finish
but not in
throughput
Very little gain to be expected and no action to be taken
Potential g
ain
in throughput and
in time-to
-finish
How to improve the performance?
NEXT TALK
THIS TALK
Improving performance
is the key for reducing
the cost
and maximizing detector potential
Slide13WLCG CollaborationDistributed infrastructure of 150 computing centers in 40 countries300+ k CPU cores (~ 3M HEP-SPEC-06)The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU coresDistributed data, services and operation infrastructure
Slide14Distributed computing todayRunning millions of jobs and processing hundreds of PB of dataEfficiency (CPU/Wall time) from 70% (analysis) to 85% (organized activities)
Slide15Can it scale?Each experiment today runs their own Grid overlay on top of shared WLCG infrastructureIn all cases the underlying distributed computing architecture is similar based on “pilot” job modelCan we scale these systems expected levels?Lots of commonality and potential for consolidation
GlideinWMS
Slide16Grid Tier modelNetwork capabilities and data access technologies significantly improved our ability to use resources independent of locationhttp://cerncourier.com/cws
/article/cern/52744
Slide17Global Data FederationFAX – Federated ATLAS Xrootd, AAA – Any Data, Any Time, Anywhere (CMS), AliEn (ALICE) B. Bockelman, CHEP 2012
Slide18VirtualizationVirtualization provides better system utilization by putting all those many cores to work, resulting in power & cost savings.
Slide19Software integration problemTraditional modeldriven by the platform (grid middleware)Horizontal layersIndependently developed Maintained by the different groupsDifferent lifecycleApplication is deployed on top of the stackBreaks if any layer changesNeeds to be certified every time when something changesResults in deployment and support nightmareDifficult to do upgradesEven worse to switch to new OS versionsOS
Application
Libraries
ToolsDatabases
Hardware
Slide20Application driven approachStart by analysing the application requirements and dependenciesAdd required tools and librariesUse virtualization toBuild minimal OS Bundle all this into Virtual Machine image
Separates lifecycles of the application and underlying computing infrastructureDecoupling Apps and Ops
OS
Libraries
Tools
Databases
Application
Slide21Cloud: Putting it all togetherAPIIaaS = Infrastructure as a ServiceServer consolidation using virtualization improves
resource usage
Public, on demand, pay as you go infrastructure with goals and capabilities similar to those of academic grids
Slide22Public CloudAt present 9 large sites/zonesup to ~2M CPU cores/site, ~4M total10 x more cores on 1/10 of the sites compared to our Grid
Slide23Could we make our own Cloud?Open Source Cloud componentsOpen Source Cloud middlewareVM building tools and infrastructureCernVM+CernVM/FS, boxgrinder..Common APIFrom the Grid heritage…Common Authentication/AuthorizationHigh performance global data federation/storageScheduled file transfer
Experiment frameworks adapted to CloudTo doUnified access and cross Cloud scheduling
Centralized key services at a few large centers
Slide24Grid on top of Clouds
AliEn
CAF
On
Demand
CAF
On
Demand
AliEn
O
2
Online-Offline
Facility
Private CERN
Cloud
Vision
Public Cloud(s)
AliEn
AliEn
HLT
DAQ
T1/T2/T3
Analysis
Reconstruction
Custodial Data Store
Analysis
Simulation
Data Cache
Simulation
Reconstruction
Calibration
Data Store
Slide25And if this is not enough…18,688 nodes with total of 299,008 processor cores and one GPU per node. And, there are spare CPU cycles available…Ideal for simulation type workloads (60-70% of all CPU cycles used by the experiments). Simulation frameworks must be adapted to efficiently use such resources where available
Slide26ConclusionsResources needed for Computing at HL-LHC are going to be large but not unprecedentedData volume is going to grow dramatically in Run 4Projected CPU needs are reaching the Moore’s law ceiling We might need to use heterogeneous computing resources to optimize the cost of computingParallelization at all levels is essential for improving performanceRequires training and reengineering of experiment software frameworksNew technologies such as clouds and virtualization may help to reduce the complexity and help to fully utilize available resourcesWe must continue to work in improving the performance and efficiency of our software adapting to new technologies