1 2 WHAT is the Offline and Computing Project doing We are in charge of the CMS Data from the moment it goes out of DAQ Cessy site and enters CERNIT We are in charge of Data custodiality ID: 785628
Download The PPT/PDF document "Offline & Computing Tommaso Boccali ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Offline & Computing
Tommaso Boccali (INFN Pisa)
1
Slide22
Slide3WHAT is the Offline and Computing Project doing
We are in charge of the CMS Data from the moment it goes out of DAQ @
Cessy
site and enters CERN/IT
We are in charge of
Data
custodiality (earthquake-solid)Data processing from DAQ channels to physics enabling objects (from 0101001 to jets, particles, leptons, ….)Periodic Data reprocessing Delivery of data to users for analysesA parallel path is followed for Monte Carlo eventsFrom Monte Carlo Generators (our DAQ…) to Geant4 (our detector…) to analysis ready datasets
3
Slide4HOW do we do it?
A simple calculation, for a typical
RunII
year (say 2018)
CMS collects 10 billion collision events, 1 MB/
ev
typicallyYou want to be earthquake-solid, so you want 2 copies on tape: 20 PB/yYou need to process them at least a coup of times; processing takes ~ 20 sec/ev)You need 200 billion second CPU (you need 15000 CPU cores)You want almost 2 MC events per data event (and they cost more, ~50 sec/ev)Another 70000 cores, another 40 PB You do much more (reconstruction of previous years, simulation of future detectors)
Double the previous numbers
You need do run user analyses
Add another 50%
4
Slide5All in all …
CMS computing requests for 2018 are roughly
CPU = 200.000 computing cores
DISK = 150 PB
Tape = 200 PB
How to handle this?
A single huge center (but you lose earthquake-solid); also, political reasoningsDistributed computingNice idea, but how?Monarc, GRID, eventually Clouds, etc …This is the story of the last 20 years
5
Slide6Today - WLCG
200+ sites
Totals exceeding
800.000 Computing cores
500 PB Disk
500 PB Tape
WLCG provides the middleware to Experiments, governs its deployment and evolution, reports to committes and review panelsIt is somehow an additional LHC experiment, and CMS is also part of it6
Slide7Distributed Computing in CMS
Initial design strictly hierarchical (from
Monarc
)
A single Tier-0 close to the experiment, takes care of first processing and data shipping
A few (5-10) regional centers (Tier-1s), insure data safety and perform MC reconstruction and data reprocessing
Many (20-100) institute level centers (Tier-2s), support user analyses + MC simulationFew guaranteed network links, static model (10+ years of commitment from centers)7
Slide8In reality now …
A much more dynamic infrastructure
Each center is well connected to any other
GRID to Cloud transformation, not only in our Environment
Consequences?
Much easies to provide computing to CMS!
You do not need large local staff; if you can provide a standard Cloud Environment, we can run on it!Only Tier-0 remains special, with its peculiarity of being close to the detector8
Slide9What we can use today (well, tomorrow early morning…)
At the resource level, CMS is more or less advanced in the utilization of
Standard GRID centers (easy ….)
Institute Cloud infrastructures (OpenStack,
OpenNebula
, VMWare, …)
DODASCommercial Clouds (Google, Amazon, …) HepCloud HPC systems
With some specific effort,
many
of them can be used
So, building and maintaining a Tier-2 is not the only option!
9
Slide10Institute level Cloud infrastructures
We provide a tool, called
DODAS
, which can run a “Tier-2/3 on demand” given an allocation on industry standard Cloud Platforms
It is *really* 1 click and go.
It deploys Compute nodes, services like squid caches and
Xroot dcachesAfter deployment (30 min), it can directly be used for either Analysis and Production transparentlyThis is the easiest way to provide resources to CMS – using what you might already have in your institutionsIt can equally work if you have a Cloud allocation / a Grant from a Commercial Cloud Provider: you can build a virtual center from 1 to 1000s computing cores, lasting from a few hours to months
10
Slide1111
Slide12Commercial Clouds
One way to provide resources to CMS (instead of building a Tier-2) is to provide access “in your name” to commercial cloud resources (Amazon, Google,
RacKSpace
, T-systems, …)
Fermilab
is the most advanced with its
HepCloud:Hide the commercial part behind its servicesInvidible to the experiment (we like it!)DODAS is also a possibility, depending on how much local effort you want to put in the deployment DODAS is easier but could be less optimal, being a catch-all solution12
Slide1313
Slide14HPC systems
They are expected to be more and more important for High Energy Physics Computing
There is experience, but currently each HPC system is unique and needs unique developments
Also, some are friendly
Networking allowed with the outside world; x86_64 architectures; provide CVMFS and virtualization, …
Others are not
Exotic architectures; network segregation; ….Which is the situation in your countries? Are there HPC centers available? We would love to work with you to check the usability & help you to prepare a utilization proposal!14
Slide1515
Tests ongoing with CINECA/PRACE
Interest for collaboration with
MareNostrum
/Barcelona
Slide16SW development
SW development is essential to CMS
Our SW is anything but static
O(2-10) releases per week
Adopting new technologies at a fast pace
C
++ C++-11 C++-14 C++-17 CUDA available in the standard SW environmentTensorFlow and Keras available in the standard SW environmentCMS loves to slightly change the detector each year
2017: pixels
2018: HCAL Endcap
…
16
Big effort from CMSSW to support new technologies
Lowers the barrier for users testing ideas/algorithms
Slide17SW development
During 2004-2010 huge effort on software, also detector driven
People building a detector wanted to have the best performance out of it, hence were spending effort on
sw
2010+ data analysis started to be felt more exciting
Effort on software much lower
There is a lot of space for physicists / CS to contribute and take ownership of MAJOR parts of the codeReconstruction algorithms, including Machine LearningCode optimization, refactoring, re-engineeringAdoption of new toolsAlgebra packages, toolkits, …There are oceans of opportunities for someone who is computer savvy and with a decent physics background
A very fast example …
17
Slide18B tagging …
I worked in b tagging for 10 years.
We spent literally years in designing and commissioning B-tagging algorithms; from easy ones to complex MVA analyses.
Then, ML came. This is what it did:
10% larger b eff at the same contamination
More than a factor 2 rejection at the same b efficiency
Less than 1 year of effort, with ML support in CMSSW still at the initial step18
Slide19Service Tool development
CMS Computing is a complex machine
It has to move PBs of files weekly
It has to process ~ 1 M jobs every day
It handles 500 unique users per week
We need a lot of infrastructure
Data ManagementWorkload managementAnalysis InfrastructureDatabases, web services, monitoring tools, ….We feel we are deeply understaffed for most of the tasksMany of them at “survival support level”; we would love the injection of new forces and ideas
19
Slide20Computing Operations
Every day in CMS
LHC delivers ~ 0.5 /fb (300 TB of RAW data)
We need to process it and store it safely
PPD operators submit
Release validation samples
Production workflowsData reprocessingUsers submit 100s of analyses tasksCoverage is 24x7 (yes, LHC does not stop at nights or weekends…)Each of these needs data movement, monitoring, failure recovery, …You would be surprised by the numbers of FTE involved here (surprise)
20
Slide21CMS & Big Data
CERN and CMS have prepared a “ready to use” solution to analyze big data samples (up to PB level)
At the moment this mostly include Computing LOGS (for smart analyses like data popularity, anomaly detections, …)
Plan is to use the same infrastructure for Physics, eventually
Interest from Physicists / data scientists is large
And CMS needs these kind of new ideas!
Please meet CMSSpark!21
Slide22R&D
Current extrapolations of computing to 2026 hint that in order to process HL-LHC we would need ~5x more funds on computing
Obviously impossible
Huge effort in finding solutions in order to have gains in operational efficiency
Reduced data formats
Wide spread utilization of GPUs, FPGAs, …
Studies on operational modesRedo vs store, …Remote access, caching, …Merging on offline / online worlds(imagine one)
22
Slide23Opportunities …
Opportunities are there in all the areas, practically
CMS is trying to lower the threshold for Resource Utilization
No more limited to “build a Tier2 for us”
SW development (physics and services)
Oceans of opportunities, ranging from code curation to the utilization of novel technologies in basically all the fields
OperationsEven during Long Shutdown! Computing never stops (indeed periods w/o data taking are usually the busiest!)Short / Long range R&DDive into the CMS Big Data (computing AND physics)Modelling, access to new technologiesTest of completely new ideas (always welcome!!!)
23
Slide24.. And rewards!
We spoke about the EPR system last time, not covered here …
The tasks in O+C are usually large visibility ones
If you provide a resources OR take care of a service
R&D tasks are very attractive for data scientists
Easy access to
Keras/TF _AND_ a huge amount of data in a single placeA HDFS/Spark cluster already configured and CMS-enabledSW, operations, …All are challenging environments, with high return in publications / conferences (CMS brought / proposed > 60 talks to last CHEP!)At FA level, we are very interested in listening and helping withParticipation to EU projectsUtilization / collaboration with local facilities (HPC, clouds, …)
24