/
Offline & Computing Tommaso Boccali (INFN Pisa) Offline & Computing Tommaso Boccali (INFN Pisa)

Offline & Computing Tommaso Boccali (INFN Pisa) - PowerPoint Presentation

sialoquentburberry
sialoquentburberry . @sialoquentburberry
Follow
345 views
Uploaded On 2020-06-24

Offline & Computing Tommaso Boccali (INFN Pisa) - PPT Presentation

1 2 WHAT is the Offline and Computing Project doing We are in charge of the CMS Data from the moment it goes out of DAQ Cessy site and enters CERNIT We are in charge of Data custodiality ID: 785628

cms data effort computing data cms computing effort cloud provide tier amp physics analyses centers years utilization cores standard

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Offline & Computing Tommaso Boccali ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Offline & Computing

Tommaso Boccali (INFN Pisa)

1

Slide2

2

Slide3

WHAT is the Offline and Computing Project doing

We are in charge of the CMS Data from the moment it goes out of DAQ @

Cessy

site and enters CERN/IT

We are in charge of

Data

custodiality (earthquake-solid)Data processing from DAQ channels to physics enabling objects (from 0101001 to jets, particles, leptons, ….)Periodic Data reprocessing Delivery of data to users for analysesA parallel path is followed for Monte Carlo eventsFrom Monte Carlo Generators (our DAQ…) to Geant4 (our detector…) to analysis ready datasets

3

Slide4

HOW do we do it?

A simple calculation, for a typical

RunII

year (say 2018)

CMS collects 10 billion collision events, 1 MB/

ev

typicallyYou want to be earthquake-solid, so you want 2 copies on tape: 20 PB/yYou need to process them at least a coup of times; processing takes ~ 20 sec/ev)You need 200 billion second CPU (you need 15000 CPU cores)You want almost 2 MC events per data event (and they cost more, ~50 sec/ev)Another 70000 cores, another 40 PB You do much more (reconstruction of previous years, simulation of future detectors)

Double the previous numbers

You need do run user analyses

Add another 50%

4

Slide5

All in all …

CMS computing requests for 2018 are roughly

CPU = 200.000 computing cores

DISK = 150 PB

Tape = 200 PB

How to handle this?

A single huge center (but you lose earthquake-solid); also, political reasoningsDistributed computingNice idea, but how?Monarc, GRID, eventually Clouds, etc …This is the story of the last 20 years

5

Slide6

Today - WLCG

200+ sites

Totals exceeding

800.000 Computing cores

500 PB Disk

500 PB Tape

WLCG provides the middleware to Experiments, governs its deployment and evolution, reports to committes and review panelsIt is somehow an additional LHC experiment, and CMS is also part of it6

Slide7

Distributed Computing in CMS

Initial design strictly hierarchical (from

Monarc

)

A single Tier-0 close to the experiment, takes care of first processing and data shipping

A few (5-10) regional centers (Tier-1s), insure data safety and perform MC reconstruction and data reprocessing

Many (20-100) institute level centers (Tier-2s), support user analyses + MC simulationFew guaranteed network links, static model (10+ years of commitment from centers)7

Slide8

In reality now …

A much more dynamic infrastructure

Each center is well connected to any other

GRID to Cloud transformation, not only in our Environment

Consequences?

Much easies to provide computing to CMS!

You do not need large local staff; if you can provide a standard Cloud Environment, we can run on it!Only Tier-0 remains special, with its peculiarity of being close to the detector8

Slide9

What we can use today (well, tomorrow early morning…)

At the resource level, CMS is more or less advanced in the utilization of

Standard GRID centers (easy ….)

Institute Cloud infrastructures (OpenStack,

OpenNebula

, VMWare, …)

 DODASCommercial Clouds (Google, Amazon, …) HepCloud HPC systems

With some specific effort,

many

of them can be used

So, building and maintaining a Tier-2 is not the only option!

9

Slide10

Institute level Cloud infrastructures

We provide a tool, called

DODAS

, which can run a “Tier-2/3 on demand” given an allocation on industry standard Cloud Platforms

It is *really* 1 click and go.

It deploys Compute nodes, services like squid caches and

Xroot dcachesAfter deployment (30 min), it can directly be used for either Analysis and Production transparentlyThis is the easiest way to provide resources to CMS – using what you might already have in your institutionsIt can equally work if you have a Cloud allocation / a Grant from a Commercial Cloud Provider: you can build a virtual center from 1 to 1000s computing cores, lasting from a few hours to months

10

Slide11

11

Slide12

Commercial Clouds

One way to provide resources to CMS (instead of building a Tier-2) is to provide access “in your name” to commercial cloud resources (Amazon, Google,

RacKSpace

, T-systems, …)

Fermilab

is the most advanced with its

HepCloud:Hide the commercial part behind its servicesInvidible to the experiment (we like it!)DODAS is also a possibility, depending on how much local effort you want to put in the deployment DODAS is easier but could be less optimal, being a catch-all solution12

Slide13

13

Slide14

HPC systems

They are expected to be more and more important for High Energy Physics Computing

There is experience, but currently each HPC system is unique and needs unique developments

Also, some are friendly

Networking allowed with the outside world; x86_64 architectures; provide CVMFS and virtualization, …

Others are not

Exotic architectures; network segregation; ….Which is the situation in your countries? Are there HPC centers available? We would love to work with you to check the usability & help you to prepare a utilization proposal!14

Slide15

15

Tests ongoing with CINECA/PRACE

Interest for collaboration with

MareNostrum

/Barcelona

Slide16

SW development

SW development is essential to CMS

Our SW is anything but static

O(2-10) releases per week

Adopting new technologies at a fast pace

C

++  C++-11  C++-14  C++-17 CUDA available in the standard SW environmentTensorFlow and Keras available in the standard SW environmentCMS loves to slightly change the detector each year

2017: pixels

2018: HCAL Endcap

16

Big effort from CMSSW to support new technologies

Lowers the barrier for users testing ideas/algorithms

Slide17

SW development

During 2004-2010 huge effort on software, also detector driven

People building a detector wanted to have the best performance out of it, hence were spending effort on

sw

2010+ data analysis started to be felt more exciting

Effort on software much lower

There is a lot of space for physicists / CS to contribute and take ownership of MAJOR parts of the codeReconstruction algorithms, including Machine LearningCode optimization, refactoring, re-engineeringAdoption of new toolsAlgebra packages, toolkits, …There are oceans of opportunities for someone who is computer savvy and with a decent physics background

A very fast example …

17

Slide18

B tagging …

I worked in b tagging for 10 years.

We spent literally years in designing and commissioning B-tagging algorithms; from easy ones to complex MVA analyses.

Then, ML came. This is what it did:

10% larger b eff at the same contamination

More than a factor 2 rejection at the same b efficiency

Less than 1 year of effort, with ML support in CMSSW still at the initial step18

Slide19

Service Tool development

CMS Computing is a complex machine

It has to move PBs of files weekly

It has to process ~ 1 M jobs every day

It handles 500 unique users per week

We need a lot of infrastructure

Data ManagementWorkload managementAnalysis InfrastructureDatabases, web services, monitoring tools, ….We feel we are deeply understaffed for most of the tasksMany of them at “survival support level”; we would love the injection of new forces and ideas

19

Slide20

Computing Operations

Every day in CMS

LHC delivers ~ 0.5 /fb (300 TB of RAW data)

We need to process it and store it safely

PPD operators submit

Release validation samples

Production workflowsData reprocessingUsers submit 100s of analyses tasksCoverage is 24x7 (yes, LHC does not stop at nights or weekends…)Each of these needs data movement, monitoring, failure recovery, …You would be surprised by the numbers of FTE involved here (surprise)

20

Slide21

CMS & Big Data

CERN and CMS have prepared a “ready to use” solution to analyze big data samples (up to PB level)

At the moment this mostly include Computing LOGS (for smart analyses like data popularity, anomaly detections, …)

Plan is to use the same infrastructure for Physics, eventually

Interest from Physicists / data scientists is large

And CMS needs these kind of new ideas!

Please meet CMSSpark!21

Slide22

R&D

Current extrapolations of computing to 2026 hint that in order to process HL-LHC we would need ~5x more funds on computing

Obviously impossible

Huge effort in finding solutions in order to have gains in operational efficiency

Reduced data formats

Wide spread utilization of GPUs, FPGAs, …

Studies on operational modesRedo vs store, …Remote access, caching, …Merging on offline / online worlds(imagine one)

22

Slide23

Opportunities …

Opportunities are there in all the areas, practically

CMS is trying to lower the threshold for Resource Utilization

No more limited to “build a Tier2 for us”

SW development (physics and services)

Oceans of opportunities, ranging from code curation to the utilization of novel technologies in basically all the fields

OperationsEven during Long Shutdown! Computing never stops (indeed periods w/o data taking are usually the busiest!)Short / Long range R&DDive into the CMS Big Data (computing AND physics)Modelling, access to new technologiesTest of completely new ideas (always welcome!!!)

23

Slide24

.. And rewards!

We spoke about the EPR system last time, not covered here …

The tasks in O+C are usually large visibility ones

If you provide a resources OR take care of a service

R&D tasks are very attractive for data scientists

Easy access to

Keras/TF _AND_ a huge amount of data in a single placeA HDFS/Spark cluster already configured and CMS-enabledSW, operations, …All are challenging environments, with high return in publications / conferences (CMS brought / proposed > 60 talks to last CHEP!)At FA level, we are very interested in listening and helping withParticipation to EU projectsUtilization / collaboration with local facilities (HPC, clouds, …)

24