/
Computing at the HL-LHC Predrag Buncic Computing at the HL-LHC Predrag Buncic

Computing at the HL-LHC Predrag Buncic - PowerPoint Presentation

mercynaybor
mercynaybor . @mercynaybor
Follow
344 views
Uploaded On 2020-06-19

Computing at the HL-LHC Predrag Buncic - PPT Presentation

o n behalf of the Trigger DAQOffline Computing Preparatory Group ALICE Pierre Vande Vyvre Thorsten Kollegger Predrag Buncic ATLAS David Rousseau Benedetto Gorini Nikos ID: 782173

cpu data grid computing data cpu computing grid event cloud online storage performance cores cms cost khz hlt run

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Computing at the HL-LHC Predrag Buncic" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Computing at the HL-LHCPredrag Buncicon behalf of the Trigger/DAQ/Offline/Computing Preparatory Group

ALICE: Pierre

Vande

Vyvre

, Thorsten

Kollegger

, Predrag Buncic; ATLAS: David Rousseau, Benedetto

Gorini

, Nikos

Konstantinidis

; CMS: Wesley Smith,

Christoph

Schwick

, Ian Fisk, Peter Elmer ;

LHCb

: Renaud

Legac

,

Niko

Neufeld

Slide2

Computing @ HL-HLC

CPU needs (per event) will grow with track multiplicity (pileup)

Storage needs are proportional to accumulated luminosity

Resource estimates

The probable evolution of the distributed computed technologies

Slide3

LHCb & ALICE @ Run 340 MHz40 MHz5-40 MHz20 kHz (0.1 MB/event)

2 GB/s

Storage

Reconstruction

+

Compression

50 kHz

75 GB/

s

50 kHz (1.5 MB/event)

PEAK OUTPUT

Slide4

ATLAS & CMS @ Run 410-20 GB/sStorage

Level 1

HLT

5-10 kHz (2MB/event)

40 GB/

s

Storage

Level 1

HLT

10 kHz (

4

MB/event)

PEAK OUTPUT

Slide5

Big Data“Data that exceeds the boundaries and sizes of normal processing capabilities, forcing you to take a non-traditional approach” size

a

ccess rate

standard

data

processing

applicable

BIG

DATA

Slide6

How do we score today?PB

RAW + Derived data

@

Run1

Slide7

Data: Outlook for HL-LHCVery rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data…PB

Slide8

Digital Universe Expansion

Slide9

Data storage issuesWe are heading towards Exabyte scale in storageBig data in true senseOur data problems may still look small on the scale of storage needs of internet giantsBusiness e-mail, video, music, smartphones, digital cameras generate more and more need for storageThe cost will probably continue to fall Potential issues Commodity high capacity disks may start to look more like tapes, optimized for multimedia storage, sequential accessNeed to be combined with flash memory disks for fast random accessThe residual cost of disk servers will remainWhile we might be able to write all this data, how long it will take to

read it back? Need for sophisticated parallel I/O and processing.

Slide10

CPU: Online requirementsALICE & LHCb @ Run 3,4 HLT farms for online reconstruction/compression or event rejections with aim to reduce data volume to manageable levels Data volume x 10ALICE estimate: 200k today’s core equivalent, 2M HEP-SPEC-06LHCb estimate: 880k today’s core equivalent, 8M HEP-SPEC-06Looking into alternative platforms to optimize performance and minimize costGPUs ( 1 GPU = 3 CPUs for ALICE online tracking use case)

ARM, Atom…ATLAS & CMS @ Run 4Trigger rate x 10, event size x 3, data volume x 30

Assuming linear scaling with pileup, a total factor of 50 increase (10M HEP-SPEC-06) of HLT power would be needed wrt. today’s CMS farm

Slide11

CPU: Outlook for HL-LHCVery rough estimate of new CPU requirements for online and offline processing per year of data taking using a simple extrapolation of current requirements scaled by the number of events. Little contingency left, we must work on improving the performance.

ONLINE

GRID

Moore’s law limit

M

HS06

Slide12

Clock frequency

Vectors

Instruction Pipelining

Instruction Level Parallelism (ILP)

Hardware threading Multi-core Multi-socket Multi-node

R

unning independent

jobs per core (as

we do

now) is still the best solution for High

T

hroughput

C

omputing applications

}

Gain in memory footprint

and time-to

-finish

but not in

throughput

Very little gain to be expected and no action to be taken

Potential g

ain

in throughput and

in time-to

-finish

How to improve the performance?

NEXT TALK

THIS TALK

Improving performance

is the key for reducing

the cost

and maximizing detector potential

Slide13

WLCG CollaborationDistributed infrastructure of 150 computing centers in 40 countries300+ k CPU cores (~ 3M HEP-SPEC-06)The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU coresDistributed data, services and operation infrastructure

Slide14

Distributed computing todayRunning millions of jobs and processing hundreds of PB of dataEfficiency (CPU/Wall time) from 70% (analysis) to 85% (organized activities)

Slide15

Can it scale?Each experiment today runs their own Grid overlay on top of shared WLCG infrastructureIn all cases the underlying distributed computing architecture is similar based on “pilot” job modelCan we scale these systems expected levels?Lots of commonality and potential for consolidation

GlideinWMS

Slide16

Grid Tier modelNetwork capabilities and data access technologies significantly improved our ability to use resources independent of locationhttp://cerncourier.com/cws

/article/cern/52744

Slide17

Global Data FederationFAX – Federated ATLAS Xrootd, AAA – Any Data, Any Time, Anywhere (CMS), AliEn (ALICE) B. Bockelman, CHEP 2012

Slide18

VirtualizationVirtualization provides better system utilization by putting all those many cores to work, resulting in power & cost savings.

Slide19

Software integration problemTraditional modeldriven by the platform (grid middleware)Horizontal layersIndependently developed Maintained by the different groupsDifferent lifecycleApplication is deployed on top of the stackBreaks if any layer changesNeeds to be certified every time when something changesResults in deployment and support nightmareDifficult to do upgradesEven worse to switch to new OS versionsOS

Application

Libraries

ToolsDatabases

Hardware

Slide20

Application driven approachStart by analysing the application requirements and dependenciesAdd required tools and librariesUse virtualization toBuild minimal OS Bundle all this into Virtual Machine image

Separates lifecycles of the application and underlying computing infrastructureDecoupling Apps and Ops

OS

Libraries

Tools

Databases

Application

Slide21

Cloud: Putting it all togetherAPIIaaS = Infrastructure as a ServiceServer consolidation using virtualization improves

resource usage

Public, on demand, pay as you go infrastructure with goals and capabilities similar to those of academic grids

Slide22

Public CloudAt present 9 large sites/zonesup to ~2M CPU cores/site, ~4M total10 x more cores on 1/10 of the sites compared to our Grid

Slide23

Could we make our own Cloud?Open Source Cloud componentsOpen Source Cloud middlewareVM building tools and infrastructureCernVM+CernVM/FS, boxgrinder..Common APIFrom the Grid heritage…Common Authentication/AuthorizationHigh performance global data federation/storageScheduled file transfer

Experiment frameworks adapted to CloudTo doUnified access and cross Cloud scheduling

Centralized key services at a few large centers

Slide24

Grid on top of Clouds

AliEn

CAF

On

Demand

CAF

On

Demand

AliEn

O

2

Online-Offline

Facility

Private CERN

Cloud

Vision

Public Cloud(s)

AliEn

AliEn

HLT

DAQ

T1/T2/T3

Analysis

Reconstruction

Custodial Data Store

Analysis

Simulation

Data Cache

Simulation

Reconstruction

Calibration

Data Store

Slide25

And if this is not enough…18,688 nodes with total of 299,008 processor cores and one GPU per node. And, there are spare CPU cycles available…Ideal for simulation type workloads (60-70% of all CPU cycles used by the experiments). Simulation frameworks must be adapted to efficiently use such resources where available

Slide26

ConclusionsResources needed for Computing at HL-LHC are going to be large but not unprecedentedData volume is going to grow dramatically in Run 4Projected CPU needs are reaching the Moore’s law ceiling We might need to use heterogeneous computing resources to optimize the cost of computingParallelization at all levels is essential for improving performanceRequires training and reengineering of experiment software frameworksNew technologies such as clouds and virtualization may help to reduce the complexity and help to fully utilize available resourcesWe must continue to work in improving the performance and efficiency of our software adapting to new technologies