/
What goes up… …must go down What goes up… …must go down

What goes up… …must go down - PowerPoint Presentation

enjoinsamsung
enjoinsamsung . @enjoinsamsung
Follow
342 views
Uploaded On 2020-08-28

What goes up… …must go down - PPT Presentation

A case study from RAL on shrinking an existing storage service Rob Appleyard Introduction Part 1 Introduction The STFC Rutherford Appleton Laboratory hosts the UKs WLCG Tier 1 Centre Data storage on disk amp tape ID: 807979

disk castor nodes tape castor disk tape nodes storage data management cache lhcb cms service atlas echo stager state

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "What goes up… …must go down" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

What goes up……must go down

A case study from RAL on shrinking an existing storage service

Rob Appleyard

Slide2

Introduction

Part 1

Slide3

Introduction

The STFC Rutherford Appleton Laboratory hosts the UK's WLCG Tier 1 CentreData storage on disk & tapeLarge batch processing

farm (25,000 cores)

Tier 1 for ATLAS, CMS, LHCb, and

ALICE

Also storage for local facilities

Slide4

CASTORCERN-developed storage system*

Running at RAL since 2007Currently 13 PB disk storage36 PB tape storageCERN:240PB of data on tape

16PB of disk, all cache

24/7 production

I have been RAL’s CASTOR service manager since 2014

*http://castor.web.cern.ch/

Slide5

CASTOR UsersAll 4 big LHC experiments

Also local facilities at RALISIS neutron spallation sourceDiamond Light SourceCEDA environmental data collaborationDifferent use case to WLCG

Separate use cases

Archive

only

This talk largely concerns the WLCG users

Slide6

EchoNew storage system, based on Ceph, being scaled up now.

Will replace CASTOR ‘disk-only’ storageCurrently 8PB of usable disk storage.Experiment data migration underwayHope to finish ~Q1 2019

More information in about 3 hours’ time!

Slide7

Reasons for Migration

Echo introduced because of long-standing CASTOR issuesStaff-intensiveDBAsHEP-specificUnattractive to customers outside HEP community

Performance struggling

Slide8

CASTOR Service Types

RAL uses CASTOR for both ‘disk-only’ and ‘tape backed’ storage (d1t0 and d0t1)Disk-only – file is guaranteed to have at least one copy on a hard disk, and will not be moved to tape.This is the element to be replaced by EchoTape-backed – file is guaranteed to have at least one copy on a tape, and may also exist on disk.

Disk cache to buffer tape reads and writes

CERN (CASTOR developers) now only use CASTOR for tape-backed

storage

Slide9

Tier 1 Data Flow Now

ATLAS

LHCb

CMS

ALICE & Other VOs

Tape

LHCb

Tape Cache

LHCb

Disk

CMS

Tape Cache

CMS

Disk

‘Gen’

Tape Cache

ALICE

Disk

ATLAS

Tape Cache

ATLAS

Disk

Slide10

CASTOR Databases

Everything in CASTOR is based on Oracle DBs.Physical data locationTransaction informationNamespace mapping

Tape drive state

Transfer

s

cheduling

Slide11

Database Groupings‘Central services’ DB

One DB instance for all WLCG usersManages namespaceManages tape

interface & contents of tapes

‘Stager’ DB

Manages data residing on disk

One DB instance

per major user community‘Instance’ – one stager DB schema

SRM DB

Provides an external interface

Collocated with stager

Slide12

What we have now

Part 2

Slide13

GridFTP

Client

CASTOR SRM Nodes (x4

)*

DB

CASTOR Stager

DB

CASTOR ‘Central Services’ nodes (x2)

DBs (x4)

CASTOR Schedulers (x2)

CASTOR Storage Nodes (x54)

Tape Drives (x22)

CASTOR ‘Tape Server’ nodes (x22)

Shared

Replicated per

VO (x4)

A Picture of

CASTOR (

GridFTP

)

Slide14

XrootD

Client

CASTOR

XrootD

Server

CASTOR Stager

DB

CASTOR ‘Central Services’ nodes (x2)

DBs (x4)

CASTOR Schedulers (x2)

CASTOR Storage Nodes (x54)

Tape Drives (x22)

CASTOR ‘Tape Server’ nodes (x22)

Shared

Replicated per

VO (x4)

A Picture of

CASTOR (

XrootD

)

Slide15

CASTOR Current State: Databases

Two Oracle RACs are used to support CASTOR operationsOne hosts ATLAS and our ALICE/general-use instanceThe other hosts CMS, LHCb, and the central servicesTransaction rate: 390hz/RACLoad is strongly driven by disk-only operations

Slide16

CASTOR Current State: Management Nodes

AKA ‘headnodes’Each instance has 3 dedicated management nodes, and 2-4 dedicated SRM interface nodesInterface nodes handle control traffic only

Plus two

shared nodes

for the

‘central services’

Name servers

T

ape system

Grand total of 25 core management nodes

Management nodes are currently ‘pets’, not ‘cattle’

One management node failure -> service offline

Slide17

CASTOR Current State: Storage Nodes (1)

AKA ‘disk servers’137 nodesEach node is 60-120 TB

One big RAID 6 array

10Gb networking

Peak i/o performance typically ~3Gb/s/node

Constrained by disk i/o

Slide18

CASTOR Current State: Storage Nodes (2)

29 of those storage nodes are used only for tape-backed storage Caching data on its way to/from tapeRemaining 108 are disk-only

W

ill be retired when migration to Echo is complete

Slide19

What if we do nothing?

Disk server count drops from 137 to ~30Transaction rate drops to ~5% of current (or lower)But we still have…29 management nodes2 RACs

M

anagement nodes outnumber storage!

Unacceptable management overhead

Slide20

What we are going to do

Part 3

Slide21

Project ObjectivesReduce node count

Reduce management overheadImprove service qualityDon’t lose any data!

Slide22

User Migration to Echo

Constraint on everything elseUsers responsible for their own data managementLHC VOs well aware of need to migrateATLAS: good progress at drawing down CASTOR disk

Production use of Echo

CMS also using Echo in early stage production

LHCb

running

a bit slower, but work

ongoing

Once user says ‘all clear from CASTOR’, we can clean up any remaining data

There is always some

Slide23

‘The Great Merger’

Once all ‘disk-only’ data migrated away...Replace 4 stager instances with a single instance that supports all usersReduce management node requirement to <=5Shared disk cache pool for all usersNo need to merge existing stager DBs

Just make a new one and re-point interface nodes

Slide24

Post-Echo CASTOR Data Flow

ATLAS

LHCb

CMS

ALICE

Unified

Tape Cache

Tape

Slide25

Issues: ContentionPotential for contention between users introduced into system

Disk cache needs to be relatively big to mitigate thisIssue already in play for other system elementsTape drives are a shared resource for all usersPartitioning of cache is possible…

…but not desirable

Slide26

Issues: Scheduling Interventions

Advantage of separate infrastructure for each user community: easy intervention schedulingNot present when everyone shares 

Need to find a date that suits everyone

Difficult to mitigate

Saving grace: WLCG Tape access is usually orderly

Able to plan with experiment data admins

Slide27

Other Improvements: Management Nodes

Change of structure is an opportunity to address other issuesCERN CASTOR implementation uses ‘cattle headnodes’All management processes run on a set of identical

nodes

Failure-tolerant

RAL will be replicating these

Shift to from physical to virtualized infrastructure

Slide28

Image from US NOAA, distributed under CC 2.0 license.

https

://www.flickr.com/photos/51647007@N08/5142792691

CASTOR Future (1)

CERN CASTOR service is scheduled to be discontinued ~ mid 2019

New product: ‘CTA

1

No more CASTOR development

effort from

CERN after this.

1:

An efficient, modular and simple tape archiving solution for LHC Run-3, S Murray, et al

http://iopscience.iop.org/article/10.1088/1742-6596/898/6/062013/pdf

Slide29

CASTOR Future (2)So what are we going to do?

No decision taken yetAll options openMigrating away from CASTOR will to take some timeImprovements have time to bear fruit

Slide30

ConclusionMigrating away from an old service is a project

Just like making a new one!Needs co-operation with usersNeeds a fresh look at how remaining elements will be implemented

Slide31

Any Questions?

Image by Marco

Belluci

, distributed under CC 2.0 license. https://www.flickr.com/photos/marcobellucci/3534516458

Related Contents


Next Show more