A case study from RAL on shrinking an existing storage service Rob Appleyard Introduction Part 1 Introduction The STFC Rutherford Appleton Laboratory hosts the UKs WLCG Tier 1 Centre Data storage on disk amp tape ID: 807979
Download The PPT/PDF document "What goes up… …must go down" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
What goes up……must go down
A case study from RAL on shrinking an existing storage service
Rob Appleyard
Slide2Introduction
Part 1
Slide3Introduction
The STFC Rutherford Appleton Laboratory hosts the UK's WLCG Tier 1 CentreData storage on disk & tapeLarge batch processing
farm (25,000 cores)
Tier 1 for ATLAS, CMS, LHCb, and
ALICE
Also storage for local facilities
Slide4CASTORCERN-developed storage system*
Running at RAL since 2007Currently 13 PB disk storage36 PB tape storageCERN:240PB of data on tape
16PB of disk, all cache
24/7 production
I have been RAL’s CASTOR service manager since 2014
*http://castor.web.cern.ch/
Slide5CASTOR UsersAll 4 big LHC experiments
Also local facilities at RALISIS neutron spallation sourceDiamond Light SourceCEDA environmental data collaborationDifferent use case to WLCG
Separate use cases
Archive
only
This talk largely concerns the WLCG users
Slide6EchoNew storage system, based on Ceph, being scaled up now.
Will replace CASTOR ‘disk-only’ storageCurrently 8PB of usable disk storage.Experiment data migration underwayHope to finish ~Q1 2019
More information in about 3 hours’ time!
Slide7Reasons for Migration
Echo introduced because of long-standing CASTOR issuesStaff-intensiveDBAsHEP-specificUnattractive to customers outside HEP community
Performance struggling
Slide8CASTOR Service Types
RAL uses CASTOR for both ‘disk-only’ and ‘tape backed’ storage (d1t0 and d0t1)Disk-only – file is guaranteed to have at least one copy on a hard disk, and will not be moved to tape.This is the element to be replaced by EchoTape-backed – file is guaranteed to have at least one copy on a tape, and may also exist on disk.
Disk cache to buffer tape reads and writes
CERN (CASTOR developers) now only use CASTOR for tape-backed
storage
Slide9Tier 1 Data Flow Now
ATLAS
LHCb
CMS
ALICE & Other VOs
Tape
LHCb
Tape Cache
LHCb
Disk
CMS
Tape Cache
CMS
Disk
‘Gen’
Tape Cache
ALICE
Disk
ATLAS
Tape Cache
ATLAS
Disk
Slide10CASTOR Databases
Everything in CASTOR is based on Oracle DBs.Physical data locationTransaction informationNamespace mapping
Tape drive state
Transfer
s
cheduling
Slide11Database Groupings‘Central services’ DB
One DB instance for all WLCG usersManages namespaceManages tape
interface & contents of tapes
‘Stager’ DB
Manages data residing on disk
One DB instance
per major user community‘Instance’ – one stager DB schema
SRM DB
Provides an external interface
Collocated with stager
Slide12What we have now
Part 2
Slide13GridFTP
Client
CASTOR SRM Nodes (x4
)*
DB
CASTOR Stager
DB
CASTOR ‘Central Services’ nodes (x2)
DBs (x4)
CASTOR Schedulers (x2)
CASTOR Storage Nodes (x54)
Tape Drives (x22)
CASTOR ‘Tape Server’ nodes (x22)
Shared
Replicated per
VO (x4)
A Picture of
CASTOR (
GridFTP
)
Slide14XrootD
Client
CASTOR
XrootD
Server
CASTOR Stager
DB
CASTOR ‘Central Services’ nodes (x2)
DBs (x4)
CASTOR Schedulers (x2)
CASTOR Storage Nodes (x54)
Tape Drives (x22)
CASTOR ‘Tape Server’ nodes (x22)
Shared
Replicated per
VO (x4)
A Picture of
CASTOR (
XrootD
)
Slide15CASTOR Current State: Databases
Two Oracle RACs are used to support CASTOR operationsOne hosts ATLAS and our ALICE/general-use instanceThe other hosts CMS, LHCb, and the central servicesTransaction rate: 390hz/RACLoad is strongly driven by disk-only operations
Slide16CASTOR Current State: Management Nodes
AKA ‘headnodes’Each instance has 3 dedicated management nodes, and 2-4 dedicated SRM interface nodesInterface nodes handle control traffic only
Plus two
shared nodes
for the
‘central services’
Name servers
T
ape system
Grand total of 25 core management nodes
Management nodes are currently ‘pets’, not ‘cattle’
One management node failure -> service offline
Slide17CASTOR Current State: Storage Nodes (1)
AKA ‘disk servers’137 nodesEach node is 60-120 TB
One big RAID 6 array
10Gb networking
Peak i/o performance typically ~3Gb/s/node
Constrained by disk i/o
Slide18CASTOR Current State: Storage Nodes (2)
29 of those storage nodes are used only for tape-backed storage Caching data on its way to/from tapeRemaining 108 are disk-only
W
ill be retired when migration to Echo is complete
Slide19What if we do nothing?
Disk server count drops from 137 to ~30Transaction rate drops to ~5% of current (or lower)But we still have…29 management nodes2 RACs
M
anagement nodes outnumber storage!
Unacceptable management overhead
Slide20What we are going to do
Part 3
Slide21Project ObjectivesReduce node count
Reduce management overheadImprove service qualityDon’t lose any data!
Slide22User Migration to Echo
Constraint on everything elseUsers responsible for their own data managementLHC VOs well aware of need to migrateATLAS: good progress at drawing down CASTOR disk
Production use of Echo
CMS also using Echo in early stage production
LHCb
running
a bit slower, but work
ongoing
Once user says ‘all clear from CASTOR’, we can clean up any remaining data
There is always some
Slide23‘The Great Merger’
Once all ‘disk-only’ data migrated away...Replace 4 stager instances with a single instance that supports all usersReduce management node requirement to <=5Shared disk cache pool for all usersNo need to merge existing stager DBs
Just make a new one and re-point interface nodes
Slide24Post-Echo CASTOR Data Flow
ATLAS
LHCb
CMS
ALICE
Unified
Tape Cache
Tape
Slide25Issues: ContentionPotential for contention between users introduced into system
Disk cache needs to be relatively big to mitigate thisIssue already in play for other system elementsTape drives are a shared resource for all usersPartitioning of cache is possible…
…but not desirable
Slide26Issues: Scheduling Interventions
Advantage of separate infrastructure for each user community: easy intervention schedulingNot present when everyone shares
Need to find a date that suits everyone
Difficult to mitigate
Saving grace: WLCG Tape access is usually orderly
Able to plan with experiment data admins
Slide27Other Improvements: Management Nodes
Change of structure is an opportunity to address other issuesCERN CASTOR implementation uses ‘cattle headnodes’All management processes run on a set of identical
nodes
Failure-tolerant
RAL will be replicating these
Shift to from physical to virtualized infrastructure
Slide28Image from US NOAA, distributed under CC 2.0 license.
https
://www.flickr.com/photos/51647007@N08/5142792691
CASTOR Future (1)
CERN CASTOR service is scheduled to be discontinued ~ mid 2019
New product: ‘CTA
1
’
No more CASTOR development
effort from
CERN after this.
1:
An efficient, modular and simple tape archiving solution for LHC Run-3, S Murray, et al
http://iopscience.iop.org/article/10.1088/1742-6596/898/6/062013/pdf
Slide29CASTOR Future (2)So what are we going to do?
No decision taken yetAll options openMigrating away from CASTOR will to take some timeImprovements have time to bear fruit
Slide30ConclusionMigrating away from an old service is a project
Just like making a new one!Needs co-operation with usersNeeds a fresh look at how remaining elements will be implemented
Slide31Any Questions?
Image by Marco
Belluci
, distributed under CC 2.0 license. https://www.flickr.com/photos/marcobellucci/3534516458