date technical survey about xrootdbased storage solutions Outline Intro Main use cases in the storage arena Generic Pure xrootd LHC The AtlasSLAC way The Alice way CASTOR2 Roadmap Conclusions ID: 932026
Download Presentation The PPT/PDF document "Xrootd usage @ LHC An up-to-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Xrootd usage @ LHC
An up-to-
date technical survey about xrootd-based
storage solutions
Slide2Outline
Intro
Main use cases in the storage arena
Generic Pure xrootd @ LHCThe Atlas@SLAC wayThe Alice wayCASTOR2RoadmapConclusions
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide3Introduction and use cases
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide4The historical Problem: data access
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Physics experiments rely on rare events and statistics
Huge amount of data to get a significant number of eventsThe typical data store can reach 5-10 PB… nowMillions of files, thousands of concurrent clients
The transaction rate is very high
Not uncommon O(10
3
) file opens/sec per cluster
Average, not peak
Traffic sources: local GRID site, local batch system, WAN
Up to O(10
4
) clients per server!
If not met then the outcome is:
Crashes, instability, workarounds, “need” for crazy things
Scalable high performance direct data access
No imposed limits on performance and size, connectivity
Higher performance, supports WAN direct data access
Avoids WN under-utilization
No need to do inefficient local copies if not needed
Do we fetch entire websites to browse one page?
Slide5The
Challenges
LHC User Analysis
Boundary Conditions
GRID environment
GSI authentication
User space deployment
CC environment
Kerberos, - admin deployment
High I/O load
Moderate Namespace load
Many clients
O(1000-10000)
Sequential File Access
Sparse File Access
Basic Analysis (
today
)
RAW, ESD
Advanced Analysis (tomorrow)ESD,AOD, Ntuple, Histograms
Batch Data Access
Interactive Data Access
RAP root, dcap,rfio ....
MFS Mounted File Systems
T0/T3
@ CERN
Preferred interface is MFSEasy, intuitive, fast response, standard applicationsModerate I/O loadHigh Namespace load CompilationSoftware startupsearchesLess Clients O(#users)
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide6Main requirement
Data access has to work reliably at the desired scale
This also means:
It has not to waste resourcesF.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide7A simple use case
I am a physicist, waiting for the results of my analysis jobs
Many bunches, several outputs
Will be saved e.g. to an SE at CERNMy laptop is configured to show histograms etc, with ROOTI leave for a conference, the jobs finish while in the planeWhen there, I want to simply draw the results from my home directoryWhen there, I want to save my new
histos
in the same place
I have no time to loose in tweaking to get a copy of everything. I loose copies into the confusion.
I want to leave the things where they are.
I know nothing about things to tweak.
What can I expect? Can I do it?
F. Furano, A. Hanushevsky - Scalla/xrootd WAN globalization tools: where we are. (CHEP09)
Slide8Another use case
ALICE analysis on the GRID
Each job reads ~100-150MB from ALICE::CERN::SE
These are cond data accessed directly, not file copiesI.e. VERY efficient, one job reads only what it needs.It just works, no workaroundsAt 10-20MB/s it takes 5-10 secs
(most common case)
At 5MB/s it takes 20secs
At 1MB/s it takes 100
Sometimes data are accessed elsewhere
Alien allows to save a job by making it read data from a different site. Very good performance
Quite
often the results are written/merged
elsewhere
F. Furano, A. Hanushevsky - Scalla/xrootd WAN globalization tools: where we are. (CHEP09)
Slide9Pure Xrootd
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide10xrootd Plugin Architecture
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
lfn2pfn
prefix encoding
Storage System
(oss,
drm/srm
, etc)
authentication
(gsi, krb5, etc)
Clustering
(
cmsd
)
authorization
(name based)
File System
(ofs, sfs, alice, etc)
Protocol (1 of n)
(xrootd)
Protocol Driver
(XRD)
Slide11The client side
Fault tolerance in data access
Meets WAN requirements, reduces jobs mortality
Connection multiplexing (authenticated sessions)Up to 65536 parallel r/w requests at once per client processUp to 32767 open files per client processOpens bunches of up to O(1000) files at once, in parallel
Full support for huge bulk
prestages
Smart
r/w
caching
Supports normal
readaheads
and “Informed Prefetching”
Asynchronous background writes
Boosts writing performance in LAN/WAN
Sophisticated integration with ROOT
Reads in advance the “right” chunks while the app computes the preceding onesBoosts read performance in LAN/WAN (up to the same order)F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide12The Xrootd
“protocol”
The
XRootD protocol is a good oneEfficient, clean, supports fault-tolerance etc. etc…It doesn’t do any magic, howeverIt does not multiply your resourcesIt does not overcome hw bottlenecks
BUT it allows the true usage of the hw resources
One of the aims of the project still is
sw
quality
In the carefully crafted pieces of
sw
which come with the distribution
What makes the difference with
Scalla/XRootD
is:
Scalla/XRootD Implementation details (performance + robustness)
And bad performance can hurt robustness (and vice-versa)Scalla SW architecture (scalability + performance + robustness)Designed to fit the HEP requirementsYou need a clean design where to insert itBorn with efficient direct access in mindBut with the requirements of high performance computingCopy-like access becomes a particular caseF.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide13Pure Xrootd @ LHC
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide14The Atlas@SLAC
way with XROOTD
Pure Xrootd + Xrootd-based “
filesystem” extensionAdapters to talk to BestMan SRM and GridFTP
More details in
A.Hanushevsky’s
talk @ CHEP09
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Scalla Cluster
xrootd
/
cmsd
/
cnsd
Data
Data
Data
Data
Data
Data
Data
DIR
F
U
S
E
A
D
A
P
T
E
R
F
U
S
E
GridFTP
Fire Wall
SRM
GRID
Clients
Slide15The ALICE way with XROOTD
Pure Xrootd + ALICE strong
authz
plugin. No difference among T1/T2 (only size and QOS)WAN-wide globalized deployment, very efficient direct data accessCASTOR at Tier-0 serving data, Pure Xrootd serving conditions to the GRID jobs“Old” DPM+Xrootd
in several tier2s
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Xrootd site
(GSI)
A globalized cluster
ALICE
global
redirector
Local clients work
Normally at each site
Missing a file?
Ask to the global redirector
Get redirected to the rightcollaborating cluster, and fetch it.Immediately.
A smart client
could point here
Any other
Xrootd site
Xrootd site(CERN)
Cmsd
Xrootd
V
irtual
MassStorageSystem… built on data Globalization
More details and complete info in “Scalla/Xrootd WAN globalization tools: where we are.” @ CHEP09
Slide16CASTOR2
Putting everything together @ Tier0/1s
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide17The CASTOR way
Client connects to a redirector node
The redirector asks CASTOR where the file is
Client then connects directly to the node holding the dataCASTOR handles tapes in the backF.Furano (CERN IT-DM) - Xrootd usage @ LHC
Disk Servers
Redirector
A
B
C
Client
Open file X
Go to C
CASTOR
Where is X ?
On C
Tape backend
Trigger migration/recall
Credits:
S.Ponce
(IT-DM)
Slide18CASTOR 2.1.8
Improving Latency - Read
1
st
focus on file (read) open latencies
Estimate
1
10
100
1000
ms
Castor 2.1.7
(rfio)
Castor 2.1.8
(xroot)
Castor 2.1.9
(xroot)
October 2008
Network Latency Limit
Read Open Latencies
Credits:
A.Peters
(IT-DM)
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide19Estimate
CASTOR 2.1.8
Improving Latency – Metadata Read
Next focus on meta data (read) latencies
1
10
100
1000
ms
Castor 2.1.7
Castor 2.1.8
Castor 2.1.9
October 2008
Network Latency Limit
Stat Latencies
Credits:
A.Peters
(IT-DM)
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide20Prototype - Architecture
XCFS Overview - xroot + FUSE
DATA FS
Authz
CLIENT
xcfsd
libXrdPosix
libXrdClient
/dev/fuse
VFS
Client Application
glibc
libfuse
FUSE LL Implementation
XROOT Posix Library
XROOT Client Library
Posix
access
to /
xcfs
(
i.e
. a
generic
application
)libXrdCatalogFsxrootdlibXrdSec<plugin>xrootdlibXrdSec<plugin>libXrdCatalogFs
xrootd
libXrdSec<plugin>
libXrdCatalogOfs
xrootd
libXrdSecUnix
XFS
FS
Name Space Provider
Meta Data Filesystem
libXrdCatalogAuthz
Strong Auth Plugin
xrootd server daemon
Remote
Access
Protocol
(ROOT
plugs
here
)
DISK SERVER
MD SERVER
Capability
Credits:
A.Peters
(IT-DM)
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide21Early
Prototype
- Evaluation
Meta Data Performance
File Creation*
File
Rewrite
File
Read
Rm
Readdir/Stat
Access
~1.000/s
~2.400/s
~2.500/s
~3.000/sΣ = 70.000/s *These values have
been measured executing shell
commands on 216mount
clients. Creation
performance decreases with the filling
of the namespace on a spinning medium.
Using an XFS filesystem over a DRBD
blockdevicein a high-availability setup
file creation perfromance stabilizes at 400/s (20
Mio files in the namespace)
Credits: A.Peters (IT-DM)F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide22Network usage (or waste!)
Network traffic is an important factor – it has to match the ratio IO(CPU Server) /
IO(Disk
Server)Too much unneeded traffic means fewer clients supported (serious bottleneck: 1 client works well, 100-1000 clients do not at all)Lustre doesn't disable readahead during forward-seeking access and transfers the complete file if reads are found in the buffer cache (
readahead
window starts with 1M and scales up to 40 M)
XCFS/LUSTRE/NFS4 network volume without read-ahead is based on 4k pages in Linux
Most of the requests are not page aligned and result in additional pages to be transferred (avg. read size 4k), hence they
xfer
twice as much data (but XCFS can skip this now!)
2nd execution plays no real role for analysis since datasets are usually bigger than client buffer cache
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Credits:
A.Peters
(IT-DM) – ACAT2008
Slide23Why
is
that
useful
?
Users
can
access
data
by LFNwithout specification of the stagerUsers are automatically directed to 'their' pool with write permissions
CASTOR 2.1.8-6
Cross Pool Redirection
T3
Stager
T0
Stager
X
X
X
Manager
Server
Server
Meta Manager
Name
Space
T3 pool subscribed
r/w for /castor/user
r/w for /castor/cms/user/
T0 pool subscribed
ro for /castor
ro for /castor/cms/data
Example Configuration
There are even more possibilities
if a part of the namespace
can be assigned to individual pools
for write operations.
X
xrootd
cmsd
(cluster management)
Manager
Credits:
A.Peters
(IT-DM)
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide24Towards
a
Production
VersionFurther Improvements – Security
GSI/VOMS
authentication plugin prototype developed
based on
pure
OpenSSL
using additionally code from mod_ssl & libgridsite
significantly faster than GLOBUS implementation
After Security Workshop with A.Hanushevsky
Virtual Socket Layer
introduced into xrootd authentication plugin base to allow socket oriented authentication over xrootd protocol layer
Final version should be based on OpenSSL and VOMS libraryVirtual Socket
Virtual
Socket
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide25The roadmap
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide26XROOT Roadmap @CERN
XROOT is strategic for scalable analysis support with CASTOR at CERN / T1s
will support other file access protocols until they become obsolete
CASTORSecure RFIO has been released in 2.1.8deployment impact in terms of CPU may be significantSecure XROOT is default in 2.1.8 (Kerb. or X509)
Expect to lower CPU cost than
rfio
due to session model
No plans to provide un-authenticated access via XROOT
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide27XROOTD Roadmap
CASTOR
Secure RFIO has been released in 2.1.8
deployment impact in terms of CPU may be significantSecure XROOT is default in 2.1.8 (Kerb. or X509)Expect to lower CPU cost than rfio due to session model
No plans to provide un-authenticated access via XROOT
DPM
support for authentication via xrootd is scheduled start certification begin of July
dCache
Relies on a custom full re-implementation of XROOTD protocol
protocol docs have been updated by A.
Hanushevsky
in contact with CASTOR/DPM team to add authentication/
authorisation
on the server side
evaluating common client plug-in / security protocol
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide28Conclusion
A very dense roadmap
Many, many tech details
Heading forSolid and high performance data accessFor production and analysisMore advanced user analysis scenariosNeed to match existing architectures, protocols and workarounds
F.Furano (CERN IT-DM) - Xrootd usage @ LHC
Slide29Thank you
Questions?