Status and plans CASTOR facetoface workshop 2223 September 2014 Eric Cano on behalf of CERN ITDSS group Overview Features for first release New tape server architecture Control and reporting flows ID: 787703
Download The PPT/PDF document "New tape server software" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
New tape server softwareStatus and plans
CASTOR face-to-face workshop
22-23 September 2014
Eric Cano
on behalf of CERN
IT-DSS group
Slide2Overview
Features
for first release
New tape server architecture
Control and reporting flows
Memory
management and d
ata flow
Error handling
Main process and sessions
Stuck session and recovery
Development
methodologies and QA
What changes in practice?
What is still missing?
Logical Block Protection investigation
Release plans and potential
new features
Slide3Features for first release
Continuation of the push to replace legacy tape software
Started with creation of tape gateway and bridge
VMGR+VDQM
will be
next
Drop-in replacement
Tapeserverd
consolidated
in a single daemon
Replaces the
previous
stack:
taped & satellites +
rtcpd
+
tapebridged
Identical outside
protocols (almost)
Stager
/
Cli
client (
readtp
in unchanged)
VMGR/VDQM
tpstat
/
tpconfig
New labelling command (castor-tape-label)
Keep
what works:
One
process per session (
pid
listed in
tpstat
, as before)
Better logs
Latency shadowing (no impact of slow DB)
Empty
mount protection
Result from big
teamwork since last meeting:
E.Cano
, S. Murray, V. Kotlyar, D.
Kruse
, D. Come
Slide4New tape server architecture
Pipelined:
based on FIFOs and threads/thread pools
Always fast to post to FIFO
Push data blocks, reports, requests for more work
Each FIFO output is served by one thread(pool)
Simple loop: pop, use/serve the data/request, repeat
All latencies are shadowed in the various threads
Keep the instruction pipeline non-empty with task
prefetch
N-way parallel disk access (as before)
All reporting is asynchronous
Tape thread is the central element that we want to keep busy at full speed
Slide5Data FIFO
Free blocks
Tape Write Task
Data blocks
Pop block, write
to tape, (flush,)
report result
Return free block
Migration
session overview
Migration Mount Manager (main thread)*
Provide
blocks
Disk Read Task
Get free
blocks
Read data
from disk
Push full
data block
Task queue
Pop,
execute,
delete
n threads
Disk Read Thread Pool
Task queue
Pop,
execute,
delete
1 thread
Tape Write Single Thread
Request more
on threshold
Request for more
Task Injector
1 thread
Get more work
from tape gateway,
create and push tasks
Report Packer
1 thread
Pack information
and send bulk report
on flush/end session
1 thread
Instantiate memory manager, injector, packer, disk and tape thread
Give initial kick to task injector
Wait for completion
Global Status Reporter
Pack information
For
tapeserverd
and
1 thread
Free blocks
Client queue
1 thread
Memory manager
*(main thread)
Slide6Task queue
Pop,
execute,
delete
Disk
Write Thread Pool
n threads
Request more
on threshold
Data FIFO
Disk Write Task
Data blocks
Pop block, write
to disk,
report result
Return free block
Recall session overview
Recall Mount Manager (main thread)*
Tape Read Task
Pull free
blocks
Read data
from tape
Push full
data block
1 thread
Task queue
Pop,
execute,
delete
Tape Read Single Thread
Request for more
Task Injector
1 thread
Get more work
from tape gateway,
create and push tasks
Individual file reports, flush reports, end of session report
Report Packer
1 thread
Pack information
and send bulk report
threshold/end session
1 thread
Instantiate memory manager, injector, packer, disk and tape thread
Give initial kick to task injector
Wait for completion
Global Status Reporter
1 thread
Pack information
For
tapeserverd
and
*(main thread)
Free blocks
(no thread)
Memory manager
Slide7Control flow
Task injector
Initially called synchronously (empty mount detection)
Triggered by requests for more work (stored in a FIFO)
Gets more work from client
Creates and injects tasks
Tasks created, linked to each other (reader/writer couple) and injected to the tape and disk thread FIFOs
Disk thread pool
Pops disk tasks, executes them, deletes them and moves to the next
Tape thread
Same as disk after initializing the session
MountingTape identificationPositioning for writing… and unmounting
in the endThe reader thread(pool) requests for more workBased on task FIFO content thresholdsAlways ask for n files or m bytes (whichever comes first, configurable)Asks again when half of that is still available in the task FIFO
Asks again one last time when the task FIFO becomes empty (last call)
Slide8Reporting flow
Reports to client (file related)
Posted to a FIFO
Packed and transmitted in a separate thread
Send on flush in migrations
Send on thresholds in recalls
End of session also follows this path
Reports to parent process (tape/drive related)
Posted to a FIFO
Transmitted asynchronously by a separate thread
Parent process keeps track of the session’s status and informs the VDQM and VMGR
Slide9Memory management and data flow
Same as before:
circulate a fixed number of memory blocks (size and count configurable)
Errors can be piggy backed on data blocks
Write
r side always does the reporting,
even for read errors
Central
m
emory manager
Migration: actively pushes blocks for each tape write task
Disk read tasks pulls block from thereReturns the block with data in a second FIFOData gets written to tape by the tape write taskRecalls: passive containerTape read task pulls memory blocks as neededPushes them to the disk write tasks (in FIFOs)
Disk write tasks pushes the data to the disk serverMemory blocks get recycled to the memory manager after writing to disk or tape
Slide10Error handlingReporting
Errors get logged when they happen
If error happens in the reader, it gets propagated to the writer through the data path
The writer propagates the error to the client
Session behaviour on error
Recalls: carry on for stager, halt on error for
readtp
absolute positioning by
blockId
(stager)r
elative positioning by fSeq (readtp)Migrations: any error ends the session
Slide11Main process and sessions
The session is forked by the parent process
Parent process keeps track of sessions and drive statuses in a drive catalogue
Answers VDQM requests
Filters input requests based on drive state
Manages the configuration files
The child session reports tape related status to the parent process
mount, unmounts
amount of data transferred for the watchdog
The parent process informs the VMGR and VDQM on behalf of the child session
Client library completely rewritten
Forking is actually done a utility sub-process (forker)No actual forking from the multithreaded parent process
Process inventory:1 parent process + 1 fork helper processN session processes (at most 1 per drive)
Slide12ZeroMQ+Protocol buffers
The
parent/session processes
communication is a no-risk protocol
Both ends get release/deployed together
Can be changed at any time
Opportunity to experiment new serialization methodologies
Need to replace
umbrello
This gave good results
Protocol buffers provide robust serialization with little development effort
ZMQ handles many communication scenariosStill in finalization (issues in the watchdog communication)
Slide13Stuck sessions and recovery
Stuck sessions do happen
RFIO problems suspected
Currently handled by a script
Log file based. No move for set time => kill
Problematic with unusually big files
Watchdog will get more internal data
Too much to be logged
If data stops flowing for a given time => kill
Clean-up process launched automatically when session killed
No clean-up after session failure
a non-stuck session failed to do its own clean-up=> drive down
Slide14Development methodologies and QA
Full C++, maintainable
software
Object encapsulation for separately manageable units
Easy unit testing
Exception handling simplifies error reporting a lot
RAII (destructors) simplifies resource
management
Cleaner drive specifics implementation through inheritance
Easy to add new models
Hardcoding-free SCSI and tape format layers
Naming conventions matching the SCSI documentationsString error reporting for all SCSI errorVery similar approach with the AUL tape format
Unit testingAllows running various scenarios systematicallyOn RPM buildMigrations, recalls, good day, bad day, full tapeUsing fake objects for drive, client interfaceEasier debugging when problems can be reproduced in unit test context
Run test standalone + through valgrind and helgrind
Automatic detection of memory leaks and race conditionsCompletely brought
to the CASTOR treeAutomated system testing would be a nice addition to this setup
Slide15What changes in practice?
T
he new logs
Convergence with the rest of CASTOR logs
Single line at completion
of tape thread
Summarises the session for tape log
More detailed timings
Will make it easier to pinpoint performance bottlenecks
New log parsing required
Should be greatly simplified as all relevant information is on a single line
A single daemonConfiguration not radically changed
Slide16What is still missing?
Support for Oracle libraries
The parent process’s watchdog for transfer sessions
Will move stuck transfers detection from operators scripts to internal (with better precision)
File transfer protocol switching
Add local file support
reliance on
rfio
removed
Add
Xroot support
switched on by configurationinstead of RFIODiskserver >= 2.1.14-15 required (for stat call)Add Ceph support
Disk path based switch, automaticFine tuning of logs for operationsDocument the latest developments
Slide17Release and deployment
Data transfers are being validated now on IBM drives
Oracle drives will follow with mount
suport
Some previously mentioned features missing
Target date for a
tapeserverd
-only 2.1.15 CASTOR release: end of November
Production deployment ~January
Compatible with current 2.1.14 stagers
2.1.14-15 on disk server will be needed for using
Xroot2.1.14 is the end of road for rtcpd/taped
Slide18Logical block protection
Tests of the tape drive feature have been done by F. Nikolaidis, J. Leduc and K. Ha
Adds a 4 byte checksum to tape blocks
Protects the data block during the transfer from computer memory to tape drive
2 checksum algorithm in use today:
Reed-Solomon
CRC32-C
Reed-Solomon requires 2 threads to match drive throughput
CRC32-C can fit in a single thread
CRC32-C is available on most recent drives
Slide19Next tape developments
Tapeserverd
Logical block protection integration
Support for pre-emption of session
VDQM/VMGR
Merge of the two in a single tape resource manager
Simplify interface
Asymmetric drive support
Improve scheduling (atomic tape-in-drive semantics for migrations)
Today, the chosen tape might no have compatible drives available, leading to migration delays
Remove need for manual synchronization
Consider pre-emptive schedulingmax-out the system with background task (repack, verify)Interrupt and make space for user sessions when they comeAllow over quota for users when free drives exist
Leading to 100% utilisation of the drivesFacilitates tape server upgradesIntegrate the authentication part for tape (from Cupv)
Slide20Conclusion
Tape server stack has been re-written and consolidated
New features already provide improvements
Empty mount protection for both read and write
Full request and report latency shadowing
Better timing monitoring is already in place
Major clean-up will allow easier development and maintenance
More new features coming
Xroot
/
Ceph
supportLogical block protectionSession pre-emptionEnd of the road for rtcpd/tapedWill be dropped form 2.1.15 as soon as we are happy with
tapeserverd in productionMore tape software consolidation around the cornerVDQM/VMGR