Incident Report Email was received from SARA admin on Friday 161 afternoon with list of 500k lost files Technical details The filesystem is on a RAID60 which is a multilevel disk set it starts with three RAID6 sets of 12 disks two in the front and one in the back of the chassis and th ID: 811228
Download The PPT/PDF document "SARA Data Loss Incident ADC Weekly, 20.1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
SARA Data Loss Incident
ADC Weekly, 20.1.15
Slide2Incident Report
Email was received from SARA admin on Friday 16.1 afternoon with list of ~500k lost files
Technical details:
The
filesystem
is on a RAID60 which is a multi-level disk set; it starts with three RAID6 sets of 12 disks (two in the front and one in the back of the chassis), and then these sets are aggregated at a higher level into a RAID0 array that has no redundancy on its own.
This
means that after the rear SAS expander backplane has failed, we lost a member (RAID6 set of 12 disks) of the non-redundant top-level RAID0. At this point the controller had marked the RAID60 as failed.
After
the vendor replaced the rear SAS expander backplane, the missing 12 disks (one of the three RAID6 sets) were again able to receive power and be visible to the RAID controller. However since the entire RAID60 was already marked failed there was no option to rebuild it.
Our
vendor tried to mark those 12 drives as "good", hoping that the controller would allow to re-assemble the old array. This was not possible as the 12 drives were detected as "new", meaning that we had two configurations (24 drives from a failed array + 12 "new" drives) that could not be merged.
The
pool node is still in maintenance, so we currently have a reduced storage capacity.
Slide3File Recovery
Files were declared lost to
Rucio
by DDM ops, in order
Non-log DATADISK
SCRATCHDISKlogs DATADISK
EndpointFiles LostFiles Recoverable*TotalSARA DATADISK32785798288426145SARA SCRATCHDISK39764621045974
*Files without rules (secondary) are included here but are not recovered
Slide4File Recovery
The processing of the bad files by
Rucio
was very fast
Workflow is much simpler than DQ2
We estimate most data has been recoveredNot possible to know without monitoring
Some lost files will be used to test prodsys2 recovery