/
SARA Data Loss Incident ADC Weekly, 20.1.15 SARA Data Loss Incident ADC Weekly, 20.1.15

SARA Data Loss Incident ADC Weekly, 20.1.15 - PowerPoint Presentation

widengillette
widengillette . @widengillette
Follow
342 views
Uploaded On 2020-08-29

SARA Data Loss Incident ADC Weekly, 20.1.15 - PPT Presentation

Incident Report Email was received from SARA admin on Friday 161 afternoon with list of 500k lost files Technical details The filesystem is on a RAID60 which is a multilevel disk set it starts with three RAID6 sets of 12 disks two in the front and one in the back of the chassis and th ID: 811228

lost files failed drives files lost drives failed controller sets recovery level raid6 disks raid60 array backplane set rucio

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "SARA Data Loss Incident ADC Weekly, 20.1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

SARA Data Loss Incident

ADC Weekly, 20.1.15

Slide2

Incident Report

Email was received from SARA admin on Friday 16.1 afternoon with list of ~500k lost files

Technical details:

The

filesystem

is on a RAID60 which is a multi-level disk set; it starts with three RAID6 sets of 12 disks (two in the front and one in the back of the chassis), and then these sets are aggregated at a higher level into a RAID0 array that has no redundancy on its own.

This

means that after the rear SAS expander backplane has failed, we lost a member (RAID6 set of 12 disks) of the non-redundant top-level RAID0. At this point the controller had marked the RAID60 as failed.

After

the vendor replaced the rear SAS expander backplane, the missing 12 disks (one of the three RAID6 sets) were again able to receive power and be visible to the RAID controller. However since the entire RAID60 was already marked failed there was no option to rebuild it.

Our

vendor tried to mark those 12 drives as "good", hoping that the controller would allow to re-assemble the old array. This was not possible as the 12 drives were detected as "new", meaning that we had two configurations (24 drives from a failed array + 12 "new" drives) that could not be merged.

The

pool node is still in maintenance, so we currently have a reduced storage capacity.

Slide3

File Recovery

Files were declared lost to

Rucio

by DDM ops, in order

Non-log DATADISK

SCRATCHDISKlogs DATADISK

EndpointFiles LostFiles Recoverable*TotalSARA DATADISK32785798288426145SARA SCRATCHDISK39764621045974

*Files without rules (secondary) are included here but are not recovered

Slide4

File Recovery

The processing of the bad files by

Rucio

was very fast

Workflow is much simpler than DQ2

We estimate most data has been recoveredNot possible to know without monitoring

Some lost files will be used to test prodsys2 recovery