/
Appears in the Proceedings of the th USENIX Conference on File and Storage Techn Appears in the Proceedings of the th USENIX Conference on File and Storage Techn

Appears in the Proceedings of the th USENIX Conference on File and Storage Techn - PDF document

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
470 views
Uploaded On 2014-10-08

Appears in the Proceedings of the th USENIX Conference on File and Storage Techn - PPT Presentation

1600 Amphitheatre Pkwy Mountain View CA 94043 edpinwolfluiz googlecom Abstract It is estimated that over 90 of all new information produced in the world is being stored on magnetic media most of it on hard disk drives Despite their importance there ID: 3957

1600 Amphitheatre Pkwy Mountain

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Appears in the Proceedings of the th USE..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Theinformationcollectedincludesenvironmentalfac-tors(suchastemperatures),activitylevelsandmanyoftheSelf-MonitoringAnalysisandReportingTechnology(SMART)parametersthatarebelievedtobegoodindi-catorsofdiskdrivehealth.Weminethroughthesedataandattempttondevidencethatcorroboratesorcon-tradictsmanyofthecommonlyheldbeliefsabouthowvariousfactorscanaffectdiskdrivelifetime.Ourpaperisuniqueinthatitisbasedondatafromadiskpopulationsizethatistypicallyonlyavailablefromvendorwarrantydatabases,buthasthedepthofdeploy-mentvisibilityanddetailedlifetimefollow-upthatonlyanend-userstudycanprovide.Ourkeyndingsare:Contrarytopreviouslyreportedresults,wefoundverylittlecorrelationbetweenfailureratesandei-therelevatedtemperatureoractivitylevels.SomeSMARTparameters(scanerrors,realloca-tioncounts,ofinereallocationcounts,andproba-tionalcounts)havealargeimpactonfailureproba-bility.GiventhelackofoccurrenceofpredictiveSMARTsignalsonalargefractionoffaileddrives,itisun-likelythatanaccuratepredictivefailuremodelcanbebuiltbasedonthesesignalsalone.2BackgroundInthissectionwedescribetheinfrastructurethatwasusedtogatherandprocessthedatausedinthisstudy,thetypesofdiskdrivesincludedintheanalysis,andin-formationonhowtheyaredeployed.2.1TheSystemHealthInfrastructureTheSystemHealthinfrastructureisalargedistributedsoftwaresystemthatcollectsandstoreshundredsofattribute-valuepairsfromallofGoogle'sservers,andprovidestheinterfaceforarbitraryanalysisjobstopro-cessthatdata.ThearchitectureoftheSystemHealthinfrastructureisshowninFigure1.Itconsistsofadatacollectionlayer,adistributedrepositoryandananalysisframe-work.Thecollectionlayerisresponsibleforgettingin-formationfromeachofthousandsofindividualserversintoacentralizedrepository.Differentavorsofcol-lectorsexisttogatherdifferenttypesofdata.Muchofthehealthinformationisobtainedfromthemachinesdi-rectly.Adaemonrunsoneverymachineandgatherslocaldatarelatedtothatmachine'shealth,suchasenvi-ronmentalparameters,utilizationinformationofvarious Figure1:Collection,storage,andanalysisarchitecture.resources,errorindications,andcongurationinforma-tion.Itisimperativethatthisdaemon'sresourceusagebeverylight,sonottointerferewiththeapplications.Onewaytoassurethisistohavethemachine-levelcol-lectorpollindividualmachinesrelativelyinfrequently(everyfewminutes).Otherslowerchangingdata(suchascongurationinformation)anddatafromotherexist-ingdatabasescanbecollectedevenlessfrequentlythanthat.Mostnotablyforthisstudy,dataregardingma-chinerepairsanddiskswapsarepulledinfromanotherdatabase.TheSystemHealthdatabaseisbuiltuponBigtable[3],adistributeddatarepositorywidelyusedwithinGoogle,whichitselfisbuiltupontheGoogleFileSys-tem(GFS)[8].Bigtabletakescareofallthedatalayout,compression,andaccesschoresassociatedwithalargedatastore.Itpresentstheabstractionofa2-dimensionaltableofdatacells,withdifferentversionsovertimemak-ingupathirddimension.Itisanaturaltforkeepingtrackofthevaluesofdifferentvariables(columns)fordifferentmachines(rows)overtime.TheSystemHealthdatabasethusretainsacompletetime-orderedhistoryoftheenvironment,utilization,error,conguration,andre-paireventsineachmachine'slife.AnalysisprogramsrunontopoftheSystemHealthdatabase,lookingatinformationfromindividualma-chines,orminingthedataacrossthousandsofmachines.Large-scaleanalysisprogramsaretypicallybuiltuponGoogle'sMapreduce[5]framework.Mapreduceauto-matesthemechanismsoflarge-scaledistributedcompu- tation(suchasworkdistribution,loadbalancing,toler-anceoffailures),allowingtheusertofocussimplyonthealgorithmsthatmakeuptheheartofthecomputa-tion.TheanalysispipelineusedforthisstudyconsistsofaMapreducejobwrittenintheSawzalllanguageandframework[15]toextractandcleanupperiodicSMARTdataandrepairdatarelatedtodisks,followedbyapassthroughR[1]forstatisticalanalysisandnalgraphgen-eration.2.2DeploymentDetailsThedatainthisstudyarecollectedfromalargenum-berofdiskdrives,deployedinseveraltypesofsystemsacrossallofGoogle'sservices.Morethanonehundredthousanddiskdriveswereusedforalltheresultspre-sentedhere.ThedisksareacombinationofserialandparallelATAconsumer-gradeharddiskdrives,ranginginspeedfrom5400to7200rpm,andinsizefrom80to400GB.Allunitsinthisstudywereputintoproductioninorafter2001.Thepopulationcontainsseveralmodelsfrommanyofthelargestdiskdrivemanufacturersandfromatleastninedifferentmodels.ThedatausedforthisstudywerecollectedbetweenDecember2005andAugust2006.Asiscommoninserver-classdeployments,thediskswerepoweredon,spinning,andgenerallyinserviceforessentiallyalloftheirrecordedlife.Theyweredeployedinrack-mountedserversandhousedinprofessionally-manageddatacenterfacilities.Beforebeingputintoproduction,alldiskdrivesgothroughashortburn-inprocess,whichconsistsofacombinationofread/writestresstestsdesignedtocatchmanyofthemostcommonassembly,conguration,orcomponent-levelproblems.Thedatashownheredonotincludethefall-outfromthisphase,butinsteadbeginwhenthesystemsareofciallycommissionedforuse.Thereforeourdatashouldbeconsistentwithwhatareg-ularend-usershouldsee,sincemostequipmentmanu-facturersputtheirsystemsthroughsimilartestsbeforeshipment.2.3DataPreparationDenitionofFailure.Narrowlydeningwhatconsti-tutesafailureisadifculttaskinsuchalargeopera-tion.Manufacturersandend-usersoftenseedifferentstatisticswhencomputingfailuressincetheyusediffer-entdenitionsforit.Whiledrivemanufacturersoftenquoteyearlyfailureratesbelow2%[2],userstudieshaveseenratesashighas6%[9].ElerathandShah[7]reportbetween15-60%ofdrivesconsideredtohavefailedattheusersitearefoundtohavenodefectbythemanu-facturersuponreturningtheunit.Hughesetal.[11]ob-servebetween20-30%“noproblemfound”casesafteranalyzingfaileddrivesfromtheirstudyof3477disks.Fromanend-user'sperspective,adefectivedriveisonethatmisbehavesinaseriousorconsistentenoughmannerintheuser'sspecicdeploymentscenariothatitisnolongersuitableforservice.Sincefailuresaresometimestheresultofacombinationofcomponents(i.e.,aparticulardrivewithaparticularcontrollerorca-ble,etc),itisnosurprisethatagoodnumberofdrivesthatfailforagivenusercouldbestillconsideredop-erationalinadifferenttestharness.Wehaveobservedthatphenomenonourselves,includingsituationswhereadrivetesterconsistently“greenlights”aunitthatin-variablyfailsintheeld.Therefore,themostaccuratedenitionwecanpresentofafailureeventforourstudyis:adriveisconsideredtohavefailedifitwasreplacedaspartofarepairsprocedure.Notethatthisdenitionimplicitlyexcludesdrivesthatwerereplacedduetoanupgrade.Sinceitisnotalwaysclearwhenexactlyadrivefailed,weconsiderthetimeoffailuretobewhenthedrivewasreplaced,whichcansometimesbeafewdaysaftertheobservedfailureevent.Itisalsoimportanttomentionthattheparametersweuseinthisstudywerenotinuseaspartoftherepairsdiagnosticsprocedureatthetimethatthesedatawerecollected.Thereforethereisnoriskoffalse(forced)correlationsbetweenthesesignalsandrepairoutcomes.Filtering.Withsuchalargenumberofunitsmonitoredoveralongperiodoftime,dataintegrityissuesinvari-ablyshowup.Informationcanbelostorcorruptedalongourcollectionpipeline.Therefore,somecleaningupofthedataisnecessary.Inthecaseofmissingvalues,theindividualvaluesaremarkedasnotavailableandthatspecicpieceofdataisexcludedfromthedetailedstud-ies.Otherrecordsforthatsamedrivearenotdiscarded.Incaseswherethedataareclearlyspurious,theentirerecordforthedriveisremoved,undertheassumptionthatonepieceofspuriousdatadrawsintoquestionothereldsforthesamedrive.Identifyingspuriousdata,how-ever,isatrickytask.Becausepartofthegoalofstudyingthedataistolearnwhatthenumbersmean,wemustbecarefulnottodiscardtoomuchdatathatmightappearinvalid.Sowedenespurioussimplyasnegativecountsordatavaluesthatareclearlyimpossible.Forexam-ple,somedriveshavereportedtemperaturesthatwerehotterthanthesurfaceofthesun.Othershavehadneg-ativepowercycles.Theseweredeemedspuriousandremoved.Ontheotherhand,wehavenotlteredanysuspiciouslylargecountsfromtheSMARTsignals,un-derthehypothesisthatlargecounts,whileimprobableas Itisdifcultforustoarriveatameaningfulnumer-icalutilizationmetricgiventhatourmeasurementsdonotprovideenoughdetailtoderivewhat100%utiliza-tionmightbeforanygivendiskmodel.Wechoosein-steadtomeasureutilizationintermsofweeklyaveragesofread/writebandwidthperdrive.Wecategorizeutiliza-tioninthreelevels:low,mediumandhigh,correspond-ingrespectivelytothelowest25thpercentile,50-75thpercentilesandtop75thpercentile.Thiscategorizationisperformedforeachdrivemodel,sincethemaximumbandwidthshavesignicantvariabilityacrossdrivefam-ilies.WenotethatusingnumberofI/Ooperationsandbytestransferredasutilizationmetricsprovideverysim-ilarresults.Figure3showstheimpactofutilizationonAFRacrossthedifferentagegroups.Overall,weexpectedtonoticeaverystrongandcon-sistentcorrelationbetweenhighutilizationandhigherfailurerates.Howeverourresultsappeartopaintamorecomplexpicture.First,onlyveryyoungandveryoldagegroupsappeartoshowtheexpectedbehavior.Af-tertherstyear,theAFRofhighutilizationdrivesisatmostmoderatelyhigherthanthatoflowutilizationdrives.Thethree-yeargroupinfactappearstohavetheoppositeoftheexpectedbehavior,withlowutilizationdriveshavingslightlyhigherfailureratesthanhighuti-lizationones.Onepossibleexplanationforthisbehavioristhesur-vivalofthettesttheory.Itispossiblethatthefail-uremodesthatareassociatedwithhigherutilizationaremoreprominentearlyinthedrive'slifetime.Ifthatisthecase,thedrivesthatsurvivetheinfantmortalityphasearetheleastsusceptibletothatfailuremode,andresultinapopulationthatismorerobustwithrespecttovaria-tionsinutilizationlevels.Anotherpossibleexplanationisthatpreviousobser-vationsofhighcorrelationbetweenutilizationandfail-ureshasbeenbasedonextrapolationsfrommanufactur-ers'acceleratedlifeexperiments.Thoseexperimentsarelikelytobettermodelearlylifefailurecharacteristics,andassuchtheyagreewiththetrendweobservefortheyoungagegroups.Itispossible,however,thatlongertermpopulationstudiescoulduncoveralesspronouncedeffectlaterinadrive'slifetime.Whenwelookattheseresultsacrossindividualmod-elsweagainseeacomplexpattern,withvaryingpat-ternsoffailurebehavioracrossthethreeutilizationlev-els.Takenasawhole,ourdataindicateamuchweakercorrelationbetweenutilizationlevelsandfailuresthanpreviousworkhassuggested. Figure3:UtilizationAFR3.4TemperatureTemperatureisoftenquotedasthemostimportantenvi-ronmentalfactoraffectingdiskdrivereliability.Previousstudieshaveindicatedthattemperaturedeltasaslowas15Ccannearlydoublediskdrivefailurerates[4].HerewetaketemperaturereadingsfromtheSMARTrecordseveryfewminutesduringtheentire9-monthwindowofobservationandtrytounderstandthecorrelationbe-tweentemperaturelevelsandfailurerates.Wehaveaggregatedtemperaturereadingsinseveraldifferentways,includingaverages,maxima,fractionoftimespentaboveagiventemperaturevalue,numberoftimesatemperaturethresholdiscrossed,andlasttem-peraturebeforefailure.Herewereportdataonaveragesandnotethatotheraggregationformshaveshownsim-ilartrendsandandthereforesuggestthesameconclu-sions.Werstlookatthecorrelationbetweenaveragetem-peratureduringtheobservationperiodandfailure.Fig-ure4showsthedistributionofdriveswithaveragetem-peratureinincrementsofonedegreeandthecorrespond-ingannualizedfailurerates.Thegureshowsthatfail-uresdonotincreasewhentheaveragetemperaturein-creases.Infact,thereisacleartrendshowingthatlowertemperaturesareassociatedwithhigherfailurerates.Onlyatveryhightemperaturesisthereaslightreversalofthistrend.Figure5looksattheaveragetemperaturesfordiffer-entagegroups.ThedistributionsareinsyncwithFigure4showingamostlyatfailurerateatmid-rangetemper-aturesandamodestincreaseatthelowendofthetem-peraturedistribution.Whatstandsoutarethe3and4-yearolddrives,wherethetrendforhigherfailureswithhighertemperatureismuchmoreconstantandalsomorepronounced.Overallourexperimentscanconrmpreviouslyre- Figure6:AFRforscanerrors. Figure7:AFRforreallocationcounts. Figure8:Impactofscanerrorsonsurvivalprobability.Leftgureshowsaggregatesurvivalprobabilityforalldrivesafterrstscanerror.Middlegurebreaksdownsurvivalprobabilityperdriveagesinmonths.Rightgurebreaksdowndrivesbytheirnumberofscanerrors.Thecriticalthresholdanalysisconrmswhatthechartsvisuallyimply:thecriticalthresholdforscaner-rorsisone.Aftertherstscanerror,drivesare39timesmorelikelytofailwithin60daysthandriveswithoutscanerrors.3.5.2ReallocationCountsWhenthedrive'slogicbelievesthatasectorisdamaged(typicallyasaresultofrecurringsofterrorsoraharder-ror)itcanremapthefaultysectornumbertoanewphys-icalsectordrawnfromapoolofspares.Reallocationcountsreectthenumberoftimesthishashappened,andisseenasanindicationofdrivesurfacewear.About9%ofourpopulationhasreallocationcountsgreaterthanzero.Althoughsomeofourdrivemodelsshowhigherabsolutevaluesthanothers,thetrendsweobservearesimilaracrossallmodels.Aswithscanerrors,thepresenceofreallocationsseemstohaveaconsistentimpactonAFRforallagegroups(Figure7),evenifslightlylesspronounced.Driveswithoneormorereallocationsdofailmoreof-tenthanthosewithnone.TheaverageimpactonAFRappearstobebetweenafactorof3-6x.Figure11showsthesurvivalprobabilityaftertherstreallocation.Wetruncatethegraphto8.5months,duetoadrasticdecreaseinthecondencelevelsafterthatpoint.Ingeneral,theleftgraphshows,about85%ofthedrivessurvivepast8monthsaftertherstreallocation.Theeffectismorepronounced(middlegraph)fordrivesintheageranges[10,20)and[20,60]months,whilenewerdrivesintherange[0,5)monthssuffermorethantheirnextgeneration.Thiscouldagainbeduetoinfantmortalityeffects,althoughitappearstobelessdrasticinthiscasethanforscanerrors.Aftertheirrstreallocation,drivesareover14timesmorelikelytofailwithin60daysthandriveswithoutreallocationcounts,makingthecriticalthresholdforthisparameteralsoone. Figure12:Impactofofinereallocationonsurvivalprobability.Leftgureshowsaggregatesurvivalprobabilityforalldrivesafterrstofinereallocation.Middlegurebreaksdownsurvivalprobabilityperdriveagesinmonths.Rightgurebreaksdowndrivesbytheirnumberofinereallocation. Figure13:Impactofprobationalcountvaluesonsurvivalprobability.Leftgureshowsaggregatesurvivalprobabilityforalldrivesafterrstprobationalcount.Middlegurebreaksdownsurvivalprobabilityperdriveagesinmonths.Rightgurebreaksdowndrivesbytheirnumberofprobationalcounts.therefore,canbeseenasasoftererrorindication.Itcouldprovideearlierwarningofpossibleproblemsbutmightalsobeaweakersignal,inthatsectorsonpro-bationmayindeedneverbereallocated.About2%ofourdriveshadnon-zeroprobationalcountvalues.Wenotethatthisnumberislowerthanbothonlineandof-inereallocationcounts,likelyindicatingthatsectorsmayberemovedfromprobationafterfurtherobserva-tionoftheirbehavior.Oncemore,thedistributionofdriveswithnon-zeroprobationalcountsaresomewhatskewedtowardsasubsetofdiskdrivemodels.Figures10and13showthatprobationalcounttrendsaregenerallysimilartothoseobservedforofinere-allocations,withagegroupbeingsomewhatlesspro-nounced.Thecriticalthresholdforprobationalcountsisalsoone:aftertherstevent,drivesare16timesmorelikelytofailwithin60daysthandriveswithzeroproba-tionalcounts.3.5.5MiscellaneousSignalsInadditiontotheSMARTparametersdescribedintheprevioussections,whichwehavefoundtomostcloselyimpactfailurerates,wehavealsostudiedseveralotherparametersfromtheSMARTsetaswellasotherenvi-ronmentalfactors.Herewebrieymentionourrelevantndingsforsomeofthoseparameters.SeekErrors.Seekerrorsoccurwhenadiskdrivefailstoproperlytrackasectorandneedstowaitforanotherrev-olutiontoreadorwritefromortoasector.Drivesreportitasarate,anditismeanttobeusedincombinationwithmodel-specicthresholds.Whenexaminingourpopu-lation,wendthatseekerrorsarewidespreadwithindrivesofonemanufactureronly,whileothersaremoreconservativeinshowingthiskindoferrors.Forthisonemanufacturer,thetrendinseekerrorsisnotclear,chang-ingfromonevintagetoanother.Forothermanufactur-ers,thereisnocorrelationbetweenfailureratesandseekerrors.CRCErrors.Cyclicredundancycheck(CRC)errors Figure14:PercentageoffaileddriveswithSMARTerrors.4RelatedWorkPreviousstudiesinthisareagenerallyfallintotwocat-egories:vendor(diskdriveorstorageappliance)tech-nicalpapersanduserexperiencestudies.Diskven-dorsstudiesprovidevaluableinsightintotheelectro-mechanicalcharacteristicsofdisksandbothmodel-basedandexperimentaldatathatsuggestshowseveralenvironmentalfactorsandusageactivitiescanaffectde-vicelifetime.YangandSun[21]andCole[4]describetheprocessesandexperimentalsetupusedbyQuantumandSeagatetotestnewunitsandthemodelsthatattempttomakelong-termreliabilitypredictionsbasedonaccel-eratedlifetestsofsmallpopulations.Power-on-hours,dutycycle,temperatureareidentiedasthekeydeploy-mentparametersthatimpactfailurerates,eachofthemhavingthepotentialtodoublefailurerateswhengoingfromnominaltoextremevalues.Forexample,Colepresentsthermalde-ratingmodelsshowingthatMTBFcoulddegradebyasmuchas50%whengoingfromop-eratingtemperaturesof30Cto40C.Cole'sreportalsopresentsyearlyfailureratesfromSeagate'swarrantydatabase,indicatingalineardecreaseinannualfailureratesfrom1.2%intherstyearto0.39%inthethird(andlastyearofrecord).Inourstudy,wedidnotndmuchcorrelationbetweenfailurerateandeitherelevatedtemperatureorutilization.Itisthemostsurprisingresultofourstudy.Ourannualizedfailureratesweregenerallyhigherthanthosereportedbyvendors,andmoreconsis-tentwithotheruserexperiencestudies.ShahandElerathhavewrittenseveralpapersbasedonthebehaviorofdiskdrivesinsideNetworkAppli-ancestorageproducts[6,7,19].Theyuseareliabilitydatabasethatincludeseldfailurestatisticsaswellassupportlogs,andtheirpositionasanappliancevendorenablesthemmorecontrolandvisibilityintoactualde-ploymentsthanatypicaldiskdrivevendormighthave.AlthoughtheydonotreportdirectlyonthecorrelationbetweenSMARTparametersorenvironmentalfactorsandfailures(possiblyforcondentialityconcerns),theirworkisusefulinenablingaqualitativeunderstandingoffactorswhataffectdiskdrivereliability.Forexam-ple,theycommentthatend-userfailureratescanbeasmuchastentimeshigherthanwhatthedrivemanufac-turermightexpect[7];theyreportin[6]astrongexperi-mentalcorrelationbetweennumberofheadsandhigherfailurerates(aneffectthatisalsopredictedbythemod-elsin[4]);andtheyobservethatdifferentfailuremech-anismsareatplayatdifferentphasesofadrivelifetime[19].Generally,ourndingsareinlinewiththesere-sults.Userexperiencestudiesmaylackthedepthofinsightintothedeviceinnerworkingsthatispossibleinman-ufacturerreports,buttheyareessentialinunderstand-ingdevicebehaviorinreal-worlddeployments.Unfortu-nately,thereareveryfewsuchstudiestodate,probablyduetothelargenumberofdevicesneededtoobservestatisticallysignicantresultsandthecomplexinfras-tructurerequiredtotrackfailuresandtheircontributingfactors.TalagalaandPatterson[20]performadetaileder-roranalysisof368SCSIdiskdrivesoveraneighteenmonthperiod,reportingafailurerateof1.9%.Re-sultsonalargernumberofdesktop-classATAdrivesunderdeploymentattheInternetArchivearepresentedbySchwarzetal[17].Theyreportona2%failurerateforapopulationof2489disksduring2005,whilemen-tioningthatreplacementrateshavebeenashighas6%inthepast.GrayandvanIngen[9]citeobservedfail-ureratesrangingfrom3.3-6%intwolargewebprop-ertieswith22,400and15,805disksrespectively.Are-centstudybySchroederandGibson[16]helpsshedlightintothestatisticalpropertiesofdiskdrivefailures.Thestudyusesfailuredatafromseverallargescaledeploy-ments,includingalargenumberofSATAdrives.Theyreportasignicantoverestimationofmeantimetofail-urebymanufacturersandalackofinfantmortalityef-fects.Noneoftheseuserstudieshaveattemptedtocor-relatefailureswithSMARTparametersorotherenviron-mentalfactors.WeareawareoftwogroupsthathaveattemptedtocorrelateSMARTparameterswithfailurestatistics.Hughesetal[11,13,14]andHamerlyandElkan[10].Thelargestpopulationsstudiedbythesegroupswasof3744and1934drivesandtheyderivefailuremodelsthatachievepredictiveratesashighas30%,atfalseposi-tiveratesofabout0.2%(thatfalse-positiveratecorre-spondedtoanumberofdrivesbetween20-43%ofthedrivesthatactuallyfailedintheirstudies).Hughesetal. the19thACMSymposiumonOperatingSystemsPrinciples,pages29–43,December2003.[9]JimGrayandCatherinevanIngen.Empiricalmeasurementsofdiskfailureratesanderrorrates.TechnicalReportMSR-TR-2005-166,December2005.[10]GregHamerlyandCharlesElkan.Bayesianap-proachestofailurepredictionfordiskdrives.InProceedingsoftheEighteenthInternationalCon-ferenceonMachineLearning(ICML'01),June2001.[11]GordonF.Hughes,JosephF.Murray,KennethKreutz-Delgado,andCharlesElkan.Improveddisk-drivefailurewarnings.IEEETransactionsonReliability,51(3):350–357,September2002.[12]PeterLymanandHalR.Varian.Howmuchinformation?October2003.http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/index.htm.[13]JosephF.Murray,GordonFHughes,andKennethKreutz-Delgado.Harddrivefailurepredictionus-ingnon-parametricstatisticalmethods.Proceed-ingsofICANN/ICONIP,June2003.[14]JosephF.Murray,GordonF.Hughes,andKen-nethKreutz-Delgado.Machinelearningmethodsforpredictingfailuresinharddrives:Amultiple-instanceapplication.J.Mach.Learn.Res.,6:783–816,2005.[15]RobPike,SeanDorward,RobertGriesemer,andSeanQuinlan.Interpretingthedata:Parallelanal-ysiswithsawzall.ScienticProgrammingJour-nal,SpecialIssueonGridsandWorldwideCom-putingProgrammingModelsandInfrastructure,13(4):227–298.[16]BiancaSchroederandGarthA.Gibson.Diskfailuresintherealworld:Whatdoesanmttfof1,000,000hoursmeantoyou?InProceedingsofthe5thUSENIXConferenceonFileandStorageTechnologies(FAST),February2007.[17]ThomasSchwartz,MaryBaker,StevenBassi,BruceBaumgart,WayneFlagg,CatherinevanIngen,KobusJoste,MarkManasse,andMehulShah.Diskfailureinvestigationsattheinternetarchive.14thNASAGoddard,23rdIEEEConfer-enceonMassStorageSystemsandTechnologies,May2006.[18]SandeepShahandJonG.Elerath.Diskdrivevin-tageanditseffectonreliability.InProceedingsoftheAnnualSymposiumonReliabilityandMain-tainability,pages163–167,January2004.[19]SandeepShahandJonG.Elerath.Reliabilityanal-ysisofdiskdrivefailuremechanisms.InProceed-ingsoftheAnnualSymposiumonReliabilityandMaintainability,pages226–231,January2005.[20]NishaTalagalaandDavidPatterson.Ananalysisoferrorbehaviorinalargestoragesystem.Techni-calReportCSD-99-1042,UniversityofCalifornia,Berkeley,February1999.[21]JimmyYangandFeng-BinSun.Acomprehensivereviewofhard-diskdrivereliability.InProceed-ingsoftheAnnualSymposiumonReliabilityandMaintainability,pages403–409,January1999.