1600 Amphitheatre Pkwy Mountain View CA 94043 edpinwolfluiz googlecom Abstract It is estimated that over 90 of all new information produced in the world is being stored on magnetic media most of it on hard disk drives Despite their importance there ID: 3957 Download Pdf
ScreamingFastGaloisFieldArithmeticUsingIntelSIMDInstructionsJamesS.PlankEECSDepartmentUniversityofTennesseeKevinM.GreenanEMCBackupRecoverySystemsDivisionEthanL.MillerComputerScienceDepartmentUCSantaCr
AcknowledgmentsWewouldliketothankourcolleaguesintheStorageSys-temsResearchCenterandNetApp
Phone 510 5288649 2 FAX 510 5485738 3 Email officeusenixorg 4 WWW URL httpwwwusenixorg Eliminating Receive Livelock in an Interruptdriven Kernel Jeffrey Mogul DEC Western Research Laboratory K K Ramakrishnan ATT Bell Laboratories brPage 2br Eliminat
WhenSlowerisFaster:OnHeterogeneousMulticoresforReliableSystemsTomasHrubyHerbertBosAndrewS.TanenbaumTheNetworkInstitute,VUUniversityAmsterdamthruby,herbertb,ast@few.vu.nlBreakinguptheOSinmanysmallcompo
UNIX is a registered trademark of AT&T in the US and other countries.Summer USENIX '88295 SanFrancisco, June 20-24 Design of a General Purpose Memory ...McKusick, KarelsThis memory allocation me
Russ Cox Alex Pesterev MIT CSAIL Abstract Foundation is a preservation system for users personal digital artifacts Foundation preserves all of a users dat and its dependenciesfonts programs plugins kernel and con64257guration stateby archiving night
Phone 510 5288649 2 FAX 510 5485738 3 Email officeusenixorg 4 WWW URL httpwwwusenixorg Wide Area Network Ecology Jon T Meek Edwin S Eichert Kim Takayama Cyanamid Agricultural Research CenterAmerican Home Products Corporation brPage 2br Wide Area Net
Holland Elaine Angelino Gideon Wald Margo I Seltzer Harvard University Abstract Flash memory has recently become popular as a caching medium Most uses to date are on the storage server side We investigate a different structure ash as a cache on the
Nightingale Jeremy Elson Jinliang Fan Owen Hofmann Jon Howell and Yutaka Suzue Microsoft Research University of Texas at Austin Abstract Flat Datacenter Storage FDS is a highperformance faulttolerant largescale localityoblivious blob store Using a
Stein Cold Spring Harbor Laboratory 57513 1999 by The USENIX Association All Rights Reserved Rights to individual papers remain with the author or the authors employer Permission is granted for noncommercial reproduction of the work for educational
Published bymarina-yarberry
1600 Amphitheatre Pkwy Mountain View CA 94043 edpinwolfluiz googlecom Abstract It is estimated that over 90 of all new information produced in the world is being stored on magnetic media most of it on hard disk drives Despite their importance there
Download Pdf - The PPT/PDF document "Appears in the Proceedings of the th USE..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Theinformationcollectedincludesenvironmentalfac-tors(suchastemperatures),activitylevelsandmanyoftheSelf-MonitoringAnalysisandReportingTechnology(SMART)parametersthatarebelievedtobegoodindi-catorsofdiskdrivehealth.Weminethroughthesedataandattempttondevidencethatcorroboratesorcon-tradictsmanyofthecommonlyheldbeliefsabouthowvariousfactorscanaffectdiskdrivelifetime.Ourpaperisuniqueinthatitisbasedondatafromadiskpopulationsizethatistypicallyonlyavailablefromvendorwarrantydatabases,buthasthedepthofdeploy-mentvisibilityanddetailedlifetimefollow-upthatonlyanend-userstudycanprovide.Ourkeyndingsare:Contrarytopreviouslyreportedresults,wefoundverylittlecorrelationbetweenfailureratesandei-therelevatedtemperatureoractivitylevels.SomeSMARTparameters(scanerrors,realloca-tioncounts,ofinereallocationcounts,andproba-tionalcounts)havealargeimpactonfailureproba-bility.GiventhelackofoccurrenceofpredictiveSMARTsignalsonalargefractionoffaileddrives,itisun-likelythatanaccuratepredictivefailuremodelcanbebuiltbasedonthesesignalsalone.2BackgroundInthissectionwedescribetheinfrastructurethatwasusedtogatherandprocessthedatausedinthisstudy,thetypesofdiskdrivesincludedintheanalysis,andin-formationonhowtheyaredeployed.2.1TheSystemHealthInfrastructureTheSystemHealthinfrastructureisalargedistributedsoftwaresystemthatcollectsandstoreshundredsofattribute-valuepairsfromallofGoogle'sservers,andprovidestheinterfaceforarbitraryanalysisjobstopro-cessthatdata.ThearchitectureoftheSystemHealthinfrastructureisshowninFigure1.Itconsistsofadatacollectionlayer,adistributedrepositoryandananalysisframe-work.Thecollectionlayerisresponsibleforgettingin-formationfromeachofthousandsofindividualserversintoacentralizedrepository.Differentavorsofcol-lectorsexisttogatherdifferenttypesofdata.Muchofthehealthinformationisobtainedfromthemachinesdi-rectly.Adaemonrunsoneverymachineandgatherslocaldatarelatedtothatmachine'shealth,suchasenvi-ronmentalparameters,utilizationinformationofvarious Figure1:Collection,storage,andanalysisarchitecture.resources,errorindications,andcongurationinforma-tion.Itisimperativethatthisdaemon'sresourceusagebeverylight,sonottointerferewiththeapplications.Onewaytoassurethisistohavethemachine-levelcol-lectorpollindividualmachinesrelativelyinfrequently(everyfewminutes).Otherslowerchangingdata(suchascongurationinformation)anddatafromotherexist-ingdatabasescanbecollectedevenlessfrequentlythanthat.Mostnotablyforthisstudy,dataregardingma-chinerepairsanddiskswapsarepulledinfromanotherdatabase.TheSystemHealthdatabaseisbuiltuponBigtable[3],adistributeddatarepositorywidelyusedwithinGoogle,whichitselfisbuiltupontheGoogleFileSys-tem(GFS)[8].Bigtabletakescareofallthedatalayout,compression,andaccesschoresassociatedwithalargedatastore.Itpresentstheabstractionofa2-dimensionaltableofdatacells,withdifferentversionsovertimemak-ingupathirddimension.Itisanaturaltforkeepingtrackofthevaluesofdifferentvariables(columns)fordifferentmachines(rows)overtime.TheSystemHealthdatabasethusretainsacompletetime-orderedhistoryoftheenvironment,utilization,error,conguration,andre-paireventsineachmachine'slife.AnalysisprogramsrunontopoftheSystemHealthdatabase,lookingatinformationfromindividualma-chines,orminingthedataacrossthousandsofmachines.Large-scaleanalysisprogramsaretypicallybuiltuponGoogle'sMapreduce[5]framework.Mapreduceauto-matesthemechanismsoflarge-scaledistributedcompu- tation(suchasworkdistribution,loadbalancing,toler-anceoffailures),allowingtheusertofocussimplyonthealgorithmsthatmakeuptheheartofthecomputa-tion.TheanalysispipelineusedforthisstudyconsistsofaMapreducejobwrittenintheSawzalllanguageandframework[15]toextractandcleanupperiodicSMARTdataandrepairdatarelatedtodisks,followedbyapassthroughR[1]forstatisticalanalysisandnalgraphgen-eration.2.2DeploymentDetailsThedatainthisstudyarecollectedfromalargenum-berofdiskdrives,deployedinseveraltypesofsystemsacrossallofGoogle'sservices.Morethanonehundredthousanddiskdriveswereusedforalltheresultspre-sentedhere.ThedisksareacombinationofserialandparallelATAconsumer-gradeharddiskdrives,ranginginspeedfrom5400to7200rpm,andinsizefrom80to400GB.Allunitsinthisstudywereputintoproductioninorafter2001.Thepopulationcontainsseveralmodelsfrommanyofthelargestdiskdrivemanufacturersandfromatleastninedifferentmodels.ThedatausedforthisstudywerecollectedbetweenDecember2005andAugust2006.Asiscommoninserver-classdeployments,thediskswerepoweredon,spinning,andgenerallyinserviceforessentiallyalloftheirrecordedlife.Theyweredeployedinrack-mountedserversandhousedinprofessionally-manageddatacenterfacilities.Beforebeingputintoproduction,alldiskdrivesgothroughashortburn-inprocess,whichconsistsofacombinationofread/writestresstestsdesignedtocatchmanyofthemostcommonassembly,conguration,orcomponent-levelproblems.Thedatashownheredonotincludethefall-outfromthisphase,butinsteadbeginwhenthesystemsareofciallycommissionedforuse.Thereforeourdatashouldbeconsistentwithwhatareg-ularend-usershouldsee,sincemostequipmentmanu-facturersputtheirsystemsthroughsimilartestsbeforeshipment.2.3DataPreparationDenitionofFailure.Narrowlydeningwhatconsti-tutesafailureisadifculttaskinsuchalargeopera-tion.Manufacturersandend-usersoftenseedifferentstatisticswhencomputingfailuressincetheyusediffer-entdenitionsforit.Whiledrivemanufacturersoftenquoteyearlyfailureratesbelow2%[2],userstudieshaveseenratesashighas6%[9].ElerathandShah[7]reportbetween15-60%ofdrivesconsideredtohavefailedattheusersitearefoundtohavenodefectbythemanu-facturersuponreturningtheunit.Hughesetal.[11]ob-servebetween20-30%noproblemfoundcasesafteranalyzingfaileddrivesfromtheirstudyof3477disks.Fromanend-user'sperspective,adefectivedriveisonethatmisbehavesinaseriousorconsistentenoughmannerintheuser'sspecicdeploymentscenariothatitisnolongersuitableforservice.Sincefailuresaresometimestheresultofacombinationofcomponents(i.e.,aparticulardrivewithaparticularcontrollerorca-ble,etc),itisnosurprisethatagoodnumberofdrivesthatfailforagivenusercouldbestillconsideredop-erationalinadifferenttestharness.Wehaveobservedthatphenomenonourselves,includingsituationswhereadrivetesterconsistentlygreenlightsaunitthatin-variablyfailsintheeld.Therefore,themostaccuratedenitionwecanpresentofafailureeventforourstudyis:adriveisconsideredtohavefailedifitwasreplacedaspartofarepairsprocedure.Notethatthisdenitionimplicitlyexcludesdrivesthatwerereplacedduetoanupgrade.Sinceitisnotalwaysclearwhenexactlyadrivefailed,weconsiderthetimeoffailuretobewhenthedrivewasreplaced,whichcansometimesbeafewdaysaftertheobservedfailureevent.Itisalsoimportanttomentionthattheparametersweuseinthisstudywerenotinuseaspartoftherepairsdiagnosticsprocedureatthetimethatthesedatawerecollected.Thereforethereisnoriskoffalse(forced)correlationsbetweenthesesignalsandrepairoutcomes.Filtering.Withsuchalargenumberofunitsmonitoredoveralongperiodoftime,dataintegrityissuesinvari-ablyshowup.Informationcanbelostorcorruptedalongourcollectionpipeline.Therefore,somecleaningupofthedataisnecessary.Inthecaseofmissingvalues,theindividualvaluesaremarkedasnotavailableandthatspecicpieceofdataisexcludedfromthedetailedstud-ies.Otherrecordsforthatsamedrivearenotdiscarded.Incaseswherethedataareclearlyspurious,theentirerecordforthedriveisremoved,undertheassumptionthatonepieceofspuriousdatadrawsintoquestionothereldsforthesamedrive.Identifyingspuriousdata,how-ever,isatrickytask.Becausepartofthegoalofstudyingthedataistolearnwhatthenumbersmean,wemustbecarefulnottodiscardtoomuchdatathatmightappearinvalid.Sowedenespurioussimplyasnegativecountsordatavaluesthatareclearlyimpossible.Forexam-ple,somedriveshavereportedtemperaturesthatwerehotterthanthesurfaceofthesun.Othershavehadneg-ativepowercycles.Theseweredeemedspuriousandremoved.Ontheotherhand,wehavenotlteredanysuspiciouslylargecountsfromtheSMARTsignals,un-derthehypothesisthatlargecounts,whileimprobableas Itisdifcultforustoarriveatameaningfulnumer-icalutilizationmetricgiventhatourmeasurementsdonotprovideenoughdetailtoderivewhat100%utiliza-tionmightbeforanygivendiskmodel.Wechoosein-steadtomeasureutilizationintermsofweeklyaveragesofread/writebandwidthperdrive.Wecategorizeutiliza-tioninthreelevels:low,mediumandhigh,correspond-ingrespectivelytothelowest25thpercentile,50-75thpercentilesandtop75thpercentile.Thiscategorizationisperformedforeachdrivemodel,sincethemaximumbandwidthshavesignicantvariabilityacrossdrivefam-ilies.WenotethatusingnumberofI/Ooperationsandbytestransferredasutilizationmetricsprovideverysim-ilarresults.Figure3showstheimpactofutilizationonAFRacrossthedifferentagegroups.Overall,weexpectedtonoticeaverystrongandcon-sistentcorrelationbetweenhighutilizationandhigherfailurerates.Howeverourresultsappeartopaintamorecomplexpicture.First,onlyveryyoungandveryoldagegroupsappeartoshowtheexpectedbehavior.Af-tertherstyear,theAFRofhighutilizationdrivesisatmostmoderatelyhigherthanthatoflowutilizationdrives.Thethree-yeargroupinfactappearstohavetheoppositeoftheexpectedbehavior,withlowutilizationdriveshavingslightlyhigherfailureratesthanhighuti-lizationones.Onepossibleexplanationforthisbehavioristhesur-vivalofthettesttheory.Itispossiblethatthefail-uremodesthatareassociatedwithhigherutilizationaremoreprominentearlyinthedrive'slifetime.Ifthatisthecase,thedrivesthatsurvivetheinfantmortalityphasearetheleastsusceptibletothatfailuremode,andresultinapopulationthatismorerobustwithrespecttovaria-tionsinutilizationlevels.Anotherpossibleexplanationisthatpreviousobser-vationsofhighcorrelationbetweenutilizationandfail-ureshasbeenbasedonextrapolationsfrommanufactur-ers'acceleratedlifeexperiments.Thoseexperimentsarelikelytobettermodelearlylifefailurecharacteristics,andassuchtheyagreewiththetrendweobservefortheyoungagegroups.Itispossible,however,thatlongertermpopulationstudiescoulduncoveralesspronouncedeffectlaterinadrive'slifetime.Whenwelookattheseresultsacrossindividualmod-elsweagainseeacomplexpattern,withvaryingpat-ternsoffailurebehavioracrossthethreeutilizationlev-els.Takenasawhole,ourdataindicateamuchweakercorrelationbetweenutilizationlevelsandfailuresthanpreviousworkhassuggested. Figure3:UtilizationAFR3.4TemperatureTemperatureisoftenquotedasthemostimportantenvi-ronmentalfactoraffectingdiskdrivereliability.Previousstudieshaveindicatedthattemperaturedeltasaslowas15Ccannearlydoublediskdrivefailurerates[4].HerewetaketemperaturereadingsfromtheSMARTrecordseveryfewminutesduringtheentire9-monthwindowofobservationandtrytounderstandthecorrelationbe-tweentemperaturelevelsandfailurerates.Wehaveaggregatedtemperaturereadingsinseveraldifferentways,includingaverages,maxima,fractionoftimespentaboveagiventemperaturevalue,numberoftimesatemperaturethresholdiscrossed,andlasttem-peraturebeforefailure.Herewereportdataonaveragesandnotethatotheraggregationformshaveshownsim-ilartrendsandandthereforesuggestthesameconclu-sions.Werstlookatthecorrelationbetweenaveragetem-peratureduringtheobservationperiodandfailure.Fig-ure4showsthedistributionofdriveswithaveragetem-peratureinincrementsofonedegreeandthecorrespond-ingannualizedfailurerates.Thegureshowsthatfail-uresdonotincreasewhentheaveragetemperaturein-creases.Infact,thereisacleartrendshowingthatlowertemperaturesareassociatedwithhigherfailurerates.Onlyatveryhightemperaturesisthereaslightreversalofthistrend.Figure5looksattheaveragetemperaturesfordiffer-entagegroups.ThedistributionsareinsyncwithFigure4showingamostlyatfailurerateatmid-rangetemper-aturesandamodestincreaseatthelowendofthetem-peraturedistribution.Whatstandsoutarethe3and4-yearolddrives,wherethetrendforhigherfailureswithhighertemperatureismuchmoreconstantandalsomorepronounced.Overallourexperimentscanconrmpreviouslyre- Figure6:AFRforscanerrors. Figure7:AFRforreallocationcounts. Figure8:Impactofscanerrorsonsurvivalprobability.Leftgureshowsaggregatesurvivalprobabilityforalldrivesafterrstscanerror.Middlegurebreaksdownsurvivalprobabilityperdriveagesinmonths.Rightgurebreaksdowndrivesbytheirnumberofscanerrors.Thecriticalthresholdanalysisconrmswhatthechartsvisuallyimply:thecriticalthresholdforscaner-rorsisone.Aftertherstscanerror,drivesare39timesmorelikelytofailwithin60daysthandriveswithoutscanerrors.3.5.2ReallocationCountsWhenthedrive'slogicbelievesthatasectorisdamaged(typicallyasaresultofrecurringsofterrorsoraharder-ror)itcanremapthefaultysectornumbertoanewphys-icalsectordrawnfromapoolofspares.Reallocationcountsreectthenumberoftimesthishashappened,andisseenasanindicationofdrivesurfacewear.About9%ofourpopulationhasreallocationcountsgreaterthanzero.Althoughsomeofourdrivemodelsshowhigherabsolutevaluesthanothers,thetrendsweobservearesimilaracrossallmodels.Aswithscanerrors,thepresenceofreallocationsseemstohaveaconsistentimpactonAFRforallagegroups(Figure7),evenifslightlylesspronounced.Driveswithoneormorereallocationsdofailmoreof-tenthanthosewithnone.TheaverageimpactonAFRappearstobebetweenafactorof3-6x.Figure11showsthesurvivalprobabilityaftertherstreallocation.Wetruncatethegraphto8.5months,duetoadrasticdecreaseinthecondencelevelsafterthatpoint.Ingeneral,theleftgraphshows,about85%ofthedrivessurvivepast8monthsaftertherstreallocation.Theeffectismorepronounced(middlegraph)fordrivesintheageranges[10,20)and[20,60]months,whilenewerdrivesintherange[0,5)monthssuffermorethantheirnextgeneration.Thiscouldagainbeduetoinfantmortalityeffects,althoughitappearstobelessdrasticinthiscasethanforscanerrors.Aftertheirrstreallocation,drivesareover14timesmorelikelytofailwithin60daysthandriveswithoutreallocationcounts,makingthecriticalthresholdforthisparameteralsoone. Figure12:Impactofofinereallocationonsurvivalprobability.Leftgureshowsaggregatesurvivalprobabilityforalldrivesafterrstofinereallocation.Middlegurebreaksdownsurvivalprobabilityperdriveagesinmonths.Rightgurebreaksdowndrivesbytheirnumberofinereallocation. Figure13:Impactofprobationalcountvaluesonsurvivalprobability.Leftgureshowsaggregatesurvivalprobabilityforalldrivesafterrstprobationalcount.Middlegurebreaksdownsurvivalprobabilityperdriveagesinmonths.Rightgurebreaksdowndrivesbytheirnumberofprobationalcounts.therefore,canbeseenasasoftererrorindication.Itcouldprovideearlierwarningofpossibleproblemsbutmightalsobeaweakersignal,inthatsectorsonpro-bationmayindeedneverbereallocated.About2%ofourdriveshadnon-zeroprobationalcountvalues.Wenotethatthisnumberislowerthanbothonlineandof-inereallocationcounts,likelyindicatingthatsectorsmayberemovedfromprobationafterfurtherobserva-tionoftheirbehavior.Oncemore,thedistributionofdriveswithnon-zeroprobationalcountsaresomewhatskewedtowardsasubsetofdiskdrivemodels.Figures10and13showthatprobationalcounttrendsaregenerallysimilartothoseobservedforofinere-allocations,withagegroupbeingsomewhatlesspro-nounced.Thecriticalthresholdforprobationalcountsisalsoone:aftertherstevent,drivesare16timesmorelikelytofailwithin60daysthandriveswithzeroproba-tionalcounts.3.5.5MiscellaneousSignalsInadditiontotheSMARTparametersdescribedintheprevioussections,whichwehavefoundtomostcloselyimpactfailurerates,wehavealsostudiedseveralotherparametersfromtheSMARTsetaswellasotherenvi-ronmentalfactors.Herewebrieymentionourrelevantndingsforsomeofthoseparameters.SeekErrors.Seekerrorsoccurwhenadiskdrivefailstoproperlytrackasectorandneedstowaitforanotherrev-olutiontoreadorwritefromortoasector.Drivesreportitasarate,anditismeanttobeusedincombinationwithmodel-specicthresholds.Whenexaminingourpopu-lation,wendthatseekerrorsarewidespreadwithindrivesofonemanufactureronly,whileothersaremoreconservativeinshowingthiskindoferrors.Forthisonemanufacturer,thetrendinseekerrorsisnotclear,chang-ingfromonevintagetoanother.Forothermanufactur-ers,thereisnocorrelationbetweenfailureratesandseekerrors.CRCErrors.Cyclicredundancycheck(CRC)errors Figure14:PercentageoffaileddriveswithSMARTerrors.4RelatedWorkPreviousstudiesinthisareagenerallyfallintotwocat-egories:vendor(diskdriveorstorageappliance)tech-nicalpapersanduserexperiencestudies.Diskven-dorsstudiesprovidevaluableinsightintotheelectro-mechanicalcharacteristicsofdisksandbothmodel-basedandexperimentaldatathatsuggestshowseveralenvironmentalfactorsandusageactivitiescanaffectde-vicelifetime.YangandSun[21]andCole[4]describetheprocessesandexperimentalsetupusedbyQuantumandSeagatetotestnewunitsandthemodelsthatattempttomakelong-termreliabilitypredictionsbasedonaccel-eratedlifetestsofsmallpopulations.Power-on-hours,dutycycle,temperatureareidentiedasthekeydeploy-mentparametersthatimpactfailurerates,eachofthemhavingthepotentialtodoublefailurerateswhengoingfromnominaltoextremevalues.Forexample,Colepresentsthermalde-ratingmodelsshowingthatMTBFcoulddegradebyasmuchas50%whengoingfromop-eratingtemperaturesof30Cto40C.Cole'sreportalsopresentsyearlyfailureratesfromSeagate'swarrantydatabase,indicatingalineardecreaseinannualfailureratesfrom1.2%intherstyearto0.39%inthethird(andlastyearofrecord).Inourstudy,wedidnotndmuchcorrelationbetweenfailurerateandeitherelevatedtemperatureorutilization.Itisthemostsurprisingresultofourstudy.Ourannualizedfailureratesweregenerallyhigherthanthosereportedbyvendors,andmoreconsis-tentwithotheruserexperiencestudies.ShahandElerathhavewrittenseveralpapersbasedonthebehaviorofdiskdrivesinsideNetworkAppli-ancestorageproducts[6,7,19].Theyuseareliabilitydatabasethatincludeseldfailurestatisticsaswellassupportlogs,andtheirpositionasanappliancevendorenablesthemmorecontrolandvisibilityintoactualde-ploymentsthanatypicaldiskdrivevendormighthave.AlthoughtheydonotreportdirectlyonthecorrelationbetweenSMARTparametersorenvironmentalfactorsandfailures(possiblyforcondentialityconcerns),theirworkisusefulinenablingaqualitativeunderstandingoffactorswhataffectdiskdrivereliability.Forexam-ple,theycommentthatend-userfailureratescanbeasmuchastentimeshigherthanwhatthedrivemanufac-turermightexpect[7];theyreportin[6]astrongexperi-mentalcorrelationbetweennumberofheadsandhigherfailurerates(aneffectthatisalsopredictedbythemod-elsin[4]);andtheyobservethatdifferentfailuremech-anismsareatplayatdifferentphasesofadrivelifetime[19].Generally,ourndingsareinlinewiththesere-sults.Userexperiencestudiesmaylackthedepthofinsightintothedeviceinnerworkingsthatispossibleinman-ufacturerreports,buttheyareessentialinunderstand-ingdevicebehaviorinreal-worlddeployments.Unfortu-nately,thereareveryfewsuchstudiestodate,probablyduetothelargenumberofdevicesneededtoobservestatisticallysignicantresultsandthecomplexinfras-tructurerequiredtotrackfailuresandtheircontributingfactors.TalagalaandPatterson[20]performadetaileder-roranalysisof368SCSIdiskdrivesoveraneighteenmonthperiod,reportingafailurerateof1.9%.Re-sultsonalargernumberofdesktop-classATAdrivesunderdeploymentattheInternetArchivearepresentedbySchwarzetal[17].Theyreportona2%failurerateforapopulationof2489disksduring2005,whilemen-tioningthatreplacementrateshavebeenashighas6%inthepast.GrayandvanIngen[9]citeobservedfail-ureratesrangingfrom3.3-6%intwolargewebprop-ertieswith22,400and15,805disksrespectively.Are-centstudybySchroederandGibson[16]helpsshedlightintothestatisticalpropertiesofdiskdrivefailures.Thestudyusesfailuredatafromseverallargescaledeploy-ments,includingalargenumberofSATAdrives.Theyreportasignicantoverestimationofmeantimetofail-urebymanufacturersandalackofinfantmortalityef-fects.Noneoftheseuserstudieshaveattemptedtocor-relatefailureswithSMARTparametersorotherenviron-mentalfactors.WeareawareoftwogroupsthathaveattemptedtocorrelateSMARTparameterswithfailurestatistics.Hughesetal[11,13,14]andHamerlyandElkan[10].Thelargestpopulationsstudiedbythesegroupswasof3744and1934drivesandtheyderivefailuremodelsthatachievepredictiveratesashighas30%,atfalseposi-tiveratesofabout0.2%(thatfalse-positiveratecorre-spondedtoanumberofdrivesbetween20-43%ofthedrivesthatactuallyfailedintheirstudies).Hughesetal. the19thACMSymposiumonOperatingSystemsPrinciples,pages2943,December2003.[9]JimGrayandCatherinevanIngen.Empiricalmeasurementsofdiskfailureratesanderrorrates.TechnicalReportMSR-TR-2005-166,December2005.[10]GregHamerlyandCharlesElkan.Bayesianap-proachestofailurepredictionfordiskdrives.InProceedingsoftheEighteenthInternationalCon-ferenceonMachineLearning(ICML'01),June2001.[11]GordonF.Hughes,JosephF.Murray,KennethKreutz-Delgado,andCharlesElkan.Improveddisk-drivefailurewarnings.IEEETransactionsonReliability,51(3):350357,September2002.[12]PeterLymanandHalR.Varian.Howmuchinformation?October2003.http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/index.htm.[13]JosephF.Murray,GordonFHughes,andKennethKreutz-Delgado.Harddrivefailurepredictionus-ingnon-parametricstatisticalmethods.Proceed-ingsofICANN/ICONIP,June2003.[14]JosephF.Murray,GordonF.Hughes,andKen-nethKreutz-Delgado.Machinelearningmethodsforpredictingfailuresinharddrives:Amultiple-instanceapplication.J.Mach.Learn.Res.,6:783816,2005.[15]RobPike,SeanDorward,RobertGriesemer,andSeanQuinlan.Interpretingthedata:Parallelanal-ysiswithsawzall.ScienticProgrammingJour-nal,SpecialIssueonGridsandWorldwideCom-putingProgrammingModelsandInfrastructure,13(4):227298.[16]BiancaSchroederandGarthA.Gibson.Diskfailuresintherealworld:Whatdoesanmttfof1,000,000hoursmeantoyou?InProceedingsofthe5thUSENIXConferenceonFileandStorageTechnologies(FAST),February2007.[17]ThomasSchwartz,MaryBaker,StevenBassi,BruceBaumgart,WayneFlagg,CatherinevanIngen,KobusJoste,MarkManasse,andMehulShah.Diskfailureinvestigationsattheinternetarchive.14thNASAGoddard,23rdIEEEConfer-enceonMassStorageSystemsandTechnologies,May2006.[18]SandeepShahandJonG.Elerath.Diskdrivevin-tageanditseffectonreliability.InProceedingsoftheAnnualSymposiumonReliabilityandMain-tainability,pages163167,January2004.[19]SandeepShahandJonG.Elerath.Reliabilityanal-ysisofdiskdrivefailuremechanisms.InProceed-ingsoftheAnnualSymposiumonReliabilityandMaintainability,pages226231,January2005.[20]NishaTalagalaandDavidPatterson.Ananalysisoferrorbehaviorinalargestoragesystem.Techni-calReportCSD-99-1042,UniversityofCalifornia,Berkeley,February1999.[21]JimmyYangandFeng-BinSun.Acomprehensivereviewofhard-diskdrivereliability.InProceed-ingsoftheAnnualSymposiumonReliabilityandMaintainability,pages403409,January1999.
© 2021 docslides.com Inc.
All rights reserved.