serversortomigratetonewresourcesAllofthesetaskscanbeperformedincrementallywhilethesystemisrunningandevenbepausedrescheduledorrestartedwithoutharmFailureindependenceEachobjectstoragenodeinthes ID: 419202
Download Pdf The PPT/PDF document "ROARS:AScalableRepositoryforDataIntensiv..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
ROARS:AScalableRepositoryforDataIntensiveScienticComputingHoangBui,PeterBui,PatrickFlynn,andDouglasThainUniversityofNotreDameABSTRACTAsscienticresearchbecomesmoredataintensive,thereisanincreasingneedforscalable,reliable,andhighperfor-mancestoragesystems.Suchdatarepositoriesmustprovidebothdataarchivalservicesandrichmetadata,andcleanlyintegratewithlargescalecomputingresources.ROARSisahybridapproachtodistributedstoragethatprovidesbothlarge,robust,scalablestorageandecientrichmetadataqueriesforscienticapplications.Inthispaper,wedemon-stratethatROARSiscapableofimportingandexport-inglargequantitiesofdata,migratingdatatonewstoragenodes,providingrobustfaulttolerance,andgeneratingma-terializedviewsbasedonmetadataqueries.Ourexperimen-talresultsdemonstratethatROARS'aggregatethroughputscaleswiththenumberofconcurrentclientswhileprovidingfault-tolerantdataaccess.ROARSiscurrentlybeingusedtostore5.1TBofdatainourlocalbiometricsrepository.1.INTRODUCTIONRecentadvancesindigitaltechnologiesnowmakeitpossi-bleforanindividualorasmallgrouptocreateandmaintainenormousamountsofdata.\Ordinary"researchersinallbranchesofscienceoperatecameras,digitaldetectors,andcomputersimulationsthatcangeneratenewdataasfastastheresearchercanposeahypothesis.Thisincreaseintheproductionofdataallowstheindividualtocarryoutcom-plexstudiesthatwerepreviouslyonlypossiblewithalargestaoflabtechnicians,computeroperators,andsystemad-ministrators.Ofcourse,suchproblemsarenotlimitedtoscience.Asimilardiscussionappliestodigitallibraries,topaperlessbusiness,ortoathinlystaedinternetstartupthatndssuddensuccess.Unfortunately,thishugegrowthindataandstoragecomeswiththeunwantedburdenofmanagingalargedataarchive.Asanarchivegrows,itbecomessignicantlyhardertondwhatdataitemsareneeded,tomigratethedatafromonetechnologytoanother,tore-organizeasthedataandgoalschange,andtodealwithequipmentfailures.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.DIDC2010,Chicago,ILCopyright200XACMX-XXXXX-XX-X/XX/XX...$10.00.Thetwocanonicalmodelsfordatastorage{thelesystemandthedatabase{arenotwellsuitedforsupportingthesekindsofapplications.Bothconceptscanbemadeparalleland/ordistributedforbothcapacityandperformance.Therelationaldatabaseiswellsuitedforquery,sorting,andre-ducingmanydiscretedataitems,butrequiresahighdegreeofadvanceschemadesignandsystemadministration.Adatabasecanstorelargebinaryobjects,butitisnothighlyoptimizedforthistask[14].Ontheotherhand,thelesys-temhasamuchlowerbarriertoentry,andiswellsuitedforsimplydepositinglargebinaryobjectsastheyarecreated.However,asalesystembecomeslarger,querying,sorting,andsearchingcanonlybedoneecientlyiftheymatchthechosenhierarchicalstructure.Asanenterprisegrows,nosin-glehierarchyislikelytomeetallneeds.Sowhileenduserspreferworkingwithlesystems,currentstoragesystemslackthequerycapabilitiesnecessaryforecientoperation.Toaddressthismismatch,wehavecreatedROARS(RichObjectARchivalSystem),anonlinedataarchivethatcom-binessomefeaturesofboththelesystemanddatabasemodels,whileeliminatingsomeofthedangerous\rexibil-ityofeach.Althoughthereexistanumberofdesignsforscalablestorage[9,7,8,25,2,3,19]ROARSoccupiesanunexploreddesignpointthatcombinesseveralunusualfea-turesthattogetherprovideapowerful,scalable,manageablescienticdatastoragesystem:Richsearchablemetadata.Eachdataobjectisassociatedwithausermetadatarecordofarbitrary(name,type,value)tuples,allowingthesystemtopro-videsomesearchoptimizationwithoutdemandingelab-orateschemadesign.Discreteobjectstorage.Eachdataobjectisstoredasasingle,discreteobjectonlocalstorage,replicatedmultipletimesforsafetyandperformance.Thisallowsforacompactstatementoflocalityneededforecientbatchcomputing.Materializedlesystemviews.Ratherthanim-poseasinglelesystemhierarchyfromthebeginning,fastqueriesmaybeusedtogeneratematerializedviewsthattheuserseesasanormallesystem.Inthisway,multipleusersmayorganizethesamedataastheyseet,andmaketemporalsnapshotstoensurerepro-ducibilityofresults.Transparent,incrementalmanagement.ROARSdoesnotneedtobetakenoineevenbrie\ryinordertoperformanintegritycheck,addordecommission servers,ortomigratetonewresources.Allofthesetaskscanbeperformedincrementallywhilethesys-temisrunning,andevenbepaused,rescheduled,orrestartedwithoutharm.Failureindependence.Eachobjectstoragenodeinthesystemcanfailorevenbedestroyedindepen-dentlywithoutaectingthebehaviororperformanceoftheothernodes.Themetadataserverismorecrit-ical,butitfunctionsonlyasan(important)cache.Ifcompletelylost,themetadatacanbereconstructedbyaparallelscanoftheobjectstorage.InourpreviousworkonBXGrid[4],wecreatedadisci-plinespecicdataarchivetightlyintegratedwithawebpor-talforbiometricsresearch.ROARSisour\secondversion"ofthisconcept,whichhasbeendecoupledfrombiometrics,generalizedtoanabstractdatamodel,andexpandedintheareasofexecution,management,andfaulttolerance.Thispaperisorganizedasfollows.Insection2,wepresenttheabstractdatamodelanduserinterfacetoROARS.Insection3,wedescribeourimplementationofROARSusingarelationaldatabaseandstoragecluster.Insection4,anoperationalandperformanceevaluationofROARSispre-sented.Insection5,wecompareROARStootherscalablestoragesystems.Weconcludewithfutureissuestoexplore.2.SYSTEMDESIGNROARSisdesignedtostoremillionstobillionsofindi-vidualobjects,eachtypicallymeasuredinmegabytesorgi-gabytes.Eachobjectcontainsbothbinarydataandstruc-turedmetadatathatdescribesthebinarydata.BecauseROARSisdesignedforthepreservationofscienticdata,dataobjectsarewrite-once,read-many(WORM),buttheassociatedmetadatacanbeupdatedbylogging.Thesys-temcanbeaccessedwithanSQL-likeinterfaceandalsobyaleystem-likeinterface.2.1DataModelAROARSsystemstoresanumberofnamedcollections.Eachcollectionconsistsofanumberofunorderedobjects.Eachobjectconsistsofthetwofollowingcomponents:1.BinaryData:Eachdataobjectcorrespondstoasin-glediscretebinarylethatisstoredonalesystem.ThisobjectisusuallyanopaquelesuchasaTIFForPDF,meaningthatthesystemdoesnotextractanyin-formationfromtheleotherthanthebasiclesystemattributes.2.StructuredMetadata:Associatedwitheachdataobjectisasetofmetadataitemsthatdescribesoran-notatestherawdataobjectwithdomain-specicprop-ertiesandvalues.Thisinformationisstoredinplaintextasrowsof(NAME,TYPE,VALUE,OWNER,TIME)tuplesasshownintheexamplemetadatarecordhere:NAMETYPEVALUEOWNERTIMErecordingidstringnd3R22829pflynn1257373461subjectidstringnd1S04388pflynn1257373461statestringproblemdthain1254049876problemtypeinteger34dthain1254049876statestringfixedhbui1254050851Intheexamplemetadatarecordabove,eachtuplecon-tainseldsforNAME,TYPE,VALUE,whichdenethenameoftheobjectproperty,thetype,anditsvalue.Becauseob-jectsmayhavevaryingnumberandtypesofattributesandtheuserneverspeciesanexactspecicationofwhatob-jectsshouldcontain,thisdatamodelisschema-free.How-ever,sincescienticdatatendstobesemi-structured,thedatamodelallowsforthestoragesystemtotransparentlygroupsimilaritemsintocollectionsforeciencyandorga-nizationalpurposes.Duetothisregularityinthenameandtypesofelds,andtheabilitytoautomaticallygroupobjectsintocollections,weconsiderthedatamodeltobeschema-implicit.Theuserneverformallyexpressestheschemaofthedataobjects,butanimplicitonecanbegeneratedfromthemetadatarecordsduetothesemi-structurednatureofthescienticdata.InadditiontotheNAME,TYPE,andVALUEelds,eachmeta-dataentryalsocontainsaeldforOWNERandTIME.Thisistoprovideprovenanceinformationandtransactionalhistoryofthemetadata.Ratherthanoverwritingmetadataentrieswhenaeldisupdated,newvaluesaresimplyappendedtotheendoftherecord.Intheexampleabove,thestatevalueisinitiallysettoproblembyoneuserandthenlatertofixedbyanother.Bydoingso,thelatestvalueforaparticulareldwillalwaysbethelastentryfoundintherecord.Thistransactionalmetadatalogiscriticaltoscien-ticresearcherswhooftenneedtokeeptrackofnotonlythedata,buthowitisupdatedandtransformedovertime.Theseadditionaleldsenabletheuserstotrackwhomadetheupdates,whentheupdatesoccurred,andwhatthenewvalueswere.Thisdatamodeltsinwiththewrite-once-read-manyna-tureofmostscienticdata.Thediscretedatalesarerarelyifeverupdatedandoftencontaindatatobeprocessedbyhighlyoptimizeddomain-specicapplications.Themeta-data,however,maychangeorevolveovertimeandisusedtoorganizeandquerythedatasets.2.2UserInterfaceUsersmayinteractwiththesystemusingeitheracommand-linetooloralesysteminterface.Thecommandlineinter-facesupportsthefollowingoperations:IMPORToll;FROM ir0;QUERYoll;WHERExpr;EXPORToll;WHERExpr;INTO ir0;[ASpatt;rn0;VIEWoll;WHERExpr;ASpatt;rn0;DELETEoll;WHERExpr;TheIMPORToperationloadsalocaldirectorycontainingobjectsandmetadataintoaspeciccollectioninthereposi-tory.QUERYretrievesthemetadataforeachobjectmatchingthegivenexpression.EXPORTretrievesboththedataandmetadataforeachobjectmatchingthegivenexpression,whicharestoredonthelocaldiskaspairsofles.VIEWcreatesamaterializedviewonthelocaldiskofallobjectsmatchingthegivenexpression,usingthespecicpatternforthepathname.DELETEremovesdataobjectsandtherelatedmetadatafromtherepository,andisusuallyinvokedonlyafterafailedIMPORT.Givenconstraintsbyusers,ROARSndsallassociateddataobjects,deletesthemandremovesthemetadataentries.ApplicationsmayalsoviewROARSasaread-onlylesys-tem.Individualobjectsandtheircorrespondingmetadata 1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiffROARSParrotMaterializedViewFacesFemaleFaces Images in Materialized ViewAsianHispanicexport viewHispanicMaleWhiteAsianWhiteFigure1:ExampleMaterializedViewcanbeaccessedviatheiruniqueleidentiersusingabsolutepaths:However,mostusersnditeectivetoaccesslesusingmaterializedviews.UsingtheVIEWcommandabove,asubsetofthedatarepositorycanbequeried,depositingatreeofnamedlinksontothelocallesystem.Eachlinkisnamedaccordingtothemetadataofthecontainingobject,andpointstotheabsolutepathofaniteminthereposi-tory.Forexample,Figure1showsaviewgeneratedbythefollowingcommand:VIEWfacesWHEREtrueAS"gender/race/fileid.type"Becausethematerializedviewisstoredinthenormallo-callesystem,itcanbekeptindenitely,sharedwithotherusers,sentalongwithbatchjobs,orpackedupintoanarchiveleandemailedtootherusers.Thecreatingusermanagestheirowndiskspaceandisthisresponsibleforcleanupattheappropriatetime.TheabilitytogeneratematerializedviewsthatprovidethirdpartyapplicationsanrobustandscalablelesysteminterfacetothedataobjectsisadistinguishingfeatureofROARS.Ratherthanforceuserstoimplanttheirdomain-specictoolsintoadatabaseexe-cutionengine,orwrapitinadistributedprogrammingab-straction,ROARSenablesscienticresearcherstocontinueusingtheirfamiliarwork\rowandapplications.3.IMPLEMENTATIONFigure2showsthebasicarchitectureofROARS.Tosup-portthediscreteobjectdatamodelandthedataopera-tionspreviouslyoutlined,ROARSutilizesahybridapproachtoconstructscienticdatarepositories.MultipleStorageServersareusedforstoringboththedataandmetadatainarchivalformat.AMetadataServer(MDS)indexesallofthemetadataonthestorageserver,alongwiththelocationofeachreplicatedobject.TheMDSservesastheprimarynameandentrypointtoaninstanceofROARS.Thedecisiontoemploybothadatabaseandaclusterofstorageserverscomesfromtheobservationthatwhileonetypeofsystemmeetstherequirementsofoneofthecom-ponentsofascienticdatarepository,itisnotadequateattheothertype.Forinstance,whileitispossibletorecordQueryToolMetadataServerParrotApplGroup BGroup AStorage ServersDataAccessMetadataLookupFigure2:ROARSArchitectureboththemetadataandrawdatainadatabase,theperfor-mancewouldgenerallybepooranddiculttoscale,espe-ciallytothelevelrequiredforlargescaledistributedexper-imentsnorwouldittinwiththework\rownormallyusedbyresearchscientists.Moreover,thedistinctadvantageofusingadatabase,whichisitstransactionalnature,ishardlyutilizedinascienticrepositorybecausethedataismostlywrite-once-read-many,andthusrarelyneedsatomicupdat-ing.Fromourexperience,duringthelifetimeoftherepos-itory,metadatamaychangedonceortwice,whiletherawdatastaysuntouched.Besidesthescalabilitydisadvantages,keepingrawdatainadatabaseposesbiggerchallengesoneverydaymaintenanceandfailurerecovery.So,although,adatabasewouldprovidegoodmetadataqueryingcapabili-ties,itwouldnotbeabletosatisfytherequirementforlargescaledatastorage.Ontheotherhand,adistributedstoragesystem,evenwithacleverlenamingscheme,isalsonotadequateforscienticrepositories.Suchdistributedstoragesystemspro-videscalablehighperformanceI/O,butprovidelimitedsup-portforrichmetadataoperations,whichgenerallydevolveintofulldatasetscansorsearchesusingfragileandadhocscripts.Althoughtherearepossibletricksandtechniquesforimprovingmetadataavailabilityinthelesystem,theseallfallshortoftheeciencyrequiredforascienticrepos-itory.Forinstance,whileitispossibletoencodeparticularattributesinthelename,itisstillin\rexibleandine-cient,particularlyfordatathatbelongtomanydierentcategories.Fastaccesstometadataremainsnearlyimpos-sible,becauseparsingthousandsormillionslenamesisthesameifnotworsethanwritingacumbersomescripttoparsecollectionsofmetadatatextles.ThehybriddesignofROARStakesthebestaspectsfrombothdatabasesanddistributedlesystemsandcombinesthemtoproviderichmetadatacapabilitiesandrobustscal-ablestorage.Tomeetthestoragerequirement,ROARSreplicatesthedataobjectsalongwiththeirassociatedmeta-dataacrossmultiplestoragenodes.Likeintraditionaldis-tributedsystems,thisuseofdatareplicationsallowsforscal-ablestreamingreadaccessandfaulttolerance.Inordertoprovidefastmetadataqueryoperations,themetadatain-formationispersistentlycacheduponimportingthedataobjectsintotherepositoryinatraditionaldatabaseserver.Queriesandoperationsonthedataobjectsaccessthiscacheforfastandecientstorageoperationsandmetadataoper-ations.Overall,thisstorageorganizationissimilartotheoneusedintheGoogleFileystem[7],andHadoop[8],wheresimple recordingstringstringnd1R3204FilesReplicas12894051statehostpathfs03fileidenrolledenrolledvalidatedunvalidatedchecksum790K801Ksizereplicaid4050fileid800KL05/01/08R05/01/08S330nullnullsmile/2/5/1290.4052/0/5/1289.4050validated1290.4051.jpg1290.4051.metaStorage Serversstatedateeyeemotionsubject12891288nd1R3205nd1R3204fileidrecordingidMetadataFigure3:ROARSMetadataStructureDataNodesstorerawdataandasingleNameNodemain-tainsthemetadata.Ourarchitecturediersinafewimpor-tantwayshowever.First,ratherthanstripingthedataasblocksacrossmultipleStorageNodesasdoneinHadoopandtheGoogleFilesystem,ROARSstorediscretewholedatalesonthestoragenodes.Whilethispreventsusfrombe-ingabletosupportextremelylargelesizes,thisisnotanimportantfeaturesincemostscienticdatacollectionstendtobemanysmallles,ratherthanafewextremelylargeones.Moreover,theuseofwholedatalesgreatlysimpli-esrecoveryandenablesfailureindependence.Likewise,theuseofadatabaseserverasthemetadatacacheenablesustoprovidesophisticatedandecientmetadataqueries.WhileGoogleFilesystemandHadooparerestrictedtobasiclesystemtypemetadata,ROARScanhandlequeriesthatworkonconstraintsondomain-specicmetadatainforma-tion,allowingresearcherstosearchandorganizetheirdataintermsfamiliartotheirresearchfocus.3.1DatabaseStructureInROARS,themetadataisstoredinthreemaindatabasetables:ametadatatable,aletableandareplicatable.Themetadatatableobviouslystoresthedomain-specicscien-ticmetadata.Eachentryinthistablecanhaveoneormorepointerstoafileidintheletable.Inthecasewherethereisonlymetadataandnocorrespondingrawdata,theentrywouldhaveaNULLfileid.TheletableplaysthesameroleaninodetableinatraditionalUnixlesystemdoesforROARSandholdstheessentialinformationaboutrawdatales.EntriesinthethistablerepresentROARSinodes,andthereforehavethefollowingimportantleinformation:fileid,size,checksum,andcreatetime.ROARSutilizesthisinformationtonotonlykeeptrackoflesbutalsotoemulatetraditionalUNIXsystemcallssuchasstat.Foranygivenfileid,therecanbemultiplereplicaentriesinthereplicatable.Thisthirdtablekeepstrackofwheretheactualrawdatalesarestored.Thestructureofthereplicatableisverystraightforwardandincludesthefollow-ingelds:fileid,host,path,state,andlastcheck.Figure3givesanexampleoftherelationshipbetweenthemetadata,le,andreplicatables.Inthisconguration,eachleisgivenanuniquefileidinletable.Inthereplicata-ble,thefileidmayoccurmultipletimes,witheachrowrepresentingaseparatereplicalocationinthestorageclus-ter.Accessingaletheninvolveslookingupthefileid,ndingthesetofassociatedreplicalocations,andthense-lectingastoragenode.Ascanbeseen,thisdatabaseorganizationprovidesboththeabilitytoquerylesbasedondomainspecicmetadata,andtheabilitytoprovidescalabledatadistributionandfault-tolerantoperationthroughtheuseofreplicas.Someoftheadditionaleldssuchaslastcheck,state,andchecksumareusedbyhighleveldataaccessoperationsprovidedbyROARStomaintaintheintegrityofthesystemandwillbediscussedinlatersubsections.3.2StorageNodesROARSutilizesanarrayofStorageNodesrunningChirp[21]forreplicatingdata.TheseStorageNodesareusuallyconventionalmachineswithlargelocalsingledisksorganizedinacomputecluster.TheseNodesaregroupedtogetherbasedonlocalityintodierentStorageGroups,andgivenagroupid.DuringanIMPORT,ROARSmakesaconsciousde-cisiontospreadoutreplicassothateachStorageGrouphasaleastonereplica,thusprovidingastaticformofloadbal-ancing.Byconvention,ifadataobjectwasnamedX.jpg,thentheassociatedmetadatalewouldbenamedX.metaandbothoftheselesarereplicatedacrosstheStorageNodesineachoftheStorageGroups.Byreplicatingtherawdataacrossthenetwork,ROARSprovidesdistributedapplicationsscalable,highthroughputdataaccess.Moreover,becauseeachStorageGrouphasatleastonecopyofthedatale,distributedapplicationscaneasilytakeadvantageofdatalocalitywithROARS.Tofacilitatedeterminingwhereacertaindataobjectresides,ROARSincludesaLOCATEcommandthatwillndtheclos-estcopyofthedataforarequestingapplication.IftheapplicationisrunningonthesameStorageNode,thenthedataisalreadyonthelocalnode,andsonodatatransferisneeded.RobustnessDuetotheuseofdataobjectreplication,ROARSissta-bleandhassupportforrecoverymechanismsandfail-over. Bydefault,dataisreplicatedacrosstheStorageNodesatleastthreetimes.Duringareadoperation,ifareplicaisnotreachableduetoserveroutageorhardwarefailure,ROARSwillrandomlytryanotheravailablereplicaafterauserspec-iedtimeout(normally30seconds).Asmentionedearlier,StorageNodesareorganizedintogroupsbasedontheirlo-cality.Whendataispopulatedintothedatarepository,ROARSintelligentlyplacesthedatatoensurethatthereisareplicaineachservergroup.Byspreadingreplicastomultiplegroups,asystematicfailureofaStoragegrouponlyhasaminimaleectonROARSoperationandperformance.Similartoreadoperations,writeoperationsperformeddur-ingIMPORTwillrandomlychooseaserverwithinaservergrouptowritedata.Iftheserverisnotresponsive,anotherischosenuntilthewriteissuccessful.ROARSalsoensuresintegrityofthedatarepositorybytrackingandcomparingchecksumsofreplicas.AsadataleisingestedintoROARS,itschecksumiscalculatedandrecordedasapartofthedataobject'smetadata.Read/writerequestscaninternallychecktomakesurethereplica'scheck-summatchestheoriginaldatale.However,frequentcheck-sumcallscanreducesystemperformance,andsothisin-tegritycheckisonlyperformedduringspecialdataintegritymanagementoperationssuchasAUDIT.Thiscommandwillscanthemetadatadatabaseandperformchecksumsonthedataobjectsandensurethecurrentchecksumsmatchtheonerecordedinthedatabase.Inthesameprocess,theAUDITcommandwillalsocheckthestatusoftheStorageNodesandperformanymaintenanceasnecessary.BecauseofthisROARSgivesintegritycheckingabroadermeaningsinceitmaintainsintegrityofthesystemasawhole,notsimplysinglereplicas.ROARS'robustdesignalsoenablestransparentandincre-mentalmanagement.WheneveraStorageNodeneedstobetakenoineordecommissioned,invokingREMOVEwilldeleteallentriesassociatedwiththatnodefromthereplicatableandupdatethemetadatadatabaseinatransactionalman-ner.ToaddnewStorageNodes,anADD_NODEfollowedbyMIGRATEwillspawnthecreationofreplicasonthenewStor-ageNodes.ROARStakesadvantageoftheatomictrans-actionsofthedatabaseservertomanagetheseoperations.Becauseofthisuseofthedatabaseasatransactionmanager,theseoperationscanbeperformedtransparentlyandincre-mentally.Forinstance,anAUDITcanbescheduletorunforonlyafewminutesatatimeorduringopeakhours.Evenifitdoesnotcomplete,duetotheuseofthedatabaseasatransactionallog,itcancontinuewhereitleftothenexttimeitisrun.ThesamegoesforoperationssuchasMIGRATEorREMOVE.Thesecommandscanbepaused,rescheduled,andrestartedwithoutaectingtheintegrityofthesystemandwithouthavingtoshutdownthewholerepository.Thisrobustnessfurtherextendstotheabilitytoprovidefailureindependence.SincethedataisstoredascompletedatalesratherthanstripedacrossmultipleStorageNodes,theintegrityofthedatalesisnevereectedbyasuddenlostofdataservers.Additionally,failureofoneserverdoesnotaecttheperformanceorbehavioroftheotherStorageNodes.Becausethemetadataisstoredalongsidethedatalereplicas,ROARSisalsocapableofrecoveringfromthelostoftheDatabaseNodewhichonlyservesasapersis-tentmetadatacache.Toperformthisrecovery,aparallelscancanbeperformedontheStorageNodestoreconstructthemetadatacache.ThisisincontrasttosystemssuchasHadoopandGoogleFilesystemwhichtakespecialcaretoreplicateallofthestateoftheNameNode.InthecaseofalostoftheNameNodeinthissystem,thelayoutandorga-nizationofthedatacanbecompletelylostsincethedataisstripedacrossmultipleservers.ROARSavoidsthisproblembystoringcompletediscretedatales,andmaintainingthemetadatalognexttothesereplicasontheStorageNodes.ThisenablesROARStorobustlyprovidefailureindepen-denceandasimplemeansofrecovery.4.EVALUATIONToevaluatetheperformanceandoperationalcharacteris-ticsofROARS,wedeployedatraditionalnetworklesystem,Hadoop,andROARSonatestbedclusterconsistingof32datanodesand1separateheadnode.Eachofthesestoragenodesisacommoditydual-coreIntel2.4GHzmachine,with4GBofRAMand750GBofdisk,allconnectedviaasingleGigabitEthernetswitch.ThetraditionalnetworklesystemwasasingleChirpleserverononeofthedatanodes.ForHadoop,weconguredtheHadoopDistributedFilesystem(HDFS)tousethe32storagenodesastheHDFSDatanodesandtheseparateheadnodeastheHDFSNamenode.WekepttheusualHadoopdefaultssuchasemployinga64MBblocksizeforHDFS.OurROARScongurationconsistedofadedicatedmeta-dataserverrunningMySQLontheheadserverand32ChirpserversonthesamedatanodesastheHadoopcluster.Toprovideourtestsoftwareaccesstothesestoragesystems,weutilizedParrotasalesystemadaptor.ThefollowingexperimentalresultstesttheperformanceofROARSanddemonstrateitscapabilitieswhileperform-ingavarietyofstoragesystemactivitiessuchasimport-ingdata,exportingmaterializedviews,andmigratingrepli-cas.Theseexperimentsalsoincludemicro-benchmarksoftraditionallesystemoperationstodeterminethelatencyofcommonsystemcalls,andconcurrentaccessbenchmarksthatdemonstratehowwellthesystemscales.Fortheselat-terperformancetests,wecompareROARS'sperformancetothatofthetraditionalnetworkserverandHadoop,whichisanoftencitedalternativetodistributeddataarchiving.Attheend,weincludeoperationalresultsthatdemonstratethedatamanagementcapabilitiesofROARS.4.1DataImportBeforeperforminganydataaccessexperiments,wersttestedtheperformanceofimportinglargedatasetsintobothHadoopandROARS.Forthisdataimportexperiment,wedividedourtestintoseveralsetsofles.Eachsetconsistsofnumberofxedsizeles,rangingfrom1KBto1GB.Toperformtheexperiment,weimportedthedatafromalocaldisktothedistributedsystems.InthecaseofHadoopthissimplyinvolvedcopythedatafromthelocalmachinetoHDFS.ForROARS,weusedtheIMPORToperation.Figure4showsthedataimportperformanceforHadoopandROARSforseveralsetsofdata.Thegraphshowsthethroughputasthelesizesincrease.Forthesmallledataset,ROARSdatamirroringoutperformsHDFSstrip-ing,whileforthelargerledataset,HadoopisfasterthanROARS.Ineithercase,bothROARSandHadoopimportlargerlesfasterthantheydosmallerles.ThedierencesinperformancebetweenROARSandHadoopareduetothewayimportingandstoringreplicasworksinbothsystems.InthecaseofHadoop,areplicacreationin- 0 5 10 15 20 25 30 351KB8KB64KB512KB1MB8MB64MB128MB256MB512MB1GBThroughput(MB)File sizeROARS mirroringHDFS stripingFigure4:ImportPerformance. 0.001 0.01 0.1 1 10 100 1000 10000 100000 1e+06 10 100 1000Run-Time (Seconds)Number of records (x1000)Grep (HDFS)MapReduceMYSQLMYSQL(with index)Figure5:QueryPerformance.volvesacomplexsetoftransactionsthatsetupadata\rowpipelinebetweenthereplicationnodesofasingleblock.ThisoverheadisprobablythereasonwhyHadoopisslightlyslowerforsmallerles.Forlargerles,however,thisdatapipelineenableshigheroverallsystembandwidthusageandthusleadstobetterperformancethanROARSwhichdoesnotperformanydatapipelining.Rather,itmerelycopiesthedataletoeachofthereplicasinsequentialorder.Thisimportandreplicacreationoverheadalsoexplainswhylargeleimportisfasterthansmallleimportation.InROARS,eachimportedleneeds9databasetransactionswhichcanbecostlywhenimportingsmallles,wherethetimespenttransferringdataisoverwhelmedbydatabasetransactionexecutiontime.Withthelargerles,thereislesstimelosttosettinguptheconnectionsandtransactions,andmoretimespentontransferringthedatatotheStorageNodes.4.2MetadataQueryInthisbenchmark,westudiedthecostofperformingametadataquery.Aspreviouslynoted,oneoftheadvantagesofROARSoverdistributedsystemssuchasHadoopisthatitprovidesameansofquicklysearchingandmanipulatingthemetadataintherepository.Forthisexperiment,wecreatedmultiplemetadatadatabasesofincreasingsizeandperformedaquerythatlooksforobjectsofaparticulartype.Asabaselinereference,weperformedacustomgrepofthedatabaserecordsonasinglenodeaccessingHDFS,whichisnormallywhathappensinrudimentaryscienticdatacollec-tions.ForHadoop,westoredallthemetadatainasinglele,andqueriedthemetadatabyexecutingthecustomscriptus-ingMapReduce[5].ForROARS,wequeriedthemetadata 0 2 4 6 8 10statopenreadcloseLatency (Milliseconds)Micro-OperationsTraditionalHDFSROARS (Cached)ROARS (Uncached)Figure6:Microbenchmark.usingQUERYwhichinternallyusestheMySQLexecutionen-gine.Wedidthiswithindexingonandotoexamineitseectonperformance.Figure5clearlyshowsthatROARStakesfulladvantageofthedatabasequerycapabilitiesproperlyandismuchfasterthaneitherMapReduceorstandardgrepping.Evidently,asthemetadatadatabaseincreasesinsize,thegrepper-formancedegradesquickly.ThesameistruefortheQUERYoperation.Hadoop,however,mostlyretainsasteadyrun-ningtime,regardlessofthesizeofthedatabase.ThisisbecausetheMapReduceversionwasabletotakeadvan-tageofmultiplecomputenodesandthusscaleupitsper-formance.Unfortunately,duetotheoverheadincurredinsettingupthecomputationandorganizingtheMapReduceexecution,theHadoopqueryhadahighstartupcostandthuswasslowerthantheMySQL.Futhermore,thestan-dardgrepandMySQLquerieswereperformedonasinglenode,andthusdidnotbenetfromscaling.Thatsaid,theROARSquerywasstillfasterthanHadoop,evenwhenthedatabasereached2,699,488dataobjects.4.3MicrobenchmarkAsmentionedearlier,ROARSdoesnotdirectlysupporttraditionalsystemcallssuchasstat,open,read,andclose.Rather,ROARSprovidestheseoperationstoexternalap-plicationsthroughaParrotROARSservicewhichemulatesthesesystemcalls.SinceHadoopalsodoesnotprovideanativelesysteminterface,wealsoimplementedaParrotHDFSservice.Totestthelatencyofthesecommonlesys-temfunctions,weperformedaseriesofstats,opens,reads,andclosesonasingleleonthetraditionalleserver,HDFS,andROARS.ForROARSweprovidetheresultsforaversionwithSQLquerycachingandonewithoutthissmalloptimization.Figure6showsthelatencyofthemicro-operationsonatraditionalnetworkleserver,HDFS,andROARS.Ascanbeseen,ROARSprovidescomparablelatencytothetradi-tionalnetworkserver,andinthecaseofstat,open,andread,lowerlatencythanHDFS.SinceallleaccesswentthroughtheParrotstorageadapter,therewassomeover-headforeachsystemcall.However,sinceallofthestoragesystemswereaccessedthoughthesameParrotadapter,thisadditionaloverheadissameforallofthesystemsandthusdoesnotaecttherelativelatencies.TheseresultsshowthatwhileusingaSQLdatabaseasapersistentmetadatacachedoesincuranoverheadcostthatincreasesthelatencyoftheselesystemmicro-operations, 0 100 200 300 400 500 600 700 800 5 10 15 20 25 30Aggregate Throughput (MB/s)Concurrent ClientsTraditionalHDFSROARS(a)ConcurrentAccessPerformance(10Kx320KB) 0 200 400 600 800 1000 1200 1400 5 10 15 20 25 30Aggregate Throughput (MB/s)Concurrent ClientsTraditionalHDFSROARS(b)ConcurrentAccessPerformance(1Kx5MB)Figure7:ConcurrentAccess.thelatenciesprovidedbytheROARSsystemremaincom-parabletoHDFS.Moreover,thisadditionaloverheadcanbeslightlymitigatedbycachingtheSQLqueriesontheclientsideasshowninthegraph.Withthissmalloptimization,operationssuchasstatandopenaresignicantlyfasterwithROARSthanwithHDFS.Evenwithoutthiscaching,though,ROARSstillprovideslowerlatencythanHDFS.4.4ConcurrentAccessTodeterminethescalabilityofROARSincomparisontoatraditionalleserverandHDFS,weexportedtwodierentdatasetstoeachofthesystemsandperformedatestthatreadallofthedataineachset.InthecaseofROARS,weusedamaterializedviewwithsymboliclinkstotakeadvan-tageofthedatareplicationfeaturesofthesystem,whileforthetraditionallesystemandHDFS,weexportedthedatadirectorytoeachofthosesystems.WeranourtestprogramusingCondor[22]with1-32concurrentreaders.Figure7showstheperformanceresultsofallthreesystemsforbothdatasets.InFigure7(a),theclientsread10,000320KBles,whileinFigure7(b)1,0005MBleswereread.Inbothgraphs,theoverallaggregatethroughputforbothHDFSandROARSincreaseswithanincreasingnumberofconcurrentclients,whilethetraditionalleserverlevelsoafteraround8clients.Thisisbecausethesingleleserverislimitedtoamaximumuploadrateofabout120MB/s,whichitreachesafter8concurrentreaders.ROARSandHDFS,however,usereplicastoenablereadingfrommultiplema-chines,andthusscalewiththenumberofreaders.Aswiththecaseofimportingdata,thesereadtestsalsoshowthataccessinglargerlesismuchmoreecientinbothROARSandHDFSthanworkingonsmallerles.WhilebothROARSandHDFSachieveimprovedaggre-gateperformanceoverthetraditionalleserver,ROARSoutperformsHDFSbyafactorof2.Inthecaseofthesmallles,ROARSwasabletoachieveanaggregatethroughputof526.66MB/s,whileHDFSonlyreached245.23MB/s.Forthelargertest,ROARShit1030.94MBS/sandHDFSman-aged581.88MB/s.Thereareacouplepossiblereasonsforthisdierence.First,ROARShaslessoverheadinsettingupthedatatransfersthanHDFSasindicatedinthemicro-operationsbenchmarks.Suchoverheadlimitsthenumberofconcurrentdatatransfersandthusaggregatethroughput.IrisFaceIrisFaceStillStillVideoVideoMethod(300KB)(1MB)(5MB)(50MB)Local1018106187Remotex28045150134Remotex423265779Remotex822165870Remotex1612121833Remotex3212171617Figure8:TranscodinginActiveStorageAnothercausefortheperformancedierenceisthebehav-ioroftheStorageNodes.InHDFS,eachblockischeck-summedandthereissomeadditionaloverheadtomaintaindataintegrity,whileinROARS,dataintegrityisonlyen-forcedduringhighleveloperationssuchasIMPORT,MIGRATE,andAUDIT.SincetheStorageNodesinROARSaresimplenetworkleservers,nochecksummingisperformedduringareadoperation,whileinHDFSdataintegrityisenforcedthroughout,evenduringreads.4.5ActiveStorageROARSisalsocapableofexecutingprogramsinternally,co-locatingthecomputationwiththedatathatitrequires.Thistechniqueisknownasactivestorage[12].InROARS,anactivestoragejobisdispatchtoaspecicleservercon-tainingtheinputles,whereitisruninanidentitybox[20]topreventitfromharmingthearchive.ActivestorageisfrequentlyusedinROARStoprovidetranscodingfromonedataformattoanother.Forexample,alargeMPEGformatanimationmightbeconverteddowntoa10-framelowresolutionGIFanimationtouseasapre-viewimageonawebsite.Agivenwebpagemightshowtensorhundredsofthumbnailsthatmustbetranscodedanddis-playedsimultaneously.Withactivestorage,wecanharnesstheparallelismoftheclustertodelivertheresultfaster.Figure8showstheperformanceoftranscodingvariouskindsofimagesusingtheactivestoragefacilitesofROARS.Eachlineshowstheturnaroundtime(inseconds)tocon-vert50imagesofthegiventype.The`Local'lineshowsthetimetocompletetheconversionssequentiallyusingROARS 0 20 40 60 80 100 120 140 160 1804KB64K256KB1MB16MB64MB256MB512MB1GB2GBTime elapse(s)File sizeROARSHDFSFigure9:CostofCalculatingChecksum.asanordinarylesystem.The`Remote'linesshowtheturnaroundtimeusingtheindicatednumberofactivestor-ageservers.Ascanbeseentheactivestoragefacilitydoesnothelpwhenappliedtosmallstillimages,butoersasignicantspeedupwhenappliedtolargevideoswithsignif-icantprocessingcost.4.6IntegrityCheck&RecoveryInROARS,theAUDITcommandisusedtoperformanintegritycheck.Aswehavementioned,theletablekeepsrecordsofadatale'ssize,checksum,andthelastcheckeddate.AUDITusesthisinformationtodetectsuspectreplicasandreplacethem.Atthelowestlevel,AUDITchecksthesizeofthereplicastomakesureitisthesameastheletableentriesindicate.Thistypeofcheckisnotexpensivetoperform,butitisalsonotreliable.Areplicacouldhaveanumberofbytesmodied,butremainsthesamesize.Abetterwaytocheckareplica'sintegrityistocomputethechecksumofthereplica,andcompareittothevalueinletable.Thisisexpensivebecausetheprocesswillneedtoreadinthewholereplicatocomputethechecksum.Figure9showsthecostofcomputingchecksumsinbothROARSandHDFS.Aslesizeincreases,thetimerequiredtoperformachecksumalsoincreasesforbothsystems.How-ever,whenthelesizeisbiggerthanaHDFSblocksize(64MB),ROARSbeginstooutperformHDFSbecausethelatterincursadditionaloverheadinselectinganewblockandsettingupanewtransaction.Moreover,ROARSletsStorageNodesperformchecksumremotelywherethedataleisstoredwhileforHDFSthisdatamustbestreamedlocallybeforeanoperationcanbeperformed.Whenareplicaisdeemedtobesuspect,ROARSwillspawnanewreplicaanddeletethesuspectcopy.ROARSdoesthisbymakingacopyofagoodreplica.Therearetwowaystodothis.Therstwayistoreadthegoodre-licatoalocalmachineandthencopyittoaStorageNode(rstpartyput).AnotherwayistotelltheStorageNodewherethegoodcopyislocatedandthenperformthetrans-ferontheuser'sbehalf(thirdpartyput).Thelatterwouldrequires2extraleserveroperationsonthetargetStorageNodes.4.7DynamicDataMigrationROARSishighly\rexibledatamanagementsystemwhereuserscantransparentlyandincrementallyaddandremoveStoragesNodeswithoutshuttingdownarunningsystem.ThesystemprovidesoperationstoaddnewStorageNodes(ADD_NODE),migratedatatonewStorageNodes(MIGRATE),andremovedatafromoldunreliableNodes(REMOVE_NODE).TodemonstratethefaulttoleranceandfailoverfeaturesofROARS,wesetupamigrationexperimentasfollows.Weadded16newStorageNodestoourcurrentsystem,andwestartedaMIGRATEprocesstospawnnewreplicas.Startingwith30activeStorageNodes,weintentionallyturnedoanumberofStorageNodesduringMIGRATEprocess.Aftersometime,weturnsomeStorageNodesbackon,leavingtheothersinactive.BydroppingStorageNodesfromthesystem,wewantedtoensurethatROARSstillcouldbefunctionalevenwhenhardwarefailureoccurs.Figure10demonstratesthatROARSremainedoperationalduringtheMIGRATEprocess.Asex-pected,theperformancethroughputtakesadipasnumberofactiveStorageNodesdecreases.Thedecreaseinperfor-manceisbecausewhenROARScontactsaninactiveStorageNode,itwouldfailtoobtainthenecessaryreplicaforcopy-ing.Withinaglobaltimeout,ROARSwillretrytoconnecttothesameStorageNodeandthenmoveontothenextavailableNode.AsNodesremaininactive,theROARScon-tinuestoenduremoreandmoretimeouts.Thatleadstothedecreaseofsystemthroughput.Although,throughputperformancedecreasesslightlywhenthereareonlytwoinactiveStorageNodes,throughputtakesamoresignicanthitwhenthereisalargernumberofin-activeStorageNodes.Therearewaystoreducethisnega-tiveeectonperformance.First,ROARScandynamicallyshortentheglobaltimeout,eectivelycuttingdownretrytime.Orbetteryet,ROARScandetectinactiveStorageNodesafteranumberoffailedattempts,andblacklistthem,thusavoidingpickingreplicasfrominactiveNodesinthefu-ture.RELATEDWORKOurgoalwastoconstructascienticdatarepositorythatrequiredbothscalablefault-tolerantdatastorage,ande-cientqueryingoftherichdomain-specicmetadata.Unfor-tunately,traditionallesystemsanddatabasesfailtomeetbothoftheserequirements.Whilemostdistributedlesys-temsproviderobustscalabledataarchiving,theyfailtoadequatelyprovideforecientrichmetadataoperations.Incontrast,databasesystemsprovideecientqueryingca-pabilities,butfailtomatchthework\rowofscienticre-searchers.5.1FilesystemsInordertofacilitatesharingofthescienticdata,scien-ticresearchersusuallyemployvariousnetworklesystemssuchasNFS[13]orAFS[9]toprovidedatadistributionandconcurrentaccess.Togetscalableandfaulttolerantdatastorage,scientistsmaylookintodistributedstoragesystemssuchasCeph[25]orHadoop[8].Mostofthedataintheselesystemsareorganizedintosetsofdirectoriesandlesalongwithassociatedmetadata.SincesomeoftheselesystemssuchasCephandHadoopperformautomaticdatareplication,theynotonlyprovidefault-tolerantdataaccessbutalsotheabilitytoscalethesystem.Therefore,inregardstotheneedforascalable,fault-tolerantdatastor-age,currentdistributedstoragesystemsadequatelymeetthisrequirement.Wherelesystemsstillfail,however,isinprovidinganecientmeansofperformingrichmetadataqueries.Sincelesystemsdonotprovideadirectmeanstoperformthese 0 50 100 150 200 250 300 350 400 450 0 5 10 15 20 0 20 40 60 80 100Data Transferred (GB)Online Storage Nodes (#)Overall Throughput (MB/s)Elapsed Time (hours) Switch-off Switch-onGBs TransferedActive ServersThroughputFigure10:DynamicDataMigrationmetadataoperations,exportprocessesusuallyinvolveacom-plexsetofadhocscriptswhichtendtobeerrorprone,in-\rexible,andunreliable.Moreimportantly,thesemanualsearchesthroughthedatarepositoryarealsotimeconsum-ingsinceallofthemetadataintherepositorymustbean-alyzedforeachexport.AlthoughsomedistributedsystemssuchasHadoopprovideprogrammingtoolssuchasMapRe-duce[5]tofacilitatesearchingthroughlargedatasetsinareliableandscalablemanner,thesefullrepositorysearchesarestillcostlyandtimeconsumingsinceeachexperimentalrunwillhavetoscantherepositoryandextractthepartic-ulardatalesrequiredbytheuser.Moreover,evenwiththepresenceoftheseprogrammingtoolsitisstillnotpos-sibletodynamicallyorganizeandgroupsubsetsofthedatarepositorybasedonthemetadatainapersistentmanner,makingitdiculttoexportreusablesnapshotsofparticu-lardatasets.5.2DatabasesTheothercommonapproachtomanagingscienticdataistogotherouteofprojectssuchastheSloanDigitalSkySurvey[18].Thatis,ratherthanoptfora\\ratle"dataaccesspatternusedinlesystems,thescienticdataiscol-lectedandorganizeddirectlyinalargedistributeddatabasesuchasMonetDB[10]orVertica[23].Besidesprovidinge-cientquerycapabilities,suchsystemsalsoprovideadvanceddataanalysistoolstoexamineandprobethedata.How-ever,thesesystemsremainundesirabletomanyscienticresearchers.Therstproblemwithdatabasesystemsisthatinordertousethemthedatamustbeorganizedinahighlystructuredexplicitschema.Fromourexperience,itisrarelythecasethatthescienticresearchersknowtheexactnatureoftheirdataaprioriorwhatattributesarerelevantornecessary.Becausescienticdatatendstobesemi-structuredratherthanhighlystructured,thisrequirementofafullexplicitschemaimposesabarriertotheadoptionofdatabasesys-temsandexplainswhymostresearchgroupsoptforlesys-tembasedstoragesystemswhichttheirorganicandevolv-ingmethodofdatacollection.Mostimportantly,databasesystemsarenotidealforsci-enticdatarepositoriesbecausetheydonottintothework\rowcommonlyusedbyscienticresearchers.InprojectssuchastheSloanDigitalSkySurveyandSequoia2000[17],thescienticdataisdirectlystoredindatabasetablesandthedatabasesystemisusedasandataprocessingandanal-ysisenginetoqueryandsearchthroughthedata.Forsci-enticprojectssuchasthese,therecentworkoutlinedbyStonebrakeret.al[16]isamoresuitablestoragesystemforthesehigh-structuredscienticrepositories.Inmosteldsofscienticresearch,however,itisnotfea-sibleorrealistictoputtherawscienticdatadirectlyintothedatabaseandusethedatabaseasanexecutionengine.Rather,ineldssuchasbiologicalcomputing,forinstance,genomesequencedataisgenerallystoredinlarge\ratlesandanalyzedusinghighlyoptimizedtoolssuchasBLAST[1]ondistributedsystemssuchasCondor[22].Althoughitmaybepossibletostuthegenomedatainahigh-enddatabaseandusethedatabaseenginetoexecuteBLASTasaUDF(userdenedfunction),thisgoesagainstthecommonpracticesofmostresearchersanddivertsfromtheirnormalwork\row.Therefore,usingadatabaseasascienticdatarepositorymovesthescientistsawayfromtheirdomainsofexpertiseandtheirfamiliartoolstotherealmofdatabaseoptimizationandmanagement,whichisnotdesirableformanyscienticresearchers.Becauseoftheselimitations,traditionaldistributedlesys-temsanddatabasesarenotdesirableforscienticdatarepos-itorieswhichrequirebothlargescalablestorageande-cientrichmetadataoperations.Althoughdistributedsys-temsproviderobustandscalabledatastorage,theydonotprovidedirectmetadataqueryingcapabilities.Incontrast,databasesdoprovidethenecessarymetadataqueryingcapa-bilities,butfailtotintothework\rowofresearchscientists.ThepurposeofROARSistoaddresstheseshortcomingsbyconstructingahybridsystemthatleveragesthestrengthsofbothdistributedlesystemsandrelationaldatabasestoprovidefault-tolerantscalabledatastorageandecientrichmetadatamanipulation.ThishybriddesignissimilartoSDM[11]whichalsoutilizesdatabasetogetherwithalesystem.ThedesignofSDMhighlyoptimizesforn-dimensionalarraystypedata.Moreover,SDMusesmultiplediskssup-porthighthroughputI/OforMPI[6],whileROARSusesadistributedactivestoragecluster.Anotherexampleofalesystem-databasecombinationisHEDC[15].HEDCisimplementedonasinglelargeenterprise-classmachine ratherthananarrayofStorageNodes.iRODS[24]anditspredecessortheStorageResourceBroker[3]supportstaggedsearchablemetadataimplementedasaverticalschema.ROARSmanagesmetadatawithhorizontalschemapointingtolesandreplicaswhichallowsforthefullexpressivenessofSQLtobeapplied.6.CONCLUSIONWehavedescribedtheoveralldesignandimplementationofROARS,aarchivalsystemforscienticdatawithsupportforrichmetadataoperations.ROARScouplesadatabaseserverandanarrayofStorageNodestoprovideuserstheabilitytosearchdataquickly,andtostorelargeamountsofdatawhileenablinghighperformancethroughputfordis-tributedapplications.Throughourexperiments,ROARShasdemonstratedtheabilitytoscaleupandperformaswellasHDFSinmostcases,andprovideuniquefeaturessuchastransparent,incrementaloperationandfailureinde-pendence.CurrentlyROARSisusedasthebackendstorageofBX-Grid[4],abiometricsdatarepository.Atthetimeofwrit-ing,BXGridhas265,927recordingsforatotalof5.1TBofdataspreadacross40StorageNodesandhasbeenusedinproductionfor16months.7.ACKNOWLEDGEMENTSThisworkwassupportedbyNationalScienceFoundationgrantsCCF-06-21434,CNS-06-43229,andCNS-01-30839.ThisworkisalsosupportedbytheFederalBureauofIn-vestigation,theCentralIntelligenceAgency,theIntelligenceAdvancedResearchProjectsActivity,theBiometricsTaskForce,andtheTechnicalSupportWorkingGroupthroughUSArmycontractW91CRB-08-C-0093.8.REFERENCES[1]S.Altschul,W.Gish,W.Miller,E.Myers,andD.Lipman.Basiclocalalignmentsearchtool.JournalofMolecularBiology,3(215):403{410,Oct1990.[2]AmazonSimpleStorageService(AmazonS3).http://aws.amazon.com/s3/,2009.[3]C.Baru,R.Moore,A.Rajasekar,andM.Wan.TheSDSCstorageresourcebroker.InProceedingsofCASCON,Toronto,Canada,1998.[4]H.Bui,M.Kelly,C.Lyon,M.Pasquier,D.Thomas,P.Flynn,andD.Thain.ExperiencewithBXGrid:ADataRepositoryandComputingGridforBiometricsResearch.JournalofClusterComputing,12(4):373,2009.[5]J.DeanandS.Ghemawat.Mapreduce:Simplieddataprocessingonlargeclusters.InOperatingSystemsDesignandImplementation,2004.[6]J.J.DongarraandD.W.Walker.MPI:Astandardmessagepassinginterface.Supercomputer,pages56{68,January1996.[7]S.Ghemawat,H.Gobio,andS.Leung.TheGooglelesystem.InACMSymposiumonOperatingSystemsPrinciples,2003.[8]Hadoop.http://hadoop.apache.org/,2007.[9]J.Howard,M.Kazar,S.Menees,D.Nichols,M.Satyanarayanan,R.Sidebotham,andM.West.Scaleandperformanceinadistributedlesystem.ACMTrans.onComp.Sys.,6(1):51{81,February1988.[10]M.Ivanova,N.Nes,R.Goncalves,andM.Kersten.Monetdb/sqlmeetsskyserver:thechallengesofascienticdatabase.ScienticandStatisticalDatabaseManagement,InternationalConferenceon,0:13,2007.[11]J.No,R.Thakur,andA.Choudhary:.Integratingparallellei/oanddatabasesupportforhigh-performancescienticdatamanagement.InIEEEHighPerformanceNetworkingandComputing,2000.[12]E.Riedel,G.A.Gibson,andC.Faloutsos.Activestorageforlargescaledataminingandmultimedia.InVeryLargeDatabases(VLDB),1998.[13]R.Sandberg,D.Goldberg,S.Kleiman,D.Walsh,andB.Lyon.DesignandimplementationoftheSunnetworklesystem.InUSENIXSummerTechnicalConference,pages119{130,1985.[14]R.Searcs,C.V.Ingen,andJ.Gray.Toblobornottoblob:Largeobjectstorageinadatabaseoralesystem.TechnicalReportMSR-TR-2006-45,MicrosoftResearch,April2006.[15]E.Stolte,C.vonPraun,G.Alonso,andT.Gross.Scienticdatarepositories.designingforamovingtarget.InSIGMOD,2003.[16]M.Stonebraker,J.Becla,D.J.DeWitt,K.-T.Lim,D.Maier,O.Ratzesberger,andS.B.Zdonik.Requirementsforsciencedatabasesandscidb.InCIDR.www.crdrdb.org,2009.[17]M.Stonebraker,J.F.T,andJ.Dozier.Anoverviewofthesequoia2000project.InInProceedingsoftheThirdInternationalSymposiumonLargeSpatialDatabases,pages397{412,1992.[18]A.S.Szalay,P.Z.Kunszt,A.Thakar,J.Gray,andD.R.Slutz.Designingandminingmulti-terabyteastronomyarchives:Thesloandigitalskysurvey.InSIGMODConference,2000.[19]O.Tatebe,N.Soda,Y.Morita,S.Matsuoka,andS.Sekiguchi.Gfarmv2:Agridlesystemthatsupportshigh-performancedistributedandparalleldatacomputing.InComputinginHighEnergyPhysics(CHEP),September2004.[20]D.Thain.IdentityBoxing:ANewTechniqueforConsistentGlobalIdentity.InIEEE/ACMSupercomputing,pages51{61,2005.[21]D.Thain,C.Moretti,andJ.Hemmes.Chirp:APracticalGlobalFilesystemforClusterandGridComputing.JournalofGridComputing,7(1):51{72,2009.[22]D.Thain,T.Tannenbaum,andM.Livny.Condorandthegrid.InF.Berman,G.Fox,andT.Hey,editors,GridComputing:MakingtheGlobalInfrastructureaReality.JohnWiley,2003.[23]Vertica.http://www.vertica.com/,2009.[24]M.Wan,R.Moore,andW.Schroeder.Aprototyperule-baseddistributeddatamanagementsystemrajasekar.InHPDCWorkshoponNextGenerationDistributedDataManagement,May2006.[25]S.A.Weil,S.A.Brandt,E.L.Miller,D.D.E.Long,andC.Maltzahn.Ceph:Ascalable,high-performancedistributedlesystem.InUSENIXOperatingSystemsDesignandImplementation,2006.