/
ROARS:AScalableRepositoryforDataIntensiveScienticComputingHoangBui,Pe ROARS:AScalableRepositoryforDataIntensiveScienticComputingHoangBui,Pe

ROARS:AScalableRepositoryforDataIntensiveScienticComputingHoangBui,Pe - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
405 views
Uploaded On 2016-07-25

ROARS:AScalableRepositoryforDataIntensiveScienticComputingHoangBui,Pe - PPT Presentation

serversortomigratetonewresourcesAllofthesetaskscanbeperformedincrementallywhilethesystemisrunningandevenbepausedrescheduledorrestartedwithoutharmFailureindependenceEachobjectstoragenodeinthes ID: 419202

servers ortomigratetonewresources.Allofthesetaskscanbeperformedincrementallywhilethesys-temisrunning andevenbepaused rescheduled orrestartedwithoutharm.Failureindependence.Eachobjectstoragenodeinthes

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "ROARS:AScalableRepositoryforDataIntensiv..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ROARS:AScalableRepositoryforDataIntensiveScienticComputingHoangBui,PeterBui,PatrickFlynn,andDouglasThainUniversityofNotreDameABSTRACTAsscienti cresearchbecomesmoredataintensive,thereisanincreasingneedforscalable,reliable,andhighperfor-mancestoragesystems.Suchdatarepositoriesmustprovidebothdataarchivalservicesandrichmetadata,andcleanlyintegratewithlargescalecomputingresources.ROARSisahybridapproachtodistributedstoragethatprovidesbothlarge,robust,scalablestorageandecientrichmetadataqueriesforscienti capplications.Inthispaper,wedemon-stratethatROARSiscapableofimportingandexport-inglargequantitiesofdata,migratingdatatonewstoragenodes,providingrobustfaulttolerance,andgeneratingma-terializedviewsbasedonmetadataqueries.Ourexperimen-talresultsdemonstratethatROARS'aggregatethroughputscaleswiththenumberofconcurrentclientswhileprovidingfault-tolerantdataaccess.ROARSiscurrentlybeingusedtostore5.1TBofdatainourlocalbiometricsrepository.1.INTRODUCTIONRecentadvancesindigitaltechnologiesnowmakeitpossi-bleforanindividualorasmallgrouptocreateandmaintainenormousamountsofdata.\Ordinary"researchersinallbranchesofscienceoperatecameras,digitaldetectors,andcomputersimulationsthatcangeneratenewdataasfastastheresearchercanposeahypothesis.Thisincreaseintheproductionofdataallowstheindividualtocarryoutcom-plexstudiesthatwerepreviouslyonlypossiblewithalargesta oflabtechnicians,computeroperators,andsystemad-ministrators.Ofcourse,suchproblemsarenotlimitedtoscience.Asimilardiscussionappliestodigitallibraries,topaperlessbusiness,ortoathinlysta edinternetstartupthat ndssuddensuccess.Unfortunately,thishugegrowthindataandstoragecomeswiththeunwantedburdenofmanagingalargedataarchive.Asanarchivegrows,itbecomessigni cantlyharderto ndwhatdataitemsareneeded,tomigratethedatafromonetechnologytoanother,tore-organizeasthedataandgoalschange,andtodealwithequipmentfailures.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.DIDC2010,Chicago,ILCopyright200XACMX-XXXXX-XX-X/XX/XX...$10.00.Thetwocanonicalmodelsfordatastorage{the lesystemandthedatabase{arenotwellsuitedforsupportingthesekindsofapplications.Bothconceptscanbemadeparalleland/ordistributedforbothcapacityandperformance.Therelationaldatabaseiswellsuitedforquery,sorting,andre-ducingmanydiscretedataitems,butrequiresahighdegreeofadvanceschemadesignandsystemadministration.Adatabasecanstorelargebinaryobjects,butitisnothighlyoptimizedforthistask[14].Ontheotherhand,the lesys-temhasamuchlowerbarriertoentry,andiswellsuitedforsimplydepositinglargebinaryobjectsastheyarecreated.However,asa lesystembecomeslarger,querying,sorting,andsearchingcanonlybedoneecientlyiftheymatchthechosenhierarchicalstructure.Asanenterprisegrows,nosin-glehierarchyislikelytomeetallneeds.Sowhileenduserspreferworkingwith lesystems,currentstoragesystemslackthequerycapabilitiesnecessaryforecientoperation.Toaddressthismismatch,wehavecreatedROARS(RichObjectARchivalSystem),anonlinedataarchivethatcom-binessomefeaturesofboththe lesystemanddatabasemodels,whileeliminatingsomeofthedangerous\rexibil-ityofeach.Althoughthereexistanumberofdesignsforscalablestorage[9,7,8,25,2,3,19]ROARSoccupiesanunexploreddesignpointthatcombinesseveralunusualfea-turesthattogetherprovideapowerful,scalable,manageablescienti cdatastoragesystem:Richsearchablemetadata.Eachdataobjectisassociatedwithausermetadatarecordofarbitrary(name,type,value)tuples,allowingthesystemtopro-videsomesearchoptimizationwithoutdemandingelab-orateschemadesign.Discreteobjectstorage.Eachdataobjectisstoredasasingle,discreteobjectonlocalstorage,replicatedmultipletimesforsafetyandperformance.Thisallowsforacompactstatementoflocalityneededforecientbatchcomputing.Materialized lesystemviews.Ratherthanim-poseasingle lesystemhierarchyfromthebeginning,fastqueriesmaybeusedtogeneratematerializedviewsthattheuserseesasanormal lesystem.Inthisway,multipleusersmayorganizethesamedataastheysee t,andmaketemporalsnapshotstoensurerepro-ducibilityofresults.Transparent,incrementalmanagement.ROARSdoesnotneedtobetakenoineevenbrie\ryinordertoperformanintegritycheck,addordecommission servers,ortomigratetonewresources.Allofthesetaskscanbeperformedincrementallywhilethesys-temisrunning,andevenbepaused,rescheduled,orrestartedwithoutharm.Failureindependence.Eachobjectstoragenodeinthesystemcanfailorevenbedestroyedindepen-dentlywithouta ectingthebehaviororperformanceoftheothernodes.Themetadataserverismorecrit-ical,butitfunctionsonlyasan(important)cache.Ifcompletelylost,themetadatacanbereconstructedbyaparallelscanoftheobjectstorage.InourpreviousworkonBXGrid[4],wecreatedadisci-plinespeci cdataarchivetightlyintegratedwithawebpor-talforbiometricsresearch.ROARSisour\secondversion"ofthisconcept,whichhasbeendecoupledfrombiometrics,generalizedtoanabstractdatamodel,andexpandedintheareasofexecution,management,andfaulttolerance.Thispaperisorganizedasfollows.Insection2,wepresenttheabstractdatamodelanduserinterfacetoROARS.Insection3,wedescribeourimplementationofROARSusingarelationaldatabaseandstoragecluster.Insection4,anoperationalandperformanceevaluationofROARSispre-sented.Insection5,wecompareROARStootherscalablestoragesystems.Weconcludewithfutureissuestoexplore.2.SYSTEMDESIGNROARSisdesignedtostoremillionstobillionsofindi-vidualobjects,eachtypicallymeasuredinmegabytesorgi-gabytes.Eachobjectcontainsbothbinarydataandstruc-turedmetadatathatdescribesthebinarydata.BecauseROARSisdesignedforthepreservationofscienti cdata,dataobjectsarewrite-once,read-many(WORM),buttheassociatedmetadatacanbeupdatedbylogging.Thesys-temcanbeaccessedwithanSQL-likeinterfaceandalsobya leystem-likeinterface.2.1DataModelAROARSsystemstoresanumberofnamedcollections.Eachcollectionconsistsofanumberofunorderedobjects.Eachobjectconsistsofthetwofollowingcomponents:1.BinaryData:Eachdataobjectcorrespondstoasin-glediscretebinary lethatisstoredona lesystem.Thisobjectisusuallyanopaque lesuchasaTIFForPDF,meaningthatthesystemdoesnotextractanyin-formationfromthe leotherthanthebasic lesystemattributes.2.StructuredMetadata:Associatedwitheachdataobjectisasetofmetadataitemsthatdescribesoran-notatestherawdataobjectwithdomain-speci cprop-ertiesandvalues.Thisinformationisstoredinplaintextasrowsof(NAME,TYPE,VALUE,OWNER,TIME)tuplesasshownintheexamplemetadatarecordhere:NAMETYPEVALUEOWNERTIMErecordingidstringnd3R22829pflynn1257373461subjectidstringnd1S04388pflynn1257373461statestringproblemdthain1254049876problemtypeinteger34dthain1254049876statestringfixedhbui1254050851Intheexamplemetadatarecordabove,eachtuplecon-tains eldsforNAME,TYPE,VALUE,whichde nethenameoftheobjectproperty,thetype,anditsvalue.Becauseob-jectsmayhavevaryingnumberandtypesofattributesandtheuserneverspeci esanexactspeci cationofwhatob-jectsshouldcontain,thisdatamodelisschema-free.How-ever,sincescienti cdatatendstobesemi-structured,thedatamodelallowsforthestoragesystemtotransparentlygroupsimilaritemsintocollectionsforeciencyandorga-nizationalpurposes.Duetothisregularityinthenameandtypesof elds,andtheabilitytoautomaticallygroupobjectsintocollections,weconsiderthedatamodeltobeschema-implicit.Theuserneverformallyexpressestheschemaofthedataobjects,butanimplicitonecanbegeneratedfromthemetadatarecordsduetothesemi-structurednatureofthescienti cdata.InadditiontotheNAME,TYPE,andVALUE elds,eachmeta-dataentryalsocontainsa eldforOWNERandTIME.Thisistoprovideprovenanceinformationandtransactionalhistoryofthemetadata.Ratherthanoverwritingmetadataentrieswhena eldisupdated,newvaluesaresimplyappendedtotheendoftherecord.Intheexampleabove,thestatevalueisinitiallysettoproblembyoneuserandthenlatertofixedbyanother.Bydoingso,thelatestvalueforaparticular eldwillalwaysbethelastentryfoundintherecord.Thistransactionalmetadatalogiscriticaltoscien-ti cresearcherswhooftenneedtokeeptrackofnotonlythedata,buthowitisupdatedandtransformedovertime.Theseadditional eldsenabletheuserstotrackwhomadetheupdates,whentheupdatesoccurred,andwhatthenewvalueswere.Thisdatamodel tsinwiththewrite-once-read-manyna-tureofmostscienti cdata.Thediscretedata lesarerarelyifeverupdatedandoftencontaindatatobeprocessedbyhighlyoptimizeddomain-speci capplications.Themeta-data,however,maychangeorevolveovertimeandisusedtoorganizeandquerythedatasets.2.2UserInterfaceUsersmayinteractwiththesystemusingeitheracommand-linetoolora lesysteminterface.Thecommandlineinter-facesupportsthefollowingoperations:IMPORT oll;FROM ir0;QUERY oll;WHERExpr;EXPORT oll;WHERExpr;INTO ir0;[AS&#xpatt;rn0;VIEW oll;WHERExpr;AS&#xpatt;rn0;DELETE oll;WHERExpr;TheIMPORToperationloadsalocaldirectorycontainingobjectsandmetadataintoaspeci ccollectioninthereposi-tory.QUERYretrievesthemetadataforeachobjectmatchingthegivenexpression.EXPORTretrievesboththedataandmetadataforeachobjectmatchingthegivenexpression,whicharestoredonthelocaldiskaspairsof les.VIEWcreatesamaterializedviewonthelocaldiskofallobjectsmatchingthegivenexpression,usingthespeci cpatternforthepathname.DELETEremovesdataobjectsandtherelatedmetadatafromtherepository,andisusuallyinvokedonlyafterafailedIMPORT.Givenconstraintsbyusers,ROARS ndsallassociateddataobjects,deletesthemandremovesthemetadataentries.ApplicationsmayalsoviewROARSasaread-only lesys-tem.Individualobjectsandtheircorrespondingmetadata 1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiff1.tiffROARSParrotMaterializedViewFacesFemaleFaces Images in Materialized ViewAsianHispanicexport viewHispanicMaleWhiteAsianWhiteFigure1:ExampleMaterializedViewcanbeaccessedviatheirunique leidenti ersusingabsolutepaths:However,mostusers ndite ectivetoaccess lesusingmaterializedviews.UsingtheVIEWcommandabove,asubsetofthedatarepositorycanbequeried,depositingatreeofnamedlinksontothelocal lesystem.Eachlinkisnamedaccordingtothemetadataofthecontainingobject,andpointstotheabsolutepathofaniteminthereposi-tory.Forexample,Figure1showsaviewgeneratedbythefollowingcommand:VIEWfacesWHEREtrueAS"gender/race/fileid.type"Becausethematerializedviewisstoredinthenormallo-cal lesystem,itcanbekeptinde nitely,sharedwithotherusers,sentalongwithbatchjobs,orpackedupintoanarchive leandemailedtootherusers.Thecreatingusermanagestheirowndiskspaceandisthisresponsibleforcleanupattheappropriatetime.Theabilitytogeneratematerializedviewsthatprovidethirdpartyapplicationsanrobustandscalable lesysteminterfacetothedataobjectsisadistinguishingfeatureofROARS.Ratherthanforceuserstoimplanttheirdomain-speci ctoolsintoadatabaseexe-cutionengine,orwrapitinadistributedprogrammingab-straction,ROARSenablesscienti cresearcherstocontinueusingtheirfamiliarwork\rowandapplications.3.IMPLEMENTATIONFigure2showsthebasicarchitectureofROARS.Tosup-portthediscreteobjectdatamodelandthedataopera-tionspreviouslyoutlined,ROARSutilizesahybridapproachtoconstructscienti cdatarepositories.MultipleStorageServersareusedforstoringboththedataandmetadatainarchivalformat.AMetadataServer(MDS)indexesallofthemetadataonthestorageserver,alongwiththelocationofeachreplicatedobject.TheMDSservesastheprimarynameandentrypointtoaninstanceofROARS.Thedecisiontoemploybothadatabaseandaclusterofstorageserverscomesfromtheobservationthatwhileonetypeofsystemmeetstherequirementsofoneofthecom-ponentsofascienti cdatarepository,itisnotadequateattheothertype.Forinstance,whileitispossibletorecordQueryToolMetadataServerParrotApplGroup BGroup AStorage ServersDataAccessMetadataLookupFigure2:ROARSArchitectureboththemetadataandrawdatainadatabase,theperfor-mancewouldgenerallybepooranddiculttoscale,espe-ciallytothelevelrequiredforlargescaledistributedexper-imentsnorwouldit tinwiththework\rownormallyusedbyresearchscientists.Moreover,thedistinctadvantageofusingadatabase,whichisitstransactionalnature,ishardlyutilizedinascienti crepositorybecausethedataismostlywrite-once-read-many,andthusrarelyneedsatomicupdat-ing.Fromourexperience,duringthelifetimeoftherepos-itory,metadatamaychangedonceortwice,whiletherawdatastaysuntouched.Besidesthescalabilitydisadvantages,keepingrawdatainadatabaseposesbiggerchallengesoneverydaymaintenanceandfailurerecovery.So,although,adatabasewouldprovidegoodmetadataqueryingcapabili-ties,itwouldnotbeabletosatisfytherequirementforlargescaledatastorage.Ontheotherhand,adistributedstoragesystem,evenwithaclever lenamingscheme,isalsonotadequateforscienti crepositories.Suchdistributedstoragesystemspro-videscalablehighperformanceI/O,butprovidelimitedsup-portforrichmetadataoperations,whichgenerallydevolveintofulldatasetscansorsearchesusingfragileandadhocscripts.Althoughtherearepossibletricksandtechniquesforimprovingmetadataavailabilityinthe lesystem,theseallfallshortoftheeciencyrequiredforascienti crepos-itory.Forinstance,whileitispossibletoencodeparticularattributesinthe lename,itisstillin\rexibleandine-cient,particularlyfordatathatbelongtomanydi erentcategories.Fastaccesstometadataremainsnearlyimpos-sible,becauseparsingthousandsormillions lenamesisthesameifnotworsethanwritingacumbersomescripttoparsecollectionsofmetadatatext les.ThehybriddesignofROARStakesthebestaspectsfrombothdatabasesanddistributed lesystemsandcombinesthemtoproviderichmetadatacapabilitiesandrobustscal-ablestorage.Tomeetthestoragerequirement,ROARSreplicatesthedataobjectsalongwiththeirassociatedmeta-dataacrossmultiplestoragenodes.Likeintraditionaldis-tributedsystems,thisuseofdatareplicationsallowsforscal-ablestreamingreadaccessandfaulttolerance.Inordertoprovidefastmetadataqueryoperations,themetadatain-formationispersistentlycacheduponimportingthedataobjectsintotherepositoryinatraditionaldatabaseserver.Queriesandoperationsonthedataobjectsaccessthiscacheforfastandecientstorageoperationsandmetadataoper-ations.Overall,thisstorageorganizationissimilartotheoneusedintheGoogleFileystem[7],andHadoop[8],wheresimple recordingstringstringnd1R3204FilesReplicas12894051statehostpathfs03fileidenrolledenrolledvalidatedunvalidatedchecksum790K801Ksizereplicaid4050fileid800KL05/01/08R05/01/08S330nullnullsmile/2/5/1290.4052/0/5/1289.4050validated1290.4051.jpg1290.4051.metaStorage Serversstatedateeyeemotionsubject12891288nd1R3205nd1R3204fileidrecordingidMetadataFigure3:ROARSMetadataStructureDataNodesstorerawdataandasingleNameNodemain-tainsthemetadata.Ourarchitecturedi ersinafewimpor-tantwayshowever.First,ratherthanstripingthedataasblocksacrossmultipleStorageNodesasdoneinHadoopandtheGoogleFilesystem,ROARSstorediscretewholedata lesonthestoragenodes.Whilethispreventsusfrombe-ingabletosupportextremelylarge lesizes,thisisnotanimportantfeaturesincemostscienti cdatacollectionstendtobemanysmall les,ratherthanafewextremelylargeones.Moreover,theuseofwholedata lesgreatlysimpli- esrecoveryandenablesfailureindependence.Likewise,theuseofadatabaseserverasthemetadatacacheenablesustoprovidesophisticatedandecientmetadataqueries.WhileGoogleFilesystemandHadooparerestrictedtobasic lesystemtypemetadata,ROARScanhandlequeriesthatworkonconstraintsondomain-speci cmetadatainforma-tion,allowingresearcherstosearchandorganizetheirdataintermsfamiliartotheirresearchfocus.3.1DatabaseStructureInROARS,themetadataisstoredinthreemaindatabasetables:ametadatatable,a letableandareplicatable.Themetadatatableobviouslystoresthedomain-speci cscien-ti cmetadata.Eachentryinthistablecanhaveoneormorepointerstoafileidinthe letable.Inthecasewherethereisonlymetadataandnocorrespondingrawdata,theentrywouldhaveaNULLfileid.The letableplaysthesameroleaninodetableinatraditionalUnix lesystemdoesforROARSandholdstheessentialinformationaboutrawdata les.EntriesinthethistablerepresentROARSinodes,andthereforehavethefollowingimportant leinformation:fileid,size,checksum,andcreatetime.ROARSutilizesthisinformationtonotonlykeeptrackof lesbutalsotoemulatetraditionalUNIXsystemcallssuchasstat.Foranygivenfileid,therecanbemultiplereplicaentriesinthereplicatable.Thisthirdtablekeepstrackofwheretheactualrawdata lesarestored.Thestructureofthereplicatableisverystraightforwardandincludesthefollow-ing elds:fileid,host,path,state,andlastcheck.Figure3givesanexampleoftherelationshipbetweenthemetadata, le,andreplicatables.Inthiscon guration,each leisgivenanuniquefileidin letable.Inthereplicata-ble,thefileidmayoccurmultipletimes,witheachrowrepresentingaseparatereplicalocationinthestorageclus-ter.Accessinga letheninvolveslookingupthefileid, ndingthesetofassociatedreplicalocations,andthense-lectingastoragenode.Ascanbeseen,thisdatabaseorganizationprovidesboththeabilitytoquery lesbasedondomainspeci cmetadata,andtheabilitytoprovidescalabledatadistributionandfault-tolerantoperationthroughtheuseofreplicas.Someoftheadditional eldssuchaslastcheck,state,andchecksumareusedbyhighleveldataaccessoperationsprovidedbyROARStomaintaintheintegrityofthesystemandwillbediscussedinlatersubsections.3.2StorageNodesROARSutilizesanarrayofStorageNodesrunningChirp[21]forreplicatingdata.TheseStorageNodesareusuallyconventionalmachineswithlargelocalsingledisksorganizedinacomputecluster.TheseNodesaregroupedtogetherbasedonlocalityintodi erentStorageGroups,andgivenagroupid.DuringanIMPORT,ROARSmakesaconsciousde-cisiontospreadoutreplicassothateachStorageGrouphasaleastonereplica,thusprovidingastaticformofloadbal-ancing.Byconvention,ifadataobjectwasnamedX.jpg,thentheassociatedmetadata lewouldbenamedX.metaandbothofthese lesarereplicatedacrosstheStorageNodesineachoftheStorageGroups.Byreplicatingtherawdataacrossthenetwork,ROARSprovidesdistributedapplicationsscalable,highthroughputdataaccess.Moreover,becauseeachStorageGrouphasatleastonecopyofthedata le,distributedapplicationscaneasilytakeadvantageofdatalocalitywithROARS.Tofacilitatedeterminingwhereacertaindataobjectresides,ROARSincludesaLOCATEcommandthatwill ndtheclos-estcopyofthedataforarequestingapplication.IftheapplicationisrunningonthesameStorageNode,thenthedataisalreadyonthelocalnode,andsonodatatransferisneeded.RobustnessDuetotheuseofdataobjectreplication,ROARSissta-bleandhassupportforrecoverymechanismsandfail-over. Bydefault,dataisreplicatedacrosstheStorageNodesatleastthreetimes.Duringareadoperation,ifareplicaisnotreachableduetoserveroutageorhardwarefailure,ROARSwillrandomlytryanotheravailablereplicaafterauserspec-i edtimeout(normally30seconds).Asmentionedearlier,StorageNodesareorganizedintogroupsbasedontheirlo-cality.Whendataispopulatedintothedatarepository,ROARSintelligentlyplacesthedatatoensurethatthereisareplicaineachservergroup.Byspreadingreplicastomultiplegroups,asystematicfailureofaStoragegrouponlyhasaminimale ectonROARSoperationandperformance.Similartoreadoperations,writeoperationsperformeddur-ingIMPORTwillrandomlychooseaserverwithinaservergrouptowritedata.Iftheserverisnotresponsive,anotherischosenuntilthewriteissuccessful.ROARSalsoensuresintegrityofthedatarepositorybytrackingandcomparingchecksumsofreplicas.Asadata leisingestedintoROARS,itschecksumiscalculatedandrecordedasapartofthedataobject'smetadata.Read/writerequestscaninternallychecktomakesurethereplica'scheck-summatchestheoriginaldata le.However,frequentcheck-sumcallscanreducesystemperformance,andsothisin-tegritycheckisonlyperformedduringspecialdataintegritymanagementoperationssuchasAUDIT.Thiscommandwillscanthemetadatadatabaseandperformchecksumsonthedataobjectsandensurethecurrentchecksumsmatchtheonerecordedinthedatabase.Inthesameprocess,theAUDITcommandwillalsocheckthestatusoftheStorageNodesandperformanymaintenanceasnecessary.BecauseofthisROARSgivesintegritycheckingabroadermeaningsinceitmaintainsintegrityofthesystemasawhole,notsimplysinglereplicas.ROARS'robustdesignalsoenablestransparentandincre-mentalmanagement.WheneveraStorageNodeneedstobetakenoineordecommissioned,invokingREMOVEwilldeleteallentriesassociatedwiththatnodefromthereplicatableandupdatethemetadatadatabaseinatransactionalman-ner.ToaddnewStorageNodes,anADD_NODEfollowedbyMIGRATEwillspawnthecreationofreplicasonthenewStor-ageNodes.ROARStakesadvantageoftheatomictrans-actionsofthedatabaseservertomanagetheseoperations.Becauseofthisuseofthedatabaseasatransactionmanager,theseoperationscanbeperformedtransparentlyandincre-mentally.Forinstance,anAUDITcanbescheduletorunforonlyafewminutesatatimeorduringo peakhours.Evenifitdoesnotcomplete,duetotheuseofthedatabaseasatransactionallog,itcancontinuewhereitlefto thenexttimeitisrun.ThesamegoesforoperationssuchasMIGRATEorREMOVE.Thesecommandscanbepaused,rescheduled,andrestartedwithouta ectingtheintegrityofthesystemandwithouthavingtoshutdownthewholerepository.Thisrobustnessfurtherextendstotheabilitytoprovidefailureindependence.Sincethedataisstoredascompletedata lesratherthanstripedacrossmultipleStorageNodes,theintegrityofthedata lesisnevere ectedbyasuddenlostofdataservers.Additionally,failureofoneserverdoesnota ecttheperformanceorbehavioroftheotherStorageNodes.Becausethemetadataisstoredalongsidethedata lereplicas,ROARSisalsocapableofrecoveringfromthelostoftheDatabaseNodewhichonlyservesasapersis-tentmetadatacache.Toperformthisrecovery,aparallelscancanbeperformedontheStorageNodestoreconstructthemetadatacache.ThisisincontrasttosystemssuchasHadoopandGoogleFilesystemwhichtakespecialcaretoreplicateallofthestateoftheNameNode.InthecaseofalostoftheNameNodeinthissystem,thelayoutandorga-nizationofthedatacanbecompletelylostsincethedataisstripedacrossmultipleservers.ROARSavoidsthisproblembystoringcompletediscretedata les,andmaintainingthemetadatalognexttothesereplicasontheStorageNodes.ThisenablesROARStorobustlyprovidefailureindepen-denceandasimplemeansofrecovery.4.EVALUATIONToevaluatetheperformanceandoperationalcharacteris-ticsofROARS,wedeployedatraditionalnetwork lesystem,Hadoop,andROARSonatestbedclusterconsistingof32datanodesand1separateheadnode.Eachofthesestoragenodesisacommoditydual-coreIntel2.4GHzmachine,with4GBofRAMand750GBofdisk,allconnectedviaasingleGigabitEthernetswitch.Thetraditionalnetwork lesystemwasasingleChirp leserverononeofthedatanodes.ForHadoop,wecon guredtheHadoopDistributedFilesystem(HDFS)tousethe32storagenodesastheHDFSDatanodesandtheseparateheadnodeastheHDFSNamenode.WekepttheusualHadoopdefaultssuchasemployinga64MBblocksizeforHDFS.OurROARScon gurationconsistedofadedicatedmeta-dataserverrunningMySQLontheheadserverand32ChirpserversonthesamedatanodesastheHadoopcluster.Toprovideourtestsoftwareaccesstothesestoragesystems,weutilizedParrotasa lesystemadaptor.ThefollowingexperimentalresultstesttheperformanceofROARSanddemonstrateitscapabilitieswhileperform-ingavarietyofstoragesystemactivitiessuchasimport-ingdata,exportingmaterializedviews,andmigratingrepli-cas.Theseexperimentsalsoincludemicro-benchmarksoftraditional lesystemoperationstodeterminethelatencyofcommonsystemcalls,andconcurrentaccessbenchmarksthatdemonstratehowwellthesystemscales.Fortheselat-terperformancetests,wecompareROARS'sperformancetothatofthetraditionalnetworkserverandHadoop,whichisanoftencitedalternativetodistributeddataarchiving.Attheend,weincludeoperationalresultsthatdemonstratethedatamanagementcapabilitiesofROARS.4.1DataImportBeforeperforminganydataaccessexperiments,we rsttestedtheperformanceofimportinglargedatasetsintobothHadoopandROARS.Forthisdataimportexperiment,wedividedourtestintoseveralsetsof les.Eachsetconsistsofnumberof xedsize les,rangingfrom1KBto1GB.Toperformtheexperiment,weimportedthedatafromalocaldisktothedistributedsystems.InthecaseofHadoopthissimplyinvolvedcopythedatafromthelocalmachinetoHDFS.ForROARS,weusedtheIMPORToperation.Figure4showsthedataimportperformanceforHadoopandROARSforseveralsetsofdata.Thegraphshowsthethroughputasthe lesizesincrease.Forthesmall ledataset,ROARSdatamirroringoutperformsHDFSstrip-ing,whileforthelarger ledataset,HadoopisfasterthanROARS.Ineithercase,bothROARSandHadoopimportlarger lesfasterthantheydosmaller les.Thedi erencesinperformancebetweenROARSandHadoopareduetothewayimportingandstoringreplicasworksinbothsystems.InthecaseofHadoop,areplicacreationin- 0 5 10 15 20 25 30 351KB8KB64KB512KB1MB8MB64MB128MB256MB512MB1GBThroughput(MB)File sizeROARS mirroringHDFS stripingFigure4:ImportPerformance. 0.001 0.01 0.1 1 10 100 1000 10000 100000 1e+06 10 100 1000Run-Time (Seconds)Number of records (x1000)Grep (HDFS)MapReduceMYSQLMYSQL(with index)Figure5:QueryPerformance.volvesacomplexsetoftransactionsthatsetupadata\rowpipelinebetweenthereplicationnodesofasingleblock.ThisoverheadisprobablythereasonwhyHadoopisslightlyslowerforsmaller les.Forlarger les,however,thisdatapipelineenableshigheroverallsystembandwidthusageandthusleadstobetterperformancethanROARSwhichdoesnotperformanydatapipelining.Rather,itmerelycopiesthedata letoeachofthereplicasinsequentialorder.Thisimportandreplicacreationoverheadalsoexplainswhylarge leimportisfasterthansmall leimportation.InROARS,eachimported leneeds9databasetransactionswhichcanbecostlywhenimportingsmall les,wherethetimespenttransferringdataisoverwhelmedbydatabasetransactionexecutiontime.Withthelarger les,thereislesstimelosttosettinguptheconnectionsandtransactions,andmoretimespentontransferringthedatatotheStorageNodes.4.2MetadataQueryInthisbenchmark,westudiedthecostofperformingametadataquery.Aspreviouslynoted,oneoftheadvantagesofROARSoverdistributedsystemssuchasHadoopisthatitprovidesameansofquicklysearchingandmanipulatingthemetadataintherepository.Forthisexperiment,wecreatedmultiplemetadatadatabasesofincreasingsizeandperformedaquerythatlooksforobjectsofaparticulartype.Asabaselinereference,weperformedacustomgrepofthedatabaserecordsonasinglenodeaccessingHDFS,whichisnormallywhathappensinrudimentaryscienti cdatacollec-tions.ForHadoop,westoredallthemetadatainasingle le,andqueriedthemetadatabyexecutingthecustomscriptus-ingMapReduce[5].ForROARS,wequeriedthemetadata 0 2 4 6 8 10statopenreadcloseLatency (Milliseconds)Micro-OperationsTraditionalHDFSROARS (Cached)ROARS (Uncached)Figure6:Microbenchmark.usingQUERYwhichinternallyusestheMySQLexecutionen-gine.Wedidthiswithindexingonando toexamineitse ectonperformance.Figure5clearlyshowsthatROARStakesfulladvantageofthedatabasequerycapabilitiesproperlyandismuchfasterthaneitherMapReduceorstandardgrepping.Evidently,asthemetadatadatabaseincreasesinsize,thegrepper-formancedegradesquickly.ThesameistruefortheQUERYoperation.Hadoop,however,mostlyretainsasteadyrun-ningtime,regardlessofthesizeofthedatabase.ThisisbecausetheMapReduceversionwasabletotakeadvan-tageofmultiplecomputenodesandthusscaleupitsper-formance.Unfortunately,duetotheoverheadincurredinsettingupthecomputationandorganizingtheMapReduceexecution,theHadoopqueryhadahighstartupcostandthuswasslowerthantheMySQL.Futhermore,thestan-dardgrepandMySQLquerieswereperformedonasinglenode,andthusdidnotbene tfromscaling.Thatsaid,theROARSquerywasstillfasterthanHadoop,evenwhenthedatabasereached2,699,488dataobjects.4.3MicrobenchmarkAsmentionedearlier,ROARSdoesnotdirectlysupporttraditionalsystemcallssuchasstat,open,read,andclose.Rather,ROARSprovidestheseoperationstoexternalap-plicationsthroughaParrotROARSservicewhichemulatesthesesystemcalls.SinceHadoopalsodoesnotprovideanative lesysteminterface,wealsoimplementedaParrotHDFSservice.Totestthelatencyofthesecommon lesys-temfunctions,weperformedaseriesofstats,opens,reads,andclosesonasingle leonthetraditional leserver,HDFS,andROARS.ForROARSweprovidetheresultsforaversionwithSQLquerycachingandonewithoutthissmalloptimization.Figure6showsthelatencyofthemicro-operationsonatraditionalnetwork leserver,HDFS,andROARS.Ascanbeseen,ROARSprovidescomparablelatencytothetradi-tionalnetworkserver,andinthecaseofstat,open,andread,lowerlatencythanHDFS.Sinceall leaccesswentthroughtheParrotstorageadapter,therewassomeover-headforeachsystemcall.However,sinceallofthestoragesystemswereaccessedthoughthesameParrotadapter,thisadditionaloverheadissameforallofthesystemsandthusdoesnota ecttherelativelatencies.TheseresultsshowthatwhileusingaSQLdatabaseasapersistentmetadatacachedoesincuranoverheadcostthatincreasesthelatencyofthese lesystemmicro-operations, 0 100 200 300 400 500 600 700 800 5 10 15 20 25 30Aggregate Throughput (MB/s)Concurrent ClientsTraditionalHDFSROARS(a)ConcurrentAccessPerformance(10Kx320KB) 0 200 400 600 800 1000 1200 1400 5 10 15 20 25 30Aggregate Throughput (MB/s)Concurrent ClientsTraditionalHDFSROARS(b)ConcurrentAccessPerformance(1Kx5MB)Figure7:ConcurrentAccess.thelatenciesprovidedbytheROARSsystemremaincom-parabletoHDFS.Moreover,thisadditionaloverheadcanbeslightlymitigatedbycachingtheSQLqueriesontheclientsideasshowninthegraph.Withthissmalloptimization,operationssuchasstatandopenaresigni cantlyfasterwithROARSthanwithHDFS.Evenwithoutthiscaching,though,ROARSstillprovideslowerlatencythanHDFS.4.4ConcurrentAccessTodeterminethescalabilityofROARSincomparisontoatraditional leserverandHDFS,weexportedtwodi erentdatasetstoeachofthesystemsandperformedatestthatreadallofthedataineachset.InthecaseofROARS,weusedamaterializedviewwithsymboliclinkstotakeadvan-tageofthedatareplicationfeaturesofthesystem,whileforthetraditional lesystemandHDFS,weexportedthedatadirectorytoeachofthosesystems.WeranourtestprogramusingCondor[22]with1-32concurrentreaders.Figure7showstheperformanceresultsofallthreesystemsforbothdatasets.InFigure7(a),theclientsread10,000320KB les,whileinFigure7(b)1,0005MB leswereread.Inbothgraphs,theoverallaggregatethroughputforbothHDFSandROARSincreaseswithanincreasingnumberofconcurrentclients,whilethetraditional leserverlevelso afteraround8clients.Thisisbecausethesingle leserverislimitedtoamaximumuploadrateofabout120MB/s,whichitreachesafter8concurrentreaders.ROARSandHDFS,however,usereplicastoenablereadingfrommultiplema-chines,andthusscalewiththenumberofreaders.Aswiththecaseofimportingdata,thesereadtestsalsoshowthataccessinglarger lesismuchmoreecientinbothROARSandHDFSthanworkingonsmaller les.WhilebothROARSandHDFSachieveimprovedaggre-gateperformanceoverthetraditional leserver,ROARSoutperformsHDFSbyafactorof2.Inthecaseofthesmall les,ROARSwasabletoachieveanaggregatethroughputof526.66MB/s,whileHDFSonlyreached245.23MB/s.Forthelargertest,ROARShit1030.94MBS/sandHDFSman-aged581.88MB/s.Thereareacouplepossiblereasonsforthisdi erence.First,ROARShaslessoverheadinsettingupthedatatransfersthanHDFSasindicatedinthemicro-operationsbenchmarks.Suchoverheadlimitsthenumberofconcurrentdatatransfersandthusaggregatethroughput.IrisFaceIrisFaceStillStillVideoVideoMethod(300KB)(1MB)(5MB)(50MB)Local1018106187Remotex28045150134Remotex423265779Remotex822165870Remotex1612121833Remotex3212171617Figure8:TranscodinginActiveStorageAnothercausefortheperformancedi erenceisthebehav-ioroftheStorageNodes.InHDFS,eachblockischeck-summedandthereissomeadditionaloverheadtomaintaindataintegrity,whileinROARS,dataintegrityisonlyen-forcedduringhighleveloperationssuchasIMPORT,MIGRATE,andAUDIT.SincetheStorageNodesinROARSaresimplenetwork leservers,nochecksummingisperformedduringareadoperation,whileinHDFSdataintegrityisenforcedthroughout,evenduringreads.4.5ActiveStorageROARSisalsocapableofexecutingprogramsinternally,co-locatingthecomputationwiththedatathatitrequires.Thistechniqueisknownasactivestorage[12].InROARS,anactivestoragejobisdispatchtoaspeci c leservercon-tainingtheinput les,whereitisruninanidentitybox[20]topreventitfromharmingthearchive.ActivestorageisfrequentlyusedinROARStoprovidetranscodingfromonedataformattoanother.Forexample,alargeMPEGformatanimationmightbeconverteddowntoa10-framelowresolutionGIFanimationtouseasapre-viewimageonawebsite.Agivenwebpagemightshowtensorhundredsofthumbnailsthatmustbetranscodedanddis-playedsimultaneously.Withactivestorage,wecanharnesstheparallelismoftheclustertodelivertheresultfaster.Figure8showstheperformanceoftranscodingvariouskindsofimagesusingtheactivestoragefacilitesofROARS.Eachlineshowstheturnaroundtime(inseconds)tocon-vert50imagesofthegiventype.The`Local'lineshowsthetimetocompletetheconversionssequentiallyusingROARS 0 20 40 60 80 100 120 140 160 1804KB64K256KB1MB16MB64MB256MB512MB1GB2GBTime elapse(s)File sizeROARSHDFSFigure9:CostofCalculatingChecksum.asanordinary lesystem.The`Remote'linesshowtheturnaroundtimeusingtheindicatednumberofactivestor-ageservers.Ascanbeseentheactivestoragefacilitydoesnothelpwhenappliedtosmallstillimages,buto ersasigni cantspeedupwhenappliedtolargevideoswithsignif-icantprocessingcost.4.6IntegrityCheck&RecoveryInROARS,theAUDITcommandisusedtoperformanintegritycheck.Aswehavementioned,the letablekeepsrecordsofadata le'ssize,checksum,andthelastcheckeddate.AUDITusesthisinformationtodetectsuspectreplicasandreplacethem.Atthelowestlevel,AUDITchecksthesizeofthereplicastomakesureitisthesameasthe letableentriesindicate.Thistypeofcheckisnotexpensivetoperform,butitisalsonotreliable.Areplicacouldhaveanumberofbytesmodi ed,butremainsthesamesize.Abetterwaytocheckareplica'sintegrityistocomputethechecksumofthereplica,andcompareittothevaluein letable.Thisisexpensivebecausetheprocesswillneedtoreadinthewholereplicatocomputethechecksum.Figure9showsthecostofcomputingchecksumsinbothROARSandHDFS.As lesizeincreases,thetimerequiredtoperformachecksumalsoincreasesforbothsystems.How-ever,whenthe lesizeisbiggerthanaHDFSblocksize(64MB),ROARSbeginstooutperformHDFSbecausethelatterincursadditionaloverheadinselectinganewblockandsettingupanewtransaction.Moreover,ROARSletsStorageNodesperformchecksumremotelywherethedata leisstoredwhileforHDFSthisdatamustbestreamedlocallybeforeanoperationcanbeperformed.Whenareplicaisdeemedtobesuspect,ROARSwillspawnanewreplicaanddeletethesuspectcopy.ROARSdoesthisbymakingacopyofagoodreplica.Therearetwowaystodothis.The rstwayistoreadthegoodre-licatoalocalmachineandthencopyittoaStorageNode( rstpartyput).AnotherwayistotelltheStorageNodewherethegoodcopyislocatedandthenperformthetrans-ferontheuser'sbehalf(thirdpartyput).Thelatterwouldrequires2extra leserveroperationsonthetargetStorageNodes.4.7DynamicDataMigrationROARSishighly\rexibledatamanagementsystemwhereuserscantransparentlyandincrementallyaddandremoveStoragesNodeswithoutshuttingdownarunningsystem.ThesystemprovidesoperationstoaddnewStorageNodes(ADD_NODE),migratedatatonewStorageNodes(MIGRATE),andremovedatafromoldunreliableNodes(REMOVE_NODE).TodemonstratethefaulttoleranceandfailoverfeaturesofROARS,wesetupamigrationexperimentasfollows.Weadded16newStorageNodestoourcurrentsystem,andwestartedaMIGRATEprocesstospawnnewreplicas.Startingwith30activeStorageNodes,weintentionallyturnedo anumberofStorageNodesduringMIGRATEprocess.Aftersometime,weturnsomeStorageNodesbackon,leavingtheothersinactive.BydroppingStorageNodesfromthesystem,wewantedtoensurethatROARSstillcouldbefunctionalevenwhenhardwarefailureoccurs.Figure10demonstratesthatROARSremainedoperationalduringtheMIGRATEprocess.Asex-pected,theperformancethroughputtakesadipasnumberofactiveStorageNodesdecreases.Thedecreaseinperfor-manceisbecausewhenROARScontactsaninactiveStorageNode,itwouldfailtoobtainthenecessaryreplicaforcopy-ing.Withinaglobaltimeout,ROARSwillretrytoconnecttothesameStorageNodeandthenmoveontothenextavailableNode.AsNodesremaininactive,theROARScon-tinuestoenduremoreandmoretimeouts.Thatleadstothedecreaseofsystemthroughput.Although,throughputperformancedecreasesslightlywhenthereareonlytwoinactiveStorageNodes,throughputtakesamoresigni canthitwhenthereisalargernumberofin-activeStorageNodes.Therearewaystoreducethisnega-tivee ectonperformance.First,ROARScandynamicallyshortentheglobaltimeout,e ectivelycuttingdownretrytime.Orbetteryet,ROARScandetectinactiveStorageNodesafteranumberoffailedattempts,andblacklistthem,thusavoidingpickingreplicasfrominactiveNodesinthefu-ture.RELATEDWORKOurgoalwastoconstructascienti cdatarepositorythatrequiredbothscalablefault-tolerantdatastorage,ande-cientqueryingoftherichdomain-speci cmetadata.Unfor-tunately,traditional lesystemsanddatabasesfailtomeetbothoftheserequirements.Whilemostdistributed lesys-temsproviderobustscalabledataarchiving,theyfailtoadequatelyprovideforecientrichmetadataoperations.Incontrast,databasesystemsprovideecientqueryingca-pabilities,butfailtomatchthework\rowofscienti cre-searchers.5.1FilesystemsInordertofacilitatesharingofthescienti cdata,scien-ti cresearchersusuallyemployvariousnetwork lesystemssuchasNFS[13]orAFS[9]toprovidedatadistributionandconcurrentaccess.Togetscalableandfaulttolerantdatastorage,scientistsmaylookintodistributedstoragesystemssuchasCeph[25]orHadoop[8].Mostofthedatainthese lesystemsareorganizedintosetsofdirectoriesand lesalongwithassociatedmetadata.Sincesomeofthese lesystemssuchasCephandHadoopperformautomaticdatareplication,theynotonlyprovidefault-tolerantdataaccessbutalsotheabilitytoscalethesystem.Therefore,inregardstotheneedforascalable,fault-tolerantdatastor-age,currentdistributedstoragesystemsadequatelymeetthisrequirement.Where lesystemsstillfail,however,isinprovidinganecientmeansofperformingrichmetadataqueries.Since lesystemsdonotprovideadirectmeanstoperformthese 0 50 100 150 200 250 300 350 400 450 0 5 10 15 20 0 20 40 60 80 100Data Transferred (GB)Online Storage Nodes (#)Overall Throughput (MB/s)Elapsed Time (hours) Switch-off Switch-onGBs TransferedActive ServersThroughputFigure10:DynamicDataMigrationmetadataoperations,exportprocessesusuallyinvolveacom-plexsetofadhocscriptswhichtendtobeerrorprone,in-\rexible,andunreliable.Moreimportantly,thesemanualsearchesthroughthedatarepositoryarealsotimeconsum-ingsinceallofthemetadataintherepositorymustbean-alyzedforeachexport.AlthoughsomedistributedsystemssuchasHadoopprovideprogrammingtoolssuchasMapRe-duce[5]tofacilitatesearchingthroughlargedatasetsinareliableandscalablemanner,thesefullrepositorysearchesarestillcostlyandtimeconsumingsinceeachexperimentalrunwillhavetoscantherepositoryandextractthepartic-ulardata lesrequiredbytheuser.Moreover,evenwiththepresenceoftheseprogrammingtoolsitisstillnotpos-sibletodynamicallyorganizeandgroupsubsetsofthedatarepositorybasedonthemetadatainapersistentmanner,makingitdiculttoexportreusablesnapshotsofparticu-lardatasets.5.2DatabasesTheothercommonapproachtomanagingscienti cdataistogotherouteofprojectssuchastheSloanDigitalSkySurvey[18].Thatis,ratherthanoptfora\\rat le"dataaccesspatternusedin lesystems,thescienti cdataiscol-lectedandorganizeddirectlyinalargedistributeddatabasesuchasMonetDB[10]orVertica[23].Besidesprovidinge-cientquerycapabilities,suchsystemsalsoprovideadvanceddataanalysistoolstoexamineandprobethedata.How-ever,thesesystemsremainundesirabletomanyscienti cresearchers.The rstproblemwithdatabasesystemsisthatinordertousethemthedatamustbeorganizedinahighlystructuredexplicitschema.Fromourexperience,itisrarelythecasethatthescienti cresearchersknowtheexactnatureoftheirdataaprioriorwhatattributesarerelevantornecessary.Becausescienti cdatatendstobesemi-structuredratherthanhighlystructured,thisrequirementofafullexplicitschemaimposesabarriertotheadoptionofdatabasesys-temsandexplainswhymostresearchgroupsoptfor lesys-tembasedstoragesystemswhich ttheirorganicandevolv-ingmethodofdatacollection.Mostimportantly,databasesystemsarenotidealforsci-enti cdatarepositoriesbecausetheydonot tintothework\rowcommonlyusedbyscienti cresearchers.InprojectssuchastheSloanDigitalSkySurveyandSequoia2000[17],thescienti cdataisdirectlystoredindatabasetablesandthedatabasesystemisusedasandataprocessingandanal-ysisenginetoqueryandsearchthroughthedata.Forsci-enti cprojectssuchasthese,therecentworkoutlinedbyStonebrakeret.al[16]isamoresuitablestoragesystemforthesehigh-structuredscienti crepositories.Inmost eldsofscienti cresearch,however,itisnotfea-sibleorrealistictoputtherawscienti cdatadirectlyintothedatabaseandusethedatabaseasanexecutionengine.Rather,in eldssuchasbiologicalcomputing,forinstance,genomesequencedataisgenerallystoredinlarge\rat lesandanalyzedusinghighlyoptimizedtoolssuchasBLAST[1]ondistributedsystemssuchasCondor[22].Althoughitmaybepossibletostu thegenomedatainahigh-enddatabaseandusethedatabaseenginetoexecuteBLASTasaUDF(userde nedfunction),thisgoesagainstthecommonpracticesofmostresearchersanddivertsfromtheirnormalwork\row.Therefore,usingadatabaseasascienti cdatarepositorymovesthescientistsawayfromtheirdomainsofexpertiseandtheirfamiliartoolstotherealmofdatabaseoptimizationandmanagement,whichisnotdesirableformanyscienti cresearchers.Becauseoftheselimitations,traditionaldistributed lesys-temsanddatabasesarenotdesirableforscienti cdatarepos-itorieswhichrequirebothlargescalablestorageande-cientrichmetadataoperations.Althoughdistributedsys-temsproviderobustandscalabledatastorage,theydonotprovidedirectmetadataqueryingcapabilities.Incontrast,databasesdoprovidethenecessarymetadataqueryingcapa-bilities,butfailto tintothework\rowofresearchscientists.ThepurposeofROARSistoaddresstheseshortcomingsbyconstructingahybridsystemthatleveragesthestrengthsofbothdistributed lesystemsandrelationaldatabasestoprovidefault-tolerantscalabledatastorageandecientrichmetadatamanipulation.ThishybriddesignissimilartoSDM[11]whichalsoutilizesdatabasetogetherwitha lesystem.ThedesignofSDMhighlyoptimizesforn-dimensionalarraystypedata.Moreover,SDMusesmultiplediskssup-porthighthroughputI/OforMPI[6],whileROARSusesadistributedactivestoragecluster.Anotherexampleofa lesystem-databasecombinationisHEDC[15].HEDCisimplementedonasinglelargeenterprise-classmachine ratherthananarrayofStorageNodes.iRODS[24]anditspredecessortheStorageResourceBroker[3]supportstaggedsearchablemetadataimplementedasaverticalschema.ROARSmanagesmetadatawithhorizontalschemapointingto lesandreplicaswhichallowsforthefullexpressivenessofSQLtobeapplied.6.CONCLUSIONWehavedescribedtheoveralldesignandimplementationofROARS,aarchivalsystemforscienti cdatawithsupportforrichmetadataoperations.ROARScouplesadatabaseserverandanarrayofStorageNodestoprovideuserstheabilitytosearchdataquickly,andtostorelargeamountsofdatawhileenablinghighperformancethroughputfordis-tributedapplications.Throughourexperiments,ROARShasdemonstratedtheabilitytoscaleupandperformaswellasHDFSinmostcases,andprovideuniquefeaturessuchastransparent,incrementaloperationandfailureinde-pendence.CurrentlyROARSisusedasthebackendstorageofBX-Grid[4],abiometricsdatarepository.Atthetimeofwrit-ing,BXGridhas265,927recordingsforatotalof5.1TBofdataspreadacross40StorageNodesandhasbeenusedinproductionfor16months.7.ACKNOWLEDGEMENTSThisworkwassupportedbyNationalScienceFoundationgrantsCCF-06-21434,CNS-06-43229,andCNS-01-30839.ThisworkisalsosupportedbytheFederalBureauofIn-vestigation,theCentralIntelligenceAgency,theIntelligenceAdvancedResearchProjectsActivity,theBiometricsTaskForce,andtheTechnicalSupportWorkingGroupthroughUSArmycontractW91CRB-08-C-0093.8.REFERENCES[1]S.Altschul,W.Gish,W.Miller,E.Myers,andD.Lipman.Basiclocalalignmentsearchtool.JournalofMolecularBiology,3(215):403{410,Oct1990.[2]AmazonSimpleStorageService(AmazonS3).http://aws.amazon.com/s3/,2009.[3]C.Baru,R.Moore,A.Rajasekar,andM.Wan.TheSDSCstorageresourcebroker.InProceedingsofCASCON,Toronto,Canada,1998.[4]H.Bui,M.Kelly,C.Lyon,M.Pasquier,D.Thomas,P.Flynn,andD.Thain.ExperiencewithBXGrid:ADataRepositoryandComputingGridforBiometricsResearch.JournalofClusterComputing,12(4):373,2009.[5]J.DeanandS.Ghemawat.Mapreduce:Simpli eddataprocessingonlargeclusters.InOperatingSystemsDesignandImplementation,2004.[6]J.J.DongarraandD.W.Walker.MPI:Astandardmessagepassinginterface.Supercomputer,pages56{68,January1996.[7]S.Ghemawat,H.Gobio ,andS.Leung.TheGoogle lesystem.InACMSymposiumonOperatingSystemsPrinciples,2003.[8]Hadoop.http://hadoop.apache.org/,2007.[9]J.Howard,M.Kazar,S.Menees,D.Nichols,M.Satyanarayanan,R.Sidebotham,andM.West.Scaleandperformanceinadistributed lesystem.ACMTrans.onComp.Sys.,6(1):51{81,February1988.[10]M.Ivanova,N.Nes,R.Goncalves,andM.Kersten.Monetdb/sqlmeetsskyserver:thechallengesofascienti cdatabase.Scienti candStatisticalDatabaseManagement,InternationalConferenceon,0:13,2007.[11]J.No,R.Thakur,andA.Choudhary:.Integratingparallel lei/oanddatabasesupportforhigh-performancescienti cdatamanagement.InIEEEHighPerformanceNetworkingandComputing,2000.[12]E.Riedel,G.A.Gibson,andC.Faloutsos.Activestorageforlargescaledataminingandmultimedia.InVeryLargeDatabases(VLDB),1998.[13]R.Sandberg,D.Goldberg,S.Kleiman,D.Walsh,andB.Lyon.DesignandimplementationoftheSunnetwork lesystem.InUSENIXSummerTechnicalConference,pages119{130,1985.[14]R.Searcs,C.V.Ingen,andJ.Gray.Toblobornottoblob:Largeobjectstorageinadatabaseora lesystem.TechnicalReportMSR-TR-2006-45,MicrosoftResearch,April2006.[15]E.Stolte,C.vonPraun,G.Alonso,andT.Gross.Scienti cdatarepositories.designingforamovingtarget.InSIGMOD,2003.[16]M.Stonebraker,J.Becla,D.J.DeWitt,K.-T.Lim,D.Maier,O.Ratzesberger,andS.B.Zdonik.Requirementsforsciencedatabasesandscidb.InCIDR.www.crdrdb.org,2009.[17]M.Stonebraker,J.F.T,andJ.Dozier.Anoverviewofthesequoia2000project.InInProceedingsoftheThirdInternationalSymposiumonLargeSpatialDatabases,pages397{412,1992.[18]A.S.Szalay,P.Z.Kunszt,A.Thakar,J.Gray,andD.R.Slutz.Designingandminingmulti-terabyteastronomyarchives:Thesloandigitalskysurvey.InSIGMODConference,2000.[19]O.Tatebe,N.Soda,Y.Morita,S.Matsuoka,andS.Sekiguchi.Gfarmv2:Agrid lesystemthatsupportshigh-performancedistributedandparalleldatacomputing.InComputinginHighEnergyPhysics(CHEP),September2004.[20]D.Thain.IdentityBoxing:ANewTechniqueforConsistentGlobalIdentity.InIEEE/ACMSupercomputing,pages51{61,2005.[21]D.Thain,C.Moretti,andJ.Hemmes.Chirp:APracticalGlobalFilesystemforClusterandGridComputing.JournalofGridComputing,7(1):51{72,2009.[22]D.Thain,T.Tannenbaum,andM.Livny.Condorandthegrid.InF.Berman,G.Fox,andT.Hey,editors,GridComputing:MakingtheGlobalInfrastructureaReality.JohnWiley,2003.[23]Vertica.http://www.vertica.com/,2009.[24]M.Wan,R.Moore,andW.Schroeder.Aprototyperule-baseddistributeddatamanagementsystemrajasekar.InHPDCWorkshoponNextGenerationDistributedDataManagement,May2006.[25]S.A.Weil,S.A.Brandt,E.L.Miller,D.D.E.Long,andC.Maltzahn.Ceph:Ascalable,high-performancedistributed lesystem.InUSENIXOperatingSystemsDesignandImplementation,2006.