/
SOFTWARE—PRACTICEANDEXPERIENCESoftw.Pract.Exper.,(6),551–576 SOFTWARE—PRACTICEANDEXPERIENCESoftw.Pract.Exper.,(6),551–576

SOFTWARE—PRACTICEANDEXPERIENCESoftw.Pract.Exper.,(6),551–576 - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
398 views
Uploaded On 2016-08-09

SOFTWARE—PRACTICEANDEXPERIENCESoftw.Pract.Exper.,(6),551–576 - PPT Presentation

pointsouttherearethreewaysofimprovingperformanceaWorkharderbWorksmarterandcGethelp CorrespondencetoMBakerSchoolofComputerScienceUniversityofPortsmouthMiltonCampusSouthseaHampshirePO48 ID: 439997

]pointsouttherearethreewaysofimprovingperformance:(a)Workharder(b)Worksmarter and(c)Gethelp Correspondenceto:M.Baker SchoolofComputerScience UniversityofPortsmouth MiltonCampus Southsea Hampshire PO48

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "SOFTWARE—PRACTICEANDEXPERIENCESoftw..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

SOFTWARE—PRACTICEANDEXPERIENCESoftw.Pract.Exper.,(6),551–576(1999)ClusterComputing:TheCommoditySupercomputerMARKBAKERANDRAJKUMARBUYYASchoolofComputerScience,UniversityofPortsmouth,MiltonCampus,Southsea,Hants,PO48JF,UK(email:Mark.Baker@port.ac.uk) ]pointsouttherearethreewaysofimprovingperformance:(a)Workharder(b)Worksmarter,and(c)Gethelp Correspondenceto:M.Baker,SchoolofComputerScience,UniversityofPortsmouth,MiltonCampus,Southsea,Hampshire,PO48JF,UK. BAKERANDRBUYYAtechniquesusedtosolvecomputationaltasks.Finally,gettinghelpreferstousingmultiplecomputerstosolveaparticulartask.Theuseofparallelprocessingasameansofprovidinghigh-performancecomputationalfacilitiesforlarge-scaleandgrand-challengeapplicationshasbeeninvestigatedwidelyformanyyears.However,untilfairlyrecently,thebenetsofthisresearchwereconnedtothoseindividualswhohadaccesstolarge(andnormallyexpensive)parallelcomputingplatforms.Todaythesituationischanging.Manyorganisationsaremovingawayfromusinglargesupercomputingfacilitiesandturningtowardsclustersofworkstations.Thismoveisprimarilyduetotherecentadvancesinhighspeednetworksandimprovedmicroprocessorperformance.Theseadvancesmeanthatclustersarebecominganappealingandcosteffectivevehicleforparallelcomputing.Clusters,builtusingcommodityhardwareandsoftwarecomponents,arenowplayingamajorroleinredeningtheconceptofsupercomputingandthesituationnowexistswhereclusterscanbeconsideredtoday'scommoditysupercomputersAnimportantfactorthathasmadetheusageofworkstationsapracticalpropositionisthestandardisationofmanyofthetoolsandutilitiesusedbyparallelapplications.ExamplesofthesestandardsarethemessagepassinglibraryMPI[ ]anddata-parallellanguageHPF[ Inthiscontext,standardisationenablesapplicationstobedeveloped,testedandevenrunonNOWandthenatalaterstagetobeported,withlittlemodication,ontodedicatedparallelplatformswhereCPU-timeisaccountedandchargedfor.Thefollowinglisthighlightssomeofthereasonswhyworkstationclustersarepreferredoverspecialisedparallelcomputers[ (a)Individualworkstationsarebecomingincreasinglypowerful.(b)Thecommunicationsbandwidthbetweenworkstationsisincreasingasnewnetworkingtechnologiesandprotocolsareimplemented.(c)Workstationclustersareeasiertointegrateintoexistingnetworksthanspecialisedparallelcomputers.(d)Typicallowuserutilisationofpersonalworkstations(e)Thedevelopmenttoolsforworkstationsaremorematurethanthecontrastingproprietarysolutionsforparallelcomputers–mainlyduetothenon-standardnatureofmanyparallelsystems.(f)Workstationclustersareacheapandreadilyavailablealternativetospecialisedhighperformancecomputingplatforms.(g)Clusterscanbeenlargedandindividualnodecapabilitiescanbeeasilyextended,forexampleadditionalmemoryorprocessorscanbeinstalled.AtabasiclevelaclusterisacollectionofworkstationsorPCsthatareinterconnectedviasomenetworktechnology.Amorelikelyscenarioisthatthecomputerswillbestate-of-the-artandhigh-performanceandthenetworkwillbeonewithahighbandwidthandlowlatency.Suchaclustercanprovidefastandreliableservicestocomputationallyintensiveapplications.AClusteriscapableofprovidingsimilarorbetterperformanceandreliabilitythantraditionalmainframesorsupercomputers.Also,ifdesignedcorrectly,betterfault-toleranceatamuchlowerhardwarecostcanbeachieved.Manyapplicationshavesuccessfullyusedclustersincludingcomputationallyintensiveonessuchasquantumchemistryandcomputationaluiddynamics.Inthispaper,wefocusonthecomponents,tools,techniques,andmethodologiesinvolvedinusingclustersforhigh-performanceorparallelcomputing. NetworksofWorkstations(NOW),ClustersofWorkstations(COW)andWorkstationClustersaresynonymous.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGTowardslowcostparallelcomputingInthe1980s,itwasbelievedthatcomputerperformancewasbestimprovedbycreatingfasterandmoreefcientprocessors.Thisideawaschallengedbyparallelprocessing,whichinessencemeanslinkingtogethertwoormorecomputerstojointlysolvesomeproblem.Sincetheearly1990stherehasbeenanincreasingtrendtomoveawayfromexpensiveandspecialisedproprietyparallelsupercomputerstowardsnetworksofworkstations.Amongthedrivingforcesthathaveenabledthistransitionhasbeentherapidimprovementandavailabilityofcommodityhigh-performancecomponentsforworkstationsandnetworks.Thesetechnologiesaremakingnetworksofcomputers(PCsorworkstations)anappealingvehicleforparallelprocessingandthisisconsequentlyleadingtolow-costcommoditysupercomputingClusterscanalsobeclassiedas[ (a)Dedicatedclusters.(b)Non-dedicatedclusters.Thedistinctionbetweenthesetwocasesisbasedontheownershipoftheworkstationsinacluster.Inthecaseofdedicatedclusters,aparticularindividualdoesnotownaworkstationandtheresourcesaresharedsothatparallelcomputingcanbeperformedacrosstheentirecluster[ ].Thealternativenon-dedicatedcaseiswhereindividuals'ownworkstationsandinthiscaseapplications,areexecutedbystealingidleCPUcycles[ ].TherationaleforthisisbasedonthefactthatmostworkstationCPUcyclesareunused,evenduringpeakhours[ Parallelcomputingonadynamicallychangingsetofnon-dedicatedworkstationsiscalledadaptiveparallelcomputingWherenon-dedicatedworkstationsareused,atensionexistsbetweentheworkstationownersandremoteuserswhoneedtheworkstationstoruntheirapplication.Theformerexpectsafastinteractiveresponsefromtheirworkstation,whilstthelatterisonlyconcernedwithfastapplicationturnaroundbyutilisinganyspareCPUcycles.Thisemphasisonsharingtheprocessingresourcesremovestheconceptofnodeownershipandintroducestheneedforcomplexitiessuchasprocessmigrationandloadbalancingstrategies.Suchstrategiesallowclusterstodeliveradequateinteractiveperformanceaswellasprovidingsharedresourcestodemandingsequentialandparallelapplications.Clearly,theworkstationenvironmentisbettersuitedtoapplicationsthatarenotcommunicationintensive–typically,onewouldseehighmessagestart-uplatenciesandlowbandwidths.Ifanapplicationrequireshighercommunicationperformance,theexistingLANarchitectures,suchasEthernet,arenotcapableofprovidingit.Traditionallyinscienceandindustry,aworkstationreferredtosomesortofUNIXplatformandthedominantfunctionofPC-basedmachineswasforadministrativeworkandwordprocessing.Therehas,however,beenarapidconvergenceinprocessorperformanceandkernel-levelfunctionalityofUNIXworkstationsandPC-basedmachinesinthelastthreeyears–thiscanbeassociatedwiththeintroductionofhigh-performancePentium-basedmachinesandtheWindowNToperatingsystem.ThisconvergencehasledtoanincreasedlevelofinterestinutilisingPC-basedsystemsassomeformofcomputationalresourceforparallelcomputing.ThisfactorcoupledwiththecomparativelylowcostofPCsandtheirwidespreadavailabilityinbothacademiaandindustryhashelpedinitiateanumberofsoftwareprojectswhoseprimaryaimistoharnesstheseresourcesinsomecollaborativeway.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYACOMMODITYCOMPONENTSFORCLUSTERSThecontinuousimprovementofworkstationandnetworksperformanceandavailabilityofstandardisedAPIsarehelpingpavethewayforcluster-basedparallelcomputing.Inthissection,wediscusssomeofthehardwareandsoftwarecomponentscommonlyused.ProcessorsandmemoryProcessorsOverthepasttwodecades,rapidadvanceshavetakenplaceinthedesignofmicroprocessorarchitectures,forexample,RISC,CISC,VLIW,andVector.Today,asingle-chipCPUsisalmostaspowerfulasprocessorsusedinsupercomputersoftherecentpast.Indeed,recentlyresearchershavebeenattemptingtointegratecombinationsofprocessor,memoryandnetworkinterfacesintoasinglechip.ForexampletheBerkeleyIntelligentRAM[ projectisexploringtheentirespectrumofissuesinvolvedindesigninggeneral-purposecomputersystemsthatintegrateaprocessorandDRAMontoasinglechip–fromcircuits,VLSIdesignandarchitecturestocompilersandoperatingsystems.Digital,intheirAlpha21364processor,aretryingtointegrateprocessing,memorycontrollerandnetworkinterfaceintoasinglechip.IntelprocessorsaremostcommonlyusedinPC-basedcomputers.ThecurrentgenerationIntelx86processorfamilyincludesthePentiumProandII.Theseprocessors,whilstnotinthehighbandofperformance,matchtheperformanceofmediumlevelworkstationprocessors ].Inthehighperformanceband,thePentiumProhasgoodintegerperformance,beatingSun'sUltraSPARCatthesameclockspeed,howevertheoating-pointperformanceismuchlower.ThePentiumIIXeon[ ],likethenewerPentiumII's,usesa100MHzmemorybus.Itisavailablewithachoiceof512KBto2MBofL2cache,andthecacheisclockedatthesamespeedastheCPU,overcomingtheL2cachesizeandperformanceissuesoftheplainPentiumII.Theaccompanying450NXchipsetfortheXeonhasa64-bitPCIbuswhichcansupportinterconnects.Otherpopularprocessorsincludex86variants(AMDx86,Cyrixx86),DigitalAlpha,IBMPowerPC,SUNSPARC,SGIMIPS,andHPPA.Computersystemsbasedontheseprocessorshavealsobeenusedasclusters;forexample,BerkeleyNOWusesSun'sSPARCfamilyofprocessorsintheirclusternodes.FurtherinformationabouttheperformanceofcommoditymicroprocessorscanbefoundattheVLSIMicroprocessorsGuide[ MemoryTheamountofmemoryneededfortheclusterislikelytobedeterminedbytheclustertargetapplications.Programsthatareparallelisedshouldbedistributedsuchthatthememory,aswellastheprocessing,isdistributedbetweenprocessorsforscalability.Thus,itisnotnecessarytoinstallenoughRAMtoholdtheentireprobleminmemoryoneachsystem(in-core),butthereshouldbeenoughtoavoidfrequentswappingofmemoryblocks(page-miss)todisk,sincediskaccesshasahugeimpactonperformance.AccesstoDRAMisextremelyslowcomparedtothespeedoftheprocessorandcantakeordersofmagnitudemoretimethanasingleCPUclockcycle.CachesareusedtoovercometheDRAMbottleneck.RecentlyusedblocksofmemoryarekeptincacheforfastCPUaccessCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGifanotherreferencetothememoryismade.However,theveryfastmemoryusedforcache(StaticRAM)isexpensiveandcachecontrolcircuitrybecomesmorecomplexasthesizeofthecacheincreases.Becauseoftheselimitations,thetotalsizeofacacheisusuallyintherangeof8KBto2MB.WithinPentiumbasedmachinesitisnotuncommontohavea64-bitwidememorybusaswellasachipsetsthatsupports2Mbytesofexternalcache.TheseimprovementswerenecessarytoexploitthefullpowerofthePentiums,andmakethememoryarchitectureverysimilartothatofUNIXworkstations.DiskIOTherearetwodiskinterfacescommonlyusedwithPCs.ThemostcommonisIDE,whichonearlygenerationsofthePCwasusuallyfoundonadaughtercard,butisnowoftenbuiltintoPentiummotherboardsintheEnhancedIDE(EIDE)form[ ].Thisisa16-bitwideinterfacewithapeaktransferrateof5MBytes/s.ThelargestIDEdrivecommonlyavailableis10Gbyteswithapriceofabout$35perGByte.AnIDEinterfaceallowstwodevicestobeconnected,andtherearetypicallyonlyoneortwointerfaceswithinaPC.CD-ROMsandtapedrivesarealsoavailablewithIDEinterfaces.TheotherdiskinterfacefoundonPCsisSCSI[ ].TheFastandWideSCSIinterfaceiscapableofapeakrateof10MBytes/sinasynchronousand20MBytes/sinsynchronousmodes.ASCSIinterfacemaycontainupto8devices(thisincludesthehost).Althoughmostcommonlyusedfordisks,otherdevicessuchasCD-ROMs,printersandscannersareavailablewithSCSIinterfaces.Forsimilarsizes,SCSIdiskstendtooutperformIDEdrives,howeverIDEhasasignicantpriceadvantage(asmuchastwotoone).SCSIdisksarehoweveravailablewithmuchlargercapacities:uptoabout18GBytesiscommon,retailingatabout$55perGByte.ImprovingI/OperformanceImprovementsindiskaccesstimehavenotkeptpacewithmicroprocessorperformance,whichhasbeenimprovingby50percentormoreperyear.Althoughmagnetic-mediadensitieshaveincreased,reducingdisktransfertimesbyapproximately60–80percentperannum[ ],overallimprovementindiskaccesstimes,whichrelyuponadvancesinmechanicalsystems,hasbeenlessthan10percentperyear.Parallel/Grandchallengingapplicationsneedtoprocesslargeamountsofdataanddatasets.Amdahl'slawimpliesthatthespeedupobtainedfromfasterprocessorsislimitedbythesystem'sslowestcomponents;therefore,itisnecessarytoimproveI/OperformancesothatitisbalancedwiththeCPU'sperformance.OnewayofimprovingI/OperformanceistocarryoutI/Ooperationsconcurrentlywiththesupportofaparallellesystem.Onesuchparallellesystem,knownassoftwareRAID,canbeconstructedbyusingthedisksassociatedwitheachworkstationinthecluster.ClusterinterconnectsIndividualnodesinaclusterareusuallyconnectedwithalow-latencyandhigh-bandwidthnetwork.Thenodescommunicateoverhigh-speednetworksusingastandardnetworkingprotocolsuchasTCP/IPoralow-levelprotocolsuchasActive[ ]orFastMessages[ Anumberofhighperformancenetworktechnologiesareavailableinthemarketplace.InthisCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYAsection,wediscussseveralofthesetechnologies,includingFastEthernet,ATM,andMyrinetasthemeansofinterconnectingclusters.FurtherperformanceguresusingtheNetPIPEanalysistoolcanbefoundontheInterconnectPerformanceWebsite[ RequirementsInmostfacilitieswherethereismorethanoneworkstationitisstilllikelythattheinterconnectwillbeviastandardEthernet[ ].Intermsofperformance,(latencyandbandwidth)thistechnologyisshowingitsage,butitisacheapandeasywaytoprovideleandprintersharing.AsingleEthernetconnectioncannotbeusedseriouslyasthebasisforcluster-basedcomputing:itsbandwidthandlatencyarenotbalancedcomparedtothecomputationalpowerofheworkstationsnowavailable.Typicallyonewouldexpecttheclusterinterconnectbandwidthtoexceed10MBytes/sandhavemessagelatenciesoflessthan100AnumberofinterconnectoptionsnowexistandwereoriginallyintendedforuseinWANs,buthavebeenadaptedtobeusedasthebasisofhighperformanceLANs,suchasATM.OtherinterconnectshavebeendesignedprimarilyforuseinLANs,forexampleMyrinet.ThesystembusTheinitialPCbus(AT,ornowknownasISAbus)wasclockedat5MHzandwas8-bitswide.Whenrstintroduceditsabilitieswerewellmatchedtotherestofthesystem.PCsaremodularsystemsanduntilfairlyrecentlyonlytheprocessorandmemorywerelocatedonthemotherboard,othercomponentsbeingtypicallyfoundondaughtercards,connectedviaasystembus.TheperformanceofPCshasincreasedsignicantlysincetheISAbuswasrstusedandithasconsequentlybecameabottleneckwhichhaslimitedthemachinesthroughputabilities.TheISAbuswasextendedtobe16-bitswideandwasclockedinexcessof13MHz.This,however,isstillnotsufcienttomeetthedemandsofthelatestCPUs,diskinterfacesandotherperipherals.AlthoughagroupofPCmanufacturersintroducedtheVESAlocalbus,a32-bitbuswhichmatchedthesystemsclockspeed,thishaslargelybeensupplantedbytheIntelcreatedPCI ]bus,whichallows133Mbytes/stransfersandisusedinsidePentiumbasedPCs.PCIhasalsobeingadoptedforuseinnon-IntelbasedplatformssuchastheDigitalAlphaServerrange.ThishasfurtherblurredthedistinctionbetweenPCsandworkstationsastheI/Osub-systemofaworkstationmaybebuiltfromcommodityinterfaceandinterconnectcards.EthernetandFastEthernetStandardEthernet[ ]hasbecomealmostsynonymouswithworkstationnetworking.Thistechnologyisinwidespreadusage,bothintheacademicandcommercialsectors,however,its10Mbpsbandwidthisnolongersufcientforuseinenvironmentswhereusersaretransferringlargedataquantitiesortherearehightrafcdensities.Animprovedversion,commonlyknownasFastEthernet,provides100MbpsbandwidthandhasbeendesignedtoprovideanupgradepathforexistingEthernetinstallations.StandardandFastEthernetcannotco-existonaparticularcable,buteachusesthesamecabletype.Whenaninstallationishub-basedandusestwisted-pairitispossibletoupgradethehubtoone,whichsupportsbothstandards,andreplacetheEthernetcardsinonlythosemachineswhereitisbelievedtobenecessary.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGAsynchronousTransferMode(ATM)ATM[ ]isaswitchedvirtual-circuittechnologyanditwasoriginallydevelopedforthetelecommunicationsindustry.ItisembodiedwithinasetofprotocolsandstandardsdenedbytheInternationalTelecommunicationsUnion(ITU).TheinternationalATMForum,anon-protorganisationcontinuesthiswork.Unlikesomeothernetworkingtechnologies,ATMisintendedtobeausedforbothLANandWAN,presentingauniedapproachtoboth.ATMisbasedaroundsmallxedsizeddatapacketscalledcells.Itisdesignedtoallowcellstobetransferredusinganumberofdifferentmediasuchasbothcopperwireandbreoptic.Thishardwarevarietyalsoresultsinanumberofdifferentinterconnectperformancelevels.WhenitwasrstintroducedATMusedopticalbreasthelinktechnology.However,indesktopenvironmentsthisisundesirable,forexample,twistedpaircablesmayhavebeenusedtointerconnectanetworkedenvironmentandmovingtobre-basedATMwouldmeananexpensiveupgrade.Thetwomostcommoncablingtechnologiesfoundinadesktopenvironmentaretelephonestylecables,termedCategory3,andabetterqualitycabletermedCategory5.Theformer,CAT-3,issuitableforEthernet,thelatter,CAT-5ismoreappropriateforFastEthernetandissuitableforusewithATM.CAT-5canbeusedwithATMatspeedsof15.5MBytes/s,allowingupgradesofexistingnetworkswithoutreplacingcabling.ScalableCoherentInterface(SCI)SCI[ , ],isanIEEE1596standardaimedatprovidinglow-latencydistributedsharedmemoryaccessacrossanetwork.SCIisthemodernequivalentofaProcessor-Memory-I/ObusandLAN,combined.Itisdesignedtosupportdistributedmultiprocessingwithhighbandwidthandlowlatency.Itprovidesascalablearchitecturethatallowslargesystemstobebuiltoutofmanyinexpensivemassproducedcomponents[ SCIisapoint-to-pointarchitecturewithdirectory-basedcachecoherence.Itcanreducethedelayofinterprocessorcommunicationsevenwhencomparedtothenewestandbesttechnologiescurrentlyavailable,suchasFibreChannelandATM.SCIachievesthisbyeliminatingtheneedforrun-timelayersofsoftwareprotocol-paradigmtranslation.AremotecommunicationinSCItakesplaceasjustapartofasimpleloadorstoreprocessinaprocessor.Typically,aremoteaddressresultsinacachemiss.ThisinturncausesthecachecontrollertoaddressremotememoryviaSCItogetthedata.Thedataisfetchedtocachewithadelayintheorderofafewsandthentheprocessorcontinuesexecution.DolphincurrentlyproducesSCIcardsfortheSPARCSBusandPCI-basedsystems.DolphinprovideaversionofMPIfortheirSCIcards.CurrentlyMPIisavailableforSunSPARCplatforms,whereitachievedalessthan12szeromessage-lengthlatency,theyintendtoportMPItoWindowsNTinthenearfuture.ThePortlandGroupInc.provideaSCIversionofHighPerformanceFortran(HPF).FurtherinformationrelatingSCI-basedclustercomputingactivitiescanbefoundattheUniversityofFlorida'sHCSLab[ AlthoughSCIisfavouredintermsoffastdistributedsharedmemorysupport,ithasnotbeentakenupwidelybecauseitsscalabilityisconstrainedbythecurrentgenerationofswitchesanditscomponentsarerelativelyexpensive.MyrinetMyrinetisa1.28GbpsfullduplexLANsuppliedbyMyricom[ ].Itisaproprietary,highperformanceinterconnectusedinmanyofthemoreexpensiveclusters.MyrinetusesCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYAlowlatencycut-throughroutingswitches,whichoffersfaulttolerancebyautomaticmappingofthenetworkcongurationandsimpliesthesettingupofanetwork.ThereareMyrinetdriversformanyoperatingsystems,includingLinux,SolarisandWindowsNT.InadditiontoTCP/IPsupport,theMPICHimplementationofMPI[ ]isprovidedaswellasanumberofothercustomerdevelopedpackages,whichoffersub10slatencies.MyrinetisratherexpensivewhencomparedtoFastEthernet,buthastheadvantageofaverylow-latency(5s,one-waypoint-to-point),highthroughput(1.28Gbps),andaprogrammableon-boardprocessorthatallowsforgreaterexibility.MyrinetcansaturatetheeffectivebandwidthofaPCIbusatalmost120Mbytes/swith4Kbytespackets[ OneofthemaindisadvantagesofMyrinetis,asmentioned,itspricecomparedtoFastEthernet.ThecostofMyrinet-LANcomponentsforconnectinghigh-performanceworkstationsorPCs,includingthecablesandswitches,isintherangeof$1600to$1800perhost.Also,switcheswithmorethan16portsarenotavailable,soscalingcanbemessy,althoughswitchchainingcanbeusedtoconstructlargerMyrinetclusters.OperatingSystemsInthissectionwefocusonsomeofthepopularoperatingsystemsavailableforworkstationplatforms.OperatingsystemtechnologyhasmaturedandtodayanOScanbeeasilyextendedandnewsubsystemsaddedwithoutmodicationofunderlyingstructure.Modernoperatingsystemssupportmultithreadingatkernellevel,andhigh-performanceuserlevelmultithreadingsystemscanbebuiltwithouttheirkernelintervention.MostPCoperatingsystemhavebecomestableandsupportmultitasking,multithreading,andnetworkingandtheyaresmartenoughtooperateoneachclusternode.ThemostpopularandwidelyusedclusteroperatingsystemsareSolaris,LinuxandWindowsNT.LinuxLinux[ ]isaUNIX-likeoperatingsystem,whichwasinitiallydevelopedbyLinusTorvalds,aFinnishundergraduatestudentin1991–92.TheoriginalreleasesofLinuxreliedheavilyontheMinixOS,however,theeffortsofanumberofcollaboratingprogrammersresultedindevelopmentandimplementationofarobustandreliable,POSIXcompliant,operatingsystem.Althoughinitiallydevelopedbyasingleauthor,therearenowalargenumberofauthorsinvolvedinthedevelopmentofLinux.Onemajoradvantageofthisdistributeddevelopmenthasbeenthatthereisawiderangeofsoftwaretools,librariesandutilitiesavailable.ThisisduetothefactthatanycapableprogrammerhasaccesstotheOSsourceandcanimplementthefeaturethattheywish.ObviouslythefeaturesthatareimplementedvaryinfunctionalityandqualitybutthereleasestrategyofLinux,thatkernelreleasesarestillcontrolledatasinglepoint,andavailabilityviatheInternet,whichleadstofastfeedbackaboutbugsandotherproblemshasledtoLinuxprovidingaveryrichandstableenvironment.ThefollowingaresomeofthereasonswhyLinuxissopopular:(a)Linuxrunsoncheapx86platforms,yetoffersthepowerandexibilityofUNIX.(b)LinuxisavailablefromtheInternetandcanbedownloadedwithoutcost.(c)Itiseasytoxbugsandimprovesystemperformance.(d)Userscandeveloporne-tunehardwaredriversandthesecaneasilybemadeavailabletootherusers.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTING Figure1.WindowsNTarchitectureLinuxprovidesthefeaturestypicallyfoundinUNIXimplementationssuchas:(a)Pre-emptivemulti-tasking.(b)Demand-pagedvirtualmemory.(c)Multi-usersupport.(d)Multi-processorsupport.LinuxprovidesaPOSIXcompatibleUNIXenvironment.MostapplicationswrittenforUNIXwillrequirelittlemorethanarecompile.InadditiontotheLinuxkernel,alargeamountofapplicationandsystemssoftwareisfreelyavailable.ThisincludestheGNUsoftwaresuchasBash,EmacsandtheCcompileraswellasXFree86,apublicdomainXserver.MicrosoftWindowsNTMicrosoftCorp.[ ]isthedominantproviderofsoftwareinthepersonalcomputingmarketplace.Microsoftprovidestwobasicoperatingsystems:Windows95/98andWindowsNT4(soontobecomeWindowsNT5/2000)[ ].NTandWindows95hadapproximately66percentofdesktopoperatingsystemsmarketsharein1996–IBMOS/2,UNIX,MacOSandDOScomprisetheremainderofthemarketshare.WindowsNT(bothWorkstationandServerversions)isa32-bitpre-emptive,multi-taskingandmulti-useroperatingsystem[ , ].NTisfaulttolerant–each32-bitapplicationoperatesinitsownvirtualmemoryaddressspace.UnlikeearlierversionsofWindows(suchasWindowsforWorkgroupsandWindows95/98),NTisacompleteoperatingsystemandnotanadditiontoDOS.NTsupportsmostCPUarchitectures,includingIntelx86,IBMPowerPC,MIPSandDECAlpha.NTalsosupportmultiprocessormachinesthroughtheusageofthreads.NThasobject-basedsecuritymodelanditsownspeciallesystem(NTFS)thatallowspermissionstobesetonaleanddirectorybasis.AschematicoftheNTarchitectureisshowinFigure .WindowsNThasthenetworkprotocolsandservicesintegratedwiththebaseoperatingsystem.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYASolarisTheSolarisoperatingsystemfromSunSoft,isaUNIXbasedmultithreadedandmultiuseroperatingsystem.ItsupportsIntelx86andSPARCbasedplatforms.ItsnetworkingsupportincludesaTCP/IPprotocolstackandlayeredfeaturessuchasRemoteProcedureCalls(RPC),andtheNetworkFileSystem(NFS).TheSolarisprogrammingenvironmentincludesANSI-compliantCandC++compilers,aswellastoolstoproleanddebugmultithreadedprograms.TheSolariskernelsupportsmultithreading,multiprocessingandhasreal-timeschedulingfeaturesthatarecriticalformultimediaapplications.Solarissupportstwokindsofthreads:LightWeightProcesses(LWPs)anduserlevelthreads.Thethreadsareintendedtobesufcientlylightweightsothattherecanbethousandspresentandthatsynchronisationandcontextswitchingcanbeaccomplishedrapidlywithoutenteringthekernel.Solaris,inadditiontotheBSDlesystem,alsosupportsseveraltypesofnon-BSDlesystemstoincreaseperformanceandeaseofuse.Forperformancetherearethreenewlesystemtypes:.ThecachinglesystemallowsalocaldisktobeusedasanoperatingsystemmanagedcacheofeitherremoteNFSdiskorCD-ROMlesystems.With,anentirelocaldiskcanbeusedascache.Thetemporarylesystemusesmainmemorytocontainalesystem.Inaddition,thereareotherlesystemslikethelesystemandVolumelesystemtoimprovesystemusability.SolarissupportsdistributedcomputingandisabletostoreandretrievedistributedinformationtodescribethesystemandusersthroughtheNetworkInformationService(NIS)anddatabase.TheSolarisGUI,OpenWindows,isacombinationofX11R5andtheAdobePostscriptsystem,whichallowsapplicationstoberunonremotesystemswiththedisplayshownalongwithlocalapplications.WindowsofopportunityTheresourcesavailableintheaverageNOW,suchasprocessors,networkinterfaces,memoryanddisks,offeranumberofresearchopportunities,suchas:(a)ParallelProcessing:usethemultipleprocessorstobuildMPP/DSMlikesystemforparallelcomputing.(b)NetworkRAM:usethememoryassociatedwitheachworkstationasaggregateDRAMcache:thiscandramaticallyimprovevirtualmemoryandlesystemperformance.(c)SoftwareRAID:usethearraysofworkstationdiskstoprovidecheap,highlyavailable,andscalablelestoragebyusingredundantarraysofworkstationdiskswiththeLANasI/Obackplane.Inaddition,itispossibletoprovideparallelI/OsupporttoapplicationsthroughmiddlewaresuchasMPI-IO(d)Multi-pathCommunication:usethemultiplenetworksforparalleldatatransferbetweenScalableparallelapplicationsrequiregoodoating-pointperformance,lowlatencyandhighbandwidthcommunications,scalablenetworkbandwidth,andfasteraccesstoles.Clustersoftwarecanmeettheserequirementsbyusingresourcesassociatedwithclusters.AlesystemsupportingparallelI/OcanbebuiltusingdisksassociatedwitheachworkstationinsteadofusingexpensivehardwareRAID.VirtualmemoryperformancecanbedrasticallyimprovedbyusingNetworkRAMasabacking-storeinsteadofharddisk.Inaway,parallelCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGlesystemsandNetworkRAMreducesthewideningperformancegapbetweenprocessorsanddisks.ItiscommontoconnectclusternodesusingthestandardEthernetandaspecialisedhighperformancenetworkssuchasMyrinet.Thesemultiplenetworkscanbeutilisedfortransferringdatasimultaneouslyacrossclusternodes.Themulti-pathcommunicationsoftwarecandemultiplexdataatthetransmittingendtomultiplenetworksandmultiplexdataatthereceivingend.Thus,allavailablenetworkscanbeutilisedforfastercommunicationofdatabetweenclusternodes.PROGRAMMINGTOOLSFORHPCONCLUSTERSMessagepassingsystems(PVM/MPI)Messagepassinglibrariesallowefcientparallelprogramstobewrittenfordistributedmemorysystems.Theselibrariesprovideroutinestoinitiateandcongurethemessagingenvironmentaswellassendingandreceivingpacketsofdata.Currently,thetwomostpopularhigh-levelmessage-passingsystemsforscienticandengineeringapplicationarethePVM ](ParallelVirtualMachine)fromOakRidgeNationalLaboratoryandMPI(MessagePassingInterface)denedbyMPIForum[ PVMisbothanenvironmentandamessage-passinglibrary,whichcanbeusedtorunparallelapplicationsonsystemsrangingfromhigh-endsupercomputersthroughtoclustersofworkstations.WhereasMPIisaspecicationformessagepassing,designedtobestandardfordistributedmemoryparallelcomputingusingexplicitmessagepassing.Thisinterfaceattemptstoestablishapractical,portable,efcient,andexiblestandardformessagepassing.MPIisavailableonmostoftheHPCsystemsincludingSMPmachines.TheMPIstandard[ ]istheamalgamationofwhatwereconsideredthebestaspectsofthemostpopularmessage-passingsystemsatthetimeofitsconception.ItistheresultofworkundertakenbytheMPIForum,acommitteecomposedofvendorsandusersformedattheSupercomputingConferencein1992withtheaimofdeningamessagepassingstandard.ThegoalsoftheMPIdesignwereportability,efciencyandfunctionality.Thestandardonlydenesamessagepassinglibraryandleaves,amongstotherthings,theinitialisationandcontrolofprocessestoindividualdeveloperstodene.LikePVM,MPIisavailableonawiderangeofplatformsfromtightlycoupled,massivelyparallelmachines,throughtoNOWs.ThechoiceofwhethertousePVMorMPItodevelopaparallelapplicationisbeyondthescopeofthispaper,butgenerallyapplicationdeveloperschooseMPIasitisfastbecomingthedefactostandardformessagepassing.MPIandPVMlibrariesareavailableforFortran77,Fortran90,ANSICandC++.Therealsoexistinterfacestootherlanguages–onesuchexampleisJava[ MPICH[ ],developedbyArgonneNationalLaboratoryandMississippiState,isprobablythemostpopularofthecurrent,free,implementationsofMPI.MPICHisaversionofMPIbuiltontopofChameleon[ ].TheportabilityofMPICHderivesfrombeingbuiltontopofarestrictednumberofhardware-independentlow-levelfunctions,collectivelyforminganAbstractDeviceInterface(ADI).TheADIcontainsapproximately25functionsandtherestofMPIapproximately125functions.ImplementingtheADIfunctionsisallthatisrequiredtorunMPICHonanewplatform.TheADIencapsulatesthedetailsandcomplexityoftheunderlyingcommunicationhardwareintoaseparatemodule.Byrestrictingtheservicesprovidedtobasicpoint-to-pointmessagepassing,itofferstheminimumrequiredtobuildacompleteMPIimplementationCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYAaswellasaverygeneralandportableinterface.OntopoftheADI,theremainingMPICHcodeimplementstherestoftheMPIstandard,includingthemanagementofcommunicators,deriveddatatypes,andcollectiveoperations.MPICHhasbeenportedontomostcomputingplatforms[ ],includingWindowsNT[ DistributedSharedMemory(DSM)systemsThemostefcientandwidelyusedprogrammingparadigmondistributedmemorysystemsismessagepassing.Aproblemwiththisparadigmisitscomplexityanddifcultyofefcientprogrammingascomparedtohavingasingleimagesystem,suchasauni-processororshared-memorysystem.Sharedmemorysystemsofferasimpleandgeneralprogrammingmodel,buttheysufferfromscalability.AnalternatecosteffectivesolutionisprovidedbyDistributedSharedMemory(DSM)systems.Theseofferasharedmemoryprogrammingparadigmaswellasthescalabilityofadistributedmemorysystem.ADSMsystemoffersaphysicallydistributedandlogicallysharedmemory,whichisanattractivesolutionforlargescalehigh-performancecomputing.DSMsystemscanbeimplementedbyusingsoftwareorhardwaresolutions[ ].ThecharacteristicsofsoftwareimplementedDSMsystemsare:theyareusuallybuiltasaseparatelayerontopofthemessagepassinginterface;theytakefulladvantageoftheapplicationcharacteristics;virtualmemorypages,objects,andlanguagetypesareunitsofsharing.Theimplementationcanbeachievedby:compilerimplementation;user-levelruntimepackage;andoperatingsystemlevel(insideoroutsidethekernel).Thatis,asharedmemory-programmingmodelforadistributedmemorymachinecanbeimplementedeithersolelybyruntimemethods[ ],bycompiletimemethods[ ],orbycombinedcompiletimeandruntimeapproach[ ].AfewrepresentativesoftwareDSMsystemsareMunin[ TreadMarks[ ],Linda[ ]andClouds[ ThecharacteristicsofhardwareimplementedDSMsystemsare:(a)Betterperformance(muchfasterthansoftwareDSMapproaches).(b)Noburdenonuserandsoftwarelayers(fulltransparency).(c)Finergranularityofsharing(cacheblock);extensionsofthecachecoherenceschemes(snoopyordirectory).(d)Increasedhardwarecomplexity(notunreasonable).TypicalclassesofhardwareDSMsystemsareCC-NUMA,forexampleDASH[ COMA,forexampletheKSR1[ ],andreectivememorysystems,suchasMerlin[ ParalleldebuggersandprolersTodevelopcorrectandefcienthighperformanceapplicationsitishighlydesirabletohavesomeformofeasy-to-useparalleldebuggerandperformanceprolingtools.MostvendorsofHPCsystemsprovidesomeformofdebuggerandperformanceanalyserfortheirplatforms.Ideally,thesetoolsshouldbeabletoworkinaheterogeneousenvironment.Thusmakingitpossibletodevelopandimplementaparallelapplicationon,say,aNOW,andthenactuallydoproductionrunsonadedicatedHPCplatform,suchastheSGI/CrayT3E.DebuggersThenumberofparalleldebuggersthatarecapableofbeingusedinacross-platform,heterogeneous,developmentenvironmentisverylimited[ ].Forthisreasonaneffortwasbegunin1996todeneacross-platformparalleldebuggingstandardthatdenedthefeaturesCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGandinterfaceuserswanted.TheHighPerformanceDebuggingForum(HPDF)wasformedasaParallelToolsConsortium[ ](PTools)project.TheHPDFmembersincludevendors,academicsandgovernmentresearchers,aswellasHPCusers.TheforummeetsatregularintervalsanditseffortshaveculminatedintheHPDVersion1specication.Thisstandarddenesthefunctionality,semantics,andsyntaxforacommand-lineparalleldebugger.CurrentlyreferenceimplementationsfortheIBMSP2andtheSGI/CrayOrigin2000areunderway.Ideally,aparalleldebuggershouldbecapableofatleast:(a)Managingmultipleprocessesandmultiplethreadswithinaprocess.(b)Displayingeachprocessinitsownwindow.(c)Displayingsourcecode,stacktraceandstackframeforoneormoreprocesses.(d)Divingintoobjects,subroutinesandfunctions.(e)Settingbothsource-levelandmachine-levelbreakpoints.(f)Sharingbreakpointsbetweengroupsofprocesses.(g)Deningwatchandevaluationpoints.(h)Displayingarraysandarrayslices.(i)Manipulationofcodevariablesandconstants.TotalViewTotalViewisacommercialproductfromDolphininterconnectSolutions[ ].ItiscurrentlytheonlywidelyavailableparalleldebuggerthatsupportsmultipleHPCplatforms.TotalViewsupportsmostcommonlyusedscienticlanguages(C,C++,F77,F90andHPF),MessagePassinglibraries(PVM/MPI)andoperatingsystems(SunOS/Solaris,IBMAIX,DigitalUNIXandSGIIRIX6).EventhoughTotalViewcanrunonmultipleplatforms,itcanonlybeusedinanhomogeneousenvironments.Namely,whereeachprocessesoftheparallelapplicationbeingdebuggedmustberunningunderthesameversionoftheOS.TotalViewisGUIdrivenandhasnocommand-lineinterface.PerformanceanalysistoolsThebasicpurposeofperformanceanalysistoolsistohelpaprogrammerunderstandtheperformancecharacteristicsofaparticularapplication.Inparticularanalyseandlocatepartsofanapplicationthatexhibitpoorperformanceandcreateprogrambottlenecks.Suchtoolsareusefulforunderstandingthebehaviourofnormalsequentialapplicationsandcanbeenormouslyhelpfulwhentryingtoanalysetheperformancecharacteristicsofparallelapplications.Tocreateperformanceinformation,mosttoolsproduceperformancedataduringprogramexecutionandthenprovideapost-mortemanalysisanddisplayoftheperformanceinformation.Sometoolsdobothstepsinanintegratedmannerandafewtoolshavethecapabilityforrun-timeanalysis,eitherinadditiontoorinsteadofpost-mortemanalysis.Mostperformancemonitoringtoolsconsistofthesomeorallofthefollowingcomponents:(a)Ameansofinsertinginstrumentationcallstotheperformancemonitoringroutinesintotheuser'sapplication.(b)Aruntimeperformancelibrarythatconsistsofasetofmonitoringroutinesthatmeasureandrecordvariousaspectsofaprogram'sperformance.(c)Asetoftoolsthatprocessanddisplaytheperformancedata.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYAInsummary,apost-mortemperformanceanalysingtoolworksby:1.Addinginstrumentationcallsintothesourcecode.2.Compilingandlinkingtheapplicationwithaperformanceanalysisruntimelibrary.3.Runningtheapplicationtogenerateatracele.4.Processingandviewingthetracele.Aparticularissuewithperformancemonitoringtoolsistheintrusivenessofthetracingcallsandtheirimpactontheapplicationsperformance.Itisveryimportantthatinstrumentationdoesaffecttheperformancecharacteristicsoftheparallelapplicationandthusprovideafalseviewofitsperformancebehaviour.Alesser,butstillimportant,secondaryissueistheformatofthetrace-le.Thelesformatisimportantfortworeasons:rstly,itmustcontaindetailedandusefulexecutioninformation,butnotbehugeinsize,andsecondly,itshouldconformtosome`standard'formatsothatitpossibletousevariousGUIinterfacestovisualisetheperformancedata.Table liststhemostcommonlyusedtoolsforperformanceanalysisandvisualisationonmessagepassingsystems.Clustermonitoring–systemadministrationtoolsMonitoringclustersisachallengingtaskthatcanbeeasedbytoolsthatallowentireclusterstobeobservedatdifferentlevelsusingaGUI.Goodmanagementsoftwareiscrucialforexploitingaclusterasahighperformancecomputingplatform.Therearemanyprojectsinvestigatingsystemadministrationofclustersthatsupportparallelcomputing,including:(a)TheBerkeleyNOW[ ]systemadministrationtoolgathersandstoresdatainarelationaldatabase.ItusesaJavaapplettoallowuserstomonitorasystemfromtheirbrowser[ (b)TheSMILE(ScalableMulticomputerImplementationusingLow-costEquipment)administrationtooliscalledK-CAP.Itsenvironmentconsistsofcomputenodes–theseexecutethecompute-intensivetasks,amanagementnode–aleserverandclustermanageraswellasamanagementconsole–aclientthatcancontrolandmonitorthecluster.K-CAPusesaJavaapplettoconnecttothemanagementnodethroughapredenedURLaddressinthecluster.(c)SolsticefromSunMicrosystemsallowsstandaloneworkstationstobemonitored.usesclient-servertechnologyformonitoringsuchaworkstation[ TheNodeStatusReporter(NSR)providesastandardmechanismformeasurementandaccesstostatusinformationofclusters[ ].Parallelapplications/toolscanaccessNSRthroughtheNSRInterface.(d)PARMON[ ]isacomprehensiveenvironmentformonitoringlargeclusters.Itusesclient-servertechnologytoprovidetransparentaccesstoallnodestobemonitored.ThetwomajorcomponentsofPARMONaretheparmon-server–systemresourceactivitiesandutilisationinformationproviderandtheparmon-client–aJavaappletcapableofgatheringandvisualisingrealtimeclusterinformation.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGTableI.Performanceanalysisandvisualisationtools ToolSupportsURL instrumentation,http://science.nas.nasa.gov/Software/AIMS/monitoringlibrary,logginglibraryandsnapshotvisualisationPablomonitoringlibraryhttp://www-pablo.cs.uiuc.edu/Projects/Pablo/andanalysisParadyndynamicinstrumentationruntimeanalysisSvPablointegratedhttp://www-pablo.cs.uiuc.edu/Projects/Pablo/instrumentor,monitoringlibraryandanalysisVampirmonitoringlibraryvisualisationmonitoringlibraryperformanceanalysisvisualisationtoolforIBMSPmachineDimemashttp://www.pallas.com/pages/dimemas.htmpredictionformessagepassingprogramsParaverprogramvisualisationandanalysis REPRESENTATIVECLUSTERSYSTEMSTherearemanyprojects[ , ],investigatingthedevelopmentofsupercomputingclassmachinesusingcommodityoff-the-shelfcomponents.Theyinclude(a)TheNetworksofWorkstations(NOW)projectatUniversityofCalifornia,Berkeley.(b)TheHighPerformanceVirtualMachine(HPVM)projectatUniversityofIllinoisatUrbana-Champaign.(c)TheBeowulfProjectattheGoddardSpaceFlightCenter,NASA;(d)TheSolaris-MCprojectatSunLabs,SunMicrosystems,Inc.,PaloAlto,CA.NetworksOfWorkstations(NOW)TheBerkeleyNetworkofWorkstations[ ](NOW)projectdemonstrateshowasuccessfullarge-scaleparallelcomputingsystemscanbeputtogetherwithvolumeproducedcommercialCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYAworkstationsandthelatestcommodityswitch-basednetworkcomponents.Toattainthegoalofcombiningdistributedworkstationsintoasinglesystem,theNOWprojectincludedresearchanddevelopmentintonetworkinterfacehardware,fastcommunicationprotocols,distributedlesystems,distributedschedulingandjobcontrol.TheBerkeleyNOWsystemconsistsofthefollowingvepackages.Inter-processorcommunicationsActiveMessages(AM)arethebasiccommunicationsprimitivesinBerkeleyNOW.ThisinterfacegeneralisespreviousAMinterfacestosupportabroaderspectrumofapplicationssuchasclient/serverprograms,lesystems,operatingsystems,aswellascontinuingsupportforparallelprograms.TheAMcommunicationisessentiallyasimpliedremoteprocedurecallthatcanbeimplementedefcientlyonawiderangeofhardware.BerkeleyNOWincludesacollectionoflow-latency,parallelcommunicationprimitives.Theseincludeextensionsofcurrentones,includingBerkeleySockets,FastSockets,sharedaddressspaceparallelC(Split-C),MPIandaversionofHPF.ThesecommunicationslayersprovidethelinkbetweenthelargeexistingbaseofMPPprogramsandthefast,butprimitive,communicationlayersofBerkeleyNOW.ProcessmanagementGLUNIX(GlobalLayerUNIX)isanoperatingsystemlayerforBerkleyNOW.GLUNIXisdesignedtoprovidetransparentremoteexecution,supportforinteractiveparallelandsequentialjobs,loadbalancing,andbackwardcompatibilityforexistingapplicationbinaries.GLUNIXisamulti-usersystemimplementedattheuser-levelsothatitcanbeeasilyportedtoanumberofdifferentplatforms.GLUNIXaimstoprovide:Acluster-widenamespace,GLUNIXusesNetworkPIDs(NPIDs)andVirtualNodeNumbers(VNNs).NPIDsaregloballyuniqueprocessidentiersforbothsequentialandparallelprogramsthroughoutthesystem.VNNsareusedtofacilitatecommunicationsamongprocessesofaparallelprogram.AsuiteofusertoolsforinteractingandmanipulatingNPIDsandVNNs,equivalenttoUNIXrun,kill,make,tcsh,upandstat.AprogrammingAPIthroughtheGLUNIXruntimelibrary,whichallowsinteractionwithNPIDsandVNNs.VirtualmemoryOnafastnetworkwithAM,fetchingapagefromaremotemachine'smainmemorycanbemorethananorderofmagnitudefasterthangettingthesameamountofdatafromthelocaldisk.TheBerkeleyNOWsystemisabletoutilisethememoryonidlemachinesasapagingdeviceforbusymachines.Thedesignedsystemisserverless,andanymachinecanbeaserverwhenitisidle,oraclientwhenitneedsmorememorythanphysicallyavailable.TwoprototypeglobalvirtualmemorysystemshavebeendevelopedwithintheBerkeleyNOWprojecttoallowsequentialprocessestopagetomemoryofremoteidlenodes.OneoftheseusescustomSolarissegmentdriverstoimplementanexternaluser-levelpagerwhichexchangespageswithremotepagedaemons.Theotherprovidessimilaroperationsonsimilarlymappedregionsusingsignals.FilesystemxFSisaserverless,distributedlesystem,whichattemptstohavelowlatency,highbandwidthaccesstolesystemdatabydistributingthefunctionalityoftheserveramongCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTING Figure2.TheHPVMlayeredarchitecturetheclients.Thetypicaldutiesofaserverincludemaintainingcachecoherence,locatingdata,andservicingdiskrequests.ThefunctionoflocatingdatainxFSisdistributedbyhavingeachclientresponsibleforservicingrequestsonasubsetoftheles.Filedataisstripedacrossmultipleclientstoprovidehighbandwidth.TheHighPerformanceVirtualMachine(HPVM)ThegoaloftheHPVMproject[ , ],istodeliversupercomputerperformanceonlow-costCOTSsystems.HPVMalsoaimstohidethecomplexitiesofadistributedsystembehindacleaninterface.TheHPVMproject,beingundertakenbytheConcurrentSystemArchitectureGroup(CSAG)intheDept.ofComputerScienceattheUniversityofIllinoisatUrbana-Champaign,providessoftwarethatenableshigh-performancecomputingonclustersofPCsandworkstations.TheHPVMarchitectureisshowninFigure anditconsistsofanumberofsoftwarecomponentswithhigh-levelAPIs,suchasMPI,SHMEMandGlobalArrays,thatallowsHPVMclusterstobecompetitivewithdedicatedMPPsystems.TheHPVMprojectaimstoaddressthefollowingchallenges:(a)Deliveringhigh-performancecommunicationtostandard,high-levelAPIs(b)Co-ordinatingschedulingandresourcemanagement(c)ManagingheterogeneityAcriticalpartofHPVMwasthedevelopmentofahigh-bandwidthandlow-latencycommunicationsprotocolknownasIllinoisFastMessages(FM).TheFMinterfaceisbasedonBerkeleyActiveMessages.Unlikeothermessaginglayers,FMisnotthesurfaceAPI,buttheunderlyingsemantics.FMcontainsfunctionsforsendinglongandshortmessagesandforextractingmessagesfromthenetwork.TheservicesprovidedbyFMguaranteesandcontrolsthememoryhierarchythatFMprovidestosoftwarebuiltwithFM.FMalsoguaranteesreliableandorderedpacketdeliveryaswellascontrolovertheschedulingofcommunicationwork.TheFMinterfacewasoriginallydevelopedonaCrayT3DandaclusterofSPARCstationsconnectedbyMyrinethardware.Myricom'sMyrinethardwareisaprogrammablenetworkCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYAinterfacecardcapableofproviding160MBytes/slinksandwithswitchlatenciesofunderas.FMhasalow-levelsoftwareinterfacethatdelivershardwarecommunicationperformance;however,higher-levellayersinterfaceoffergreaterfunctionality,applicationportabilityandeaseofuse.TheBeowulfProjectTheBeowulfproject[ ]wasinitiatedinthesummerof1994underthesponsorshipoftheUSNASAHPCCEarthandSpaceSciences(ESS)project.TheaimoftheprojectwastoinvestigatethepotentialofPCclustersforperformingcomputationaltasks.BeowulfreferstoaPile-of-PCs(PoPC)todescribealooseensembleorclusterofPCswhichissimilartoClusterorNetworkOfWorkstations(NOW).AnemphasisofPoPCistheuseofmass-marketcommoditycomponents,dedicatedprocessors(ratherthanstealingcyclesfromidleworkstations),andtheusageofaprivatecommunicationsnetwork.AnoverallgoalofBeowulfistoachievethe`best'overallsystemcost/performanceratioforthecluster.Inthetaxonomyofparallelcomputers,BeowulfclustersfallsomewherebetweentightlycoupledsystemsliketheCrayT3EandclusterarchitecturessuchasBerkeleyNOW.ProgrammersusingaBeowulfstillneedtoworryaboutlocality,loadbalancing,granularityandcommunicationoverheadsinordertoobtainthebestperformance.BeowulfaddstothePoPCmodelbyemphasisinganumberofotherfactorsintothePoPCequation:(a)Nocustomcomponents–Beowulfexploitstheuseofcommoditycomponentsandindustrystandardswhichhavebeendevelopedundercompetitivemarketconditionsandareinmassproduction.Amajoradvantageofthisapproachisthatnosinglevendorownstherighttotheproductastheessentiallyidenticalsubsystemcomponentssuchasmotherboards,peripheralcontrollersandI/Odevicescanbemulti-sourced.Thesesubsystemsprovideaccepted,standardinterfacessuchasPCIbus,IDEandSCSIinterfaces,andEthernetcommunications.(b)Incrementalgrowthandtechnologytracking–AsnewPCtechnologiesbecomeavailable.ThecommodityapproachtoBeowulfalsomeansthatitssystemsadministratorhastotalcontrolovercongurationoftheclusterandsotheymaychoosewhichonemay`best'suittheirapplicationsneeds,ratherthanbeingrestrictedtovendor-basedcongurations.(c)Usageofreadilyavailableandfree,softwarecomponents–BeowulfusestheLinuxOSforperformance,availabilityofsourcecode,devicesupport,andwideuseracceptance.LinuxisdistributedwithXwindows,mostpopularshells,andstandardcompilersforthemostpopularprogramminglanguages.BothPVMandMPIareavailableforLinuxaswellasavarietyofdistributed-memoryprogrammingtools.GrendelsoftwarearchitectureThecollectionofsoftwaretoolsbeingdevelopedandevolvingwithintheBeowulfprojectisknownasGrendel.Thesetoolsareforresourcemanagementandtosupportdistributedapplications.TheBeowulfdistributionincludesseveralprogrammingenvironmentsanddevelopmentlibrariesasseparatepackages.TheseincludePVM,MPI,andBSP,aswellas,SYSV-styleIPCandCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGKeytothesuccessoftheBeowulfenvironmentisinter-processorcommunicationsbandwidthandsystemsupportforparallelIO.ThecommunicationbetweenprocessorsonaBeowulfclusterisachievedthroughstandardTCP/IPprotocolsovertheEthernetinternaltocluster.Theperformanceofinter-processorcommunicationsis,therefore,limitedbytheperformancecharacteristicsoftheEthernetandthesystemsoftwaremanagingformessagepassing.BeowulfhasbeenusedtoexplorethefeasibilityofemployingmultipleEthernetnetworksinparalleltosatisfytheinternaldatatransferbandwidthsrequired.EachBeowulfworkstationhasuser-transparentaccesstomultipleparallelEthernetnetworks.Thisarchitecturewasachievedby`channelbonding'techniquesimplementedasanumberofenhancementstotheLinuxkernel.TheBeowulfprojecthasshownthatuptothreenetworkscouldbegangedtogethertoobtainsignicantthroughput,thusvalidatingtheusageofchannelbondingtechnique.Newcommoditynetworktechnologies,suchasFastEthernet,willensurethatevenbetterinter-processorcommunicationsperformancewillbeachievedinthefuture.IntheBeowulfscheme,asiscommoninNOWclusters,everynodeisresponsibleforrunningitsowncopyofthekernel.However,intheinterestsofpresentingauniformsystemimagetobothusersandapplications,BeowulfhasextendedtheLinuxkerneltoallowalooseensembleofnodestoparticipateinanumberofglobalnamespaces.NormalUNIXprocesses`belong'tothekernelrunningthemandhaveauniqueidentierProcessID(PID).InadistributedschemeitisoftenconvenientforprocessestohaveaPIDthatisuniqueacrossanentirecluster,spanningseveralkernels.BeowulfimplementstwoGlobalProcessID(GPID)schemes.Therstisindependentofexternallibraries.Thesecond,GPID-PVM,isdesignedtobecompatiblewithPVMTaskIDformatandusePVMasitssignaltransport.ThetraditionalUNIXcallsworktransparentlywithbothschemes.WhiletheGPIDextensionissufcientforcluster-widecontrolandsignalingofprocesses,itisoflittleusewithoutaglobalviewoftheprocesses.Tothisend,theBeowulfprojectisdevelopingamechanismthatallowsunmodiedversionsofstandardUNIXprocessutilities)toworkacrossacluster.ProgrammingmodelsBeowulfsupportsseveraldistributedprogrammingparadigms.Themostcommonlyusedarethemessagepassingenvironments,PVM,MPIandBSP.Inthenearfuture,adistributedsharedmemorypackagewillbeavailablealso.BeowulfsystemscantakeadvantageofanumberoflibrarieswrittentoprovideparallellesysteminterfacestoNOWs.MPI-IOisexpectedtobecomecoresoftwarefornewapplications.SolarisMC:ahighperformanceoperatingsystemforclustersSolarisMC[ ](Multi-Computer)isadistributedoperatingsystemforamulti-computer,aclusterofcomputingnodesconnectedbyahigh-speedinterconnect.Itprovidesasinglesystemimage,makingtheclusterappearlikeasinglemachinetotheuser,toapplications,andtothenetwork.SolarisMCisbuiltasaglobalisationlayerontopoftheexistingSolariskernel,asshowninFigure .ItextendsoperatingsystemabstractionsacrosstheclusterandpreservestheexistingSolarisABI/API,andhencerunsexistingSolaris2.xapplicationsanddevicedriverswithoutmodications.SolarisMCconsistsofseveralmodules:C++andsupportforobjectframework,globalisedprocessesandlesystem,andnetworking.TheinterestingfeaturesofSolarisMCincludethefollowing:(a)ExtendsexistingSolarisoperatingsystem.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYA Figure3.TheSolarisMCarchitecture(b)PreservestheexistingSolarisABI/APIcompliance.(c)Providessupportforhighavailability.(d)UsesC++,IDL,CORBAinthekernel.(e)LeveragesSpringtechnology.SolarisMCusesobject-orientedframeworkforcommunicationbetweennodes.Theobject-orientedframeworkisbasedonCORBAandprovidesremoteobjectmethodinvocations.ItlookslikeastandardC++methodinvocationtotheprogrammers.Theframeworkalsoprovidesobjectreferencecounting:noticationtoobjectserverwhentherearenomorereferences(local/remote)totheobject.AnotherfeatureofSolarisMCobjectframeworkisthatitsupportsmultipleobjecthandlers.AkeycomponentinprovingasinglesystemimageinSolarisMCisagloballesystem.Itprovidesconsistentaccessfrommultiplenodestolesandleattributes.ItusescachingforhighperformanceandusesanewdistributedlesystemcalledProXyFileSystem(PXFS),whichprovidesglobalisedlesystemwithouttheneedformodifyingtheexistinglesystem.ThesecondimportantcomponentofSolarisMCsupportingasinglesystemimageisitsglobalisedprocessmanagement.Itglobalisesprocessoperationssuchassignals.Italsoglobalisesthe/proclesystemprovidingaccesstoprocessstateforcommandssuchaspsandforthedebuggers.Itsupportsremoteexecution,whichallowsittostartupnewprocessesonanynodeinthesystem.SolarisMCalsoglobalisesitssupportfornetworkingandI/O.Itallowsmorethanonenetworkconnectionandprovidessupporttomultiplexbetweenarbitrarynetworklinks.AcomparisonofthefourclusterenvironmentsTheclusterprojectsdescribedintheabovesectionsshareacommongoalofattemptingtoprovideauniedresourceoutofinterconnectedPCsorworkstations.Eachsystemclaimsthatitiscapableofprovidingsupercomputingresourcesfromcommon-off-the-shelfcomponents.Eachprojectprovidestheseresourcesindifferentways,bothintermsofhowthehardwareisCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGTableII.Clustersystemscomparison ProjectPlatformCommunicationsOSOther PCsMultipleEthernetLinuxandMPI/PVM,withTCP/IPGrendelSocketsandHPFBerkeleyNOWSolaris-basedPCSMyrinetandSolaris+AM,PVM/MPI,andworkstationsActiveMessagesGLUUNIX+XFsHPFandSplit-CPCsMyrinetwithNTorLinuxJava-frontend,FastMessagesconnectionandFM,Sockets,globalresourceGlobalArrays,manager+LSFSHMEMandMPISolarisMCSolaris-basedPCsSolaris-supportedSolaris+C++andCORBAandworkstationsGlobalisationlayer connectedtogetherandthewaythesystemsoftwareandtoolsprovidetheservicesforparallelapplications.Table showsthekeyhardwareandsoftwarecomponentsthateachsystemuses.BeowulfandHPVMarecapableofusinganyPC,whereasBerkeleyNOWandSolarisMCfunctiononplatformswhereSolarisisavailable–currentlyPCs,Sunworkstationsandvariousclonessystems.BerkeleyNOWandHPVMuseMyrinetwithafast,low-levelcommunicationsprotocol(ActiveandFastMessages).BeowulfusesmultiplestandardEthernetandSolarisMCusesNICswhicharesupportedbySolaris–theserangefromEthernettoATMandSCI.EachsystemconsistsofsomemiddlewareinterfacedintotheOSkernel,whichisusedtoprovideaglobalisationlayer,oruniedviewofthedistributedclusterresources.BerkeleyNOWandSolarisMCusetheSolarisOS,whereasBeowulfusesLinuxwithamodiedkernelandHPVMisavailableforbothLinuxandWindowsNT.Allfoursystemsprovidearichvarietyoftoolsandutilitiescommonlyusedtodevelop,testandrunparallelapplications.Theseincludevarioushigh-levelAPIsformessagepassingandshared-memoryprogramming.SUMMARYANDCONCLUSIONSInthispaperwehavebrieydiscussedthedifferenthardwareandsoftwarecomponentsthatarecommonlyusedinthecurrentgenerationcluster-basedsystems.Wehavealsodescribedfourstate-of-the-artprojectsthatareusingsubtlydifferentapproachesrangingfromanallCOTSapproachthroughtoamixtureoftechnologies.Itwasneveranobjectiveofthispapertotryanddeterminethe`best'technologies,softwareorclustercomputingprojects.Rather,thegoalofthispaperwastoleadthereaderthroughtherangeofhardwareandsoftwaretechnologiesavailableandthenbrieydiscussanumberofclusterprojectsthatareusingthesetechnologies.Thereadershouldbeabletodrawtheirownconclusionsaboutwhichparticularclusterprojectwouldsuittheirneeds.HardwareandsoftwaretrendsInthelastveyearsseveralimportantadvanceshavetakenplaceandprominentamongtheseare:(a)Anetworkperformanceincreaseof10fold(10x)with100BaseTEthernetwithfullduplexsupport.Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYA(b)Theavailabilityofswitchednetworkcircuits,includingfullcrossbarswitchesforproprietarynetworktechnologiessuchasMyrinet.(c)Workstationperformancehassignicantlyimprovedandnowappearslikeyesterday'ssupercomputers.(d)ImprovementsinmicroprocessorperformancehasledtotheavailabilityofdesktopPCswiththeperformanceoflow-endworkstations,butatasignicantlylowercost.(e)TheavailabilityofapowerfulandstableUNIX-likeoperatingsystems(Linux)forx86classPCswithsourcecodeaccess.CullerandSingh[ ]quantifyanumberofhardwaretrendsintheirbookParallelComputerArchitecture:AHardware/SoftwareApproach.Foremostoftheseisthedesignandmanufactureofmicroprocessors.Abasicadvanceisthedecreaseinfeaturesizewhichenablescircuitstobecomeeitherfasterorlowerinpowerconsumption.Inconjunctionwiththisisthegrowingdiesizethatcanbemanufactured.Thesefactorsmeanthat:(a)Theaveragenumberoftransistorsonachipisgrowingbyabout40percentperannum.(b)Theclockfrequencygrowthrateisabout30percentperannum.Itisanticipatedthatbyearly2000ADtherewillbe700MHzprocessorswithabout100milliontransistors.Thereisasimilarstoryforstoragebutthedivergencebetweenmemorycapacityandspeedismorepronounced.Memorycapacityincreasedbythreeordersofmagnitude(1000x)between1980and1995,yetitsspeedhasonlydoubled(2x).ItisanticipatedthatGigabitDRAM(128MByte)willbeavailableinearly2000,butthegaptoprocessorspeedisgettinggreaterallthetime.Theproblemisthatmemoriesaregettinglargerwhilstprocessorsaregettingfaster.Sogettingaccesstodatainmemoryisbecomingabottleneck.OnemethodofovercomingthisbottleneckistoconguretheDRAMinbanksandthentransferdatafromthesebanksinparallel.Inaddition,multi-levelmemoryhierarchiesorganisedascachesmakememoryaccessmoreeffective,buttheireffectiveandefcientdesigniscomplicated.Theaccessbottleneckalsoappliestodiskaccess,whichcanalsotakeadvantageofparalleldisksandTheperformanceofnetworkinterconnectsisincreasingday-by-daywiththereductionofcosts.TheuseofnetworkssuchasATM,SCIandMyrinetinclusteringforparallelprocessingappearspromising.ThishasbeendemonstratedbymanycommercialandacademicprojectssuchasBerkeley'sNOWandBeowulf.Butnosinglenetworkinterconnecthasemergedasaclearwinner.MyrinetisnotacommodityproductandcostsalotmorethanEthernet,buthasrealadvantagesoverit:verylow-latency,highbandwidth,andaprogrammableon-boardprocessorallowingforgreaterexibility.SCInetworkhasbeenusedtobuilddistributedsharedmemorysystem,butlacksscalability.ATMisusedinclustersthataremainlyusedformultimediaapplications.Twoofthemostpopularoperatingsystemsofthe90sareLinuxandNT.LinuxhasbecomeapopularalternativetoacommercialoperatingsystemduetoitsfreeavailabilityandsuperiorperformancecomparedtootherdesktopoperatingsystemssuchasNT.Linuxhasmorethan7millionuserscurrentlyworldwideanditishasbecometheresearcher'schoiceofoperatingsystem.Linuxisalsoavailableformultiprocessormachineswithuptoeightprocessors.NThaslargeinstalledbaseandithasalmostbecomeaubiquitousoperatingsystem.NT5willhaveathinnerandfasterTCP/IPstack,whichsupportsfastercommunicationofmessages,yetusingstandardcommunicationtechnology.NTsystemsforparallelcomputingCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTINGareinasimilarsituationtoUNIXworkstations5/7yearsagoanditisonlyamatteroftimebeforeNTcatchesup.NTdevelopersbenetfromthetimeandmoneyinvestedinresearchbytheUNIXcommunityasitseemsthattheyareborrowingmanyoftheirtechnologies.ClustertechnologytrendsWehavediscussedanumberofclusterprojectswithinthispaper.Theserangefromthosewhichareproprietary-based(Solaris)throughtoatotallycommodity-basedsystem(Beowulf).HPVMcanbeconsideredahybrid-systemusingcommoditycomputersandspecialisednetworkinterfaces.Itshouldbenotedthattheprojectsdetailedinthispaperareafewofthemostpopularandwellknown,ratherthananexhaustivelistofallthoseavailable.Alltheprojectsdiscussedclaimtoconsistofcommoditycomponents.Althoughthisistrue,onecouldarguethattruecommoditytechnologieswouldbethosethatarepervasiveatmostacademicorindustrialsites.Ifthiswerethecase,thentruecommoditywouldmeanPCsrunningWindows95/98withstandard10MbpsEthernet.However,whenconsideringparallelapplicationswithdemandingcomputationalandnetworkneeds,thenthistypeoflow-endclusterwouldbeincapableofprovidingtheresourcesneeded.Eachoftheprojectsdiscussedtriestoovercomethebottlenecksassociatedwithusingcluster-basedsystemsfordemandingparallelapplicationsinaslightlydifferentway.Withoutfail,however,themainbottleneckisnotthecomputationalresource(beitaPCorUNIXworkstation),ratheritistheprovisionofalow-latency,high-bandwidthinterconnectandanefcientlow-levelcommunicationsprotocoltoprovidehigh-levelAPIs.TheBeowulfprojectexplorestheusageofmultiplestandardEthernetcardstoovercomethecommunicationsbottleneck,whereasBerkeleyNOWandHPVNuseprogrammableMyrinetNICsandAM/FMcommunicationsprotocols.SolarisMCusesstandardhigh-performanceNICsandTCP/IP.Thechoiceofwhatisthebestsolutioncannotjustbebasedonperformance,thecostpernodetoprovidetheNICshouldalsobeconsidered.Forexample,astandardEthernetcardcostslessthan$100,whereasMyrinetcardscostinexcessof$1000each.AnotherfactorthatmustalsobeconsideredinthisequationistheavailabilityofFastEthernetandtheadventofGigabitEthernet.ItseemsthatEthernettechnologiesaremorelikelytobemain-stream,massproducedandconsequentlycheaperthanspecialisednetworkinterfaces.Asanaside,alltheprojectsthathavebeendiscussedareinthevanguardoftheclustercomputingrevolutionandtheirresearchishelpingthefollowingarmydeterminewhicharethebesttechniquesandtechnologiestoadopt.SomepredictionsaboutthefutureEmerginghardwaretechnologiesalongwithmaturingsoftwareresourcesmeanthatcluster-basedsystemsarerapidlyclosingtheperformancegapwithdedicatedparallelcomputingplatforms.Withoutdoubtthegapwillcontinuetoclose.ClustersystemsthatscavengeidlecyclesfromPCsandworkstationswillcontinuetousewhateverhardwareandsoftwarecomponentsthatareavailableonpublicworkstations.Clustersdedicatedtohighperformanceapplicationswillcontinuetoevolveasnewandmorepowerfulcomputersandnetworkinterfacesbecomeavailableinthemarketplace.Itislikelythatindividualclusternodeswillcompriseofmultipleprocessors.CurrentlytwoandfourprocessorPCsandUNIXworkstationsarebecomingfairlycommon.SoftwarethatallowsSMPnodestobeefcientlyandeffectivelyusedbyparallelapplicationswillnodoubtbedevelopedandaddedtotheOSkernelinthenearfuture.ItislikelythatthereCopyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYAwillbewidespreadusageofFastandGigabitEthernetandassuchtheywillbecomethedefactonetworkingtechnologyforclusters.Toreducemessagepassinglatenciesclustersoftwaresystemswillby-passtheOSkernel,thusavoidingtheneedforexpensivesystemcalls,andexploittheusageofintelligentnetworkcards.TheOSusedonfutureclusterswillbedeterminedbyitsabilitytoprovidearichsetofdevelopmenttoolsandutilitiesaswellastheprovisionofrobustandreliableservices.UNIX-basedOSsarelikelytobemostpopular,butthesteadyimprovementandacceptanceofWindowsNTwillmeansthatitwillbenotfarbehind.FinalthoughtsOurneedforcomputationalresourcesinalleldsofscience,engineeringandcommercefarout-stripourabilitytofulltheseneeds.Theusageofclustersofcomputersis,perhaps,oneofthemostpromisingmeansbywhichwecanbridgethegapbetweenourneedsandtheavailableresources.TheusageofaCOTS-basedclustersystemhasanumberofadvantagesincluding:(a)Price/performancewhencomparedtoadedicatedparallelsupercomputer.(b)Incrementalgrowththatoftenmatchesyearlyfundingpatterns.(c)Theprovisionofamulti-purposesystem:onethatcould,forexample,beusedforsecretarialpurposesduringthedayandasacommodityparallelsupercomputeratnight.Theseandotheradvantageswillfueltheevolutionofclustercomputinganditsacceptanceasameansofprovidingcommoditysupercomputingfacilities.ACKNOWLEDGEMENTSWewouldliketothankRoseRayner,JohnRosbottom,ToniCortesandLarsRzymianowiczfortheireffortsproofreadingthispaper.1.G.Pster,InSearchofClusters,PrenticeHallPTR,1998.2.M.Snir,S.Otto,S.Huss-Lederman,D.WalkerandJ.Dongarra,MPI:TheCompleteReference(Vol.1)2ndEdition,MITPress,September1998.3.C.Koelbel,D.Loveman,R.Schreiber,G.SteeleJr.andM.Zosel,TheHighPerformanceFortranHandbookMITPress,1994.4.M.Baker,G.FoxandH.Yau,`Reviewofclustermanagementsoftware',NHSEReview,May1996.(http://nhse.cs.rice.edu/NHSEreview/CMS/).5.L.Turcotte,`Asurveyofsoftwareenvironmentsforexploitingnetworkedcomputingresources',EngineeringResearchCenterforComputationalFieldSimulation,MississippiState,1993.6.T.Anderson,D.CullerandD.Patterson,`AcaseforNOW(NetworkofWorkstations)',IEEEMicro(1),54–64(February1995).7.R.Buyya(ed),HighPerformanceClusterComputing:SystemsandArchitectures,Volume1,PrenticeHall,NJ,1999.8.Jazznet:AdedicatedclusterofLinuxPCs.(http://math.nist.gov/jazznet/).9.Gardensproject.(http://www.t.qut.edu.au/˜szypersk/Gardens/).10.M.MutkaandM.Living,`Prolingworkstations'availablecapacityforremoteexecution',Performance'87,Procs.ofthe12thIFTPWG7.3SymposiumonComputerPerformance,Brussels,December1987.11.TheBerkeleyIntelligentRAMProject.(http://iram.cs.berkeley.edu/).Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) CLUSTERCOMPUTING12.TheStandardPerformanceEvaluationCorporation(SPEC).(http://open.specbench.org).13.Intel'sPentiumIIXeonProcessor.(http://www.tomshardware.com/xeon.html).14.RussianAcademyofSciences,`VLSIMicroprocessors:Aguidetohighperformancemicroprocessors'.(http://www.microprocessor.sscc.ru/).TheEnhancedIDEFAQ,J.WehmanandP.Herweijer.SCSI-1:Doc#X3.131-1986,ANSI,1430Broadway,NY,USA.17.C.RuemmlerandJ.Wilkes,`Modellingdisks',TechnicalReportHPL9368,HewlettPackardLabs.,July18.BerkeleyNOW.(http://now.cs.berkeley.edu/).19.HPVM.(http://www-csag.cs.uiuc.edu/projects/clusters.html).20.InterconnectPerformancepage.(http://www.scl.ameslab.gov/Projects/ClusterCookbook/icperf.html).PCIBIOSSpecication,Revision2.1,PCISIG,Hillsboro,Oregon,August1994.CSMA/CD(Ethernet),ISO/IEC8802-3:1993(ANSI802.3),1993.ATMUser-NetworkInterfaceSpecication,ATMForum/PrenticeHall,September1993.24.`ScalableCoherentInterface(SCI)',IEEE1596-1992,August1993.25.R.ClarkandK.Alnes,`AnSCIInterconnectChipsetandAdapter',SymposiumRecord,HotInterconnects,August1996,pp.221–235.26.SCIAssociation.(http://www.SCIzzL.com/).27.HCSLab.UniversityofFlorida.(http://www.hcs.u.edu/sci.html).28.Myricom,Inc.(http://www.myri.com/).29.MPI-FM:MPIforFastMessages.(http://www-csag.cs.uiuc.edu/projects/comm/mpi-fm.html).30.N.Boden,D.Cohen,R.Felderman,A.Kulawik,C.Seitz,J.SeisovicandW.Su,`Myrinet–Agigabit-per-secondlocal-areanetwork',IEEEMicro(1)(February1995).31.LinuxMeta-FAQ.(http://sunsite.unc.edu/mdw/linux.html).32.MicrosoftCorporation.(http://www.microsoft.com).33.A.Watts,`Highcommand',PCDirect,(December1997).34.TheMicrosoftMarketShare.(http://newsport.sfsu.edu/ms/markets.html).35.H.Custer,InsideWindowsNT,MicrosoftPress,1993.36.WindowsNTServer.(http://www.microsoft.com/ntserver/).37.MPIForum.(http://www.mpi-forum.org/docs/docs.html).38.A.Beguelin,J.Dongarra,G.Geist,R.ManchekandV.Sunderam,`ThePVMproject',TechnicalReport,OakRidgeNationalLaboratory,February1993.MessagePassingInterfaceForum,`MPI:AMessage-PassingInterfaceStandard,UniversityofTennessee,Knoxville,ReportNo.CS-94-230,May51994.40.mpiJava.(http://www.npac.syr.edu/projects/prpc/mpiJava/).August1998.41.MPICH.(http://www.mcs.anl.gov/mpi/mpich/).42.W.Gropp,E.Lusk,N.DossandA.Skjellum,`Ahigh-performance,portableimplementationoftheMPImessagepassinginterfacestandard'.(http://www.mcs.anl.gov/mpi/mpicharticle/paper.html).43.W.GroppandB.Smith,`Chameleonparallelprogrammingtoolsusersmanual',TechnicalReportANL-93/23,ArgonneNationalLaboratory,March1993.44.MPIImplementations.(http://www.mpi.nd.edu/MPI/).45.M.A.Baker,`MPIonNT:Thecurrentstatusandperformanceoftheavailableenvironments',EuroPVM/MPI98,Springer-Verlag,,Vol1497,September,1988,pp.63–75.46.R.ButlerandE.Lusk,User'sGuidetothep4ParallelProgrammingSystem,ANL-92/17,ArgonneNationalLaboratory,October1992.47.R.ButlerandE.Lusk,`Monitors,messages,andclusters:Thep4parallelprogrammingsystem',ParallelComputing,547–564(April1994).48.S.Parkin,V.KaramchetiandA.Chein,`Fast-Messages(FM):Efcient,portablecommunicationforworkstationclustersandmassively-parallelprocessors',IEEEMicroprocessorOperatingSystems,60–73(April–June1997).49.T.vonEicken,D.Culler,S.GoldsteinandK.Shauser,`ActiveMessages:amechanismforintegratedcommunicationsandcomputation',Proc.ofInternationalSymposiumonComputerArchitectures,1992.50.J.Protic,M.TomasevicandV.Milutinovic,`Distributedsharedmemory:conceptsandsystems',Tutorial,InternationalConferenceonHPC,Seoul,Korea,April28–May21997.51.K.LiandP.Hudak.`Memorycoherenceinsharedvirtualmemorysystems',ACMTTOCS(November1989).52.S.Hiranandani,K.KennedyandC.Tseng,`CompilingFortranDforMIMDdistributedmemorymachines'CACM(August1992).Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999) BAKERANDRBUYYA53.S.Dwarkadas,A.CoxandW.Zwaenepoel,`Anintegratedcompile-time/run-timesoftwaredistributedsharedmemorysystem',OperatingSystemReview(December1996).54.W.Zwaenepoel,J.BennettandJ.Carter,`Munin:DSMusingmulti-protocolreleaseconsistency',A.KarshmerandJ.Nehmer(eds),OperatingSystemsofthe90sandBeyond,Springer-Verlag,1991,pp.56–55.TreadMarks.(http://www.cs.rice.edu/˜willy/TreadMarks/overview.html).56.N.CarrieroandD.Gelernter,`Lindaincontext',Comm.ACM(4),444–458(April1989).57.P.Dasgupta,R.C.Chenetal.,`Thedesignandimplementationofthecloudsdistributedoperatingsystem',TechnicalReport,(ComputingSystemsJournal,USENIX,Winter1990).(ftp://helios.cc.gatech.edu/pub/papers/design.ps.Z).58.J.Laudon,D.Lenoskietal.,`TheStanfordDASHmultiprocessor',IEEEComputer(3)(March1992).59.S.FranksandJ.R.BurkhardtIII,`TheKSR1:BridgingthegapbetweensharedmemoryandMPPs',COMPCON93,February1993.60.C.MapplesandL.Wittie,`Merlin:Asuperglueformultiprocessorsystems',CAMPCON'90,March1990.61.S.Browne,`Cross-platformparalleldebuggingandperformanceanalysistools',ProceedingsoftheEuroPVM/MPI98Workshop,Liverpool,September1998.62.ParallelToolsConsortiumproject.(http://www.ptools.org/).63.DolphinInterconnectSolutions.(http://www.dolphinics.no/).64.E.AndersonandD.Patterson,`Extensible,scalablemonitoringforclustersofcomputers',Proceedingsofthe11thSystemsAdministrationConference(LISA'97),SanDiego,CA,October26–311997.65.P.Uthayopas,C.JaikaewandT.Srinak,`Interactivemanagementofworkstationclustersusingworldwideweb',ClusterComputingConference-CCC1997Proceedings.(http://www.mathcs.emory.edu/˜ccc97/).66.SunMicrosystems,`SolsticeSyMON1.1User'sGuide',PaloAlto,CA,1996.67.C.Roder,T.LudwigandA.Bode,`Flexiblestatusmeasurementinheterogeneousenvironment',Proceedingsofthe5InternationalConferenceonParallelandDistributedProcessing,TechniquesandApplications(PDPTA'98),CSREAPublishers,LasVegas,USA,1998.68.Rajkumar,KrishnamohanandBindu,`PARMON:Acomprehensiveclustermonitoringsystem',TheAustralianUsersGroupforUNIXandOpenSystemsConferenceandExhibition,AUUG'98–OpenSystems:TheCommonThread,Sydney,Australia,1998.69.ClusterComputinglinks.(http://www.tu-chemnitz.de/informatik/RA/cchp/).70.ComputerArchitecturelinks.(http://www.cs.wisc.edu/˜arch/www/).71.S.Parkin,M.Lauria,A.Chienetal.,HighPerformanceVirtualMachines(HPVM):ClusterswithSupercomputingAPIsandPerformance,SIAMConferenceonParallelProcessingforScienticComputing(PP97),March,1997.72.TheBeowulfProject.(http://www.beowulf.org/).73.SolarisMC.(http://www.sunlabs.com/research/solaris-mc/).74.Y.Khalidi,J.Bernabeu,V.Matena,K.ShirriffandM.Thadani,SolarisMC:AMulti-ComputerOS',USENIXConference,January1996.75.K.Shirriff,`Nextgenerationdistributedcomputing:TheSolarisMCoperatingsystem',JapanTalk,1996.(http://www.sunlabs.com/research/solaris-mc/doc/japantalk.ps).76.D.CullerandJ.Singh,`Parallelcomputerarchitecture:ahardware/softwareapproach'.(http://www.cs.berkeley.edu/˜culler/book.alpha/).Copyright1999JohnWiley&Sons,Ltd.Softw.Pract.Exper.,(6),551–576(1999)