/
ImplementingEfcientMPIonLAPIforIBMRS/6000SPSystems:ExperiencesandPerf ImplementingEfcientMPIonLAPIforIBMRS/6000SPSystems:ExperiencesandPerf

ImplementingEfcientMPIonLAPIforIBMRS/6000SPSystems:ExperiencesandPerf - PDF document

bety
bety . @bety
Follow
343 views
Uploaded On 2021-04-14

ImplementingEfcientMPIonLAPIforIBMRS/6000SPSystems:ExperiencesandPerf - PPT Presentation

inresponsetotherequestfromthesenderthereceiverprovidesthereceivebufferaddresstothesenderandthenthesendercansendthemessageInthismethodtheunnecessarydatacopyintoatemporarybufferisavoidedbutthecos ID: 833140

mpi lapi fig msgdescription lapi mpi msgdescription fig hardwareadapter message null packet bytes blocking size byte messaging enhancednative 6000spsystems

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "ImplementingEfcientMPIonLAPIforIBMRS/60..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ImplementingEfcientMPIonLAPIforIBMRS/60
ImplementingEfcientMPIonLAPIforIBMRS/6000SPSystems:ExperiencesandPerformanceEvaluationMohammadBanikazemiRamaKGovindarajuRobertBlackmoreDhabaleswarKPandaDept.ofComputerandInformationScienceCommunicationSubsystemsTheOhioStateUniversityIBMPowerParallelSystemsColumbus,OH43210Poughkeepsie,NY12601Email:banikaze,panda@cis.ohio-state.eduEmail:ramag,blackmor@us.ibm.comAbstractTheIBMRS/6000SPsystemisoneofthemostcost-effectivecommerciallyavailablehighperformancemachines.IBMRS/6000SPsystemssupporttheMessagePassingIn-terfacestandard(MPI)andLAPI.LAPIisalowlevel,reli-ableandefcientonesidedcommunicationAPIlibrary,imple-mentedonIBMRS/6000SPsystems.ThispaperexplainshowthehighperformanceoftheLAPIlibraryhasbeenexploitedinordertoimplementtheMPIstandardmoreefcientlythantheexistingMPI.Itdescribeshowtoavoidunnecessarydatacopiesatboththesendingandreceivingsidesforsuchanim-plementation.Theresolutionofproblemsarisingfromthemis-matchesbetweentherequirementsoftheMPIstandardandthefeaturesofLAPIisdiscussed.Asaresultofthisexercise,cer-tainenhancementstoLAPIareidentiedtoenableanefcientimplementationofMPIonLAPI.TheperformanceofthenewimplementationofMPIiscomparedwiththatoftheunderly-ingLAPIitself.Thelatency(inpollingandinterruptmodes)andbandwidthofournewimplementationiscomparedwiththatofthenativeMPIimplementationonRS/6000SPsystems.TheresultsindicatethattheMPIimplementationonLAPIper-formscomparablyorbetterthantheoriginalMPIimplemen-tationinmostcases.Improvementsofupto\n \r inpollingmodelatency, \r\n  ininterruptmodelatency,and\n  inbandwidthareobtainedforcertainmessagesizes.Theim-plementationofMPIontopofLAPIalsooutperformsthena-tiveMPIimplementationfortheNASParallelBenchmarks.ItshouldbenotedthattheimplementationofMPIontopofLAPIisnotapartofanyIBMproductandnoassumptionsshouldbemaderegardingitsavailabilityasaproduct.1IntroductionTheIBMRS/6000SP1system[1,9](referredtoasSPintherestofthispaper)isageneral-purposescalableparallelsystembasedonadistributed-memory,message-passingarchi-tecture.Congurationsrangingfrom2-nodesystemsto128-nodesystemsareavailablefromIBM.TheuniprocessornodesareavailablewiththelatestPower2-Super(P2SC)micropro-1IBM,RS/6000,SP,AIX,Power-PC,andPower2-Superaretrade-marksorregisteredtrademarksoftheIBMCorporationintheUnitedStatesorothercountriesorboth.cessorsandtheTB3adapter.TheSMPnodesareavailablewiththe4way,Power-PC332MHzmicroprocessorsandtheTBMXadapter.Thenodesareinterconnectedviaaswitchadaptertoahigh-performance,multistage,packet-switchednetworkforinterprocessorcommunicationcapableofdeliver-ingbi-directionaldata-transferrateofupto160MB/sbetweeneachnodepair.Eachnodecontainsitsowncopyofthestan-dardAIXoperatingsystemandotherstandardRS/6000systemsoftware.IBMSPsystemssupportseveralcommunicationlibrarieslikeMPI[6],MPLandLAPI[4,7].MPL,anIBMdesignedinterface,wastherstmessagepassinginterfacedevelopedbyIBMonSPsystems.Subsequently,afterMPIbecameastan-darditwasimplementedbyreusingsomeoftheinfrastruc-tureofMPL.ThisreuseallowedforSPsystemstoprovideanimplementationofMPIquiterapidly,butalsoimposedsomeinherentconstraintsontheMPIimplementationwhicharedis-cussedindetailinSection2.In1997,theLAPIlibraryinter-facewasdesignedandimplementedonSPsystems.Thepri-marydesigngoalforLAPIwastodeneanarchitecturewithsemanticsthatwouldallowefcientimplementationontheun-derlyinghardwareandrmwareinfrastructureprovidedbySPsystems.LAPIisauserspacelibrary,whichprovidesaone-sidedcommunicationmodeltherebyavoidingthecomplexitiesassociatedwithtwo-sidedprotocols(likemessagematching,ordering,etc.).InthispaperwedescribetheimplementationoftheMPIstandardontopofLAPI(MPI-LAPI)toavoidsomeofthein-herentperformanceconstraintsofthecurrentimplementationofMPI(nativeMPI)andtoexploitthehighperformanceofLAPI.Therearesomechallengesinvolvedinimplementinga2-sidedprotocolsuchasMPIontopofa1-sidedprotocolsuchasLAPI.Themajorissueisndingtheaddressofthereceiv-ingbuffer.In2-sidedprotocols,thesenderdoesnothaveanyinformationabouttheaddressofthereceivebufferwherethemessageshouldbecopiedinto.Therearesomeexistingsolu-tionstothisproblem.Atemporarybuffercanbeusedatthereceivingsidetostorethemessagebeforetheaddressofitsdestinationisresolved.Thissolutionincursthecostofadatacopywhichincreasesthedatatransfertimeandtheprotocoloverheadespeciallyforlargemessages.Analternativesolu-tiontothisproblemisusingarendezvousprotocol,inwhichinresponsetotherequestfromthesender,thereceiverpro-videsthereceivebufferaddresstothesender,andthenthesendercansend

themessage.Inthismethodtheunnecessarydat
themessage.Inthismethodtheunnecessarydatacopy(intoatemporarybuffer)isavoided,butthecostofroundtripcontrolmessagesforprovidingthereceivebufferaddresstothesenderimpactstheperformance(especiallyforsmallmessages)considerably.Theimpactisincreasedlatencyandcontroltrafc.Itisthereforeimportantthatamoreef-cientmethodbeusedforresolvingthereceivebufferaddress.Inthispaper,weexplainhowtheexibilityoftheLAPIar-chitectureisusedtosolvethisprobleminanefcientmanner.AnotherchallengeinimplementingMPIontopofLAPIistokeepthecostofenforcingthesemanticsofMPIsmallsothattheefciencyofLAPIisrealizedtothefullest.Anothermo-tivationbehindourefforthasbeentoprovidebetterreusebymakingLAPIthecommontransportlayerforothercommuni-cationlibraries.Onceagain,itshouldbenotedthatMPI-LAPIisnotapartofanyIBMproductandnoassumptionsshouldbemaderegardingitsavailabilityasaproduct.Thispaperisorganizedasfollows:InSection2,wedetailthedifferentmessaginglayersinthecurrentimplementationofMPI.InSection3,wepresentanoverviewofLAPIanditsfunctionality.InSection4,wediscussdifferentMPIcommu-nicationmodesandshowhowthesemodesaresupportedbyusingLAPI.InSection5,wediscussdifferentstrategiesthatareusedtoimplementMPIontopofLAPIandthevariouschangeswemadetoimprovetheperformanceofMPI-LAPI.Experimentalresultsincludinglatency,bandwidth,andbench-markperformancearepresentedinSection6.RelatedworkisdiscussedinSection7.InSection8,weoutlinesomeofourconclusions.2TheNativeMPIOverviewTheprotocolstackforthecurrentimplementationofMPIonSPsystemsisshowninFigure1a.Thisprotocolstackcon-sistsofseverallayers.Thefunctionsofeachofthelayersisdescribedbrieybelow:Pipes - reliable bytes streamSwitch HardwareMPI - MPI semantics layerAdapter HardwareAdapter MicrocodeHAL - Packet LayerSwitch HardwareAdapter HardwareAdapter MicrocodeHAL - Packet LayerLAPI - reliable transport layerSwitch HardwareAdapter HardwareAdapter MicrocodeHAL - Packet LayerLAPI - reliable transport layer(a) MPI (Messaging Layers)(b) LAPI (Messaging Layers)MPI - MPI semantics layerNew MPCI - pt-to-pt msg layer(c) MPI on LAPI (Messaging Layers)MPCI - pt-to-pt msg layerFigure1.ProtocolStackLayering.TheMPIlayerenforcesallMPIsemantics.Itbreaksdownallcollectivecommunicationcallsintoaseriesofpoint-to-pointmessagepassingcallsinMPCI(MessagePassingClientInterface).TheMPCIlayerprovidesapoint-to-pointcommunica-tioninterfacewithmessagematching,bufferingforearlyarrivals,etc.Itsendsdatabycopyingdatafromtheuserbufferintothepipebuffers.Thepipelayerthenhasre-sponsibilityforsendingthedata.Likewisedatareceivedbythepipelayerismatched,andifthecorrespondingre-ceivehasbeenposted,copiedfromthepipebuffersintotheuserbuffer,otherwisethedataiscopiedintoanearlyarrivalbuffer(ifthereceiveisnotposted).ThePipeslayerprovidesareliablebytestreaminter-face[8].Itensuresthatdatainthepipebuffersisre-liablytransmittedandreceived.Thislayerisalsousedtoenforceorderingofpacketsatthereceivingendpipebufferifpacketscomeoutoforder(theswitchnetworkhasfourroutesbetweeneachpairofnodesandpacketsonsomeroutescantakelongerthanotherroutesbasedontheswitchcongestionontheroute).Aslidingwin-dowowcontrolprotocolisused.Reliabilityisenforcedusinganacknowledgement-retransmitmechanism.TheHALlayer(packetlayer,alsoreferredtoastheHard-wareAbstractionLayer)providesapacketinterfacetotheupperlayers.Datafromthepipebuffersarepacke-tizedintheHALnetworksendbuffersandtheninjectedintotheswitchnetwork.LikewisepacketsarrivingfromthenetworkareassembledintheHALnetworkreceivebuffers.TheHALnetworkbuffersarepinneddown.TheHALlayerhandshakeswiththeadaptermicrocodetosend/receivepacketsto/fromtheswitchnetwork.TheAdapterDMAsthedatafromtheHALnetworksendbuffersontotheswitchadapterandtheninjectsthepacketintotheswitchnetwork.Likewise,packetsarriv-ingfromtheswitchnetworkintotheswitchadapterareDMAedontotheHALnetworkreceivebuffers.ThecurrentMPIimplementation,fortherstandlast16Kbytesofdata,incursacopyfromtheuserbuffertothepipesbufferandfromthepipebufferstotheHALbuffersforsendingmessages[8].Similarly,receivedmessagesarerstDMAedintoHALbuffersandthencopiedintothepipebuffer.Theextracopyingofdataisperformedinordertosimplifythecommunicationprotocol.ThesetwoextradatacopiesaffecttheperformanceofMPI.InthefollowingsectionswediscussLAPI(Fig.1b)andexplainhowLAPIcanreplacethePipeslayer(Fig.1c)inordertoavoidtheextradatacopiesandim-provetheperformanceofthemessagepassinglibrary.3LAPICommunicationModelOverviewLAPIisalowlevelAPIdesignedtosupportefcientone-sidedcommunicationbetweentasksonSPsystems[9]

.TheprotocolstackofLAPIisshowninFigure1b
.TheprotocolstackofLAPIisshowninFigure1b.AnoverviewoftheLAPIcommunicationmodel(forLAPIAmsend)isgiveninFigure2whichhasbeencapturedfrom[7].DifferentstepsinvolvedinLAPIcommunicationfunctionsareasfollows.EachmessageissentwithaLAPIheader,andpossiblyauserheader(step1).Onarrivaloftherstpacketofthemes-sageatthetargetmachine,theheaderisparsedbyaheaderhandler(step2)whichisresponsibleforaccomplishingthreetasks(step3).First,itmustreturnthelocationofadatabufferwherethemessageistobeassembled.Second,itmayreturnapointertoacompletionhandlerfunctionwhichiscalledwhenallthepacketshavearrivedinthebufferlocationpreviouslyreturned.Finally,ifacompletionhandlerfunctionisprovided,italsoreturnsapointertodatawhichispassedtothecom-pletionhandler.Thecompletionhandlerisexecutedafterthelastpacketofthemessagehasbeenreceivedandcopiedintoabuffer(step4).Ingeneral,threecountersmaybeusedsothataprogrammermaydeterminewhenitissafetoreusebuffersandtoindicatecompletionofdatatransfer.Therstcounter(orgcntr)istheorigincounter,locatedintheaddressspaceofthesendingtask.Thiscounterisincrementedwhenitissafefortheorigintasktoupdatetheoriginbuffer.Thesec-ondcounter,locatedinthetargettask'saddressspace,isthetargetcounter(tgtcntr).Thiscounterisincrementedafterthemessagehasarrivedatthetargettask.Thethirdcounter,thecompletioncounter(cmplcntr)isupdatedoncompletionofthemessagetransfer.Thiscompletioncounterissimilartothetargetcounterexceptitislocatedintheorigintask'saddressspace.34 !"#$!%&$# % "#' %#' %"#'!$!#'!$!"#$&$($%#)%&($%#(*"($%+,,,,,,,,--------........................////////////////////////000000000000000000000000111111111111111111111111LAPI Dispatcher222233334455cmpl_cntrtargetorigintgt_cntrHeader Handlerhdr_hdlcmpl_hdlbuffercmpl_hdluhdrorg_cntrudatabufferprocessprocess21LAPI_AmsendCompletion HndlrFigure2.LAPIoverview.TheuseofLAPIfunctionsmayrequirethattheorigintaskspecifypointerstoeitherfunctionsoraddressesinthetargettaskaddressspace.Oncetheaddressoftheheaderhandlerhasbeendetermined,thesendingprocessdoesnotnecessarilyneedtoknowthereceivebufferaddressinthereceiveraddressspacesincetheheaderhandlerisresponsibleforreturningthereceivebufferaddress.Theheaderhandlermay,forexample,interprettheheaderdataasasetoftagswhich,whenmatchedwithrequestsonthereceivingside,maybeusedtodeterminetheaddressofthereceivebuffer.Asweshallsee,thisgreatlysimpliesthetaskofimplementingatwosidedcommunica-tionprotocolwithaonesidedinfrastructure.Toavoiddead-locks,LAPIfunctionscannotbecalledfromheaderhandlers.ThecompletionhandlersareexecutedonaseparatethreadandcanmakeLAPIcalls.LAPIfunctionsmaybebroadlybrokenintotwoclassesoffunctions.Therstofthesearecommunicationfunctionsusingtheinfrastructuredescribedabove.Inadditiontothesecommunicationfunctions,thereareanumberofutilityfunc-tionprovidedsothatthecommunicationfunctionsmaybeef-fectivelyused.AlltheLAPIfunctionsareshowninTable1.FormoreinformationaboutLAPIwereferthereaderto[7].LAPIFunctionPurposeLAPIInitInitializetheLAPIsubsystemLAPITermTerminatetheLAPIsubsystemLAPIPutDatatransferfunctionLAPIGetDatatransferfunctionLAPIAmsendActivemessagesendfunctionLAPIRmwSynchronizationread-modify-writeLAPISetcntrSetthevalueofacounterLAPIGetcntrGetthevalueofacounterLAPIWaitcntrWaitforacountertoreachavalueLAPIAddressinitExchangeaddressesofinterestLAPIFenceEnforceorderingofmessagesLAPIGfenceEnforceorderingofmessagesLAPIQenvQuerytheenvironmentstateLAPISenvSettheenvironmentstateTable1.LAPIFunctions.4SupportingMPIontopofLAPITheprotocolstackusedforthenewMPIimplementationisshowninFigure1c.ThePIPElayerisreplacedbytheLAPIlayer.TheMPCIlayerusedinthisimplementationisthinnerthanthatofthenativeMPIimplementationanddoesnotin-cludetheinterfacewiththePIPElayer.Inthissection,werstdiscussdifferentcommunicationmodesdenedbyMPIandthenexplainhowthenewMPCIlayerhasbeendesignedandimplementedtosupportMPIontopofLAPI.TheMPIstandarddenesfourcommunicationmodes:Standard,Synchronous,Buffered,andReadymodes[6].Thesefourmodesareusuallyimplementedbyusingtwointernalpro-tocolscalledEagerandRendezvousprotocols.ThetranslationoftheMPIcommunicationmodesintotheseinternalprotocolsinourimplementationisshowninTable2.TheRendezvousprotocolisusedforlargemessagestoavoidthepotentialbufferexhaustioncausedbyunexpectedmessages(whosereceiveshavenotbeenpostedbythetimetheyreachthedestination).ThevalueofEagerLimitcanbesetbytheuserandhasade-faultvalueof4096bytes.Thisvaluecanbesettosmallerorlargervaluesbasedonthesizeofthebufferavailableforst

or-ingunexpectedmessages.InEagerprotocol
or-ingunexpectedmessages.InEagerprotocol,messagesaresentregardlessofthestateofthereceiver.ArrivingmessageswhosematchingreceiveshavenotyetbeenpostedarestoredinabuffercalledtheEarlyArrivalBufferuntilthecorrespondingreceivesgetposted.Ifanarrivingmessagendsitsmatchingreceive,themessageiscopieddirectlytotheuserbuffer.IntheRendezvousprotocol,aRequesttosendcontrolmessageisrstsenttothereceiverwhichisacknowledgedassoonasthematchingreceivegetsposted.Themessageissenttothereceiveronlyafterthear-rivalofthisacknowledgment.TheblockingandnonblockingversionsoftheMPIcom-municationmodeshavebeendenedintheMPIstandard.Intheblockingversion,afterasendoperation,controlreturnstotheapplicationonlyaftertheuserdatabuffercanbereusedbytheapplication.Intheblockingversionofthereceiveopera-tion,controlreturnstotheapplicationonlywhenthemessagehasbeencompletelyreceivedintotheapplicationbuffer.InMPICommunicationModeInternalProtocolStandardif(size6EagerLimit)eagerelserendezvousReadyEagerSynchronousRendezvousBufferedif(size6EagerLimit)eagerelserendezvousTable2.TranslationofMPIcommunicationmodestointernalprotocols.thenonblockingversionofsendoperations,controlimmedi-atelyreturnstotheuseroncethemessagehasbeensubmit-tedfortransmissionanditistheresponsibilityoftheusertoensuresafereuseofitssendbuffer(byusingMPIWAITorMPITESToperations).Inthenonblockingversionofreceive,thereceiveispostedandcontrolisreturnedtotheuser.Itistheresponsibilityoftheusertodetermineifthemessagehasarrived.InthefollowingsectionsweexplainhowtheinternalprotocolsandMPIcommunicationmodesareimplementedbyusingLAPI.4.1ImplementingtheInternalProtocolsAsmentionedinSection3,LAPIprovidesone-sidedoper-ationssuchasLAPIPutandLAPIGet.LAPIalsoprovidesActiveMessagestyleoperationsthroughtheLAPIAmsendfunction.WedecidedtoimplementtheMPIpoint-to-pointop-erationsontopofthisLAPIactivemessageinfrastructure.TheLAPIactivemessageinterface(LAPIAmsend)functionpro-videssomeenhancementstotheactivemessagesemanticsde-nedinGAM[10].TheLAPIAmsendfunctionallowstheusertospecifyaheaderhandlerfunctiontobeexecutedatthetargetsideoncetherstpacketofthemessagearrivesatthetarget.TheheaderhandlermustreturnabufferpointertoLAPIwhichtellsLAPIwherethepacketsofthemessagemustbereassembled.Theabilitytonotrequirethetaskmak-ingtheLAPIAmsendcallspecifythetargetaddressforthemessagesbeingsent,makesitideallysuitedtobeusedasthebasisforimplementingMPI-LAPI.Theheaderhandlerisusedtoprocessthemessagematchingandearlyarrivalsemantics,therebyavoidingtheneedforanextracopyatthetargetside.Theheaderhandleralsoallowstheusertospecifyacompletionhandlerfunctiontobeexecutedafterallthepacketsofthemes-sagehavebeencopiedintothetargetbuffer.Thecompletionhandlerthereforeservestoallowtheapplicationtoincorpo-ratethearrivingmessageintotheongoingcomputation.InourMPIimplementationthecompletionhandlerservestoupdatelocalstateofmarkingmessagescomplete,andpossiblysend-ingcontrolmessagebacktothesender.TheLAPIAmsendthereforeprovidesthehookstoallowapplicationstogetcon-trolwhentherstpacketofamessagearrivesandwhenthecompletemessagehasarrivedatthetargetbuffer,makingitidealtobeusedasabasisforimplementingMPI-LAPI.InSections4.1.1and4.1.2,weexplainhowtheEagerandRen-dezvousprotocolshavebeenimplemented.4.1.1ImplementingtheEagerProtocolIntheMPI-LAPIimplementation,LAPIAmsendisusedtosendthemessagetothereceiver(Fig.3a).Themessagedescriptions(suchasmessageTAGandCommunicator)areencodedintheuserheaderwhichispassedtotheheaderhan-dler(Fig.3b).Usingthemessagedescription,theposted“Re-ceiveQueue”(Receivequeue)issearchedtoseeifamatchingreceivehasalreadybeenposted.Ifsuchareceivehasbeenposted,theaddressoftheuserbufferisreturnedtoLAPIandLAPIassemblesthedataintotheuserbuffer.ItshouldbenotedthatLAPIwilltakecareofoutoforderpacketsandcopythedataintothecorrectoffsetintheuserbuffer.Iftheheaderhan-dlerdoesn'tndamatchingreceive,itwillreturntheaddressofan“EarlyArrivalBuffer”(EAbuffer)forLAPItoassem-blethemessageinto.(Thebufferspaceisallocatedifneeded.)Theheaderhandleralsopoststhearrivalofthemessageintothe“EarlyArrivalQueue”(EAqueue).IfthemessagebeingreceivedisaReady-modemessageanditsmatchingreceivehasnotyetbeenposted,afatalerrorisraisedandthejobisterminated.Ifthematchingreceiveisfound,theheaderhan-dleralsosetsthefunctionEagercmplhdltobeexecutedasthecompletionhandler.Thecompletionhandlerisexecuted,whenthewholemessagehasbeencopiedintotheuserbuffer,andthecorrespondingreceiveismarkedascomplete(Fig.3c).Itshouldbenotedthatinordertomakethedescriptionoftheimplementatio

nmorereadable,wehaveomittedsomeoftherequ
nmorereadable,wehaveomittedsomeoftherequiredparametersoftheLAPIfunctionsfromtheoutlines.(a)FunctionEagersendLAPIAmsend(eagerhdrhdl,msgdescription,msg)endEagersend(b)FunctionEagerhdrhdl(msgdescription)if(matchingreceiveposted(msgdescription))begincompletionhandler=Eagercmplhdlreturn(userbuffer)endelsebeginif(ReadyMode)Errorhandler(Fatal,“Recvnotposted”)postmsgdescriptioninEAqueuecompletionhandler=NULLreturn(EAbuffer)endifendEagerhdrhdl(c)FunctionEagercmplhdl(msgdescription)MarktherecvasCOMPLETEendEagercmplhdlFigure3.OutlineoftheEagerprotocol:(a)Eagersend,(b)theheaderhandlerfortheeagersendand(c)thecompletionhandlerfortheeagersend.4.1.2ImplementingtheRendezvousProtocolTheRendezvousprotocolisimplementedintwosteps.Intherststeparequesttosendcontrolmessageissenttothere-ceiverbyusingLAPIAmsend(Fig.4).Thesecondstepisex-ecutedwhentheacknowledgmentofthismessageisreceived(indicatingthatthecorrespondingreceivehasbeenposted).ThemessageissentbyusingLAPIAmsendthesamewaythemessageistransmittedinEagerprotocol(Fig.3a).Inthenextsection,weexplainhowtheseprotocolsareemployedtoim-plementdifferentcommunicationmodesasdenedintheMPIstandard.(a)FunctionRequesttosendLAPIAmsend(Requesttosendhdrhdl,msgdescription,NULL)endRequesttosend(b)FunctionRequesttosendhdrhdl(msgdescription)if(matchingreceiveposted(msgdescription))begincompletionhandler=Requesttosendcmplhdlreturn(NULL)endelsebeginpostmsgdescriptioninEAqueuecompletionhandler=NULLreturn(NULL)endifendRequesttosendhdrhdl(c)FunctionRequesttosendcmplhdl(msgdescription)LAPIAmsend(Requesttosendackedhdrhdl,msgdescription,NULL)endRequesttosendcmplhdlFigure4.OutlineoftherstphaseoftheRendezvousprotocol:(a)RequesttoSend,(b)TheHeaderhandlerfortherequesttosendand(c)thecompletionhandlerfortherequesttosend.4.2ImplementingtheMPICommunicationModesStandard-modemessageswhicharesmallerthantheEagerLimitandReady-modemessagesaresentbyusingtheEagerprotocol(Fig.5).Dependingonwhetherthesendisblock-ingornot,awaitstatement(LAPIWaitcntr)mightbeusedtoensurethattheuserbuffercanbereused.Standard-modemessageswhicharelongerthantheEagerLimitandSynchronous-modemessagesaretransmittedbyus-ingthe2-phaseRendezvousprotocol.Figure6illustrateshowthesesendsareimplemented.Inthenon-blockingversion,thesecondphaseofthesendisexecutedinthecompletionhandlerwhichisspeciedintheheaderhandlercorrespondingtotheactivemessagesentforacknowledgingtheRequesttosendmessageasshowninFigure7.Bufferedmodemessagesaretransmittedusingthesameprocedureasforsendingnonblockingstandardmessages.Theonlydifferenceisthatmessagesarerstcopiedintoauserspeciedbuffer(denedbyMPIBufferattach).Thereceiverinformsthesenderwhenthewholemessagehasbeenreceivedsothatthesendercanfreethebufferusedfortransmittingthemessage(Figure8).Figure9showshowblockingandnon-blockingreceiveop-erationsareimplemented.ItshouldbenotedthatinresponsetoaRequesttosendmessage,aLAPIAmsendisusedtoac-knowledgetherequest.Whenthisacknowledgmentisreceivedatthesendersideoftheoriginalcommunication,theentiremessagewillbetransmittedtothereceiver.Iftheoriginalsendoperationisablockingsend,thesenderisblockedun-tiltheRequesttosendmessageismarkedasacknowledgedandtheblockingsendwillsendoutthemessage.Iftheorig-inalmessageisanonblockingsend,themessageissentoutinthecompletionhandlerspeciedintheheaderhandlerofRequesttosendacked(Fig.7).FunctionStndShortreadysendEagersendif(blocking)WaituntilOrigincounterissetendStndShortreadysendFigure5.Outlineofthestandardsendformessagesshorterthantheeagerlimitandtheready­modesend.FunctionStndLongsyncsendRequesttosendif(blocking)beginWaituntilrequesttosendisacknowledgedEagersendWaituntilOrigincounterissetendifendStndLongsyncsendFigure6.Outlineofthestandardsendformessageslongerthantheeagerlimitandthesynchronous­modesend.5OptimizingtheMPI-LAPIImplementa-tionInthissectionwerstdiscusstheperformanceofthebaseimplementationofMPI-LAPIwhichisbasedonthedescrip-tionoutlinedinSection4.Afterdiscussingtheshortcomingsofthisimplementation,wepresenttwomethodstoimprovetheperformanceofMPI-LAPI.5.1PerformanceoftheBaseMPI­LAPIWecomparedtheperformanceofourbaseimplementationwiththatofLAPIitself.Wemeasuredthetimetosendanum-berofmessages(withaparticularmessagesize)fromonenodetoanothernode.Eachtimethereceivingnodewouldsendbackamessageofthesamesize,andthesendernodewillsendanewmessageonlyafterreceivingamessagefromthereceiver.Thenumberofmessagesbeingsentbackandforthwaslongenoughtomakethetimererrornegligible.Thegran-ularityofthetimerwaslessthanamicrosecond.LAPIPutandLAPI

Waitcntrwereusedtosendthemessageandtowai
Waitcntrwereusedtosendthemessageandtowaitforthereply,respectively.ThetimefortheMPI-LAPIimple-mentationwasmeasuredinasimilarfashion.MPISendandMPIRecvwerethecommunicationfunctionsusedforthisex-periment.Itshouldbenotedthatinallcases,theRendezvousprotocolwasusedformessageslargerthantheEagerLimit(78bytes).Figure10showsthemeasuredtimeformessagesofdifferentsizes.WeobservedthatmessagetransfertimeoftheMPI-LAPIimplementationwastoohightobeattributedonlytothecostofprotocolprocessinglikemessagematchingwhicharerequiredfortheMPIimplementationbutnotforthe1-sidedLAPIprimitives.5.2MPI­LAPIwithCountersCarefulstudyofthedesignandprolingofthebaseimple-mentationshowedthatthecostofthreadcontextswitchingre-quiredfromtheheaderhandlertothecompletionhandlerwasthemajorsourceofincreaseinthedatatransfertime.ItshouldbenotedthatinLAPI,completionhandlersareexecutedonaseparatethread(Section3).Toverifythishypothesis,wemodiedthedesignsuchthatwedonotrequiretheexecutionFunctionRequesttosendackedhdrhdlif(blocking(msgdescription))marktherequestasacknowledgedelsecompletionhandler=RequesttosendackedcmplhdlendRequesttosendackedhdrhdlFunctionRequesttosendackedcmplhdlEagersendendRequesttosendackedcmplhdlFigure7.OutlineofreceiveformessagessentusingtheRendezvousprotocol.FunctionBufferedsendCopythemsgtotheattachedbufferif(msgsize9EagerLimit)EagersendelseRequesttosendendBufferedsendFigure8.Outlineofthebuffered­modesend.ofcompletionhandlers.AsdescribedinSection4,whentheeagerprotocolisused,theonlyactiontakeninthecomple-tionhandlerismarkingthemessageascompleted(Fig.3)suchthatthereceive(orMPIWAITorMPITEST)canrecognizethecompletionofthereceiptofthemessage.LAPIprovidesasetofcounterstosignalthecompletionofLAPIoperations.ThetargetcounterspeciedinLAPIAmsendisupdated(in-crementedbyone)afterthemessageiscompletelyreceived(andthecompletionhandler,ifthereexistany,hasexecuted).Weusedthiscountertoindicatethatmessagehasbeencom-pletelyreceived.However,theaddressofthiscounterwhichresidesatthereceivingsideoftheoperationshouldbespeci-edatthesendersideoftheoperation(whereLAPIAmsendiscalled).Inordertotakeadvantageofthisfeature,wemodiedthebaseimplementationtouseasetofcounterswhosead-dressesareexchangedamongtheparticipatingMPIprocessesduringinitialization.Byusingthesecountersweavoidedus-ingthecompletionhandlerofmessagessentthroughtheeagerprotocol.WecouldnotemploythesamestrategyfortherstphaseoftheRendezvousprotocol.ThereceptionoftheRe-questtosendcontrolmessagesatthereceivingsidedoesnotimplythatthemessagecanbesent.Ifthereceivehasnotyetbeenposted,thesendercannotstartsendingthemessageevenFunctionReceiveif(foundmatchingmsg(EAqueue,msgdescription))if(requesttosend)beginLAPIAmsend(Requesttosendacked,msgdescription,NULL)endifelsePostthereceiveinReceivequeueif(blocking)WaituntilmsgismarkedasCOMPLETEendReceiveFigure9.OutlineofreceiveformessagessentbytheEagerprotocol.thoughtheRequesttosendmessagehasbeenalreadyreceivedatthetarget.ThetimeforthemessagetransferofthismodiedversionisshowninFigure10.Asitcanbeobserved,thisim-plementationprovidedbetterperformanceforshortmessages(whicharesentinEagermode)comparedtothebaseimple-mentation.Thisexperimentwassolelyperformedtoverifythecorrectnessofourhypothesis.5.3MPI­LAPIEnhancedTheresultsinFigure10,conrmedourhypothesisthatthemajorsourceofoverheadwasthecostofcontextswitchingrequiredfortheexecutionofthecompletionhandlers.Weshowedhowwecanavoidusingcompletionhandlersformes-sageswhicharesentinEagermode.However,westillneedtousecompletionhandlersforlargermessages(sentinRen-dezvousmode).Inordertoavoidthehighcostofcontextswitchingforallmessages,weenhancedLAPItoincludepre-denedcompletionhandlersinthesamecontext.Inthismodi-edversionofLAPI,operationssuchasupdatingalocalvari-ableoraremotevariable(whichrequiresaLAPIfunctioncall),indicatingtheoccurrenceofcertainevents,wereexecutedinthesamecontext.TheresultsofthisversionisshowninFig-ure10.ThetimeofthisversionofMPI-LAPIcomesveryclosetothatofthebareLAPIitself.Thedifferencebetweenthecurvescanbeattributedtothecostofpostingandmatch-ingreceivesrequiredbyMPI,andalsothecostoflockingandunlockingofthedatastructuresusedforthesefunctionsattheMPIlevel.163264128256512102420484096819216384416642561K4K16K64K256K1MTime (microsec):Message Size (byte)MPI-LAPI EnhancedMPI-LAPI CountersMPI-LAPI BaseRAW LAPIFigure10.ComparisonbetweentheperformanceofrawLAPIanddifferentversionsofMPI­LAPI.Inthefollowingsection,wecomparethelatencyandband-widthofourMPI-LAPIEnhancedimplementationwiththatofthenativeMP

Iimplementation.Wealsoexplainthedifferen
Iimplementation.Wealsoexplainthedifferencebetweentheperformanceofthesetwoimplementations.6PerformanceEvaluationInthissectionwerstpresentacomparisonbetweenthenativeMPIandMPI-LAPI(theEnhancedversion)inlatencyandbandwidth.Then,wecomparetheresultsobtainedfromrunningtheNASbenchmarksusingMPI-LAPIwiththoseob-tainedfromrunningNASbenchmarksusingthenativeMPI.InallofourexperimentsweuseaSPsystemwithPower-PC332MHznodesandtheTBMXadapter.TheEagerLimitissetto7\r8bytesforallexperiments.6.1LatencyandBandwidthWecomparedtheperformanceofMPI-LAPIwiththatofthenativeMPIavailableonSPsystems.Thetimeformessagetransferwasmeasuredbysendingmessagesbackandforthbe-tweentwonodesasdescribedinSection5.TheMPIprimitivesusedfortheseexperimentswereMPISendandMPIRecv.Theeagerlimitforbothsystemswassetto78bytes.Tomea-surethebandwidth,werepeatedlysentmessagesoutfromonenodetoanothernodeforanumberoftimesandthenwaitedforthelastmessagetobeacknowledged.Wemeasurethetimeforsendingthesebacktobackmessagesandstopthetimerwhentheacknowledgmentofthelastmessageisreceived.Thenum-berofmessagesbeingsentislargeenoughtomakethetimefortransmissionoftheacknowledgmentofthelastmessageneg-ligibleincomparisonwiththetotaltime.ForthisexperimentweusedMPIIsendandMPIIrecvprimitives.Figure11illustratesthetimeofMPI-LAPIandthenativeMPIfordifferentmessagesizes.Itcanbeobservedthat,thetimeofMPI-LAPIforveryshortmessagesisslightlyhigherthanthatofthenativeMPI.ThisincreaseisinpartduetotheextraparametercheckingbyLAPI,whichunliketheinternalPipesinterfaceisanexposedinterface.Thedifferencebetweenthesizeofthepacketheadersinthesetwoimplementationsisanotherfactorwhichcontributestotheslightlyincreasedla-tency.ThesizeofheadersinthenativeMPIisbytes,andthesizeofheadersforMPI-LAPIis7\r=bytes.Itcanbealsoobservedthatformessageslargerthan;bytes,thelatencyofMPI-LAPIbecomeslessthanthatofthenativeMPI.Animprovementofupto\n \r wasmeasured.Asmentionedearlier,unlikethenativeimplementationofMPI,intheMPI-LAPIimplementationmessagesarecopieddirectlyfromtheuserbufferintotheNICbufferandviceversa.AvoidingtheextradatacopyinghelpsimprovetheperformanceoftheMPI-LAPIimplementation.16326412825651210242048409641664256102440961638465536262144Time (microsec):Message Size (byte)MPI-LAPI EnhancedNative MPIFigure11.ComparisonbetweentheperformanceofthenativeMPIandMPI­LAPI.TheobtainablebandwidthofthenativeMPIandMPI-LAPIisshowninFigure12.Itcanbeseenthat,forawiderangeofmessagesizes,thebandwidthofMPI-LAPIishigherthanthatofthenativeMPI.For&#x;000;;7\r8bytemessages,MPI-LAPIachievesabandwidthof= \n \r?A@CBDEFwhichindicatesa&#x;000;G \r improvementincomparisonwiththe;=G  ?H@CBDEFbandwidthobtainedbyusingthenativeMPI.0.1250.250.5124816326412841664256102440961638465536Bandwith (MB/sec)Message Size (byte)MPI-LAPI EnhancedNative MPIFigure12.ComparisonbetweentheperformanceofthenativeMPIandMPI­LAPI.6412825651210242048409641664256102440961638465536262144Time (microsec):Message Size (byte)MPI-LAPI EnhancedNative MPIFigure13.ComparisonbetweentheperformanceofthenativeMPIandMPI­LAPIininterruptmode.Formeasuringthetimerequiredforsendingmessagesfromonenodetoanothernodeininterruptmode,weusedamethodsimilartotheoneusedformeasuringlatency.Theonlydif-ferencewasthatthereceiverwouldpostthereceive(usingMPIIrecv)andcheckthecontentofthereceivebufferuntilthemessagehasarrived.Thenitwouldsendbackamessagewiththesamesize.TheresultsofourmeasurementsareshowninFigure13.ItcanbeseenthatMPI-LAPIperformsconsistentlyandconsiderablybetterthanthenativeMPIimplementation.Forshortmessagesof7bytesanimprovementof \r\n  isobserved.ThenativeMPIperformspoorlyinthisexperiment.ThemajorreasonbehindthepoorperformanceofthenativeMPIisthehysteresisschemeusedinit.Intheinterrupthan-dlerofthenativeMPI,theinterrupthandlerwaitsforacertainperiodoftimetoseeifmorepacketsarecomingtoavoidfur-therinterrupts.Ifmorearecomingthentheyincreasethetimetheinterrupthandlerwaitsintheloop.Thevalueofthiswait-ingperiodcanbesetbytheuser.LAPIdoesnotuseanysuchhysteresisinitsinterrupthandlerandthus,providesbetterper-formance.6.2NASBenchmarksInthissectionwepresenttheexecutiontimesofprogramsfromtheNASbenchmarkforthenativeMPIandMPI-LAPI.NASParallelBenchmarks(version2.3)consistofeightbench-markswritteninMPI.Thesebenchmarkswereusedtoeval-uatetheperformanceofourMPIimplementationinamorerealisticenvironment.WeusedthenativeimplementationofMPIandMPI-LAPItocomparetheexecutiontimesofthesebenchmarksonafour-nodeSPsystem.Thebenchmarkswereexecutedseveraltimes.Thebe

stexecutiontimeforeachap-plicationwasrec
stexecutiontimeforeachap-plicationwasrecorded.TheMPI-LAPIperformsconsistentlybetterthanthenativeMPI.Improvementsof \r ,7G IJ ,7K ; ,\n IJ and \n = wereobtainedforLU,IS,CG,BTandFTbenchmarks,re-spectively.ThepercentagesofimprovementforEP,MG,andSPwerelessthan  .7RelatedWorkPreviousworkonimplementingMPIontopoflow-levelone-sidedcommunicationinterfacesinclude(a)theeffortatCornellinportingMPICHontopoftheirGAM(genericac-tivemessage)implementationonSPsystems[2],and(b)theeffortatUniversityofIllinoisinportingMPICHontopoftheFM(fastmessages)communicationinterfaceonawork-stationclusterconnectedwiththeMyrinetnetwork[5].InbothcasesthepublicdomainversionofMPI(MPICH[3])hasbeenthestartingpointoftheseimplementations.IntheMPIimplementationontopofAM,shortmessagesarecopiedintoaretransmissionbufferaftertheyareinjectedintothenet-work.Lostmessagesareretransmittedfromtheretransmis-sionbuffers.Theretransmissionbuffersarefreedwhenacor-respondingacknowledgedisreceivedfromthetarget.Shortmessagesthereforerequireacopyatthesenderside.Theotherproblemisthatforeachpairofnodesinthesystemabuffershouldbeallocatedwhichlimitsscalabilityoftheprotocol.MPI-LAPIimplementationavoidstheseproblems(whichde-gradetheperformance)byusingtheheaderhandlerfeatureofLAPI.UnlikeMPI-LAPI,theimplementationofMPIonAMdescribedin[2]doesnotsupportpacketarrivalinterruptswhichimpactsperformanceofapplicationswithcommunica-tionbehaviorthatisasynchronous.IntheimplementationofMPIontopofFM[5],FMwasmodiedtoavoidextracopyingatthesenderside(gather)aswellasthereceiveside(upcall).FMhasbeenoptimizedforshortmessages.Therefore,forlongmessages(largerthan8bytes)MPI-FMperformspoorlyincomparisonwiththenativeimplementationofMPIonSPsys-tems(Fig.10of[5]).8ConclusionRemarksInthispaper,wehavepresentedhowtheMPIstandardisimplementedontopofLAPIforSPsystems.ThedetailsofthisimplementationandthemismatchesbetweentheMPIstandardrequirementsandLAPIfunctionalityhavebeendis-cussed.WehavealsoshownhowLAPIcanbeenhancedinordertomaketheMPIimplementationmoreefcient.Theexibilityprovidedbyhavingheaderhandlersandcomple-tionhandlersmakesitpossibletoavoidanyunnecessarydatacopies.TheperformanceofMPI-LAPIisshowntobeveryclosetothatofbareLAPIandthecostaddedbecauseoftheMPIstandardsemanticsenforcementisshowntobeminimal.MPI-LAPIperformscomparablyorbetterthanthenativeMPIintermsoflatencyandbandwidth.WeplantoimplementMPIdatatypeswhichhavenotbeenimplementedyet.AcknowledgementsWewouldliketothankseveralmembersoftheCSSteam:WilliamTuelandRobertStraubfortheirhelpinprovidingusdetailsoftheMPCIlayer,andDickTreumannforhelpingwiththedetailsoftheMPIlayer.WewouldalsoliketothankKevinGildea,M.T.Raghunath,GautamShah,PaulDiNicola,andChulhoKimfromtheLAPIteamfortheirinputandtheearlydiscussionwhichhelpedmotivatethisproject.DisclaimerMPI-LAPIisnotapartofanyIBMproductandnoassump-tionsshouldbemaderegardingitsavailabilityasaproduct.Theperformanceresultsquotedinthispaperarefrommea-surementsdoneinAugustof1998andthesystemiscontinu-ouslybeingtunedforimprovedperformance.References[1]T.Agerwala,J.L.Martin,J.H.Mirza,D.C.Sadler,D.M.Dias,andM.Snir.SP2SystemArchitecture.IBMSystemsJournal,34(2):152–184,1995.[2]C.Chang,G.Czajkowski,C.Hawblitzel,andT.V.Eicken.LowLatencyCommunicationontheIBMRISCSystem/6000SP.Supercomputing96,1996.[3]W.Gropp,E.Lusk,N.Doss,andA.Skjellum.AHigh-Performance,PortableImplementationoftheMPI,Mes-sagePassingInterfaceStandard.Technicalreport,Ar-gonneNationalLaboratoryandMississippiStateUniver-sity.[4]IBM.PSSPCommandandTechnicalReference-LAPIChapter.IBM,1997.[5]M.LauriaandA.Chien.MPI-FM:HighPerformanceMPIonWorkstationClusters.JournalofParallelandDistributedComputing,pages4–18,Jan1997.[6]MessagePassingInterfaceForum.MPI:AMessage-PassingInterfaceStandard,Mar1994.[7]G.Shah,J.Nieplocha,J.Mirza,C.Kim,R.Harrison,R.K.Govindaraju,K.Gildea,P.DiNicola,andC.Ben-der.PerformanceandExperiencewithLAPI-aNewHigh-PerformanceCommunicationLibraryfortheIBMRS/6000SP.InProceedingsoftheInternationalParallelProcessingSymposium,pages260–267,March1998.[8]M.Snir,P.Hochschild,D.D.Frye,andK.J.Gildea.ThecommunicationsoftwareandparallelenvironmentoftheIBMSP2.IBMSystemsJournal,34(2):205–221,1995.[9]C.B.Stunkel,D.Shea,D.G.Grice,P.H.Hochschild,andM.Tsao.TheSP1HighPerformanceSwitch.InScalableHighPerformanceComputingConference,pages150–157,1994.[10]T.vonEicken,D.E.Culler,S.C.Goldstein,andK.E.Schauser.ActiveMessages:AMechanismforIntegratedCommunicationandComputation.InInternationalSym-posiumonComputerArchitecture,pages256–266

Related Contents


Next Show more