/
Thread Reinforcer Dynamically Determining Number of Th Thread Reinforcer Dynamically Determining Number of Th

Thread Reinforcer Dynamically Determining Number of Th - PDF document

jane-oiler
jane-oiler . @jane-oiler
Follow
401 views
Uploaded On 2015-06-05

Thread Reinforcer Dynamically Determining Number of Th - PPT Presentation

Bhuyan Department of Computer Science and Engineering University of California Riverside Riverside USA 92521 kishorecsucredu guptacsucredu bhuyancsucredu Abstract It is often assumed that to maximize the perfor mance of a multithreaded application t ID: 80913

Bhuyan Department Computer

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Thread Reinforcer Dynamically Determinin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

B.TuningtheImplementationofApplications.PreviousstudiesofPARSEChavebeencarriedoutformachinecongurationswithasmallnumberofcores(2,4,or8).Ithasbeenobservedthattheperformanceoftheseprogramsscaleswellforasmallnumberofcores.However,sinceweareconductingastudywhichconsidersthescalabilityoftheseapplicationprogramsforlargernumberofcores,werstexaminedtheprogramstoseeiftheirimplementationsrequireanytuningconsistentwiththeuseofalargernumberofcores.Ourstudyoftheapplicationsrevealedtwomainissuesthatrequiredtuningofimplementations.First,forprogramsthatmakeextensiveuseofheapmemory,toavoidthehighoverheadofmalloc[36],weusedthelibmtmalloclibrarytoallowmultiplethreadstoconcurrentlyaccesstoheap.Second,insomeapplicationswheretheinputloadisnotevenlydistributedacrossworkerthreads,weimprovedtheloaddistributioncode.Bytuningtheimplementationsintheabovefashion,theperformanceforsevenoutofeightapplicationsconsideredwasimproved.Insomecasestheimprovementsaresmall(ferret,blackscholes,streamcluster,andbodytrack),moderateimprovementwasobservedincaseofuidanimate,andveryhighimprovementwasobservedforswaptions.Theimprovementinswaptionscanbeexplainedasfollows.Weobserveddramaticreductioninlockingeventswhenweswitchfrommalloctomtmallocinarunwhere24workerthreadsareused.Intheoriginalswaptionsworkerthreadcodetheinputloadof128swaptionsisdistributedacross24threadsasfollows:veswaptionseacharegivento23threads;and13swaptionsareassignedtothe24ththread.Thisisbecausethecoderstassignsequalloadtoallthreadsandallremainingloadtothelastthread.Whenthenumberofthreadsislarge,thiscausesloadimbalance.Toremovethisimbalance,wemodiedthecodesuchthatitassignssixswaptionseachtoeightthreadsandveswaptionseachtotheremaining16threads.Thisisbecauseinsteadofassigningtheextraloadtoonethread,wedistributeitacrossmultiplethreads.C.PerformanceforVaryingNumberofThreadsWeraneachprogramforvaryingnumberofthreadsandcollectedthespeedupsobserved.Eachprogramwasruntentimesandspeedupswereaveragedovertheseruns.TableIIIshowsthemaximumspeedup(MaxSpeedup)foreachprogramonthe24-coremachinealongwiththeminimumnumberofTABLEIII:MaximumspeedupsobservedandcorrespondingnumberofthreadsforPARSECprogramsonthe24-coremachine.Programswererunfromaminimumof4threadstoamaximumof128threads. Program TunedVersion Original MaxSpeedup OPTThreads MaxSpeedup OPTThreads swaptions 21.9 33 3.6 7 ferret 14.1 63 13.7 63 bodytrack 11.4 26 11.1 26 blackscholes 4.9 33 4.7 33 canneal 3.6 41 nochange uidanimate 12.7 21 12 65 facesim 4.9 16 4.6 16 streamcluster 4.2 17 4.0 17 threads(calledOPTThreads)thatproducedthisspeedup.Thedataisprovidedforboththetunedversionoftheprogramandtheoriginalversionoftheprogram.Aswecansee,tuningresultedinimprovedperformanceforseveralprograms.Intherestofthepaperwewillonlyconsiderthetunedversionsoftheprogram.AswecanseefromTableIII,notonlydoesthemaximumspeedupachievedbytheseprogramsvarywidely(from3.6xforcannealto21.9xforswaptions),thenumberofthreadsthatproducemaximumspeedupsalsovarieswidely(from16threadsforfacesimto63threadsforferret).Moreover,fortherstveprogramsthemaximumspeedupresultsfromcreatingmorethreadsthanthenumberofcores,i.e.OPT-Threadsisgreaterthan24.FortheotherthreeprogramsOPT-Threadsislessthanthenumberofcores.TheaboveobservationthatthevalueofOPT-Threadsvarieswidelyissignicant–ittellsusthatthechoiceofnumberofthreadsthatarecreatedisanimportantone.ExperimentsinpriorstudiesinvolvingPARSEC[2],[4],[24]wereperformedforcongurationswithasmallnumberofcores(4and8).Inthesestudiesthenumberofthreadswastypicallysettoequalthenumberofcoresasthistypicallyprovidedthebestperformance.However,thesameapproachcannotbetakenwhenmachineswithlargernumberofcoresarebeingused.Inotherwords,wemustselectappropriatenumberofthreadstomaximizethespeedupsobtained.ToobservehowthespeedupvarieswiththenumberofthreadsweplotthespeedupsforallourexperimentsinFigure2.ThegraphontheleftshowsthespeedupsforprogramsforwhichOPT-Threadsisgreaterthan24andthegraphontherightshowsthespeedupsfortheprogramsforwhichOPT- Fig.2:SpeedupbehaviorofPARSECworkloadsforvaryingnumberofthreads:ThegraphontheleftshowsthebehaviorofapplicationswheremaximumspeedupwasobservedforNumberofThreads�NumberofCores=24;andThegraphontherightshowsthebehaviorofapplicationswheremaximumspeedupwasobservedforNumberofThreadsNumberofCores=24.3 Threadsislessthan24.TheprogramswithOPT-Threadsgreaterthan24exhibitdifferentbehaviors.Thespeedupsforswaptionsandferretscalewellwiththenumberofthreadswithmaximumspeedupsresultingfromuseof33and63threadsrespectively.Whilebodytrackprovidessubstantialspeedups,oncemaximumspeedupof11.6isachievedwith26threads,thespeedupsstartstofallgraduallyasmorethreadsareadded.Thespeedupsofblackscholesandcannealincreaseveryslowlywiththenumberofthreadsduedolackofparallelismintheseprograms.ForprogramswithOPT-Threadslessthan24,oncethenumberofthreadsreachesOPT-Threads,speedupsfallasadditionalthreadsarecreated.Thisbehavioristheresultoflockcontentionthatincreaseswiththenumberofthreads.D.FactorsDeterminingScalabilityInthissectionwepresentadditionaldatacollectedwiththeaimofunderstandingthefactorsthatleadtotheobservedspeedupbehaviorspresentedinFigure2.Usingtheprstat[7]utility,westudiedthefollowingmaincomponentsoftheexecutiontimesforthreadsineachapplication.1)User:Thepercentageoftimeathreadspendsinusermode.2)System:Thepercentageoftimeathreadspendsinprocessingthefollowingsystemevents:systemcalls,systemtraps,textpagefaults,anddatapagefaults.3)Lock-contention:Thepercentageoftimeathreadspendswaitingforuserlocks,condition-variablesetc.4)Latency:ThepercentageoftimeathreadspendswaitingforaCPU.Inotherwords,althoughthethreadisreadytorun,itisnotscheduledonanycore. Program CriticalThreads ferret RankstageThreads canneal MainThread swaptions WorkerThreads blackscholes MainThread bodytrack AllThreads uidanimate WorkerThreads streamcluster WorkerThreads facesim AllThreads Fig.3:Breakdownofelapsedtimeofcriticalthreads.Westudiedtheabovetimesforallthreadstoseeifchangesinthesetimeswouldexplainthechangesinspeedupsobservedbyvaryingnumberofthreads.Althoughweexaminedthedataforallthreads,itquicklybecameapparentthatinmanyprogramsnotallthreadswerecriticaltotheoverallspeedup.Weidentiedthecriticalthreadsandstudiedthemingreaterdetail.Thecriticalthreadsforeachapplicationarelistedinthetablebelow.Figure3providesthebreakdownofthetimeofcriticalthreadsintheabovefourcategories–thisdataisfortheOPT-Threadsrunandistheaverageacrossallcriticalthreads.Aswecansee,insomeprogramslock-contention(LCK)playsacriticalrole,inothersthethreadsspendsignicanttimewaitingforaCPUaslatency(LAT)ishigh,andthesystemtime(SYS)isthehighestforcannealandblackscholes.Intheremainderofthissectionweanalyzetheabovetimesforeachoftheprogramsingreaterdetailtostudytheirrelationshipwithspeedupvariationsthatareobservedwhennumberofthreadsisvaried.Wefurtheridentifytheprogramcharacteristicsthatarethecausesfortheobservedspeedupvariations.1)OPT-Threads�NumberofCoresScalablePerformance.AswecanseefromthegraphontheleftinFigure2,forthreeprograms(swaptions,bodytrack,andferret)inthiscategory,thespeedupsscalequitewell.Asthenumberofthreadsisvariedfromafewthreadstoaround24,whichisthenumberofcores,thespeedupincreaseslinearlywiththenumberofthreads.However,oncethenumberofthreadsisincreasedfurther,thethreeprogramsexhibitdifferenttrendsasdescribedbelow:(Erratic)swaptions:Althoughthespeedupforswaptionscanbesignicantlyincreased--from20for25threadsto21.9for33threads--itstrendiserratic.Sometimestheadditionofmorethreadsincreasesthespeedupwhileatothertimesanincreaseinnumberofthreadsreducesthespeedup.(SteadyDecline)bodytrack:Thespeedupforbodytrackdecreasesasthenumberofthreadsisincreasedbeyond26threads.Thedeclineinspeedupisquitesteady.(ContinuedIncrease)ferret:Thespeedupforferretcon-tinuestoincreaselinearly.Infactthelinearincreaseinspeedupisobservedfromtheminimumnumberof6threadsallthewayuptill63threads.Interestinglynochangeinbehaviorisobservedwhenthenumberofthreadsisincreasedfromlessthanthenumberofcorestomorethanthenumberofcores.Nextwetracethedifferingbehaviorsbacktospeciccharac-teristicsoftheseprograms.swaptions:Firstletusconsidertheerraticbehaviorofspeedupsobservedinswaptions.Werstexaminedthelockcontentionandlatencyinformation.AsshowninFigure4(a),thelockcontention(LOCK)isverylowandremainsverylowthroughoutandthelatency(LAT)increasessteadilywhichshowsthattheadditionalthreadscreatedarereadytorunbutaresimplywaitingforaCPU(core)tobecomeavailable.Thiskeepstheexecutiontimetobethesame.Thereforeweneedtolookelsewhereforanexplanation.Uponfurtheranalysiswefoundthatthespeedupbehavioriscorrelatedtothethreadmigrationrate.AswecanseefromFigure4(b),whenthemigrationrategoesup,thespeedupgoesdownandviceversa–themigrationratewasmeasuredusingthempstat[7]utility.Migrationsareexpensiveeventsastheycauseathreadtopullitsworkingsetintocoldcaches,oftenattheexpenseofotherthreads[7].Thus,thespeedupbehaviorisadirectconsequenceofchangesinthreadmigrationrate.TheOSschedulerplaysasignicantrolehereasitisresponsibleformakingmigrationdecisions.Whenathreadmakesatransitionfromsleepstatetoaready-to-runstate,if4 TABLEIV:Behaviorofferret. Total n Load(1) Segment(n) Extract(n) Vector(n) Rank(n) Out(1) Speedup Threads USR SYS LOCK USR LOCK USR LOCK USR LOCK USR LOCK USR LOCK 15 3 22 4 74 8 92 1 99 44 56 100 0 0.5 99.3 3.3 31 7 44 7.8 48 6.7 93 1 99 43 57 100 0 0.6 99 7.5 47 11 56 11.3 32 5.4 95 1 99 40 60 100 0 0.7 99 11.5 55 13 64 14 19 5 95 1 99 44 56 98 0 0.7 99 12.5 63 15 79 20 0 5 95 1 99 43 57 96 0 0.7 99 14.1 71 17 77 20 0 5 95 1 99 37 63 80 16 0.7 99 13.8 87 21 78 17 0 4 96 1 99 28 72 65 33 0.4 99.3 13.7 103 25 75 17 0 3 97 1 99 24 76 53.5 45 0.4 99.3 13.4 119 29 74 17 0 3 97 1 99 20 80 46 52.5 0.4 99.4 13.2 127 31 70 20 0 3 97 1 99 19 81 40 59 0.4 99.4 13.1 (a)LockandLatency (b)Speedupvs.Mig.Rate.Fig.4:swaptions:CauseofErraticSpeedupChanges.thecoreonwhichitlastranisnotavailable,thethreadislikelytobemigratedtoanotheravailablecore.Ingeneral,onewouldexpectmoremigrationsasthenumberofthreadsincreasesbeyondthenumberofcores.However,ifthenumberofthreadsisdivisiblebythenumberofcores,thenthelikelihoodofmigrationsislesscomparedtowhenthisisnotthecase.Intheformercase,theOSschedulercanallocateequalnumberofthreadstoeachcore,balancingtheload,andthusreducingtheneedformigrations.Thusweconcludethatvariationsindegreeofloadbalancingacrosscorescausescorrespondingvariationsinthreadmigrationrateandhencetheobservedspeedups.Forexample,inFigure4(b),thethreadmigrationratefor48threadson24coresislowerthanthreadmigrationratefor40threadson24cores.Moreover,wecanexpectlowthreadmigrationratewhentheinputload(128swaptions)isperfectlydivisiblebythenumberofthreads(e.g.,16,32,64etc.).bodytrack:Nextletusconsiderthesteadydeclineinspeedupobservedforbodytrack.Figure5(a)showsthatalthoughthelatency(LAT)risesasmorethreadsarecreated,sodoesthelockcontention(LOCK)whichissignicantforbodytrack.Inaddition,bodytrackisanI/OintensivebenchmarkwhereI/Oisperformedbyallthethreads.Weobservedthatthisprogramproducesaround350ioctl()callspersecond.BothlockcontentionandI/Ohavetheconsequenceofincreasingthethreadmigrationrate.ThisisbecausebothlockcontentionandI/Oresultinsleeptowakeupandruntosleepstatetransitionsforthethreadsinvolved.Whenathreadwakesupfromthesleepstate,theOSschedulerimmediatelytriestogiveacoretothatthread,ifitfailstoschedulethethreadonthesamecorethatitusedlast,itmigratesthethreadtoanothercore.AswecanseefromFigure5(b),thethreadmigrationrateforbodytrackriseswiththenumberofthreadswhichcausesasteadydeclineinitsspeedup.ferret:Thebehaviorofthisprogramisinterestingasthespeedupforitincreaseslinearlystartingfrom6threadstoallthewayupto63threadseventhoughonly24coresare (a)LockandLatency (b)MigrationRate.Fig.5:bodytrack:CauseofDeclineinSpeedup.available.Tounderstandthisbehaviorweneedtoexaminetheprogramingreaterdetail.Theprogramisdividedintosixpipelinestages–theresultsofprocessinginonestagearepassedontothenextstage.Thestagesare:Load,Segment,Extract,Vector,Rank,andOut.Therstandlaststagehaveasinglethreadbuteachoftheintermediatestagesareapoolofnthreads.Betweeneachpairofconsecutivestagesaqueueisprovidedthroughwhichresultsarecommunicatedandlockingisusedtocontrolqueuesaccesses.Thereasonfortheobservedbehaviorisasfollows.TheRankstageperformsmostoftheworkandthusthespeedupoftheapplicationisdeterminedbytheRankstage.Moreovertheotherstagesperformrelativelylittleworkandthustheirthreadstogetheruseonlyafractionofthecomputepoweroftheavailablecores.Thus,aslongascoresarenotsufcientlyutilized,morespeedupcanbeobtainedbycreatingadditionalthreadsfortheRankstage.Themaximumspeedupof14.1forferretwasobservedwhenthetotalnumberofthreadscreatedwas63whichactuallycorrespondsto15threadsforRankstage.Thatis,thelinearriseinspeedupisobservedfrom1threadto15threadsfortheRankstagewhichiswellunderthetotalof24coresavailable–theremainingcoresaresufcienttosatisfytheneedsofallotherthreads.ThejusticationoftheabovereasoningcanbefoundinthedatapresentedinTableIVwhereweshowtheaveragepercentageofUSRandLOCKtimesforallstagesandSYStimeforonlyLoadstagebecauseallothertimesarequitesmall.ThethreadsbelongingtoSegment,Extract,andOutstagesperformverylittleworkandmostlyspendtheirtimewaitingforresultstobecomeavailableintheirincomingqueues.WhiletheLoadandVectorstagesdoperformsignicantamountofwork,theyneverthelessperformlessworkthantheRankstage.TheperformanceoftheRankstagedeterminestheoverallspeedup–addingadditionalthreadstotheRankstagecontinuestoyieldadditionalspeedupsaslongasthisstagedoesnotexperiencelockcontention.Oncelockcontentiontimesstarttorise(startingatn=17),thespeedupbeginstofall.5 (a)uidanimate (b)facesim (c)streamclusterFig.6:MaximumSpeedupWhenNumberofThreadsNumberofCores. Fig.7:VoluntaryContextSwitchRate.TofurtherconrmourobservationsabovewerananexperimentinwhichweincreasedthenumberofthreadsintheRankstageandloweredthenumberofthreadsinotherintermediatestages.Wefoundthatthecongurationwith(1,10,10,10,16,1)threadsgaveaspeedupof13.9andwhenwechangedthecongurationto(1,16,16,16,16,1)threadsthespeedupremainedthesame.ThisfurtherconrmstheimportanceoftheRankstage.PerformanceDoesNotScale.(blackscholesandcanneal)Althoughthemaximumspeedupsoftheseprograms(4.9and3.6)areobservedwhen32and40workerthreadsarecreated,thespeedupsofboththeseprogramsincreaseverylittlebeyond16workerthreads.ThisisbecausemostoftheworkisperformedbythemainthreadandtheoverallCPUutilizationbecomeslow.Themainthreadtakesup85%and70%ofthetimeforblackscholesandcannealrespectively.Duringrestofthetimetheparallelizedpartoftheprogramisexecutedbyworkerthreads.Theimpactofparallelizationofthislimitedpartontheoverallspeedupdiminisheswithincreasingnumberofthreads.2)OPT-ThreadsNumberofCoresThethreeprogramswheremaximumspeedupwasachievedusingfewerthreadsthannumberofcoresareuidanimate,facesim,andstreamcluster.Intheseprogramsthekeyfactorthatlimitsperformanceislockcontention.Figure6showsthatthetimeduetolockcontention(LOCK)dramaticallyincreaseswithnumberofthreadswhilethelatency(LAT)showsmodestornoincrease.Themaximumspeedupsareobservedat21threadsforuidanimate,16threadsforfacesim,and17threadsforstreamcluster.Whenthenumberofthreadsislessthanthenumberofcores,theloadbalancingtaskoftheOSschedulerbecomessimpleandthreadmigrationsbecomerare.Thus,unlikeswaptionsandbodytrackwheremaximumspeedupswereobservedforgreaterthan24threads,threadmigrationratedoesnotplayanyroleinTABLEV:Voluntaryvs.InvoluntaryContextSwitches. Program VCX(%) ICX(%) uidanimate 84 16 facesim 97 3 streamcluster 94 6 swaptions 11 89 ferret 13 87 theperformanceofthethreeprogramsconsideredinthissection.However,theincreasedlockcontentionleadstoslowdownsbecauseofincreasedcontextswitchrate.Wecandividecontext-switchesintotwotypes:involuntarycontext-switches(ICX)andvoluntarycontext-switches(VCX).Involuntarycontext-switcheshappenwhenthreadsareinvoluntarytakenoffacore(e.g.,duetoexpirationoftheirtimequantum).Voluntarycontext-switchesoccurwhenathreadperformsablockingsystemcall(e.g.,forI/O)orwhenitfailstoacquirealock.Insuchcasesathreadvoluntarilyreleasesthecoreusingtheyield()systemcallbeforegoingtosleepusinglwp_park()systemcall.Thereforeasmorethreadsarecreatedandlockcontentionincreases,VCXcontextswitchraterisesasshowninFigure7.ItisalsoworthnotingthatmostofthecontextswitchesperformedbythethreeprogramsareintheVCXcategory.WemeasuredtheVCXandICXdatausingtheprstatutility.TableVshowsthatthepercentageofVCXrangesfrom84%to97%forthethreeprogramsconsideredhere.Incontrast,theVCXrepresentsonly11%and13%ofcontextswitchesforswaptionsandferret.SincethespeedupbehaviorofanapplicationcorrelateswithvariationsinLOCK,MIGR_RATE,VCX_RATE,andCPU_UTIL,inthenextsectionwedevelopaframeworkforautomaticallydeterminingthenumberofthreadsbyruntimemonitoringoftheabovecharacteristics.III.TheThreadReinforcerFrameworkTheapplicationsconsideredallowtheusertocontrolthenumberofthreadscreatedusingthecommandlineargumentninTableII.Sinceourexperimentsshowthatthenumberofthreadsthatyieldpeakperformancevariesgreatlyfromoneprogramtoanother,theselectionofnplacesanaddedburdenontheuser.Therefore,inthissection,wedevelopaframeworkforautomaticallyselectingthenumberofthreads.Theframeworkweproposerunstheapplicationintwosteps.IntherststeptheapplicationisrunmultipletimesforshortdurationsoftimeduringwhichitsbehaviorismonitoredandbaseduponruntimeobservationsThreadReinforcersearchesfortheappropriatenumberofthreads.Oncethisnumberisfound,inthesecondstep,theapplicationisfullyreexecutedwiththenumberofthreadsdeterminedintherststep.We6 TABLEVI:Factorsconsideredwrttothenumberofthreads. Factor 24Threads �24Threads LOCK Yes Yes VCX_RATE Yes - MIGR_RATE - Yes CPU_UTIL Yes Yes havetoreruntheapplicationsforshortdurationsbecausetheapplicationsarewrittensuchthattheydonotsupportvaryingofnumberofthreadsonline.Thus,ThreadReinforcerdoesnotconsiderphasechangesofthetargetprogram.However,outofthe16programstested,onlytheammpprogramshowstwosignicantlydifferentphasesanditsrstphasedominatestheexecution.ThereforeThreadReinforcerworkswellalsofortheammpprogram.Eachtimeanapplicationistobeexecutedonanewinput,ThreadReinforcerisusedtodeterminetheappropriatenumberofthreadsforthatinput.Thisisdoneinordertohandleapplicationswhoseruntimebehaviorisinputdependantandthustheoptimalnumberofthreadsmayvaryacrossinputs.Ourgoalistwofold:tondtheappropriatenumberofthreadsandtodosoquicklysoastominimizeruntimeoverhead.TheapplicationswehaveconsideredtakefromtensofsecondstoafewhundredsecondstoexecuteintheOPT-Threadsconguration.Therefore,weaimtodesignThreadReinforcersothatthetimesittakestosearchforappropriatenumberofthreadsisonlyafewseconds.Thisensuresthatthebenetsofthealgorithmoutweightheruntimeoverheadofusingit.ThreadReinforcersearchesforappropriatenumberofthreadsintherangeofTminandTmaxthreadsasfollows.Itrunstheapplicationforincreasingnumberofthreadsforshorttimedurations.EachsuccessiveruncontainseitherTsteporTstep=2additionalthreads.ThedecisionofwhetherornottoruntheprogramforhighernumberofthreadsandwhethertoincreasethenumberofthreadsbyTsteporTstep=2,isbaseduponchangesinprolesobservedoverthepasttworuns.Theproleconsistsoffourcomponents:LOCK(lockcontention),MIGR_RATE(threadmigrationrate),VCX_RATE(voluntarycontextswitchrate),andCPU_UTIL(processorutilization).Thevaluesofeachofthesemeasuresarecharacterizedaseitherloworhighbaseduponsetthresholdsfortheseparameters.Ouralgorithmnotonlyexaminesthecurrentvaluesofaboveproles,italsoexamineshowrapidlytheyarechanging.ThechangesofthesevaluesoverthepasttworunsaredenotedasDLOCK,DMIGR_RATE,DVCX_RATE,andDCPU_UTIL.Thechangesarealsocharacterizedaslowandhightoindicatewhetherthechangeisgradualorrapid.Atanypointinthemostpenultimaterunrepresentsthecurrentbestsolutionofouralgorithmandthelastruniscomparedwiththepreviousruntoseeifitshouldbeviewedasanimprovementoverthepenultimaterun.Ifitisconsideredtobeanimprovement,thenthelastrunbecomesourcurrentbestsolution.Baseduponthestrengthofimprovement,weruntheprogramwithTsteporTstep=2additionalthreads.Theaboveprocesscontinuesaslongasimprovementisobserved.EventuallyThreadReinforcerterminatesifnoimprovementordegradationisobserved,orwehavealreadyreachedthemaximumnumberofthreadsTmax.TableVIidentiesthecomponentsthatplayanimportantrolewhenthenumberofthreadsisnomorethanthenumberofcores(i.e.,24)versuswhenthenumberofthreadsisgreaterthanthenumberofcores.Thelockcontentionisanimportantfactorwhichmustbeconsideredthroughout.However,for24threadstheVCX_RATEisimportantwhilefor�24threadstheMIGR_RATEisimportanttoconsider.Ingeneral,thelimitofparallelismforaprogrammayreachatanytime.ThusCPU_UTILisanimportantfactortoconsiderthroughout.Theaboveobservationsareadirectconsequenceofourobservationsmadeduringthestudypresentedearlier.Figure8presentsThreadReinforcerindetail.ThreadRein-forcerisinitiatedbycallingFindN()andwhenitterminatesitreturnsthevalueofcommandlikeparameternthatisclosesttothenumberofthreadsthatareexpectedtogivethebestperformance.FindN()isiterative–itchecksforterminationbycallingTerminate()andifterminationconditionsarenotmet,itcallsComputeNextT()tondoutthenumberofthreadsthatmustbeusedinthenextrun.ConsiderthecodeforTerminate().Itrstchecksifprocessorutilizationhasincreasedfromthepenultimateruntothelastrun.Ifthisisnotthecasethenthealgorithmterminatesotherwisethelockcontentionisexaminedfortermination.Iflockcontentionishighthenterminationoccursifoneofthefollowingistrue:lockcontentionhasincreasedsignicantly;numberofthreadsisnomorethanthenumberofcoresandvoluntarycontextswitchratehassharplyincreased;ornumberofthreadsisgreaterthanthenumberofcoresandthreadmigrationratehassharplyincreased.Finally,iftheaboveterminationconditionisalsonotmetwedonotterminatethealgorithmunlesswehavealreadyreachedtheupperlimitfornumberofthreads.Beforeiteratinganotherstep,thenumberofadditionalthreadstobecreatedisdetermined.ComputeNextT()doesthistask–iftheoverheadsoflocking,contextswitches,ormigrationrateincreaseslowlythenTstepadditionalthreadsarecreated;otherwiseTstep=2additionalthreadsarecreated.WeimplementedThreadReinforcertoevaluateitseffective-nessinndingappropriatenumberofthreadsandstudyitsruntimeoverhead.Beforeexperimentation,weneededtoselectthevariousthresholdsusedbyThreadReinforcer.Toguidetheselectionofthresholdsweusedthreeoftheeightprograms:uidanimate,facesim,andblackscholes.Werantheseselectedprogramsonsmallinputs:foruidanimateandblackscholesweusedthesimlargeinputandforfacesimweusedthesimsmallinput.WestudiedtheprolesoftheprogramsandidentiedthethresholdvaluesforLOCK,MIGR_RATE,VCX_RATE,CPU_UTILasfollows.Thethresholdvalueswerechosensuchthatafterreachingthethresholdvalue,thevalueoftheprolecharacteristicbecamemoresensitivetothenumberofthreadsandshowedarapidincrease.Therearetwotypesofthresholdvalues:absolutethresholdsandDthresholds.TheDthresholdindicateshowrapidlythecorrespondingcharacteristicischanging.ForLOCKandVCX_RATEboththresholdsareusedbyouralgorithm.ForMIGR_RATEandCPU_UTILonlyDthresholdisused.Itshouldbenotedthatthethreeprogramsthatwerechosentohelpinselectionofthresholdscollectivelycoverallfouroftheprolecharacteristics:foruidanimatebothLOCKandMIGR_RATEareimportant;forfacesimVCX_RATEisimportant;andforblackscholesCPU_UTILisimportant.7