/
Costeffective Outbreak Detection in Networks Jure Leskovec Carnegie Mellon University Costeffective Outbreak Detection in Networks Jure Leskovec Carnegie Mellon University

Costeffective Outbreak Detection in Networks Jure Leskovec Carnegie Mellon University - PDF document

alida-meadow
alida-meadow . @alida-meadow
Follow
583 views
Uploaded On 2015-03-11

Costeffective Outbreak Detection in Networks Jure Leskovec Carnegie Mellon University - PPT Presentation

We present a general methodology for near optimal sensor placement in these and related problems We demonstrate that many realistic outbreak detection objectives eg de tection likelihood population a64256ected exhibit the prop erty of submodularity ID: 43989

present general

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Costeffective Outbreak Detection in Netw..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Cost-effectiveOutbreakDetectioninNetworksJureLeskovecCarnegieMellonUniversityAndreasKrauseCarnegieMellonUniversityCarlosGuestrinCarnegieMellonUniversityChristosFaloutsosCarnegieMellonUniversityJeanneVanBriesenCarnegieMellonUniversityNatalieGlanceNielsenBuzzMetricsABSTRACTGivenawaterdistributionnetwork,whereshouldweplacesensorstoquicklydetectcontaminants?Or,whichblogsshouldwereadtoavoidmissingimportantstories?Theseseeminglydierentproblemssharecommonstruc-ture:Outbreakdetectioncanbemodeledasselectingnodes(sensorlocations,blogs)inanetwork,inordertodetectthespreadingofavirusorinformationasquicklyaspossible.Wepresentageneralmethodologyfornearoptimalsensorplacementintheseandrelatedproblems.Wedemonstratethatmanyrealisticoutbreakdetectionobjectives(e.g.,de-tectionlikelihood,populationaected)exhibittheprop-ertyofsubmodularityŽ.Weexploitsubmodularitytode-velopanecientalgorithmthatscalestolargeproblems,achievingnearoptimalplacements,whilebeing700timesfasterthanasimplegreedyalgorithm.Wealsoderiveon-lineboundsonthequalityoftheplacementsobtainedbyalgorithm.Ouralgorithmsandboundsalsohandlecaseswherenodes(sensorlocations,blogs)havedierentcosts.Weevaluateourapproachonseverallargereal-worldprob-lems,includingamodelofawaterdistributionnetworkfromtheEPA,andrealblogdata.Theobtainedsensorplace-mentsareprovablynearoptimal,providingaconstantfrac-tionoftheoptimalsolution.Weshowthattheapproachscales,achievingspeedupsandsavingsinstorageofseveralordersofmagnitude.Wealsoshowhowtheapproachleadstodeeperinsightsinbothapplications,answeringmulticrite-riatrade-o,cost-sensitivityandgeneralizationquestions.CategoriesandSubjectDescriptors:F.2.2Analysisof big,well-knownblogs.However,theseusuallyhavealargenumberofposts,andaretime-consumingtoread.Weshow,that,perhapscounterintuitively,amorecost-eectivesolu-tioncanbeobtained,byreadingsmaller,buthigherquality,blogs,whichouralgorithmcan“nd.Thereareseveralpossiblecriteriaonemaywanttoopti-mizeinoutbreakdetection.Forexample,onecriterionseekstominimizedetectiontimei.e.,toknowaboutacascadeassoonaspossible,oravoidspreadingofcontaminatedwater).Similarly,anothercriterionseekstominimizethepopulationaectedbyanundetectedoutbreak(i.e.,thenumberofblogsreferringtothestorywejustmissed,orthepopulationcon-sumingthecontaminationwecannotdetect).OptimizingtheseobjectivefunctionsisNP-hard,soforlarge,real-worldproblems,wecannotexpectto“ndtheoptimalsolution.Inthispaper,weshow,thattheseandmanyotherrealis-ticoutbreakdetectionobjectivesaresubmodulari.e.,theyexhibitadiminishingreturnsproperty:Readingablog(orplacingasensor)whenwehaveonlyreadafewblogspro-videsmorenewinformationthanreadingitafterwehavereadmanyblogs(placedmanysensors).Weshowhowwecanexploitthissubmodularityprop-ertytoecientlyobtainsolutionswhichareprovablyclosetotheoptimalsolution.Theseguaranteesareimportantinpractice,sinceselectingnodesisexpensive(readingblogsistime-consuming,sensorshavehighcost),andwedesiresolutionswhicharenottoofarfromtheoptimalsolution.Themaincontributionsofthispaperare:Weshowthatmanyobjectivefunctionsfordetectingoutbreaksinnetworksaresubmodular,includingde-tectiontimeandpopulationaectedintheblogosphereandwaterdistributionmonitoringproblems.Weshowthatourapproachalsogeneralizesworkby[10]onse-lectingnodesmaximizingin”uenceinasocialnetwork.Weexploitthesubmodularityoftheobjective(e.g.detectiontime)todevelopanecientapproximationalgorithm,,whichachievesnear-optimalplace-ments(guaranteeingatleastaconstantfractionoftheoptimalsolution),providinganoveltheoreticalresultfornon-constantnodecostfunctions.isupto700timesfasterthansimplegreedyalgorithm.Wealsoderivenovelonlineboundsonthequalityoftheplacementsobtainedbyalgorithm.Weextensivelyevaluateourmethodologyontheap-plicationsintroducedabove…waterqualityandblo-gospheremonitoring.Thesearelargereal-worldprob-lems,involvingamodelofawaterdistributionnetworkfromtheEPAwithmillionsofcontaminationscenar-ios,andrealblogdatawithmillionsofposts.Weshowhowtheproposedmethodologyleadstodeeperinsightsinbothapplications,includingmulticriterion,cost-sensitivityanalysesandgeneralizationquestions.2.OUTBREAKDETECTION2.1ProblemstatementThewaterdistributionandblogospheremonitoringprob-lems,despitebeingverydierentdomains,shareessentialstructure.Inbothproblems,wewanttoselectasubsetnodes(sensorlocations,blogs)inagraph),whichdetectoutbreaks(spreadingofavirus/information)quickly.Fig.2presentsanexampleofsuchagraphforblognet-work.Eachofthesixblogsconsistsofasetofposts.Con-nectionsbetweenpostsrepresenthyper-links,andlabelsshow Figure2:Blogshaveposts,andtherearetimestampedlinksbetweentheposts.Thelinkspointtothesourcesofinformationandthecascadesgrow(informationspreads)inthereversedirectionoftheedges.Readingonlyblogcapturesallcascades,butlate.alsohasmanyposts,sobyreadingwedetectcascadessooner.thetimedierencebetweenthesourceanddestinationpost,e.g.,postlinkedonedayafterwaspublished).Theseoutbreaks(e.g.,informationcascades)initiatefromasinglenodeofthenetwork(e.g.and),andspreadoverthegraph,suchthatthetraversalofeveryedges,ttakesacertainamountoftime(indicatedbytheedgelabels).Assoonastheeventreachesaselectednode,analarmistriggered,e.g.,selectingblog,woulddetectthecascadesoriginatingfrompostand,after6,6and2timestepsafterthestartoftherespectivecascades.Dependingonwhichnodesweselect,weachieveacertainplacementscore.Fig.2illustratesseveralcriteriaonemaywanttooptimize.Ifweonlywanttodetectasmanystoriesaspossible,thenreadingjustblogisbest.However,read-ingwouldonlymissonecascade(),butwoulddetecttheothercascadesimmediately.Ingeneral,thisplacementscore(representing,e.g.,thefractionofdetectedcascades,orthepopulationsavedbyplacingawaterqualitysensor)isasetfunction,mappingeveryplacementtoarealnumber)(ourreward),whichweintendtomaximize.Sincesensorsareexpensive,wealsoassociateacostwitheveryplacement,andrequire,thatthiscostdoesnotexceedaspeci“edbudgetwhichwecanspend.Forexample,thecostofselectingablogcouldbethenumberofpostsinit(i.e.hascost2,whilehascost6).Inthewaterdistributionsetting,accessingcertainlocationsinthenetworkmightbemoredicult(expensive)thanotherloca-tions.Also,wecouldhaveseveraltypesofsensorstochoosefrom,whichvaryintheirquality(detectionaccuracy)andcost.Weassociateanonnegativecost)witheverysensor,andde“nethecostofplacementUsingthisnotionofrewardandcost,ourgoalistosolvetheoptimizationproblemmaxAV)subjectto(1)whereisabudgetwecanspendforselectingthenodes.2.2PlacementobjectivesAneventfromsetofscenarios(e.g.,cascades,contaminantintroduction)originatesfromanodeanetwork),andspreadsthroughthenetwork,af-fectingothernodes(e.g.,throughcitations,or”owthroughpipes).Eventually,itreachesamonitorednodeAVi.e.,blogsweread,pipejunctionweinstrumentwithasen-sor),andgetsdetected.Dependingonthetimeofdetectioni,s),andtheimpactonthenetworkbeforethedetec-tion(e.g.,thesizeofthecascadesmissed,orthepopulationaectedbyacontaminant),weincurpenalty).The Theorem2([17]).isasubmodular,nondecreas-ingsetfunctionand,thenthegreedyalgorithm“ndsaset,suchthat)max|A|Hence,thegreedyalgorithmisguaranteedto“ndasolutionwhichachievesatleastaconstantfraction(163%)oftheoptimalscore.Thepenaltyreductionsatis“esallrequirementsofTheorem2,andhencethegreedyalgorithmapproximatelysolvesthemaximizationproblemEq.(1).Non-constantcosts.Whatifthecostsofthenodesarenotconstant?Itiseasytoseethatthesimplegreedyalgo-rithm,whichiterativelyaddssensorsusingrulefromEq.(2)untilthebudgetisexhausted,canfailbadly,sinceitisin-dierenttothecosts(i.e.,averyexpensivesensorprovidingrewardispreferredoveracheapersensorprovidingreward.Toavoidthisissue,thegreedyruleEq.(2)canbemodi“edtotakecostsintoaccount:=argmaxV\A (3)i.e.,thegreedyalgorithmpickstheelementmaximizingthebene“t/costratio.Thealgorithmstopsoncenoelementcanbeaddedtothecurrentsetwithoutexceedingthebudget.Unfortunately,thisintuitivegeneralizationofthegreedyal-gorithmcanperformarbitrarilyworsethantheoptimalso-lution.Considerthecasewherewehavetwolocations,andand.Alsoassumewehaveonlyonescenario,and)=2,and.Now,)=2,and)=1.Hencethegreedyalgorithmwouldpick.Afterselecting,wecannotaordanymore,andourtotalrewardwould.However,theoptimalsolutionwouldpick,achievingtotalpenaltyreductionof.Asgoesto0,theperformanceofthegreedyalgorithmbecomesarbitrarilybad.However,thegreedyalgorithmcanbeimprovedtoachieveaconstantfactorapproximation.Thisnewalgorithm,(Cost-EectiveForwardselection),computesthesolutionGCBusingthebene“t-costgreedyalgorithm,usingrule(3),andalsocomputesthesolutionGUCusingtheunit-costgreedyalgorithm(ignoringthecosts),usingrule(2).Forbothrules,onlyconsiderselementswhichdonotex-ceedthebudgetthenreturnsthesolutionwithhigherscore.Eventhoughbothsolutionscanbearbitrarilybad,thefollowingresultshowsthatthereisatleastoneofthemwhichisnottoofarawayfromoptimum,andhenceprovidesaconstantfactorapproximation.TheoremLetbetheanondecreasingsubmodularfunctionwith.ThenmaxGCBGUC )maxTheorem3wasprovedby[11]forthespecialcaseoftheBudgetedMAX-COVERproblem,andhereweprovethisresultforarbitrarynondecreasingsubmodularfunctions.The-orem3statesthatthebettersolutionofGBCandGUC(whichisreturnedby)isatmostaconstantfactor )oftheoptimalsolution.Notethattherunningtimeof|V|)inthenum-berofpossiblelocations|V|(ifweconsiderafunctioneval-uation)asatomicoperation,andthelowestcostofanodeisconstant).In[25],itwasshownthateveninthenon-constantcostcase,theapproximationguaranteeof(1canbeachieved.However,theiralgorithmis(|V|)inthesizeofpossiblelocations|V|weneedtoselectfrom,which InMAX-COVER,wepickfromacollectionofsets,suchthattheunionofthepickedsetsisaslargeaspossible.isprohibitiveintheapplicationsweconsider.Inaddition,inourcasestudies,weshowthatthesolutionsofprovablyveryclosetotheoptimalscore.3.2OnlineboundsforanyalgorithmTheapproximationguaranteesof(1)and intheunit-andnon-constantcostcasesareoinei.e.,wecanstatetheminadvancebeforerunningtheactualalgo-rithm.Wecanalsousesubmodularitytoacquiretightlineboundsontheperformanceofanarbitraryplacement(notjusttheoneobtainedbythealgorithm).TheoremForaplacementAV,andeachV\let.Let,andlet,...,sbethesequenceoflocationswithindecreas-ingorder.Letbesuchthat.Let.Thenmax(4)Theorem4presentsawayofcomputinghowfaranygivensolution(obtainedusingotheralgorithm)isfromtheoptimalsolution.Thistheoremcanbereadilyturnedintoanalgorithm,asformalizedinAlgorithm2.Weempiricallyshowthatthisboundismuchtighterthanthebound ),whichisroughly31%.4.SCALINGUPTHEALGORITHM4.1SpeedingupfunctionevaluationsEvaluatingthepenaltyreductionscanbeveryexpen-sive.E.g.,inthewaterdistributionapplication,weneedtorunphysicalsimulations,inordertoestimatetheeectofacontaminationatacertainnode.Intheblognetworks,weneedtoconsiderseveralmillionsofposts,whichmakeupthecascades.However,inbothapplications,mostoutbreaksaresparse,i.e.,aectonlyasmallareaofthenetwork(c.f.,[12,16]),andhenceareonlydetectedbyasmallnumberofnodes.Hence,mostnodesdonotreducethepenaltyincurredbyanoutbreak(i.e.)=0).Note,thatthissparsityisonlypresentifweconsiderpenaltyreduc-tions.Ifforeachsensorandscenariowestoretheactualpenalty),theresultingrepresentationisnotsparse.Ourimplementationexploitsthissparsitybyrepre-sentingthepenaltyfunctionasaninvertedindex,whichallowsfastlookupofthepenaltyreductionsbysensorindex.Bylookingupallscenariosdetectedbyallsensorsinourplacement,wecanquicklycomputethepenaltyreductiondetectedby)maxwithouthavingtoscantheentiredataset.Theinvertedindexisthemaindatastructureweuseinouroptimizationalgorithms.Aftertheproblem(waterdis-tributionnetworksimulations,blogcascades)hasbeencom-pressedintothisstructure,weusethesameimplementationforoptimizingsensorplacementsandcomputingbounds.Inthewaterdistributionnetworkapplication,exploitingthissparsityallowsusto“tthesetofallpossibleintrusionsconsideredintheBWSNchallengeinmainmemory(16GB),whichleadstoseveralordersofmagnitudeimprovementsintherunningtime,sincewecanavoidhard-driveaccesses. Theindexisinverted,sincethedatasetfacilitatesthelookupbyscenarioindex(sinceweneedtoconsidercas-cades,orcontaminationsimulationsforeachscenario). 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Reduction in population affected CELFsolution Onlinebound Offline bound 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Number of blogsPenalty reduction PA DT DL (a)Performanceof(b)ObjectivefunctionsFigure3:(a)PerformanceofCELFalgorithmando-lineandon-lineboundsforPAobjectivefunc-tion.(b)Comparesobjectivefunctions.(Section3.1)showsthattheunknownoptimalsolutionliesbetweenoursolution(bottomline)andthebound(topline).Noticethediscrepancybetweenthelinesisbig,whichmeanstheboundisveryloose.Ontheotherhand,themiddlelineshowstheonlinebound(Section3.2),whichagaintellsusthattheoptimalsolutionissomewherebetweenourcurrentsolutionandthebound.Notice,thegapismuchsmaller.Thismeans(a)thattheouron-lineboundismuchtighterthanthetraditionalo-linebound.And,(b)thatouralgorithmperformsveryclosetotheoptimum.Incontrasttotheo-linebound,theon-lineboundisal-gorithmindependent,andthuscanbecomputedregardlessofthealgorithmusedtoobtainthesolution.Sinceitistighter,itgivesamuchbetterworstcaseestimateofthesolutionquality.Forthisparticularexperiment,weseethatworksverywell:afterselecting100blogs,weareatmost13.8%awayfromtheoptimalsolution.Figure3(b)showstheperformanceusingvariousobjectivefunctions(fromtoptobottom:DL,DT,PA).DLincreasesthefastest,whichmeansthatoneonlyneedstoreadafewblogstodetectmostofthecascades,orequivalentlythatmostcascadeshitoneofthebigblogs.However,thepop-ulationaected(PA)increasesmuchslower,whichmeansthatoneneedsmanymoreblogstoknowaboutstoriesbe-foretherestofpopulationdoes.Byusingtheon-lineboundwealsocalculatedthatallobjectivefunctionsareatmost5%to15%fromoptimal.5.4CostofablogTheresultspresentedsofarassumethateverybloghasthesamecost.Underthisunitcostmodel,thealgorithmtendstopicklarge,in”uentialblogs,thathavemanyposts.Forexample,isthebestblogwhenopti-mizingPA,butithas4,593posts.Interestingly,mostoftheblogsamongthetop10arepoliticsblogs:,andsciencepolitics.blogspot.com.Somepopularaggregatorsofinterestingthingsandtrendsontheblogo-spherearealsoselected:and.Thetop10PAblogshadmorethan21,000thousandpostsin2006.Theyaccountfor0.2%ofallposts,3.5%ofallin-links,1.7%ofout-linksinsidethedataset,and0.37%ofallout-links.Undertheunitcostmodel,largeblogsareimportant,butreadingablogwithmanypostsistimeconsuming.Thismotivatesthenumberofposts(NP)costmodel,wherewesetthecostofablogtothenumberofpostsithadin2006.First,wecomparetheNPcostmodelwiththeunitcostinFig.4(a).ThetopcurveshowsthevalueofthePAcriterionforbudgetsofposts,i.e.,weoptimizePAsuchthatthe 1 2 3 4 5x 104 0 0.2 0.4 0.6 0.8 Reduction in population affected Optimizing benefit/cost ratio Ignoring costin optimization 5000 10000 15000 0 50 100 150 200 250 300 Number of blogs Score R = 0.4 R = 0.3 R = 0.2 (a)Costofablog(b)CosttradeoFigure4:(a)Comparisonoftheunitandthenum-berofpostscostmodels.(b)For“xedvalueofPA,wegetmultiplesolutionsvaryingincosts. 20 40 60 80 100 0 0.2 0.4 0.6 0.8 Reduction in population affected CELF Blog outlinks Inlinks All outlinks # Posts Random 1000 2000 3000 4000 5000 0 0.1 0.2 0.3 0.4 0.5 Reduction in population affected CELF BlogOutlinks AllOutlinks InLinks # Posts (a)Unitcost(b)NumberofpostscostFigure5:Heuristicblogselectionmethods.(a)unitcostmodel,(b)numberofpostscostmodel.selectedblogscanhaveatmostpoststotal.Note,thatundertheunitcostmodel,choosesexpensiveblogswithmanyposts.Forexample,toobtainthesamePAob-jectivevalue,oneneedstoread10,710postsunderunitcostmodel.TheNPcostmodelachievesthesamescorewhilereadingjust1,500posts.Thus,optimizingthebene“tcostratio(PA/cost)leadstodrasticallyimprovedperformance.Interestingly,thesolutionsobtainedundertheNPcostmodelareverydierentfromtheunitcostmodel.UnderNP,politicalblogsarenotchosenanymore,butrathersum-marizers(e.g.)areimportant.BlogsselectedunderNPcostappearabout3dayslaterinthecascadeasthoseselectedunderunitcost,whichfurthersuggeststhatthatsumma-rizerblogstendtobechosenunderNPmodel.Inpractice,thecostofreadingablogisnotsimplypropor-tionaltothenumberofposts,sincewealsoneedtonavigatetotheblog(whichtakesconstanteortperblog).Hence,acombinationofunitandNPcostismorerealistic.Fig.4(b)interpolatesbetweenthesetwocostmodels.EachcurveshowsthesolutionswiththesamevalueofthePAob-jective,butusingadierentnumberofposts(x-axis)andblogs(y-axis)each.Foragiven,theidealspotistheoneclosesttoorigin,whichmeansthatwewanttoreadtheleastnumberofpostsfromleastblogstoobtaindesiredscoreOnlyattheendpointsdoestendtopickextremeso-lutions:fewblogswithmanyposts,ormanyblogswithfewposts.Note,thereisaclearkneeonplotsofFig.4(b),whichmeansthatbyonlyslightlyincreasingthenumberofblogsweallowourselvestoread,thenumberofpostsneededde-creasesdrastically,whilestillmaintainingthesamevalueoftheobjectivefunction.5.5ComparisontoheuristicblogselectionNext,wecompareourmethodwithseveralintuitiveheuris-ticselectiontechniques.Forexample,insteadofoptimizingtheDT,DLorPAobjectivefunctionusing,wemay 6.CASESTUDY2:WATERNETWORKS6.1ExperimentalsetupInthewaterdistributionsystemapplication,weusedthedataandrulesintroducedbytheBattleofWaterSensorNetworks(BWSN)challenge[19].Weconsideredboththesmallnetworkon129nodes(BWSN1),andalarge,real-istic,12,527nodedistributionnetwork(BWSN2)providedaspartoftheBWSNchallenge.Inadditionwealsocon-siderathirdwaterdistributionnetwork(NW3)ofalargemetropolitanareaintheUnitedStates.Thenetwork(notincludingthehouseholdlevel)contains21,000nodesand25,000pipes(edges).Toourknowledge,thisisthelargestwaterdistributionnetworkconsideredforsensorplacementoptimizationsofar.Thenetworksconsistofastaticdescrip-tion(junctionsandpipes)anddynamicparameters(time-varyingwaterconsumptiondemandpatternsatdierentnodes,openingandclosingofvalves,pumps,tanks,etc.)6.2ObjectivefunctionsIntheBWSNchallenge,wewanttoselectasetof20sen-sors,simultaneouslyoptimizingtheobjectivefunctionsDT,PAandDL,asintroducedinSection2.2.Toobtaincas-cadesweusearealisticdiseasemodelde“nedby[19],whichdependsonthedemandsandthecontaminantconcentrationateachnode.Inordertoevaluatetheseobjectives,weusetheEPANETsimulator[24],whichisbasedonaphysicalmodeltoproviderealisticpredictionsonthedetectiontimeandconcentrationofcontaminantforanypossiblecontam-inationevent.Weconsidersimulationsof48hourslength,with5minutesimulationtimesteps.Contaminationscanhappenatanynodeandanytimewithinthe“rst24hours,andspreadthroughthenetworkaccordingtotheEPANETsimulation.Thetimeoftheoutbreakisimportant,sincewa-terconsumptionvariesoverthedayandthecontaminationspreadsatdierentratesdependingonthetimeoftheday.Altogether,weconsiderasetof3.6millionpossiblecon-taminationscenariosandeachoftheseisassociatedwithacascadeŽofcontaminantspreadingoverthenetwork.6.3SolutionqualityWe“rstusedtooptimizeplacementsofincreasingsize,accordingtothethreecriteriaDL,DT,PA.Weagainnormalizedthescorestobebetween0and1,where1isthebestachievablescorewhenplacingsensorsateverynode.Fig.8(a)presentsthescore,theo-lineandon-lineboundsforPAobjectiveontheBWSN2network.Consis-tentlywiththeblogexperiments,theon-lineboundismuchtighterthantheo-linebound,andthesolutionsobtainedbyouralgorithmareveryclosetotheoptimum.Fig.8(b)showssperformanceonall3objectivefunctions.Similarlytotheblogdata,thepopulationaf-fected(PA)scoreincreasesveryquickly.Thereasonisthatmostcontaminationeventsonlyimpactasmallfractionofthenetwork.Usingfewsensors,itisrelativelyeasytode-tectmostofthehighimpactoutbreaks.However,ifwewanttodetectallscenarios,weneedtoplacealargenumberofsensors(2,263inourexperiment).Hence,theDL(andcor-respondinglyDT)increasemoreslowlythanPA.Fig.9showstwo20sensorplacementsafteroptimizingDLandPArespectivelyonBWSN2.Whenoptimizingthepop-ulationaected(PA),theplacedsensorsareconcentratedinthedensehigh-populationareas,sincethegoalistodetectoutbreakswhichaectthepopulationthemost.Whenop- 5 10 15 20 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Reduction of population affected CELFsolution online bound offline bound 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 Penalty reduction DT PA DL (a)Performanceof(b)ObjectivefunctionsFigure8:(a)CELFwithoineandonlineboundsforPAobjective.(b)Dierentobjectivefunctions. (a)PA(b)DLFigure9:Waternetworksensorplacements:(a)whenoptimizingPA,sensorsareconcentratedinhighpopulationareas.(b)whenoptimizingDL,sen-sorsareuniformlyspreadout.timizingthedetectionlikelihood,thesensorsareuniformlyspreadoutoverthenetwork.Intuitivelythismakessense,sinceaccordingtoBWSNchallenge[19],outbreakshappenwithsameprobabilityateverynode.So,forDL,theplacedsensorsshouldbeasclosetoallnodesaspossible.Wealsocomparedthescoresachievedbywithsev-eralheuristicsensorplacementtechniques,whereweorderthenodesbysomegoodnessŽcriteria,andthenpickthetopnodes.Weconsiderthefollowingcriteria:populationatthenode,water”owthroughthenode,andthediameterandthenumberofpipesatthenode.Fig.11(a)showstheresultsforthePAobjectivefunction.outperformsbestheuristicby45%.Bestheuristicsareplacingnodesatrandom,byde-greeortheirpopulation.Weseeheuristicsperformpoorly,sincenodeswhicharecloseinthegraphtendtohavesimilar”ow,diameterandpopulation,andhencethesensorswillbespreadouttoolittle.Eventhemaximumoveronehundredrandomtrialsperformsfarworsethanhan6.4MulticriterionoptimizationUsingthetheorydevelopedinSection2.4,wetraded-odierentobjectivesforthewaterdistributionapplication.Weselectedpairsofobjectives,e.g.,DLandPA,andvariedtheweightstoproduce(approximately)Pareto-optimalsolutions.InFig.10(a)weplotthetradeocurvesfordierentplacementsizes.Byaddingmoresensors,bothobjectivesDLandPAincrease.Thecrvesalsoshow,thatifwe,e.g.,optimizeforDL,thePAscorecanbeverylow.However,therearepointswhichachievenear-optimalscoresinbothcriteria(thekneeinthecurve).Thissweetspotiswhatweaimforinmulti-criteriaoptimization.WealsotradedotheaectedpopulationPAandafourthcriterionde“nedbyBWSN,theexpectedconsumptionofcontaminatedwater.Fig.10(b)showsthetrade-ocurveforthisexperiment.Noticethatthecurves(almost)collapsetopoints,indicatingthatthesecriteriaarehighlycorrelated,whichweexpectforthispairofobjectivefunctions.Again, 0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 1 Detection likelihood (DL)Population affected (PA) k=2 k=3 k=5 k=10 k=20 k=35 k=50 0.6 0.8 1 0.5 0.6 0.7 0.8 0.9 1 Population affected (PA)Contaminated water consumed k=2 k=3 k=5 k=10 k=20 k=35 k=50 (a)(b)Figure10:(a)TradingoPAandDL.(b)TradingoPAandconsumedcontaminatedwater. 5 10 15 20 0 0.2 0.4 0.6 0.8 Number of sensorsReduction in population affectedCELF Degree Random Population Diameter Flow 4 6 8 10 0 100 200 300 Running time (minutes) Exhaustive search(All subsets) Naivegreedy CELF,CELF + Bounds (a)Comparisonwithrandom(b)RuntimeFigure11:(a)SolutionsofCELFoutperformheuris-ticapproaches.(b)Runningtimeofexhaustivesearch,greedyandCELFtheeciencyofourimplementationallowstoquicklygen-erateandexplorethesetrade-ocurves,whilemaintainingstrongguaranteesaboutnear-optimalityoftheresults.6.5ScalabilityInthewaterdistributionsetting,weneedtosimulate3.6millioncontaminationscenarios,eachofwhichtakesapprox-imately7secondsandproduces14KBofdata.Sincemostofthecomputerclusterschedulingsystemsbreakifonewouldsubmit3.6millionjobsintothequeue,wedevelopedadis-tributedarchitecture,wheretheclientsobtainsimulationparametersandthencon“rmthesuccessfulcompletionofthesimulation.Werunthesimulationforamonthonaclusterofaround40machines.Thisproduced152GBofoutbreaksimulationdata.ByexploitingthepropertiesoftheproblemdescribedinSection4,thesizeoftheinvertedindex(whichrepresentstherelevantinformationforevalu-atingplacementscores)isreducedto16GBwhichwewereableto“tintomainmemoryofaserver.Thefactthatwecould“tthedataintomainmemoryalonespedupthealgorithmsbyatleastafactorof1000.Fig.11(b)presentstherunningtimesof,thenaivegreedyalgorithmandexhaustivesearch(extrapolated).Wecanseethattheis10timesfasterthanthegreedyal-gorithmwhenplacing10sensors.Again,adrasticspeedup.7.DISCUSSIONANDRELATEDWORK7.1RelationshiptoInßuenceMaximizationIn[10],aTriggeringModelwasintroducedformodelingthespreadofin”uenceinasocialnetwork.Astheauthorsshow,thismodelgeneralizestheIndependentCascade,Lin-earThresholdandListen-oncemodelscommonlyusedformodelingthespreadofin”uence.Essentially,thismodelde-scribesaprobabilitydistributionoverdirectedgraphs,andthein”uenceisde“nedastheexpectednumberofnodesreachablefromasetofnodes,withrespecttothisdistri-bution.Kempeetal.showedthattheproblemofselectingasetofnodeswithmaximumin”uenceissubmodular,sat-isfyingtheconditionsofTheorem2,andhencethegreedyalgorithmprovidesa(1)approximation.TheproblemaddressedinthispapergeneralizesthisTriggeringmodel:TheoremTheTriggeringModel[10]isaspecialcaseofournetworkoutbreakdetectionproblem.InordertoproveTheorem5,weconsider“xeddirectedgraphssampledfromtheTriggeringdistribution.Ifwere-vertthearcsinanysuchgraph,thenourPAobjectivecor-respondsexactlytothein”uencefunctionof[10]appliedtotheoriginalgraph.Detailsoftheproofcanbefoundin[15].Theorem5showsthatspreadingin”uenceunderthegen-eralTriggeringModelisaspecialcaseofouroutbreakde-tectionformalism.Theproblemsarefundamentallyrelatedsince,whenspreadingin”uence,onetriestoaectasmanynodesaspossible,whilewhendetectingoutbreak,onewantstominimizetheeectofanoutbreakinthenetwork.Sec-ondly,notethatintheexampleofreadingblogs,itisnotnecessarilyagoodstrategytoaectnodeswhichareveryin-”uential,asthesetendtohavemanyposts,andhenceareex-pensivetoread.Incontrasttoin”uencemaximization,thenotionofcost-bene“tanalysisiscrucialtoourapplications.7.2RelatedworkOptimizingsubmodularfunctions.Thefundamentalresultaboutthegreedyalgorithmformaximizingsubmod-ularfunctionsintheunit-costcasegoesbackto[17].The“rstapproximationresultsaboutmaximizingsubmodularfunctionsinthenon-constantcostcasewereprovedby[25].Theydevelopedanalgorithmwithapproximationguaranteeof(1),whichhoweverrequiresanumberoffunctionevaluations(|V|)inthesizeofthegroundset(ifthelowestcostisconstant).Incontrast,thenumberofevalu-ationsrequiredby|V|),whilestillprovidingaconstantfactorapproximationguarantee.Viruspropagationandoutbreakdetection.Workonspreadofdiseasesinnetworksandimmunizationmostlyfo-cusesondeterminingthevalueoftheepidemicthresholddacriticalvalueofthevirustransmissionprobabilityabovewhichtheviruscreatesanepidemic.Severalstrategiesforimmunizationhavealsobeenproposed:uniformnodeimmu-nization,targetedimmunizationofhighdegreenodes[20]andacquaintanceimmunization,whichfocusesonhighlyconnectednodes[5].Inthecontextofourwork,uniformim-munizationcorrespondstorandomlyplacingsensorsinthenetwork.Similarly,targetedimmunizationcorrespondstoselectingnodesbasedontheirin-orout-degree.AswehaveseeninFigures5and11,bothstrategiesperformworsethandirectoptimizationofthepopulationaectedcriterion.Informationcascadesandblognetworks.Cascadeshavebeenstudiedformanyyearsbysociologistsconcernedwiththediusionofinnovation[23];morerecently,cas-cadesweusedforstudyingviralmarketing[8,14],selectingtrendsettersinsocialnetworks[21],andexplainingtrendsinblogspace[9,13].Studiesofblogspaceeitherspendeortminingtopicsfromposts[9]orconsideronlythepropertiesofblogspaceasagraphofunlabeledURLs[13].Recently,[16]studiedthepropertiesandmodelsofinformationcas-cadesinblogs.Whilepreviousworkeitherfocusedonem-piricalanalysesofinformationpropagationand/orprovidedmodelsforit,wedevelopageneralmethodologyfornodeselectioninnetworkswhileoptimizingagivencriterion.