/
FUNNEL:AutomaticMiningofSpatiallyCoevolvingEpidemicsYasukoMatsubaray,Y FUNNEL:AutomaticMiningofSpatiallyCoevolvingEpidemicsYasukoMatsubaray,Y

FUNNEL:AutomaticMiningofSpatiallyCoevolvingEpidemicsYasukoMatsubaray,Y - PDF document

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
366 views
Uploaded On 2016-04-17

FUNNEL:AutomaticMiningofSpatiallyCoevolvingEpidemicsYasukoMatsubaray,Y - PPT Presentation

1 IntuitivelytheproblemwewishtosolveisasfollowsINFORMALPROBLEM1Givenalargecollectionofepidemiologicaldatawhichconsistsofddiseasesinllocationsofdurationnwithmissingvaluesandrecordingerrorswewant ID: 282553

1 Intuitively theproblemwewishtosolveisasfollows:INFORMALPROBLEM1.Givenalargecollectionofepidemi-ologicaldata whichconsistsofddiseasesinllocationsofdurationn withmissingvaluesandrecordingerrors wewant

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "FUNNEL:AutomaticMiningofSpatiallyCoevolv..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

FUNNEL:AutomaticMiningofSpatiallyCoevolvingEpidemicsYasukoMatsubaray,YasushiSakuraiy,WillemG.vanPanhuisx,ChristosFaloutsoszyDept.ofComputerScienceandElectricalEngineering,KumamotoUniversity,xDept.ofEpidemiology,UniversityofPittsburgh,zDept.ofComputerScience,CarnegieMellonUniversity{yasuko,yasushi}@cs.kumamoto-u.ac.jp,wav10@pitt.edu,christos@cs.cmu.eduABSTRACTGivenalargecollectionofepidemiologicaldataconsistingofthecountofdcontagiousdiseasesforllocationsofdurationn,howcanwendpatterns,rulesandoutliers?Forexample,theProjectTychoprovidesopenaccesstothecountinfectionsforU.S.statesfrom1888to2013,for56contagiousdiseases(e.g.,measles,in-uenza),whichincludemissingvalues,possiblerecordingerrors,suddenspikes(ordives)ofinfections,etc.Sohowcanwendacombinedmodel,forallthesediseases,locations,andtime-ticks?Inthispaper,wepresentFUNNEL,aunifyinganalyticalmodelforlargescaleepidemiologicaldata,aswellasanovelttingalgo-rithm,FUNNELFIT,whichsolvestheaboveproblem.Ourmethodhasthefollowingproperties:(a)Sense-making:itdetectsimpor-tantpatternsofepidemics,suchasperiodicities,theappearanceofvaccines,externalshockevents,andmore;(b)Parameter-free:ourmodelingframeworkfreestheuserfromprovidingparameterval-ues;(c)Scalable:FUNNELFITiscarefullydesignedtobelinearontheinputsize;(d)General:ourmodelisgeneralandpracti-cal,whichcanbeappliedtovarioustypesofepidemics,includingcomputer-viruspropagation,aswellashumandiseases.ExtensiveexperimentsonrealdatademonstratethatFUNNELFITdoesindeeddiscoverimportantpropertiesofepidemics:(P1)dis-easeseasonality,e.g.,inuenzaspikesinJanuary,LymediseasespikesinJulyandtheabsenceofyearlyperiodicityforgonorrhea;(P2)diseasereductioneffect,e.g.,theappearanceofvaccines;(P3)local/state-levelsensitivity,e.g.,manymeaslescasesinNY;(P4)externalshockevents,e.g.,historicalupandemics;(P5)detectincongruousvalues,i.e.,datareportingerrors.CategoriesandSubjectDescriptors:H.2.8[Databasemanage-ment]:Databaseapplications–DataminingKeywords:Epidemics;Time-series;Automaticmining1.INTRODUCTIONGivenahugecollectionofco-evolvingepidemictime-series,suchasmeaslesandinuenza,howcanwendtypicalpatternsoranomalies,andstatisticallysummarizealltheepidemicsequences?Inthispaper,wepresentaunifyingmodel,namelyFUNNEL,whichPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita-tionontherstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre-publish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.Requestpermissionsfrompermissions@acm.org.KDD'14,August24–27,2014,NewYork,NY,USA.Copyright2014ACM978-1-4503-2956-9/14/08...$15.00.http://dx.doi.org/10.1145/2623330.2623624.providesagooddescriptionoflargecollectionsofepidemiologicaldata. 1 Intuitively,theproblemwewishtosolveisasfollows:INFORMALPROBLEM1.Givenalargecollectionofepidemi-ologicaldata,whichconsistsofddiseasesinllocationsofdurationn,withmissingvaluesandrecordingerrors,wewanttondbasicpatternsofdiseases(e.g.,seasonality)ndextrapatterns(e.g.,outbreaks,suddendrops)detectanomalies(i.e.,possibleerrors)Uncoveringthemechanismsandpatternsofcontagiousdiseasesisanimportantandchallengingtaskforpublichealthscientistsandpolicymakers.Inthispaper,westudyapubliclyavailableresourceofepidemiologicaldata:Tycho[ 32 ],whichcontainsthecountofinfectionsof56diseasesintheU.S,coveringover125yearsonaweeklybasis. 2 Previewofourresults. Figure1 (a)showsthenumberofmeaslescasesintheUnitedStatesfrom1928to1982,asgraycircles,andourttedmodel,asasolidredline.Thesequencehasaclearyearlyperiodicity,butalsocharacteristicbi-andtriennialpatternsresult-inginalternatinglarge(1941,1958)andsmall(1940,1947)epi-demicyears[ 22 ],whichisknownas“skip”phenomena[ 29 ].Itshouldalsobenotedthatthenumberofcasessuddenlydroppedin1965.Thiswasachievedbecauseofthevaccinationprogramthatstartedin1963. Figure1 (b)showsthepotentialpopulationofsus-ceptiblesformeaslesforeachstateintheU.S.Thetopthreestatesinthisrespectare,NY,PA,andCA. Figure1 (c)showsascatterplotoftheseasonalitystrengthvs.thepeakseasonforeachdisease.Wedeterminedfourcategoriesofdiseasesintermsof(1)thestrengthofannualperiodicity(radius)and(2)theirphasedifference,i.e.,themonthinwhichtheypeak(angle).Forexample,wefoundpreviouslycharacterizedepidemicpeaksforinuenzainJanuary-Februaryandforrespiratorychild-hooddiseases(e.g.,measles)inthespring[ 27 ],andfortick-bornediseases(e.g.,Lymedisease)peaksinthesummer[ 28 ];Thereisnoperiodicityforsexuallytransmitteddiseases,(e.g.,gonorrhea).Wecancapturetheseimportantpatternsinepidemicdatawithourproposedmodel,simplybychangingitsparameters.Moreim-portantly,ourmethodisfully-automatic,thatis,itprovidesagooddescriptionofalargecollectionofepidemiologicaldata,withoutuserintervention,priortraining,orparametertuning.Contrastwithcompetitors. Table1 illustratestherelativeadvan-tagesofourmethod.Onlyourapproachhaschecksagainstallentries,while,TheSImodel(andSIR,SIRS,etc.)cancompressthedataintoaxednumberofparameters,andcapturethedynamicsofepidemiologicaldata,however,itcannotdescribeperiodicpatterns,andisincapableofforecasting. 1Availableat http://www.cs.kumamoto-u.ac.jp/~yasuko/software.html 2ProjectTychoatUniversityofPittsburgh: http://www.tycho.pitt.edu/ 1930 1940 1950 1960 1970 1980 0 2 4 6x 104 YearCount 1930 1940 1950 1960 1970 1980 105 YearCount (log) Original I(t) Original S(t) I(t) V(t) Vaccination (a)FittingresultofFUNNELFIT(measles) (b)Potentialpopulationofsusceptibles(measles) 0.1 0.2 0.3 0.4 0.5 February (2) August (8)March (3) September (9)April (4) October (10)May (5)November (11)June (6)December (12)July (7)January (1) Rubella Measles Mumps Gonorrhea Streptococcal sore throat Chickenpox Smallpox Lymedisease Typhoidfever Cryptosporidiosis Rocky mountain spotted fever Typhus fever Influenza (c)Seasonalitystrength(radius)vs.peakseason(angle)Figure1:ModelingpowerofFUNNELFIT:(a)theoriginalnumberofmeaslescases(graydots),andourmodel(redlines).Itcapturestheyearlycycle,externalspikes,andvaccinationstartingin1963,aswellas(b)thelocalsensitivity(e.g.,manypatientsinNY,PA,CA,TX);(c)thescatterplotoftheseasonalitystrengthvs.thepeakseason-(angle):thepeakmonthforeachdiseaseand(radius):thestrengthoftheuctuation,e.g.,inuenzapeakseverywinter,measlesinthespring,andnoperiodicityforgonorrhea.Table1:Capabilitiesofapproaches.Onlyourapproachmeetsallspecications. SIRS AR/PLiF PARAFAC FUNNELFIT Compression p p p pDomainknowledge p pMissingvalues p pPeriodicity p pForecasting p pParameterfree p Theautoregression(AR)modelandPLiF[ 15 ]havetheabil-itytocompressandforecastsequences,buttheyarefunda-mentallyunsuitableforepidemicdata,andcannotcapturethenon-linearpatternsofviruspropagation.Ourepidemicdatacanbeturnedintoatensor.PARAFACiscapableofcompression,butitcannothandlemissingvalues,periodicity,orforecasting.Mostimportantly,noneofaboveareparameter-freemethods.Contributions.Ourmethodhasthefollowingdesirableproperties:1.Sense-making:thankstoourmodelingframework,ourmethodcanprovideanintuitiveexplanationforepidemics,suchastheseasonalityofdiseases,vaccination,andexternalshocks.Itmatchesthebehaviorofvarioustypesofcontagiousdis-eases,suchasmeasles,inuenza,andsmallpox.2.Automatic:itisfullyautomatic,requiringnohumaninter-vention.Ouralgorithmistheoreticallyfoundedontheideaofminimizingthecostoftheresultingmodeling.3.Scalable:itscaleslinearlywiththeinputsize.4.Generality:itincludesearlierpatternsandmodelsasspecialcases(e.g.,SIRS),anditcanbeappliedtovarioustypesofepidemicdataincludingcomputervirusinfections.Outline.Therestofthepaperisorganizedintheconventionalway:Nextwedescriberelatedwork,followedbyourproposedmodelandalgorithms,experiments,discussionandconclusions.2.RELATEDWORKWeprovideasurveyoftherelatedliterature,whichfallsbroadlyintotwocategories:(1)epidemiology,and(2)patterndiscoveryintimeseries.Epidemiology.Thecanonicaltextbookforepidemiologicalmod-elsincludingSI/SIRisAndersonandMay[ 2 ].Grenfelletal.[ 9 ]studiedtherecurrenttravellingwavesformeasles,whiletheworkin[ 8 ]explainedthecomplexdynamicaltransitionsinepidemics.Stoneetal.[ 29 ]studiedtheseasonaldynamicsofrecurrentepi-demicsincludingmeasles,andidentiedanewthresholdforpre-dictingtheoccurrenceofeitherafutureepidemic,ora`skip'(i.e.,ayearinwhichanepidemicfailstoinitiate).VanPanhuisetal.[ 32 ]digitizedtheentirehistoryofweeklyNationallyNotiableDiseaseSurveillanceReportsfortheU.S.from1888to2013.Patterndiscoveryintimeseries.Inrecentyears,therehasbeenanexplosionofinterestinminingtimeseries[ 4 , 6 , 21 , 16 ].Tra-ditionalapproachesappliedtodataminingincludeauto-regression(AR),lineardynamicalsystems(LDS),Kalmanlters(KF)andtheirvariants[ 10 , 15 , 31 ].Similaritysearchandpatterndiscov-eryintimesequenceshavealsoattractedhugeinterest[ 33 , 13 , 30 , 26 , 7 ].Regardinglarge-scaletime-seriesmining,TriMine[ 18 ]isascalablemethodforforecastingco-evolvingmultiple(thousandsof)sequences,while,[ 17 ]developedafully-automaticminingal-gorithmforco-evolvingtimesequences.Rakthanmanonetal.[ 25 ]proposedasimilaritysearchalgorithmfor“trillionsoftimeseries”undertheDTWdistance.Recently,analysesofepidemics,so-cialmedia,propagationandthecascadestheycreatehaveattractedmuchinterest[ 24 , 12 , 14 , 23 , 19 ].However,noneofthesemethodsspecicallyfocusedonauto-maticminingofnon-lineardynamicsincoevolvingepidemics.3.PROPOSEDMODELInthissectionwepresentourproposedmodel.3.1DesignphilosophyofFUNNELDatadescription.TheProjectTycho[ 32 ]coversmorethanacen-turyofweeklysurveillancereportsofnationallynotiablediseases(56diseasesintotal)forall50statesintheU.S,from1888tothepresent,with87,950,807reportedindividualcasesfordiseases.Thisdatasetconsistsoftuplesoftheform:(disease,location,timestamp).Wethenhaveacollectionofentrieswithduniquediseases,andlstates,withdurationn(onaweeklybasis).Wecan 40 60 80 100 105 Temperature (F)# of cases (log) 1942 1943 1944 1945 1946 100 105 cases (log) 1942 1943 1944 1945 1946 0 50 100 TemperatureYear CA TX VA 1931 1932 1933 1934 1935 100 102 104 cases (log) 20 40 60 80 102 104 Temperature (F)# of cases (log) 1931 1932 1933 1934 1935 0 50 100 TemperatureYear CA NY PA (a)Inuenza(inCA,TX,VA)(b)Measles(inCA,NY,PA)Figure2:Theairtemperaturevs.#ofcases:(a)inuenzaiscompletelyanti-correlatedwiththeairtemperature(i.e.,peak-inginthewinter),while,(b)measlesalsohasstrongperiodicity,butitpeaksinthespring(i.e.,withaphaseshift).treatthissetofdlepidemicsequencesasa3rd-ordertensor,i.e.,X2Ndln,wheretheelementxij(t)ofXshowsthetotalnumberofentriesofthei-thdiseaseinthej-thstateattime-tickt.Forexample,(`measles',`PA',`April1-7,1931';4740),meansthatthenumberofcasesdueto`measles'in`PA'on`April1-7in1931'is`4740'.Werefertoeachsequenceofthei-thdiseaseinthej-thstate:xijfxij(t)gnt=1,asa“local/state”-levelepidemicsequence.Similarly,wecanturntheselocalsequencesinto“global/country”-levelepidemics:xifxi(t)gnt=1,wherexi(t)showsthetotalcountofthei-thdiseaseattime-tickt,i.e.,xi(t)=Plj=1xij(t).Preliminaryobservations.Here,weprovidethereaderwithsev-eralimportantobservations. Figure2 showsthescatterplots(top)andsequenceplots(bottom)oftheoriginallocal-levelsequencesofinuenzaandmeaslescountsinthreestates,versustheaver-ageairtemperatureforveyears. 3 In Figure2 (a),inuenzacasesarestronglyanti-correlatedwiththeairtemperature,corre-spondingtoinuenzaepidemicsincolderseasons.Ontheotherhand,formeasles( Figure2 (b)),thescatterplotexhibitscharac-teristicloopshapes,whichindicatesthatthereisaphaseshiftofmeaslesvs.temperature-actually,measlespeaksinthespring.Asshownin Figure1 (c),thereareseveralgroupsofinfectiousdis-easeswithspecicseasonalpatterns,includingchildren'sdiseases(e.g.,measles,mumps)inthespring,andtick-bornediseases(e.g.,Lymedisease)inthesummer.Consequentlywehave:OBSERVATION1(DISEASESEASONALITY).Manydiseaseshaveyearlycycleswithdifferentphases,thatis,theyarecorrelatedwithairtemperatureandtheseasons.Thenextobservationreferstotheabruptdeclineofseveraldis-eases.Luckily,manydiseaseshavebeeneradicatedorsignicantlyreducedoverthelastcentury,throughvariousfactorsincludingvac-cination,sanitationandantibiotics.Forexample,in Figure1 (a),thenumberofmeaslescaseshasbeendecreasingsincethevacci-nationprogramwasintroducedin1963.Wewillcollectivelyrefertosuchabruptdeclinesasdiseasereductioneffects.OBSERVATION2(DISEASEREDUCTIONEFFECT).Manyin-fectiousdiseaseshavebeenreducedoreliminatedthroughvacci-nationprograms,antibiotics,sanitation,etc.Next,letuslookatthetopicfromalocalpointofview.In Figure2 ,threelocalsequencesarecorrelatedwitheachother,butwithdifferentfractionsofpatients,whichcorrespondtothenumberofsusceptiblepeopleineachstate.Forexample,measlesmainlyaffectschildren,andso,themorechildrenthereare,themorecasesofmeaslestherewillbe(see,NY,PA,CA,TX,in Figure1 (b)). 3Nationalclimatedatacenter: http://www.ncdc.noaa.gov/cag/ Table2:Symbolsanddenitions Symbol Denition d Numberofdiseasesl Numberofstates(i.e.,locations)n DurationofsequencesX 3rd-ordertensor(X2Ndln) xij Local-levelepidemicsequenceofdiseaseiinstatejxi Global-levelepidemicsequenceofdiseasei Sij(t) CountofsusceptiblesofdiseaseiinstatejattimetIij(t) CountofinfectivesofdiseaseiinstatejattimetVij(t) Countofvigilantsofdiseaseiinstatejattimet B Basematrix(d6)i.e.,B=fb1;:::;bdgR Diseasereductionmatrix(d2)i.e.,R=fr1;:::;rdgN Geo-diseasematrix(dl)i.e.,N=fNijgd;li;j=1E Externalshocktensori.e.,E=fE();E(T);E(S)gM Mistaketensori.e.,M=fmij(t)gd;l;ni;j;t=1 F CompletesetofFUNNELi.e.,F=fB;R;N;E;Mg OBSERVATION3(AREASPECIFICITYANDSENSITIVITY).Foreachdisease,neighborsarecorrelatedwithdifferentsensitivity.Thelasttwoobservationsaretheextrapropertiesofepidemics. Figure1 (a)showslargeoutbreaksofmeaslesin1941and1958,while Figure2 (a)showstwolargeupandemicsin1944and1946.OBSERVATION4(EXTERNALSHOCKEVENTS).Therearesomeextremespikes,representingmajoreventssuchashistoricalupandemics.Basically,real-worlddatasetsaresubjecttoqualityconstraintssuchastypingerrorsandincorrectreports(werefertothemas“mistakes”).OBSERVATION5(MISTAKES).Therearesomeimplausiblespikes,whicharecompletelyindependentofthedynamicsofepi-demicpatterns.Summary.Inthispaper,weproposeanewmodel,namely,FUN-NEL,whichtriestoincorporatealltheaboveimportantproper-tiesthatweobservedintherealepidemicdata.Consequently,wewouldliketocapturethefollowingproperties: (P1):yearlyperiodicity(P2):diseasereductioneffects(P3):areaspecicityandsensitivity(P4):externalshockevents(P5):mistakes,incorrectvalues Forsimplicity,let'sfocusonasimplesteprst,where(a)weas-sumethatwearegivenasingleepidemicsequence,say,thenumberofmeaslescasesinNY.Wethen(b)extendourmodeltomultipleco-evolvingepidemics,thatis,tocapturetheindividualpatternsofddiseasesinlstates.3.2FUNNEL-withasingleepidemicWebeginwiththesimplestcase,whereweassumethatwearegivenasingleepidemicsequence.3.2.1Basemodel-FUNNEL-BASEThemodelweproposehasnodes(=people)ofthreeclasses:S usceptible:nodesinthisclasscangetinfectedbyanyneigh-boringnodewhoisinfectious.I nfected:nodeswhohavebeeninfectedandarecapableoftransmittingtheinfectiontothoseinthesusceptibleclass.V igilant(i.e.,recovered/immune):nodesinthisclasscannotgetinfectednorcantheycauseinfections. Figure3 (a)showsadiagramofourbasemodel,where, (t)representstherateofeffectivecontactsbetweeninfectedandsus-ceptibleindividuals;istherateatwhichinfectedindividualsre-covered;\ristheimmunizationlossprobabilityforarecoveredor ! ! " ! # ! ! ! !!!" ! ! " ! # ! ! ! !!!" !!!" !!!" (a)FUNNEL-BASE(b)FUNNEL-REFigure3:FUNNELdiagrams:therearethreeclasses-suscep-tible(i.e.,healthy,butcangetinfected),infected(i.e.,capableoftransmission),vigilant(i.e.,healthy,andcannotgetinfected).vigilantindividual. 4 Moreimportantly,tohandletherstprop-ertyofepidemics:(P1),weassumethattheinfectionrate (t)isaperiodicfunctionoftimet.WerefertoitasFUNNEL-BASE.MODEL1(FUNNEL-BASE).LetS(t),I(t),V(t)bethenum-berofsusceptible,infected,vigilantpeopleattime-tickt.Ourbasemodelisgovernedbythefollowingequations:S(t+1)=S(t) (t)S(t)I(t)+\rV(t)I(t+1)=I(t)+ (t)S(t)I(t)I(t)V(t+1)=V(t)+I(t)\rV(t)(1)where (t)= 01+Pacos2 P(tPs);Pp=52; 5 and,wehavetheinvariantNS(t)+I(t)+V(t);withinitialconditionsS(1)=N1;I(1)=1;V(1)=0:Consequently,FUNNEL-BASEconsistsofasetofthefollowingpa-rameters:bfN; 0;;\r;Pa;Psg,specically,N:Potentialpopulationofthedisease.Niscomposedofsusceptible,infectedandvigilantindividuals. 0:Rateofeffectivecontactsbetweeninfectedandsuscepti-bleindividualsaveragedovertheyear.:Healingrateofthedisease.\r:Forgettingrateofthediseases.Pa:Amplitudeoftheuctuation,specically,itgivestherelativevalueofthepeak/off-season.Ps:Phaseshiftoftheseasonalcycle.3.2.2Withdiseasereduction-FUNNEL-RWithrespecttothesecondproperty:(P2),wealsointroduceanessentialconcept,namely,the“diseasereduction”effect.MODEL2(FUNNEL-R).Weaddadiseasereductionrate:(t),tocapturetheeffectofthediseasereductionprogram,thatis,S(t+1)=S(t) (t)S(t)I(t)+\rV(t)(t)S(t)I(t+1)=I(t)+ (t)S(t)I(t)I(t)V(t+1)=V(t)+I(t)\rV(t)+(t)S(t)(2)where,thediseasereductionprogramstartedattimetand(t)isdenedas:(t)=0(tt)0(tt)ThemodelisidenticaltoFUNNEL-BASE,withtheadditionofthediseasereductionfactor,(t),whichcorrespondstothedirectim-munizationprobabilitywhensusceptible(see Figure3 (b)).Notethatthiseffectisduetovaccination,antibioticsandanyotheranti-diseasefactors.Hereafter,wesimplysaythe“diseasereductioneffect”,unlessotherwisespecied.Inadditiontothebaseparametersb,FUNNEL-Rrequiresasetoftwoparameters,rft;0g,where,t:Startingtimeofthediseasereductioneffect.0:Diffusionrateofthediseasereductioneffect. 4Thisfactoralsoincorporatesthebirthandmortalityrate.5Wehave52time-ticks(weeks)inoneyear.3.2.3Withexternalshocks-FUNNEL-RENext,withrespecttotheproperty:(P4),weassumethatthereareexternalshockevents,suchasupandemics.Sohowdowegoaboutcapturingsuchunexpectedpatterns?Assumethatthereisaswineupandemic.Inthissituation,manymorepeopleinthesusceptibleclasswouldbecomeinfectedthaninpreviousyears.Anelementaryconceptweneedtointroduceisthetemporalsus-ceptiblerate(t). Figure3 (b)describeshowthisisdone.TheideaisthatthenumberofsusceptiblesS(t)isthecountofvic-timsavailableforinfection,andifthereisanexternalshockeventattime-tickt,thevirusattacksaremuchstrongerthanusual,and,eachvictim-attackpairwouldleadtoanewvictim,andwilleven-tuallycauseamajorpandemic.MODEL3(FUNNEL-RE).Ourfullmodelcanbedescribedasthefollowingequations: S(t+1)=S(t) (t)(t)S(t)I(t)+\rV(t)(t)S(t)I(t+1)=I(t)+ (t)(t)S(t)I(t)I(t)V(t+1)=V(t)+I(t)\rV(t)+(t)S(t)(3) Inaddition,weintroducethetemporalsusceptiblerate,(t),whichisdenedasfollows:(t)=1+kXi=1f(t;e(T)i);f(t;(T))=0(tttt+t)0(else)where,isthenumberofshocks,andif=0,then(t)=1.Here,eachexternalshockconsistsofe(T)ft;t;0g,i.e.,t:Centraltimepointoftheexternalshockevent.t:Durationoftheevent.0:Strengthoftheexternalshockeffect.3.3FUNNEL-withmulti-evolvingepidemicsSofarwehaveseenhowFUNNELcapturesthedynamicsofasingleepidemicsequence.Thenextquestionis,“howcanweapplyFUNNELtomultipleco-evolvingepidemicsinX,andcapturetheindividualbehaviorofddiseasesinlstates?”WewanttoestimatetheparametersetofFUNNEL,foreachin-dividualepidemicsequenceinX.Thestraightforwardsolutionwouldbethatweconsiderasetof(dl)sequencesoflengthngeneratedfromXfxijgd;li;j=1,(i.e,“local-level”epidemicse-quences),andestimateparameterset:fb;r;e(T)gforeachse-quence.However,someofthe(disease,state)pairshaveverysparsesequences(e.g.,LymediseaseinAlaska),whichderailsthettingresult.Also,weareinterestedincapturingglobal/country-levelpatterns,aswellaslocal/state-leveltrends.Sohowcanwedealwiththisissue?Wethuspropose“sharing”theglobal-levelparametersforalllstates,toachievemuchbettermodeling.FUNNEL-fullmodelparameterset.Ourgoalistoextractthemaintrendsandexternalpatternsofco-evolvingepidemicsX2Ndln,andmakeagoodrepresentationofX. Figure4 showsourmodelingframework.GivenepidemicdataX,wetrytondimportantpatternswithrespecttothefollowingveaspects,(P1)B:basepropertiesofdiseases,(P2)R:diseasereductioneffects,(P3)N:locationsvs.diseases,(P4)E:externalshockevents,and(P5)M:mistakevalues.Thersttwoareglobal/country-levelparametersets,andthethirdisalocal/state-levelparameterset,andthelasttwoareusedfordescribingextratrendsinX.DEFINITION1(COMPLETESETOFFUNNEL).LetFbeacompletesetofparameters(namely,FfB;R;N;E;Mg)thatdescribetheglobal/local/extrapatternsofepidemicsinX.Next,wewillseeeachpropertyindetail.(P1),(P2)Global/countryview.Basically,weassumethatthefollowingparametersarethesameforalllstates. !"#"$! ! %&!$#!$! ! ! ! "&'$ ! ! ! ! ! ! ! !!!"!"!#!"#!"$ ! ! !!"!! ! ! ! !" #" ! ! ! ! ! ! ! !"#$%"&'!"#$%&'$()%*+,-."/012 ! !"('!"$%)'$(3,',1"/012 ! !")$%"*'!"14,-'"/012"5"!"36%)73"8"!"903,'713 ! $" ! ! ! (a)FUNNELstructure(i.e.,FfB;R;N;E;Mg) ! ! !"!#" !"!#" !"!#" !!! ! ! ! !!"" ! ! ! !!!" !!!" !!!" !!!!!!"" ! ! ! ! !!""# !!""# !!""# ! ! !!""# !!""# !!""# !" ! ! ! !" # $ ! !" % $ ! !" & $ ! !" # $ ! !" % $ ! !" & $ ! (b)Externalshocktensor(i.e.,EfE(D);E(T);E(S)g)Figure4:IllustrationofFUNNELstructure:(a)weextracttheimportantbehaviorofepidemicsfromanoriginaltensorX,(i.e.,thebasematrixB,diseasereductionmatrixR,geo-diseasematrixN,externalshocktensorE,andmistaketensorM);Also,(b)theexternalshocktensorEcanbedescribedasasetofmatricesi.e.,fE(D);E(T);E(S)g,fordisease,state,time.DEFINITION2(BASEMATRIXB(d6)).LetBbeasetofbaseparametersofddiseases,i.e.,Bfb1;:::;bdgwherebiistheparametersetofthei-thdisease.Forexample,theinfection/healingrateofmeaslesshouldbethesameforNYandFL.Similarly,oncethemeaslesvaccinehasbeenintroduced,(i.e.,thediseasereductioneffect),itcouldbeimme-diatelyspreadalloverthecountry,thatis,thestartingtimeofthediseasereductioneffectwouldbethesameforalllocations.DEFINITION3(DISEASEREDUCTIONMATRIXR(d2)).LetRbeaparametersetofthereductionofddiseases,i.e.,Rfr1;:::;rdgwhereriistheparametersetofthei-thdisease.(P3)Local/stateview.Wealsowanttoanalyzeandexplainlocal-specicpatternsandtrendsinX.So,whatisthedifferencebe-tweenmeaslesinNYandinFL?Ouransweris:theyareexactlythesame,exceptforthe“localsensitivity”ofthedisease.Theideaisthatwesharetheparametersoftheglobal-levelmatricesforalllstateswithbutoneexception,localsensitivity,Nij,whichde-scribesthepotentialpopulationofthediseaseiinthej-thstate.Specically,wesettheinvariant,NijSij(t)+Iij(t)+Vij(t)inModel 3 .Thisparametercorrespondstothefractionofindividualswhoarelikelytobeinfectedbythedisease.Forexample,NYhasmoremeaslespatientsthanFL,becauseitmainlyaffectschildren(i.e.,thereweremorechildreninNYthanFL,inthelastcentury).DEFINITION4(GEO-DISEASEMATRIXN(dl)).LetNbeaparametersetofthepotentialpopulationofddiseasesandlstates,i.e.,NfNijgd;li;j=1,whereNijisthepotentialpopula-tionofsusceptiblesofthei-thdiseaseinthej-thstate.(P4)Extraview-externalshocks.Considerthatin1946aseriousupandemicspreadthroughoutthecountry.Wewanttodescribethisexternalshockeventintermsofthreeaspects,(disease,state,time),e.g.,(e1)“inuenza,country-wide,1946”.Similarly,therewasacommunity-wideoutbreakofcryptosporidiosisinUtah,in2007.i.e.,(e2)“cryptosporidiosis,Utah,2007”.Todescribetheseexternalshockevents,wecreateanewparameterset,namelyex-ternalshocktensorE,whichconsistsofasetofexternalshockevents,asdescribedin Figure4 (b).TheexternalshocktensorEcanbealsodecomposedintothree-aspectmatrices,{E(D),E(S),E(T)g,eachofwhichshowsthepat-ternsintermsofdisease,state,time.Asingleexternalshockeventcanbedescribedastripletvectors{e(D),e(S),e(T)g,where,Theshock(disease)vectore(D)showstheassignmentoftheexternalshocktothediseaseID(i.e.,1e(D)d). 2007 2008 2009 2010 2011 2012 0 5000 10000 15000 Year 2007 2008 2009 2010 2011 2012 100 Count (log) Original I(t) Original S(t) I(t) V(t) 2007 2008 2009 2010 2011 2012 0 5000 10000 Year 2007 2008 2009 2010 2011 2012 100 105 Count (log) Original I(t) Original S(t) I(t) V(t) (a)Externalshocktting(b)MistakettingFigure5:Externalshockvs.mistakeforgiardiasisin2007:(a)themodel(i.e.,redline)isgreatlyinuencedbythelargedistanceoftheoutlierfromtheoriginalsequence,while(b)itltersoutthemistakepoint,andtsthesequenceverywell.Theshock(state)vectore(S)describestheparticipationstrengthofeachstateforeachexternalshockevent.Theshock(time)vectore(T)showsthetemporalpatternoftheexternalshockevent.Specically,e(T)istheglobal-levelparameters,asdescribedin subsubsection3.2.3 ,ande(S)isthelocal-levelparameters,i.e.,e(S)fe(S)jglj=1,where,wechange0inModel 3 todescribethestrengthoftheexternalshockforlindividuallocations.Thatis,thestrengthoftheshockeffectinthej-thstateis,0e(S)j.Consequently,wehavethefollowing:DEFINITION5(EXTERNALSHOCKTENSORE).LetEbea3rd-ordertensorofexternalshockevents,i.e.,EfE();E(S);E(T)g,wheretripletmatricesshowtheparametersintermsofthreeaspects,namely,“disease”,“state”,and“time”.(P5)Extraview-Mistakes.Basically,realdatasetscontainmanyerrorssuchasincorrectreports.FUNNELshoulddetectandlterthemoutasoutliers.Wethusintroduceanadditionalconcept.DEFINITION6(MISTAKETENSORM).LetMbea3rd-ordertensorofmistakedatapoints,where,theelementmij(t)ofMshowstheentryofthei-thdiseaseinthej-thstateattime-tickt.NotethatMisverysparse,andveryoftenmij(t)=0. Figure5 comparesthettingresultsoftheexternalshockvs.mistakeforthegiardiasiscases,whichcontainsanincongruouspointin2006(approximately,10;000).Inthiscase,thepointshouldbetreatedas(b)amistakevalue,insteadof(a)anexternalshockevent;ingure(a),themodel(redline)isstronglyinuencedbytheextremepoint,whilein(b),itsuccessfullycapturestherealpatternsoftheoriginalsequence. 4.OPTIMIZATIONALGORITHMInthissection,wedescribeourttingalgorithm,FUNNELFIT.OurgoalistoextracttheimportantpatternsofepidemicsfromX.Morespecically,theproblemthatwewanttosolveisasfollows:PROBLEM1.GivenatensorXof(disease;state;time)triplets,FindacompactdescriptionthatbestsummarizesX,thatis,FfB;R;N;E;Mg.WewanttondagoodrepresentationFtosolvetheproblem.Theessentialquestionsare:(a)HowcanweestimatetheparametersetthatbestcapturesthedynamicsandpatternsinX?(b)Howshouldwedecidethenumberofexternalshocks?(c)Howcanweignoremistake(i.e.,outlier)valuesinX?4.1ModelqualityanddatacompressionWeprovideanewintuitivecodingscheme,whichisbasedontheminimumdescriptionlength(MDL)principle.Inshort,itfollowstheassumptionthatthemorewecancompressthedata,themorewecanlearnaboutitsunderlyingpatterns.Modeldescriptioncost.Thedescriptioncomplexityofmodelpa-rametersetconsistsofthefollowingterms,Thenumberofdiseasesd,statesl,andtime-ticksnrequirelog(d)+log(l)+log(n)bits. 6 Themodelparametersetofthebase(B),reduction(R),geo-disease(N)matricesrequired6,d2,dlparameters,respectively,i.e.,CostM(B)+CostM(R)+CostM(N)=cFd(6+2+l),wherecFistheoatingpointcost 7 .Similarly,themodeldescriptioncostoftheexternalshocktensorEfE(D);E(S);E(T)g)consistsofthefollowing:Thenumberofexternalshocksrequireslog()bits.Theshock-diseasematrixE(D)requireslog(d).Theshock-timeparametersete(T)ft;t;0ginE(T)requireslog(n),log(n),cF,respectively.Theshock-statematrixE(S)requirescFkl.Consequently,themodelcostoftheexternalshocktensorEisCostM(E)=log()+log(d)+2log(n)+cF(1+l).ThemodelcostofmistaketensorMconsistsofThenumberofnon-zeroelementsinMrequireslog(jMj)Thelocationofeachnon-zeroelementanditsvalue,mij(t)requirelog(d),log(l),log(n),log(mij(t)),respectively.Thus,CostM(M)=log(jMj)+PjMjmij(t)�0(log(d)+log(l)+log(n)+log(mij(t))),where,jMjisthenumberofnon-zeroelementsinM.Datacodingcost.OncewehavedecidedthefullparametersetF,wecanencodethedataXusingHuffmancoding[ 3 ],i.e.,anumberofbitsisassignedtoeachvalueinX,whichisthelog-arithmoftheinverseoftheprobabilityofthevalues(here,weuseaGaussiandistribution).TheencodingcostofXgivenFis:CostC(XjF)=Pd;l;ni;j;t=1log2p1Gauss(;)(xij(t)mij(t)Iij(t)),where,xij(t),mij(t)aretheelementsinXandthemis-taketensorM,respectively,andIij(t)istheestimatedcountofinfections(i.e.,Model 3 ).Also,andarethemeanandvarianceofthedistancebetweentheoriginalandestimatedvalues. 8 Puttingitalltogether.Consequently,thetotalcodelengthforXwithrespecttoagivenparametersetFcanbedescribedasfollows: 6Here,logistheuniversalcodelengthforintegers.7Weused48bitsinoursetting.8Here,,need2cFbits,butwecaneliminatethembecausetheyareconstantvaluesandindependentofourmodeling.CostT(X;F)=log(d)+log(l)+log(n)CostM(B)+CostM(R)+CostM(N)CostM(E)+CostM(M)+CostC(XjF)(4)Thusournextgoalistominimizetheabovefunction.4.2Multi-layeroptimizationUntilnow,wehaveseenhowwecanmeasurethegoodnessoftherepresentationofX,ifwearegivenacandidateparametersetF.Thenextquestionis,howtondanoptimalsolutionofthefullparameterset:FfB;R;N;E;Mg.Asdescribedin subsection3.3 ,ourFUNNELmodelconsistsofmultipleparametersets,eachofwhichexplainseitherthelocalorglobalpatternofepidemicsinX.Forexample,thebaseandre-ductionmatricesB,Rexplaintheglobal-levelbehaviorofeachdisease,whilethegeo-diseasematrixNdescribesthelocal-leveltrends.Also,theextratensorsE,Mconsistofboththeglobalandlocal-levelparameters.Morespecically,theexternalshocksconsistsofEfE(D);E(S);E(T)g),where,thersttwoaretheglobal-level,andthelastoneisthelocal-level.Similarly,themistaketensorcanalsobedescribesbythetripletmatrixMfM(D),M(S);M(T)g),eachofwhichdescribesthelocationofthemistakevaluesintermsofdisease,state,time.So,howcanweefcientlyestimatethesemodelparameters?Weproposeamulti-layeroptimizationalgorithm,tosearchfortheoptimalsolutionintermsofboththeglobalandlocal-levelpa-rameters.TheideaisthatwesplitparametersetFintotwosubsets,i.e.,FGandFL,eachofwhichcorrespondstoaglobal/local-levelparameterset,andtrytottheparametersetsseparately.Ouralgo-rithmconsistsofthefollowingtwophases:GLOBALFIT:ndgoodglobal-levelparametersforfxigdi=1,i.e.,FGfB;R;E();E(T);M();M(T)gLOCALFIT:ndgoodlocal-levelparameters:forfxijgd;li;j=1,i.e.,FLfN;E(S);M(S)gHere,theglobalepidemicsequenceofthei-thdisease:xicanbedescribedasthesumofthellocalsequences,i.e.,xi(t)=Plj=1xij(t).Algorithm 1 showsanoverviewofFUNNELFIT.GivenatensorX,itndsthefullsetofFUNNELparameters. Algorithm1FUNNELFIT(X) 1:Input:TensorX(dln)2:Output:Completesetofparameters,i.e.,F=fB;R;N;E;Mg3:/*Parameterttingforglobal-levelsequences*/4:fFGg=GLOBALFIT(X);5:/*Parameterttingforlocal-levelsequences*/6:fFLg=LOCALFIT(X;FG);7:returnF=fFG;FLg; 4.2.1Global-levelparameterttingGivenatensorX,oursub-goalistondtheoptimalglobal-levelparameterset:FG,tominimizethecostfunction(i.e., Equation4 ).Wewanttotthebasicparametersofeachdisease(i.e.,thebaseandreductionmatrices),andestimatetheappropriatenumberofexternalshocksandmistakevalues,simultaneously.Findingtheappropriatenumberofexternal-shocks/mistakesisaparticularis-suehere,becausetheparameterttingsareverysensitivetoout-liers,asdescribedin Figure5 (a).TondagoodbasicparametersetforX,wehavetolterouttheexternalshocksandmistakesappropriately.Simultaneously,agoodexternal-shock/mistakel-terrequiresawellestimatedbasemodel.Weescapethiscirculardependencybyapplyinganiterativemethodthatemploysexternal-shocks/mistakesdetectionandltering,andbasicmodelttinginanalternatingwayuntilthecostfunctionreachesaminimumvalue. Algorithm2GLOBALFIT(X) 1:Input:TensorX2:Output:Setofglobal-levelparametersFG3:fori=1:ddo4:CreatexifromX;/*Globalsequencexiofi-thdisease*/5:/*Initializeexternalshocksandmistakevaluesfordiseasei*/6:E(D)i=E(T)i=M(D)i=M(T)i=;;7:whileimprovingthecostdo8:b=argminb0CostC(xjb0;r;(T);M(T));/*Base*/9:r=argminr0CostC(xjb;r0;(T);M(T));/*Reduction*/10:E(D)i=E(T)i=M(D)i=M(T)i=;;/*Initializevalues*/11:/*Findexternalshocksandmistakesfordiseasei*/12:whileimprovingthecostdo13:e(T)=argmine(T)CostC(xjb;r;f(T)[e(T)g;M(T));14:m(T)=argminm(T)CostC(xjb;r;(T);fM(T)[m(T)g);15:/*Compareexternalshockvs.mistake*/16:ifCostT(xi;e(T))CostT(xi;m(T))then17:/*Externalshockwins-treatasanexternalshock*/18:E(D)i=fE(D)iiig;E(T)i=fE(T)iie(T)g;19:else20:/*Mistakewins-treatasamistakevalue*/21:M(D)i=fM(D)iiig;M(T)i=fM(T)iim(T)g;22:endif23:endwhile24:endwhile25:/*Updateparametersetofi-thdisease*/26:B=BBbi;R=RRri;27:E(D)=E(D))E(D)i;E(T)=E(T))E(T)i;28:M(D)=M(D))M(D)i;M(T)=M(T))M(T)i;29:endfor30:returnFG=fB;R;E(D);E(T);M(D);M(T)g; Externalshockvs.mistake.Thereisalsoanimportantissuere-gardingtheexternalshockvs.themistakevalue.Wewanttodistin-guishautomaticallybetweenanexternalshockeventandatypingerror.Forexample,in Figure5 ,thereisaclear“typo”,ratherthananexternalshockevent.Ourcodingschemeenablesustoprovidetheanswer.Theideaisthatwetrytottheparametersbytreatingthedataasbothanexternalshockeventandamistakevalue,andthencomparethecostofthetwoalternatives.For Figure5 ,thecostof(b)islessthan(a),thusthealgorithmdeterminesthatthereisamistakevaluein2007.Algorithm.Algorithm 2 isadetailedalgorithmoftheglobal-leveltting.GivenatensorX,itcreatesasetofdglobalsequences:fxigdi=1.Ittriestottheglobal-levelparameterset,aswellasndtheappropriatenumberofexternal-shocks/mistakes.WeusetheLevenberg-Marquardt(LM)algorithmtominimizethecostfunc-tion.NotethattheextratensorsEandMconsistofanentry(disease;state;time),butthisalgorithmcanndonlytheglobal-levelentry,whichconsistsof(disease;time).Thelocal-levelentriesE(S)andM(S)canbecomputedbylocal-levelparame-tertting,asshownnextinAlgorithm 3 .Also,thecostfunction( Equation4 )includesthecostoflocal-levelparameterssuchasN,butthesetermsareindependentoftheglobalmodeltting.Hence,wecansimplyconsiderthemtobeconstant.4.2.2Local-levelparameterttingGivenasetofdllocal-levelsequences,fxijgd;li;j=12X,andasetofglobal-levelparameters,FG,ournextgoalistottheindividualparametersofeachdiseaseineachstate,thatis,FLfN;E(S);M(S)g.Weproposeaniterativeoptimizationalgorithm(seeAlgorithm 3 ).Ouralgorithmsearchesfortheoptimalsolutionwithrespectto(a)thegeo-diseasematrixN,(b)thelocal-level Algorithm3LOCALFIT(X;B;R;E();E(T);M();M(T)) 1:Input:(a)TensorX,(b)global-levelparametersetFG2:Output:Setoflocal-levelparameters,i.e.,FL3:whileimprovingthecostdo4:/*Foreachlocalsequencexijofi-thdiseaseinj-thstate*/5:fori=1:ddo6:forj=1:ldo7:Nij=argmin0ijCostC(xijjB;R;N0ij;E;M);8:endfor9:endfor10:foreachexternalshock((D);(S);(T))Edo11:Updatee(S)tominimizethecost/*Localparticipationrate*/12:endfor13:foreachmistake(m(D);m(S);m(T))Mdo14:Updatem(S)tominimizethecost/*Mistakevalue*/15:endfor16:endwhile17:returnFL=fN;E(S);M(S)g; externalshocksE(S),and(c)thelocal-levelmistakevaluesM(S),sothatthetotalcodingcostisminimized.LEMMA1.ThecomputationtimeofFUNNELFITisO(dln).PROOF.Tocreatetheglobal-levelsequencesfromX,thealgo-rithmrequiresO(dln)time.Forglobal-levelparametertting,itneedsO(#iter(jMj)dn)time,where#iteristhenumberofiterations,andjMjshowthenumberofexternalshocksandnon-zerovaluesinM,respectively.Similarly,forthelocal-levelparametertting,itneedsO(#iter(jMj)dln)timetottheparameters.Notethat#iter,andjMjaresmallconstantvaluesthatarenegligible.Thus,thecomplexityisO(dln). 5.EXPERIMENTSInthissectionwedemonstratetheeffectivenessofFUNNELwithrealepidemicdata.Theexperimentsweredesignedtoanswerthefollowingquestions:Q1Sense-making:Canourmethodhelpusunderstandthegiveninputepidemicdata?Q2Accuracy:Howwelldoesourmethodmatchthedata?Q3Scalability:Howdoesourmethodscaleintermsofcompu-tationaltime?5.1Matchingco-evolvingepidemicpatternsWedemonstratehoweffectivelyFUNNELcanlearnimportantpatternsgivenalargecollectionofepidemics. Figure6 showstheresultsofmodelttingon15typicaldiseases.Weshowtheoriginalsequences(i.e.,blackdots)andestimatedsequences:I(t)(i.e.,redline)inlinear-linear(top)andlinear-log(bottom)scales.Inthelog-logscale,wealsoshowthesusceptibleS(t)andvigilantV(t)counts.Wemadeseveralimportantobservations,whichcorrespondtothevepropertiesoftheepidemicsequences.(P1)Diseaseseasonality.Aswehavealreadyseenintheintroduc-tionsection( Figure1 (c)),weidentiedfourcategoriesi.e.,Inuenzahasverystrongperiodicspikes,inJanuary-February.Children'sdiseases(e.g.,measles,mumps,chickenpox)alsohavestrongperiodicity,buttheypeakinspring[ 27 ].Tick-bornediseases(e.g.,Lymedisease),andcryptosporid-iosis(i.e.,water-bornedisease)havestrongperiodicity,peak-inginthesummer,relatedtovectorandhumanbehaviorandclimatefactors[ 28 ].Gonorrhea,i.e.,sexuallytransmitteddisease(STD)hasnoperiodicity.(P2)Diseasereductioneffects.FUNNELiscapableofautomati-callydetectingthediseasereductionimpact.Forexample,in Figure6