/
Finding unusual time series in terabyte sized data sets Finding unusual time series in terabyte sized data sets

Finding unusual time series in terabyte sized data sets - PDF document

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
403 views
Uploaded On 2017-04-11

Finding unusual time series in terabyte sized data sets - PPT Presentation

BasedonthedistancecalculationsforeachSitherearethreesituations1ThedistancebetweenthediscordcandidateinCandtheitemondiskisgreaterthanthecurrentvalueofCdistjIfthisistruewedonothing2Thedistancebet ID: 339061

Basedonthedistancecalculations foreachSitherearethreesituations:1.ThedistancebetweenthediscordcandidateinCandtheitemondiskisgreaterthanthecurrentvalueofC:distj.Ifthisistruewedonothing.2.Thedistancebet

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Finding unusual time series in terabyte ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

likethescaleconsideredinthiswork.In[7]theauthorscon-sideranastronomicaldatasettakenfromtheSloanDigitalSkySurvey,with111,456recordsand68variables.TheyndanomaliesbybuildingaBayesiannetworkandthenlookingforobjectswithalowlog-likelihood.Becausethedimensionalityisrelativelysmallandtheyonlyused10,000outofthe111,456recordstobuildthemodel,allitemscouldbeplacedinmainmemory.Theyreport3hoursofCPUtime(witha400MHzmachine).Forthesecondarystoragecasetheywouldalsorequireatleasttwoscans,onetobuildthemodel,andonetocreateanomalyscores.Inaddition,thisapproachrequiresthesettingofmanyparam-eters,includingchoicesfordiscretizationofrealvariables,amaximumnumberofiterationsforEM(asub-routine),thenumberofmixturecomponents,etc.InasequenceofpapersOteyandcolleagues[10]intro-duceaseriesofalgorithmsforminingdistancebasedout-liers.Theirapproachhasmanyadvantages,includingtheabilitytohandlebothreal-valuedanddiscretedata.Further-more,likeourapproach,theirapproachalsorequiresonlytwopassesoverthedata,onetobuildamodelandonetondtheoutliers.However,italsorequiressignicantCPUtime,beinglinearinthesizeofthedatasetbutquadraticinthedimensionalityoftheexamples.Forinstance,fortwomillionobjectswithadimensionalityof128theyreportneeding12.5hoursofCPUtime(ona2.4GHzmachine).Incontrast,wecanhandleadatasetofsizetwomillionob-jectswithdimensionality512inlessthantwohours,mostofwhichisI/Otime.Jagadishetal.[11]producedaninuentialpaperonndingunusualtimeseries(whichtheycalldeviants)withadynamicprogrammingapproach.Againthismethodisquadraticinthelengthofthetimeseries,andthusitisonlydemonstratedonkilobytesizeddatasets.Thediscordintroducingwork[13]suggestsafastheuris-tictechnique(termedHOTSAX)forpruningquicklythedataspaceandfocusingonlyonthepotentialdiscords.Theauthorsobtainalowerdimensionalrepresentationforthetimeseriesathandandthenbuildatrieinmainmemorytoindextheselowerdimensionalsequences.Adrawbackoftheapproachisthatchoosingaverysmalldimensionalitysizeresultsinalargenumberofdiscordcandidates,whichmakesthealgorithmessentiallyquadratic,whilechoosingamoreaccuraterepresentationincreasestheindexstructureexponentially.Thedatasetsusedinthatevaluationarealsoassumedtotinmainmemory.Inordertodiscoverdiscordsinmassivedatasetswemustdesignspecialpurposealgorithms.Themainmemoryalgo-rithmsachievespeed-upinavarietyofways,butallrequirerandomaccesstothedata.Randomaccessandlinearsearchhaveessentiallythesametimerequirementsinmainmem-ory,butondiskresidentdatasets,randomaccessisexpen-siveandshouldbeavoidedwherepossible.Asageneralruleofthumbinthedatabasecommunityitissaidthatran-domaccesstojust10%ofadiskresidentdatasettakesaboutthesametimeasalinearsearchovertheentiredata.Infact,recentstudiessuggestthatthisgapiswidening.Forexam-ple,[19]notesthattheinternaldatarateofIBM'sharddisksimprovedfromabout4MB/sectomorethan60MB/sec.Inthesametimeperiod,thepositioningtimeonlyimprovedfromabout18msecto9msec.Thisimpliesthatsequentialdiskaccesshasbecomeabout15timesfaster,whilerandomaccesshasonlyimprovedbyafactoroftwo.Giventheabove,efcientalgorithmsfordiskresidentdatasetsshouldstrivetodoonlyafewsequentialscansofthedata.3.NotationLetatimeseriesT=t1;:::;tm,bedenedasanor-deredsetofscalarormultivariateobservationstimeasuredatequalintervalsintime.Whenmisverylarge,look-ingatthetimeseriesasawholedoesnotrevealmuchusefulinformation.Instead,onemightbemoreinterestedinsubsequencesC=tp;:::;tp+n�1ofTwithlengthnm(herepisanarbitraryposition,suchthat1pm�n+1).Workingwithtimeseriesdatabasesthereareusuallytwoscenariosinwhichtheexamplesinthedatabasemighthavebeengenerated.Inoneofthemthetimeseriesaregener-atedfromshortdistinctevents,e.g.asetofastronomicalobservations(seeSection6.1.1).Inthesecondscenario,thedatabasesimplyconsistsofallpossiblesubsequencesex-tractedfromthetimeseriesofalongongoingprocess,e.g.theyearlyrecordingsofameteorologicalsensor.Knowingwhetherthedatabaseispopulatedwithsubsequencesofthesameprocessisessentialwhenperformingpatternrecogni-tiontasks.ThereasonforthisisthattwosubsequencesCandMextractedfromclosepositionsp1andp2areverylikelytobesimilartooneanother.ThismightfalselyleadtoaconclusionthatthesubsequenceCisnotarareex-ampleinthedatabase.Inthesecases,whenp1andp2arenot“signicantly”different,thesubsequencesCandMarecalledtrivialmatches[5].Thepositionsp1andp2aresig-nicantlydifferentwithrespecttoadistancefunctionDist,ifthereexistsasubsequenceQstartingatpositionp3,suchthatp1p3p2andDist(C;M)Dist(C;Q).Withtheabovenotationinhand,wecannowpresenttheformaldenitionoftimeseriesdiscords:Denition1.TimeSeriesDiscord:GivenadatabaseS,thetimeseriesC2SiscalledthemostsignicantdiscordinSifthedistancetoitsnearestneighbor(oritsnearestnon-trivialmatchincaseofsubsequencedatabases)islargest.I.e.foranarbitrarytimeseriesM2Sthefollowingholds:min(Dist(C;Q))min(Dist(M;P)),whereQ;P2S(andQ;Parenon-trivialmatchesofCandMincaseofsubsequencedatabases). Basedonthedistancecalculations,foreachSitherearethreesituations:1.ThedistancebetweenthediscordcandidateinCandtheitemondiskisgreaterthanthecurrentvalueofC:distj.Ifthisistruewedonothing.2.ThedistancebetweenthediscordcandidateinCandtheitemondiskislessthatr.Ifthishappensitmeansthatthediscordcandidatecannotbeadiscord,itisafalsepositive.WecanpermanentlyremoveitfromthesetC(line11andline12).3.ThedistancebetweenthediscordcandidateinCandtheitemondiskislessthanthecurrentvalueofC:distj(butstillgreaterthanr,otherwisewewouldhaveremovedit).Ifthisistruewesimplyupdatethecurrentdistancetothenearestneighbor(line14).ItisstraightforwardtoseethatuponcompletionofAlgorithm1thesubsetCcontainsonlythetruediscordsatrangeatleastr,andthatnosuchdiscordhasbeendeletedfromC,providedthatithasalreadybeeninit.ThetimecomplexityforthealgorithmdependscriticallyonthesizeofthesubsetsizejCj.InthepathologicalcasewherejCj=jSj,itbecomesabruteforcesearch,quadraticinthesizejSj.Obviously,suchcandidatesetcouldbeproducediftherangeparameterrisequalto0.If,however,thecan-didatesetCcontainsjustoneitem,thealgorithmbecomesessentiallyalinearscanoverthediskforthenearestneigh-bortothatoneitem.AveryinterestingobservationisthatifthecandidatesetCcontainstwoorthreeitemsinsteadofone,thiswillmostlikelynotchangethetimeforthealgo-rithmtorun.Thisisso,becauseforaverysmalljCjtheCPUrequiredcalculationswillexecutefasterthanthediskreadingoperations,andthustherunningtimeforthealgo-rithmisjustthetimetakenforalinearscanofthediskdata.Tosummarize,theefciencyofAlgorithm1dependsonthetwocriticalassumptionsthat:1.Foragivenvalueofr,wecanefcientlybuildasetCwhichcontainsallthediscordswithadiscorddis-tancegreaterthanorequaltor.Thissetmayalsocon-tainsomenon-discords,butthenumberofthese“falsepositives”mustberelativelysmall.2.Wecanprovidea“good”valueforrwhichallowsustodo`1'above.Ifwechoosetoolowofavalue,thenthesizeofsetCwillbeverylarge,andouralgorithmwillbecomeslow,andevenworse,thesetCmightnolongertinmainmemory.Incontrast,ifwechoosetoolargeavalueforr,wemaydiscoverthatafterrun-ningthealgorithmabovethesetCisempty.Thiswillbethecorrectresult;therearesimplynodiscordswithadistanceofthatvalue.However,weprobablywantedtondahandfulofdiscords.4.2.CandidatesSelectionPhaseInthissectionweaddresstherstoftheaboveassump-tions,i.e.givenathresholdrwepresentanefcientalgo-rithmforbuildingacompactsetCwithasmallnumberoffalsepositives.Aformaldescriptionofthiscandidateselec-tionphaseisgivenasAlgorithm2. Algorithm2CandidatesSelectionPhase procedure[C]=DC Selection(S,r)in:S:diskresidentdatasetoftimeseriesr:discorddeningrangeout:C:listofdiscordcandidates1:C=fS1g2:fori=2tojSjdo3:isCandidate=true4:for8Cj2Cdo5:if(Dist(Si;Cj)r)then6:C=CnCj7:isCandidate=false8:endif9:endfor10:if(isCandidate)then11:C=C[Si12:endif13:endfor ThealgorithmperformsonelinearscanthroughthedatabaseandforeachtimeseriesSiitvalidatesthepossibil-ityforthecandidatesalreadyinCtobediscords(line5).Ifacandidatefailsthevalidation,thenitisremovedfromthisset.Intheend,thenewSiiseitheraddedtothecandidateslist(line11),ifitislikelytobeadiscord,oritisomit-ted.Toshowthecorrectnessofthisprocedure,andhenceoftheoveralldiscorddetectionalgorithm,werstpointoutanobservationthatholdsforanarbitrarydistancefunction:Proposition1.GlobalInvariant.LetSibeatimeseriesinthedatasetSanddsibethedistancefromSitoitsnearestneighborinS.ForanysubsetCSthedistancedcifromSitoitsnearestneighborinCislargerorequaltodsi,i.e.dcidsi.Indeed,ifthenearestneighborofSiispartofCthendsi=dci.Otherwise,asCdoesnotcontainelementsoutsideofS,thedistancedcishouldbelargerthandsi.Usingtheaboveglobalinvariant,wecannoweasilyjus-tifythefollowingproposition:Proposition2.UponcompletionofAlgorithm2,thecan-didateslistCcontainsalldiscordsSiatdistancedsirfromtheirnearestneighborsinS.Proof.LetSibeadiscordatdistancedsirfromitsnearestneighborinS.Fromtheglobalinvariantitfollows Figure8:RandomwalkData(jSj=106).NumberofexamplesinCafterprocessingeachofthe100pagesduringthetwophasesofthealgorithm.Themethodremainsstableevenifweselectaslightlydifferentthresholdrduringthesamplingprocedure.rithmdetectsthemostsignicantdiscordsinlessthanfourtimesthetimenecessarytondthenearestneighborofasingleexampleonly.Figure8demonstratesthesizejCjafterprocessingeachdatabasepage.Thegraphsalsoshowhowthesizevarieswhenchangingthethreshold.Theplotsdemonstratethatwitha2%�5%changeinitsvalueswestilldetectthere-quired10discordswithjusttwoscans,whilethemaximummemoryandtherunningtimedonotincreasedrastically.Itisinterestingtonotehowquicklythememorydropsaf-tertherenementstepisinitiated.Thisimpliesthatmostofthenon-discordelementsinthecandidateslistgetelimi-natedafterscanningjustafewpagesofthedatabase.Fromthispointonthealgorithmperformsaverylimitednumberofdistancecomputationtoupdatethenearestneighbordis-tancesfortheremainingcandidatesinC.Similarbehaviorwasobservedthroughoutalldatasetsstudied.HeterogeneousData.Finallywechecktheefciencyofthediscorddetectionalgorithmonalargedatasetofreal-worldtimeseriescomingfromamixtureofdistributions.Togeneratesuchdatasetwecombinedthreedatasetseachofsize4x105(1.2millionelementsintotal).Thetimese-rieshavelengthof140points.Thethreedatasetsare:mo-tioncapturedata,EEGrecordingsofarat,andmeteoro-logicaldatafromtheTropicalAtmosphereOceanproject(TAO)[2].Table2:Heterogeneousdata.Timeefciencyofthealgorithm. Examples DiskSizeTime(Phase1)Time(Phase2) 1.2mill. 1.17Gb15min.16min. Table2liststherunningtimeofthealgorithmontheheterogeneousdataset.Againwearelookingforthetop10discordsinthedataset.Onthesamplethethresholdisesti-matedasr=12:86.AfterthecandidateselectionphasethesetCcontains690elements,andattheendoftherene-mentphasethereare59elementsthatmeetthethresholdr.Norestartsofthealgorithmwerenecessaryforthisdataseteither.ThediscordsdetectedaremostlyfromtheTAOclassasitstimeseriesexhibitmuchlargervariabilitycomparedtothetimeseriesfortheothertwoclasses.7.DiscussionInasense,theapproachtakenheremayappearsur-prising.Mostdataminingalgorithmsfortimeseriesusesomeapproximationofthedata,suchasDFT,DWT,SVDetc.Previous(mainmemory)algorithmsforndingdis-cordshaveusedSAX[13][24],orHaarwavelets[9].How-ever,weareworkingwithjusttherawdata.Itisworthexplainingwhy.Mosttimeseriesdataminingalgorithmsachievespeed-upwiththeGeminiframework(orsomevari-ationthereof)[8].Thebasicideaistoapproximatethefulldatasetinmainmemory,approximatelysolvetheproblemathand,andthenmake(hopefullyfew)accessestothedisktoconrmoradjustthesolution.Notethatthisframeworkrequiresonelinearscanjusttocreatethemainmemoryap-proximation,andouralgorithmrequiresatotaloftwolin-earscans.Sothereisatmostafactoroftwopossibilityofimprovement.However,itisclearthateventhiscan-notbeachieved.Evenifweassumethatsomealgorithmcanbecreatedtoapproximatelysolvetheprobleminmainmemory.Thealgorithmmustmakesomeaccesstodisktochecktherawdata.Becausesuchrandomaccessesaretentimesmoreexpensivethansequentialaccesses[19],ifthealgorithmmustaccessmorethat10%ofthedataitcannolongerbecompetitive.Infact,itisdifculttoseehowanyalgorithmcouldavoidretrieving100%ofthedatainthesecondphase.Foralltimeseriesapproximations,itispos-siblethattwoobjectsappeararbitrarilycloseinapproxima-tionspace,butbearbitrarilyfarapartintherawdataspace.Mostdataminingalgorithmsexploitlowerboundpruningtondthenearestneighbor,buthereupperboundsarere-quiredtopruneobjectsthatcannotbethefurthestnearestneighbor.Whiletherehasbeensomeworkonprovidingup-perboundsfortimeseries,theseboundstendtobeexcep-tionallyweak[22].Intuitivelythismakessense,thereareonlysomanywaystwotimeseriescanbesimilartoeachother,hencetheabilitytotightlylowerbound.However,thereisamuchlargerspaceofpossiblewaysthattwotimeseriescouldbedifferent,andanupperboundmustsome-howcaptureallofthem.Inthesamevein,itisworthdis-cussingwhywedonotattempttoindexthecandidatesetCinmainmemory,tospeedupboththephaseoneandphasetwoofouralgorithm.Theanswerissimplythatitdoesnotimproveperformance.Themanytimeseriesindexingalgo-rithmsthatexist[8][22]aredesignedtoreducethenumberofdiskaccesses,theyhavelittleutilitywhenallthedataresidesinmainmemory(aswiththecandidatesetC).Forhighdimensionaltimeseriesinmainmemoryitisimpos-sibletobeatalinearscan;especiallywhenthelinearscan ishighlyoptimizedwithearlyabandoning.Furthermore,inphaseoneofouralgorithmeveryobjectseeninthediskres-identdatasetiseitheraddedtothecandidatesetCorcausesanobjecttobeejectedfromC,thisoverheadinmaintainingtheindexmorethannulliesanypossiblegain.8.ConclusionsTheworkintroducedahighlyefcientalgorithmforminingrangediscordsinmassivetimeseriesdatabases.Thealgorithmperformstwolinearscansthroughthedatabaseandalimitedamountofmemorybasedcomputations.Itisintuitiveandverysimpletoimplement.Wefurtherdemon-strated,thatwithasuitablesamplingtechniquethemethodcanbeadaptedtorobustlydetectthetopkdiscordsinthedata.Theutilityofthediscorddenitioncombinedwiththeefciencyofthemethodsuggestitasavaluabletoolacrossmultipledomains,suchasastronomy,surveillance,webmining,etc.Experimentalresultsfromalltheseareashavebeendemonstrated.Wearecurrentlyexploringadaptiveapproachesthatal-lowfortheefcientdetectionofstatisticallysignicantdis-cordswhenthetimeseriesaregeneratedbyamixtureofdifferentprocesses.Inthesecasesalternatingtherangepa-rameteraccordingtothedistributionofeachexampleturnsouttobeessentialwhenlookingforthetopdiscordswithrespecttotheindividualclasses.9.AcknowledgementsWewouldliketothankDr.M.Vlachosforprovidingusthewebquerydata,Dr.P.Protopapasforthelight-curves,Dr.A.NaftelandDr.L.Lateckiforthetrajectorydatasets.References[1]http://bulge.astro.princeton.edu/ogle/.[2]http://www.pmel.noaa.gov/tao/index.shtml.[3]J.AmeenandR.Basha.Miningtimeseriesforidentifyingunusualsub-sequenceswithapplications.1stInternationalConferenceonInnovativeComputing,InformationandCon-trol,1:574–577,2006.[4]S.Berchtold,C.B¨ohm,D.Keim,andH.Kriegel.Acostmodelfornearestneighborsearchinhigh-dimensionaldataspace.InProc.ofthe16thACMSymposiumonPrinciplesofdatabasesystems(PODS),pages78–86,1997.[5]B.Chiu,E.Keogh,andS.Lonardi.Probabilisticdiscoveryoftimeseriesmotifs.InProc.ofthe9thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining(KDD'03),pages493–498,2003.[6]M.ChuahandF.Fu.ECGanomalydetectionviatimeseriesanalysis.TechnicalReportLU-CSE-07-001,2007.[7]S.DaviesandA.Moore.Mix-nets:Factoredmixturesofgaussiansinbayesiannetworkswithmixedcontinuousanddiscretevariables.InProc.ofthe16thConferenceonUn-certaintyinArticialIntelligence,pages168–175,2000.[8]C.Faloutsos,M.Ranganathan,andY.Manolopoulos.Fastsubsequencematchingintime-seriesdatabases.SIGMODRecord,23(2):419–429,1994.[9]A.Fu,O.Leung,E.Keogh,andJ.Lin.FindingtimeseriesdiscordsbasedonHaartransform.InProc.ofthe2ndInter-nationalConferenceonAdvancedDataMiningandAppli-cations,pages31–41,2006.[10]A.Ghoting,S.Parthasarathy,andM.Otey.Fastminingofdistance-basedoutliersinhighdimensionaldatasets.InProc.ofthe6thSIAMInternationalConferenceonDataMining,2006.[11]H.Jagadish,N.Koudas,andS.Muthukrishnan.Miningde-viantsinatimeseriesdatabase.InProc.ofthe25thInter-nationalConferenceonVeryLargeDataBases,pages102–113,1999.[12]E.KeoghandS.Kasetty.Ontheneedfortimeseriesdataminingbenchmarks:asurveyandempiricaldemonstration.InProc.ofthe8thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages102–111,2002.[13]E.Keogh,J.Lin,andA.Fu.Hotsax:Efcientlyndingthemostunusualtimeseriessubsequence.InProc.ofthe5thIEEEInternationalConferenceonDataMining,pages226–233,2005.[14]E.KnorrandR.Ng.Algorithmsforminingdistance-basedoutliersinlargedatasets.InProc.ofthe24rdInternationalConferenceonVeryLargeDataBases(VLDB),pages392–403,1998.[15]K.Malatesta,S.Beck,G.Menali,andE.Waagen.TheAAVSOdatavalidationproject.JournaloftheAmericanAssociationofVariableStarObservers(JAAVSO),78:31–44,2005.[16]A.NaftelandS.Khalid.Classifyingspatiotemporalobjecttrajectoriesusingunsupervisedlearninginthecoefcientfeaturespace.MultimediaSyst.,12(3):227–238,2006.[17]D.Pokrajac,A.Lazarevic,andL.Latecki.Incrementallocaloutlierdetectionfordatastreams.InIEEESymposiumonComputationalIntelligenceandDataMining,pages504–515,2007.[18]P.Protopapas,J.Giammarco,L.Faccioli,M.Struble,R.Dave,andC.Alcock.Findingoutlierlight-curvesincata-logsofperiodicvariablestars.MonthlyNoticesoftheRoyalAstronomicalSociety,369:677–696,2006.[19]M.Riedewald,D.Agrawal,A.Abbadi,andF.Korn.Access-ingscienticdata:Simplerisbetter.InProc.ofthe8thIn-ternationalSymposiuminSpatialandTemporalDatabases,pages214–232,2003.[20]B.Silverman.DensityEstimationforStatisticsandDataAnalysis.Chapman&Hall/CRC,1986.[21]D.Stoyan.Onestimatorsofthenearestneighbourdistancedistributionfunctionforstationarypointprocesses.Metrica,64(2):139–150,2006.[22]C.WangandX.Wang.Multilevellteringforhighdimen-sionalnearestneighborsearch.InACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscov-ery,pages37–43,2000.[23]D.Wang,P.Fortier,H.Michel,andT.Mitsa.Hierarchicalagglomerativeclusteringbasedt-outlierdetection.6thInter-nationalConferenceonDataMining-Workshops,0:731–738,2006.[24]L.Wei,E.Keogh,andX.Xi.SAXuallyexplicitimages:Findingunusualshapes.InProc.ofthe6thInternationalConferenceonDataMining,pages711–720,2006.