FrequentPatternMining - PDF document

myesha-ticknor . @myesha-ticknor

403 views
Uploaded On 2016-06-07

About Share Download

FrequentPatternMining - PPT Presentation

CharuCAggarwal ID: 352029

CharuC.Aggarwal

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/352029" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Pdf The PPT/PDF document "FrequentPatternMining" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

FrequentPatternMining CharuC.AggarwalJiaweiHanFrequentPatternMining EditorsCharuC.AggarwalJiaweiHanIBMT.J.WatsonResearchCenterUniversityofIllinoisatUrbana-ChampaignYorktownHeightsUrbanaNewYorkIllinoisUSAUSAISBN978-3-319-07820-5ISBN978-3-319-07821-2(eBook)DOI10.1007/978-3-319-07821-2SpringerChamHeidelbergNewYorkDordrechtLondonLibraryofCongressControlNumber:2014944536©SpringerInternationalPublishingSwitzerland2014Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthematerialisconcerned,specicallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting,reproductiononmicrolmsorinanyotherphysicalway,andtransmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnectionwithreviewsorscholarlyanalysisormaterialsuppliedspecicallyforthepurposeofbeingenteredandexecutedonacomputersystem,forexclusiveusebythepurchaserofthework.DuplicationofthispublicationorpartsthereofispermittedonlyundertheprovisionsoftheCopyrightLawofthePublisherslocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.ViolationsareliabletoprosecutionundertherespectiveCopyrightLaw.Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoesnotimply,evenintheabsenceofaspecicstatement,thatsuchnamesareexemptfromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse.Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpublication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforanyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespecttothematerialcontainedherein.Printedonacid-freepaperSpringerispartofSpringerScience+BusinessMedia(www.springer.com) PrefaceTheeldofdatamininghasfourmainsuper-problemscorrespondingtoclustering,classication,outlieranalysis,andfrequentpatternmining.Comparedtotheotherthreeproblems,thefrequentpatternminingmodelforformulatedrelativelyrecently.Inspiteofitsshorterhistory,frequentpatternminingisconsideredthemarqueeproblemofdatamining.ThereasonforthisisthatinterestinthedataminingeldincreasedrapidlysoonaftertheseminalpaperonassociationruleminingbyAgrawal,Imielinski,andSwami.Theearlierdataminingconferenceswereoftendominatedbyalargenumberoffrequentpatternminingpapers.Thisisoneofthereasonsthatfrequentpatternmininghasaveryspecialplaceinthedataminingcommunity.Atthispoint,theeldoffrequentpatternminingisconsideredamatureone.Whiletheeldhasreachedarelativelevelofmaturity,veryfewbookscoverdifferentaspectsoffrequentpatternmining.Mostoftheexistingbooksareeithertoogenericordonotcoverfrequentpatternmininginanexhaustiveway.Aneedexistsforanexhaustivebookonthetopicthatcancoverthedifferentnuancesinanexhaustiveway.Thisbookprovidescomprehensivesurveysintheeldoffrequentpatternmining.Eachchapterisdesignedasasurveythatcoversthekeyaspectsoftheeldoffrequentpatternmining.Thechaptersaretypicallyofthefollowingtypes::Inthesecases,thekeyalgorithmsforfrequentpatternminingareexplored.Theseincludejoin-basedmethodssuchas,andpattern-growthVariations:Manyvariationsoffrequentpatternminingsuchasinterestingpat-terns,negativepatterns,constrainedpatternmining,orcompressedpatternsareexploredinthesechapters.:Thelargesizesofdatainrecentyearshasledtotheneedforbigdataandstreamingframeworksforfrequentpatternmining.Frequentpatternminingalgorithmsneedtobemodiedtoworkwiththeseadvancedscenarios.DataTypes:Differentdatatypesleadtodifferentchallengesforfrequentpatternminingalgorithms.Frequentpatternminingalgorithmsneedtobeabletoworkwithcomplexdatatypes,suchastemporalorgraphdata. Preface:Inthesechapters,differentapplicationsoffrequentpatternminingareexplored.Theseincludestheapplicationoffrequentpatternminingmethodstoproblemssuchasclusteringandclassication.Othermorecomplexalgorithmsarealsoexplored.Thisbookis,therefore,intendedtoprovideanoverviewoftheeldoffrequentpatternmining,asitcurrentlystands.Itishopedthatthebookwillserveasausefulguideforstudents,researchers,andpractitioners. 1AnIntroductiontoFrequentPatternMining.....................1CharuC.Aggarwal1Introduction...............................................12FrequentPatternMiningAlgorithms..........................32.1FrequentPatternMiningwiththeTraditionalSupportFramework..........................................42.2InterestingandNegativeFrequentPatterns................62.3ConstrainedFrequentPatternMining.....................72.4CompressedRepresentationsofFrequentPatterns..........73ScalabilityIssuesinFrequentPatternMining...................83.1FrequentPatternMininginDataStreams.................83.2FrequentPatternMiningwithBigData...................94FrequentPatternMiningwithAdvancedDataTypes.............94.1SequentialPatternMining..............................104.2SpatiotemporalPatternMining..........................104.3FrequentPatternsinGraphsandStructuredData...........114.4FrequentPatternMiningwithUncertainData..............115PrivacyIssues.............................................126ApplicationsofFrequentPatternMining.......................136.1ApplicationstoMajorDataMiningProblems..............136.2GenericApplications..................................137ConclusionsandSummary..................................14....................................................142FrequentPatternMiningAlgorithms:ASurvey..................19CharuC.Aggarwal,MansurulA.BhuiyanandMohammadAlHasan1Introduction...............................................191.1Denitions...........................................222Join-BasedAlgorithms......................................232.1AprioriMethod.......................................242.2DHPAlgorithm.......................................272.3SpecialTricksfor2-ItemsetCounting....................28 2.4PruningbySupportLowerBounding.....................282.5HypercubeDecomposition..............................293Tree-BasedAlgorithms.....................................293.1AISAlgorithm........................................313.2TreeProjectionAlgorithms..............................323.3VerticalMiningAlgorithms.............................364RecursiveSufx-BasedGrowth..............................394.1TheFP-GrowthApproach..............................414.2Variations............................................455MaximalandClosedFrequentItemsets........................475.1Denitions...........................................475.2FrequentMaximalItemsetMiningAlgorithms.............485.3FrequentClosedItemsetMiningAlgorithms...............556OtherOptimizationsandVariations...........................576.1RowEnumerationMethods.............................576.2OtherExplorationStrategies............................587ReducingtheNumberofPasses..............................587.1CombiningPasses.....................................587.2SamplingTricks......................................597.3OnlineAssociationRuleMining.........................608ConclusionsandSummary..................................61....................................................613Pattern-GrowthMethods......................................65JiaweiHanandJianPei1Introduction...............................................662FP-Growth:PatternGrowthforMiningFrequentItemsets........683PushingMoreConstraintsinPattern-GrowthMining............724PrexSpan:MiningSequentialPatternsbyPatternGrowth.......745FurtherDevelopmentofPatternGrowth-BasedPatternMining..............................................776Conclusions...............................................78....................................................794MiningLongPatterns.........................................83FeidaZhu1Introduction...............................................832Preliminaries..............................................843APatternLatticeModel.....................................864PatternEnumerationApproach...............................874.1Breadth-FirstApproach................................874.2Depth-FirstApproach..................................885RowEnumerationApproach.................................896PatternMergeApproach....................................926.1Piece-wisePatternMerge...............................93 6.2Fusion-stylePatternMerge.............................987PatternTraversalApproach..................................1018Conclusion...............................................102....................................................1035InterestingPatterns...........................................105JillesVreekenandNikolajTatti1Introduction...............................................1062AbsoluteMeasures.........................................1072.1FrequentItemsets.....................................1072.2Tiles................................................1122.3LowEntropySets.....................................1143AdvancedMethods.........................................1144StaticBackgroundModels..................................1154.1IndependenceModel..................................1164.2BeyondIndependence.................................1194.3MaximumEntropyModels.............................1204.4RandomizationApproaches.............................1235DynamicBackgroundModels................................1245.1TheGeneralIdea......................................1255.2MaximumEntropyModels.............................1255.3Tile-basedTechniques.................................1265.4SwapRandomization..................................1286PatternSets...............................................1286.1Itemsets.............................................1296.2Tiles................................................1306.3SwapRandomization..................................1307Conclusions...............................................131....................................................1326NegativeAssociationRules.....................................135LuizaAntonie,JundongLiandOsmarZaiane1Introduction...............................................1352NegativePatternsandNegativeAssociationRules...............1363CurrentApproaches........................................1384AssociativeClassicationandNegativeAssociationRules........1435Conclusions...............................................143....................................................1447Constraint-BasedPatternMining...............................147SiegfriedNijssenandAlbrechtZimmermann1Introduction...............................................1472ProblemDenition.........................................1482.1Constraints...........................................1493Level-WiseAlgorithm......................................152 3.1GenericAlgorithm....................................1534Depth-FirstAlgorithm......................................1544.1BasicAlgorithm......................................1544.2Constraint-basedItemsetMining........................1554.3GenericFrameworks...................................1584.4ImplementationConsiderations..........................1595Languages................................................1596Conclusions...............................................162....................................................1628MiningandUsingSetsofPatternsthroughCompression..........165MatthijsvanLeeuwenandJillesVreeken1Introduction...............................................1652Foundations...............................................1672.1KolmogorovComplexity...............................1682.2MDL................................................1692.3MDLinDataMining..................................1713Compression-basedPatternModels...........................1713.1PatternModelsforMDL...............................1723.2CodeTables..........................................1733.3InstancesofCompression-basedModels..................1794AlgorithmicApproaches....................................1814.1CandidateSetFiltering.................................1814.2DirectMiningofPatternsthatCompress..................1845MDLforDataMining......................................1855.1Classication.........................................1865.2ADissimilarityMeasureforDatasets.....................1885.3IdentifyingandCharacterizingComponents...............1895.4OtherDataMiningTasks...............................1915.5TheAdvantageofPattern-basedModels..................1926ChallengesAhead..........................................1936.1TowardMiningStructuredData.........................1936.2Generalization........................................1946.3Task-and/orUser-specicUsefulness....................1947Conclusions...............................................195....................................................1969FrequentPatternMininginDataStreams.......................199VictorE.Lee,RuomingJinandGaganAgrawal1Introduction...............................................2002Preliminaries..............................................2012.1FrequentPatternMining:Denition......................2012.2DataWindows........................................2022.3FrequentItemMining..................................2033FrequentItemsetMiningAlgorithms..........................204 3.1MiningtheFullDataStream............................2063.2RecentlyFrequentItemsets.............................2093.3ClosedandMaximalItemsets...........................2143.4MiningDataStreamswithUncertainData................2164MiningPatternsOtherthanItemsets..........................2164.1Subsequences........................................2174.2SubtreesandSemistructuredData........................2184.3Subgraphs...........................................2195ConcludingRemarks.......................................219....................................................22010BigDataFrequentPatternMining..............................225DavidC.Anastasiu,JeremyIverson,ShadenSmithandGeorge1Introduction...............................................2252FrequentPatternMining:Overview...........................2262.1Preliminaries.........................................2262.2BasicMiningMethodologies............................2283ParadigmsforBigDataComputation.........................2323.1PrinciplesofParallelAlgorithms........................2323.2SharedMemorySystems...............................2333.3DistributedMemorySystems...........................2344FrequentItemsetMining....................................2364.1MemoryScalability...................................2364.2WorkPartitioning.....................................2394.3DynamicLoadBalancing...............................2414.4FurtherConsiderations.................................2425FrequentSequenceMining..................................2425.1SerialFrequentSequenceMining........................2435.2ParallelFrequentSequenceMining......................2456FrequentGraphMining.....................................2506.1SerialFrequentGraphMining...........................2506.2ParallelFrequentGraphMining.........................2527Conclusion...............................................255....................................................25611SequentialPatternMining.....................................261WeiShen,JianyongWangandJiaweiHan1Introduction...............................................2612ProblemDenition.........................................2633Apriori-basedApproaches...................................2643.1HorizontalDataFormatAlgorithms......................2643.2VerticalDataFormatAlgorithms.........................2684PatternGrowthAlgorithms..................................2714.1FreeSpan............................................2714.2PrexSpan...........................................272 5Extensions................................................2745.1ClosedSequentialPatternMining........................2745.2Multi-level,Multi-dimensionalSequentialPatternMining...2765.3IncrementalMethods..................................2775.4HybridMethods......................................2785.5ApproximateMethods.................................2795.6Top-ClosedSequentialPatternMining..................2795.7FrequentEpisodeMining...............................2806ConclusionsandSummary..................................281....................................................28112SpatiotemporalPatternMining:AlgorithmsandApplications......283ZhenhuiLi1Introduction...............................................2832BasicConcept.............................................2842.1SpatiotemporalDataCollection.........................2842.2DataPreprocessing....................................2852.3BackgroundInformation...............................2863IndividualPeriodicPattern..................................2863.1AutomaticDiscoveryofPeriodicityinMovements..........2873.2FrequentPeriodicPatternMining........................2893.3UsingPeriodicPatternforLocationPrediction.............2894PairwiseMovementPatterns.................................2904.1SimilarityMeasure....................................2904.2GenericPattern.......................................2924.3BehavioralPattern.....................................2944.4SemanticPatterns.....................................2965AggregatePatternsoverMultipleTrajectories..................2985.1FrequentTrajectoryPatternMining......................2985.2DetectionofMovingObjectCluster......................3005.3TrajectoryClustering..................................3026Summary.................................................304....................................................30413MiningGraphPatterns........................................307HongCheng,XifengYanandJiaweiHan1Introduction...............................................3072FrequentSubgraphMining..................................3082.1ProblemDenition....................................3082.2Apriori-BasedApproach................................3092.3Pattern-GrowthApproach..............................3102.4ClosedandMaximalSubgraphs.........................3112.5MiningSubgraphsinaSingleGraph.....................3112.6TheComputationalBottleneck..........................313 3MiningSignicantGraphPatterns............................3143.1ProblemDenition....................................3143.2gboost:ABranch-and-BoundApproach...................3143.3gPLS:APartialLeastSquaresRegressionApproach........3173.4LEAP:AStructuralLeapSearchApproach................3193.5GraphSig:AFeatureRepresentationApproach.............3234MiningRepresentativeOrthogonalGraphs.....................3264.1ProblemDenition....................................3274.2RandomizedMaximalSubgraphMining..................3274.3OrthogonalRepresentativeSetGeneration................3295MiningDenseGraphPatterns................................3295.1CliquesandQuasi-Cliques..............................3305.2K-CoreandK-Truss...................................3315.3OtherDenseSubgraphPatterns..........................3326MiningGraphPatternsinStreams............................3327MiningGraphPatternsinUncertainGraphs....................3348Conclusions...............................................336....................................................33614UncertainFrequentPatternMining.............................339CarsonKai-SangLeung1Introduction...............................................3392TheProbabilisticModelforMiningExpectedSupport-BasedFrequentPatternsfromUncertainData........................3403CandidateGenerate-and-TestBasedUncertainFrequentPattern...................................................3434HyperlinkedStructure-BasedUncertainFrequentPatternMining..3445Tree-BasedUncertainFrequentPatternMining.................3455.1UF-growth...........................................3455.2UFP-growth..........................................3465.3CUF-growth.........................................3475.4PUF-growth..........................................3496ConstrainedUncertainFrequentPatternMining.................3507UncertainFrequentPatternMiningfromBigData...............3518StreamingUncertainFrequentPatternMining..................3538.1SUF-growth..........................................3538.2UF-streamingfortheSlidingWindowModel..............3548.3TUF-streamingfortheTime-FadingModel................3558.4LUF-streamingfortheLandmarkModel..................3568.5HyperlinkedStructure-BasedStreamingUncertainFrequentPatternMining.......................................3569VerticalUncertainFrequentPatternMining....................3579.1U-Eclat:AnApproximateAlgorithm.....................3579.2UV-Eclat:AnExactAlgorithm..........................3579.3U-VIPER:AnExactAlgorithm..........................358 xiv10DiscussiononUncertainFrequentPatternMining...............36011Extension:ProbabilisticFrequentPatternMining...............36111.1MiningProbabilisticHeavyHitters......................36111.2MiningProbabilisticFrequentPatterns...................36212Conclusions...............................................364....................................................36515PrivacyIssuesinAssociationRuleMining.......................369ArisGkoulalas-Divanis,JayantHaritsaandMuratKantarcioglu1Introduction...............................................3692InputPrivacy..............................................3702.1ProblemFramework...................................3712.2EvolutionoftheLiterature..............................3763OutputPrivacy............................................3793.1TerminologyandPreliminaries..........................3803.2TaxonomyofARHAlgorithms..........................3813.3HeuristicandExactARHAlgorithms.....................3823.4MetricsandPerformanceAnalysis.......................3904CryptographicMethods.....................................3924.1HorizontallyPartitionedData...........................3944.2VerticallyPartitionedData..............................3965Conclusions...............................................398....................................................39816FrequentPatternMiningAlgorithmsforDataClustering..........403ArthurZimek,IraAssentandJillesVreeken1Introduction...............................................4032GeneralizingPatternMiningforClustering....................4062.1GeneralizedMonotonicity..............................4072.2CountIndexes........................................4102.3PatternExplosionandRedundancy.......................4103FrequentPatternMininginSubspaceClustering................4123.1SubspaceClusterSearch...............................4123.2SubspaceSearch......................................4143.3RedundancyinSubspaceClustering......................4174Conclusions...............................................419....................................................41917SupervisedPatternMiningandApplicationstoClassication......425AlbrechtZimmermannandSiegfriedNijssen1Introduction...............................................4252SupervisedPatternMining..................................4272.1ExplicitClassLabels..................................4282.2ClassesasDataSubsets................................4282.3NumericalTargetValues...............................431 3SupervisedPatternSetMining...............................4323.1LocalEvaluation,LocalModication....................4343.2GlobalEvaluation,GlobalModication...................4353.3LocalEvaluation,GlobalModication...................4363.4DataInstance-BasedSelection..........................4374ClassierConstruction......................................4374.1DirectClassication...................................4374.2IndirectClassication..................................4385Summary.................................................439....................................................44018ApplicationsofFrequentPatternMining........................443CharuC.Aggarwal1Introduction...............................................4432FrequentPatternsforCustomerAnalysis.......................4453FrequentPatternsforClustering..............................4464FrequentPatternsforClassication...........................4475FrequentPatternsforOutlierAnalysis.........................4496FrequentPatternsforIndexing...............................4507WebMiningApplications...................................4517.1WebLogMining......................................4517.2WebLinkageMining..................................4528FrequentPatternsforTextMining............................4529TemporalApplications......................................45310SpatialandSpatiotemporalApplications.......................45511SoftwareBugDetection.....................................45612ChemicalandBiologicalApplications.........................45712.1ChemicalApplications.................................45812.2BiologicalApplications................................45813ResourcesforthePractitioner................................46014ConclusionsandSummary..................................461....................................................461............................................................469 ContributorsCharuC.AggarwalIBMT.J.WatsonResearchCenter,YorktownHeights,NY,GaganAgrawalOhioStateUniversity,Columbus,OH,USALuizaAntonieUniversityofGuelph,Guelph,CanadaDavidC.AnastasiuDepartmentofComputerScienceandEngineering,UniversityofMinnesota,Minneapolis,USAIraAssentDepartmentofComputerScience,AarhusUniversity,Aarhus,DenmarkMansurulA.BhuiyanIndianaUniversityPurdueUniversity,Indianapolis,IN,HongChengDepartmentofSystemsEngineeringandEngineeringManagement,TheChineseUniversityofHongKong,HongKong,ChinaArisGkoulalas-DivanisIBMResearch-Ireland,DamastownIndustrialEstate,Mulhuddart,Dublin,IrelandJiaweiHanUniversityofIllinoisatUrbana-Champaign,Urbana,IL,USADepartmentofComputerScience,UniversityofIllinoisatUrbana-Champaign,Champaign,USAJayantHaritsaDatabaseSystemsLab,IndianInstituteofScience(IISc),Bangalore,IndiaMohammadAlHasanIndianaUniversityPurdueUniversity,Indianapolis,IN, ContributorsJeremyIversonDepartmentofComputerScienceandEngineering,UniversityofMinnesota,Minneapolis,USARuomingJinKentStateUniversity,Kent,OH,USAMuratKantarciogluUTDDataSecurityandPrivacyLab,UniversityofTexasatDallas,Texas,USAVictorE.LeeJohnCarrollUniversity,UniversityHeights,OH,USAMatthijsvanLeeuwenKULeuven,Leuven,BelgiumCarsonKai-SangLeungUniversityofManitoba,Winnipeg,MB,CanadaJundongLiUniversityofAlberta,Alberta,CanadaZhenhuiLiPennsylvaniaStateUniversity,UniversityPark,USAGeorgeKarypisDepartmentofComputerScienceandEngineering,UniversityofMinnesota,Minneapolis,USASiegfriedNijssenKULeuven,Leuven,BelgiumUniversiteitLeiden,Leiden,TheNetherlandsJianPeiSimonFraserUniversity,Burnaby,BC,CanadaShadenSmithDepartmentofComputerScienceandEngineering,UniversityofMinnesota,Minneapolis,USAWeiShenTsinghuaUniversity,Beijing,ChinaNikolajTattiHIIT,DepartmentofInformationandComputerScience,AaltoUniversity,Helsinki,FinlandJillesVreekenMax-PlanckInstituteforInformaticsandSaarlandUniversity,Saarbrücken,GermanyJianyongWangTsinghuaUniversity,Beijing,ChinaXifengYanDepartmentofComputerScience,UniversityofCaliforniaatSantaBarbara,SantaBarbara,USA ContributorsOsmarZaianeUniversityofAlberta,Alberta,CanadaFeidaZhuSingaporeManagementUniversity,Singapore,SingaporeAlbrechtZimmermannINSALyon,VilleurbanneCEDEX,FranceArthurZimekLudwig-Maximilians-UniversitätMünchen,Munich,Germany Chapter1AnIntroductiontoFrequentPatternMiningCharuC.AggarwalTheproblemoffrequentpatternmininghasbeenwidelystudiedintheliteraturebecauseofitsnumerousapplicationstoavarietyofdataminingproblemssuchasclusteringandclassication.Inaddition,frequentpatternminingalsohasnumerousapplicationsindiversedomainssuchasspatiotemporaldata,softwarebugdetection,andbiologicaldata.Thealgorithmicaspectsoffrequentpatternmininghavebeenexploredverywidely.Thischapterprovidesanoverviewofthesemethods,asitrelatestotheorganizationofthisbook.KeywordsFrequentpatternminingAssociationrules1IntroductionTheproblemoffrequentpatternminingisthatofndingrelationshipsamongtheitemsinadatabase.Theproblemcanbestatedasfollows.Givenadatabasewithtransactions,determineallpatternsarepresentinatleastafractionsofthetransactionsThefractionisreferredtoastheminimumsupport.Theparametercanbeexpressedeitherasanabsolutenumber,orasafractionofthetotalnumberoftrans-actionsinthedatabase.Eachtransactioncanbeconsideredasparsebinaryvector,orasasetofdiscretevaluesrepresentingtheidentiersofthebinaryattributesthatareinstantiatedtothevalueof1.Theproblemwasoriginallyproposedinthecontextofmarketbasketdatainordertondfrequentgroupsofitemsthatareboughttogether[10].Thus,inthisscenario,eachattributecorrespondstoaniteminasuperstore,andthebinaryvaluerepresentswhetherornotitispresentinthetransaction.Becausetheproblemwasoriginallyproposed,ithasbeenappliedtonumerousotherapplica-tionsinthecontextofdatamining,Weblogmining,sequentialpatternmining,andsoftwarebuganalysis.Intheoriginalmodeloffrequentpatternmining[10],theproblemofndingassociationruleshasalsobeenproposedwhichiscloselyrelatedtothatoffrequent C.C.Aggarwal(IBMT.J.WatsonResearchCenter,YorktownHeights,NY10598,USAe-mail:charu@us.ibm.comC.C.Aggarwal,J.Han(eds.),FrequentPatternMining,DOI10.1007/978-3-319-07821-2_1,©SpringerInternationalPublishingSwitzerland2014 C.C.Aggarwalpatterns.Ingeneralassociationrulescanbeconsideredasecond-stageoutput,whicharederivedfromfrequentpatterns.Considerthesetsofitems.Theisconsideredanassociationruleatminimumsupportandminimum,whenthefollowingtwoconditionsholdtrue:1.Thesetisafrequentpattern.2.TheratioofthesupportoftothatofisatleastTheminimumcondenceisalwaysafractionlessthan1becausethesupportofthesetisalwayslessthanthatof.Becausetherststepofndingfrequentpatternsisusuallythecomputationallymorechallengingone,mostoftheresearchinthisareaisfocussedontheformer.Nevertheless,somecomputationalandmodelingissuesalsoariseduringthesecondstep,especiallywhenthefrequentpatternminingproblemisusedinthecontextofotherdataminingproblemssuchasclassication.Therefore,thisbookwillalsodiscussvariousaspectsofassociationruleminingalongwiththatoffrequentpatternmining.Arelatedproblemisthatofsequentialpatternmininginwhichanorderispresentinthetransactions[5].Temporalorderisquitenaturalinmanyscenariossuchascustomerbuyingbehavior,becausetheitemsareboughtatspecictimestamps,andoftenfollowanaturaltemporalorder.Inthesecases,theproblemisredenedtothatofsequentialpatternmining,inwhichitisdesirabletodeterminerelevantandofitems.Someexamplesofimportantapplicationsareasfollows;CustomerTransactionAnalysis:Inthiscase,thetransactionsrepresentsetsofitemsthatco-occurincustomerbuyingbehavior.Inthiscase,itisdesirabletodeterminefrequentpatternsofbuyingbehavior,becausetheycanbeusedformakingdecisionaboutshelfstockingorrecommendations.OtherDataMiningProblems:Frequentpatternminingcanbeusedtoenableothermajordataminingproblemssuchasclassication,clusteringandoutlieranalysis[11,52,73].Thisisbecausetheuseoffrequentpatternsissofundamentalintheanalyticalprocessforahostofdataminingproblems.WebMining:Inthiscase,theWeblogsmaybeprocessedinordertodetermineimportantpatternsinthebrowsingbehavior[24,63].ThisinformationcanbeusedforWebsitedesign.recommendations,orevenoutlieranalysis.SoftwareBugAnalysis:Executionsofsoftwareprogramscanberepresentedasgraphswithtypicalpatterns.Logicalerrorsinthesebugsoftenshowupasspecickindsofpatternsthatcanbeminedforfurtheranalysis[41,51].ChemicalandBiologicalAnalysis:Chemicalandbiologicaldataareoftenrep-resentedasgraphsandsequences.Anumberofmethodshavebeenproposedintheliteratureforusingthefrequentpatternsinsuchgraphsforawidevarietyofapplicationsindifferentscenarios[8,29,41,42,6975].Sincethepublicationoftheoriginalarticleonfrequentpatternmining[10],numeroustechniqueshavebeenproposedbothforfrequentandsequentialpatternmining[5,4,13,33,62].Furthermore,manyvariantsoffrequentpatternmining,suchas 1AnIntroductiontoFrequentPatternMiningsequentialpatternmining,constrainedpatternmining,andgraphmininghavebeenproposedintheliterature.Frequentpatternminingisaratherbroadareaofresearch,anditrelatestoawidevarietyoftopicsatleastfromanapplicationspecic-perspective.Broadlyspeaking,theresearchintheareafallsinoneoffourdifferentcategories:Technique-centered:Thisarearelatestothedeterminationofmoreefcientalgorithmsforfrequentpatternmining.Awidevarietyofalgorithmshavebeenproposedinthiscontextthatusedifferentenumerationtreeexplorationstrategies,anddifferentdatarepresentationmethods.Inaddition,numerousvariationssuchasthedeterminationofcompressedpatternsofgreatinteresttoresearchersindataScalabilityissues:Thescalabilityissuesinfrequentpatternminingareverysignicant.Whenthedataarrivesintheformofastream,multi-passmethodscannolongerbeused.Whenthedataisdistributedorverylarge,thenparallelorbig-dataframeworksmustbeused.ThesescenariosnecessitatedifferenttypesofAdvanceddatatypes:Numerousvariationsoffrequentpatternmininghavebeenproposedforadvanceddatatypes.Thesevariationshavebeenutilizedinawidevarietyoftasks.Inaddition,differentdatadomainssuchasgraphdata,treestructureddata,andstreamingdataoftenrequirespecializedalgorithmsforfrequentpatternmining.Issuesofinterestingnessofthepatternsarealsoquiterelevantinthiscontext[6].Applications:Frequentpatternmininghavenumerousapplicationstoothermajordataminingproblems,Webapplications,softwarebuganalysis,andchemicalandbiologicalapplications.Asignicantamountofresearchhasbeendevotedtoapplicationsbecausetheseareparticularlyimportantinthecontextoffrequentpatternmining.Thisbookwillcoverallthesedifferentareascomprehensively,soastoprovideacomprehensiveoverviewofthisbroaderarea.Thischapterisorganizedasfollows.Thenextsectiondiscussesalgorithmsforthefrequentpatternminingproblem,anditsbasicvariations.Section3discussesscalabilityissuesforfrequentpatternmining.FrequentpatternminingmethodsareadvanceddatatypesarediscussedinSect.4.PrivacyissuesoffrequentpatternminingareaddressedinSect.5.TheapplicationsarediscussedinSect.6.Section7givestheconclusionsandsummary.2FrequentPatternMiningAlgorithmsMostofthealgorithmsforfrequentpatternmininghavebeendesignedwiththetra-ditionalsupport-condenceframework,orforspecializedframeworksthatgenerate C.C.Aggarwalmoreinterestingkindsofpatterns.Thesespecializedframeworkmayusediffer-enttypesofinterestingnessmeasures,modelnegativerules,oruseconstraint-basedframeworkstodeterminemorerelevantpatterns.2.1FrequentPatternMiningwiththeTraditionalSupportFrameworkThesupportframeworkisdesignedtodeterminepatternsforwhichtherawfrequencyisgreaterthanaminimumthreshold.Althoughthisisasimplisticwayofdeningfrequentpatterns,thismodelhasanalgorithmicallyconvenientproperty,whichisreferredtoasthelevel-wiseproperty.Thelevel-wisepropertyoffrequentpatternmin-ingisalgorithmicallycrucialbecauseitenablesthedesignofabottom-upapproachtoexploringthespaceoffrequentpatterns.Inotherwords,a(1)-patternmaynotbefrequentwhenanyofitssubsetsisnotfrequent.Thisisacrucialobservationthatisusedbyvirtuallyalltheefcientfrequentpatternminingalgorithms.Sincetheproblemoffrequentpatternminingwasrstproposed,numerousal-gorithmshavebeenproposedinordertomakethesolutionstotheproblemmoreefcient.Thisareaofresearchissopopularthatanannualworkshopwasde-votedtoimplementationsoffrequentpatternminingforafewyears.Thissite[77]isnoworganizedasarepository,wheremanyefcientimplementationsoffrequentpatternminingareavailable.Thetechniquesforfrequentpatternminingstartedwith-likejoin-basedmethods.Inthesealgorithms,candidateitemsetsaregener-atedinincreasingorderofitemsetsize.Thegenerationinincreasingorderofitemsetsizeisreferredtoaslevel-wiseexploration.Theseitemsetsarethentestedagainsttheunderlyingtransactiondatabaseandthefrequentonessatisfyingtheminimumsupportconstraintareretainedforfurtherexploration.Eventually,itwasrealizedthat-likemethodscouldbemoresystematicallyexploredasenumerationtrees.ThisstructurewillbeexplainedindetailinChap.2,andprovidesamethod-ologytoperformsystematicandnon-redundantfrequentpatternexploration.Theenumerationtreeprovidesamoreexibleframeworkforfrequentitemsetminingbecausethetreecanbeexploredinavarietyofdifferentstrategiessuchasdepth-rst,breadth-rst,orotherhybridstrategies[13].Onepropertyofthebreadth-rststrategyisthatlevel-wisepruningcanbeused,whichisnotpossiblewithotherstrategies.Nevertheless,strategiessuchasdepth-rstsearchhaveotheradvantages,especiallyformaximalpatternmining.Thisobservationforthecaseofmaximalpatternminingwasrststatedin[12].Thisisbecauselongpatternsarediscoveredearly,andtheycanbeusedfordownwardclosure-basedpruningoflargepartsoftheenumerationtreethatarealreadyknowntobefrequent.Itshouldbepointedout,thatforthecasewherefrequentpatternsaremined,theorderofexplorationofanenumerationtreedoesnotaffectthenumberofcandidatesthatareexploredbecausethesizeoftheenumerationtreeisxed. 1AnIntroductiontoFrequentPatternMiningJoin-basedalgorithmsarealwayslevel-wise,andcanbeviewedasequivalenttobreadth-rstenumerationtreeexploration.Thealgorithmproposedintherstfre-quentpatternminingpaper[10]wasanenumeration-treebasedalgorithm,whereasthesecondalgorithmproposedwasreferredtoas,andwasajoin-basedalgo-rithm[4].Bothalgorithmsarelevel-wisealgorithms.Subsequently,manyalgorithmshavebeenproposedinordertoimprovetheimplementationsbasedontheenumer-ationtreeparadigmwiththeuseoftechniquessuchaslookahead[17],depth-rstsearch[12,13,33]andverticalexploration[62].SomeofthesemethodssuchasTreeProjectionDepthProjectFP-growth[33]useaprojectionstrategyinwhichsmallertransactiondatabasesareexploredatlowerlevelsofthetree.Oneofthechallengesoffrequentpatternminingisthatalargenumberofre-dundantpatternsareoftenmined.Forexample,thesubsetofafrequentpatternisalsoguaranteedtobefrequentandbyminingamaximalitemset,oneisassuredthattheotherfrequentpatternscanalsobegeneratedfromthissmallerset.Therefore,onepossibilityistomineforonlyitemsets[17].However,theminingofmaximalitemsetslosesinformationabouttheexactvalueofsupportofthesubsetsofmaximalpatterns.Therefore,afurtherrenementwouldbetonditemsets[58,74].Closedfrequentitemsetsaredenedasfrequentpatterns,nosu-persetofwhichhavethesamefrequencyasthatitemset.Byminingclosedfrequentitemsets,itispossibletosignicantlyreducethenumberofpatternsfound,withoutlosinganyinformationaboutthesupportlevel.Closedpatternscanbeviewedasthemaximalpatternsfromeachgroupofpatterns(i.e.,patternswiththesamesupport).Allmaximalpatternsare,therefore,closed.Thedepth-rstmethodhasbeenshowntohaveanumberofadvantagesinmax-imalpatternmining[12],becauseofthegreatereffectivenessofthepruning-basedlookaheadsinthedepth-rststrategy.Differenttechniquesforfrequentpatternmin-ingwillbediscussedinChaps.2and3.Theformerchapterwillgenerallyfocusonfrequentpatternminingalgorithms,whereasthelatterchapterwillfocusonpattern-growthalgorithms.Anadditionalchapterwithgreaterdetailhasbeendevotedtopattern-growthmethods,becauseofitisconsideredastate-of-the-arttechniqueinfrequentpatternmining.Theefciencyinfrequentpatternminingalgorithmscanbegainedinseveralways:1.Reducingthesizeofthecandidatesearchspace,withtheuseofpruningmethods,suchasmaximalitypruning.Thenotionofclosurecanalsobeusedtoprunelargepartsofthesearchspace.However,thesemethodsoftendonotexhaustivelyreturnthefullsetoffrequentpatterns.Manyofthesemethodsreturnedcondensedrepresentationssuchasmaximalpatternsorclosedpatterns.2.Improvingtheefciencyof,withtheuseofdatabaseprojection.MethodssuchasTreeProjectionspeeduptherateatwhicheachpatterniscounted,byreducingthesizeofthedatabasewithrespecttowhichpatternsarecompared.3.Usingmoreefcientdatastructures,suchasverticallists,oranFP-Treeformorecompresseddatabaserepresentation.Infrequentpatternmining,bothmemoryandcomputationalspeedscanbeimprovedbyjudiciouschoiceofdatastructures. C.C.AggarwalAparticularscenarioofinterestisoneinwhichthepatternstobeminedareverylong.Insuchcases,thenumberofsubsetsoffrequentpatternscanbeextremelylarge.Therefore,anumberoftechniquesneedtobedesignedinordertomineverylongpatterns.Insuchcases,avarietyofmethodsareusedtoexplorethelongpatternsearly,sothattheirsubsetscanbeprunedeffectively.ThescenariooflongpatterngenerationisdiscussedindetailinChap.4,thoughitisalsodiscussedtosomeextentintheearlierChaps.2and3.2.2InterestingandNegativeFrequentPatternsAmajorchallengeinfrequentpatternminingisthattherulesfoundmayoftennotbeveryinteresting,whenquanticationssuchassupportandcondenceareused.Thisisbecausesuchquanticationsdonotnormalizefortheoriginalfrequencyoftheunderlyingitems.Forexample,anitemthatoccursveryrarelyintheunderlyingdatabasewouldnaturallyalsooccurinitemsetswithlowerfrequency.Therefore,thefrequencyoftendoesnottellusmuchaboutthelikelihoodofitemstotogether,becauseofthebiasesassociatedwiththefrequenciesoftheindividualitems.Therefore,numerousmethodshavebeenproposedintheliteratureforndinginterestingfrequentpatternsthatnormalizefortheunderlyingitemfrequencies[6,26].MethodsforndinginterestingfrequentpatternsarediscussedinChap.5.Theissueofinterestingnessisalsorelatedtocompressedrepresentationsofpatternssuchasclosedormaximalitemsets.Theseissuesarealsodiscussedinthechapter.Innegativeassociativerulemining,weattempttodeterminerulessuchasBreadtter,wherethesymbolindicatesnegation.Therefore,inthistterbecomesapseudo-itemdenotinganegativeitem.Onepossibilityistoaddnegativeitemstothedata,andperformthemininginthesamewayasonewoulddeterminerulesinthesupport-condenceframework.However,thisisnotafeasiblesolution.Thisisbecausetraditionalsupportframeworksarenotdesignedforcaseswhereanitemispresentedinthedata98%ofthetime.Thisisthecasefornegativeitems.Forexample,mosttransactionsmaynotcontaintheitemtterandthereforeevenpositivelycorrelateditemsmayappearasnegativerules.Forex-ample,theruleBreadttermayhavecondencegreaterthan50%,evenBreadisclearlycorrelatedinapositivewaywithtter.Thisisbecause,theitemttermayhaveanevenhighersupportof98%.Theissueofndingnegativepatternsiscloselyrelatedtothatofndinginterestingpatternsinthedata[6]becauseoneislookingforpatternsthatsatisfythesupportrequirementinaninterestingway.Thisrelationshipbetweenthetwoproblemstendstobeunder-emphasizedintheliterature,andtheproblemofnegativepatternminingisoftentreatedindependentlyfrominterestingpatternmining.Someframeworks,suchascollectivestrength,aredesignedtoaddressbothissuessimultaneously.MethodsfornegativepatternminingareaddressedinChap.6.Therelationshipbetweeninterestingpatternminingandnegativepatternminingwillbediscussedinthesamechapter. 1AnIntroductiontoFrequentPatternMining2.3ConstrainedFrequentPatternMiningOff-the-shelffrequentpatternminingalgorithmsdiscoveralargenumberofpatternswhicharenotusefulwhenitisdesiredtodeterminepatternsonthebasisofmorerenedcriteria.Frequentpatternminingmethodsareoftenparticularlyusefulinthecontextofconstrainedapplications,inwhichrulessatisfyingparticularcriteriaarediscovered.Forexample,onemaydesirespecicitemstobepresentintherule.Onesolutionistorstminealltheitemsets,andthenenableonlineminingfromthissetofbasepatterns[3].However,pushingconstraintsdirectlyintotheminingprocesshasseveraladvantages.Thisisbecausewhenconstraintsarepusheddirectlyintotheminingprocess,theminingcanbeperformedatmuchlowersupportlevelsthancanbeperformedbyusingatwo-phaseapproach.Thisisespeciallythecasewhenalargenumberofintermediatecandidatescanbeprunedbytheconstraint-basedpatternminingalgorithm.Avarietyofarbitraryconstraintsmayalsobepresentinthepatterns.Themajorproblemwithsuchmethodsisthattheconstraintsmayresultintheviolationofthedownwardclosureproperty.Becausemostfrequentpatternminingalgorithmsdependcruciallyonthisproperty,itsviolationisaseriousissue.Nevertheless,manyconstraintshavespecializedpropertiesbecauseofwhichspecializedalgorithmscanbedeveloped.Methodsforconstrainedfrequentpatternminingmethodhavebeendiscussedin[55,57,60].Constrainedmethodshavealsobeendevelopedforthesequentialpatternminingproblem[31,61].Inrealapplications,theoutputofthevanillafrequentpatternminingproblemmaybetoolarge,anditisonlybypushingconstraintsintothepatternminingprocess,thatusefulapplication-specicpatternscanbefound.Constrainedfrequentpatternminingmethodsarecloselyrelatedtotheproblemofpattern-basedclassication,becausethelatterproblemrequiresustodiscoverdiscriminativepatternsfromtheunderlyingdata.MethodsforconstrainedfrequentpatternminingwillbediscussedinChap.2.2.4CompressedRepresentationsofFrequentPatternsAmajorprobleminfrequentpatternminingalgorithmsisthatthevolumeofthemin-ingpatternsisoftenextremelylarge.Thisscenariocreatesnumerouschallengesforusingthesepatternsinameaningfulway.Furthermore,differentkindsofredundancyarepresentintheminedpatterns.Forexample,maximalpatternsimplythepresenceofalltheirsubsetsinthedata.Thereissomeinformationlossintermsoftheexactsupportvaluesofthesesubsets.Therefore,ifitisnotneededtopreservethevaluesofthesupportacrossthepatterns,thenthedeterminationofconciserepresentationscanbeveryuseful.Aparticularlyinterestingformofconciserepresentationisthatofclosedpatterns[56].Anitemsetissettobeclosedifnoneofitssupersetshavethesamesupport.Therefore,bydeterminingalltheclosedfrequentpatterns,onecanderivenotonlytheexhaustivesetoffrequentitemsets,butalsotheirsupports.Notethat C.C.Aggarwalsupportvaluesarelostbymaximalpatternmining.Inotherwords,thesetofmaximalpatternscannotbeusedtoderivethesupportvaluesofmissingsubsets.However,thesupportvaluesofclosedfrequentitemsetscanbeusedtoderivethesupportvaluesofmissingsubsets.Manyinterestingmethods[58,67,74]havebeendesignedforidentifyingfrequentclosedpatterns.Thegeneralprincipleofdeterminingfrequentclosedpatternshasbeengeneralizedtothatofdetermining-freesets[18].Thisissueiscloselyrelatedtothatofminingallnon-derivablefrequentitemsets[20].Asurveyonthistopicmaybefoundin[21].ThesedifferentformsofcompressionarediscussedinChaps.2and5.Finally,aformalwayofviewingcompressionisfromtheperspectiveofinformation-theoreticmodels.Information-theoreticmodelsaredesignedforcom-pressingdifferentkindsofdata,andcanthereforebeusedtocompressitemsetsaswell.Thisbasicprinciplehasbeenusedformethodssuchas[66].Theprob-lemofdeterminingcompressedrepresentationsoffrequentitemsetsisdiscussedinChap.8.Thischapterfocussesmostlyontheinformation-theoreticissuesoffrequentitemsetcompression.3ScalabilityIssuesinFrequentPatternMiningInthemodernera,theabilitytocollectlargeamountsofdatahasincreasedsigni-cantlybecauseofadvancesinhardwareandsoftwareplatforms.Theamountofdataisoftensolargethatspecializedmethodsarerequiredfortheminingprocess.Thestreamingandbig-dataarchitecturesareslightlydifferentandposedifferentchal-lengesfortheminingprocess.Thefollowingdiscussionwilladdresseachofthese3.1FrequentPatternMininginDataStreamsInrecentyears,datastreamhavebecomeverypopularbecauseoftheadvancesinhardwareandsoftwaretechnologythatcancollectandtransmitdatacontinuouslyovertime.Insuchcases,themajorconstraintondataminingalgorithmsistoexecutethealgorithmsinasinglepass.Thiscanbesignicantlychallengingbecausefrequentandsequentialpatternminingmethodsaregenerallydesignedaslevel-wisemethods.Therearetwovariantsoffrequentpatternminingfordatastreams:FrequentItemsorHeavyHitters:Inthiscase,frequent1-itemsetsneedtobedeterminedfromadatastreaminasinglepass.Suchanapproachisgenerallyneededwhenthetotalnumberofdistinctitemsistoolargetobeheldinmainmemory.Typically,sketch-basedmethodsareusedinordertocreateacompressdatastructureinordertomaintainapproximatecountsoftheitems[23,27].Frequentitemsets:Inthiscase,itisnotassumedthatthenumberofdistinctitemsaretoolarge.Therefore,themainchallengeinthiscaseiscomputational,because 1AnIntroductiontoFrequentPatternMiningthetypicalfrequentpatternminingmethodsaremulti-passmethods.Multiplepassesareclearlynotpossibleinthecontextofdatastreams[22,39].Thestreamingscenarioalsopresentsnumerouschallengesinthecontextofdataofadvancedtypes.Forexample,graphstreamsareoftenencounteredinthecontextofnetworkdata.Insuchcases,methodsneedtobedesignedfordeterminingdensegroupsofnodesinrealtime[16].MethodsforminingfrequentitemsanditemsetsindatastreamsarediscussedinChap.9.3.2FrequentPatternMiningwithBigDataThebigdatascenarioposesnumerouschallengesfortheproblemoffrequentpatternmining.Amajorproblemariseswhenthedataislargeenoughtobestoredinadistributedway.Therefore,signicantcostsareincurredinshufingarounddataorintermediateresultsoftheminingprocessacrossthedistributednodes.Thesecostsarealsoreferredtoasdatatransfercosts.Whendatasetsareverylarge,thenthealgorithmsneedtodesignedtotakeintoaccountboththediskaccessconstraintandthedatatransfercosts.Inaddition,manydistributedframeworkssuchas[28]requirespecializedalgorithmsforfrequentpatternmining.Thefocusofbig-dataframeworkissomewhatdifferentfromstreams,inthatitiscloselyrelatedtotheissueofshufinglargeamountsofdataaroundfortheminingprocess.Interestingly,itissometimeseasiertoprocessthealgorithmsinasinglepassinstreamingfashion,thanwhentheyhavealreadybeenstoredindistributedframeworkswhereaccesscostsbecomeamajorissue.AlgorithmsforfrequentpatternminingwithbigdataarediscussedindetailinChap.10.Thischapterdiscussesboththeparallelalgorithmsandthebig-dataalgorithmsthatarebasedontheframework.4FrequentPatternMiningwithAdvancedDataTypesalthoughthefrequentpatternminingproblemisnaturallydenedonsets,itcanbeextendedtovariousadvanceddatatypes.Themostnaturalextensionoffrequentpatternminingalgorithmsistothecaseoftemporaldata.Thiswasoneoftheearliestproposedextensionsandisreferredtoassequentialpatternmining.Subsequently,theproblemhasbeengeneralizedtootheradvanceddatatypes,suchasspatiotem-poraldata,graphs,anduncertaindata.Manyofthedevelopedalgorithmsarebasicvariationsofthefrequentpatternminingproblem.Ingeneral,thebasicfrequentpatternminingalgorithmsneedtobemodiedcarefullytoaddressthevariationsrequiredbytheadvanceddatatypes. C.C.Aggarwal4.1SequentialPatternMiningTheproblemofsequentialpatternminingiscloselyrelatedtothatoffrequentpatternmining.Themajordifferenceinthiscaseisthatrecordcontainbasketsofitemsarrangedsequential.Forexample,eachrecordmaybeofthefollowingform:={BreadtterCakeChickenYogInthiscase,eachentitywithinisabasketofitemsthatareboughttogetherand,therefore,donothaveatemporalordering.Thisbasketofitemsiscollectivelyre-ferredtoasanevent.Thelengthofapatternisequaltothesumofthelengthsofthecomplexitemsinit.Forexample,isa5-pattern,eventhoughithas3events.Thedifferentcomplexentities(orevents)dohaveatemporalordering.Intheaforemen-tionedexample,itisclearthatBreadhasbeenboughtearlierthantterCakeTheproblemofsequentialpatternminingisthatofndingsequencesofeventsthatarepresentinatleastafractionoftheunderlyingrecords[5].Forexample,theBreadtterChickenispresentintheafore-mentionedrecord,butnotthesequenceBreadCaketter.Thepatternmayalsocontaincomplexevents.Forexample,thepatternBreadChickenYogispresent.Theproblemofsequentialpatternminingiscloselyrelatedtothatoffre-quentpatternminingexceptthatitissomewhatmorecomplextoaccountforboththepresenceofcomplexbasketsofitemsinthedatabase,andthetemporalorder-ingoftheindividualbaskets.Anextensionofasequentialpatternmayeitherbeaset-wiseextensionofacomplexitem,oratemporalextensionwithanentirelynewevent.Thisaffectsthenatureoftheextensionsofitemsinthetransactions.Numerousmodicationsofknownfrequentpatternminingmethodssuchasanditsvariants,TreeProjectionanditsvariants[32],andtheFP-growthanditsvariants,canbeusedinordertosolvethesequentialpatternminingprob-lem[5,35,36].Theenumerationtreeconceptcanalsobegeneralizedtosequentialpatternmining[32].Therefore,inprinciple,allenumerationtreealgorithmscanbegeneralizedtosequentialpatternmining.Thisisapowerfulabilitybecause,aswewillseeinChap.2allfrequentpatternminingalgorithmsare,implicitlyorexplicitly,enumeration-treealgorithms.SequentialpatternminingmethodswillbediscussedindetailinChap.11.4.2SpatiotemporalPatternMiningTheadventofGPS-enabledmobilephonesandwearablesensorshasenabledthecollectionoflargeamountsofspatiotemporaldata.Suchdatamayincludetrajectorydata,location-taggedimages,orothercontent.Insomecases,thespatiotemporaldataexistsintheformofRFIDdata[37].Theminingofpatternsfromsuchspa-tiotemporaldataprovidesnumerousinsightsinawidevarietyofapplications,suchastrafccontrolandsocialsensing[2].Frequentpatternsarealsousedfortrajectory 1AnIntroductiontoFrequentPatternMiningclusteringclassicationandoutlieranalysis[38,4548].Manytrajectoryanalysisproblemscanbeapproximatelytransformedtosequentialpatternminingwiththeuseofappropriatetransformations.AlgorithmsforspatiotemporalpatternminingarediscussedinChap.12.4.3FrequentPatternsinGraphsandStructuredDataManykindsofchemicalandbiologicaldata,XMLdata,softwareprogramtraces,andWebbrowsingbehaviorscanberepresentedasstructuredgraphs.Inthesecases,frequentpatternminingisveryusefulformakinginferencesinsuchdata.Thisisbecausefrequentstructuralpatternsprovideimportantinsightsaboutthegraphs.Forexample,specicchemicalstructuresresultinparticularproperties,specicprogramstructuresresultinsoftwarebugs,andsoon.Suchpatternscanevenbeusedforclusteringandclassicationofgraphs![14,73].Avarietyofmethodsforstructuralfrequentpatternminingarediscussedin[41,6971,72].Amajorprobleminthecontextofgraphsistheproblemofbecauseofwhichtherearemultiplewaystomatchtwographs.An-likealgorithmcanbedevelopedforgraphpatternmining.However,becauseofthecomplexityofgraphsandandalsobecauseofissuesrelatedtoisomorphism,thealgorithmsaremorecomplex.Forexample,inan-likealgorithm,pairsofgraphscanbejoinedinmultipleways.Pairsofgraphscanbejoinedwhentheyhave1)nodesincommon,ortheyhave(1)edgesincommon.Furthermore,eitherkindofjoinbetweenapairofgraphscanhavemultipleresults.Thecountingprocessisalsomorechallengingbecauseofisomorphism.Patternminingingraphsbecomesespeciallychallengingwhenthegraphsarelarge,andtheisomorphismproblembecomessignicant.Anotherparticularlydifcultcaseisthestreamingscenario[16]whereonehastodeterminedensepatternsinthegraphsstream.Typically,theseproblemscannotbesolvedexactly,andapproximationsarerequired.Frequentpatternminingingraphshasnumerousapplications.Insomecases,thesemethodscanbeusedinordertoperformclassicationandclusteringofstructureddata[14,73].Graphpatternsareusedforchemicalandbiologicaldataanalysis,andsoftwarebugdetectionincomputerprograms.MethodsforndingfrequentpatternsingraphsarediscussedinChap.13.TheapplicationsofgraphpatternminingarediscussedinChap.18.4.4FrequentPatternMiningwithUncertainDataUncertainorprobabilisticdatahasbecomeincreasinglycommonoverthelastfewyears,asmethodshavebeendesignedinordertocollectdatawithverylowqual-ity.Theattributevaluesinsuchdatasetsareprobabilistic,whichimpliesthatthevaluesarerepresentedasprobabilitydistributions.Numerousalgorithmshavebeen C.C.Aggarwalproposedintheliteratureforuncertainfrequentpatternmining[15],andacompu-tationalevaluationofthedifferenttechniquesisprovidedin[64].ManyalgorithmssuchasFP-growtharehardertogeneralizetouncertaindata[15]becauseofthedif-cultyinstoringprobabilityinformationwiththeFP-Tree.Nevertheless,astheworkin[15]shows,otherrelatedmethodssuchas[59]canbegeneralizedeasilytothecaseofuncertaindata.Uncertainfrequentpatternminingmethodshavealsobeenextendedtothecaseofgraphdata[76].Avariantofuncertaingraphpatternminingdiscovershighlyreliablesubgraphs[40].Highlyreliablesubgraphsaresubgraphsthatarehardtodisconnectinspiteoftheuncertaintyassociatedwiththeedges.AdiscussionofthedifferentmethodsforfrequentpatternminingwithuncertaindataisprovidedinChap.14.5PrivacyIssuesPrivacyhasincreasinglybecomeatopicofconcerninrecentyearsbecauseofthewideavailabilityofpersonaldataaboutindividuals[7].Thishasoftenledtoreluctancetosharedata,shareitinaconstrainedway,orsharedowngradedversionsofthedata.Theadditionalconstraintsanddowngradingtranslatetochallengesindiscoveringfrequentpatterns.Inthecontextoffrequentpatternandassociationrulemining,theprimarychallengesareasfollows:1.Whenprivacy-preservationmethodssuchasrandomizationareused,itbecomesachallengetodiscoverassociationsfromtheunderlyingdata.Thisisbecauseasignicantamountofnoisehasbeenaddedtothedata,anditisoftendifculttodiscovertheassociationrulesinthepresenceofthisnoise.Therefore,oneclassofassociationruleminingmethods[30]proposeseffectivemethodstoperturbthedata,sothatmeaningfulpatternsmaybediscoveredwhileretainingprivacyoftheperturbeddata.2.Insomecases,theoutputofaprivacy-preservingdataminingalgorithmcanleadtoviolationofprivacy.Thisisbecauseassociationrulescanrevealsensitivein-formationaboutindividualswhentheyrelatesensitiveattributestootherkindsofattributes.Therefore,oneclassofmethodsfocussesontheproblemofrulehidinghiding3.Inmanycases,thedatatobeminedisstoredinadistributedwaybycompetitorswhomaywishtodetermineglobalinsightswithout,atthesametime,revealingtheirlocalinsights.Thisproblemisreferredtoasthatofdistributedprivacypreservation[25].Thedatamaybeeitherhorizontallypartitionedacrossrows(differentrecords)orverticallypartitioned(acrossattributes).Eachoftheseformsofpartitioningrequiredifferentmethodsfordistributedmining.Methodsforprivacy-preservingassociationruleminingareaddressedinChap.15. 1AnIntroductiontoFrequentPatternMining6ApplicationsofFrequentPatternMiningFrequentpatternmininghasapplicationsoftwotypes.Thersttypeofapplicationistoothermajordataminingproblemssuchasclustering,outlierdetection,andclassication.Frequentpatternsareoftenusedtodeterminerelevantclustersfromtheunderlyingdata.Inaddition,rule-basedclassiersareoftenconstructedwiththeuseoffrequentpatternminingmethods.Frequentpatternminingisalsousedingenericapplications,suchasWebloganalytics,softwarebuganalysis,chemical,andbiologicaldata.6.1ApplicationstoMajorDataMiningProblemsFrequentpatternminingmethodscanalsobeappliedtoothermajordataminingproblemssuchasclustering[9,19],classicationandoutlieranalysis.Forexample,frequentpatternminingmethodsareoftenusedforsubspaceclustering[11],bydiscretizingthequantitativeattributes,andthenndingpatternsfromthesediscretevalues.Eachsuchpattern,therefore,correspondstoarectangularregioninasubspaceofthedata.Theserectangularregionscanthenbeintegratedtogetherinordertocreateamorecomprehensivesubspacerepresentation.Frequentpatternminingisalsoappliedtoproblemssuchasclassication,inwhichrulesaregeneratedbyusingpatternsonthelefthandsideoftherule,andtheclassvariableontherighthandsideoftherule[52].Themaingoalhereistondpatternsforthepurposeofclassication,ratherthansimplypatternsthatsatisfythesupportrequirements.SuchmethodshavealsobeenextendedtostructuredXMLdata[73]byndingdiscriminativegraph-structuredpatterns.Inaddition,sequentialpatternminingmethodscanbeappliedtoothertemporalminingmethodssuchaseventdetection[43,44,53,54]andsequenceclassication[68].Frequentpatternmininghasalsobeenappliedtotheproblemofoutlieranalysis[1],bydeterminingdeviationsfromtheexpectedpatternsintheunderlyingdata.MethodsforclusteringbasedonfrequentpatternminingarediscussedinChap.16,whilerule-basedclassicationarediscussedinChap.17.Itshouldbepointedoutthatconstrainedfrequentpatternminingiscloselyrelatedtotheproblemofclassicationwithfrequentpatterns,andthereforebotharediscussedinthesamechapter.6.2GenericApplicationsFrequentpatternmininghasapplicationstoavarietyofproblemssuchasclustering,classicationandeventdetection.Inaddition,specicapplicationareassuchasWebminingandsoftwarebugdetectioncanalsobenetfromfrequentpatternminingmethods.InthecontextofWebmining,numerousmethodshavebeenproposedforndingusefulpatternsfromWeblogsinordertomakerecommendations[63].Such C.C.AggarwaltechniquescanalsobeusedtodetermineoutliersfromWeblogsequences[1].Fre-quentpatternsarealsousedfortrajectoryclassicationandoutlieranalysis[4948].Frequentpatternminingmethodscanalsobeusedinordertodeterminerelevantrulesandpatternsinspatialdata,astheyrelatedtospatialandnon-spatialpropertiesofobjects.Forexample,anassociationrulecouldbecreatedfromtherelationshipsoflandtemperaturesofnearbygeographicallocations.Inthecontextofspatiotem-poraldata,therelationshipsbetweenthemotionsofdifferentobjectscouldbeusedtocreatespatiotemporalfrequentpatterns.Frequentpatternminingmethodshavebeenusedforndingpatternsinbiologicalandchemicaldata[42,29,75].Inaddition,becausesoftwareprogramscanberepresentedasgraphs,frequentpatternminingmethodscanbeusedinordertondlogicalbugsfromprogramexecutiontraces[51].NumerousapplicationsoffrequentpatternminingarediscussedinChap.18.7ConclusionsandSummaryFrequentpatternminingisoneoffourmajorproblemsinthedataminingdomain.Thischapterprovidesanoverviewofthemajortopicsinfrequentpatternmining.Theearliestworkinthisareawasfocussedondeterminingtheefcientalgorithmsforfrequentpatternmining,andvariantssuchaslongpatternmining,interestingpatternmining,constraint-basedpatternmining,andcompression.Inrecentyearsscalabilityhasbecomeanissuebecauseofthemassiveamountsofdatathatcontinuetobecreatedinvariousapplications.Inaddition,becauseofadvancesindatacollectiontechnology,advanceddatatypessuchastemporaldata,spatiotemporaldata,graphdata,anduncertaindatahavebecomemorecommon.Suchdatatypeshavenumerousapplicationstootherdataminingproblemssuchasclusteringandclassication.Inaddition,suchdatatypesareusedquiteofteninvarioustemporalapplications,suchastheWebloganalytics.References1.C.Aggarwal.OutlierAnalysis,Springer,2013.2.C.Aggarwal.SocialSensing,ManagingandMiningSensorData,Springer,2013.3.C.C.Aggarwal,andP.S.Yu.OnlinegenerationofAssociationRules,ICDEConference,1998.4.R.Agrawal,andR.Srikant.FastAlgorithmsforMiningAssociationRulesinLargeDatabases,VLDBConference,pp.487499,1994.5.R.Agrawal,andR.Srikant.MiningSequentialPatterns,ICDEConference,1995.6.C.C.Aggarwal,andP.S.Yu.ANewFrameworkforItemsetGeneration,ACMPODSConference,1998.7.C.AggarwalandP.Yu.Privacy-preservingdatamining:ModelsandAlgorithms,Springer8.C.C.Aggarwal,andH.Wang.ManagingandMiningGraphData,Springer,2010.9.C.C.Aggarwal,andC.K.Reddy.DataClustering:AlgorithmsandApplications,CRCPress 1AnIntroductiontoFrequentPatternMining 15 10.R.Agrawal,T.Imielinski,andA.Swami.DatabaseMining:APerformancePerspective. IEEE TransactionsonKnowledgeandDataEngineering ,5(6),pp.914925,1993. 11.R.Agrawal,J.Gehrke,D.Gunopulos,P.Raghavan.AutomaticSubspaceClusteringofHigh DimensionalDataforDataMiningApplications, ACMSIGMODConference ,1998. 12.R.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.Depth-rstGenerationofLongPatterns, ACMKDDConference ,2000:AlsoappearsasIBMResearchReport,RC,21538,1999. 13.R.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjectionAlgorithmforGenerationof FrequentItemsets, JournalofParallelandDistributedComputing ,61(3),pp.350371,2001. AlsoappearsasIBMResearchReport,RC21341,1999. 14.C.C.Aggarwal,N.Ta,J.Wang,J.Feng,M.Zaki.Xproj:Aframeworkforprojectedstructural clusteringofXMLdocuments, ACMKDDConference ,2007. 15.C.C.Aggarwal,Y.Li,J.Wang,J.Feng.FrequentPatternMiningwithUncertainData, ACM KDDConference ,2009. 16.C.Aggarwal,Y.Li,P.Yu,andR.Jin.Ondensepatternminingingraphstreams, VLDB Conference ,2010. 17.R.J.BayardoJr.Efcientlymininglongpatternsfromdatabases. ACMSIGMODConference , 1998. 18.J.-F.Boulicaut,A.Bykowski,andC.Rigotti.Free-sets:ACondensedRepresentationof BooleandatafortheApproximationofFrequencyQueries. DataMiningandKnowledge Discovery ,7(1),pp.522,2003. 19.G.Buehrer,andK.Chellapilla.AScalablePatternMiningApproachtoWebGraph CompressionwithCommunities. WSDMConference ,2009. 20.T.Calders,andB.Goethals.Miningallnon-derivablefrequentitemsets, Principlesof KnowledgeDiscoveryandDataMining ,2006. 21.T.Calders,C.Rigotti,andJ.F.Boulicaut.Asurveyoncondensedrepresentationsforfrequent sets.In Constraint-basedminingandinductivedatabases ,pp.6480,Springer,2006. 22.J.H.Chang,W.S.Lee.FindingRecentFrequentItemsetsAdaptivelyoverOnlineData Streams. ACMKDDConference ,2003. 23.M.Charikar,K.Chen,andM.Farach-Colton.FindingFrequentItemsinDataStreams, Automata,LanguagesandProgramming ,pp.693703,2002. 24.M.S.Chen,J.S.Park,andP.S.Yu.Efcientdataminingforpathtraversalpatterns, IEEE TransactionsonKnowledgeandDataEngineering ,10(2),pp.209221,1998. 25.C.Clifton,M.Kantarcioglu,J.Vaidya,X.Lin,andM.Zhu.Toolsforprivacypreserving distributeddatamining. ACMSIGKDDExplorationsNewsletter ,4(2),pp.2834,2002. 26.E.Cohen.M.Datar,S.Fujiwara,A.Gionis,P.Indyk,R.Motwani,J.Ullman,andC.Yang. FindingInterestingAssociationswithoutSupportPruning, IEEETKDE ,13(1),pp.6478, 2001. 27.G.Cormode,S.Muthukrishnan.Whatshotandwhatsnot:trackingmostfrequentitems dynamically, ACMTODS ,30(1),pp.249278,2005. 28.J.DeanandS.Ghemawat. MapReduce :SimpliedDataProcessingonLargeClusters. OSDI , pp.137150,2004. 29.M.Deshpande,M.Kuramochi,N.Wale,andG.Karypis.Frequentsubstructure-based approachesforclassifyingchemicalcompounds. IEEETKDE. ,17(8),pp.10361050,2005. 30.A.Evmievski,R.Srikant,R.Agrawal,andJ.Gehrke.Privacypreservingminingofassociation rules. InformationSystems ,29(4),pp.343364,2004. 31.M.Garofalakis,R.Rastogi,andK.Shim.:SequentialPatternMiningwithRegularExpression Constraints, VLDBConference ,1999. 32.V.Guralnik,andG.Karypis.Paralleltree-projection-basedsequenceminingalgorithms. ParallelComputing ,30(4):pp.443472,April2004. 33.J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration, ACM SIGMODConference ,2000. 34.J.Han,H.Cheng,D.Xin,andX.Yan.FrequentPatternMining:CurrentStatusandFuture Directions, DataMiningandKnowledgeDiscovery , 15(1),pp.5586,2007. C.C.Aggarwal35.J.Han,J.Pei,B.Mortazavi-Asl,Q.Chen,U.Dayal,andM.C.Hsu.FreeSpan:frequentpattern-projectedsequentialpatternmining.ACMKDDConference,2000.36.J.Han,J.Pei,H.Pinto,B.Mortazavi-Asl,Q.Chen,U.Dayal,andM.C.Hsu.PrexSpan:Miningsequentialpatternsefcientlybyprex-projectedpatterngrowth.ICDEConference37.J.Han,J.-G.Lee,H.Gonzalez,X.Li.MiningMassiveRFID,Trajectory,andTraf-cDataSets(Tutorial).ACMKDDConference,2008.VideoofTutoralLectureat:38.H.Jeung,M.L.Yiu,X.Zhou,C.Jensen,H.Shen,DiscoveryofConvoysinTrajectoryVLDBConference,2008.39.R.Jin,G.Agrawal.FrequentPatternMininginDataStreams,DataStreams:Modelsand,pp.6184,Springer,2007.40.R.Jin,L.Liu,andC.Aggarwal.Discoveringhighlyreliablesubgraphsinuncertaingraphs.ACMKDDConference,2011.41.G.KuramuchiandG.Karypis.FrequentSubgraphDiscovery,ICDMConference,2001.42.A.R.LeachandV.J.Gillet.AnIntroductiontoChemoinformatics.Springer,2003.43.W.Lee,S.Stolfo,andP.Chan.LearningPatternsfromUnixExecutionTracesforIntrusionAAAIworkshoponAImethodsinFraudandRiskManagement,1997.44.W.Lee,S.Stolfo,andK.Mok.ADataMiningFrameworkforBuildingIntrusionDetectionIEEESymposiumonSecurityandPrivacy,1999.45.J.-G.Lee,J.Han,K.-Y.Whang,TrajectoryClustering:APartition-and-GroupFramework,ACMSIGMODConference,2007.46.J.-G.Lee,J.Han,X.Li.TrajectoryOutlierDetection:APartition-and-DetectFramework,ICDEConference,2008.47.J.-G.Lee,J.Han,X.Li,H.Gonzalez.TraClass:trajectoryclassicationusinghierarchicalregion-basedandtrajectory-basedclustering.,1(1):pp.10811094,2008.48.X.Li,J.Han,andS.Kim.Motion-alert:AutomaticAnomalyDetectioninMassiveMovingIEEEConferenceinIntelligenceandSecurityInformatics,2006.49.X.Li,J.Han,S.KimandH.Gonzalez.ROAM:Rule-andMotif-basedAnomalyDetectioninMassiveMovingObjectDataSets,SDMConference,2007.50.Z.Li,B.Ding,J.Han,R.Kays.Swarm:MiningRelaxedTemporalObjectMovingClusters,VLDBConference,2010.51.C.Liu,X.Yan,H.Lu,J.Han,andP.S.Yu.MiningBehaviorGraphsforbacktraceofnon-crashingbugs,SDMConference,2005.52.B.Liu,W.Hsu,Y.Ma.IntegratingClassicationandAssociationRuleMining,ACMKDDConference,1998.53.S.Ma,andJ.Hellerstein.MiningPartiallyPeriodicEventPatternswithUnknownPeriods,IEEEInternationalConferenceonDataEngineering,2001.54.H.Mannila,H.Toivonen,andA.I.Verkamo.DiscoveringFrequentEpisodesinSequences,ACMKDDConference,1995.55.R.Ng,L.V.S.Lakshmanan,J.Han,andA.Pang.Exploratoryminingandpruningoptimizationsofconstrainedassociationsrules.ACMSIGMODConference,1998.56.N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsforassociationrules.InternationalConferenceonDatabaseTheory,pp.398416,1999.57.J.Pei,andJ.Han.Canwepushmoreconstraintsintofrequentpatternmining?ACMKDDConference,2000.58.J.Pei,J.Han,R.Mao.CLOSET:AnEfcientAlgorithmsforMiningFrequentClosedItemsets,DMKDWorkshop,2000.59.J.Pei,J.Han,H.Lu,S.Nishio,S.Tang,andD.Yang.H-mine:Hyper-structureminingoffrequentpatternsinlargedatabases.InDataMining,ICDMConference,2001.60.J.Pei,J.Han,andL.V.S.Lakshmanan.MiningFrequentPatternswithConvertibleConstraintsinLargeDatabases,ICDEConference,2001. 1AnIntroductiontoFrequentPatternMining61.J.Pei,J.Han,andW.Wang.Constraint-basedSequentialPatternMining:ThePattern-GrowthJournalofIntelligentInformationSystems,28(2),pp.133160,2007.62.P.Shenoy,J.Haritsa,S.Sudarshan,G.Bhalotia,M.Bawa,D.Shah.Turbo-chargingVerticalMiningofLargeDatabases.ACMSIGMODConference,pp.2233,2000.63.J.Srivastava,R.Cooley,M.Deshpande,andP.N.Tan.Webusagemining:DiscoveryandapplicationsofusagepatternsfromWebdata.ACMSIGKDDExplorationsNewsletter,1(2),pp.1223,2000.64.Y.Tong,L.Chen,Y.Cheng,P.Yu.MiningFrequentItemsetsoverUncertainDatabases.5(11),pp.16501661,2012.65.V.S.Verykios,A.K.Elmagarmid,E.Bertino,Y.Saygin,andE.Dasseni.AssociationruleIEEETransactionsonKnowledgeandDataEngineering,pp.434447,16(4),pp.434447,2004.66.J.Vreeken,M.vanLeeuwen,andA.Siebes.Krimp:Miningitemsetsthatcompress.MiningandKnowledgeDiscovery,23(1),pp.169214,2011.67.J.Wang,J.Han,andJ.Pei.CLOSET+:SearchingfortheBeststrategiesforminingfrequentcloseditemsets.ACMKDDConference,2003.68.Z.Xing,J.Pei,andE.Keogh.ABriefSurveyonSequenceClassication,ACMSIGKDDExplorations,12(1),2010.69.X.Yan,P.S.Yu,andJ.Han,Graphindexing:Afrequentstructure-basedapproach.ACMSIGMODConference,2004.70.X.Yan,P.S.Yu,andJ.Han.Substructuresimilaritysearchingraphdatabases.ACMSIGMODConference,2005.71.X.Yan,F.Zhu,J.Han,andP.S.Yu.Searchingsubstructureswithsuperimposeddistance,ICDEConference,2006.72.M.Zaki.Efcientlyminingfrequenttreesinaforest:Algorithmsandapplications.TransactionsonKnowledgeandDataEngineering,17(8),pp.10211035,2005.73.M.Zaki,C.Aggarwal.XRules:AnEffectiveClassierforXMLData,ACMKDDConference74.M.Zaki,C.J.Hsiao.CHARM:AnEfcientAlgorithmforClosedFrequentItemsetMining,SDMConference,2002.75.S.Zhang,T.Wang.DiscoveringFrequentAgreementSubtreesfromPhylogeneticData.TransactionsonKnowledgeandDataEngineering,20(1),pp.6882,2008.76.Z.Zou,J.Li,H.Gao,andS.Zhang.MiningFrequentSubgraphPatternsfromUncertainGraphIEEETransactionsonKnowledgeandDataEngineering,22(9),pp.12031218,2010.77.http://mi.ua.ac.be/ Chapter2FrequentPatternMiningAlgorithms:ASurveyCharuC.Aggarwal,MansurulA.BhuiyanandMohammadAlHasanThischapterwillprovideadetailedsurveyoffrequentpatternminingalgorithms.AwidevarietyofalgorithmswillbecoveredstartingfromManyalgorithmssuchasTreeProjection,andFP-growthwillbediscussed.Inadditionadiscussionofseveralmaximalandclosedfrequentpatternminingalgorithmswillbeprovided.Thus,thischapterwillprovideoneofmostdetailedsurveysoffrequentpatternminingalgorithmsavailableintheliterature.KeywordsFrequentpatternminingalgorithmsTreeProjectionFP-growth1IntroductionIndatamining,frequentpatternmining(FPM)isoneofthemostintensivelyinves-tigatedproblemsintermsofcomputationalandalgorithmicdevelopment.Overthelasttwodecades,numerousalgorithmshavebeenproposedtosolvefrequentpatternminingorsomeofitsvariants,andtheinterestinthisproblemstillpersists[45,75].Differentframeworkshavebeendenedforfrequentpatternmining.Themostcom-mononeisthesupport-basedframework,inwhichitemsetswithfrequencyaboveagiventhresholdarefound.However,suchitemsetsmaysometimesnotrepresentinterestingpositivecorrelationsbetweenitemsbecausetheydonotnormalizefortheabsolutefrequenciesoftheitems.Consequently,alternativemeasuresforinter-estingnesshavebeendenedintheliterature[7,11,16,63].Thischapterwillfocusonthesupport-basedframeworkbecausethealgorithmsbasedontheinterestingness C.C.Aggarwal(IBMT.J.WatsonResearchCenter,YorktownHeights,NY10598,USAe-mail:charu@us.ibm.comM.A.BhuiyanM.A.HasanIndianaUniversityPurdueUniversity,Indianapolis,IN,USAe-mail:mbhuiyan@cs.iupui.eduM.A.Hasane-mail:alhasan@cs.iupui.eduC.C.Aggarwal,J.Han(eds.),FrequentPatternMining,DOI10.1007/978-3-319-07821-2_2,©SpringerInternationalPublishingSwitzerland2014 C.C.Aggarwaletal. Fig.2.1Agenericfrequentpatternminingalgorithmframeworkareprovidedinadifferentchapter.Surveysonfrequentpatternminingmaybefoundin[26,33].Oneofthemainreasonsforthehighlevelofinterestinfrequentpatternminingalgorithmsisduetothecomputationalchallengeofthetask.Evenforamoderatesizeddataset,thesearchspaceofFPMisenormous,whichisexponentialtothelengthofthetransactionsinthedataset.Thisnaturallycreateschallengesforitemsetgeneration,whenthesupportlevelsarelow.Infact,inmostpracticalscenarios,thesupportlevelsatwhichonecanminethecorrespondingitemsetsarelimited(boundedbelow)bythememoryandcomputationalconstraints.Therefore,itiscriticaltobeabletoperformtheanalysisinaspace-andtime-efcientway.Duringtherstfewyearsofresearchinthisarea,theprimaryfocusofworkwastondFPMalgorithmswithbettercomputationalefciency.Severalclassesofalgorithmshavebeendevelopedforfrequentpatternmining,manyofwhicharecloselyrelatedtooneanother.Infact,theexecutiontreeofallthealgorithmsismostlydifferentintermsoftheorderinwhichthepatternsareexplored,andwhetherthecountingworkdonefordifferentcandidatesisindependentofoneanother.Toexplainthispoint,weintroduceaprimitivebaselinealgorithmthatformstheheartofmostfrequentpatternminingalgorithms.Figure2.1presentsthepseudocodeforaverysimplebaselinefrequentpatternminingalgorithm.Thealgorithmtakesthetransactiondatabaseandauser-denedsupportvalueasinput.Itrstpopulatesalllength-onefrequentpatternsinafrequentpatterndata-store,.Thenitgeneratesacandidatepatternandcomputesitssupportinthedatabase.Ifthesupportofthecandidatepatternisequalorhigherthantheminimumsupportthresholdthepatternisstoredin.Theprocesscontinuesuntilallthefrequentpatternsfromthedatabasearefound.Intheaforementionedalgorithm,candidatepatternsaregeneratedfromtheprevi-ouslygeneratedfrequentpatterns.Then,thetransactiondatabaseisusedtodeterminewhichofthecandidatesaretrulyfrequentpatterns.Thekeyissuesofcomputa-tionalefciencyariseintermsofgeneratingthecandidatepatternsinanorderlyandcarefullydesignedfashion,pruningirrelevantandduplicatecandidates,andusingwellchosentrickstominimizetheworkincountingthecandidates.Clearly,the 2FrequentPatternMiningAlgorithms:ASurveyeffectivenessofthesedifferentstrategiesdependoneachother.Forexample,theeffectivenessofapruningstrategymaybedependentontheorderofexplorationofthecandidates(level-wisevs.depthrst),andtheeffectivenessofcountingisalsodependentontheorderofexplorationbecausetheworkdoneforcountingatthehigherlevels(shorteritemsets)canbereusedatthelowerlevels(longeritemsets)withcertainstrategies,suchasthoseexploredinTreeProjectionFP-growthSurprisingasitmightseem,virtuallyallfrequentpatternminingalgorithmscanbeconsideredcomplexvariationsofthissimplebaselinepseudocode.Themajorchal-lengeofallofthesemethodsisthatthenumberoffrequentpatternsandcandidatepatternscansometimesbelarge.Thisisafundamentalproblemoffrequentpatternminingalthoughitispossibletospeedupthecountingofthedifferentcandidatepatternswiththeuseofvarioustrickssuchasdatabaseprojections.Ananalysisonthenumberofcandidatepatternsmaybefoundin[25].Thecandidategenerationprocessoftheearliestalgorithmsusedjoins.Theoriginalalgorithmbelongstothiscategory[1].Althoughispresentedasajoin-basedalgorithm,itcanbeshownthatthealgorithmisabreadthrstexplorationofastructuredarrangementoftheitemsets,knownasalexicographictreeenumerationtree.Therefore,laterclassesofalgorithmsexplicitlydiscusstree-basedenumeration[4,5].Thealgorithmsassumealexicographictree(orenumerationtree)ofcandidatepatternsandexplorethetreeusingbreadth-rstordepth-rststrategies.Theuseoftheenumerationtreeformsthebasisforunderstandingsearchspacedecomposition,asinthecaseoftheTreeProjectionalgorithm[5].Theenumerationtreeconceptisveryusefulbecauseitprovidesanunderstandingofhowthesearchspaceofcandidatepatternsmaybeexploredinasystematicandnon-redundantway.Frequentpatternminingalgorithmstypicallyneedtoevaluatethesupportoffrequentportionsoftheenumerationtree,andalsoruleoutanadditionallayerofinfrequentextensionsofthefrequentnodesintheenumerationtree.Thismakesthecandidatespaceofallfrequentpatternminingalgorithmsvirtuallyinvariantunlessoneisinterestedinparticulartypesofpatternssuchasmaximalpatterns.Theenumerationtreeisdenedontheprexesoffrequentitemsets,andwillbeintroducedlaterinthischapter.LateralgorithmssuchasFP-growthsufx-basedrecursiveexplorationofthesearchspace.Inotherwords,thefrequentpatternswithaparticularpatternasasufxareexploredatonetime.ThisisbecauseFP-growthusestheoppositeitemorderingconventionasmostenumerationtreealgorithmsthoughtherecursiveexplorationorderofFP-growthissimilartoanenumerationtree.Notethatallclassesofalgorithms,implicitlyorexplicitly,explorethesearchspaceofpatternsdenedbyanenumerationtreeoffrequentpatternswithdifferentstrategiessuchasjoins,prex-baseddepth-rstexploration,orsufx-baseddepth-rstexploration.However,therearesignicantdifferencesintermsoftheorderinwhichthesearchspaceisexplored,thepruningmethodsused,andhowthecountingisperformed.Inparticular,certainprojection-basedmethodshelpinreusingthecountingworkfor-itemsetsfor(1)-itemsetswiththeuseofthenotionofprojecteddatabases.ManyalgorithmssuchasTreeProjectionFP-growthabletoachievethisgoal. C.C.Aggarwaletal.Table2.1Toytransactiondatabaseandfrequentitemsofeachtransactionforaminimumsupportof3 tidItemsSortedfrequentitems 2a,b,c,d,f,ha,b,c,d,f3a,f,ga,f4b,e,f,gb,f,e5a,b,c,d,e,ha,b,c,d,e Thischapterisorganizedasfollows.Theremainderofthischapterdiscussesnotationsanddenitionsrelevanttofrequentpatternmining.Section2discussesjoin-basedalgorithms.Section3discussestree-basedalgorithms.AllthealgorithmsdiscussedinSects.2and3extendprexesofitemsetstogeneratedfrequentpatterns.AnumberofmethodsthatextendsufxesoffrequentpatternsarediscussedinSect.4.Variantsoffrequentpatternmining,suchasclosedandmaximalfrequentpatternmining,arediscussedinSect.5.OtheroptimizedvariationsoffrequentpatternminingalgorithmsarediscussedinSect.6.Methodsforreducingthenumberofpasses,withtheuseofsamplingandaggregationareproposedinSect.7.Finally,Sect.8concludeschapterwithanoverallsummary.1.1DenitionsInthissection,wedeneseveralkeyconceptsoffrequentpatternmining(FPM)thatwewilluseintheremainingpartofthechapter.beatransactiondatabase,whereeachconsistsofasetofitems,say.Asetiscalledanitemset.Thesizeofanitemsetisdenedbythenumberofitemsitcontains.Wewillreferanitemsetasitemsetpattern),ifitssizeis.Thenumberoftransactionscontainingisreferredtoasthe.Apatternisdenedtobefrequentifitssupportisatleastequaltothetheminimumthreshold.Table2.1depictsatoydatabasewith5transactions().Thesecondcolumnshowstheitemsineachtransaction.Inthethirdcolumn,weshowthesetofitemsthatarefrequentinthecorrespondingtransactionforaminimumsupportvalueof3.Forexample,theitemintransactionwithvalueof2isaninfrequentitemwithasupportvalueof2.Therefore,itisnotlistedinthethirdcolumnofthecorrespondingrow.Similarly,thepattern(or,inabbreviatedform)isfrequentbecauseithasasupportvalueof3.Thefrequentpatternsareoftenusedtogenerateassociationrules.Considerthe,wherearesetsofitems.Thecondenceoftheruleistheequaltotheratioofthesupportoftothatofthesupportof.Inotherwords,itcanbeviewedastheconditionalprobabilitythatoccurs,giventhathasoccurred.Thesupportoftheruleisequaltothesupportof.Associationrule-generationisatwo-phaseprocess.Therstphasedeterminesallthefrequentpatternsatagivenminimumsupportlevel.Thesecondphaseextractsalltherulesfromthesepatterns.Thesecondphaseisfairlytrivialandwithlimitedsophistication.Therefore,mostofthealgorithmicworkinfrequentpatternminingfocussesonthe 2FrequentPatternMiningAlgorithms:ASurveyFig.2.2Thelatticeof Null FREQUENT ITEMSETS abcd abacadbcbdcd abcabdacdbcd abcd INFREQUENT ITEMSETSBORDER BETWEENFREQUENT ANDINFREQUENT ITEMSETS rstphase.Thischapterwillalsofocusontherstphaseoffrequentpatternmining,whichisgenerallyconsideredmoreimportantandnon-trivial.Frequentpatternssatisfyadownwardclosureproperty,accordingtowhicheverysubsetofafrequentpatternisalsofrequent.Thisisbecauseifapatternisasubsetofatransaction,theneverypatternwillalsobeasubsetofTherefore,thesupportofcanbenolessthanthatof.Thespaceofexplorationoffrequentpatternscanbearrangedasalattice,inwhicheverynodeisoneofthe2possibleitemsets,andanedgerepresentsanimmediatesubsetrelationshipbetweentheseitemsets.AnexampleofalatticeofpossibleitemsetsforauniverseofitemscorrespondingtoisillustratedinFig.2.2.Thelatticerepresentsthesearchoffrequentpatterns,andallfrequentpatternminingalgorithmsmust,inonewayoranother,traversethislatticetoidentifythefrequentnodesofthislattice.Thelatticeisseparatedintoafrequentandaninfrequentpartwiththeuseofaborder.AnexampleofaborderisillustratedinFig.2.2.Thisbordermustsatisfythedownwardclosureproperty.Thelatticecanbetraversedwithavarietyofstrategiessuchasbreadth-rstordepth-rstmethods.Furthermore,candidatenodesofthelatticemaybegeneratedinmanyways,suchasusingjoins,orusinglexicographictree-basedextensions.Manyofthesemethodsareconceptuallyequivalenttooneanother.Thefollowingdiscussionwillprovideanoverviewofthedifferentstrategiesthatarecommonly2Join-BasedAlgorithmsJoin-basedalgorithmsgenerate(1)-candidatesfromfrequent-patternswiththeuseofjoins.Thesecandidatesarethenvalidatedagainstthetransactiondatabase.methodusesjoinstocreatecandidatesfromfrequentpatterns,andisoneoftheearliestalgorithmsforfrequentpatternmining. C.C.Aggarwaletal.2.1AprioriMethodThemostbasicjoin-basedalgorithmisthemethod[1].Theusesalevel-wiseapproachinwhichallfrequentitemsetsoflengtharegeneratedbeforethoseoflength(1).Themainobservationwhichisusedforthealgorithmisthateverysubsetofafrequentpatternisalsofrequent.Therefore,forfrequentpatternsoflength(1)canbegeneratedfrompatternsoflengthwiththeuseofjoins.Ajoinisdenedbypairsoffrequentpatternsthathaveatleast(1)itemsincommon.Specically,considerafrequentthatisfrequent,buthasnotyetbeendiscoveredbecauseonlyitemsetsoflength3havebeendiscoveredsofar.Inthiscase,becausethepatternsarefrequent,theywillbepresentinthesetofallfrequentpatternswithlength3.Notethatthisparticularpairalsohas2itemsincommon.Byperformingajoinonthispair,itispossibletocreatethe.Thispatternisreferredtoasabecauseitmightbefrequent,andonemosteitherruleitinorruleitoutbysupportcounting.There-fore,thiscandidateisthenagainstthetransactiondatabasebycountingitssupport.Clearly,thedesignofanefcientsupportcountingmethodplaysacriticalroleintheoverallefciencyoftheprocess.Furthermore,itisimportanttonotethatthesamecandidatecanbeproducedbyjoiningmultiplefrequentpatterns.Forex-ample,onemightjointoachievethesameresult.Therefore,inordertoavoidduplicationincandidategeneration,twoitemsetsarejoinedonlywhetherrst(1)itemsarethesame,basedonalexicographicorderingimposedontheitems.Thisprovidesallthe(1)-candidatesinanon-redundantway.Itshouldbepointedoutthatsomecandidatescanbeprunedoutinanefcientway,withoutvalidatingthemagainstthetransactiondatabase.Forany(itischeckedwhethersubsetsarefrequent.Althoughitisalreadyknownthattwoofitssubsetscontributingtothejoinarefrequent,itisnotknownwhetheritsremainingsubsetsarefrequent.Ifallitssubsetsarenotfrequent,thenthecandidatecanbeprunedfromconsiderationbecauseofthedownwardclosureproperty.Thisisknownasthepruningtrick.Forexample,inthepreviouscase,iftheitemsetdoesnotexistinthesetoffrequent3-itemsetswhichhavealreadybeenfound,thenthecandidateitemsetcanbeprunedfromconsiderationwithnofurthercomputationaleffort.Thisgreatlyspeedsuptheoverallalgorithm.Thegenerationof1-itemsetsand2-itemsetsisusuallyperformedinaspecializedwaywithmoreefcienttechniques.Therefore,thebasicalgorithmcanbedescribedrecursivelyinlevel-wisefashion.theoverallalgorithmcomprisesofthreestepsthatarerepeatedoverandoveragain,fordifferentvaluesof,whereisthelengthofthepatterngeneratedinthecurrentiteration.Thefourstepsarethoseof(i)generationofcandidatepatternsbyusingjoinsonthepatternsin,(ii)thepruningofcandidatesfromforwhichallsubsetstonotliein,and(iii)thevalidationofthepatternsinagainstthetransactiondatabase,todeterminethesubsetofwhichistrulyfrequent.Thealgorithmisterminated,whenthesetoffrequentinagiveniterationisempty.Thepseudo-codeoftheoverallprocedureispresentedinFig.2.3. 2FrequentPatternMiningAlgorithms:ASurveyFig.2.3 Thecomputationallyintensiveprocedureinthiscaseisthecountingofthecandi-datesinwithrespecttothetransactiondatabase.Therefore,anumberofoptimizationsanddatastructureshavebeenproposedin[1](andalsothesubsequentliterature)tospeedupthecountingprocess.Thedatastructureproposedin[1]isthatofconstructingahash-treetomaintainthecandidatepatterns.Aleafnodeofthehash-treecontainsalistofitemsets,whereasaninteriornodecontainsahash-table.Anitemsetismappedtoaleafnodeofthetreebydeningapathfromtheroottotheleafnodewiththeuseofthehashfunction.Atanodeoflevel,ahashfunctionisappliedtothethitemtodecidewhichbranchtofollow.Theitemsetsintheleafnodearestoredinsortedorder.Thetreeisconstructedrecursivelyintopdownfashion,andaminimumthresholdisimposedonthenumberofcandidatesintheleafnode.Toperformthecounting,allpossible-itemsetswhicharesubsetsofatransactionarediscoveredinaexplorationofthehash-tree.Toachievethisgoalallpossiblepathsinthehashtreethatcouldcorrespondtosubsetsofthetransaction,arefollowedinrecursivefashion,todeterminewhichleafnodesarerelevanttothattransaction.Aftertheleafnodeshavebeendiscovered,theitemsetsattheseleafnodesthataresubsetsofthattransactionareisolatedandtheircountisincremented.Theactualselectionoftherelevantleafnodesisperformedbyrecursivetraversalasfollows.Attherootnode,allbranchesarefollowedsuchthatoftheitemsinthetransactionhashtooneofbranches.Atagiveninteriornode,ifthethitemofthetransactionwaslasthashed,thenallitemsfollowingitinthetransactionarehashedtodeterminethepossiblechildrentofollow.Thus,byfollowingallthesepaths,therelevantleafnodesinthetreearedetermined.Thecandidatesintheleafnodearestoredinsortedorder,andcanbecomparedefcientlytothehashedsequenceofitemsinthetransactiontodeterminewhethertheyarerelevant.Thisprovidesacountoftheitemsetsrelevanttothetransaction.Thisprocessisrepeatedforeachtransactiontodeterminethenalsupportcountforeachitemset.Itshouldbepointedoutthatthereasonforusingahashfunctionattheintermediatenodesistoreducethebranchingfactorofthehashtree.However,ifdesired,atriecanbeusedexplicitly,inwhichthedegreeofa C.C.Aggarwaletal. Fig.2.4Executiontreeofnodeispotentiallyoftheorderofthetotalnumberofitems.Anexampleofsuchanimplementationisprovidedin[12],anditseemstoworkquitewell.Analgorithmthatsharessomesimilaritiestothemethod,wasindependentlyproposedin[44],andsubsequentlyacombinedworkwaspublishedin[3].Figure2.4illustratestheexecutiontreeofthejoin-basedalgorithmoverthetoytransactiondatabasementionedinTable2.1forminimumsupportvalue3.Asmentionedinthepseudocodeof,acandidate-patternsaregeneratedbyjoiningtwofrequentitemsetofsize(1).Forexample,atlevel3,thepatternisgeneratedbyjoining.Aftergeneratingthecandidatepatterns,thesupportofthepatternsiscomputedbyscanningeverytransactioninthedatabaseanddeterminingthefrequentones.InFig.2.4,acandidatepatternsisshowninaboxalongwithitssupportvalue.Afrequentcandidateisshowninasolidbox,andaninfrequentcandidateisshowninadottedbox.Anedgerepresentsthejoinrelationshipbetweenacandidatepatternofsizeandafrequentpatternofsize1)suchthatthelatterisusedtogeneratetheearlier.Thegurealsoillustratesthefactthatapairoffrequentpatternsareusedtogenerateacandidatepattern,whereasnocandidatesaregeneratedfromaninfrequentpattern.2.1.1AprioriOptimizationsNumerousoptimizationswereproposedforthealgorithm[1]thatarereferredtoasAprioriTidrespectively.IntheAprioriTidalgorithm,eachtransactionisreplacedbyashortertransactionornulltransaction)duringthephase.Letthesetof1-candidatesinthatarecontainedintransactiondenotedby).Thisset)isaddedtoanewlycreatedtransaction.Iftheset)isnull,thenclearly,anumberofdifferenttradeoffsexistwiththeuseofsuchanapproach. 2FrequentPatternMiningAlgorithms:ASurveyBecauseeachnewlycreatedtransactioninismuchshorter,thismakessubsequentsupportcountingmoreefcient.Insomecases,nocandidatemaybeasubsetofthetransaction.Suchatransactioncanbedroppedfromthedatabasebecauseitdoesnotcontributetothecountingofsupportvalues.Inothercases,morethanonecandidatemaybeasubsetofthetransaction,whichwillactuallyincreasetheoverheadofthealgorithm.Clearly,thisisnotadesirableThus,thersttwofactorsimprovetheefciencyofthenewrepresentation,whereasthelastfactorworsensit.Typically,theimpactofthelastfactorisgreaterintheearlyiterations,whereastheimpactofthersttwofactorsisgreaterinthelateriterations.Therefore,tomaximizetheoverallefciency,anaturalapproachwouldbetousethisoptimizationintheearlyiterations,andapplyitonlyinthelateriterations.Thisvariationisreferredtoasthealgorithm[1].Anotheroptimizationproposedin[9]isthatthesupportofmanypatternscanbeinferredfromthoseofkeypatternsinthedata.ThisisusedtosignicantlyenhancetheefciencyoftheNumerousothertechniqueshavebeenproposedthatusedifferenttechniquestooptimizetheoriginalimplementationofthealgorithm.Asanexample,themethodin[1]and[44]shareanumberofsimilaritiesbutaresomewhatdifferentattheimplementationlevel.Aworkthatcombinestheideasfromthesedifferentpiecesofworkispresentedin[3].2.2DHPAlgorithmTheDHPalgorithm,alsoknownastheDirectHashingandPruningmethod[50],wasproposedsoonafterthemethod.Itproposestwomainoptimizationstospeedupthealgorithm.Therstoptimizationistoprunethecandidateitemsetsineachiteration,andthesecondoptimizationistotrimthetransactionstomakethesupport-countingprocessmoreefcient.Toprunetheitemsets,thealgorithmtrackspartialinformationaboutcandidate1)-itemsets,whileexplicitlycountingthesupportofcandidate-itemsets.Duringthecountingofcandidate-itemsets,all(1)subsetsofthetransactionarefoundandhashedintoatablethatmaintainsthecountsofthenumberofsubsetshashedintoeachentry.Duringthephaseofcounting(1)-itemsets,thecountsinthehashtableareretrievedforeachitemset.Clearly,thesecountsareoverestimatesbecauseofpossiblecollisionsinthehashtable.Thoseitemsetsforwhichthecountsarebelowtheuser-speciedsupportlevelarethenprunedfromconsideration.AsecondoptimizationproposedinDHPisthatoftransactiontrimming.Akeyobservationhereisthatifanitemdoesnotappearinatleastfrequentitemsetsin,thennofrequentitemsetinwillcontainthatitem.Thisfollowsfromthefactthatthereshouldbeatleast(immediate)subsetsofeachfrequentpatternin C.C.Aggarwaletal.containingaparticularitemthatalsooccurinandalsocontainthatitem.Thisimpliesthatifanitemdoesnotappearinatleastfrequentitemsetsin,thenthatitemisnolongerrelevanttofurthersupportcountingforndingfrequentpatterns.Therefore,thatitemcanbetrimmedfromthetransaction.Thisreducesthewidthofthetransaction,andincreasestheefciencyofprocessing.Theoverheadfromthedatastructuresissignicant,andmostoftheadvantagesareobtainedforpatternsofsmallerlengthsuchas2-itemsets.Itwaspointedoutinlaterwork[46,47,60]thattheuseoftriangulararraysforsupportcountingof2-itemsetsinthecontextofthemethodisevenmoreefcientthansuchanapproach.2.3SpecialTricksfor2-ItemsetCountingAnumberofspecialtrickscanbeusedtoimprovetheeffectivenessof2-itemsetcounting.Thecaseof2-itemsetcountingisspecialandisoftensimilarforthecaseofjoin-basedandtree-basedalgorithms.Asmentionedabove,oneapproachistouseatriangulararraythatmaintainsthecountsofthe-patternsexplicitly.Foreachtransaction,anestedloopcanbeusedtoexploreallpairsofitemsinthetransactionandincrementthecorrespondingcountsinthetriangulararray.Anumberofcachingtrickscanbeused[5]toimprovedatalocalityaccessduringthecountingprocess.However,ifthenumberofpossibleitemsareverylarge,thiswillstillbeaverysignicantoverheadbecauseitisneededtomaintainanentryforeachpairofitems.Thisisalsoverywasteful,ifmanyofthe1-itemsarenotfrequent,orsomeofthe2-itemcountsarezero.Therefore,apossibleapproachwouldbetorstpruneoutallthe1-itemswhicharenotfrequent.Itissimplynotnecessarytocountthesupportofa2-itemsetunlessbothofitsconstituentitemsarefrequent.Ahashtablecanthenbeusedtomaintainthefrequencycountsofthecorresponding2-itemsets.Asbefore,thetransactionsareexploredinadoublenestedloops,andallpairsofitemsarehashedintothetable,withthecaveat,thateachoftheindividualitemsmustbefrequent.Thesetofitemsetswhichsatisfythesupportrequirementsarereported.2.4PruningbySupportLowerBoundingMostofthepruningtricksdiscussedearlierpruneitemsetswhentheyareguaranteedmeettherequiredsupportthreshold.Itisalsopossibletoskipthecountingprocessforanitemsetiftheitemsetisguaranteedtomeetthesupportthreshold.Ofcourse,thecaveathereisthattheexactsupportofthatitemsetwillnotbeavailable,beyondtheknowledgethatitmeetstheminimumthreshold.Thisissufcientinthecaseofmanyapplications.Considertwothathave1itemsincommon.Then,theunionoftheitemsin,denotedbywillhaveexactly1items.Then,if)representthesupportofanitemset,thenthesupportof 2FrequentPatternMiningAlgorithms:ASurveybelowerboundedasfollows:)(2.1)Thisconditionfollowsdirectlyfromset-theoreticconsiderations.Thus,thesupportof(1)-candidatescanbelowerboundedintermsofthe(alreadycomputed)supportvaluesofitemsetsoflengthorless.Ifthecomputedvalueontheright-handsideisgreaterthantherequiredminimumsupport,thenthecountingofthecandidatedoesnotneedtobeperformedexplicitly,andthereforeconsiderablesavingscanbeachieved.Anexampleofamethodwhichusesthiskindofpruningisthemethod[10].Anotherinterestingruleisthatifthesupportofanitemsetisthesameasthat,thenforanysuperset,itisthecasethatthesupportoftheitemsetisthesameasthatof.Thisrulecanbeshowndirectlyasacorollaryoftheequationabove.Thisisveryusefulinavarietyoffrequentpatternminingalgorithms.Forexample,oncethesupportofhasbeenshowntobethesameasthatof,then,foranysuperset,itisnolongernecessarytoexplicitlycomputethesupportof,afterthesupportofhasalreadybeencomputed.Suchoptimizationshavebeenshowntobequiteeffectiveinthecontextofmanyfrequentpatternminingalgorithms[13,51,17].Asdiscussedlater,thistrickisnotexclusivetojoin-basedalgorithms,andisoftenusedeffectivelyintree-basedalgorithmssuch,and2.5HypercubeDecompositionOnefeasiblewaytoreducethecomputationcostofsupportcountingistondsupportofmultiplefrequentpatternsatonetime.LCM[66]deviseatechniquereferredtoashypercubedecompositioninthispurpose.Themultipleitemsetsobtainedatonetime,compriseahypercubeintheitemsetlattice.Supposethatisafrequentpattern,tidset)containsthetransactionsthatispartof,andtail()denotesthelatestitemextensiontotheitemset)isthesetofitems�etailtidsettidset).Theset)isreferredtoasthehypercubeset.Then,foranytidsettidset)istrue,andisfrequent.Theworkin[66]usesthispropertyinthecandidategenerationphase.Fortwoitemsets,wesaythatisbetween.Inthephasewithrespectto,weoutputall).Thistechniquesavessignicanttimeincounting.3Tree-BasedAlgorithmsThetree-basedalgorithmisbasedonset-enumerationconcepts.Thecandidatescanbeexploredwiththeuseofasubgraphofthelatticeofitemsets(seeFig.2.2),whichisalsoreferredtoasthelexicographictreeorenumerationtree[5].Thesetermswill, C.C.Aggarwaletal.acdfLevel 4 Level 0 fbaec Level 1 d bcbdcdcfdf abac adaf abcabdacdacfcdcdf Fig.2.5Thelexicographictree(alsoknownasenumerationtree)therefore,beusedinterchangeably.Thus,theproblemoffrequentitemsetgenerationisequivalenttothatofconstructingtheenumerationtree.Thetreecanbegrowninawidevarietyofwayssuchasbreadth-rstordepth-rstorder.Becausemostofthediscussioninthissectionwillusethisstructureasabaseforalgorithmicdevelopment,thisconceptwillbediscussedindetailhere.Themaincharacteristicoftree-basedalgorithmsisthattheenumerationtree(orlexicographictree)providesacertainorderofexplorationthatcanbeextremelyusefulinmanyscenarios.Itisassumedthatalexicographicorderingexistsamongtheitemsinthedatabase.Thislexicographicorderingisessentialforefcientsetenumerationwithoutrep-etition.Toindicatethatanitemoccurslexicographicallyearlierthan,wewillusethenotation.Thelexicographictreeisanabstractrepresentationofthelargeitemsetswithrespecttothisordering.Thelexicographictreeisdenedinthefollowingway:Anodeexistsinthetreecorrespondingtoeachlargeitemset.TherootofthetreecorrespondstotheLetbealargeitemset,wherearelistedinlexicographicorder.TheparentofthenodeistheitemsetThisdenitionofancestralrelationshipnaturallydenesatreestructureonthenodesthatisrootedatthenode.Afrequent1-extensionofanitemsetsuchthatthelastitemisthecontributortotheextensionwillbecalledafrequentlexicographictreeextension,orsimplyatreeextension.Thus,eachedgeinthelexicographictreecorrespondstoanitemwhichisthefrequentlexicographictreeextensiontoanode.Thefrequentlexicographicextensionsofnodearedenotedby).AnexampleofthelexicographictreeisillustratedinFig.2.5.Inthisexample,thefrequentlexicographicextensionsofnode,and 2FrequentPatternMiningAlgorithms:ASurveybetheimmediateancestoroftheitemsetinthelexicographictree.Thesetofprospectivebranchesofanodeisdenedtobethoseitemsinwhichoccurlexicographicallyafterthenode.Thesearethelexicographicextensionsof.Wedenotethissetby).Thus,wehavethefollowingrelationship:).Thevalueof)inFig.2.5,when.Thevalueof)for,andforisempty.Itisimportanttopointoutthatvirtuallyallnon-maximalandmaximalalgorithms,startingfrom,canbeconsideredenumeration-treemethods.Infact,therearefewfrequentpatternminingalgorithmswhichdonotusetheenumerationtree,orasubsetthereof(inmaximalpatternmining)forfrequentitemsetgeneration.However,orderofexplorationofthedifferentalgorithmsofthelexicographictreeisquitedifferent.Forexample,usesabreadth-rststrategy,whereasotheralgorithmsdiscussedlaterinthischapteruseadepth-rststrategy.Somemethodsareexplicitabouttherelationshipaboutthecandidategenerationprocesswiththeenumerationtree,whereasothers,suchas,arenot.Forexample,byexaminingFig.2.4,itisevidentthatcandidatescanbegeneratedbyjoiningtwofrequentsiblingsofalexicographictree.Infact,allcandidatescanbegeneratedinanexhaustiveandnon-redundantwaybyjoiningfrequentsiblings.Forexample,thetwoitemsetsacdfhacdfgsiblings,becausetheyarechildrenofthenodeacdf.Byjoiningthem,oneobtainsthecandidatepatternacdfgh.Thus,whiletheisajoin-basedalgorithm,itcanalsobeexplainedintermsoftheenumerationtree.Partsoftheenumerationtreemayberemovedbysomeofthealgorithmsbypruningmethods.Forexample,thealgorithmusesalevelwisepruningtrick.Forpatternminingtheadvantagesgainedfrompruningtrickscanbeverysignicant.Therefore,thenumberofcandidatesintheexecutiontreeofdifferentalgorithmsisdifferentonlybecauseofpruningoptimizationtricks.However,somemethodsareabletoachievebetterstrategiesbyusingthestructureoftheenumerationtreetoavoidre-doingthecountingworkalreadydoneforto(1)-candidates.Therefore,explicitlyintroducingtheenumerationtreeishelpfulbecauseitallowsamoreexiblewaytovisualizecandidateexplorationstrategiesthanjoin-basedmethods.Theexplicitintroductionoftheenumerationtreealsohelpsinunderstandingwhetherthegainsindifferentalgorithmsariseasaresultoffewernumberofcandidates,orwhethertheyariseasaresultofbettercountingstrategies.3.1AISAlgorithmTheoriginalalgorithm[2]isasimpleversionofthelexicographic-treealgorithm,thoughitisnotdirectlypresentedassuch.Inthisapproach,thetreeisconstructedinlevelwisefashionandthecorrespondingitemsetsatagivenlevelarecountedwiththeuseofthetransactiondatabase.Thealgorithmdoesnotuseanyspecicoptimizationstoimprovetheefciencyofthecountingprocess.Aswillbediscussedlater,avarietyofmethodscanbeusedtofurtherimprovetheefciencyoftree-basedalgorithms.Thus,thisisaprimitiveapproachthatexplorestheentiresearchspacewithnooptimization. C.C.Aggarwaletal.3.2TreeProjectionAlgorithmsTwovariantsofanalgorithmwhichuserecursiveprojectionsofthetransactionsdownthelexicographictreestructureareproposedin[5]and[4],respectively.Thegoalofusingtheserecursiveprojectionsistoreusethecountingworkdownatagivenlevelforlowerlevelsofthetree.Thisreducesthecountingworkatthelowerlevelsbyordersofmagnitude,aslongasitispossibletosuccessfullymanagethememoryrequirementsoftheprojectedtransactions.ThemaindifferencebetweenthedifferentversionsofTreeProjectionistheexplorationstrategyused.TreeProjectioncanbeviewedasagenericframeworkthatadvocatesthenotionofdatabaseprojection,inthecontextofseveraldifferentstrategiesforconstructingtheenumerationtree,suchasabreadth-rst,depth-rst,oracombinationofthetwo.Thedepth-rstversion,describedindetailin[4],alsoincorporatesmaximalpruning,thoughthedisablingofthepruningoptionscanalsomaterializeallthepatterns.Thebreadth-rstanddepth-rstalgorithmshavedifferentadvantages.Theformerallowslevel-wisepruningwhichisnotpossibleindepth-rstmethodsthoughitisoftennotusedinprojection-basedmethods.Thedepth-rstversionallowsbettermemorymanagement.Thedepth-rstapproachworksbestwhentheitemsetsareverylong,anditisdesirabletoquicklydiscovermaximalpatterns,sothatportionsofthelexicographictreecanbeprunedoffquicklyduringexplorationanditcanalsobeusedfordiscoveringallpatternsincludingnon-maximalones.Whenallpatternsarerequired,includingnon-maximalones,theprimarydifferencebetweendifferentstrategiesisnotoneofthesizeofthecandidatespace,butthatofeffectivememorymanagementoftheprojectedtransactions.Thisisbecausethesizeofthecandidatespaceisdenedbythesizeoftheenumerationtree,whichisxed,andisagnostictothestrategyusedfortreeexploration.Ontheotherhand,memorymanagementofprojectedtransactionsiseasierwiththedepth-rststrategybecauseoneonlyneedstomaintainasmallnumberofprojectedtransactionsetsalongthedepthofthetree.ThenotionofdatabaseprojectioniscommontoTreeProjectionFP-growth,andhelpsreducethecountingworkbyrestrictingthesizeofthedatabaseusedforsupportcounting.TreeProjectionwasdevelopedindependentlyfromFP-growth.WhiletheFP-growthpaperprovidesabriefdiscussionofTreeProjection,thischapterwillprovideamoredetaileddiscussionofthesimilaritiesanddifferencesbetweenthetwomethods.Onemajordifferencebetweenthetwomethodsisthattheinternalrepresentationofthecorrespondingprojecteddatabasesisdifferentinthetwocases.ThebasicdatabaseprojectionapproachisverysimilarinbothcasesofTreeProjec-FP-growth.Animportantobservationisthatifatransactionisnotrelevantforcountingatagivennodeintheenumerationtree,thenitwillnotberelevantforcountinginanydescendentofthatnode.Therefore,onlythosetransactionsareretainedthatcontainallitemsinforcountingatthenodeintheprojectedtrans-actions.Notethatthissetstrictlyreducesaswemovetolowerlevelsofthetree,andthesetofrelevanttransactionsatthelowerleveloftheenumerationtreeisasubsetofthesetatahigherlevel.Furthermore,onlythepresenceofitemscorrespondingtothecandidateextensionsofanodearerelevantforcountingatanyofthesubtreesrooted 2FrequentPatternMiningAlgorithms:ASurveyFig.2.6Enumerationtreeexploration atthatnode.Therefore,thedatabaseisalsoprojectedintermsofattributes,inwhichonlyitemswhicharecandidateextensionsatanodeareretained.Thecandidateset)ofitemextensionsofnodeisaverysmallsubsetoftheuniverseofitemsatlowerlevelsoftheenumerationtree.Infact,eventheitemsinthenodenotberetainedexplicitlyinthetransaction,becausetheyareknowntoalwaysbepresentinalltheselectedtransactionsbasedontherstcondition.Thisprojectionprocessisperformedrecursivelyintopdownfashiondowntheenumerationtreeforcountingpurposes,wherelowerlevelnodesinherittheprojectionsfromhigherlevelnodesandaddoneadditionalitemtotheprojectionateachlevel.Theideaofthisinheritance-basedapproachisthattheprojecteddatabaseremembersthecount-ingworkdoneathigherlevelsoftheenumerationtreeby(successively)removingirrelevanttransactionsandirrelevantitemsateachleveloftheprojection.Suchanapproachworksefcientlybecauseitneverrepeatsthecountingworkwhichhasalreadybeendoneatthehigherlevels.Thus,theprimarysavingsinthestrategyarisefromavoidingrepetitiveandwastefulcounting.Abare-bonesdepth-rstversionofTreeProjection,thatissimilartoDepthProjectbutwithoutmaximalpruning,isdescribedinFig.2.6.Amoredetaileddescrip-tionwithmaximalpruningandotheroptimizationsisprovidedlaterinthischapter.Becausethealgorithmisdescribedrecursively,thecurrentprex(nodeofthelexicographictree)beingextendedisoneoftheargumentstothealgorithm.Intheinitialcall,thevalueofbecauseoneintendstodetermineallfrequentde-scendantsattherootofthelexicographictree.Thisalgorithmrecursivelyextendsfrequentprexesandmaintainsonlythetransactiondatabaserelevanttotheprex.Thefrequentprexesareextendedbydeterminingtheitemsthatarefrequentin.Thentheitemsetisreported.Theextensionofthefrequentprexcanbeviewedasarecursivecallatanodeoftheenumerationtree.Thus,atagivenenumerationtreenode,onenowhasacompletelyindependentproblemofextendingtheprexwiththeprojecteddatabasethatisrelevanttoalldescendantsofthatnode.Theconditionaldatabasereferstothesubsetoftheoriginaltransactiondatabasecorrespondingtotransactionscontainingitem.Furthermore,theitemandanyitemoccurringlexicographicallyearliertoitisnotretainedinthedatabasebecause C.C.Aggarwaletal.theseitemsarenotrelevanttocountingtheextensionsof.Thisindependentproblemissimilarinstructuretotheoriginalproblem,andcanbesolvedrecursively.Althoughitisnaturaltouserecursionforthedepth-rstversionsofTreeProjectionthebreadth-rstversionsarenotdenedrecursively.Nevertheless,thebreadth-rstversionsexploreapatternspaceofthesamesizeasthedepth-rstversions,andarenodifferenteitherintermsofthetreesizeorthecountingworkdoneovertheen-tirealgorithm.Themajorchallengeinthebreadth-rstversionisinmaintainingtheprojectedtransactionsalongthebreadthofthetree,whichisstorage-intensive.Itisshownin[5],howmanyoftheseissuescanberesolvedwiththeuseofacombinationofexplorationstrategiesfortreegrowthandcounting.Furthermore,itisalsoshownin[5]howbreadth-rstanddepth-rstmethodsmaybecombined.NotethatthisconceptofdatabaseprojectioniscommonbetweenTreeProjectionFP-growthalthoughtherearesomedifferencesintheinternalrepresentationoftheprojecteddatabases.Theaforementioneddescriptionisdesignedfordiscoveringallpatterns,anddoesnotincorporatemaximalpatternpruning.Whengeneratingtheitemsets,themainadvantageofthedepth-rststrategyoverthebreadth-rststrategyisthatitislessmemoryintensive.Thisisbecauseonedoesnothavehandlethelargenumberofcandidatesalongthebreadthoftheenumerationtreeatanypointinthecourseofalgorithmexecutionwhencombinedwithcountingdatastructures.Theoverallsizeofthecandidatespaceisxed,anddenedbythesizeoftheenumerationtree.Therefore,overtheentireexecutionofthealgorithm,thereisnodifferencebetweenthetwostrategiesintermsofsearchspacesize,beyondmemoryoptimization.Projection-basedalgorithms,suchasTreeProjection,canbeimplementedeitherrecursivelyornon-recursively.Depth-rstvariationsofprojectionstrategies,suchDepthProjectFP-growth,aregenerallyimplementedrecursivelyinwhichaparticularprex(orsufx)offrequentitemsisgrownrecursively(seeFig.2.6).Forrecursivevariations,thestructureandsizeoftherecursiontreeisthesameastheenumerationtree.Non-recursivevariationsofTreeProjectionmethodsdirectlypresenttheprojection-basedalgorithmsintermsoftheenumerationtreebystoringprojectedtransactionsatthenodesintheenumerationtree.Describingprojectionstrategiesdirectlyintermsoftheenumerationtreeishelpful,becauseonecanusetheenumerationtreeexplicitlytooptimizetheprojection.Forexample,onedoesnotneedtoprojectateverynodeoftheenumerationtree,butprojectonlywhenthesizeofthedatabasereducesbyaparticularfactorwithrespecttothenearestancestornodewherethelastprojectionwasstored.Suchoptimizationscanreducethespace-overheadofrepeatedelementsintheprojecteddatabasesatdifferentlevelsoftheenumeration(recursion)tree.IthasbeenshownhowtousethisoptimizationindifferentvariationsofTreeProjection.Furthermore,breadth-rstvariationsofthestrategyarenaturallydenednon-recursivelyintermsoftheenumerationtree.Therecursivedepth-rstversionsmaybeviewedeitherasdivide-and-conquerstrategies(becausetheyrecursivelysolveasetofsmallersubproblems),orasprojection-basedcountingreusestrategies.Thenotionofprojection-basedcountingreuseclearlydescribeshowcomputationalsavingsareachievedinbothversionsofthealgorithm. 2FrequentPatternMiningAlgorithms:ASurveyWhengeneratingpatterns,thedepth-rststrategyhasclearadvantagesintermsofpruningaswell.WereferthereadertoadetaileddescriptionoftheDepthProjectalgorithm,describedlaterinthischapter.Thisdescriptiondescribeshowseveralspecializedpruningtechniquesareenabledbythedepth-rststrategyformaximalpatternmining.TheTreeProjectionalgorithmhasalsobeengeneralizedtosequentialpatternmining[31].Therearemanydifferenttypesofdatastructuresthatmaybeusedinprojection-stylealgorithms.Thechoiceofdatastructureissensitivetothedataset.TwocommonchoicesthatareusedwithTreeProjectionfamilyofalgorithmsareasfollows:Arrays:Inthiscase,theprojecteddatabaseismaintainedas2-dimensionalarray.Oneofthedimensionsofthearrayisequaltothenumberofrelevanttransactionsandtheotherdimensionisequaltothenumberofrelevantitemsintheprojecteddatabase.Bothdimensionsoftheprojecteddatabasereducefromtopleveltolowerlevelsoftheenumerationtreewithsuccessiveprojection.Inthiscase,theprojecteddatabaseismaintainedasa01bitstringwhosewidthisxedtothetotalnumberoffrequent1-items,butthenumberofprojectedtransactionsreduceswithsuccessiveprojection.Suchanapproachlosesthepowerofitem-wiseprojection,butthisisbalancedbythefactthatthebit-stringscanbeusedmoreefcientlyforcountingoperations.Assumethateachtransactionbits,andcanthereforebeexpressedintheformofbytes.Eachbyteofthetransactioncontainstheinformationaboutthepresenceorabsenceofeightitems,andtheintegervalueofthecorre-spondingbitstringcantakeonanyvaluefrom0to2255.Correspondingly,foreachbyteofthe(projected)transactionatanode,256countersaremaintainedandavalueof1isaddedtothecountercorrespondingtotheintegervalueofthattransactionbyte.Thisprocessisrepeatedforeachtransactionintheprojecteddatabaseatnode.Therefore,attheendofthisprocess,onehas256countsforthedifferentitems.Atthispoint,apostprocessingphaseisiniti-atedinwhichthesupportofanitemisdeterminedbyaddingthecountsofthe128counterswhichtakeonthevalueof1forthatbit.Thus,thesecondphaserequires128operationsonly,andisindependentofdatabasesize.Therstphase,(whichisthebottleneck)istheimprovementoverthenaivecountingmethodbecauseitperformsonlyoneoperationforeachinthetransaction,whichcontainseightitems.Thus,themethodwouldbeafactorofeightfasterthanthenaivecountingtechnique,whichwouldneedtoscantheentirebitstring.ProjectionisalsoveryefcientinthebitstringrepresentationwithsimpleANDThemajorproblemwithxedwidthbitstringsisthattheyarenotefcientrepre-sentationsatlowerlevelsoftheenumerationtreeatwhichonlyasmallnumberofitemsarerelevant,andthereforemostentriesinthesebitstringsare0.Oneapproachtospeedthisupistoperformtheitem-wiseprojectiononlyatselectednodesinthetree,whenthereductioninthenumberofitemsfromthelastancestoratwhichtheitem-wiseprojectionwasperformedisatparticularmultiplicativefactor.Atthispoint,ashorterbitstringisusedforrepresentationforthedescendantsatthatnode, 36 C.C.Aggarwaletal. Table2.2 Vertical representationoftransactions. Notethatthesupportof itemset ab canbecomputed asthelengthofthe intersectionofthe tidlists of a and b Itemtidlist a1,2,3,5 b1,2,4,5 c1,2,5 d1,2,5 e1,4,5 f2,3,4 g3,4 h2,5 untilthewidthofthebitstringisreducedevenfurtherbythesamemultiplicative factor.Thisensuresthatthebitstringsrepresentationsarenotsparseandwasteful. Thekeyissuehereisthatdifferentrepresentationsprovidedifferenttradeoffsin termsofmemorymanagementandefciency.Laterinthischapter,anapproach called FP-growth willbediscussedwhichusesthetriedatastructuretoachieve compressionofprojectedtransactionsforbettermemorymanagement. 3.3VerticalMiningAlgorithms Theverticalpatternminingalgorithmsuseaverticalrepresentationofthetransaction databasetoenablemoreefcientcounting.Thebasicideaoftheverticalrepresen- tationisthatonecanexpressthetransactiondatabaseasaninvertedlist.Inother words,foreachtransactionidentiers,onecanhavealistofitemsthatarecontained init.Thisisreferredtoasa tidset or tidlist .Anexampleofaverticalrepresentation ofthetransactionsinTable2.1isillustratedinTable2.2. Thekeyideainverticalpatternminingalgorithmsisthatthesupportof k -patterns canbecomputedbyintersectionoftheunderlying tidlists .Therearetwodifferent waysinwhichthiscanbedone. Thesupportofa k -itemsetcanbecomputedasa k -waysetintersectionofthelists oftheindividualitems. Thesupportofa k -itemsetcanbecomputedasanintersectionofthe tidlists two ( k 1)-itemsetsthatjointothat k -itemset. Thelatterapproachismoreefcient.Thecreditforboththenotionofverticaltidlists andtheadvantagesofrecursiveintersectionoftidlistsissharedbythe Monet [56] andthe Partition algorithms[57].Notallverticalpatternminingalgorithmsusean enumerationtreeconcepttodescribethealgorithm.Manyofthealgorithmsdirectly usejoinstogeneratea( k + 1)-candidatepatternfromafrequent k -pattern,though evenajoin-basedalgorithm,suchas Apriori ,canbeexplainedintermsofanenumer- ationtree.Manyofthelatervariationsofverticalmethodsuseanenumerationtree concepttoexplorethelatticeofitemsetsmorecarefullyandrealizethefullpowerof theverticalapproach.TheindvidualensemblecomponentofSavasereetal.s[57] Partition algorithmistheprogenitorofallverticalpatternminingalgorithmstoday, andtheoriginal Eclat algorithmisamemory-optimizedandcandidatepartitioned versionofthisApriori-likealgorithm. 2FrequentPatternMiningAlgorithms:ASurvey 37 3.3.1Eclat Eclat usesabreadth-rstapproachlikeSavasereatalsalgorithm[57]onlattice partitions,afterpartitioningthecandidatesetintodisjointgroups,usingacandidate partitioningapproachsimilartoearlierparallelversionsofthe Apriori algorithm. The Eclat [71]algorithmisbestdescribedwiththeconceptofanenumerationtree becauseofthewidevariationinthedifferentstrategiesusedbythealgorithm.An importantcontributionof Eclat [71]istorecognizetheearlierpioneeringworkof the Monet and Partition algorithms[56,57]onrecursiveintersectionoftidlists,and proposemanyefcientvariantsofthisparadigm. Differentvariationsof Eclat explorethecandidatesindifferentstrategies.The earliestdescriptionof Eclat maybefoundin[74].Ajournalpaperexploringdiffer- entaspectsof Eclat maybefoundin[71].Intheearliestversionsofthework[74],a breadth-rststrategyisused.Thejournalversionin[71]alsopresentsexperimental resultsforonlythebreadth-rststrategy,althoughthepossibilityofadepth-rst strategyismentionedinthepaper.Therefore,theoriginal Eclat algorithmshouldbe consideredabreadth-rstalgorithm.Morerecentdepth-rstversionsof Eclat ,such as dEclat ,userecursive tidlist intersectionwithdifferencing[72],andrealizethefull benetofthedepth-rstapproach.The Eclat algorithm,aspresentedin[74],usesa levelwisestrategyinwhichall( k + 1)-candidateswithinalatticepartitionaregener- atedfromfrequent k -patternsinlevel-wisefashion,asin Apriori .The tidlists areused toperformsupportcounting.Thefrequentpatternsaredeterminedfromthese tidlists . Atthispoint,anewlevelwisephaseisinitiatedforfrequentpatternsofsize( k + 1). Othervariationsanddepth-rstexplorationstrategiesof Eclat ,alongwithexper- imentalresults,arepresentedinlaterworksuchas dEclat [72].The dEclat workin [72]presentssomeadditionalenhancementssuchas diffsets toimprovecounting.In thischapter,wepresentasimpliedpseudo-codeofthisversionof Eclat .Thealgo- rithmispresentedinFig.2.8.Thealgorithmisstructuredasarecursivealgorithm.A patternset FP ispartoftheinput,andissettothesetofallfrequent1-itemsatthe toplevelcall.Therefore,itmaybeassumedthat,atthetoplevel,thesetoffrequent 1-itemsand tidlists havealreadybeencomputed,thoughthiscomputationisnot showninthepseudocode.Ineachrecursivecallof Eclat ,anewsetofcandidates FP i isgeneratedforeverypattern(itemset) P i ,whichextendstheitemsetbyone unit.Thesupportofacandidateisdeterminedwiththeuseof tidlist intersection. Finally,if P i isfrequent,itisaddedtoapatternset FP i forthenextlevel. Figure2.7illustratestheitemsetgenerationtreewithsupportcomputationby tidlist intersectionforthesampledatabasefromTable2.1.Thecorresponding tidlists inthetreearealsoillustrated.Allinfrequentitemsetsineachlevelaredenotedbydot- ted,andborderedrectangles.Forexample,anitemset ab isgeneratedbyjoining b to a .The tidlist of( a )is { 1,2,3,5 } ,andthe tidlist of b is { 1,2,4,5 } .Wecandeterminethe supportof ab byintersectingthetwo tidlists toobtainthe tidlist { 1,2,5 } ofthesecan- didates.Therefore,thesupportof ab isgivenbythelengthofthis tidlist ,whichis3. Furthergainsmaybeobtainedwiththeuseofthenotionof diffsets [72].This approachrealizesthetruepowerofverticalpatternmining.Thebasicidea,in diffsets istomaintainonlytheportionofthe tidlists atanode,thatcorrespondtothechangein theinvertedlistfromtheparentnode.Thus,the tidlists atanodecanbereconstructed byexaminingthe tidlists attheancestorsofanodeinthetree.Themajoradvantage C.C.Aggarwaletal. Fig.2.7ExecutionofFig.2.8 diffsetsisthattheysavesignicantstorageinrequirementsintermsofthesizeofthedatastructurerequired(Fig.2.8). 2FrequentPatternMiningAlgorithms:ASurveyFig.2.9Sufx-basedpatternexploration 3.3.2VIPERalgorithm[58]usesaverticalapproachtominingfrequentpatterns.Thebasicideainthealgorithmisrorepresenttheverticaldatabaseintheformofcompressedbitvectorsthatarealsoreferredtoassnakes.Thesesnakesarethenusedforefcientcountingofthefrequentpatterns.Thedifferentcompressedrepresentationoftheprovideanumberofoptimizationadvantagesthatareleveragedbythealgorithm.Intrinsically,isnotverydifferentfromintermsofthebasiccountingapproach.Themajordifferenceisintermsofthechoiceofthecompressedbitvectorrepresentation,andtheefcienthandlingofthisrepresentation.Detailsmaybefoundin[58].4RecursiveSufx-BasedGrowthInthesealgorithmsrecursivesufx-basedexplorationofthepatternsisperformed.Notethatinmostfrequentpatternminingalgorithms,theenumerationtree(executiontree)ofpatternsexploresthepatternsintheformofalexicographictreeofitemsetsbuiltontheprexes.Sufx-basedmethodsuseadifferentconventioninwhichthesufxesoffrequentpatternsareextended.Asinallprojection-basedmethods,oneonlyneedstousethetransactiondatabasecontainingitemsetinordertocountitemsetsthathavethesufx.Itemsetsareextendedfromthesufxbackwards.Ineachiteration,theconditionaltransactiondatabase(orprojecteddatabase)oftrans-actionscontainingthecurrentsufxbeingexploredisaninputtothealgorithm.Furthermore,itisassumedthattheconditionaldatabasecontainsonlyfrequentex-tensionsof.Forthetop-levelcall,thevalueofisnull,andthefrequentitemsaredeterminedusingasinglepreprocessingpassthatisnotshowninthepseudo-code.Becauseeachitemisalreadyknowntobefrequent,thefrequentpatternscanbeimmediatelygeneratedforeachitem.Thedatabaseisprojectedfurthertoincludeonlytransactionscontaining,andarecursivecallisinitiatedwiththe.Theprojecteddatabasecorrespondingtotransactionscontainingisdetermined.Infrequentitemsareremovedfrom.Thus,thetransactionsarerecursivelyprojectedtoreecttheadditionofaniteminthesufx.Thus,thisisa C.C.Aggarwaletal.smallersubproblemthatcanbesolvedrecursively.TheFP-growthapproachusesthesufx-basedpatternexploration,asillustratedinFig.2.9.Inaddition,theFP-growthapproachusesanefcientdatastructure,knownastheFP-Treetorepresentthecon-ditionaltransactiondatabasewiththeuseofcompressedprexes.TheFP-Treewillbediscussedinmoredetailinalatersection.Thesufxinthetoplevelcalltothealgorithmisthenullitemset.Recursivesufx-basedexplorationofthepatternspaceis,inprinciple,nodifferentfromprex-basedexplorationoftheenumerationtreespacewiththeorderingoftheitemsreversed.Inotherwords,byusingareverseorderingofitems,sufx-basedrecursivepatternspaceexplorationcanbesimulatedwithprex-basedenumerationtreeexploration.Indeed,asdiscussedinthelastsection,prex-basedenumerationtreemethodsorderitemsfromtheleastfrequenttothemostfrequent,whereasthesufx-basedmethodsofthissectionorderitemsfromthemostfrequenttotheleastfrequent,toaccountforthisdifference.Thus,sufx-basedrecursivegrowthhasanexecutiontreethatisidenticalinstructuretoaprex-basedenumerationtree.Thisisadifferenceonlyofconvention,butitdoesnotaffectthepatternspacethatisexplored.Itisinstructivetocomparethesufx-basedexplorationwiththepseudocodeoftheprex-basedTreeProjectionalgorithminFig.2.6.Thetwopseudocodesarestructureddifferentlybecausetheinitialpre-processingpassofremovingfrequentitemsisnotassumedintheTreeProjectionalgorithm.Therefore,ineachrecursivecalloftheprex-basedTreeProjection,frequentitemsetsmustbecountedbeforetheyarereported.Insufx-basedexploration,thisstepisdoneasapreprocessingstep(forthetop-levelcall)andjustbeforetherecursivecallfordeepercalls.Therefore,eachrecursivecallalwaysstartswithadatabaseoffrequentitems.Thisis,ofcourse,adifferenceintermsofhowtherecursivecallsarestructuredbutisnotdifferentintermsofthebasicsearchstrategy,ortheamountofoverallcomputationalworkrequired,becauseinfrequentitemsneedtoberemovedineithercase.Afewotherkeydifferencesareevident:TreeProjectionusesdatabaseprojectionsontopofaprex-basedenumerationtree.Sufx-basedrecursivemethodshavearecursiontreewhosestructureissimilartoanenumerationtreeonthefrequentsufxesinsteadoftheprexes.TheremovalofinfrequentitemsfromFP-growthissimilartodeterminingwhichbranchesoftheenumerationtreetoextendfurther.Theuseofsufx-basedexplorationisadifferenceonlyofconventionfromprex-basedexploration.Forexample,afterreversingtheitemorder,onemightFP-growthbygrowingpatternsontheprexes,butconstructingacompressedFP-Treeonthesufxes.Theresultingexplorationorderandexecu-tioninthetwodifferentimplementationsofFP-growthwillbeidentical,butthelattercanbemoreeasilyrelatedtotraditionalenumerationtreemethods. TheresultingFP-Treewillbeasufx-basedtrie. 2FrequentPatternMiningAlgorithms:ASurveyVariousdatabaseprojectionmethodsaredifferentintermsofthespecicdatastructuresusedfortheprojecteddatabase.ThedifferentvariationsofTreeProjec-usearraysandbitstringstorepresenttheprojecteddatabase.TheFP-growthmethodusesanFP-Tree.TheFP-Treewillbediscussedinthenextsection.LatervariationsofFP-Treealsousecombinationsofarraysandpointerstorepresenttheprojecteddatabase.Somevariations,suchasOpportuneProject[38],combinedifferentdatastructuresinanoptimizedwaytoobtainthebestresult.Sufx-basedrecursivegrowthisinherentlydenedasadepth-rststrategy.Ontheotherhand,asisevidentfromthediscussionin[5],thespecicchoiceofex-plorationstrategyontheenumerationtreeisorthogonaltotheprocessofdatabaseprojection.Theoverallsizeoftheenumerationtreeisthesame,nomatterhowitisexplored,unlessmaximalpatternpruningisused.Thus,TreeProjectionexploresavarietyofstrategiessuchasbreadth-rstanddepth-rststrategies,withnodif-ferencetothe(overall)workrequiredforcounting.Themajorchallengewiththebreadth-rststrategyisthesimultaneousmaintenanceofprojectedtransactionsetsalongthebreadthofthetree.Theissueofeffectivememorymanagementofbreadth-rststrategiesisdiscussedin[5],whichshowshowcertainoptimizationssuchascache-blockingcanimprovetheeffectivenessinthiscase.Breadth-rststrategiesalsoallowcertainkindsofpruningsuchaslevel-wisepruning.Themajoradvantagesofdepth-rststrategiesariseinthecontextofmaximalpat-ternmining.Thisisbecauseadepth-rststrategydiscoversthemaximalpatternsveryearly,whichcanbeusedtoprunethesmallernon-maximalpatterns.Inthiscase,thesizeofthesearchspaceexploredtrulyreducesbecauseofadepth-rststrategy.Thisissueisdiscussedinthesectiononmaximalpatternmining.TheadvantagesformaximalpatternminingwererstproposedinthecontextoftheDepthProjectalgorithm[4].Next,wewilldescribetheFP-Treedatastructurethatusescompressedrepresenta-tionsofthetransactiondatabaseformoreefcientcounting.4.1TheFP-GrowthApproachFP-growthapproachcombinessufx-basedpatternexplorationwithacom-pressedrepresentationoftheprojecteddatabaseformoreefcientcounting.Theprex-basedFP-Treeisacompressedrepresentationofthedatabasewhichisbuiltbyconsideringaxedorderamongtheitemsinanitemset[32].ThistreeisusedtorepresenttheconditionaltransactionsetsofFig.2.9.AnFP-Treemaybeviewedasaprex-basedtriedatastructureofthetransactiondatabaseoffrequentitems.Justaseachnodeinatrieislabeledwithasymbol,anodeintheFP-Treeislabeledwithanitem.Inaddition,thenodeholdsthesupportoftheitemsetdenedbytheitemsofthenodesthatareonthepathfromtherootto.Byconsolidatingtheprexes,oneobtainscompression.Thisisusefulforeffectivememorymanagement.Ontheotherhand,themaintenanceofcountsandpointerswiththeprexesisan C.C.Aggarwaletal. {}ADD 1stTRANSACTIONa,b,c,d,e {}ADD 2ndTRANSACTIONa,b,c,f,d {}ADD 3rdTRANSACTIONa,f {}ADD 4thTRANSACTIONb,f,e {}ADD 5thTRANSACTIONa,b,c,d,e a:1b:1 a:2 a:3 b:2f:1 a:3 b:2 b:1 a:4 b:3f b:1 c:1 c:2d:1 f:1 d:1f f d:1f:1 e:1 d:2f:1 f:1 f:1e:1 e:1 e:1 d:1 e:1 d:1 e:1 d:1 e:2 d:1 ADD POINTERS {}a:4 b:1 c:3 b:3f:1 f:1 d:2 f:1 d:1 Fig.2.10FP-Treeconstructionadditionaloverhead.Thisresultsinadifferentsetoftrade-offsascomparedtothearrayrepresentation.TheinitialFP-Treeisconstructedasfollows.WestartwiththeemptyFP-TreeFPT.BeforeconstructingtheFP-Tree,thedatabaseisscannedandinfrequentitemsareremoved.Thefrequentitemsaresortedindecreasingorderofsupport.TheinitialconstructionofFP-Treeisstraightforward,andsimilartohowonemightinsertastringinatrie.Foreveryinsertion,thecountsoftherelevantnodesthatareaffectedbytheinsertionareincrementedby1.Iftherehasbeenanysharingofprexbetweenthecurrenttransactionbeinginserted,andapreviouslyinsertedtransactionthenwillbeinthesamepathuntilthecommonprex.Beyondthiscommonprex,newnodesareinsertedinthetreefortheremainingitemsin,withsupportcountinitializedto1.Theaboveprocedureendswhenalltransactionshavebeeninserted.TostoretheitemsinthenalFP-Tree,aliststructurecalledheadertableismaintained.AchainofpointersthreadsthroughtheoccurrenceoftheitemintheFP-Tree.Thus,thischainofpointersneedtobeconstructedinadditiontothetriedatastructure.EachentryinthistablestorestheitemlabelandpointerstothenoderepresentingtheleftmostoccurrenceoftheitemintheFP-Tree(rstiteminthepointerchain).ThereasonformaintainingthesepointersisthatitispossibletodeterminetheconditionalFP-Treeforanitembychasingthepointersforthatitem.AnexampleoftheinitialconstructionoftheFP-Treedatastructurefroma 2FrequentPatternMiningAlgorithms:ASurveyFig.2.11FP-growth databaseofvetransactionsisillustratedinFig.2.10.Theorderingoftheitemsis.Itisclearthatatriedatastructureiscreated,andthenodecountsareupdatedbytheinsertionofeachtransactionintheFP-Tree.Figure2.10alsoshowsallthepointersbetweenthedifferentitems.Thesumofthecountsontheitemsonthispointerpathisthesupportoftheitem.ThissupportisalwayslargerthantheminimumsupportbecauseafullconstructedFP-Tree(withpointers)containsonlyfrequentitems.Theactualcountingofthesupportofitem-extensionsandtheremovalofinfrequentitemsmustbedoneduringconditionaltransactiondatabase(andtherelevantFP-Tree)creation.ThepointerpathsarenotavailableduringtheFP-Treecreationprocess.Forexample,theitemhastwonodesonthispointerpath,correspondingto:2and:1.Bysummingupthesecounts,atotalcountofthreefortheitemisobtained.ItisnotdifculttoverifythatthreetransactionscontaintheitemWiththisnewcompressedrepresentationoftheconditionaltransactiondatabaseoffrequentitems,onecandirectlyextractthefrequentpatterns.Thepseudo-codeofFP-growthalgorithmispresentedinFig.2.11.Althoughthispseudo-codelooksmuchmorecomplextounderstandthantheearlierpseudocodeofFig.2.9,themaindifferenceisthatmoredetailsofthedatastructure(FP-Tree),usedtorepresenttheconditionaltransactionsets,havebeenadded.ThealgorithmacceptsaFP-TreeFPT,currentitemsetsufxanduserdenedminimumsupportasinput.Theadditionalsufxhasbeenaddedtotheparametertofacilitatetherecursivedescription.Atthetoplevelcallmadebytheuser,thevalueof.Furthermore,theconditionalFP-Treeisconstructedonadatabaseoffrequentitemsratherthanalltheitems.Thispropertyismaintainedacrossdifferentrecursivecalls.ForanFP-TreeFPT,theconditionalFP-TreesarebuiltforeachitemFPT(whichisalreadyknowntobefrequent).TheconditionalFP-TreesareconstructedbychasingpointersforeachitemintheFP-Tree.Thisyieldsalltheconditionalprex C.C.Aggarwaletal. {} {} {} a:4 b:3f:1 b:1f:1 a:4 b:3 b:1 a:4 b:3 b:1POINTERCHASINGOF eREMOVALOF e d:2f:1 e:1 c:3e:1 c:3 e:2 d:1 e:2 Fig.2.12GeneratingaconditionalFP-Treebypointerchasingpathsfortheitem.Theinfrequentnodesfromthesepathsareremoved,andtheyareputtogethertocreateaconditionalFP-TreeFPT.BecausetheinfrequentitemshavealreadybeenremovedfromFPTthenewconditionalFP-Treealsocontainsonlyfrequentitems.Therefore,inthenextlevelrecursivecall,anyitemfromFPTcanbeappendedtotogenerateanotherpattern.Thesupportsofthosepatternscanalsobereconstructedviapointerchasingduringtheprocessofreportingthepatterns.Thus,thecurrentpatternsufxisextendedwiththefrequentitemappendedtothefront.Thisextendedsufxisdenotedby.Thepatternalsoneedstobereportedasfrequent.TheresultingconditionalFP-TreeFPTisthecompresseddatabaserepresentationofofFig.2.9intheprevioussection.Thus,FPTisasmallerconditionaltreethatcontainsinformationrelevantonlytotheextractionofvariousprexpathsrelevanttodifferentitemsthatwillextendthesufxfurtherinthebackwardsdirection.NotethatinfrequentitemsareremovedfromFPTduringthisstep,whichrequiresthesupportcountingofallitemsinFPT.BecausethepointershavenotyetbeenconstructedforFPT,thesupportofeachitem-extensionofcorrespondingtotheitemsinFPTmustbeexplicitlydeterminedbylocatingeachinstanceofaniteminFPT.Thisistheprimarycomputationalbottleneckstep.TheremovalofinfrequentitemsfromFPTmayresultinadifferentstructureoftheFP-Treeinthenextstep.Finally,iftheconditionalFP-TreeFPTisnotempty,theFP-growthmethodiscalledrecursivelywithparameterscorrespondingtotheconditionalFP-TreeFPTextendedsufx,andminimumsupport.Notethatsuccessiveprojectedtrans-actiondatabases(andcorrespondingconditionalFP-Trees)intherecursionwillbesmallerbecauseoftherecursiveprojection.ThebasecaseoftherecursionoccurswhentheentireFP-Treeisasinglepath.Thisislikelytooccurwhentheprojectedtransactiondatabasebecomessmallenough.Inthatcase,FP-growthdeterminesallcombinationsofnodesonthispath,appendsthesufxtothem,andreportsthem.AnexampleofhowtheconditionalFP-Treeiscreatedforaminimumsupportof1unit,isillustratedinFig.2.12.Notethatiftheminimumsupportwere2,thentherightbranch(nodes)wouldnotbeincludedintheconditionalFP-Tree.Inthiscase,thepointersforitemarechasedintheFP-Treetocreatetheconditionalprexpathsoftherelevantconditionaltransactiondatabase.Thisrepresentsalltransactions 2FrequentPatternMiningAlgorithms:ASurvey.Thecountsontheprexpathsarere-adjustedbecausemanybranchesarepruned.TheremovalofinfrequentitemsandthatoftheitemmightleadtoaconditionalFP-Treethatlooksverydifferentfromtheconditionalprex-paths.ThesekindsofconditionalFP-treesneedtobegeneratedforeachconditionalfrequentitem,althoughonlyasingleitemhasbeenshownforthesakeofsimplicity.Notethat,ingeneral,thepointersmayneedtoberecreatedeverytimeaconditionalFP-Treeis4.2VariationsAsthedatabasegrowslarger,theconstructionoftheFP-Treebecomechallengingbothfromruntimeandspacecomplexity.Therehavebeenmanyworks[8,24,27,29,30,36,39,55,59,61,62]totacklethesechallenges.ThesevariationsofFP-growthmethodcanbeclassiedintotwocategories.Methodsbelongingtotherstcategorydesignmemory-basedminingprocessusingamemory-residentdatastructurethatholdspartitioneddatabase.MethodsbelongingtothesecondcategoryimprovetheefciencyoftheFP-Treerepresentation.Inthissubsection,wewillpresenttheseapproachesbriey.4.2.1Memory-ResidentVariationsInthefollowing,anumberofdifferentmemory-residentvariationsofthebasicFP-growthideawillbedescribed.CT-PROAlgorithmInthiswork[62],theauthorsintroducedanewFP-TreelikedatastructurecalledCompactFP-Tree(CFP-Tree)thatholdsthesameinformationasFP-Treebutwith50%lessstorage.TheyalsodesignedaminingalgorithmcalledCT-PROwhichfollowsanon-recursiveprocedureunlikeFP-growth.Asdiscussedearlier,duringtheminingprocess,FP-growthconstructsmanyconditionalFP-Trees,whichbecomesanoverheadasthepatternsgetlongerorthesupportgetslower.Toovercomethisproblem,theCT-PROalgorithmdividesthedatabaseintoseveraldisjointprojectionswhereeachprojectionisrepresentedasaCFP-Tree.Thenanon-recursiveminingprocessisexecutedovereachprojectionindependently.SignicantmodicationsweremadetotheheaderTable4.1datastructure.IntheoriginalFP-Tree,thenodesstorethesupportanditemlabel.However,intheCFP-Tree,itemlabelsaremappedtoanincreasingsequenceofintegersthatisactuallytheindexoftheheadertable.TheheadertableofCFP-Treestoresthesupportofeachitem.TocompresstheoriginalFP-Tree,allidenticalsubtreesareremovedbyaccumulatingthemandstoringtherelevantinformationintheleftmostbranch.TheheadertablecontainsapointertoeachnodeontheleftmostbranchoftheCFP-Tree,asthesenodesarerootsofsubtreesstartingwithdifferentitems.Theminingprocessstartsfromthepointersoftheleastfrequentitemsintheheadertable.Thisprunesalargenumberofnodesatanearlystageandshrinksthetree C.C.Aggarwaletal.structure.Byfollowingthepointerstothesameitem,aprojectionofalltransactionsendingwiththecorrespondingitemisbuilt.ThisprojectionisalsorepresentedasaCFP-TreecalledlocalCFP-Tree.ThelocalCFP-Treeisthentraversedtoextractthefrequentpatternsintheprojection.H-MineAlgorithmTheauthorsin[54]proposedanefcientalgorithmcalled.Itusesamemoryefcienthyper-structurecalledH-Struct.Thefundamentalstrategyofistopartitionthedatabaseandmineeachpartitioninthememory.Finally,theresultsfromdifferentpartitionsareconsolidatedintoglobalfrequentpatterns.Anintelligentmoduleofisthatitcanidentifywhetherthedatabaseisdenseorsparse,anditisabletomakedynamicchoicesbetweendifferentdatastructuresbasedonthisidentication.MoredetailsmaybefoundinChap.3onpattern-growthmethods.4.2.2ImprovedDataStructureVariationsInthissection,severalvariationsofthebasicalgorithmbyimprovingtheunderlyingdatastructurewillbedescribed.UsingArraysAsignicantpartoftheminingtimeinFP-growthisspentontravers-ingthetree.Toreducethistime,theauthorsin[29]designedanarraybasedimplementationofFP-growth,namedFP-growth*whichdrasticallyreducesthetraversaltimeoftheminingalgorithm.ItusestheFP-Treedatastructureincom-binationwithanarray-likedatastructureanditincorporatesvariousoptimizationschemes.ItshouldbepointedoutthattheTreeProjectionfamilyofalgorithmsalsousesarrays,thoughtheoptimizationsusedarequitedifferent.Whentheinputdatabaseissparse,thearraybasedtechniqueperformswellbe-causethearraysavesthetraversaltimeforalltheitems;moreovertheinitializationofthenextlevelofFP-Treesiseasierusinganarray.Butincaseofdensedatabase,thetreebaserepresentationismorecompact.Todealwiththesituation,FP-growth*devisesamechanismtoidentifywhetherthedatabaseissparseornot.Todoso,growth*countsthenumberofnodesineachlevelofthetree.Basedonexperiments,theyfoundthatiftheupperquarterofthetreecontainslessthan15%ofthetotalnumberofnodes,thenthedatabaseismostlikelydense.Otherwise,itissparse.Ifthedatabaseturnsouttobesparse,FP-growth*allocatesanarrayforeachFP-Treeinthenextlevelofmining.ThenonordfpApproachThiswork[55]presentedanimprovedimplementationofthewellknownFP-growthalgorithmusinganefcientFP-Treelikedatastructurethatallowsfasterallocation,traversalandoptionalprojection.Thetreenodesdonotstoretheirlabels(itemidentiers).Thereisnoconceptofheadertable.Thedatastructurestoreslessadministrativeinformationinthetreenodewhichallowtherecursivestepofminingwithoutrebuildingthetree. 2FrequentPatternMiningAlgorithms:ASurvey Fig.2.13Frequent,maximalandcloseditemsets5MaximalandClosedFrequentItemsetsOneofthemajorchallengesoffrequentitemsetminingisthat,mostoftheitemsetsminedaresubsetofthesetofsinglelengthfrequentitems.Therefore,asigni-cantamountoftimeisspentoncountingredundantitemsets.Onesolutiontothisproblemistodiscovercondensedrepresentationsofthefrequentitemsets.Itwillbesuchrepresentationsthatsynopsizesthepropertyofthesetofitemsetscompletelyorpartially.Thecompactrepresentationnotonlysavecomputationalandmemoryresourcebutalsopavedamucheasierwaytowardsknowledgediscoverystageaftermining.Anotherinterestingobservationby[53]wasthat,insteadofminingthecom-pletesetoffrequentitemsetsandtheirassociations,associationminingonlyneedstondfrequentcloseditemsetsandtheircorrespondingrules.So,miningfrequentcloseditemsetcanfullltheobjectivesofminingallfrequentitemsetsbutwithlessredundancyandbetterefciencyandeffectivenessinmining.Inthissection,wewilldiscusstwotypesofcondensedrepresentationofitemset:maximalandclosedfrequentitemset.5.1DenitionsMaximalFrequentItemsetisthetransactiondatabase,isthesetofallitemsinthedatabaseandisthesetofallfrequentitemsets.Afrequentitemsetiscalledmaximalifithasnofrequentsuperset.letbethesetofallfrequentmaximalitemsets,whichisdenotedbyandchthatQ C.C.Aggarwaletal.ForthetoytransactiondatabaseinTable2.1thefrequentmaximalitemsetsatmin-imumsupport3areabcd,asillustratedinFig.2.13.Alltherectangleslledwithgreycolorrepresentmaximalfrequentpatterns.AswecanseeinFig.2.4,thattherearenofrequentsupersetsofabcdClosedFrequentItemsetTheclosureoperatorinducesanequivalencerelationonthepowersetofitemspartitioningitintodisjointsubsetscalledequivalenceclasses.Thelargestelementwithrespecttothenumberofitemsineachequivalenceclassiscalledacloseditemset.Afrequentitemsetisclosedif.Fromtheclosurepropertyitcanbesaidthatboth)andhavethesametidset.Insimplerterms,anitemsetisclosedifitdoesnothaveanyfrequentsupersetwiththesamesupport.Acloseditemsetcanbewrittenas:andchthatspportpportBecausemaximalitemsetshavenofrequentsuperset,theyarevacuouslyclosedfrequentitemsets.Thus,allmaximalpatternsareclosed.However,thereisakeydifferencebetweenminingmaximalitemsetsandcloseditemsets.Miningmaximalitemsetslosesinformationaboutthesupportoftheunderlyingitemsets.Ontheotherhand,miningcloseditemsetsdoesnotloseanyinformationaboutthesupport.Thesupportofthemissingsubsetscanbederivedfromtheclosedfrequentpatterndatabase.Onewayofviewingclosedfrequentpatternsisasthemaximalpatternsfromeachgroupoffrequentpatterns.Closedfrequentitemsetsareacondensedrepresentationoffrequentitemsetsthatislossless.ForthetoytransactiondatabaseofTable2.1thefrequentclosedpatternsareabcdforminimumsupportvalueof3,asillustratedinFig.2.13.Alltherectangleswithdottedborderrepresentclosedfrequentpatterns.Theremainingnodesinthetree(notlledanddottedborder)representfrequentitemsets.5.2FrequentMaximalItemsetMiningAlgorithmsInthissubsection,wewilldiscusssomeofmaximalfrequentitemsetmining5.2.1MaxMinerAlgorithmalgorithmwastherstalgorithmthatusedavarietyofoptimizationstoimprovetheeffectivenessoftreeexplorations[10].Thisalgorithmisgenerallyfocussedondeterminingmaximalpatternsratherthanallpatterns.Theauthorof[10]observedthatitisusuallysufcienttoonlyreportmaximalpatterns,whenfrequentpatternsarelong.Thisisbecauseofthecombinatorialexplosioninexaminingallsubsetsofpatterns.Althoughtheexplorationofthetreeisstilldoneinbreadth-rstfashion,anumberofoptimizationsareusedtoimprovetheefciencyofexploration: 2FrequentPatternMiningAlgorithms:ASurveyTheconceptofisdened.Let)bethesetofcandidateitemsthatmightextendnode.Beforecounting,itischeckedwhether)isasubsetofanyofthefrequentpatternsfoundsofar.Ifsuchisindeedthecase,thenitisknownthattheentiresubtreerootedatisfrequent,andcanbeprunedfromconsideration(formaximalpatternmining).Duringcountingthesupportofindividualitemextensionsof,thesupportof)isalsodetermined.Iftheset)isfrequent,thenitisknownthatallitemsetsintheentiresubsetrootedatthatnodearefrequent.Therefore,thetreedoesnotneedtobeexploredfurther,andcanbepruned.Thesupportlowerboundingtrickdiscussedearliercanbeusedtoquicklyde-terminepatternswhicharefrequentwithoutexplicitcounting.Thecountsofextensionsofnodescanbedeterminedwithoutcountinginmanycases,wherethecountdoesnotchangebyextendinganitem.Ithasbeenshownin[10],thatthesesimpleoptimizationscanimproveoverthealgorithmbyordersofmagnitude.5.2.2DepthProjectAlgorithmDepthProjectalgorithmisbasedonthenotionofthelexicographictree,denedin[5].UnlikeTreeProjection,theapproachaggressivelyexploresthecandidatesinadepth-rststrategybothtoensurebetterpruningandfastercounting.AsinTreeProjection,thedatabaseisrecursivelyprojecteddownthelexicographictreetoensuremoreefcientcounting.Thiskindofprojectionensuresthatthecountinginformationfor-candidatesisreusedfor(1)-candidates,asinthecaseofFP-growthForthecaseoftheDepthProjectmethod[4],thelexicographictreeisexploredindepth-rstordertomaximizetheadvantageoflookaheadsinwhichentiresubtreescanbeprunedbecauseitisknownthatallpatternsinthemarefrequent.Theoverallpseudocodeforthedepth-rststrategyisillustratedinFig.2.14.Thepseudocodesforcandidategenerationandcountingarenotprovidedbecausetheyaresimilartothepreviouslydiscussedalgorithms.However,oneimportantdistinctionincountingisthatprojecteddatabasesareusedforcounting.ThisissimilartotheFP-growthofalgorithms.Notethattherecursivetransactionprojectionisparticularlyeffectivewithadepth-rststrategybecauseasmallernumberofprojecteddatabasesneedtobestoredalongapathinthetree,ascomparedtothebreadthofthetree.Toreducetheoverheadofcountinglongpatterns,thenotionoflookaheadsareused.Atanynodeofthetree,let)beitspossible(candidate)itemextensions.Then,itischeckedwhether)isfrequentintwoways:1.Beforecountingthesupportoftheindividualextensionsof),itischeckedwhetheroccursassubsetofafrequentitemsetthathasalreadybeendiscoveredearlierduringdepth-rstexploration.Ifsuchisthecase,thentheentiresubtreerootedatisprunedbecauseitisknown C.C.Aggarwaletal.tobefrequentanditisnotamaximalpattern.Thistypeofpruningisparticularlyeffectivewithadepth-rststrategy.2.Duringsupportcountingoftheitemextensions,thesupportof)isalsodetermined.Ifaftersupportcounting,)turnsouttobefrequent,thentheentiresubtreerootedatnodecanbepruned.Notethattheprojecteddatabaseatnode(asinTreeProjection)isused.Althoughlookaheadsarealsousedinthealgorithm,itshouldbepointedoutthattheeffectivenessoflookaheadsismaximizedwithadepth-rststrategy.Thisistrueoftherstofthetwoaforementionedstrategies,inwhichitischecked)isasubsetofanalreadyexistingfrequentpattern.Thisisabecauseadepth-rststrategytendstoexploretheitemsetsindictionaryorder.Indictionaryorder,maximalitemsetsareusuallyexploredmuchearlierthanoftheirsubsets.Forexample,fora10-itemsetabcdefghij,only9ofthe1024subsetsoftheitemsetswillbeexploredbeforeexploringtheitemsetabscdefghij.These9itemsetsaretheimmediateprexesoftheitemset.When,thelongeritemsetsareexploredearlytheybecomeavailabletopruneshorteritemsets.Thefollowinginformationisstoredateachnodeduringtheprocessofconstructionofthelexicographictree:1.Theitemsetatthatnode.2.Thesetoflexicographictreeextensionsatthatnodewhichare3.Apointertotheprojectedtransactionset),whereissomeancestorof(includingitself).Therootofthetreepointstotheentiretransactiondatabase.4.Abitvectorcontainingtheinformationaboutwhichtransactionscontaintheitem-setfornodePasasubset.Thelengthofthisbitvectorisequaltothetotalnumberoftransactionsin).Thevalueofabitforatransactionisequaltoone,iftheitemsetPisasubsetofthetransaction.Otherwiseitisequaltozero.Thus,thenumberof1bitsisequaltothenumberoftransactionsin)whichprojectto.Thebitvectorsareusedtomaketheprocessofsupportcountingmoreefcient.Afteralltheprojectedtransactionsatagivennodehavebeenidentied,thennd-ingthesubtreerootedatthatnodeisacompletelyindependentitemsetgenerationproblemwithasubstantiallyreducedtransactionset.Thenumberoftransactionsatanodeisproportionaltothesupportatthatnode.ThedescriptioninFig.2.14showshowthedepthrstcreationofthelexicographictreeisperformed.Thealgorithmisdescribedrecursively,sothatthecallfromeachnodeisacompletelyindependentitemsetgenerationproblemthatndsallfrequentitemsetsthataredescendantsofanode.Therearethreeparameterstothealgorithm,apointertothedatabase,theitemsetnode,andthebitvector.Thebitvectorcontainsonebitforeachtransactionin,andindicateswhetherornotthetransactionshouldbeusedinndingthefrequentextensionsofAbitforatransactionisone,iftheitemsetatthatnodeisasubsetofthecorrespondingtransaction.Therstcalltothealgorithmisfromthenode,theparametertheentiretransactiondatabase.Becauseeachtransactioninthedatabaseisrelevanttoperformthecounting,thebitvectorconsistsofallonevalues.Oneproperty 2FrequentPatternMiningAlgorithms:ASurveyFig.2.14Thedepthrststrategy oftheDepthProjectalgorithmisthattheprojectionisperformedonlywhenthetransactiondatabasereducesbyacertainsize.ThisistheProjectionConditionFig.2.14.Mostofthenodesinthelexicographictreecorrespondtothelowerlevels.Thus,thecountingtimesattheselevelsaccountformostoftheCPUtimesofthealgorithm.Fortheselevels,astrategycalledbucketingcansubstantiallyimprovethecountingtimes.Theideaistochangethecountingtechniqueatanodeinthelexicographictree,ifislessthanacertainvalue.Inthiscase,anupperboundonthenumberofdistinctprojectedtransactionsis2.Thus,forexample,whenisnine,thenthereareonly512distinctprojectedtransactionsatthenode.Clearly,thisisbecausetheprojecteddatabasecontainsseveralrepetitionsofthesame(projected) C.C.Aggarwaletal.Fig.2.15Aggregatingbucket transaction.Thefactthatthenumberoftransactionsintheprojecteddatabaseissmallcanbeexploitedtoyieldsubstantiallymoreefcientcountingalgorithms.Theaimistocountthesupportfortheentiresubtreerootedatwithaquickpassthroughthedata,andanadditionalpostprocessingphasewhichisindependentofdatabasesize.Theprocessofperformingbucketcountingconsistsoftwophases:1.Intherstphase,thecountsofeachdistincttransactionpresentintheprojecteddatabasearedetermined.Thiscanbeaccomplishedeasilybymaintaining2bucketsorcounters,scanningthetransactionsonebyone,andaddingcountstothebuckets.Thetimeforperformingthissetofoperationsislinearinthenumberof(projected)databasetransactions.2.Inthesecondphase,thecountsofthe2transactionareusedtodeterminetheaggregatesupportcountsforeachitemset.Ingeneral,thesupportcountofanitemsetmaybeobtainedbyaddingthecountsofallthesupersetsofthatitemsettoit.Askillfulalgorithm(fromtheefciencyperspective)forperformingtheseoperationsisillustratedinFig.2.15.Considerastringcomposedof0,1,andthatreferstoanitemsetinwhichthepositionswith0and1arexedtothosevalues(correspondingtopresenceorabsenceofitems),whileapositionwithaisadontcare.Thus,allitemsetscanbeexpressedintermsof1andbecauseitemsetsaretraditionallydenedwithrespecttopresenceofitems.Considerforexample,thecasewhen4,andtherearefouritems,numbered1,2,3,4.Anitemsetcontainingitems2and4isdenotedby1.Westartoffwiththeinformationon216bitstringswhicharecomposedof0and1.Theserepresentallpossibledistincttransactions.Thealgorithmaggregatesthecountsiniterations.Thecountforastringwitha*inaparticularpositionmaybeobtainedbyaddingthecountsforthestringswitha0and1inthosepositions.Forexample,thecountforthestring*1*1maybeexpressedasthesumofthecountsofthestrings01*1and11*1. 2FrequentPatternMiningAlgorithms:ASurveyFig.2.16Performingthesecondphaseofbucketing TheprocedureinFig.2.15worksbystartingwiththecountsofthe01strings,andthenconvertsthemtostringswith1and*.ThealgorithmrequiresInthethiteration,itincreasesthecountsofallthosebucketswitha0inthethbit,sothatthecountnowcorrespondstoacasewhenthatbucketcontainsainthatposition.Thiscanbeachievedbyaddingthecountsofthebucketswitha0inthethpositiontothatofthebucketwitha1inthatposition,withallotherbitshavingthesamevalue.Forexample,thecountofthestring0*1*isobtainedbyaddingthecountsofthebuckets001*and011*.InFig.2.15,theprocessofaddingthecountofthebuckettothatofthebucketachievesthis.Thesecondphaseofthebucketingoperationrequiresiterations,andeachiterationrequires2operations.Therefore,thetotaltimerequiredbythemethodisproportionalto2.Whenissufcientlysmall,thetimerequiredbythesecondphaseofpostprocessingissmallcomparedtotherstphase,whereastherstphaseisessentiallyproportionaltoreadingthedatabaseforthecurrentWehaveillustratedthesecondphaseofbucketingbyanexampleinwhich3.TheprocessillustratedinFig.2.16illustrateshowthesecondphaseofbucketingisefcientlyperformed.Theexactstringsandthecorrespondingcountsineachofthe3iterationsareillustrated.Intherstiteration,allthosebitswith0inthelowestorderpositionhavetheircountsaddedwiththecountofthebitstringwitha1inthatposition.Thus,2pairwiseadditionoperations C.C.Aggarwaletal.takeplaceduringthisstep.Thesameprocessisrepeatedtwomoretimeswiththesecondandthirdorderbits.Attheendofthreepasses,eachbucketcontainsthesupportcountfortheappropriateitemset,wherethe0fortheitemsetisreplacedbyadontcarewhichisrepresentedbya*.Notethatthenumberoftransactionsinthisexampleis27.Thisisrepresentedbytheentryforthebucket***.Onlytwotransactionscontainallthreeitemsthatisrepresentedbythebucket111.Theprojection-basedmethodswereshowntohaveanorderofmagnitudeim-provementoverthealgorithm.Thedepth-rstapproachhassubsequentlybeenusedinthecontextofmanytree-basedalgorithms.Otherexamplesofsuchalgorithmsincludethosein[17,18,14].Amongthese,theMAFIAalgorithm[14]isdiscussedinsomedetailinthenextsubsection.Anapproachwhichvariesontheprojectionmethodology,andusesopportunisticprojectionisdiscussedin[38].Thisalgorithmopportunisticallychoosesbetweenarray-basedandtree-basedrep-resentationstorepresentprojectedtransactionsubsets.SuchanapproachhasbeenshowntobemoreefcientthanmanystateoftheartmethodssuchastheFP-Growthmethod.Othervariationsoftree-basedalgorithmshavealsobeenproposed[70]thatusedifferentstrategiesintreeexploration.5.2.3MAFIAAlgorithmTheMAFIAalgorithmproposedin[14]sharesanumberofsimilaritiestotheProjectapproach,thoughitusesabitmapbasedapproachforcounting,ratherthantheuseofaprojectedtransactiondatabase.Inthebitmap-basedapproach,asequenceofbitsismaintainedforeachitemsetthatcorrespondstowhetherornotthattransac-tioncontainsthatparticularitem.Sparserepresentations(suchasalistoftransactionidentiers)mayalsobeused,whenthefractionoftransactionscontainingtheitemsetissmall.Notethatsuchanapproachmaybeconsideredaspecialcaseofdatabaseprojection[5],inwhichverticalprojectionisusedbuthorizontalprojectionisnot.Thishastheadvantageofrequiringlessmemory,butitreusesasmallerfractionofthecountinginformationfromhigherlevelnodes.Anumberofotherpruningopti-mizationshavealsobeenproposedinthisworkthatfurtherimprovetheeffectivenessofthealgorithm.Inparticular,ithasbeenpointedoutthatwhenthesupportoftheextensionofanodeisthesameasthatofitsparent,thenthatsubtreecanbeprunedaway,becauseofthecountsofalltheitemsetsinthesubtreecanbederivedfromthoseofotheritemsetsinthedata.ThisisthesameasthesupportlowerboundingtrickdiscussedinSect.2.4,andalsousedinforpruning.Thus,theapproachin[14]usesmanyofthesamestrategiesusedinTreeProjection,butwithinadifferentcombination,andwithsomevariationsonspecicimplementation 2FrequentPatternMiningAlgorithms:ASurvey5.2.4GenMaxLikeMAFIA,isausestheverticalrepresentationtospeedupcounting.Specicallytheareusedbytospeedupthecountingapproach.Inparticularthemorerecentnotionofdiffsets[72]wasused,andadepth-rstexplorationstrategywasused.Anapproachknownassuccessivefocussingwasusedtofurtherimprovetheefciencyofthealgorithm.Thedetailsoftheapproachmaybefoundin[28].5.3FrequentClosedItemsetMiningAlgorithmsTheareseveralfrequentcloseditemsetminingalgorithms[41,42,5153,64,6669,73]existtodate.Mostofthemaximalandclosedpatternminingalgorithmsarebasedondifferentvariationsofthenon-maximalpatternminingalgorithms.Typ-icallypruningstrategiesareincorporatedwithinthenon-maximalpatternminingalgorithmstoyieldmoreefcientalgorithms.5.3.1CloseInthisalgorithm[52]authorsapplybasedpattengenerationoverthecloseditemsetsearchspace.Theusagesofcloseditemsetlattice(searchspace)signicantlyreducestheoverallsearchspaceofthealgorithm.operatesiniterativemanner.Eachiterationconsistsofthreephases,.First,theclosurefunctionisappliedforobtainingthecandidatecloseditemsetsandtheirsupport.Next,theobtainedsetofcandidatecloseditemsetsaretestedagainsttheminimumsupportconstraint.Ifsucceed,thecandidatesaremarkedasfrequentcloseditemset.Finallythesameprocedureisinitiatedtogeneratethenextlevelofcandidatecloseditemsets.Thisprocesscontinuesuntilallfrequentcloseditemsetshavebeengenerated.5.3.2CHARM[73]isafrequentcloseditemsetminingalgorithm,thattakesadvantageoftheverticalrepresentationofdatabaseasinthecaseof[71]forefcientclosurecheckingoperation.Forpunningthesearchspaceusesthefollowingthreeproperties.Supposeforitemset,iftidset()=tidset(),thenitreplaceseveryoccurrenceofandprunethewholebranchunder.Ontheotherhandiftidset(),itreplaceseveryoccurrenceofbutdoesnotprunethebranchunder.Finallyif,tidset(),noneoftheaforementionedpruningscanbeapplied.Theinitialcallofaset()ofsinglelengthfrequentitemandminimumsupportasinput.Asarststep,itsortsbytheincreasingtheorderofsupportoftheitems.Foreachitem C.C.Aggarwaletal.triestoextenditbyanotheritemfromthesamesetandappliesthreeconditionsforpruning.Ifthenewlycreateitemsetbyextensionisfrequent,performsclosure-checkingtoidentifywhethertheitemsetisclosed.updatesthesetaccordingly.Inotherwords,itreplaces,ifthecorrespondingpruningconditionismet.Ifthesetisthenotempty,theniscalledrecursively.5.3.3CLOSETandCLOSET+[53]and[69]frequentcloseditemsetminingalgorithmsareinspiredbytheFP-growthmethod.Thealgorithmmakesuseoftheprin-ciplesoftheFP-Treedatastructuretoavoidthecandidategenerationstepduringtheprocessofminingfrequentcloseditemsets.Thisworkintroducesatechnique,referredtoassingleprexpathcompression,thatquicklyassiststheminingprocess.alsoappliespartition-basedprojectionmechanismsforbetterscalability.TheminingprocedureoffollowstheFP-growthalgorithm.However,theal-gorithmisabletoextractonlytheclosedpatternsbycarefulbook-keeping.treatsitemsappearingineverytransactionoftheconditionaldatabasespecially.Forexample,ifisthesetofitemsthatappearineverytransactionofthedatabasethencreatesafrequentcloseditemsetifitisnotapropersubsetofanyfrequentcloseditemsetwiththeequalsupport.alsoprunesthesearchspace.Forexample,ifarefrequentitemsetwiththeequalsupportwhereisalsoacloseditemsetand,thenitdoesnotminetheconditionaldatabasebecausethelatterwillnotproduceanyfrequentcloseditemsets.isafollow-upworkafterbythesamegroupofauthors.attemptstodesignthemostoptimizedfrequentcloseditemsetminingalgorithmbyndingthebesttrade-offbetweendepth-rstsearchversusbreadth-rstsearch,verticalformatsversushorizontalformats,treestructureversusotherdatastructures,topdownversusbottomuptraversal,andpseudoprojectionver-susphysicalprojectionoftheconditionaldatabase.keepstrackoftheunpromisingprexitemsetsforgeneratingpotentialclosedfrequentitemsetsandprunesthesearchspacebydeletingthem.alsoappliesitemmerging,andsub-itemsetbasedpruning.Tosavethememoryoftheclosurecheckingopera-usesthecombinationofthe2-levelhash-indexedtreebasedmethodandthepseudo-projectionbasedupwardcheckingmethod.Interestedreadersareencouragedtoreferto[69]formoredetails.5.3.4DCI_CLOSED[41,42]usesabitwiseverticalrepresentationoftheinputdatabase.canbeexecutedindependentlyoneachpartitionofthedatabaseinanyorderand,thus,alsoinparallel.isdesignedtoimprovememory-efciencybyavoidingthestorageofduplicatecloseditemsets.designsanovelstrategyforsearchingthelatticethatcandetectanddiscarddu-plicateclosedpatternsonthey.Usingtheconceptoforder-preservinggenerators 2FrequentPatternMiningAlgorithms:ASurveyoffrequentcloseditemsets,anewvisitationschemeofthesearchspaceisintro-duced.Suchavisitationschemeresultsadisjointsubdivisionofthesearchspace.Thisalsofacilitatesparallelism.appliesseveraloptimizationtrickstoimproveexecutiontime,suchasthebitwiseintersectionoftidsetstocomputesupportandclosure.Wherepossible,itreusespreviouslycomputedintersectionstoavoidredundantcomputations.6OtherOptimizationsandVariationsInthissection,anumberofotheroptimizationsandvariationsoffrequentpatternminingalgorithmswillbediscussed.Manyofthesemethodsarediscussedindetailinotherchaptersofthisbook,andthereforetheywillbediscussedonlybrieyhere.6.1RowEnumerationMethodsNotallfrequentpatternminingalgorithmsfollowthefundamentalstepsofbaselinealgorithm,thereexistsanumberofspecialcases,forwhichspecializedfrequentpatternminingalgorithmshavebeendesigned.Aninterestingcaseisthatofmicro-arraydatasets,inwhichthecolumnsareverylongbutthenumberofrowsarenotverylarge.Insuchcases,amethodcalledrow-enumerationisused[22,23,40,48,49]insteadoftheusualcolumnenumeration,inwhichcombinationsofrowsareexaminedduringthesearchprocess.Therearetwocategoriesofrowenumerationalgorithm.Onecategoryalgorithmperformbottom-up[22,23,48]searchovertherowenumerationtreewhereasothercategoryalgorithmsperformtop-down[40]searchstrategy.Rowenumerationalgorithmsperformminingoverthetransposeofthetransactiondatabase.Intransposedatabase,eachtransactionidbecomeitemandeachitemcor-respondsatransaction.Miningoverthetransposeddatabaseisbasicallythebottomupsearchforfrequentpatternsbyenumerationofrowsets.However,thebottom-upsearchstrategycannottakeadvantageofuser-speciedminimumsupportthresholdtoeffectivelyprunethesearchspace,andthereforeleadstolongerrunningtimeandlargememoryoverhead.Asasolution[40]introduceatop-downapproachofminingusinganovelrowenumerationtree.Theirapproachcantakefulladvantageofuser-denedminimumsupportvalueandprunethesearchspaceefcientlyhencelowerdowntheexecutiontime.Notethat,bothofthesearchstrategiesareappliedoverthetransposedtransactiondatabase.Mostofdevelopedalgorithmusingrowenumerationtechniqueconcentrateonminingfrequentcloseditemset(explainedinSect.5).Thereasonbehindthismotivationisthatduetothenatureofmicro-arraydatathereexistsalargenumberofredundancyamongthefrequentpatternsforaminimumsupportthresholdandclosedpatternsarecapableofsummarizingthewholedatabase.ThesestrategieswillbediscussedindetailinChap.4,andthereforeonlyabriefdiscussionisprovided C.C.Aggarwaletal.6.2OtherExplorationStrategiesTheadvantageoftree-enumerationstrategiesisthattheyfacilitatetheexplorationofcandidatesinthetreeinanarbitraryorder.AmethodknownasPincer-Searchisproposedin[37]thatcombinestop-downandbottom-upexplorationinpincerfashiontoavailoftheadvantagesofbothsubsetandsupersetpruning.Twoprimaryobservationsareusedinpincersearch:1.Anysubsetofafrequentitemsetisfrequent.2.Anysupersetofaninfrequentitemsetisinfrequent.Inpincer-search,topdownandbottomupexplorationarecombinedandirrelevantitemsetsareprunedusingbothobservations.Moredetailsofthisapproacharedis-cussedin[37].Notethat,forsparsetransactiondata,supersetpruningislikelytobeinefcient.Otherrecentmethodshavebeenproposedforlongpatternminingwithmethodssuchasleapsearch.Thesemethodsarediscussedinthechapteronlongpatternmininginthisbook.7ReducingtheNumberofPassesAmajorchallengeinfrequentpatternminingiswhenthedataisdiskresident.Insuchcases,itisdesirabletouselevel-wisemethodstoensurethatrandomaccessestodiskareminimized.Thisisthereasonthatmostoftheavailablealgorithmsuselevel-wisemethods,whichensurethatthenumberofpassesoverthedatabaseareboundedbythesizeofthelongestpattern.Evenso,thiscanbesignicant,whenmanylongpatternsarepresentinthedatabase.Therefore,anumberofmethodshavebeenproposedintheliteraturetoreducethenumberofpassesoverthedata.Thesemethodscouldbeusedinthecontextofjoin-basedalgorithms,tree-basedalgorithms,orevenotherclassesoffrequentpatternminingmethods.Thesecorrespondtocombiningthelevel-wisedatabasepasses,usingsampling,andusingapreprocess-once-query-manyparadigm.7.1CombiningPassesTheearliestworkoncombiningpasseswasproposedintheoriginal[1].Thekeyideaincombingpassesisthatitispossibletousejoinstocreatecandidatesofhigherorderthan(1)inasinglepass.Forexample,(candidatescanbecreatedfrom(1)-candidatesbeforeactualvalidationofthe1)-candidatesoverthedata.Then,thecandidatesofsize(1)and(canbevalidatedtogetherinasinglepassoverthedata.Althoughsuchanapproachreducesthenumberofpassesoverthedata,ithasthedownsidethatthenumberofspurious(2)candidateswillbefarlargerbecausethe(1)candidateswerenotconrmedtobefrequentbeforetheywerejoined.Therefore,thesavingofdatabase 2FrequentPatternMiningAlgorithms:ASurveypassescomesatanincreasedcomputationalcost.Therefore,itwasproposedin[1]thattheapproachshouldbeusedforlaterpasses,whenthenumberofcandidateshasalreadyreducedsignicantly.Thisreducesthelikelihoodthatthenumberofcandidatesblowsuptoomuchwiththisapproach.7.2SamplingTricksAnumberofsamplingtrickscanbeusedtogreatlyimprovetheefciencyofthefrequentpatternminingprocess.Mostsamplingmethodsrequiretwopassesoverthedata,therstofwhichisusedforsampling.Aninterestingapproachthatusestwopasseswiththeuseofsamplingisdiscussedin[65].Thismethodgeneratestheapproximatelyfrequentpatternsoverthedata,usingasample.Falsenegativescanbereducedbyloweringtheminimumsupportlevelappropriately,sothatboundscanbedenedonthelikelihoodoffalsenegatives.Falsepositivescanberemovedwiththeuseofasecondpassoverthedata.Themajordownsideoftheapproachisthatthereductionintheminimumsupportleveltoreducethenumberoffalsenegativescanbesignicant.Thisalsoreducesthecomputationalefciencyoftheapproach.Themethodhoweverrequiresonlytwopassesoverthedata,wheretherstpassisusedtocreatethesample,andthesecondpassisusedtoremovethefalsepositives.Aninterestingapproachproposedin[57]dividesthediskresidentdatabaseintosmallermemory-residentpartitions.Foreachpartition,moreefciencyalgorithmscanbeused,becauseofthememory-residentnatureofthepartition.Itshouldbepointedoutthateachfrequentpatternovertheentiredatabasewillappearasafre-quentpatterninatleastonetransaction.Therefore,theunionoftheitemsetsoverthedifferenttransactionsprovidesasupersetofthetruefrequentpatterns.Apost-processingphaseisthenusedtolteroutthespuriousitemsets,bycountingthiscandidatesetagainstthetransactiondatabase.Aslongasthepartitionsarereason-ablylarge,thesupersetfoundapproximatesthetruefrequentpatternsverywell,andthereforetheadditionaltimespentincountingirrelevantcandidatesisrelativelysmall.Themainadvantageofthisapproachisitrequiresonlytwopassesoverthedatabase.Therefore,suchanapproachisparticularlyeffectivewhenthedataisresidentondisk.DynamicItemsetCounting(DIC)algorithm[15]dividesthedatabaseintointervals,andgenerateslongercandidateswhenitisknownthatthesubsetsofthesecandidatesarealreadyfrequent.Thesearethenvalidatedoverthedatabase.Suchanapproachcanreducethenumberofpassesoverthedata,becauseitimplicitlycombinestheprocessofcandidategenerationandcounting. C.C.Aggarwaletal.7.3OnlineAssociationRuleMiningInmanyapplications,ausermaywishtoquerythetransactiondatatondtheasso-ciationrulesorthefrequentpatterns.Insuchcases,evenathighsupportlevels,itisoftenimpossibletocreatethefrequentpatternsinonlinetimebecauseofthemultiplepassesrequiredoverapotentiallylargedatabase.Oneoftheearliestalgorithmsforonlineassociationruleminingwasproposedin[6].Inthisapproach,anaugmentedlexicographictreeisstoredeitherondiskorinmain-memory.Thelexicographictreeisaugmentedwithalltheedgesrepresentedthesubsetrelationshipsbetweenitem-sets,andisalsoreferredtoastheitemset.Foranygivenquery,theitemsetlatticemaybetraversedtodeterminetheassociationrules.Ithasbeenshownin[6],thatsuchanapproachcanalsobeusedtodeterminethenon-redundantassociationrulesintheunderlyingdata.Asecondmethod[40]usesacondensedfrequentpatterntree(insteadofalattice)topre-processandstoretheitemsets.Thisstructurecanbequeriedtoprovideonlineresponses.Averydifferentapproachforonlineassociationrulemininghasbeenproposedin[34],inwhichthetransactiondatabaseisprocessedinrealtime.Inthiscase,anincrementalapproachisusedtominethetransactiondatabase.ThisisaAssociationRuleMiningAlgorithm,whichisreferredtoas.Inthiscase,transactionsareprocessedastheyarrive,andcandidateitemsetsaregeneratedonthey,byexaminingthesubsetsofthattransaction.Clearly,thedownsideisthatsuchanapproachisthatitwillcreatealotmorecandidatesthananyoftheofinealgorithmswhichuselevelwisemethodstogeneratethecandidates.Thisgeneralcharacteristicisofcoursetrueofanyalgorithmwhichtriestoreducethenumberofpasseswithapproximatecandidategeneration.OneinterestingcharacteristicoftheCARMAalgorithmisthatitallowstheusertochangetheminimumsupportlevelduringexecution.Inthatcase,thealgorithmisguaranteedtohavegeneratedthesupersetsofthetrueitemsetsinthedata.Ifdesired,asecondpassoverthedatacanbeusedtoremovethespuriousfrequentitemsets.Manystreamingmethodshavealsobeenproposedthatuseonlyonepassoverthetransactiondata[1921,35,43].Itshouldbepointedoutthatitisoftendifculttondeven1-itemsetsexactlyoveradatastreambecauseoftheone-passconstraint[21],whenthenumberofdistinctitemsislargerthanthemainmemoryavailability.Thisisoftentrueof-itemsetsaswell,especiallyatlowsupportlevels.Furthermore,ifthepatternsinthestreamchangeovertime,thenthefrequent-itemsetswillchangesignicantlyaswell.Thesemethodsthereforehavethechallengeofndingthefrequentitemsetsefciently,maintainingthem,andhandlingissuesinvolvingevolutionofthedatastream.Giventhenumerouschallengesofpatternmininginthisscenario,mostofthesemethodsndthefrequentitemsapproximately.TheseissueswillbediscussedindetailinChap.9onstreamingpatternminingalgorithms. 2FrequentPatternMiningAlgorithms:ASurvey8ConclusionsandSummaryThischapterprovidesasurveyofdifferentfrequentpatternminingalgorithms.mostfrequentpatternalgorithms,implicitlyorexplicitly,exploretheenumerationtreeofitemsets.Algorithmssuchasexploretheenumerationtreeinbreadth-rstfashionwithjoin-basedcandidategeneration.Althoughthenotionofanenumerationtreeisnotexplicitlymentionedbythealgorithm,theexecutiontreeexploresthecandidatesaccordingtoanenumerationtreeconstructedontheprexes.OtheralgorithmssuchasTreeProjectionFP-growthusethehierarchicalrelationshipsbetweentheprojecteddatabasesforpatternsofdifferentlengths,andavoidre-doingthecountingworkdonefortheshorterpatterns.Maximalandclosedversionsoffrequentpatternminingalgorithmsarealsoabletoachievemuchbetterpruningperformance.Anumberofefciency-basedoptimizationsoffrequentpatternminingalgorithmswerealsodiscussedinthischapter.References1.R.Agrawal,andR.Srikant.FastAlgorithmsforMiningAssociationRulesinLargeDatabases,VLDBConference,pp.487499,1994.2.R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.ACMSIGMODConference,1993.3.R.Agrawal,H.Mannila,R.Srikant,H.Toivonen,andA.I.Verkamo.Fastdiscoveryofassociationrules,AdvancesinKnowledgeDiscoveryandDataMining,pp.307328,1996.4.R.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.Depth-rstGenerationofLongPatterns,ACMKDDConference,2000.AlsoavailableasIBMResearchReport,RC21538,July1999.5.R.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjectionAlgorithmforGenerationofFrequentItemsets,JournalofParallelandDistributedComputing,61(3),pp.350371,2001.AlsoavailableasIBMResearchReport,RC21341,1999.6.C.C.Aggarwal,P.S.Yu.OnlineGenerationofAssociationRules,ICDEConference,1998.7.C.C.Aggarwal,P.S.Yu.ANewFrameworkforItemsetGeneration,ACMPODSConference8.E.AzkuralandC.Aykanat.ASpaceOptimizationforFP-Growth,FIMIworkshop,2004.9.Y.Bastide,R.Taouil,N.Pasquier,G.Stumme,andL.Lakhal.MiningFrequentPatternswithCountingInference.ACMSIGKDDExplorationsNewsletter,2(2),pp.6675,2000.10.R.J.BayardoJr.Efcientlymininglongpatternsfromdatabases,ACMSIGMODConference11.J.Blanchard,F.Guillet,R.Gras,andH.Briand.UsingInformation-theoreticMeasurestoAssessAssociationRuleInterestingness.ICDMConference,2005.12.C.Borgelt,R.Kruse.InductionofAssociationRules:AprioriImplementation,ConferenceonComputationalStatistics,2002.http://fuzzy.cs.uni-magdeburg.de/borgelt/software.html.13.J.-F.Boulicaut,A.Bykowski,andC.Rigotti.Free-sets:ACondensedRepresentationofBooleandatafortheApproximationofFrequencyQueries.DataMiningandKnowledgeDiscovery,7(1),pp.522,2003.14.D.Burdick,M.Calimlim,andJ.Gehrke.MAFIA:AMaximalFrequentItemsetAl-gorithmforTransactionalDatabases,ICDEConference,2000.ImplementationURL:http://himalaya-tools.sourceforge.net/Maa/.15.S.Brin,R.Motwani,J.D.Ullman,andS.Tsur.Dynamicitemsetcountingandimplicationrulesformarketbasketdata.ACMSIGMODConference,1997. C.C.Aggarwaletal.16.S.Brin,R.Motwani,andC.Silverstein.BeyondMarketBaskets:GeneralizingAssociationRulestoCorrelations.ACMSIGMODConference,1997.17.T.Calders,andB.Goethals.Miningallnon-derivablefrequentitemsetsPrinciplesofDataMiningandKnowledgeDiscovery,pp.142,2002.18.T.Calders,andB.Goethals.Depth-rstNon-derivableItemsetMining,SDMConference19.T.Calders,N.Dexters,J.Gillis,andB.Goethals.MiningFrequentItemsetsinaStream,InformationsSystems,toappear,2013.20.J.H.Chang,andW.S.Lee.FindingRecentFrequentItemsetsAdaptivelyoverOnlineDataACMKDDConference,2003.21.M.Charikar,K.Chen,andM.Farach-Colton.FindingFrequentItemsinDataStreams.Automata,LanguagesandProgramming,pp.693703,2002.22.G.Cong,A.K.H.Tung,X.Xu,F.Pan,andJ.Yang.FARMER:Findinginterestingrulegroupsinmicroarraydatasets.ACMSIGMODConference,2004.23.G.Cong,K.-L.Tan,A.K.H.Tung,X.Xu.MiningTop-coveringRuleGroupsforGeneExpressionData.ACMSIGMODConference,2005.24.M.El-HajjandO.Zaiane.COFI-treeMining:ANewApproachtoPatternGrowthwithReducedCandidacyGeneration.FIMIWorkshop,2003.25.F.Geerts,B.Goethals,J.Bussche.ATightUpperBoundontheNumberofCandidatePatterns,ICDMConference,2001.26.B.Goethals.Surveyonfrequentpatternmining,Technicalreport,UniversityofHelsinki,2003.27.R.P.GopalanandY.G.Sucahyo.HighPerformanceFrequentPatternExtractionusingCom-pressedFP-Trees,ProceedingsofSIAMInternationalWorkshoponHighPerformanceandDistributedMining,2004.28.K.Gouda,andM.Zaki.Genmax:Anefcientalgorithmforminingmaximalfrequentitemsets.DataMiningandKnowledgeDiscovery,11(3),pp.223242,2005.29.G.Grahne,andJ.Zhu.EfcientlyUsingPrex-treesinMiningFrequentItemsets,IEEEICDMWorkshoponFrequentItemsetMining,2004.30.G.Grahne,andJ.Zhu.FastAlgorithmsforFrequentItemsetMiningUsingFP-Trees.TransactionsonKnowledgeandDataEngineering.17(10),pp.13471362,2005,vol.17,no.10,pp.13471362,October,2005.31.V.Guralnik,andG.Karypis.Paralleltree-projection-basedsequenceminingalgorithms.ParallelComputing,30(4):pp.443472,April2004.32.J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration,ACMSIGMODConference,2000.33.J.Han,H.Cheng,D.Xin,andX.Yan.FrequentPatternMining:CurrentStatusandFutureDataMiningandKnowledgeDiscovery,15(1),pp.5586,2007.34.C.Hidber.OnlineAssociationRuleMining,ACMSIGMODConference,1999.35.R.Jin,andG.Agrawal.AnAlgorithmforin-coreFrequentItemsetMiningonStreamingData,ICDMConference,2005.36.Q.Lan,D.Zhang,andB.Wu.ANewAlgorithmForFrequentItemsetsMiningBasedOnAprioriAndFP-Tree,IEEEInternationalConferenceonGlobalCongressonIntelligentSystemspp.360364,2009.37.D.-I.Lin,andZ.Kedem.Pincer-search:ANewAlgorithmforDiscoveringtheMaximumFrequentSet,EDBTConference,1998.38.J.Liu,Y.Pan,K.Wang.MiningFrequentItemSetsbyOpportunisticProjection,ACMKDDConference,2002.39.G.Liu,H.LuandJ.X.Yu.AFOPT:AnEfcientImplementationofPatternGrowthApproach,FIMIWorkshop,2003.40.H.Liu,J.Han,D.Xin,andZ.Shao.Miningfrequentpatternsonveryhighdimensionaldata:atop-downrowenumerationapproach.SDMConference,2006.41.C.Lucchesse,S.Orlando,andR.Perego.DCI-Closed:Afastandmemoryefcientalgorithmtominefrequentcloseditemsets.FIMIWorkshop,2004. 2FrequentPatternMiningAlgorithms:ASurvey 63 42.C.Lucchese,S.Orlando,andR.Perego.Fastandmemoryefcientminingoffrequentclosed itemsets. IEEETKDEJournal ,18(1),pp.2136,January2006. 43.G.Manku,R.Motwani.ApproximateFrequencyCountsoverDataStreams. VLDBConference , 2002. 44.H.Mannila,H.Toivonen,andA.I.Verkamo.Efcientalgorithmsfordiscoveringassociation rules. ProceedingsoftheAAAIWorkshoponKnowledgeDiscoveryinDatabases ,pp.181192, 1994. 45.B.Negrevergne,T.Guns,A.Dries,andS.Nijssen.DominanceProgrammingforItemset Mining. IEEEICDMConference ,2013. 46.S.Orlando,P.Palmerini,R.Perego.Enhancingthea-priorialgorithmforfrequentsetcounting, ThirdInternationalConferenceonDataWarehousingandKnowledgeDiscovery ,2001. 47.S.Orlando,P.Palmerini,R.Perego,andF.Silvestri.Adaptiveandresource-awareminingof frequentsets. ICDMConference ,2002. 48.F.Pan,G.Cong,A.K.H.Tung,J.Yang,andM.J.Zaki.Findingclosedpatternsinlong biologicaldatasets. ACMKDDConference ,2003. 49.FPan,A.K.H.Tung,G.Cong,X.Xu.COBBLER:CombiningcolumnandRowEnumeration forClosedPatternDiscovery. SSDBM ,2004. 50.J.-S.Park,M.S.Chen,andP.S.Yu.AnEffectiveHash-basedAlgorithmforMiningAssociation Rules, ACMSIGMODConference ,1995. 51.N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsfor associationrules. ICDTConference ,1999. 52.N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Efcientminingofassociationrulesusing closeditemsetlattices. JournalofInformationSystems ,24(1),pp.2546,1999. 53.J.Pei,J.Han,andR.Mao.CLOSET:AnEfcientAlgorithmforMiningFrequentClosed Itemsets, DMKDWorkshop ,2000. 54.J.Pei,J.Han,H.Lu,S.Nishio,S.Tang,D.Yang.H-mine:Hyper-structureminingoffrequent patternsinlargedatabases, ICDMConference ,2001. 55.B.Racz.nonordfp:AnFP-GrowthVariationwithoutRebuildingtheFP-Tree, FIMIWorkshop , 2004. 56.M.Holsheimer,M.Kersten,H.Mannila,andH.Toivonen.APerspectiveonDatabasesand DataMining, ACMKDDConference ,1995. 57.A.Savasere,E.Omiecinski,andS.Navathe.Anefcientalgorithmforminingassociationrules inlargedatabases. VLDBConference ,1995. 58.P.Shenoy,J.Haritsa,S.Sudarshan,G.Bhalotia,M.Bawa,D.Shah.Turbo-chargingVertical MiningofLargeDatabases. ACMSIGMODConference ,pp.2233,2000. 59.Z.Shi,andQ.He.EfcientlyMiningFrequentItemsetswithCompactFP-Tree,IFIP InternationalFederationforInformationProcessing,V-163,pp.397406,2005. 60.R.Srikant.Fastalgorithmsforminingassociationrulesandsequentialpatterns. PhDthesis, UniversityofWisconsin,Madison ,1996. 61.Y.G.SucahyoandR.P.Gopalan.CT-ITL:EfcientFrequentItemSetMiningUsinga CompressedPrexTreewithPatternGrowth, Proceedingsofthe14thAustralasianDatabase Conference ,2003. 62.Y.G.SucahyoandR.P.Gopalan.CT-PRO:ABottomUpNonRecursiveFrequentItemset MiningAlgorithmUsingCompressedFP-TreeDataStructures. FIMIWorkshop ,2004. 63.P.-N.Tan,V.Kumar,amdJ.Srivastava.SelectingtheRightInterestingnessMeasurefor AssociationPatterns. ACMKDDConference ,2002. 64.I.Taouil,N.Pasquier,Y.Bastide,andL.Lakhal.MiningBasisforAssociationRulesusing ClosedSets, ICDEConference ,2000. 65.H.Toivonen.Samplinglargedatabasesforassociationrules. VLDBConference ,1996. 66.T.Uno,M.KiyomiandH.Arimura.EfcientMiningAlgorithmsforFrequent/Closed/Maximal Itemsets, FIMIWorkshop ,2004. 67.J.Wang,J.Han.BIDE:EfcientMiningofFrequentClosedSequences. ICDEConference , 2004. C.C.Aggarwaletal.68.J.Wang,J.Han,Y.Lu,andP.Tzvetkov.TFP:Anefcientalgorithmforminingtop-closeditemsets.IEEETransactionsonKnowledgeandDataEngineering,17,pp.652664,69.J.Wang,J.Han,andJ.Pei.CLOSET+:SearchingfortheBeststrategiesforminingfrequentcloseditemsets.ACMKDDConference,2003.70.G.I.Webb.EfcientSearchforAssociationRules,ACMKDDConference,2000.71.M.J.Zaki.Scalablealgorithmsforassociationmining,IEEETransactionsonKnowledgeandDataEngineering,12(3),pp.372390,2000.72.M.Zaki,andK.Gouda.Fastverticalminingusingdiffsets.ACMKDDConference,2003.73.M.J.ZakiandC.Hsiao.CHARM:Anefcientalgorithmforclosedassociationrulemining.SDMConference,2002.74.M.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.NewAlgorithmsforFastDiscoveryofAssociationRules.KDDConference,pp.283286,1997.75.C.Zeng,J.F.Naughton,andJYCai.OnDifferentiallyPrivateFrequentItemsetMining.InProceedingsof39thInternationalConferenceonVeryLargedataBases,2012. Chapter3Pattern-GrowthMethodsJiaweiHanandJianPeiMiningfrequentpatternshasbeenafocusedtopicindataminingre-searchinrecentyears,withthedevelopmentofnumerousinterestingalgorithmsforminingassociation,correlation,causality,sequentialpatterns,partialperiodic-ity,constraint-basedfrequentpatternmining,associativeclassication,emergingpatterns,etc.ManystudiesadoptanApriori-like,candidategeneration-and-testap-proach.However,basedonouranalysis,candidategenerationandtestmaystillbeexpensive,especiallywhenencounteringlongandnumerouspatterns.Anewmethodology,calledfrequentpatterngrowth,whichminesfrequentpat-ternswithoutcandidategeneration,hasbeendeveloped.Themethodadoptsadivide-and-conquerphilosophytoprojectandpartitiondatabasesbasedonthecur-rentlydiscoveredfrequentpatternsandgrowsuchpatternstolongeronesintheprojecteddatabases.Moreover,efcientdatastructureshavebeendevelopedforeffectivedatabasecompressionandfastin-memorytraversal.Suchamethodologymayeliminateorsubstantiallyreducethenumberofcandidatesetstobegeneratedandalsoreducethesizeofthedatabasetobeiterativelyexamined,and,therefore,leadtohighperformance.Inthispaper,weprovideanoverviewofthisapproachandexamineitsmethodologyandimplicationsforminingseveralkindsoffrequentpatterns,in-cludingassociation,frequentcloseditemsets,max-patterns,sequentialpatterns,andconstraint-basedminingoffrequentpatterns.Weshowthatfrequentpatterngrowthefcientatmininglargedata-basesanditsfurtherdevelopmentmayleadtoscalableminingofmanyotherkindsofpatternsaswell.KeywordsScalabledataminingmethodsandalgorithmsFrequentpatternsSequentialpatternsConstraint-basedmining J.Han(UniversityofIllinoisatUrbana-Champaign,Urbana,IL61801,USAe-mail:hanj@cs.uiuc.eduJ.PeiSimonFraserUniversity,Burnaby,BCV5A1S6,Canadae-mail:jpei@cs.sfu.caC.C.Aggarwal,J.Han(eds.),FrequentPatternMining,DOI10.1007/978-3-319-07821-2_3,©SpringerInternationalPublishingSwitzerland2014

FrequentPatternMining - PDF document

FrequentPatternMining - PPT Presentation

Share:

Link:

Embed:

Related Contents