/
FrequentPatternMining FrequentPatternMining

FrequentPatternMining - PDF document

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
389 views
Uploaded On 2016-06-07

FrequentPatternMining - PPT Presentation

CharuCAggarwal ID: 352029

CharuC.Aggarwal

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "FrequentPatternMining" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

FrequentPatternMining CharuC.Aggarwal€JiaweiHanFrequentPatternMining EditorsCharuC.AggarwalJiaweiHanIBMT.J.WatsonResearchCenterUniversityofIllinoisatUrbana-ChampaignYorktownHeightsUrbanaNewYorkIllinoisUSAUSAISBN978-3-319-07820-5ISBN978-3-319-07821-2(eBook)DOI10.1007/978-3-319-07821-2SpringerChamHeidelbergNewYorkDordrechtLondonLibraryofCongressControlNumber:2014944536©SpringerInternationalPublishingSwitzerland2014Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartofthematerialisconcerned,speci“callytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting,reproductiononmicro“lmsorinanyotherphysicalway,andtransmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodologynowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnectionwithreviewsorscholarlyanalysisormaterialsuppliedspeci“callyforthepurposeofbeingenteredandexecutedonacomputersystem,forexclusiveusebythepurchaserofthework.DuplicationofthispublicationorpartsthereofispermittedonlyundertheprovisionsoftheCopyrightLawofthePublisherslocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.ViolationsareliabletoprosecutionundertherespectiveCopyrightLaw.Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublicationdoesnotimply,evenintheabsenceofaspeci“cstatement,thatsuchnamesareexemptfromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse.Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpublication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforanyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespecttothematerialcontainedherein.Printedonacid-freepaperSpringerispartofSpringerScience+BusinessMedia(www.springer.com) PrefaceThe“eldofdatamininghasfourmainsuper-problemsŽcorrespondingtoclustering,classi“cation,outlieranalysis,andfrequentpatternmining.Comparedtotheotherthreeproblems,thefrequentpatternminingmodelforformulatedrelativelyrecently.Inspiteofitsshorterhistory,frequentpatternminingisconsideredthemarqueeproblemofdatamining.Thereasonforthisisthatinterestinthedatamining“eldincreasedrapidlysoonaftertheseminalpaperonassociationruleminingbyAgrawal,Imielinski,andSwami.Theearlierdataminingconferenceswereoftendominatedbyalargenumberoffrequentpatternminingpapers.Thisisoneofthereasonsthatfrequentpatternmininghasaveryspecialplaceinthedataminingcommunity.Atthispoint,the“eldoffrequentpatternminingisconsideredamatureone.Whilethe“eldhasreachedarelativelevelofmaturity,veryfewbookscoverdifferentaspectsoffrequentpatternmining.Mostoftheexistingbooksareeithertoogenericordonotcoverfrequentpatternmininginanexhaustiveway.Aneedexistsforanexhaustivebookonthetopicthatcancoverthedifferentnuancesinanexhaustiveway.Thisbookprovidescomprehensivesurveysinthe“eldoffrequentpatternmining.Eachchapterisdesignedasasurveythatcoversthekeyaspectsofthe“eldoffrequentpatternmining.Thechaptersaretypicallyofthefollowingtypes::Inthesecases,thekeyalgorithmsforfrequentpatternminingareexplored.Theseincludejoin-basedmethodssuchas,andpattern-growthVariations:Manyvariationsoffrequentpatternminingsuchasinterestingpat-terns,negativepatterns,constrainedpatternmining,orcompressedpatternsareexploredinthesechapters.:Thelargesizesofdatainrecentyearshasledtotheneedforbigdataandstreamingframeworksforfrequentpatternmining.Frequentpatternminingalgorithmsneedtobemodi“edtoworkwiththeseadvancedscenarios.DataTypes:Differentdatatypesleadtodifferentchallengesforfrequentpatternminingalgorithms.Frequentpatternminingalgorithmsneedtobeabletoworkwithcomplexdatatypes,suchastemporalorgraphdata. Preface:Inthesechapters,differentapplicationsoffrequentpatternminingareexplored.Theseincludestheapplicationoffrequentpatternminingmethodstoproblemssuchasclusteringandclassi“cation.Othermorecomplexalgorithmsarealsoexplored.Thisbookis,therefore,intendedtoprovideanoverviewofthe“eldoffrequentpatternmining,asitcurrentlystands.Itishopedthatthebookwillserveasausefulguideforstudents,researchers,andpractitioners. 1AnIntroductiontoFrequentPatternMining.....................1CharuC.Aggarwal1Introduction...............................................12FrequentPatternMiningAlgorithms..........................32.1FrequentPatternMiningwiththeTraditionalSupportFramework..........................................42.2InterestingandNegativeFrequentPatterns................62.3ConstrainedFrequentPatternMining.....................72.4CompressedRepresentationsofFrequentPatterns..........73ScalabilityIssuesinFrequentPatternMining...................83.1FrequentPatternMininginDataStreams.................83.2FrequentPatternMiningwithBigData...................94FrequentPatternMiningwithAdvancedDataTypes.............94.1SequentialPatternMining..............................104.2SpatiotemporalPatternMining..........................104.3FrequentPatternsinGraphsandStructuredData...........114.4FrequentPatternMiningwithUncertainData..............115PrivacyIssues.............................................126ApplicationsofFrequentPatternMining.......................136.1ApplicationstoMajorDataMiningProblems..............136.2GenericApplications..................................137ConclusionsandSummary..................................14....................................................142FrequentPatternMiningAlgorithms:ASurvey..................19CharuC.Aggarwal,MansurulA.BhuiyanandMohammadAlHasan1Introduction...............................................191.1De“nitions...........................................222Join-BasedAlgorithms......................................232.1AprioriMethod.......................................242.2DHPAlgorithm.......................................272.3SpecialTricksfor2-ItemsetCounting....................28 2.4PruningbySupportLowerBounding.....................282.5HypercubeDecomposition..............................293Tree-BasedAlgorithms.....................................293.1AISAlgorithm........................................313.2TreeProjectionAlgorithms..............................323.3VerticalMiningAlgorithms.............................364RecursiveSuf“x-BasedGrowth..............................394.1TheFP-GrowthApproach..............................414.2Variations............................................455MaximalandClosedFrequentItemsets........................475.1De“nitions...........................................475.2FrequentMaximalItemsetMiningAlgorithms.............485.3FrequentClosedItemsetMiningAlgorithms...............556OtherOptimizationsandVariations...........................576.1RowEnumerationMethods.............................576.2OtherExplorationStrategies............................587ReducingtheNumberofPasses..............................587.1CombiningPasses.....................................587.2SamplingTricks......................................597.3OnlineAssociationRuleMining.........................608ConclusionsandSummary..................................61....................................................613Pattern-GrowthMethods......................................65JiaweiHanandJianPei1Introduction...............................................662FP-Growth:PatternGrowthforMiningFrequentItemsets........683PushingMoreConstraintsinPattern-GrowthMining............724Pre“xSpan:MiningSequentialPatternsbyPatternGrowth.......745FurtherDevelopmentofPatternGrowth-BasedPatternMining..............................................776Conclusions...............................................78....................................................794MiningLongPatterns.........................................83FeidaZhu1Introduction...............................................832Preliminaries..............................................843APatternLatticeModel.....................................864PatternEnumerationApproach...............................874.1Breadth-FirstApproach................................874.2Depth-FirstApproach..................................885RowEnumerationApproach.................................896PatternMergeApproach....................................926.1Piece-wisePatternMerge...............................93 6.2Fusion-stylePatternMerge.............................987PatternTraversalApproach..................................1018Conclusion...............................................102....................................................1035InterestingPatterns...........................................105JillesVreekenandNikolajTatti1Introduction...............................................1062AbsoluteMeasures.........................................1072.1FrequentItemsets.....................................1072.2Tiles................................................1122.3LowEntropySets.....................................1143AdvancedMethods.........................................1144StaticBackgroundModels..................................1154.1IndependenceModel..................................1164.2BeyondIndependence.................................1194.3MaximumEntropyModels.............................1204.4RandomizationApproaches.............................1235DynamicBackgroundModels................................1245.1TheGeneralIdea......................................1255.2MaximumEntropyModels.............................1255.3Tile-basedTechniques.................................1265.4SwapRandomization..................................1286PatternSets...............................................1286.1Itemsets.............................................1296.2Tiles................................................1306.3SwapRandomization..................................1307Conclusions...............................................131....................................................1326NegativeAssociationRules.....................................135LuizaAntonie,JundongLiandOsmarZaiane1Introduction...............................................1352NegativePatternsandNegativeAssociationRules...............1363CurrentApproaches........................................1384AssociativeClassi“cationandNegativeAssociationRules........1435Conclusions...............................................143....................................................1447Constraint-BasedPatternMining...............................147SiegfriedNijssenandAlbrechtZimmermann1Introduction...............................................1472ProblemDe“nition.........................................1482.1Constraints...........................................1493Level-WiseAlgorithm......................................152 3.1GenericAlgorithm....................................1534Depth-FirstAlgorithm......................................1544.1BasicAlgorithm......................................1544.2Constraint-basedItemsetMining........................1554.3GenericFrameworks...................................1584.4ImplementationConsiderations..........................1595Languages................................................1596Conclusions...............................................162....................................................1628MiningandUsingSetsofPatternsthroughCompression..........165MatthijsvanLeeuwenandJillesVreeken1Introduction...............................................1652Foundations...............................................1672.1KolmogorovComplexity...............................1682.2MDL................................................1692.3MDLinDataMining..................................1713Compression-basedPatternModels...........................1713.1PatternModelsforMDL...............................1723.2CodeTables..........................................1733.3InstancesofCompression-basedModels..................1794AlgorithmicApproaches....................................1814.1CandidateSetFiltering.................................1814.2DirectMiningofPatternsthatCompress..................1845MDLforDataMining......................................1855.1Classi“cation.........................................1865.2ADissimilarityMeasureforDatasets.....................1885.3IdentifyingandCharacterizingComponents...............1895.4OtherDataMiningTasks...............................1915.5TheAdvantageofPattern-basedModels..................1926ChallengesAhead..........................................1936.1TowardMiningStructuredData.........................1936.2Generalization........................................1946.3Task-and/orUser-speci“cUsefulness....................1947Conclusions...............................................195....................................................1969FrequentPatternMininginDataStreams.......................199VictorE.Lee,RuomingJinandGaganAgrawal1Introduction...............................................2002Preliminaries..............................................2012.1FrequentPatternMining:De“nition......................2012.2DataWindows........................................2022.3FrequentItemMining..................................2033FrequentItemsetMiningAlgorithms..........................204 3.1MiningtheFullDataStream............................2063.2RecentlyFrequentItemsets.............................2093.3ClosedandMaximalItemsets...........................2143.4MiningDataStreamswithUncertainData................2164MiningPatternsOtherthanItemsets..........................2164.1Subsequences........................................2174.2SubtreesandSemistructuredData........................2184.3Subgraphs...........................................2195ConcludingRemarks.......................................219....................................................22010BigDataFrequentPatternMining..............................225DavidC.Anastasiu,JeremyIverson,ShadenSmithandGeorge1Introduction...............................................2252FrequentPatternMining:Overview...........................2262.1Preliminaries.........................................2262.2BasicMiningMethodologies............................2283ParadigmsforBigDataComputation.........................2323.1PrinciplesofParallelAlgorithms........................2323.2SharedMemorySystems...............................2333.3DistributedMemorySystems...........................2344FrequentItemsetMining....................................2364.1MemoryScalability...................................2364.2WorkPartitioning.....................................2394.3DynamicLoadBalancing...............................2414.4FurtherConsiderations.................................2425FrequentSequenceMining..................................2425.1SerialFrequentSequenceMining........................2435.2ParallelFrequentSequenceMining......................2456FrequentGraphMining.....................................2506.1SerialFrequentGraphMining...........................2506.2ParallelFrequentGraphMining.........................2527Conclusion...............................................255....................................................25611SequentialPatternMining.....................................261WeiShen,JianyongWangandJiaweiHan1Introduction...............................................2612ProblemDe“nition.........................................2633Apriori-basedApproaches...................................2643.1HorizontalDataFormatAlgorithms......................2643.2VerticalDataFormatAlgorithms.........................2684PatternGrowthAlgorithms..................................2714.1FreeSpan............................................2714.2Pre“xSpan...........................................272 5Extensions................................................2745.1ClosedSequentialPatternMining........................2745.2Multi-level,Multi-dimensionalSequentialPatternMining...2765.3IncrementalMethods..................................2775.4HybridMethods......................................2785.5ApproximateMethods.................................2795.6Top-ClosedSequentialPatternMining..................2795.7FrequentEpisodeMining...............................2806ConclusionsandSummary..................................281....................................................28112SpatiotemporalPatternMining:AlgorithmsandApplications......283ZhenhuiLi1Introduction...............................................2832BasicConcept.............................................2842.1SpatiotemporalDataCollection.........................2842.2DataPreprocessing....................................2852.3BackgroundInformation...............................2863IndividualPeriodicPattern..................................2863.1AutomaticDiscoveryofPeriodicityinMovements..........2873.2FrequentPeriodicPatternMining........................2893.3UsingPeriodicPatternforLocationPrediction.............2894PairwiseMovementPatterns.................................2904.1SimilarityMeasure....................................2904.2GenericPattern.......................................2924.3BehavioralPattern.....................................2944.4SemanticPatterns.....................................2965AggregatePatternsoverMultipleTrajectories..................2985.1FrequentTrajectoryPatternMining......................2985.2DetectionofMovingObjectCluster......................3005.3TrajectoryClustering..................................3026Summary.................................................304....................................................30413MiningGraphPatterns........................................307HongCheng,XifengYanandJiaweiHan1Introduction...............................................3072FrequentSubgraphMining..................................3082.1ProblemDe“nition....................................3082.2Apriori-BasedApproach................................3092.3Pattern-GrowthApproach..............................3102.4ClosedandMaximalSubgraphs.........................3112.5MiningSubgraphsinaSingleGraph.....................3112.6TheComputationalBottleneck..........................313 3MiningSigni“cantGraphPatterns............................3143.1ProblemDe“nition....................................3143.2gboost:ABranch-and-BoundApproach...................3143.3gPLS:APartialLeastSquaresRegressionApproach........3173.4LEAP:AStructuralLeapSearchApproach................3193.5GraphSig:AFeatureRepresentationApproach.............3234MiningRepresentativeOrthogonalGraphs.....................3264.1ProblemDe“nition....................................3274.2RandomizedMaximalSubgraphMining..................3274.3OrthogonalRepresentativeSetGeneration................3295MiningDenseGraphPatterns................................3295.1CliquesandQuasi-Cliques..............................3305.2K-CoreandK-Truss...................................3315.3OtherDenseSubgraphPatterns..........................3326MiningGraphPatternsinStreams............................3327MiningGraphPatternsinUncertainGraphs....................3348Conclusions...............................................336....................................................33614UncertainFrequentPatternMining.............................339CarsonKai-SangLeung1Introduction...............................................3392TheProbabilisticModelforMiningExpectedSupport-BasedFrequentPatternsfromUncertainData........................3403CandidateGenerate-and-TestBasedUncertainFrequentPattern...................................................3434HyperlinkedStructure-BasedUncertainFrequentPatternMining..3445Tree-BasedUncertainFrequentPatternMining.................3455.1UF-growth...........................................3455.2UFP-growth..........................................3465.3CUF-growth.........................................3475.4PUF-growth..........................................3496ConstrainedUncertainFrequentPatternMining.................3507UncertainFrequentPatternMiningfromBigData...............3518StreamingUncertainFrequentPatternMining..................3538.1SUF-growth..........................................3538.2UF-streamingfortheSlidingWindowModel..............3548.3TUF-streamingfortheTime-FadingModel................3558.4LUF-streamingfortheLandmarkModel..................3568.5HyperlinkedStructure-BasedStreamingUncertainFrequentPatternMining.......................................3569VerticalUncertainFrequentPatternMining....................3579.1U-Eclat:AnApproximateAlgorithm.....................3579.2UV-Eclat:AnExactAlgorithm..........................3579.3U-VIPER:AnExactAlgorithm..........................358 xiv10DiscussiononUncertainFrequentPatternMining...............36011Extension:ProbabilisticFrequentPatternMining...............36111.1MiningProbabilisticHeavyHitters......................36111.2MiningProbabilisticFrequentPatterns...................36212Conclusions...............................................364....................................................36515PrivacyIssuesinAssociationRuleMining.......................369ArisGkoulalas-Divanis,JayantHaritsaandMuratKantarcioglu1Introduction...............................................3692InputPrivacy..............................................3702.1ProblemFramework...................................3712.2EvolutionoftheLiterature..............................3763OutputPrivacy............................................3793.1TerminologyandPreliminaries..........................3803.2TaxonomyofARHAlgorithms..........................3813.3HeuristicandExactARHAlgorithms.....................3823.4MetricsandPerformanceAnalysis.......................3904CryptographicMethods.....................................3924.1HorizontallyPartitionedData...........................3944.2VerticallyPartitionedData..............................3965Conclusions...............................................398....................................................39816FrequentPatternMiningAlgorithmsforDataClustering..........403ArthurZimek,IraAssentandJillesVreeken1Introduction...............................................4032GeneralizingPatternMiningforClustering....................4062.1GeneralizedMonotonicity..............................4072.2CountIndexes........................................4102.3PatternExplosionandRedundancy.......................4103FrequentPatternMininginSubspaceClustering................4123.1SubspaceClusterSearch...............................4123.2SubspaceSearch......................................4143.3RedundancyinSubspaceClustering......................4174Conclusions...............................................419....................................................41917SupervisedPatternMiningandApplicationstoClassi“cation......425AlbrechtZimmermannandSiegfriedNijssen1Introduction...............................................4252SupervisedPatternMining..................................4272.1ExplicitClassLabels..................................4282.2ClassesasDataSubsets................................4282.3NumericalTargetValues...............................431 3SupervisedPatternSetMining...............................4323.1LocalEvaluation,LocalModi“cation....................4343.2GlobalEvaluation,GlobalModi“cation...................4353.3LocalEvaluation,GlobalModi“cation...................4363.4DataInstance-BasedSelection..........................4374Classi“erConstruction......................................4374.1DirectClassi“cation...................................4374.2IndirectClassi“cation..................................4385Summary.................................................439....................................................44018ApplicationsofFrequentPatternMining........................443CharuC.Aggarwal1Introduction...............................................4432FrequentPatternsforCustomerAnalysis.......................4453FrequentPatternsforClustering..............................4464FrequentPatternsforClassi“cation...........................4475FrequentPatternsforOutlierAnalysis.........................4496FrequentPatternsforIndexing...............................4507WebMiningApplications...................................4517.1WebLogMining......................................4517.2WebLinkageMining..................................4528FrequentPatternsforTextMining............................4529TemporalApplications......................................45310SpatialandSpatiotemporalApplications.......................45511SoftwareBugDetection.....................................45612ChemicalandBiologicalApplications.........................45712.1ChemicalApplications.................................45812.2BiologicalApplications................................45813ResourcesforthePractitioner................................46014ConclusionsandSummary..................................461....................................................461............................................................469 ContributorsCharuC.AggarwalIBMT.J.WatsonResearchCenter,YorktownHeights,NY,GaganAgrawalOhioStateUniversity,Columbus,OH,USALuizaAntonieUniversityofGuelph,Guelph,CanadaDavidC.AnastasiuDepartmentofComputerScienceandEngineering,UniversityofMinnesota,Minneapolis,USAIraAssentDepartmentofComputerScience,AarhusUniversity,Aarhus,DenmarkMansurulA.BhuiyanIndianaUniversity…PurdueUniversity,Indianapolis,IN,HongChengDepartmentofSystemsEngineeringandEngineeringManagement,TheChineseUniversityofHongKong,HongKong,ChinaArisGkoulalas-DivanisIBMResearch-Ireland,DamastownIndustrialEstate,Mulhuddart,Dublin,IrelandJiaweiHanUniversityofIllinoisatUrbana-Champaign,Urbana,IL,USADepartmentofComputerScience,UniversityofIllinoisatUrbana-Champaign,Champaign,USAJayantHaritsaDatabaseSystemsLab,IndianInstituteofScience(IISc),Bangalore,IndiaMohammadAlHasanIndianaUniversity…PurdueUniversity,Indianapolis,IN, ContributorsJeremyIversonDepartmentofComputerScienceandEngineering,UniversityofMinnesota,Minneapolis,USARuomingJinKentStateUniversity,Kent,OH,USAMuratKantarciogluUTDDataSecurityandPrivacyLab,UniversityofTexasatDallas,Texas,USAVictorE.LeeJohnCarrollUniversity,UniversityHeights,OH,USAMatthijsvanLeeuwenKULeuven,Leuven,BelgiumCarsonKai-SangLeungUniversityofManitoba,Winnipeg,MB,CanadaJundongLiUniversityofAlberta,Alberta,CanadaZhenhuiLiPennsylvaniaStateUniversity,UniversityPark,USAGeorgeKarypisDepartmentofComputerScienceandEngineering,UniversityofMinnesota,Minneapolis,USASiegfriedNijssenKULeuven,Leuven,BelgiumUniversiteitLeiden,Leiden,TheNetherlandsJianPeiSimonFraserUniversity,Burnaby,BC,CanadaShadenSmithDepartmentofComputerScienceandEngineering,UniversityofMinnesota,Minneapolis,USAWeiShenTsinghuaUniversity,Beijing,ChinaNikolajTattiHIIT,DepartmentofInformationandComputerScience,AaltoUniversity,Helsinki,FinlandJillesVreekenMax-PlanckInstituteforInformaticsandSaarlandUniversity,Saarbrücken,GermanyJianyongWangTsinghuaUniversity,Beijing,ChinaXifengYanDepartmentofComputerScience,UniversityofCaliforniaatSantaBarbara,SantaBarbara,USA ContributorsOsmarZaianeUniversityofAlberta,Alberta,CanadaFeidaZhuSingaporeManagementUniversity,Singapore,SingaporeAlbrechtZimmermannINSALyon,VilleurbanneCEDEX,FranceArthurZimekLudwig-Maximilians-UniversitätMünchen,Munich,Germany Chapter1AnIntroductiontoFrequentPatternMiningCharuC.AggarwalTheproblemoffrequentpatternmininghasbeenwidelystudiedintheliteraturebecauseofitsnumerousapplicationstoavarietyofdataminingproblemssuchasclusteringandclassi“cation.Inaddition,frequentpatternminingalsohasnumerousapplicationsindiversedomainssuchasspatiotemporaldata,softwarebugdetection,andbiologicaldata.Thealgorithmicaspectsoffrequentpatternmininghavebeenexploredverywidely.Thischapterprovidesanoverviewofthesemethods,asitrelatestotheorganizationofthisbook.KeywordsFrequentpatternminingAssociationrules1IntroductionTheproblemoffrequentpatternminingisthatof“ndingrelationshipsamongtheitemsinadatabase.Theproblemcanbestatedasfollows.Givenadatabasewithtransactions,determineallpatternsarepresentinatleastafractionsofthetransactionsThefractionisreferredtoastheminimumsupport.Theparametercanbeexpressedeitherasanabsolutenumber,orasafractionofthetotalnumberoftrans-actionsinthedatabase.Eachtransactioncanbeconsideredasparsebinaryvector,orasasetofdiscretevaluesrepresentingtheidenti“ersofthebinaryattributesthatareinstantiatedtothevalueof1.Theproblemwasoriginallyproposedinthecontextofmarketbasketdatainorderto“ndfrequentgroupsofitemsthatareboughttogether[10].Thus,inthisscenario,eachattributecorrespondstoaniteminasuperstore,andthebinaryvaluerepresentswhetherornotitispresentinthetransaction.Becausetheproblemwasoriginallyproposed,ithasbeenappliedtonumerousotherapplica-tionsinthecontextofdatamining,Weblogmining,sequentialpatternmining,andsoftwarebuganalysis.Intheoriginalmodeloffrequentpatternmining[10],theproblemof“ndingassociationruleshasalsobeenproposedwhichiscloselyrelatedtothatoffrequent C.C.Aggarwal(IBMT.J.WatsonResearchCenter,YorktownHeights,NY10598,USAe-mail:charu@us.ibm.comC.C.Aggarwal,J.Han(eds.),FrequentPatternMining,DOI10.1007/978-3-319-07821-2_1,©SpringerInternationalPublishingSwitzerland2014 C.C.Aggarwalpatterns.Ingeneralassociationrulescanbeconsideredasecond-stageŽoutput,whicharederivedfromfrequentpatterns.Considerthesetsofitems.Theisconsideredanassociationruleatminimumsupportandminimum,whenthefollowingtwoconditionsholdtrue:1.Thesetisafrequentpattern.2.TheratioofthesupportoftothatofisatleastTheminimumcon“denceisalwaysafractionlessthan1becausethesupportofthesetisalwayslessthanthatof.Becausethe“rststepof“ndingfrequentpatternsisusuallythecomputationallymorechallengingone,mostoftheresearchinthisareaisfocussedontheformer.Nevertheless,somecomputationalandmodelingissuesalsoariseduringthesecondstep,especiallywhenthefrequentpatternminingproblemisusedinthecontextofotherdataminingproblemssuchasclassi“cation.Therefore,thisbookwillalsodiscussvariousaspectsofassociationruleminingalongwiththatoffrequentpatternmining.Arelatedproblemisthatofsequentialpatternmininginwhichanorderispresentinthetransactions[5].Temporalorderisquitenaturalinmanyscenariossuchascustomerbuyingbehavior,becausetheitemsareboughtatspeci“ctimestamps,andoftenfollowanaturaltemporalorder.Inthesecases,theproblemisrede“nedtothatofsequentialpatternmining,inwhichitisdesirabletodeterminerelevantandofitems.Someexamplesofimportantapplicationsareasfollows;CustomerTransactionAnalysis:Inthiscase,thetransactionsrepresentsetsofitemsthatco-occurincustomerbuyingbehavior.Inthiscase,itisdesirabletodeterminefrequentpatternsofbuyingbehavior,becausetheycanbeusedformakingdecisionaboutshelfstockingorrecommendations.OtherDataMiningProblems:Frequentpatternminingcanbeusedtoenableothermajordataminingproblemssuchasclassi“cation,clusteringandoutlieranalysis[11,52,73].Thisisbecausetheuseoffrequentpatternsissofundamentalintheanalyticalprocessforahostofdataminingproblems.WebMining:Inthiscase,theWeblogsmaybeprocessedinordertodetermineimportantpatternsinthebrowsingbehavior[24,63].ThisinformationcanbeusedforWebsitedesign.recommendations,orevenoutlieranalysis.SoftwareBugAnalysis:Executionsofsoftwareprogramscanberepresentedasgraphswithtypicalpatterns.Logicalerrorsinthesebugsoftenshowupasspeci“ckindsofpatternsthatcanbeminedforfurtheranalysis[41,51].ChemicalandBiologicalAnalysis:Chemicalandbiologicaldataareoftenrep-resentedasgraphsandsequences.Anumberofmethodshavebeenproposedintheliteratureforusingthefrequentpatternsinsuchgraphsforawidevarietyofapplicationsindifferentscenarios[8,29,41,42,69…75].Sincethepublicationoftheoriginalarticleonfrequentpatternmining[10],numeroustechniqueshavebeenproposedbothforfrequentandsequentialpatternmining[5,4,13,33,62].Furthermore,manyvariantsoffrequentpatternmining,suchas 1AnIntroductiontoFrequentPatternMiningsequentialpatternmining,constrainedpatternmining,andgraphmininghavebeenproposedintheliterature.Frequentpatternminingisaratherbroadareaofresearch,anditrelatestoawidevarietyoftopicsatleastfromanapplicationspeci“c-perspective.Broadlyspeaking,theresearchintheareafallsinoneoffourdifferentcategories:Technique-centered:Thisarearelatestothedeterminationofmoreef“cientalgorithmsforfrequentpatternmining.Awidevarietyofalgorithmshavebeenproposedinthiscontextthatusedifferentenumerationtreeexplorationstrategies,anddifferentdatarepresentationmethods.Inaddition,numerousvariationssuchasthedeterminationofcompressedpatternsofgreatinteresttoresearchersindataScalabilityissues:Thescalabilityissuesinfrequentpatternminingareverysigni“cant.Whenthedataarrivesintheformofastream,multi-passmethodscannolongerbeused.Whenthedataisdistributedorverylarge,thenparallelorbig-dataframeworksmustbeused.ThesescenariosnecessitatedifferenttypesofAdvanceddatatypes:Numerousvariationsoffrequentpatternmininghavebeenproposedforadvanceddatatypes.Thesevariationshavebeenutilizedinawidevarietyoftasks.Inaddition,differentdatadomainssuchasgraphdata,treestructureddata,andstreamingdataoftenrequirespecializedalgorithmsforfrequentpatternmining.Issuesofinterestingnessofthepatternsarealsoquiterelevantinthiscontext[6].Applications:Frequentpatternmininghavenumerousapplicationstoothermajordataminingproblems,Webapplications,softwarebuganalysis,andchemicalandbiologicalapplications.Asigni“cantamountofresearchhasbeendevotedtoapplicationsbecausetheseareparticularlyimportantinthecontextoffrequentpatternmining.Thisbookwillcoverallthesedifferentareascomprehensively,soastoprovideacomprehensiveoverviewofthisbroaderarea.Thischapterisorganizedasfollows.Thenextsectiondiscussesalgorithmsforthefrequentpatternminingproblem,anditsbasicvariations.Section3discussesscalabilityissuesforfrequentpatternmining.FrequentpatternminingmethodsareadvanceddatatypesarediscussedinSect.4.PrivacyissuesoffrequentpatternminingareaddressedinSect.5.TheapplicationsarediscussedinSect.6.Section7givestheconclusionsandsummary.2FrequentPatternMiningAlgorithmsMostofthealgorithmsforfrequentpatternmininghavebeendesignedwiththetra-ditionalsupport-con“denceframework,orforspecializedframeworksthatgenerate C.C.Aggarwalmoreinterestingkindsofpatterns.Thesespecializedframeworkmayusediffer-enttypesofinterestingnessmeasures,modelnegativerules,oruseconstraint-basedframeworkstodeterminemorerelevantpatterns.2.1FrequentPatternMiningwiththeTraditionalSupportFrameworkThesupportframeworkisdesignedtodeterminepatternsforwhichtherawfrequencyisgreaterthanaminimumthreshold.Althoughthisisasimplisticwayofde“ningfrequentpatterns,thismodelhasanalgorithmicallyconvenientproperty,whichisreferredtoasthelevel-wiseproperty.Thelevel-wisepropertyoffrequentpatternmin-ingisalgorithmicallycrucialbecauseitenablesthedesignofabottom-upapproachtoexploringthespaceoffrequentpatterns.Inotherwords,a(1)-patternmaynotbefrequentwhenanyofitssubsetsisnotfrequent.Thisisacrucialobservationthatisusedbyvirtuallyalltheef“cientfrequentpatternminingalgorithms.Sincetheproblemoffrequentpatternminingwas“rstproposed,numerousal-gorithmshavebeenproposedinordertomakethesolutionstotheproblemmoreef“cient.Thisareaofresearchissopopularthatanannualworkshopwasde-votedtoimplementationsoffrequentpatternminingforafewyears.Thissite[77]isnoworganizedasarepository,wheremanyef“cientimplementationsoffrequentpatternminingareavailable.Thetechniquesforfrequentpatternminingstartedwith-likejoin-basedmethods.Inthesealgorithms,candidateitemsetsaregener-atedinincreasingorderofitemsetsize.Thegenerationinincreasingorderofitemsetsizeisreferredtoaslevel-wiseexploration.Theseitemsetsarethentestedagainsttheunderlyingtransactiondatabaseandthefrequentonessatisfyingtheminimumsupportconstraintareretainedforfurtherexploration.Eventually,itwasrealizedthat-likemethodscouldbemoresystematicallyexploredasenumerationtrees.ThisstructurewillbeexplainedindetailinChap.2,andprovidesamethod-ologytoperformsystematicandnon-redundantfrequentpatternexploration.Theenumerationtreeprovidesamore”exibleframeworkforfrequentitemsetminingbecausethetreecanbeexploredinavarietyofdifferentstrategiessuchasdepth-“rst,breadth-“rst,orotherhybridstrategies[13].Onepropertyofthebreadth-“rststrategyisthatlevel-wisepruningcanbeused,whichisnotpossiblewithotherstrategies.Nevertheless,strategiessuchasdepth-“rstsearchhaveotheradvantages,especiallyformaximalpatternmining.Thisobservationforthecaseofmaximalpatternminingwas“rststatedin[12].Thisisbecauselongpatternsarediscoveredearly,andtheycanbeusedfordownwardclosure-basedpruningoflargepartsoftheenumerationtreethatarealreadyknowntobefrequent.Itshouldbepointedout,thatforthecasewherefrequentpatternsaremined,theorderofexplorationofanenumerationtreedoesnotaffectthenumberofcandidatesthatareexploredbecausethesizeoftheenumerationtreeis“xed. 1AnIntroductiontoFrequentPatternMiningJoin-basedalgorithmsarealwayslevel-wise,andcanbeviewedasequivalenttobreadth-“rstenumerationtreeexploration.Thealgorithmproposedinthe“rstfre-quentpatternminingpaper[10]wasanenumeration-treebasedalgorithm,whereasthesecondalgorithmproposedwasreferredtoas,andwasajoin-basedalgo-rithm[4].Bothalgorithmsarelevel-wisealgorithms.Subsequently,manyalgorithmshavebeenproposedinordertoimprovetheimplementationsbasedontheenumer-ationtreeparadigmwiththeuseoftechniquessuchaslookahead[17],depth-“rstsearch[12,13,33]andverticalexploration[62].SomeofthesemethodssuchasTreeProjectionDepthProjectFP-growth[33]useaprojectionstrategyinwhichsmallertransactiondatabasesareexploredatlowerlevelsofthetree.Oneofthechallengesoffrequentpatternminingisthatalargenumberofre-dundantpatternsareoftenmined.Forexample,thesubsetofafrequentpatternisalsoguaranteedtobefrequentandbyminingamaximalitemset,oneisassuredthattheotherfrequentpatternscanalsobegeneratedfromthissmallerset.Therefore,onepossibilityistomineforonlyitemsets[17].However,theminingofmaximalitemsetslosesinformationabouttheexactvalueofsupportofthesubsetsofmaximalpatterns.Therefore,afurtherre“nementwouldbeto“nditemsets[58,74].Closedfrequentitemsetsarede“nedasfrequentpatterns,nosu-persetofwhichhavethesamefrequencyasthatitemset.Byminingclosedfrequentitemsets,itispossibletosigni“cantlyreducethenumberofpatternsfound,withoutlosinganyinformationaboutthesupportlevel.Closedpatternscanbeviewedasthemaximalpatternsfromeachgroupofpatterns(i.e.,patternswiththesamesupport).Allmaximalpatternsare,therefore,closed.Thedepth-“rstmethodhasbeenshowntohaveanumberofadvantagesinmax-imalpatternmining[12],becauseofthegreatereffectivenessofthepruning-basedlookaheadsinthedepth-“rststrategy.Differenttechniquesforfrequentpatternmin-ingwillbediscussedinChaps.2and3.Theformerchapterwillgenerallyfocusonfrequentpatternminingalgorithms,whereasthelatterchapterwillfocusonpattern-growthalgorithms.Anadditionalchapterwithgreaterdetailhasbeendevotedtopattern-growthmethods,becauseofitisconsideredastate-of-the-arttechniqueinfrequentpatternmining.Theef“ciencyinfrequentpatternminingalgorithmscanbegainedinseveralways:1.Reducingthesizeofthecandidatesearchspace,withtheuseofpruningmethods,suchasmaximalitypruning.Thenotionofclosurecanalsobeusedtoprunelargepartsofthesearchspace.However,thesemethodsoftendonotexhaustivelyreturnthefullsetoffrequentpatterns.Manyofthesemethodsreturnedcondensedrepresentationssuchasmaximalpatternsorclosedpatterns.2.Improvingtheef“ciencyof,withtheuseofdatabaseprojection.MethodssuchasTreeProjectionspeeduptherateatwhicheachpatterniscounted,byreducingthesizeofthedatabasewithrespecttowhichpatternsarecompared.3.Usingmoreef“cientdatastructures,suchasverticallists,oranFP-Treeformorecompresseddatabaserepresentation.Infrequentpatternmining,bothmemoryandcomputationalspeedscanbeimprovedbyjudiciouschoiceofdatastructures. C.C.AggarwalAparticularscenarioofinterestisoneinwhichthepatternstobeminedareverylong.Insuchcases,thenumberofsubsetsoffrequentpatternscanbeextremelylarge.Therefore,anumberoftechniquesneedtobedesignedinordertomineverylongpatterns.Insuchcases,avarietyofmethodsareusedtoexplorethelongpatternsearly,sothattheirsubsetscanbeprunedeffectively.ThescenariooflongpatterngenerationisdiscussedindetailinChap.4,thoughitisalsodiscussedtosomeextentintheearlierChaps.2and3.2.2InterestingandNegativeFrequentPatternsAmajorchallengeinfrequentpatternminingisthattherulesfoundmayoftennotbeveryinteresting,whenquanti“cationssuchassupportandcon“denceareused.Thisisbecausesuchquanti“cationsdonotnormalizefortheoriginalfrequencyoftheunderlyingitems.Forexample,anitemthatoccursveryrarelyintheunderlyingdatabasewouldnaturallyalsooccurinitemsetswithlowerfrequency.Therefore,thefrequencyoftendoesnottellusmuchaboutthelikelihoodofitemstotogether,becauseofthebiasesassociatedwiththefrequenciesoftheindividualitems.Therefore,numerousmethodshavebeenproposedintheliteraturefor“ndinginterestingfrequentpatternsthatnormalizefortheunderlyingitemfrequencies[6,26].Methodsfor“ndinginterestingfrequentpatternsarediscussedinChap.5.Theissueofinterestingnessisalsorelatedtocompressedrepresentationsofpatternssuchasclosedormaximalitemsets.Theseissuesarealsodiscussedinthechapter.Innegativeassociativerulemining,weattempttodeterminerulessuchasBreadtter,wherethesymbolindicatesnegation.Therefore,inthistterbecomesapseudo-itemdenotinganegativeitem.ŽOnepossibilityistoaddnegativeitemstothedata,andperformthemininginthesamewayasonewoulddeterminerulesinthesupport-con“denceframework.However,thisisnotafeasiblesolution.Thisisbecausetraditionalsupportframeworksarenotdesignedforcaseswhereanitemispresentedinthedata98%ofthetime.Thisisthecasefornegativeitems.ŽForexample,mosttransactionsmaynotcontaintheitemtterandthereforeevenpositivelycorrelateditemsmayappearasnegativerules.Forex-ample,theruleBreadttermayhavecon“dencegreaterthan50%,evenBreadisclearlycorrelatedinapositivewaywithtter.Thisisbecause,theitemttermayhaveanevenhighersupportof98%.Theissueof“ndingnegativepatternsiscloselyrelatedtothatof“ndinginterestingpatternsinthedata[6]becauseoneislookingforpatternsthatsatisfythesupportrequirementinaninterestingway.Thisrelationshipbetweenthetwoproblemstendstobeunder-emphasizedintheliterature,andtheproblemofnegativepatternminingisoftentreatedindependentlyfrominterestingpatternmining.Someframeworks,suchascollectivestrength,aredesignedtoaddressbothissuessimultaneously.MethodsfornegativepatternminingareaddressedinChap.6.Therelationshipbetweeninterestingpatternminingandnegativepatternminingwillbediscussedinthesamechapter. 1AnIntroductiontoFrequentPatternMining2.3ConstrainedFrequentPatternMiningOff-the-shelffrequentpatternminingalgorithmsdiscoveralargenumberofpatternswhicharenotusefulwhenitisdesiredtodeterminepatternsonthebasisofmorere“nedcriteria.Frequentpatternminingmethodsareoftenparticularlyusefulinthecontextofconstrainedapplications,inwhichrulessatisfyingparticularcriteriaarediscovered.Forexample,onemaydesirespeci“citemstobepresentintherule.Onesolutionisto“rstminealltheitemsets,andthenenableonlineminingfromthissetofbasepatterns[3].However,pushingconstraintsdirectlyintotheminingprocesshasseveraladvantages.Thisisbecausewhenconstraintsarepusheddirectlyintotheminingprocess,theminingcanbeperformedatmuchlowersupportlevelsthancanbeperformedbyusingatwo-phaseapproach.Thisisespeciallythecasewhenalargenumberofintermediatecandidatescanbeprunedbytheconstraint-basedpatternminingalgorithm.Avarietyofarbitraryconstraintsmayalsobepresentinthepatterns.Themajorproblemwithsuchmethodsisthattheconstraintsmayresultintheviolationofthedownwardclosureproperty.Becausemostfrequentpatternminingalgorithmsdependcruciallyonthisproperty,itsviolationisaseriousissue.Nevertheless,manyconstraintshavespecializedpropertiesbecauseofwhichspecializedalgorithmscanbedeveloped.Methodsforconstrainedfrequentpatternminingmethodhavebeendiscussedin[55,57,60].Constrainedmethodshavealsobeendevelopedforthesequentialpatternminingproblem[31,61].Inrealapplications,theoutputofthevanillafrequentpatternminingproblemmaybetoolarge,anditisonlybypushingconstraintsintothepatternminingprocess,thatusefulapplication-speci“cpatternscanbefound.Constrainedfrequentpatternminingmethodsarecloselyrelatedtotheproblemofpattern-basedclassi“cation,becausethelatterproblemrequiresustodiscoverdiscriminativepatternsfromtheunderlyingdata.MethodsforconstrainedfrequentpatternminingwillbediscussedinChap.2.2.4CompressedRepresentationsofFrequentPatternsAmajorprobleminfrequentpatternminingalgorithmsisthatthevolumeofthemin-ingpatternsisoftenextremelylarge.Thisscenariocreatesnumerouschallengesforusingthesepatternsinameaningfulway.Furthermore,differentkindsofredundancyarepresentintheminedpatterns.Forexample,maximalpatternsimplythepresenceofalltheirsubsetsinthedata.Thereissomeinformationlossintermsoftheexactsupportvaluesofthesesubsets.Therefore,ifitisnotneededtopreservethevaluesofthesupportacrossthepatterns,thenthedeterminationofconciserepresentationscanbeveryuseful.Aparticularlyinterestingformofconciserepresentationisthatofclosedpatterns[56].Anitemsetissettobeclosedifnoneofitssupersetshavethesamesupport.Therefore,bydeterminingalltheclosedfrequentpatterns,onecanderivenotonlytheexhaustivesetoffrequentitemsets,butalsotheirsupports.Notethat C.C.Aggarwalsupportvaluesarelostbymaximalpatternmining.Inotherwords,thesetofmaximalpatternscannotbeusedtoderivethesupportvaluesofmissingsubsets.However,thesupportvaluesofclosedfrequentitemsetscanbeusedtoderivethesupportvaluesofmissingsubsets.Manyinterestingmethods[58,67,74]havebeendesignedforidentifyingfrequentclosedpatterns.Thegeneralprincipleofdeterminingfrequentclosedpatternshasbeengeneralizedtothatofdetermining-freesets[18].Thisissueiscloselyrelatedtothatofminingallnon-derivablefrequentitemsets[20].Asurveyonthistopicmaybefoundin[21].ThesedifferentformsofcompressionarediscussedinChaps.2and5.Finally,aformalwayofviewingcompressionisfromtheperspectiveofinformation-theoreticmodels.Information-theoreticmodelsaredesignedforcom-pressingdifferentkindsofdata,andcanthereforebeusedtocompressitemsetsaswell.Thisbasicprinciplehasbeenusedformethodssuchas[66].Theprob-lemofdeterminingcompressedrepresentationsoffrequentitemsetsisdiscussedinChap.8.Thischapterfocussesmostlyontheinformation-theoreticissuesoffrequentitemsetcompression.3ScalabilityIssuesinFrequentPatternMiningInthemodernera,theabilitytocollectlargeamountsofdatahasincreasedsigni“-cantlybecauseofadvancesinhardwareandsoftwareplatforms.Theamountofdataisoftensolargethatspecializedmethodsarerequiredfortheminingprocess.Thestreamingandbig-dataarchitecturesareslightlydifferentandposedifferentchal-lengesfortheminingprocess.Thefollowingdiscussionwilladdresseachofthese3.1FrequentPatternMininginDataStreamsInrecentyears,datastreamhavebecomeverypopularbecauseoftheadvancesinhardwareandsoftwaretechnologythatcancollectandtransmitdatacontinuouslyovertime.Insuchcases,themajorconstraintondataminingalgorithmsistoexecutethealgorithmsinasinglepass.Thiscanbesigni“cantlychallengingbecausefrequentandsequentialpatternminingmethodsaregenerallydesignedaslevel-wisemethods.Therearetwovariantsoffrequentpatternminingfordatastreams:FrequentItemsorHeavyHitters:Inthiscase,frequent1-itemsetsneedtobedeterminedfromadatastreaminasinglepass.Suchanapproachisgenerallyneededwhenthetotalnumberofdistinctitemsistoolargetobeheldinmainmemory.Typically,sketch-basedmethodsareusedinordertocreateacompressdatastructureinordertomaintainapproximatecountsoftheitems[23,27].Frequentitemsets:Inthiscase,itisnotassumedthatthenumberofdistinctitemsaretoolarge.Therefore,themainchallengeinthiscaseiscomputational,because 1AnIntroductiontoFrequentPatternMiningthetypicalfrequentpatternminingmethodsaremulti-passmethods.Multiplepassesareclearlynotpossibleinthecontextofdatastreams[22,39].Thestreamingscenarioalsopresentsnumerouschallengesinthecontextofdataofadvancedtypes.Forexample,graphstreamsareoftenencounteredinthecontextofnetworkdata.Insuchcases,methodsneedtobedesignedfordeterminingdensegroupsofnodesinrealtime[16].MethodsforminingfrequentitemsanditemsetsindatastreamsarediscussedinChap.9.3.2FrequentPatternMiningwithBigDataThebigdatascenarioposesnumerouschallengesfortheproblemoffrequentpatternmining.Amajorproblemariseswhenthedataislargeenoughtobestoredinadistributedway.Therefore,signi“cantcostsareincurredinshuf”ingarounddataorintermediateresultsoftheminingprocessacrossthedistributednodes.Thesecostsarealsoreferredtoasdatatransfercosts.Whendatasetsareverylarge,thenthealgorithmsneedtodesignedtotakeintoaccountboththediskaccessconstraintandthedatatransfercosts.Inaddition,manydistributedframeworkssuchas[28]requirespecializedalgorithmsforfrequentpatternmining.Thefocusofbig-dataframeworkissomewhatdifferentfromstreams,inthatitiscloselyrelatedtotheissueofshuf”inglargeamountsofdataaroundfortheminingprocess.Interestingly,itissometimeseasiertoprocessthealgorithmsinasinglepassinstreamingfashion,thanwhentheyhavealreadybeenstoredindistributedframeworkswhereaccesscostsbecomeamajorissue.AlgorithmsforfrequentpatternminingwithbigdataarediscussedindetailinChap.10.Thischapterdiscussesboththeparallelalgorithmsandthebig-dataalgorithmsthatarebasedontheframework.4FrequentPatternMiningwithAdvancedDataTypesalthoughthefrequentpatternminingproblemisnaturallyde“nedonsets,itcanbeextendedtovariousadvanceddatatypes.Themostnaturalextensionoffrequentpatternminingalgorithmsistothecaseoftemporaldata.Thiswasoneoftheearliestproposedextensionsandisreferredtoassequentialpatternmining.Subsequently,theproblemhasbeengeneralizedtootheradvanceddatatypes,suchasspatiotem-poraldata,graphs,anduncertaindata.Manyofthedevelopedalgorithmsarebasicvariationsofthefrequentpatternminingproblem.Ingeneral,thebasicfrequentpatternminingalgorithmsneedtobemodi“edcarefullytoaddressthevariationsrequiredbytheadvanceddatatypes. C.C.Aggarwal4.1SequentialPatternMiningTheproblemofsequentialpatternminingiscloselyrelatedtothatoffrequentpatternmining.Themajordifferenceinthiscaseisthatrecordcontainbasketsofitemsarrangedsequential.Forexample,eachrecordmaybeofthefollowingform:={BreadtterCakeChickenYogInthiscase,eachentitywithinisabasketofitemsthatareboughttogetherand,therefore,donothaveatemporalordering.Thisbasketofitemsiscollectivelyre-ferredtoasanevent.Thelengthofapatternisequaltothesumofthelengthsofthecomplexitemsinit.Forexample,isa5-pattern,eventhoughithas3events.Thedifferentcomplexentities(orevents)dohaveatemporalordering.Intheaforemen-tionedexample,itisclearthatBreadhasbeenboughtearlierthantterCakeTheproblemofsequentialpatternminingisthatof“ndingsequencesofeventsthatarepresentinatleastafractionoftheunderlyingrecords[5].Forexample,theBreadtterChickenispresentintheafore-mentionedrecord,butnotthesequenceBreadCaketter.Thepatternmayalsocontaincomplexevents.Forexample,thepatternBreadChickenYogispresent.Theproblemofsequentialpatternminingiscloselyrelatedtothatoffre-quentpatternminingexceptthatitissomewhatmorecomplextoaccountforboththepresenceofcomplexbasketsofitemsinthedatabase,andthetemporalorder-ingoftheindividualbaskets.Anextensionofasequentialpatternmayeitherbeaset-wiseextensionofacomplexitem,oratemporalextensionwithanentirelynewevent.Thisaffectsthenatureoftheextensionsofitemsinthetransactions.Numerousmodi“cationsofknownfrequentpatternminingmethodssuchasanditsvariants,TreeProjectionanditsvariants[32],andtheFP-growthanditsvariants,canbeusedinordertosolvethesequentialpatternminingprob-lem[5,35,36].Theenumerationtreeconceptcanalsobegeneralizedtosequentialpatternmining[32].Therefore,inprinciple,allenumerationtreealgorithmscanbegeneralizedtosequentialpatternmining.Thisisapowerfulabilitybecause,aswewillseeinChap.2allfrequentpatternminingalgorithmsare,implicitlyorexplicitly,enumeration-treealgorithms.SequentialpatternminingmethodswillbediscussedindetailinChap.11.4.2SpatiotemporalPatternMiningTheadventofGPS-enabledmobilephonesandwearablesensorshasenabledthecollectionoflargeamountsofspatiotemporaldata.Suchdatamayincludetrajectorydata,location-taggedimages,orothercontent.Insomecases,thespatiotemporaldataexistsintheformofRFIDdata[37].Theminingofpatternsfromsuchspa-tiotemporaldataprovidesnumerousinsightsinawidevarietyofapplications,suchastraf“ccontrolandsocialsensing[2].Frequentpatternsarealsousedfortrajectory 1AnIntroductiontoFrequentPatternMiningclusteringclassi“cationandoutlieranalysis[38,45…48].Manytrajectoryanalysisproblemscanbeapproximatelytransformedtosequentialpatternminingwiththeuseofappropriatetransformations.AlgorithmsforspatiotemporalpatternminingarediscussedinChap.12.4.3FrequentPatternsinGraphsandStructuredDataManykindsofchemicalandbiologicaldata,XMLdata,softwareprogramtraces,andWebbrowsingbehaviorscanberepresentedasstructuredgraphs.Inthesecases,frequentpatternminingisveryusefulformakinginferencesinsuchdata.Thisisbecausefrequentstructuralpatternsprovideimportantinsightsaboutthegraphs.Forexample,speci“cchemicalstructuresresultinparticularproperties,speci“cprogramstructuresresultinsoftwarebugs,andsoon.Suchpatternscanevenbeusedforclusteringandclassi“cationofgraphs![14,73].Avarietyofmethodsforstructuralfrequentpatternminingarediscussedin[41,69…71,72].Amajorprobleminthecontextofgraphsistheproblemofbecauseofwhichtherearemultiplewaystomatchtwographs.An-likealgorithmcanbedevelopedforgraphpatternmining.However,becauseofthecomplexityofgraphsandandalsobecauseofissuesrelatedtoisomorphism,thealgorithmsaremorecomplex.Forexample,inan-likealgorithm,pairsofgraphscanbejoinedinmultipleways.Pairsofgraphscanbejoinedwhentheyhave1)nodesincommon,ortheyhave(1)edgesincommon.Furthermore,eitherkindofjoinbetweenapairofgraphscanhavemultipleresults.Thecountingprocessisalsomorechallengingbecauseofisomorphism.Patternminingingraphsbecomesespeciallychallengingwhenthegraphsarelarge,andtheisomorphismproblembecomessigni“cant.Anotherparticularlydif“cultcaseisthestreamingscenario[16]whereonehastodeterminedensepatternsinthegraphsstream.Typically,theseproblemscannotbesolvedexactly,andapproximationsarerequired.Frequentpatternminingingraphshasnumerousapplications.Insomecases,thesemethodscanbeusedinordertoperformclassi“cationandclusteringofstructureddata[14,73].Graphpatternsareusedforchemicalandbiologicaldataanalysis,andsoftwarebugdetectionincomputerprograms.Methodsfor“ndingfrequentpatternsingraphsarediscussedinChap.13.TheapplicationsofgraphpatternminingarediscussedinChap.18.4.4FrequentPatternMiningwithUncertainDataUncertainorprobabilisticdatahasbecomeincreasinglycommonoverthelastfewyears,asmethodshavebeendesignedinordertocollectdatawithverylowqual-ity.Theattributevaluesinsuchdatasetsareprobabilistic,whichimpliesthatthevaluesarerepresentedasprobabilitydistributions.Numerousalgorithmshavebeen C.C.Aggarwalproposedintheliteratureforuncertainfrequentpatternmining[15],andacompu-tationalevaluationofthedifferenttechniquesisprovidedin[64].ManyalgorithmssuchasFP-growtharehardertogeneralizetouncertaindata[15]becauseofthedif-“cultyinstoringprobabilityinformationwiththeFP-Tree.Nevertheless,astheworkin[15]shows,otherrelatedmethodssuchas[59]canbegeneralizedeasilytothecaseofuncertaindata.Uncertainfrequentpatternminingmethodshavealsobeenextendedtothecaseofgraphdata[76].Avariantofuncertaingraphpatternminingdiscovershighlyreliablesubgraphs[40].Highlyreliablesubgraphsaresubgraphsthatarehardtodisconnectinspiteoftheuncertaintyassociatedwiththeedges.AdiscussionofthedifferentmethodsforfrequentpatternminingwithuncertaindataisprovidedinChap.14.5PrivacyIssuesPrivacyhasincreasinglybecomeatopicofconcerninrecentyearsbecauseofthewideavailabilityofpersonaldataaboutindividuals[7].Thishasoftenledtoreluctancetosharedata,shareitinaconstrainedway,orsharedowngradedversionsofthedata.Theadditionalconstraintsanddowngradingtranslatetochallengesindiscoveringfrequentpatterns.Inthecontextoffrequentpatternandassociationrulemining,theprimarychallengesareasfollows:1.Whenprivacy-preservationmethodssuchasrandomizationareused,itbecomesachallengetodiscoverassociationsfromtheunderlyingdata.Thisisbecauseasigni“cantamountofnoisehasbeenaddedtothedata,anditisoftendif“culttodiscovertheassociationrulesinthepresenceofthisnoise.Therefore,oneclassofassociationruleminingmethods[30]proposeseffectivemethodstoperturbthedata,sothatmeaningfulpatternsmaybediscoveredwhileretainingprivacyoftheperturbeddata.2.Insomecases,theoutputofaprivacy-preservingdataminingalgorithmcanleadtoviolationofprivacy.Thisisbecauseassociationrulescanrevealsensitivein-formationaboutindividualswhentheyrelatesensitiveattributestootherkindsofattributes.Therefore,oneclassofmethodsfocussesontheproblemofrulehidinghiding3.Inmanycases,thedatatobeminedisstoredinadistributedwaybycompetitorswhomaywishtodetermineglobalinsightswithout,atthesametime,revealingtheirlocalinsights.Thisproblemisreferredtoasthatofdistributedprivacypreservation[25].Thedatamaybeeitherhorizontallypartitionedacrossrows(differentrecords)orverticallypartitioned(acrossattributes).Eachoftheseformsofpartitioningrequiredifferentmethodsfordistributedmining.Methodsforprivacy-preservingassociationruleminingareaddressedinChap.15. 1AnIntroductiontoFrequentPatternMining6ApplicationsofFrequentPatternMiningFrequentpatternmininghasapplicationsoftwotypes.The“rsttypeofapplicationistoothermajordataminingproblemssuchasclustering,outlierdetection,andclassi“cation.Frequentpatternsareoftenusedtodeterminerelevantclustersfromtheunderlyingdata.Inaddition,rule-basedclassi“ersareoftenconstructedwiththeuseoffrequentpatternminingmethods.Frequentpatternminingisalsousedingenericapplications,suchasWebloganalytics,softwarebuganalysis,chemical,andbiologicaldata.6.1ApplicationstoMajorDataMiningProblemsFrequentpatternminingmethodscanalsobeappliedtoothermajordataminingproblemssuchasclustering[9,19],classi“cationandoutlieranalysis.Forexample,frequentpatternminingmethodsareoftenusedforsubspaceclustering[11],bydiscretizingthequantitativeattributes,andthen“ndingpatternsfromthesediscretevalues.Eachsuchpattern,therefore,correspondstoarectangularregioninasubspaceofthedata.Theserectangularregionscanthenbeintegratedtogetherinordertocreateamorecomprehensivesubspacerepresentation.Frequentpatternminingisalsoappliedtoproblemssuchasclassi“cation,inwhichrulesaregeneratedbyusingpatternsonthelefthandsideoftherule,andtheclassvariableontherighthandsideoftherule[52].Themaingoalhereisto“ndpatternsforthepurposeofclassi“cation,ratherthansimplypatternsthatsatisfythesupportrequirements.SuchmethodshavealsobeenextendedtostructuredXMLdata[73]by“ndingdiscriminativegraph-structuredpatterns.Inaddition,sequentialpatternminingmethodscanbeappliedtoothertemporalminingmethodssuchaseventdetection[43,44,53,54]andsequenceclassi“cation[68].Frequentpatternmininghasalsobeenappliedtotheproblemofoutlieranalysis[1],bydeterminingdeviationsfromtheexpectedpatternsintheunderlyingdata.MethodsforclusteringbasedonfrequentpatternminingarediscussedinChap.16,whilerule-basedclassi“cationarediscussedinChap.17.Itshouldbepointedoutthatconstrainedfrequentpatternminingiscloselyrelatedtotheproblemofclassi“cationwithfrequentpatterns,andthereforebotharediscussedinthesamechapter.6.2GenericApplicationsFrequentpatternmininghasapplicationstoavarietyofproblemssuchasclustering,classi“cationandeventdetection.Inaddition,speci“capplicationareassuchasWebminingandsoftwarebugdetectioncanalsobene“tfromfrequentpatternminingmethods.InthecontextofWebmining,numerousmethodshavebeenproposedfor“ndingusefulpatternsfromWeblogsinordertomakerecommendations[63].Such C.C.AggarwaltechniquescanalsobeusedtodetermineoutliersfromWeblogsequences[1].Fre-quentpatternsarealsousedfortrajectoryclassi“cationandoutlieranalysis[49…48].Frequentpatternminingmethodscanalsobeusedinordertodeterminerelevantrulesandpatternsinspatialdata,astheyrelatedtospatialandnon-spatialpropertiesofobjects.Forexample,anassociationrulecouldbecreatedfromtherelationshipsoflandtemperaturesofnearbyŽgeographicallocations.Inthecontextofspatiotem-poraldata,therelationshipsbetweenthemotionsofdifferentobjectscouldbeusedtocreatespatiotemporalfrequentpatterns.Frequentpatternminingmethodshavebeenusedfor“ndingpatternsinbiologicalandchemicaldata[42,29,75].Inaddition,becausesoftwareprogramscanberepresentedasgraphs,frequentpatternminingmethodscanbeusedinorderto“ndlogicalbugsfromprogramexecutiontraces[51].NumerousapplicationsoffrequentpatternminingarediscussedinChap.18.7ConclusionsandSummaryFrequentpatternminingisoneoffourmajorproblemsinthedataminingdomain.Thischapterprovidesanoverviewofthemajortopicsinfrequentpatternmining.Theearliestworkinthisareawasfocussedondeterminingtheef“cientalgorithmsforfrequentpatternmining,andvariantssuchaslongpatternmining,interestingpatternmining,constraint-basedpatternmining,andcompression.Inrecentyearsscalabilityhasbecomeanissuebecauseofthemassiveamountsofdatathatcontinuetobecreatedinvariousapplications.Inaddition,becauseofadvancesindatacollectiontechnology,advanceddatatypessuchastemporaldata,spatiotemporaldata,graphdata,anduncertaindatahavebecomemorecommon.Suchdatatypeshavenumerousapplicationstootherdataminingproblemssuchasclusteringandclassi“cation.Inaddition,suchdatatypesareusedquiteofteninvarioustemporalapplications,suchastheWebloganalytics.References1.C.Aggarwal.OutlierAnalysis,Springer,2013.2.C.Aggarwal.SocialSensing,ManagingandMiningSensorData,Springer,2013.3.C.C.Aggarwal,andP.S.Yu.OnlinegenerationofAssociationRules,ICDEConference,1998.4.R.Agrawal,andR.Srikant.FastAlgorithmsforMiningAssociationRulesinLargeDatabases,VLDBConference,pp.487…499,1994.5.R.Agrawal,andR.Srikant.MiningSequentialPatterns,ICDEConference,1995.6.C.C.Aggarwal,andP.S.Yu.ANewFrameworkforItemsetGeneration,ACMPODSConference,1998.7.C.AggarwalandP.Yu.Privacy-preservingdatamining:ModelsandAlgorithms,Springer8.C.C.Aggarwal,andH.Wang.ManagingandMiningGraphData,Springer,2010.9.C.C.Aggarwal,andC.K.Reddy.DataClustering:AlgorithmsandApplications,CRCPress 1AnIntroductiontoFrequentPatternMining 15 10.R.Agrawal,T.Imielinski,andA.Swami.DatabaseMining:APerformancePerspective. IEEE TransactionsonKnowledgeandDataEngineering ,5(6),pp.914…925,1993. 11.R.Agrawal,J.Gehrke,D.Gunopulos,P.Raghavan.AutomaticSubspaceClusteringofHigh DimensionalDataforDataMiningApplications, ACMSIGMODConference ,1998. 12.R.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.Depth-“rstGenerationofLongPatterns, ACMKDDConference ,2000:AlsoappearsasIBMResearchReport,RC,21538,1999. 13.R.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjectionAlgorithmforGenerationof FrequentItemsets, JournalofParallelandDistributedComputing ,61(3),pp.350…371,2001. AlsoappearsasIBMResearchReport,RC21341,1999. 14.C.C.Aggarwal,N.Ta,J.Wang,J.Feng,M.Zaki.Xproj:Aframeworkforprojectedstructural clusteringofXMLdocuments, ACMKDDConference ,2007. 15.C.C.Aggarwal,Y.Li,J.Wang,J.Feng.FrequentPatternMiningwithUncertainData, ACM KDDConference ,2009. 16.C.Aggarwal,Y.Li,P.Yu,andR.Jin.Ondensepatternminingingraphstreams, VLDB Conference ,2010. 17.R.J.BayardoJr.Ef“cientlymininglongpatternsfromdatabases. ACMSIGMODConference , 1998. 18.J.-F.Boulicaut,A.Bykowski,andC.Rigotti.Free-sets:ACondensedRepresentationof BooleandatafortheApproximationofFrequencyQueries. DataMiningandKnowledge Discovery ,7(1),pp.5…22,2003. 19.G.Buehrer,andK.Chellapilla.AScalablePatternMiningApproachtoWebGraph CompressionwithCommunities. WSDMConference ,2009. 20.T.Calders,andB.Goethals.Miningallnon-derivablefrequentitemsets, Principlesof KnowledgeDiscoveryandDataMining ,2006. 21.T.Calders,C.Rigotti,andJ.F.Boulicaut.Asurveyoncondensedrepresentationsforfrequent sets.In Constraint-basedminingandinductivedatabases ,pp.64…80,Springer,2006. 22.J.H.Chang,W.S.Lee.FindingRecentFrequentItemsetsAdaptivelyoverOnlineData Streams. ACMKDDConference ,2003. 23.M.Charikar,K.Chen,andM.Farach-Colton.FindingFrequentItemsinDataStreams, Automata,LanguagesandProgramming ,pp.693…703,2002. 24.M.S.Chen,J.S.Park,andP.S.Yu.Ef“cientdataminingforpathtraversalpatterns, IEEE TransactionsonKnowledgeandDataEngineering ,10(2),pp.209…221,1998. 25.C.Clifton,M.Kantarcioglu,J.Vaidya,X.Lin,andM.Zhu.Toolsforprivacypreserving distributeddatamining. ACMSIGKDDExplorationsNewsletter ,4(2),pp.28…34,2002. 26.E.Cohen.M.Datar,S.Fujiwara,A.Gionis,P.Indyk,R.Motwani,J.Ullman,andC.Yang. FindingInterestingAssociationswithoutSupportPruning, IEEETKDE ,13(1),pp.64…78, 2001. 27.G.Cormode,S.Muthukrishnan.Whatshotandwhatsnot:trackingmostfrequentitems dynamically, ACMTODS ,30(1),pp.249…278,2005. 28.J.DeanandS.Ghemawat. MapReduce :Simpli“edDataProcessingonLargeClusters. OSDI , pp.137…150,2004. 29.M.Deshpande,M.Kuramochi,N.Wale,andG.Karypis.Frequentsubstructure-based approachesforclassifyingchemicalcompounds. IEEETKDE. ,17(8),pp.1036…1050,2005. 30.A.Ev“mievski,R.Srikant,R.Agrawal,andJ.Gehrke.Privacypreservingminingofassociation rules. InformationSystems ,29(4),pp.343…364,2004. 31.M.Garofalakis,R.Rastogi,andK.Shim.:SequentialPatternMiningwithRegularExpression Constraints, VLDBConference ,1999. 32.V.Guralnik,andG.Karypis.Paralleltree-projection-basedsequenceminingalgorithms. ParallelComputing ,30(4):pp.443…472,April2004. 33.J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration, ACM SIGMODConference ,2000. 34.J.Han,H.Cheng,D.Xin,andX.Yan.FrequentPatternMining:CurrentStatusandFuture Directions, DataMiningandKnowledgeDiscovery , 15(1),pp.55…86,2007. C.C.Aggarwal35.J.Han,J.Pei,B.Mortazavi-Asl,Q.Chen,U.Dayal,andM.C.Hsu.FreeSpan:frequentpattern-projectedsequentialpatternmining.ACMKDDConference,2000.36.J.Han,J.Pei,H.Pinto,B.Mortazavi-Asl,Q.Chen,U.Dayal,andM.C.Hsu.Pre“xSpan:Miningsequentialpatternsef“cientlybypre“x-projectedpatterngrowth.ICDEConference37.J.Han,J.-G.Lee,H.Gonzalez,X.Li.MiningMassiveRFID,Trajectory,andTraf-“cDataSets(Tutorial).ACMKDDConference,2008.VideoofTutoralLectureat:38.H.Jeung,M.L.Yiu,X.Zhou,C.Jensen,H.Shen,DiscoveryofConvoysinTrajectoryVLDBConference,2008.39.R.Jin,G.Agrawal.FrequentPatternMininginDataStreams,DataStreams:Modelsand,pp.61…84,Springer,2007.40.R.Jin,L.Liu,andC.Aggarwal.Discoveringhighlyreliablesubgraphsinuncertaingraphs.ACMKDDConference,2011.41.G.KuramuchiandG.Karypis.FrequentSubgraphDiscovery,ICDMConference,2001.42.A.R.LeachandV.J.Gillet.AnIntroductiontoChemoinformatics.Springer,2003.43.W.Lee,S.Stolfo,andP.Chan.LearningPatternsfromUnixExecutionTracesforIntrusionAAAIworkshoponAImethodsinFraudandRiskManagement,1997.44.W.Lee,S.Stolfo,andK.Mok.ADataMiningFrameworkforBuildingIntrusionDetectionIEEESymposiumonSecurityandPrivacy,1999.45.J.-G.Lee,J.Han,K.-Y.Whang,TrajectoryClustering:APartition-and-GroupFramework,ACMSIGMODConference,2007.46.J.-G.Lee,J.Han,X.Li.TrajectoryOutlierDetection:APartition-and-DetectFramework,ICDEConference,2008.47.J.-G.Lee,J.Han,X.Li,H.Gonzalez.TraClass:trajectoryclassi“cationusinghierarchicalregion-basedandtrajectory-basedclustering.,1(1):pp.1081…1094,2008.48.X.Li,J.Han,andS.Kim.Motion-alert:AutomaticAnomalyDetectioninMassiveMovingIEEEConferenceinIntelligenceandSecurityInformatics,2006.49.X.Li,J.Han,S.KimandH.Gonzalez.ROAM:Rule-andMotif-basedAnomalyDetectioninMassiveMovingObjectDataSets,SDMConference,2007.50.Z.Li,B.Ding,J.Han,R.Kays.Swarm:MiningRelaxedTemporalObjectMovingClusters,VLDBConference,2010.51.C.Liu,X.Yan,H.Lu,J.Han,andP.S.Yu.MiningBehaviorGraphsforbacktraceŽofnon-crashingbugs,SDMConference,2005.52.B.Liu,W.Hsu,Y.Ma.IntegratingClassi“cationandAssociationRuleMining,ACMKDDConference,1998.53.S.Ma,andJ.Hellerstein.MiningPartiallyPeriodicEventPatternswithUnknownPeriods,IEEEInternationalConferenceonDataEngineering,2001.54.H.Mannila,H.Toivonen,andA.I.Verkamo.DiscoveringFrequentEpisodesinSequences,ACMKDDConference,1995.55.R.Ng,L.V.S.Lakshmanan,J.Han,andA.Pang.Exploratoryminingandpruningoptimizationsofconstrainedassociationsrules.ACMSIGMODConference,1998.56.N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsforassociationrules.InternationalConferenceonDatabaseTheory,pp.398…416,1999.57.J.Pei,andJ.Han.Canwepushmoreconstraintsintofrequentpatternmining?ACMKDDConference,2000.58.J.Pei,J.Han,R.Mao.CLOSET:AnEf“cientAlgorithmsforMiningFrequentClosedItemsets,DMKDWorkshop,2000.59.J.Pei,J.Han,H.Lu,S.Nishio,S.Tang,andD.Yang.H-mine:Hyper-structureminingoffrequentpatternsinlargedatabases.InDataMining,ICDMConference,2001.60.J.Pei,J.Han,andL.V.S.Lakshmanan.MiningFrequentPatternswithConvertibleConstraintsinLargeDatabases,ICDEConference,2001. 1AnIntroductiontoFrequentPatternMining61.J.Pei,J.Han,andW.Wang.Constraint-basedSequentialPatternMining:ThePattern-GrowthJournalofIntelligentInformationSystems,28(2),pp.133…160,2007.62.P.Shenoy,J.Haritsa,S.Sudarshan,G.Bhalotia,M.Bawa,D.Shah.Turbo-chargingVerticalMiningofLargeDatabases.ACMSIGMODConference,pp.22…33,2000.63.J.Srivastava,R.Cooley,M.Deshpande,andP.N.Tan.Webusagemining:DiscoveryandapplicationsofusagepatternsfromWebdata.ACMSIGKDDExplorationsNewsletter,1(2),pp.12…23,2000.64.Y.Tong,L.Chen,Y.Cheng,P.Yu.MiningFrequentItemsetsoverUncertainDatabases.5(11),pp.1650…1661,2012.65.V.S.Verykios,A.K.Elmagarmid,E.Bertino,Y.Saygin,andE.Dasseni.AssociationruleIEEETransactionsonKnowledgeandDataEngineering,pp.434…447,16(4),pp.434…447,2004.66.J.Vreeken,M.vanLeeuwen,andA.Siebes.Krimp:Miningitemsetsthatcompress.MiningandKnowledgeDiscovery,23(1),pp.169…214,2011.67.J.Wang,J.Han,andJ.Pei.CLOSET+:SearchingfortheBeststrategiesforminingfrequentcloseditemsets.ACMKDDConference,2003.68.Z.Xing,J.Pei,andE.Keogh.ABriefSurveyonSequenceClassi“cation,ACMSIGKDDExplorations,12(1),2010.69.X.Yan,P.S.Yu,andJ.Han,Graphindexing:Afrequentstructure-basedapproach.ACMSIGMODConference,2004.70.X.Yan,P.S.Yu,andJ.Han.Substructuresimilaritysearchingraphdatabases.ACMSIGMODConference,2005.71.X.Yan,F.Zhu,J.Han,andP.S.Yu.Searchingsubstructureswithsuperimposeddistance,ICDEConference,2006.72.M.Zaki.Ef“cientlyminingfrequenttreesinaforest:Algorithmsandapplications.TransactionsonKnowledgeandDataEngineering,17(8),pp.1021…1035,2005.73.M.Zaki,C.Aggarwal.XRules:AnEffectiveClassi“erforXMLData,ACMKDDConference74.M.Zaki,C.J.Hsiao.CHARM:AnEf“cientAlgorithmforClosedFrequentItemsetMining,SDMConference,2002.75.S.Zhang,T.Wang.DiscoveringFrequentAgreementSubtreesfromPhylogeneticData.TransactionsonKnowledgeandDataEngineering,20(1),pp.68…82,2008.76.Z.Zou,J.Li,H.Gao,andS.Zhang.MiningFrequentSubgraphPatternsfromUncertainGraphIEEETransactionsonKnowledgeandDataEngineering,22(9),pp.1203…1218,2010.77.http://“mi.ua.ac.be/ Chapter2FrequentPatternMiningAlgorithms:ASurveyCharuC.Aggarwal,MansurulA.BhuiyanandMohammadAlHasanThischapterwillprovideadetailedsurveyoffrequentpatternminingalgorithms.AwidevarietyofalgorithmswillbecoveredstartingfromManyalgorithmssuchasTreeProjection,andFP-growthwillbediscussed.Inadditionadiscussionofseveralmaximalandclosedfrequentpatternminingalgorithmswillbeprovided.Thus,thischapterwillprovideoneofmostdetailedsurveysoffrequentpatternminingalgorithmsavailableintheliterature.KeywordsFrequentpatternminingalgorithmsTreeProjectionFP-growth1IntroductionIndatamining,frequentpatternmining(FPM)isoneofthemostintensivelyinves-tigatedproblemsintermsofcomputationalandalgorithmicdevelopment.Overthelasttwodecades,numerousalgorithmshavebeenproposedtosolvefrequentpatternminingorsomeofitsvariants,andtheinterestinthisproblemstillpersists[45,75].Differentframeworkshavebeende“nedforfrequentpatternmining.Themostcom-mononeisthesupport-basedframework,inwhichitemsetswithfrequencyaboveagiventhresholdarefound.However,suchitemsetsmaysometimesnotrepresentinterestingpositivecorrelationsbetweenitemsbecausetheydonotnormalizefortheabsolutefrequenciesoftheitems.Consequently,alternativemeasuresforinter-estingnesshavebeende“nedintheliterature[7,11,16,63].Thischapterwillfocusonthesupport-basedframeworkbecausethealgorithmsbasedontheinterestingness C.C.Aggarwal(IBMT.J.WatsonResearchCenter,YorktownHeights,NY10598,USAe-mail:charu@us.ibm.comM.A.BhuiyanM.A.HasanIndianaUniversity…PurdueUniversity,Indianapolis,IN,USAe-mail:mbhuiyan@cs.iupui.eduM.A.Hasane-mail:alhasan@cs.iupui.eduC.C.Aggarwal,J.Han(eds.),FrequentPatternMining,DOI10.1007/978-3-319-07821-2_2,©SpringerInternationalPublishingSwitzerland2014 C.C.Aggarwaletal. Fig.2.1Agenericfrequentpatternminingalgorithmframeworkareprovidedinadifferentchapter.Surveysonfrequentpatternminingmaybefoundin[26,33].Oneofthemainreasonsforthehighlevelofinterestinfrequentpatternminingalgorithmsisduetothecomputationalchallengeofthetask.Evenforamoderatesizeddataset,thesearchspaceofFPMisenormous,whichisexponentialtothelengthofthetransactionsinthedataset.Thisnaturallycreateschallengesforitemsetgeneration,whenthesupportlevelsarelow.Infact,inmostpracticalscenarios,thesupportlevelsatwhichonecanminethecorrespondingitemsetsarelimited(boundedbelow)bythememoryandcomputationalconstraints.Therefore,itiscriticaltobeabletoperformtheanalysisinaspace-andtime-ef“cientway.Duringthe“rstfewyearsofresearchinthisarea,theprimaryfocusofworkwasto“ndFPMalgorithmswithbettercomputationalef“ciency.Severalclassesofalgorithmshavebeendevelopedforfrequentpatternmining,manyofwhicharecloselyrelatedtooneanother.Infact,theexecutiontreeofallthealgorithmsismostlydifferentintermsoftheorderinwhichthepatternsareexplored,andwhetherthecountingworkdonefordifferentcandidatesisindependentofoneanother.Toexplainthispoint,weintroduceaprimitivebaselineŽalgorithmthatformstheheartofmostfrequentpatternminingalgorithms.Figure2.1presentsthepseudocodeforaverysimplebaselineŽfrequentpatternminingalgorithm.Thealgorithmtakesthetransactiondatabaseandauser-de“nedsupportvalueasinput.It“rstpopulatesalllength-onefrequentpatternsinafrequentpatterndata-store,.Thenitgeneratesacandidatepatternandcomputesitssupportinthedatabase.Ifthesupportofthecandidatepatternisequalorhigherthantheminimumsupportthresholdthepatternisstoredin.Theprocesscontinuesuntilallthefrequentpatternsfromthedatabasearefound.Intheaforementionedalgorithm,candidatepatternsaregeneratedfromtheprevi-ouslygeneratedfrequentpatterns.Then,thetransactiondatabaseisusedtodeterminewhichofthecandidatesaretrulyfrequentpatterns.Thekeyissuesofcomputa-tionalef“ciencyariseintermsofgeneratingthecandidatepatternsinanorderlyandcarefullydesignedfashion,pruningirrelevantandduplicatecandidates,andusingwellchosentrickstominimizetheworkincountingthecandidates.Clearly,the 2FrequentPatternMiningAlgorithms:ASurveyeffectivenessofthesedifferentstrategiesdependoneachother.Forexample,theeffectivenessofapruningstrategymaybedependentontheorderofexplorationofthecandidates(level-wisevs.depth“rst),andtheeffectivenessofcountingisalsodependentontheorderofexplorationbecausetheworkdoneforcountingatthehigherlevels(shorteritemsets)canbereusedatthelowerlevels(longeritemsets)withcertainstrategies,suchasthoseexploredinTreeProjectionFP-growthSurprisingasitmightseem,virtuallyallfrequentpatternminingalgorithmscanbeconsideredcomplexvariationsofthissimplebaselinepseudocode.Themajorchal-lengeofallofthesemethodsisthatthenumberoffrequentpatternsandcandidatepatternscansometimesbelarge.Thisisafundamentalproblemoffrequentpatternminingalthoughitispossibletospeedupthecountingofthedifferentcandidatepatternswiththeuseofvarioustrickssuchasdatabaseprojections.Ananalysisonthenumberofcandidatepatternsmaybefoundin[25].Thecandidategenerationprocessoftheearliestalgorithmsusedjoins.Theoriginalalgorithmbelongstothiscategory[1].Althoughispresentedasajoin-basedalgorithm,itcanbeshownthatthealgorithmisabreadth“rstexplorationofastructuredarrangementoftheitemsets,knownasalexicographictreeenumerationtree.Therefore,laterclassesofalgorithmsexplicitlydiscusstree-basedenumeration[4,5].Thealgorithmsassumealexicographictree(orenumerationtree)ofcandidatepatternsandexplorethetreeusingbreadth-“rstordepth-“rststrategies.Theuseoftheenumerationtreeformsthebasisforunderstandingsearchspacedecomposition,asinthecaseoftheTreeProjectionalgorithm[5].Theenumerationtreeconceptisveryusefulbecauseitprovidesanunderstandingofhowthesearchspaceofcandidatepatternsmaybeexploredinasystematicandnon-redundantway.Frequentpatternminingalgorithmstypicallyneedtoevaluatethesupportoffrequentportionsoftheenumerationtree,andalsoruleoutanadditionallayerofinfrequentextensionsofthefrequentnodesintheenumerationtree.Thismakesthecandidatespaceofallfrequentpatternminingalgorithmsvirtuallyinvariantunlessoneisinterestedinparticulartypesofpatternssuchasmaximalpatterns.Theenumerationtreeisde“nedonthepre“xesoffrequentitemsets,andwillbeintroducedlaterinthischapter.LateralgorithmssuchasFP-growthsuf“x-basedrecursiveexplorationofthesearchspace.Inotherwords,thefrequentpatternswithaparticularpatternasasuf“xareexploredatonetime.ThisisbecauseFP-growthusestheoppositeitemorderingconventionasmostenumerationtreealgorithmsthoughtherecursiveexplorationorderofFP-growthissimilartoanenumerationtree.Notethatallclassesofalgorithms,implicitlyorexplicitly,explorethesearchspaceofpatternsde“nedbyanenumerationtreeoffrequentpatternswithdifferentstrategiessuchasjoins,pre“x-baseddepth-“rstexploration,orsuf“x-baseddepth-“rstexploration.However,therearesigni“cantdifferencesintermsoftheorderinwhichthesearchspaceisexplored,thepruningmethodsused,andhowthecountingisperformed.Inparticular,certainprojection-basedmethodshelpinreusingthecountingworkfor-itemsetsfor(1)-itemsetswiththeuseofthenotionofprojecteddatabases.ManyalgorithmssuchasTreeProjectionFP-growthabletoachievethisgoal. C.C.Aggarwaletal.Table2.1Toytransactiondatabaseandfrequentitemsofeachtransactionforaminimumsupportof3 tidItemsSortedfrequentitems 2a,b,c,d,f,ha,b,c,d,f3a,f,ga,f4b,e,f,gb,f,e5a,b,c,d,e,ha,b,c,d,e Thischapterisorganizedasfollows.Theremainderofthischapterdiscussesnotationsandde“nitionsrelevanttofrequentpatternmining.Section2discussesjoin-basedalgorithms.Section3discussestree-basedalgorithms.AllthealgorithmsdiscussedinSects.2and3extendpre“xesofitemsetstogeneratedfrequentpatterns.Anumberofmethodsthatextendsuf“xesoffrequentpatternsarediscussedinSect.4.Variantsoffrequentpatternmining,suchasclosedandmaximalfrequentpatternmining,arediscussedinSect.5.OtheroptimizedvariationsoffrequentpatternminingalgorithmsarediscussedinSect.6.Methodsforreducingthenumberofpasses,withtheuseofsamplingandaggregationareproposedinSect.7.Finally,Sect.8concludeschapterwithanoverallsummary.1.1De“nitionsInthissection,wede“neseveralkeyconceptsoffrequentpatternmining(FPM)thatwewilluseintheremainingpartofthechapter.beatransactiondatabase,whereeachconsistsofasetofitems,say.Asetiscalledanitemset.Thesizeofanitemsetisde“nedbythenumberofitemsitcontains.Wewillreferanitemsetasitemsetpattern),ifitssizeis.Thenumberoftransactionscontainingisreferredtoasthe.Apatternisde“nedtobefrequentifitssupportisatleastequaltothetheminimumthreshold.Table2.1depictsatoydatabasewith5transactions().Thesecondcolumnshowstheitemsineachtransaction.Inthethirdcolumn,weshowthesetofitemsthatarefrequentinthecorrespondingtransactionforaminimumsupportvalueof3.Forexample,theitemintransactionwithvalueof2isaninfrequentitemwithasupportvalueof2.Therefore,itisnotlistedinthethirdcolumnofthecorrespondingrow.Similarly,thepattern(or,inabbreviatedform)isfrequentbecauseithasasupportvalueof3.Thefrequentpatternsareoftenusedtogenerateassociationrules.Considerthe,wherearesetsofitems.Thecon“denceoftheruleistheequaltotheratioofthesupportoftothatofthesupportof.Inotherwords,itcanbeviewedastheconditionalprobabilitythatoccurs,giventhathasoccurred.Thesupportoftheruleisequaltothesupportof.Associationrule-generationisatwo-phaseprocess.The“rstphasedeterminesallthefrequentpatternsatagivenminimumsupportlevel.Thesecondphaseextractsalltherulesfromthesepatterns.Thesecondphaseisfairlytrivialandwithlimitedsophistication.Therefore,mostofthealgorithmicworkinfrequentpatternminingfocussesonthe 2FrequentPatternMiningAlgorithms:ASurveyFig.2.2Thelatticeof Null FREQUENT ITEMSETS abcd abacadbcbdcd abcabdacdbcd abcd INFREQUENT ITEMSETSBORDER BETWEENFREQUENT ANDINFREQUENT ITEMSETS “rstphase.Thischapterwillalsofocusonthe“rstphaseoffrequentpatternmining,whichisgenerallyconsideredmoreimportantandnon-trivial.Frequentpatternssatisfyadownwardclosureproperty,accordingtowhicheverysubsetofafrequentpatternisalsofrequent.Thisisbecauseifapatternisasubsetofatransaction,theneverypatternwillalsobeasubsetofTherefore,thesupportofcanbenolessthanthatof.Thespaceofexplorationoffrequentpatternscanbearrangedasalattice,inwhicheverynodeisoneofthe2possibleitemsets,andanedgerepresentsanimmediatesubsetrelationshipbetweentheseitemsets.AnexampleofalatticeofpossibleitemsetsforauniverseofitemscorrespondingtoisillustratedinFig.2.2.Thelatticerepresentsthesearchoffrequentpatterns,andallfrequentpatternminingalgorithmsmust,inonewayoranother,traversethislatticetoidentifythefrequentnodesofthislattice.Thelatticeisseparatedintoafrequentandaninfrequentpartwiththeuseofaborder.AnexampleofaborderisillustratedinFig.2.2.Thisbordermustsatisfythedownwardclosureproperty.Thelatticecanbetraversedwithavarietyofstrategiessuchasbreadth-“rstordepth-“rstmethods.Furthermore,candidatenodesofthelatticemaybegeneratedinmanyways,suchasusingjoins,orusinglexicographictree-basedextensions.Manyofthesemethodsareconceptuallyequivalenttooneanother.Thefollowingdiscussionwillprovideanoverviewofthedifferentstrategiesthatarecommonly2Join-BasedAlgorithmsJoin-basedalgorithmsgenerate(1)-candidatesfromfrequent-patternswiththeuseofjoins.Thesecandidatesarethenvalidatedagainstthetransactiondatabase.methodusesjoinstocreatecandidatesfromfrequentpatterns,andisoneoftheearliestalgorithmsforfrequentpatternmining. C.C.Aggarwaletal.2.1AprioriMethodThemostbasicjoin-basedalgorithmisthemethod[1].Theusesalevel-wiseapproachinwhichallfrequentitemsetsoflengtharegeneratedbeforethoseoflength(1).Themainobservationwhichisusedforthealgorithmisthateverysubsetofafrequentpatternisalsofrequent.Therefore,forfrequentpatternsoflength(1)canbegeneratedfrompatternsoflengthwiththeuseofjoins.Ajoinisde“nedbypairsoffrequentpatternsthathaveatleast(1)itemsincommon.Speci“cally,considerafrequentthatisfrequent,buthasnotyetbeendiscoveredbecauseonlyitemsetsoflength3havebeendiscoveredsofar.Inthiscase,becausethepatternsarefrequent,theywillbepresentinthesetofallfrequentpatternswithlength3.Notethatthisparticularpairalsohas2itemsincommon.Byperformingajoinonthispair,itispossibletocreatethe.Thispatternisreferredtoasabecauseitmightbefrequent,andonemosteitherruleitinorruleitoutbysupportcounting.There-fore,thiscandidateisthenagainstthetransactiondatabasebycountingitssupport.Clearly,thedesignofanef“cientsupportcountingmethodplaysacriticalroleintheoverallef“ciencyoftheprocess.Furthermore,itisimportanttonotethatthesamecandidatecanbeproducedbyjoiningmultiplefrequentpatterns.Forex-ample,onemightjointoachievethesameresult.Therefore,inordertoavoidduplicationincandidategeneration,twoitemsetsarejoinedonlywhether“rst(1)itemsarethesame,basedonalexicographicorderingimposedontheitems.Thisprovidesallthe(1)-candidatesinanon-redundantway.Itshouldbepointedoutthatsomecandidatescanbeprunedoutinanef“cientway,withoutvalidatingthemagainstthetransactiondatabase.Forany(itischeckedwhethersubsetsarefrequent.Althoughitisalreadyknownthattwoofitssubsetscontributingtothejoinarefrequent,itisnotknownwhetheritsremainingsubsetsarefrequent.Ifallitssubsetsarenotfrequent,thenthecandidatecanbeprunedfromconsiderationbecauseofthedownwardclosureproperty.Thisisknownasthepruningtrick.Forexample,inthepreviouscase,iftheitemsetdoesnotexistinthesetoffrequent3-itemsetswhichhavealreadybeenfound,thenthecandidateitemsetcanbeprunedfromconsiderationwithnofurthercomputationaleffort.Thisgreatlyspeedsuptheoverallalgorithm.Thegenerationof1-itemsetsand2-itemsetsisusuallyperformedinaspecializedwaywithmoreef“cienttechniques.Therefore,thebasicalgorithmcanbedescribedrecursivelyinlevel-wisefashion.theoverallalgorithmcomprisesofthreestepsthatarerepeatedoverandoveragain,fordifferentvaluesof,whereisthelengthofthepatterngeneratedinthecurrentiteration.Thefourstepsarethoseof(i)generationofcandidatepatternsbyusingjoinsonthepatternsin,(ii)thepruningofcandidatesfromforwhichallsubsetstonotliein,and(iii)thevalidationofthepatternsinagainstthetransactiondatabase,todeterminethesubsetofwhichistrulyfrequent.Thealgorithmisterminated,whenthesetoffrequentinagiveniterationisempty.Thepseudo-codeoftheoverallprocedureispresentedinFig.2.3. 2FrequentPatternMiningAlgorithms:ASurveyFig.2.3 Thecomputationallyintensiveprocedureinthiscaseisthecountingofthecandi-datesinwithrespecttothetransactiondatabase.Therefore,anumberofoptimizationsanddatastructureshavebeenproposedin[1](andalsothesubsequentliterature)tospeedupthecountingprocess.Thedatastructureproposedin[1]isthatofconstructingahash-treetomaintainthecandidatepatterns.Aleafnodeofthehash-treecontainsalistofitemsets,whereasaninteriornodecontainsahash-table.Anitemsetismappedtoaleafnodeofthetreebyde“ningapathfromtheroottotheleafnodewiththeuseofthehashfunction.Atanodeoflevel,ahashfunctionisappliedtothethitemtodecidewhichbranchtofollow.Theitemsetsintheleafnodearestoredinsortedorder.Thetreeisconstructedrecursivelyintop…downfashion,andaminimumthresholdisimposedonthenumberofcandidatesintheleafnode.Toperformthecounting,allpossible-itemsetswhicharesubsetsofatransactionarediscoveredinaexplorationofthehash-tree.Toachievethisgoalallpossiblepathsinthehashtreethatcouldcorrespondtosubsetsofthetransaction,arefollowedinrecursivefashion,todeterminewhichleafnodesarerelevanttothattransaction.Aftertheleafnodeshavebeendiscovered,theitemsetsattheseleafnodesthataresubsetsofthattransactionareisolatedandtheircountisincremented.Theactualselectionoftherelevantleafnodesisperformedbyrecursivetraversalasfollows.Attherootnode,allbranchesarefollowedsuchthatoftheitemsinthetransactionhashtooneofbranches.Atagiveninteriornode,ifthethitemofthetransactionwaslasthashed,thenallitemsfollowingitinthetransactionarehashedtodeterminethepossiblechildrentofollow.Thus,byfollowingallthesepaths,therelevantleafnodesinthetreearedetermined.Thecandidatesintheleafnodearestoredinsortedorder,andcanbecomparedef“cientlytothehashedsequenceofitemsinthetransactiontodeterminewhethertheyarerelevant.Thisprovidesacountoftheitemsetsrelevanttothetransaction.Thisprocessisrepeatedforeachtransactiontodeterminethe“nalsupportcountforeachitemset.Itshouldbepointedoutthatthereasonforusingahashfunctionattheintermediatenodesistoreducethebranchingfactorofthehashtree.However,ifdesired,atriecanbeusedexplicitly,inwhichthedegreeofa C.C.Aggarwaletal. Fig.2.4Executiontreeofnodeispotentiallyoftheorderofthetotalnumberofitems.Anexampleofsuchanimplementationisprovidedin[12],anditseemstoworkquitewell.Analgorithmthatsharessomesimilaritiestothemethod,wasindependentlyproposedin[44],andsubsequentlyacombinedworkwaspublishedin[3].Figure2.4illustratestheexecutiontreeofthejoin-basedalgorithmoverthetoytransactiondatabasementionedinTable2.1forminimumsupportvalue3.Asmentionedinthepseudocodeof,acandidate-patternsaregeneratedbyjoiningtwofrequentitemsetofsize(1).Forexample,atlevel3,thepatternisgeneratedbyjoining.Aftergeneratingthecandidatepatterns,thesupportofthepatternsiscomputedbyscanningeverytransactioninthedatabaseanddeterminingthefrequentones.InFig.2.4,acandidatepatternsisshowninaboxalongwithitssupportvalue.Afrequentcandidateisshowninasolidbox,andaninfrequentcandidateisshowninadottedbox.Anedgerepresentsthejoinrelationshipbetweenacandidatepatternofsizeandafrequentpatternofsize1)suchthatthelatterisusedtogeneratetheearlier.The“gurealsoillustratesthefactthatapairoffrequentpatternsareusedtogenerateacandidatepattern,whereasnocandidatesaregeneratedfromaninfrequentpattern.2.1.1AprioriOptimizationsNumerousoptimizationswereproposedforthealgorithm[1]thatarereferredtoasAprioriTidrespectively.IntheAprioriTidalgorithm,eachtransactionisreplacedbyashortertransactionornulltransaction)duringthephase.Letthesetof1-candidatesinthatarecontainedintransactiondenotedby).Thisset)isaddedtoanewlycreatedtransaction.Iftheset)isnull,thenclearly,anumberofdifferenttradeoffsexistwiththeuseofsuchanapproach. 2FrequentPatternMiningAlgorithms:ASurvey€Becauseeachnewlycreatedtransactioninismuchshorter,thismakessubsequentsupportcountingmoreef“cient.€Insomecases,nocandidatemaybeasubsetofthetransaction.Suchatransactioncanbedroppedfromthedatabasebecauseitdoesnotcontributetothecountingofsupportvalues.€Inothercases,morethanonecandidatemaybeasubsetofthetransaction,whichwillactuallyincreasetheoverheadofthealgorithm.Clearly,thisisnotadesirableThus,the“rsttwofactorsimprovetheef“ciencyofthenewrepresentation,whereasthelastfactorworsensit.Typically,theimpactofthelastfactorisgreaterintheearlyiterations,whereastheimpactofthe“rsttwofactorsisgreaterinthelateriterations.Therefore,tomaximizetheoverallef“ciency,anaturalapproachwouldbetousethisoptimizationintheearlyiterations,andapplyitonlyinthelateriterations.Thisvariationisreferredtoasthealgorithm[1].Anotheroptimizationproposedin[9]isthatthesupportofmanypatternscanbeinferredfromthoseofkeypatternsinthedata.Thisisusedtosigni“cantlyenhancetheef“ciencyoftheNumerousothertechniqueshavebeenproposedthatusedifferenttechniquestooptimizetheoriginalimplementationofthealgorithm.Asanexample,themethodin[1]and[44]shareanumberofsimilaritiesbutaresomewhatdifferentattheimplementationlevel.Aworkthatcombinestheideasfromthesedifferentpiecesofworkispresentedin[3].2.2DHPAlgorithmTheDHPalgorithm,alsoknownastheDirectHashingandPruningmethod[50],wasproposedsoonafterthemethod.Itproposestwomainoptimizationstospeedupthealgorithm.The“rstoptimizationistoprunethecandidateitemsetsineachiteration,andthesecondoptimizationistotrimthetransactionstomakethesupport-countingprocessmoreef“cient.Toprunetheitemsets,thealgorithmtrackspartialinformationaboutcandidate1)-itemsets,whileexplicitlycountingthesupportofcandidate-itemsets.Duringthecountingofcandidate-itemsets,all(1)subsetsofthetransactionarefoundandhashedintoatablethatmaintainsthecountsofthenumberofsubsetshashedintoeachentry.Duringthephaseofcounting(1)-itemsets,thecountsinthehashtableareretrievedforeachitemset.Clearly,thesecountsareoverestimatesbecauseofpossiblecollisionsinthehashtable.Thoseitemsetsforwhichthecountsarebelowtheuser-speci“edsupportlevelarethenprunedfromconsideration.AsecondoptimizationproposedinDHPisthatoftransactiontrimming.Akeyobservationhereisthatifanitemdoesnotappearinatleastfrequentitemsetsin,thennofrequentitemsetinwillcontainthatitem.Thisfollowsfromthefactthatthereshouldbeatleast(immediate)subsetsofeachfrequentpatternin C.C.Aggarwaletal.containingaparticularitemthatalsooccurinandalsocontainthatitem.Thisimpliesthatifanitemdoesnotappearinatleastfrequentitemsetsin,thenthatitemisnolongerrelevanttofurthersupportcountingfor“ndingfrequentpatterns.Therefore,thatitemcanbetrimmedfromthetransaction.Thisreducesthewidthofthetransaction,andincreasestheef“ciencyofprocessing.Theoverheadfromthedatastructuresissigni“cant,andmostoftheadvantagesareobtainedforpatternsofsmallerlengthsuchas2-itemsets.Itwaspointedoutinlaterwork[46,47,60]thattheuseoftriangulararraysforsupportcountingof2-itemsetsinthecontextofthemethodisevenmoreef“cientthansuchanapproach.2.3SpecialTricksfor2-ItemsetCountingAnumberofspecialtrickscanbeusedtoimprovetheeffectivenessof2-itemsetcounting.Thecaseof2-itemsetcountingisspecialandisoftensimilarforthecaseofjoin-basedandtree-basedalgorithms.Asmentionedabove,oneapproachistouseatriangulararraythatmaintainsthecountsofthe-patternsexplicitly.Foreachtransaction,anestedloopcanbeusedtoexploreallpairsofitemsinthetransactionandincrementthecorrespondingcountsinthetriangulararray.Anumberofcachingtrickscanbeused[5]toimprovedatalocalityaccessduringthecountingprocess.However,ifthenumberofpossibleitemsareverylarge,thiswillstillbeaverysigni“cantoverheadbecauseitisneededtomaintainanentryforeachpairofitems.Thisisalsoverywasteful,ifmanyofthe1-itemsarenotfrequent,orsomeofthe2-itemcountsarezero.Therefore,apossibleapproachwouldbeto“rstpruneoutallthe1-itemswhicharenotfrequent.Itissimplynotnecessarytocountthesupportofa2-itemsetunlessbothofitsconstituentitemsarefrequent.Ahashtablecanthenbeusedtomaintainthefrequencycountsofthecorresponding2-itemsets.Asbefore,thetransactionsareexploredinadoublenestedloops,andallpairsofitemsarehashedintothetable,withthecaveat,thateachoftheindividualitemsmustbefrequent.Thesetofitemsetswhichsatisfythesupportrequirementsarereported.2.4PruningbySupportLowerBoundingMostofthepruningtricksdiscussedearlierpruneitemsetswhentheyareguaranteedmeettherequiredsupportthreshold.Itisalsopossibletoskipthecountingprocessforanitemsetiftheitemsetisguaranteedtomeetthesupportthreshold.Ofcourse,thecaveathereisthattheexactsupportofthatitemsetwillnotbeavailable,beyondtheknowledgethatitmeetstheminimumthreshold.Thisissuf“cientinthecaseofmanyapplications.Considertwothathave1itemsincommon.Then,theunionoftheitemsin,denotedbywillhaveexactly1items.Then,if)representthesupportofanitemset,thenthesupportof 2FrequentPatternMiningAlgorithms:ASurveybelowerboundedasfollows:)(2.1)Thisconditionfollowsdirectlyfromset-theoreticconsiderations.Thus,thesupportof(1)-candidatescanbelowerboundedintermsofthe(alreadycomputed)supportvaluesofitemsetsoflengthorless.Ifthecomputedvalueontheright-handsideisgreaterthantherequiredminimumsupport,thenthecountingofthecandidatedoesnotneedtobeperformedexplicitly,andthereforeconsiderablesavingscanbeachieved.Anexampleofamethodwhichusesthiskindofpruningisthemethod[10].Anotherinterestingruleisthatifthesupportofanitemsetisthesameasthat,thenforanysuperset,itisthecasethatthesupportoftheitemsetisthesameasthatof.Thisrulecanbeshowndirectlyasacorollaryoftheequationabove.Thisisveryusefulinavarietyoffrequentpatternminingalgorithms.Forexample,oncethesupportofhasbeenshowntobethesameasthatof,then,foranysuperset,itisnolongernecessarytoexplicitlycomputethesupportof,afterthesupportofhasalreadybeencomputed.Suchoptimizationshavebeenshowntobequiteeffectiveinthecontextofmanyfrequentpatternminingalgorithms[13,51,17].Asdiscussedlater,thistrickisnotexclusivetojoin-basedalgorithms,andisoftenusedeffectivelyintree-basedalgorithmssuch,and2.5HypercubeDecompositionOnefeasiblewaytoreducethecomputationcostofsupportcountingisto“ndsupportofmultiplefrequentpatternsatonetime.LCM[66]deviseatechniquereferredtoashypercubedecompositioninthispurpose.Themultipleitemsetsobtainedatonetime,compriseahypercubeintheitemsetlattice.Supposethatisafrequentpattern,tidset)containsthetransactionsthatispartof,andtail()denotesthelatestitemextensiontotheitemset)isthesetofitems�etailtidsettidset).Theset)isreferredtoasthehypercubeset.Then,foranytidsettidset)istrue,andisfrequent.Theworkin[66]usesthispropertyinthecandidategenerationphase.Fortwoitemsets,wesaythatisbetween.Inthephasewithrespectto,weoutputall).Thistechniquesavessigni“canttimeincounting.3Tree-BasedAlgorithmsThetree-basedalgorithmisbasedonset-enumerationconcepts.Thecandidatescanbeexploredwiththeuseofasubgraphofthelatticeofitemsets(seeFig.2.2),whichisalsoreferredtoasthelexicographictreeorenumerationtree[5].Thesetermswill, C.C.Aggarwaletal.acdfLevel 4 Level 0 fbaec Level 1 d bcbdcdcfdf abac adaf abcabdacdacfcdcdf Fig.2.5Thelexicographictree(alsoknownasenumerationtree)therefore,beusedinterchangeably.Thus,theproblemoffrequentitemsetgenerationisequivalenttothatofconstructingtheenumerationtree.Thetreecanbegrowninawidevarietyofwayssuchasbreadth-“rstordepth-“rstorder.Becausemostofthediscussioninthissectionwillusethisstructureasabaseforalgorithmicdevelopment,thisconceptwillbediscussedindetailhere.Themaincharacteristicoftree-basedalgorithmsisthattheenumerationtree(orlexicographictree)providesacertainorderofexplorationthatcanbeextremelyusefulinmanyscenarios.Itisassumedthatalexicographicorderingexistsamongtheitemsinthedatabase.Thislexicographicorderingisessentialforef“cientsetenumerationwithoutrep-etition.Toindicatethatanitemoccurslexicographicallyearlierthan,wewillusethenotation.Thelexicographictreeisanabstractrepresentationofthelargeitemsetswithrespecttothisordering.Thelexicographictreeisde“nedinthefollowingway:€Anodeexistsinthetreecorrespondingtoeachlargeitemset.Therootofthetreecorrespondstothe€Letbealargeitemset,wherearelistedinlexicographicorder.TheparentofthenodeistheitemsetThisde“nitionofancestralrelationshipnaturallyde“nesatreestructureonthenodesthatisrootedatthenode.Afrequent1-extensionofanitemsetsuchthatthelastitemisthecontributortotheextensionwillbecalledafrequentlexicographictreeextension,orsimplyatreeextension.Thus,eachedgeinthelexicographictreecorrespondstoanitemwhichisthefrequentlexicographictreeextensiontoanode.Thefrequentlexicographicextensionsofnodearedenotedby).AnexampleofthelexicographictreeisillustratedinFig.2.5.Inthisexample,thefrequentlexicographicextensionsofnode,and 2FrequentPatternMiningAlgorithms:ASurveybetheimmediateancestoroftheitemsetinthelexicographictree.Thesetofprospectivebranchesofanodeisde“nedtobethoseitemsinwhichoccurlexicographicallyafterthenode.Thesearethelexicographicextensionsof.Wedenotethissetby).Thus,wehavethefollowingrelationship:).Thevalueof)inFig.2.5,when.Thevalueof)for,andforisempty.Itisimportanttopointoutthatvirtuallyallnon-maximalandmaximalalgorithms,startingfrom,canbeconsideredenumeration-treemethods.Infact,therearefewfrequentpatternminingalgorithmswhichdonotusetheenumerationtree,orasubsetthereof(inmaximalpatternmining)forfrequentitemsetgeneration.However,orderofexplorationofthedifferentalgorithmsofthelexicographictreeisquitedifferent.Forexample,usesabreadth-“rststrategy,whereasotheralgorithmsdiscussedlaterinthischapteruseadepth-“rststrategy.Somemethodsareexplicitabouttherelationshipaboutthecandidategenerationprocesswiththeenumerationtree,whereasothers,suchas,arenot.Forexample,byexaminingFig.2.4,itisevidentthatcandidatescanbegeneratedbyjoiningtwofrequentsiblingsofalexicographictree.Infact,allcandidatescanbegeneratedinanexhaustiveandnon-redundantwaybyjoiningfrequentsiblings.Forexample,thetwoitemsetsacdfhacdfgsiblings,becausetheyarechildrenofthenodeacdf.Byjoiningthem,oneobtainsthecandidatepatternacdfgh.Thus,whiletheisajoin-basedalgorithm,itcanalsobeexplainedintermsoftheenumerationtree.Partsoftheenumerationtreemayberemovedbysomeofthealgorithmsbypruningmethods.Forexample,thealgorithmusesalevelwisepruningtrick.Forpatternminingtheadvantagesgainedfrompruningtrickscanbeverysigni“cant.Therefore,thenumberofcandidatesintheexecutiontreeofdifferentalgorithmsisdifferentonlybecauseofpruningoptimizationtricks.However,somemethodsareabletoachievebetterstrategiesbyusingthestructureoftheenumerationtreetoavoidre-doingthecountingworkalreadydoneforto(1)-candidates.Therefore,explicitlyintroducingtheenumerationtreeishelpfulbecauseitallowsamore”exiblewaytovisualizecandidateexplorationstrategiesthanjoin-basedmethods.Theexplicitintroductionoftheenumerationtreealsohelpsinunderstandingwhetherthegainsindifferentalgorithmsariseasaresultoffewernumberofcandidates,orwhethertheyariseasaresultofbettercountingstrategies.3.1AISAlgorithmTheoriginalalgorithm[2]isasimpleversionofthelexicographic-treealgorithm,thoughitisnotdirectlypresentedassuch.Inthisapproach,thetreeisconstructedinlevelwisefashionandthecorrespondingitemsetsatagivenlevelarecountedwiththeuseofthetransactiondatabase.Thealgorithmdoesnotuseanyspeci“coptimizationstoimprovetheef“ciencyofthecountingprocess.Aswillbediscussedlater,avarietyofmethodscanbeusedtofurtherimprovetheef“ciencyoftree-basedalgorithms.Thus,thisisaprimitiveapproachthatexplorestheentiresearchspacewithnooptimization. C.C.Aggarwaletal.3.2TreeProjectionAlgorithmsTwovariantsofanalgorithmwhichuserecursiveprojectionsofthetransactionsdownthelexicographictreestructureareproposedin[5]and[4],respectively.Thegoalofusingtheserecursiveprojectionsistoreusethecountingworkdownatagivenlevelforlowerlevelsofthetree.Thisreducesthecountingworkatthelowerlevelsbyordersofmagnitude,aslongasitispossibletosuccessfullymanagethememoryrequirementsoftheprojectedtransactions.ThemaindifferencebetweenthedifferentversionsofTreeProjectionistheexplorationstrategyused.TreeProjectioncanbeviewedasagenericframeworkthatadvocatesthenotionofdatabaseprojection,inthecontextofseveraldifferentstrategiesforconstructingtheenumerationtree,suchasabreadth-“rst,depth-“rst,oracombinationofthetwo.Thedepth-“rstversion,describedindetailin[4],alsoincorporatesmaximalpruning,thoughthedisablingofthepruningoptionscanalsomaterializeallthepatterns.Thebreadth-“rstanddepth-“rstalgorithmshavedifferentadvantages.Theformerallowslevel-wisepruningwhichisnotpossibleindepth-“rstmethodsthoughitisoftennotusedinprojection-basedmethods.Thedepth-“rstversionallowsbettermemorymanagement.Thedepth-“rstapproachworksbestwhentheitemsetsareverylong,anditisdesirabletoquicklydiscovermaximalpatterns,sothatportionsofthelexicographictreecanbeprunedoffquicklyduringexplorationanditcanalsobeusedfordiscoveringallpatternsincludingnon-maximalones.Whenallpatternsarerequired,includingnon-maximalones,theprimarydifferencebetweendifferentstrategiesisnotoneofthesizeofthecandidatespace,butthatofeffectivememorymanagementoftheprojectedtransactions.Thisisbecausethesizeofthecandidatespaceisde“nedbythesizeoftheenumerationtree,whichis“xed,andisagnostictothestrategyusedfortreeexploration.Ontheotherhand,memorymanagementofprojectedtransactionsiseasierwiththedepth-“rststrategybecauseoneonlyneedstomaintainasmallnumberofprojectedtransactionsetsalongthedepthofthetree.ThenotionofdatabaseprojectioniscommontoTreeProjectionFP-growth,andhelpsreducethecountingworkbyrestrictingthesizeofthedatabaseusedforsupportcounting.TreeProjectionwasdevelopedindependentlyfromFP-growth.WhiletheFP-growthpaperprovidesabriefdiscussionofTreeProjection,thischapterwillprovideamoredetaileddiscussionofthesimilaritiesanddifferencesbetweenthetwomethods.Onemajordifferencebetweenthetwomethodsisthattheinternalrepresentationofthecorrespondingprojecteddatabasesisdifferentinthetwocases.ThebasicdatabaseprojectionapproachisverysimilarinbothcasesofTreeProjec-FP-growth.Animportantobservationisthatifatransactionisnotrelevantforcountingatagivennodeintheenumerationtree,thenitwillnotberelevantforcountinginanydescendentofthatnode.Therefore,onlythosetransactionsareretainedthatcontainallitemsinforcountingatthenodeintheprojectedtrans-actions.Notethatthissetstrictlyreducesaswemovetolowerlevelsofthetree,andthesetofrelevanttransactionsatthelowerleveloftheenumerationtreeisasubsetofthesetatahigherlevel.Furthermore,onlythepresenceofitemscorrespondingtothecandidateextensionsofanodearerelevantforcountingatanyofthesubtreesrooted 2FrequentPatternMiningAlgorithms:ASurveyFig.2.6Enumerationtreeexploration atthatnode.Therefore,thedatabaseisalsoprojectedintermsofattributes,inwhichonlyitemswhicharecandidateextensionsatanodeareretained.Thecandidateset)ofitemextensionsofnodeisaverysmallsubsetoftheuniverseofitemsatlowerlevelsoftheenumerationtree.Infact,eventheitemsinthenodenotberetainedexplicitlyinthetransaction,becausetheyareknowntoalwaysbepresentinalltheselectedtransactionsbasedonthe“rstcondition.Thisprojectionprocessisperformedrecursivelyintop…downfashiondowntheenumerationtreeforcountingpurposes,wherelowerlevelnodesinherittheprojectionsfromhigherlevelnodesandaddoneadditionalitemtotheprojectionateachlevel.Theideaofthisinheritance-basedapproachisthattheprojecteddatabaseremembersthecount-ingworkdoneathigherlevelsoftheenumerationtreeby(successively)removingirrelevanttransactionsandirrelevantitemsateachleveloftheprojection.Suchanapproachworksef“cientlybecauseitneverrepeatsthecountingworkwhichhasalreadybeendoneatthehigherlevels.Thus,theprimarysavingsinthestrategyarisefromavoidingrepetitiveandwastefulcounting.Abare-bonesdepth-“rstversionofTreeProjection,thatissimilartoDepthProjectbutwithoutmaximalpruning,isdescribedinFig.2.6.Amoredetaileddescrip-tionwithmaximalpruningandotheroptimizationsisprovidedlaterinthischapter.Becausethealgorithmisdescribedrecursively,thecurrentpre“x(nodeofthelexicographictree)beingextendedisoneoftheargumentstothealgorithm.Intheinitialcall,thevalueofbecauseoneintendstodetermineallfrequentde-scendantsattherootofthelexicographictree.Thisalgorithmrecursivelyextendsfrequentpre“xesandmaintainsonlythetransactiondatabaserelevanttothepre“x.Thefrequentpre“xesareextendedbydeterminingtheitemsthatarefrequentin.Thentheitemsetisreported.Theextensionofthefrequentpre“xcanbeviewedasarecursivecallatanodeoftheenumerationtree.Thus,atagivenenumerationtreenode,onenowhasacompletelyindependentproblemofextendingthepre“xwiththeprojecteddatabasethatisrelevanttoalldescendantsofthatnode.Theconditionaldatabasereferstothesubsetoftheoriginaltransactiondatabasecorrespondingtotransactionscontainingitem.Furthermore,theitemandanyitemoccurringlexicographicallyearliertoitisnotretainedinthedatabasebecause C.C.Aggarwaletal.theseitemsarenotrelevanttocountingtheextensionsof.Thisindependentproblemissimilarinstructuretotheoriginalproblem,andcanbesolvedrecursively.Althoughitisnaturaltouserecursionforthedepth-“rstversionsofTreeProjectionthebreadth-“rstversionsarenotde“nedrecursively.Nevertheless,thebreadth-“rstversionsexploreapatternspaceofthesamesizeasthedepth-“rstversions,andarenodifferenteitherintermsofthetreesizeorthecountingworkdoneovertheen-tirealgorithm.Themajorchallengeinthebreadth-“rstversionisinmaintainingtheprojectedtransactionsalongthebreadthofthetree,whichisstorage-intensive.Itisshownin[5],howmanyoftheseissuescanberesolvedwiththeuseofacombinationofexplorationstrategiesfortreegrowthandcounting.Furthermore,itisalsoshownin[5]howbreadth-“rstanddepth-“rstmethodsmaybecombined.NotethatthisconceptofdatabaseprojectioniscommonbetweenTreeProjectionFP-growthalthoughtherearesomedifferencesintheinternalrepresentationoftheprojecteddatabases.Theaforementioneddescriptionisdesignedfordiscoveringallpatterns,anddoesnotincorporatemaximalpatternpruning.Whengeneratingtheitemsets,themainadvantageofthedepth-“rststrategyoverthebreadth-“rststrategyisthatitislessmemoryintensive.Thisisbecauseonedoesnothavehandlethelargenumberofcandidatesalongthebreadthoftheenumerationtreeatanypointinthecourseofalgorithmexecutionwhencombinedwithcountingdatastructures.Theoverallsizeofthecandidatespaceis“xed,andde“nedbythesizeoftheenumerationtree.Therefore,overtheentireexecutionofthealgorithm,thereisnodifferencebetweenthetwostrategiesintermsofsearchspacesize,beyondmemoryoptimization.Projection-basedalgorithms,suchasTreeProjection,canbeimplementedeitherrecursivelyornon-recursively.Depth-“rstvariationsofprojectionstrategies,suchDepthProjectFP-growth,aregenerallyimplementedrecursivelyinwhichaparticularpre“x(orsuf“x)offrequentitemsisgrownrecursively(seeFig.2.6).Forrecursivevariations,thestructureandsizeoftherecursiontreeisthesameastheenumerationtree.Non-recursivevariationsofTreeProjectionmethodsdirectlypresenttheprojection-basedalgorithmsintermsoftheenumerationtreebystoringprojectedtransactionsatthenodesintheenumerationtree.Describingprojectionstrategiesdirectlyintermsoftheenumerationtreeishelpful,becauseonecanusetheenumerationtreeexplicitlytooptimizetheprojection.Forexample,onedoesnotneedtoprojectateverynodeoftheenumerationtree,butprojectonlywhenthesizeofthedatabasereducesbyaparticularfactorwithrespecttothenearestancestornodewherethelastprojectionwasstored.Suchoptimizationscanreducethespace-overheadofrepeatedelementsintheprojecteddatabasesatdifferentlevelsoftheenumeration(recursion)tree.IthasbeenshownhowtousethisoptimizationindifferentvariationsofTreeProjection.Furthermore,breadth-“rstvariationsofthestrategyarenaturallyde“nednon-recursivelyintermsoftheenumerationtree.Therecursivedepth-“rstversionsmaybeviewedeitherasdivide-and-conquerstrategies(becausetheyrecursivelysolveasetofsmallersubproblems),orasprojection-basedcountingreusestrategies.Thenotionofprojection-basedcountingreuseclearlydescribeshowcomputationalsavingsareachievedinbothversionsofthealgorithm. 2FrequentPatternMiningAlgorithms:ASurveyWhengeneratingpatterns,thedepth-“rststrategyhasclearadvantagesintermsofpruningaswell.WereferthereadertoadetaileddescriptionoftheDepthProjectalgorithm,describedlaterinthischapter.Thisdescriptiondescribeshowseveralspecializedpruningtechniquesareenabledbythedepth-“rststrategyformaximalpatternmining.TheTreeProjectionalgorithmhasalsobeengeneralizedtosequentialpatternmining[31].Therearemanydifferenttypesofdatastructuresthatmaybeusedinprojection-stylealgorithms.Thechoiceofdatastructureissensitivetothedataset.TwocommonchoicesthatareusedwithTreeProjectionfamilyofalgorithmsareasfollows:Arrays:Inthiscase,theprojecteddatabaseismaintainedas2-dimensionalarray.Oneofthedimensionsofthearrayisequaltothenumberofrelevanttransactionsandtheotherdimensionisequaltothenumberofrelevantitemsintheprojecteddatabase.Bothdimensionsoftheprojecteddatabasereducefromtopleveltolowerlevelsoftheenumerationtreewithsuccessiveprojection.Inthiscase,theprojecteddatabaseismaintainedasa0…1bitstringwhosewidthis“xedtothetotalnumberoffrequent1-items,butthenumberofprojectedtransactionsreduceswithsuccessiveprojection.Suchanapproachlosesthepowerofitem-wiseprojection,butthisisbalancedbythefactthatthebit-stringscanbeusedmoreef“cientlyforcountingoperations.Assumethateachtransactionbits,andcanthereforebeexpressedintheformofbytes.Eachbyteofthetransactioncontainstheinformationaboutthepresenceorabsenceofeightitems,andtheintegervalueofthecorre-spondingbitstringcantakeonanyvaluefrom0to2255.Correspondingly,foreachbyteofthe(projected)transactionatanode,256countersaremaintainedandavalueof1isaddedtothecountercorrespondingtotheintegervalueofthattransactionbyte.Thisprocessisrepeatedforeachtransactionintheprojecteddatabaseatnode.Therefore,attheendofthisprocess,onehas256countsforthedifferentitems.Atthispoint,apostprocessingphaseisiniti-atedinwhichthesupportofanitemisdeterminedbyaddingthecountsofthe128counterswhichtakeonthevalueof1forthatbit.Thus,thesecondphaserequires128operationsonly,andisindependentofdatabasesize.The“rstphase,(whichisthebottleneck)istheimprovementoverthenaivecountingmethodbecauseitperformsonlyoneoperationforeachinthetransaction,whichcontainseightitems.Thus,themethodwouldbeafactorofeightfasterthanthenaivecountingtechnique,whichwouldneedtoscantheentirebitstring.Projectionisalsoveryef“cientinthebitstringrepresentationwithsimpleANDThemajorproblemwith“xedwidthbitstringsisthattheyarenotef“cientrepre-sentationsatlowerlevelsoftheenumerationtreeatwhichonlyasmallnumberofitemsarerelevant,andthereforemostentriesinthesebitstringsare0.Oneapproachtospeedthisupistoperformtheitem-wiseprojectiononlyatselectednodesinthetree,whenthereductioninthenumberofitemsfromthelastancestoratwhichtheitem-wiseprojectionwasperformedisatparticularmultiplicativefactor.Atthispoint,ashorterbitstringisusedforrepresentationforthedescendantsatthatnode, 36 C.C.Aggarwaletal. Table2.2 Vertical representationoftransactions. Notethatthesupportof itemset ab canbecomputed asthelengthofthe intersectionofthe tidlists of a and b Itemtidlist a1,2,3,5 b1,2,4,5 c1,2,5 d1,2,5 e1,4,5 f2,3,4 g3,4 h2,5 untilthewidthofthebitstringisreducedevenfurtherbythesamemultiplicative factor.Thisensuresthatthebitstringsrepresentationsarenotsparseandwasteful. Thekeyissuehereisthatdifferentrepresentationsprovidedifferenttradeoffsin termsofmemorymanagementandef“ciency.Laterinthischapter,anapproach called FP-growth willbediscussedwhichusesthetriedatastructuretoachieve compressionofprojectedtransactionsforbettermemorymanagement. 3.3VerticalMiningAlgorithms Theverticalpatternminingalgorithmsuseaverticalrepresentationofthetransaction databasetoenablemoreef“cientcounting.Thebasicideaoftheverticalrepresen- tationisthatonecanexpressthetransactiondatabaseasaninvertedlist.Inother words,foreachtransactionidenti“ers,onecanhavealistofitemsthatarecontained init.Thisisreferredtoasa tidset or tidlist .Anexampleofaverticalrepresentation ofthetransactionsinTable2.1isillustratedinTable2.2. Thekeyideainverticalpatternminingalgorithmsisthatthesupportof k -patterns canbecomputedbyintersectionoftheunderlying tidlists .Therearetwodifferent waysinwhichthiscanbedone. €Thesupportofa k -itemsetcanbecomputedasa k -waysetintersectionofthelists oftheindividualitems. €Thesupportofa k -itemsetcanbecomputedasanintersectionofthe tidlists two ( k Š 1)-itemsetsthatjointothat k -itemset. Thelatterapproachismoreef“cient.Thecreditforboththenotionofverticaltidlists andtheadvantagesofrecursiveintersectionoftidlistsissharedbythe Monet [56] andthe Partition algorithms[57].Notallverticalpatternminingalgorithmsusean enumerationtreeconcepttodescribethealgorithm.Manyofthealgorithmsdirectly usejoinstogeneratea( k + 1)-candidatepatternfromafrequent k -pattern,though evenajoin-basedalgorithm,suchas Apriori ,canbeexplainedintermsofanenumer- ationtree.Manyofthelatervariationsofverticalmethodsuseanenumerationtree concepttoexplorethelatticeofitemsetsmorecarefullyandrealizethefullpowerof theverticalapproach.TheindvidualensemblecomponentofSavasereetal.s[57] Partition algorithmistheprogenitorofallverticalpatternminingalgorithmstoday, andtheoriginal Eclat algorithmisamemory-optimizedandcandidatepartitioned versionofthisApriori-likealgorithm. 2FrequentPatternMiningAlgorithms:ASurvey 37 3.3.1Eclat Eclat usesabreadth-“rstapproachlikeSavasereatalsalgorithm[57]onlattice partitions,afterpartitioningthecandidatesetintodisjointgroups,usingacandidate partitioningapproachsimilartoearlierparallelversionsofthe Apriori algorithm. The Eclat [71]algorithmisbestdescribedwiththeconceptofanenumerationtree becauseofthewidevariationinthedifferentstrategiesusedbythealgorithm.An importantcontributionof Eclat [71]istorecognizetheearlierpioneeringworkof the Monet and Partition algorithms[56,57]onrecursiveintersectionoftidlists,and proposemanyef“cientvariantsofthisparadigm. Differentvariationsof Eclat explorethecandidatesindifferentstrategies.The earliestdescriptionof Eclat maybefoundin[74].Ajournalpaperexploringdiffer- entaspectsof Eclat maybefoundin[71].Intheearliestversionsofthework[74],a breadth-“rststrategyisused.Thejournalversionin[71]alsopresentsexperimental resultsforonlythebreadth-“rststrategy,althoughthepossibilityofadepth-“rst strategyismentionedinthepaper.Therefore,theoriginal Eclat algorithmshouldbe consideredabreadth-“rstalgorithm.Morerecentdepth-“rstversionsof Eclat ,such as dEclat ,userecursive tidlist intersectionwithdifferencing[72],andrealizethefull bene“tofthedepth-“rstapproach.The Eclat algorithm,aspresentedin[74],usesa levelwisestrategyinwhichall( k + 1)-candidateswithinalatticepartitionaregener- atedfromfrequent k -patternsinlevel-wisefashion,asin Apriori .The tidlists areused toperformsupportcounting.Thefrequentpatternsaredeterminedfromthese tidlists . Atthispoint,anewlevelwisephaseisinitiatedforfrequentpatternsofsize( k + 1). Othervariationsanddepth-“rstexplorationstrategiesof Eclat ,alongwithexper- imentalresults,arepresentedinlaterworksuchas dEclat [72].The dEclat workin [72]presentssomeadditionalenhancementssuchas diffsets toimprovecounting.In thischapter,wepresentasimpli“edpseudo-codeofthisversionof Eclat .Thealgo- rithmispresentedinFig.2.8.Thealgorithmisstructuredasarecursivealgorithm.A patternset FP ispartoftheinput,andissettothesetofallfrequent1-itemsatthe toplevelcall.Therefore,itmaybeassumedthat,atthetoplevel,thesetoffrequent 1-itemsand tidlists havealreadybeencomputed,thoughthiscomputationisnot showninthepseudocode.Ineachrecursivecallof Eclat ,anewsetofcandidates FP i isgeneratedforeverypattern(itemset) P i ,whichextendstheitemsetbyone unit.Thesupportofacandidateisdeterminedwiththeuseof tidlist intersection. Finally,if P i isfrequent,itisaddedtoapatternset FP i forthenextlevel. Figure2.7illustratestheitemsetgenerationtreewithsupportcomputationby tidlist intersectionforthesampledatabasefromTable2.1.Thecorresponding tidlists inthetreearealsoillustrated.Allinfrequentitemsetsineachlevelaredenotedbydot- ted,andborderedrectangles.Forexample,anitemset ab isgeneratedbyjoining b to a .The tidlist of( a )is { 1,2,3,5 } ,andthe tidlist of b is { 1,2,4,5 } .Wecandeterminethe supportof ab byintersectingthetwo tidlists toobtainthe tidlist { 1,2,5 } ofthesecan- didates.Therefore,thesupportof ab isgivenbythelengthofthis tidlist ,whichis3. Furthergainsmaybeobtainedwiththeuseofthenotionof diffsets [72].This approachrealizesthetruepowerofverticalpatternmining.Thebasicidea,in diffsets istomaintainonlytheportionofthe tidlists atanode,thatcorrespondtothechangein theinvertedlistfromtheparentnode.Thus,the tidlists atanodecanbereconstructed byexaminingthe tidlists attheancestorsofanodeinthetree.Themajoradvantage C.C.Aggarwaletal. Fig.2.7ExecutionofFig.2.8 diffsetsisthattheysavesigni“cantstorageinrequirementsintermsofthesizeofthedatastructurerequired(Fig.2.8). 2FrequentPatternMiningAlgorithms:ASurveyFig.2.9Suf“x-basedpatternexploration 3.3.2VIPERalgorithm[58]usesaverticalapproachtominingfrequentpatterns.Thebasicideainthealgorithmisrorepresenttheverticaldatabaseintheformofcompressedbitvectorsthatarealsoreferredtoassnakes.Thesesnakesarethenusedforef“cientcountingofthefrequentpatterns.Thedifferentcompressedrepresentationoftheprovideanumberofoptimizationadvantagesthatareleveragedbythealgorithm.Intrinsically,isnotverydifferentfromintermsofthebasiccountingapproach.Themajordifferenceisintermsofthechoiceofthecompressedbitvectorrepresentation,andtheef“cienthandlingofthisrepresentation.Detailsmaybefoundin[58].4RecursiveSuf“x-BasedGrowthInthesealgorithmsrecursivesuf“x-basedexplorationofthepatternsisperformed.Notethatinmostfrequentpatternminingalgorithms,theenumerationtree(executiontree)ofpatternsexploresthepatternsintheformofalexicographictreeofitemsetsbuiltonthepre“xes.Suf“x-basedmethodsuseadifferentconventioninwhichthesuf“xesoffrequentpatternsareextended.Asinallprojection-basedmethods,oneonlyneedstousethetransactiondatabasecontainingitemsetinordertocountitemsetsthathavethesuf“x.Itemsetsareextendedfromthesuf“xbackwards.Ineachiteration,theconditionaltransactiondatabase(orprojecteddatabase)oftrans-actionscontainingthecurrentsuf“xbeingexploredisaninputtothealgorithm.Furthermore,itisassumedthattheconditionaldatabasecontainsonlyfrequentex-tensionsof.Forthetop-levelcall,thevalueofisnull,andthefrequentitemsaredeterminedusingasinglepreprocessingpassthatisnotshowninthepseudo-code.Becauseeachitemisalreadyknowntobefrequent,thefrequentpatternscanbeimmediatelygeneratedforeachitem.Thedatabaseisprojectedfurthertoincludeonlytransactionscontaining,andarecursivecallisinitiatedwiththe.Theprojecteddatabasecorrespondingtotransactionscontainingisdetermined.Infrequentitemsareremovedfrom.Thus,thetransactionsarerecursivelyprojectedtore”ecttheadditionofaniteminthesuf“x.Thus,thisisa C.C.Aggarwaletal.smallersubproblemthatcanbesolvedrecursively.TheFP-growthapproachusesthesuf“x-basedpatternexploration,asillustratedinFig.2.9.Inaddition,theFP-growthapproachusesanef“cientdatastructure,knownastheFP-Treetorepresentthecon-ditionaltransactiondatabasewiththeuseofcompressedpre“xes.TheFP-Treewillbediscussedinmoredetailinalatersection.Thesuf“xinthetoplevelcalltothealgorithmisthenullitemset.Recursivesuf“x-basedexplorationofthepatternspaceis,inprinciple,nodifferentfrompre“x-basedexplorationoftheenumerationtreespacewiththeorderingoftheitemsreversed.Inotherwords,byusingareverseorderingofitems,suf“x-basedrecursivepatternspaceexplorationcanbesimulatedwithpre“x-basedenumerationtreeexploration.Indeed,asdiscussedinthelastsection,pre“x-basedenumerationtreemethodsorderitemsfromtheleastfrequenttothemostfrequent,whereasthesuf“x-basedmethodsofthissectionorderitemsfromthemostfrequenttotheleastfrequent,toaccountforthisdifference.Thus,suf“x-basedrecursivegrowthhasanexecutiontreethatisidenticalinstructuretoapre“x-basedenumerationtree.Thisisadifferenceonlyofconvention,butitdoesnotaffectthepatternspacethatisexplored.Itisinstructivetocomparethesuf“x-basedexplorationwiththepseudocodeofthepre“x-basedTreeProjectionalgorithminFig.2.6.Thetwopseudocodesarestructureddifferentlybecausetheinitialpre-processingpassofremovingfrequentitemsisnotassumedintheTreeProjectionalgorithm.Therefore,ineachrecursivecallofthepre“x-basedTreeProjection,frequentitemsetsmustbecountedbeforetheyarereported.Insuf“x-basedexploration,thisstepisdoneasapreprocessingstep(forthetop-levelcall)andjustbeforetherecursivecallfordeepercalls.Therefore,eachrecursivecallalwaysstartswithadatabaseoffrequentitems.Thisis,ofcourse,adifferenceintermsofhowtherecursivecallsarestructuredbutisnotdifferentintermsofthebasicsearchstrategy,ortheamountofoverallcomputationalworkrequired,becauseinfrequentitemsneedtoberemovedineithercase.Afewotherkeydifferencesareevident:TreeProjectionusesdatabaseprojectionsontopofapre“x-basedenumerationtree.Suf“x-basedrecursivemethodshavearecursiontreewhosestructureissimilartoanenumerationtreeonthefrequentsuf“xesinsteadofthepre“xes.TheremovalofinfrequentitemsfromFP-growthissimilartodeterminingwhichbranchesoftheenumerationtreetoextendfurther.€Theuseofsuf“x-basedexplorationisadifferenceonlyofconventionfrompre“x-basedexploration.Forexample,afterreversingtheitemorder,onemightFP-growthbygrowingpatternsonthepre“xes,butconstructingacompressedFP-Treeonthesuf“xes.Theresultingexplorationorderandexecu-tioninthetwodifferentimplementationsofFP-growthwillbeidentical,butthelattercanbemoreeasilyrelatedtotraditionalenumerationtreemethods. TheresultingFP-Treewillbeasuf“x-basedtrie. 2FrequentPatternMiningAlgorithms:ASurvey€Variousdatabaseprojectionmethodsaredifferentintermsofthespeci“cdatastructuresusedfortheprojecteddatabase.ThedifferentvariationsofTreeProjec-usearraysandbitstringstorepresenttheprojecteddatabase.TheFP-growthmethodusesanFP-Tree.TheFP-Treewillbediscussedinthenextsection.LatervariationsofFP-Treealsousecombinationsofarraysandpointerstorepresenttheprojecteddatabase.Somevariations,suchasOpportuneProject[38],combinedifferentdatastructuresinanoptimizedwaytoobtainthebestresult.€Suf“x-basedrecursivegrowthisinherentlyde“nedasadepth-“rststrategy.Ontheotherhand,asisevidentfromthediscussionin[5],thespeci“cchoiceofex-plorationstrategyontheenumerationtreeisorthogonaltotheprocessofdatabaseprojection.Theoverallsizeoftheenumerationtreeisthesame,nomatterhowitisexplored,unlessmaximalpatternpruningisused.Thus,TreeProjectionexploresavarietyofstrategiessuchasbreadth-“rstanddepth-“rststrategies,withnodif-ferencetothe(overall)workrequiredforcounting.Themajorchallengewiththebreadth-“rststrategyisthesimultaneousmaintenanceofprojectedtransactionsetsalongthebreadthofthetree.Theissueofeffectivememorymanagementofbreadth-“rststrategiesisdiscussedin[5],whichshowshowcertainoptimizationssuchascache-blockingcanimprovetheeffectivenessinthiscase.Breadth-“rststrategiesalsoallowcertainkindsofpruningsuchaslevel-wisepruning.€Themajoradvantagesofdepth-“rststrategiesariseinthecontextofmaximalpat-ternmining.Thisisbecauseadepth-“rststrategydiscoversthemaximalpatternsveryearly,whichcanbeusedtoprunethesmallernon-maximalpatterns.Inthiscase,thesizeofthesearchspaceexploredtrulyreducesbecauseofadepth-“rststrategy.Thisissueisdiscussedinthesectiononmaximalpatternmining.Theadvantagesformaximalpatternminingwere“rstproposedinthecontextoftheDepthProjectalgorithm[4].Next,wewilldescribetheFP-Treedatastructurethatusescompressedrepresenta-tionsofthetransactiondatabaseformoreef“cientcounting.4.1TheFP-GrowthApproachFP-growthapproachcombinessuf“x-basedpatternexplorationwithacom-pressedrepresentationoftheprojecteddatabaseformoreef“cientcounting.Thepre“x-basedFP-Treeisacompressedrepresentationofthedatabasewhichisbuiltbyconsideringa“xedorderamongtheitemsinanitemset[32].ThistreeisusedtorepresenttheconditionaltransactionsetsofFig.2.9.AnFP-Treemaybeviewedasapre“x-basedtriedatastructureofthetransactiondatabaseoffrequentitems.Justaseachnodeinatrieislabeledwithasymbol,anodeintheFP-Treeislabeledwithanitem.Inaddition,thenodeholdsthesupportoftheitemsetde“nedbytheitemsofthenodesthatareonthepathfromtherootto.Byconsolidatingthepre“xes,oneobtainscompression.Thisisusefulforeffectivememorymanagement.Ontheotherhand,themaintenanceofcountsandpointerswiththepre“xesisan C.C.Aggarwaletal. {}ADD 1stTRANSACTIONa,b,c,d,e {}ADD 2ndTRANSACTIONa,b,c,f,d {}ADD 3rdTRANSACTIONa,f {}ADD 4thTRANSACTIONb,f,e {}ADD 5thTRANSACTIONa,b,c,d,e a:1b:1 a:2 a:3 b:2f:1 a:3 b:2 b:1 a:4 b:3f b:1 c:1 c:2d:1 f:1 d:1f f d:1f:1 e:1 d:2f:1 f:1 f:1e:1 e:1 e:1 d:1 e:1 d:1 e:1 d:1 e:2 d:1 ADD POINTERS {}a:4 b:1 c:3 b:3f:1 f:1 d:2 f:1 d:1 Fig.2.10FP-Treeconstructionadditionaloverhead.Thisresultsinadifferentsetoftrade-offsascomparedtothearrayrepresentation.TheinitialFP-Treeisconstructedasfollows.WestartwiththeemptyFP-TreeFPT.BeforeconstructingtheFP-Tree,thedatabaseisscannedandinfrequentitemsareremoved.Thefrequentitemsaresortedindecreasingorderofsupport.TheinitialconstructionofFP-Treeisstraightforward,andsimilartohowonemightinsertastringinatrie.Foreveryinsertion,thecountsoftherelevantnodesthatareaffectedbytheinsertionareincrementedby1.Iftherehasbeenanysharingofpre“xbetweenthecurrenttransactionbeinginserted,andapreviouslyinsertedtransactionthenwillbeinthesamepathuntilthecommonpre“x.Beyondthiscommonpre“x,newnodesareinsertedinthetreefortheremainingitemsin,withsupportcountinitializedto1.Theaboveprocedureendswhenalltransactionshavebeeninserted.Tostoretheitemsinthe“nalFP-Tree,aliststructurecalledheadertableismaintained.AchainofpointersthreadsthroughtheoccurrenceoftheitemintheFP-Tree.Thus,thischainofpointersneedtobeconstructedinadditiontothetriedatastructure.EachentryinthistablestorestheitemlabelandpointerstothenoderepresentingtheleftmostoccurrenceoftheitemintheFP-Tree(“rstiteminthepointerchain).ThereasonformaintainingthesepointersisthatitispossibletodeterminetheconditionalFP-Treeforanitembychasingthepointersforthatitem.AnexampleoftheinitialconstructionoftheFP-Treedatastructurefroma 2FrequentPatternMiningAlgorithms:ASurveyFig.2.11FP-growth databaseof“vetransactionsisillustratedinFig.2.10.Theorderingoftheitemsis.Itisclearthatatriedatastructureiscreated,andthenodecountsareupdatedbytheinsertionofeachtransactionintheFP-Tree.Figure2.10alsoshowsallthepointersbetweenthedifferentitems.Thesumofthecountsontheitemsonthispointerpathisthesupportoftheitem.ThissupportisalwayslargerthantheminimumsupportbecauseafullconstructedFP-Tree(withpointers)containsonlyfrequentitems.Theactualcountingofthesupportofitem-extensionsandtheremovalofinfrequentitemsmustbedoneduringconditionaltransactiondatabase(andtherelevantFP-Tree)creation.ThepointerpathsarenotavailableduringtheFP-Treecreationprocess.Forexample,theitemhastwonodesonthispointerpath,correspondingto:2and:1.Bysummingupthesecounts,atotalcountofthreefortheitemisobtained.Itisnotdif“culttoverifythatthreetransactionscontaintheitemWiththisnewcompressedrepresentationoftheconditionaltransactiondatabaseoffrequentitems,onecandirectlyextractthefrequentpatterns.Thepseudo-codeofFP-growthalgorithmispresentedinFig.2.11.Althoughthispseudo-codelooksmuchmorecomplextounderstandthantheearlierpseudocodeofFig.2.9,themaindifferenceisthatmoredetailsofthedatastructure(FP-Tree),usedtorepresenttheconditionaltransactionsets,havebeenadded.ThealgorithmacceptsaFP-TreeFPT,currentitemsetsuf“xanduserde“nedminimumsupportasinput.Theadditionalsuf“xhasbeenaddedtotheparametertofacilitatetherecursivedescription.Atthetoplevelcallmadebytheuser,thevalueof.Furthermore,theconditionalFP-Treeisconstructedonadatabaseoffrequentitemsratherthanalltheitems.Thispropertyismaintainedacrossdifferentrecursivecalls.ForanFP-TreeFPT,theconditionalFP-TreesarebuiltforeachitemFPT(whichisalreadyknowntobefrequent).TheconditionalFP-TreesareconstructedbychasingpointersforeachitemintheFP-Tree.Thisyieldsalltheconditionalpre“x C.C.Aggarwaletal. {} {} {} a:4 b:3f:1 b:1f:1 a:4 b:3 b:1 a:4 b:3 b:1POINTERCHASINGOF eREMOVALOF e d:2f:1 e:1 c:3e:1 c:3 e:2 d:1 e:2 Fig.2.12GeneratingaconditionalFP-Treebypointerchasingpathsfortheitem.Theinfrequentnodesfromthesepathsareremoved,andtheyareputtogethertocreateaconditionalFP-TreeFPT.BecausetheinfrequentitemshavealreadybeenremovedfromFPTthenewconditionalFP-Treealsocontainsonlyfrequentitems.Therefore,inthenextlevelrecursivecall,anyitemfromFPTcanbeappendedtotogenerateanotherpattern.Thesupportsofthosepatternscanalsobereconstructedviapointerchasingduringtheprocessofreportingthepatterns.Thus,thecurrentpatternsuf“xisextendedwiththefrequentitemappendedtothefront.Thisextendedsuf“xisdenotedby.Thepatternalsoneedstobereportedasfrequent.TheresultingconditionalFP-TreeFPTisthecompresseddatabaserepresentationofofFig.2.9intheprevioussection.Thus,FPTisasmallerconditionaltreethatcontainsinformationrelevantonlytotheextractionofvariouspre“xpathsrelevanttodifferentitemsthatwillextendthesuf“xfurtherinthebackwardsdirection.NotethatinfrequentitemsareremovedfromFPTduringthisstep,whichrequiresthesupportcountingofallitemsinFPT.BecausethepointershavenotyetbeenconstructedforFPT,thesupportofeachitem-extensionofcorrespondingtotheitemsinFPTmustbeexplicitlydeterminedbylocatingeachinstanceofaniteminFPT.Thisistheprimarycomputationalbottleneckstep.TheremovalofinfrequentitemsfromFPTmayresultinadifferentstructureoftheFP-Treeinthenextstep.Finally,iftheconditionalFP-TreeFPTisnotempty,theFP-growthmethodiscalledrecursivelywithparameterscorrespondingtotheconditionalFP-TreeFPTextendedsuf“x,andminimumsupport.Notethatsuccessiveprojectedtrans-actiondatabases(andcorrespondingconditionalFP-Trees)intherecursionwillbesmallerbecauseoftherecursiveprojection.ThebasecaseoftherecursionoccurswhentheentireFP-Treeisasinglepath.Thisislikelytooccurwhentheprojectedtransactiondatabasebecomessmallenough.Inthatcase,FP-growthdeterminesallcombinationsofnodesonthispath,appendsthesuf“xtothem,andreportsthem.AnexampleofhowtheconditionalFP-Treeiscreatedforaminimumsupportof1unit,isillustratedinFig.2.12.Notethatiftheminimumsupportwere2,thentherightbranch(nodes)wouldnotbeincludedintheconditionalFP-Tree.Inthiscase,thepointersforitemarechasedintheFP-Treetocreatetheconditionalpre“xpathsoftherelevantconditionaltransactiondatabase.Thisrepresentsalltransactions 2FrequentPatternMiningAlgorithms:ASurvey.Thecountsonthepre“xpathsarere-adjustedbecausemanybranchesarepruned.TheremovalofinfrequentitemsandthatoftheitemmightleadtoaconditionalFP-Treethatlooksverydifferentfromtheconditionalpre“x-paths.ThesekindsofconditionalFP-treesneedtobegeneratedforeachconditionalfrequentitem,althoughonlyasingleitemhasbeenshownforthesakeofsimplicity.Notethat,ingeneral,thepointersmayneedtoberecreatedeverytimeaconditionalFP-Treeis4.2VariationsAsthedatabasegrowslarger,theconstructionoftheFP-Treebecomechallengingbothfromruntimeandspacecomplexity.Therehavebeenmanyworks[8,24,27,29,30,36,39,55,59,61,62]totacklethesechallenges.ThesevariationsofFP-growthmethodcanbeclassi“edintotwocategories.Methodsbelongingtothe“rstcategorydesignmemory-basedminingprocessusingamemory-residentdatastructurethatholdspartitioneddatabase.Methodsbelongingtothesecondcategoryimprovetheef“ciencyoftheFP-Treerepresentation.Inthissubsection,wewillpresenttheseapproachesbrie”y.4.2.1Memory-ResidentVariationsInthefollowing,anumberofdifferentmemory-residentvariationsofthebasicFP-growthideawillbedescribed.CT-PROAlgorithmInthiswork[62],theauthorsintroducedanewFP-TreelikedatastructurecalledCompactFP-Tree(CFP-Tree)thatholdsthesameinformationasFP-Treebutwith50%lessstorage.TheyalsodesignedaminingalgorithmcalledCT-PROwhichfollowsanon-recursiveprocedureunlikeFP-growth.Asdiscussedearlier,duringtheminingprocess,FP-growthconstructsmanyconditionalFP-Trees,whichbecomesanoverheadasthepatternsgetlongerorthesupportgetslower.Toovercomethisproblem,theCT-PROalgorithmdividesthedatabaseintoseveraldisjointprojectionswhereeachprojectionisrepresentedasaCFP-Tree.Thenanon-recursiveminingprocessisexecutedovereachprojectionindependently.Signi“cantmodi“cationsweremadetotheheaderTable4.1datastructure.IntheoriginalFP-Tree,thenodesstorethesupportanditemlabel.However,intheCFP-Tree,itemlabelsaremappedtoanincreasingsequenceofintegersthatisactuallytheindexoftheheadertable.TheheadertableofCFP-Treestoresthesupportofeachitem.TocompresstheoriginalFP-Tree,allidenticalsubtreesareremovedbyaccumulatingthemandstoringtherelevantinformationintheleftmostbranch.TheheadertablecontainsapointertoeachnodeontheleftmostbranchoftheCFP-Tree,asthesenodesarerootsofsubtreesstartingwithdifferentitems.Theminingprocessstartsfromthepointersoftheleastfrequentitemsintheheadertable.Thisprunesalargenumberofnodesatanearlystageandshrinksthetree C.C.Aggarwaletal.structure.Byfollowingthepointerstothesameitem,aprojectionofalltransactionsendingwiththecorrespondingitemisbuilt.ThisprojectionisalsorepresentedasaCFP-TreecalledlocalCFP-Tree.ThelocalCFP-Treeisthentraversedtoextractthefrequentpatternsintheprojection.H-MineAlgorithmTheauthorsin[54]proposedanef“cientalgorithmcalled.Itusesamemoryef“cienthyper-structurecalledH-Struct.Thefundamentalstrategyofistopartitionthedatabaseandmineeachpartitioninthememory.Finally,theresultsfromdifferentpartitionsareconsolidatedintoglobalfrequentpatterns.Anintelligentmoduleofisthatitcanidentifywhetherthedatabaseisdenseorsparse,anditisabletomakedynamicchoicesbetweendifferentdatastructuresbasedonthisidenti“cation.MoredetailsmaybefoundinChap.3onpattern-growthmethods.4.2.2ImprovedDataStructureVariationsInthissection,severalvariationsofthebasicalgorithmbyimprovingtheunderlyingdatastructurewillbedescribed.UsingArraysAsigni“cantpartoftheminingtimeinFP-growthisspentontravers-ingthetree.Toreducethistime,theauthorsin[29]designedanarraybasedimplementationofFP-growth,namedFP-growth*whichdrasticallyreducesthetraversaltimeoftheminingalgorithm.ItusestheFP-Treedatastructureincom-binationwithanarray-likedatastructureanditincorporatesvariousoptimizationschemes.ItshouldbepointedoutthattheTreeProjectionfamilyofalgorithmsalsousesarrays,thoughtheoptimizationsusedarequitedifferent.Whentheinputdatabaseissparse,thearraybasedtechniqueperformswellbe-causethearraysavesthetraversaltimeforalltheitems;moreovertheinitializationofthenextlevelofFP-Treesiseasierusinganarray.Butincaseofdensedatabase,thetreebaserepresentationismorecompact.Todealwiththesituation,FP-growth*devisesamechanismtoidentifywhetherthedatabaseissparseornot.Todoso,growth*countsthenumberofnodesineachlevelofthetree.Basedonexperiments,theyfoundthatiftheupperquarterofthetreecontainslessthan15%ofthetotalnumberofnodes,thenthedatabaseismostlikelydense.Otherwise,itissparse.Ifthedatabaseturnsouttobesparse,FP-growth*allocatesanarrayforeachFP-Treeinthenextlevelofmining.ThenonordfpApproachThiswork[55]presentedanimprovedimplementationofthewellknownFP-growthalgorithmusinganef“cientFP-Treelikedatastructurethatallowsfasterallocation,traversalandoptionalprojection.Thetreenodesdonotstoretheirlabels(itemidenti“ers).Thereisnoconceptofheadertable.Thedatastructurestoreslessadministrativeinformationinthetreenodewhichallowtherecursivestepofminingwithoutrebuildingthetree. 2FrequentPatternMiningAlgorithms:ASurvey Fig.2.13Frequent,maximalandcloseditemsets5MaximalandClosedFrequentItemsetsOneofthemajorchallengesoffrequentitemsetminingisthat,mostoftheitemsetsminedaresubsetofthesetofsinglelengthfrequentitems.Therefore,asigni“-cantamountoftimeisspentoncountingredundantitemsets.Onesolutiontothisproblemistodiscovercondensedrepresentationsofthefrequentitemsets.Itwillbesuchrepresentationsthatsynopsizesthepropertyofthesetofitemsetscompletelyorpartially.Thecompactrepresentationnotonlysavecomputationalandmemoryresourcebutalsopavedamucheasierwaytowardsknowledgediscoverystageaftermining.Anotherinterestingobservationby[53]wasthat,insteadofminingthecom-pletesetoffrequentitemsetsandtheirassociations,associationminingonlyneedsto“ndfrequentcloseditemsetsandtheircorrespondingrules.So,miningfrequentcloseditemsetcanful“lltheobjectivesofminingallfrequentitemsetsbutwithlessredundancyandbetteref“ciencyandeffectivenessinmining.Inthissection,wewilldiscusstwotypesofcondensedrepresentationofitemset:maximalandclosedfrequentitemset.5.1De“nitionsMaximalFrequentItemsetisthetransactiondatabase,isthesetofallitemsinthedatabaseandisthesetofallfrequentitemsets.Afrequentitemsetiscalledmaximalifithasnofrequentsuperset.letbethesetofallfrequentmaximalitemsets,whichisdenotedbyandchthatQ C.C.Aggarwaletal.ForthetoytransactiondatabaseinTable2.1thefrequentmaximalitemsetsatmin-imumsupport3areabcd,asillustratedinFig.2.13.Alltherectangles“lledwithgreycolorrepresentmaximalfrequentpatterns.AswecanseeinFig.2.4,thattherearenofrequentsupersetsofabcdClosedFrequentItemsetTheclosureoperatorinducesanequivalencerelationonthepowersetofitemspartitioningitintodisjointsubsetscalledequivalenceclasses.Thelargestelementwithrespecttothenumberofitemsineachequivalenceclassiscalledacloseditemset.Afrequentitemsetisclosedif.Fromtheclosurepropertyitcanbesaidthatboth)andhavethesametidset.Insimplerterms,anitemsetisclosedifitdoesnothaveanyfrequentsupersetwiththesamesupport.Acloseditemsetcanbewrittenas:andchthatspportpportBecausemaximalitemsetshavenofrequentsuperset,theyarevacuouslyclosedfrequentitemsets.Thus,allmaximalpatternsareclosed.However,thereisakeydifferencebetweenminingmaximalitemsetsandcloseditemsets.Miningmaximalitemsetslosesinformationaboutthesupportoftheunderlyingitemsets.Ontheotherhand,miningcloseditemsetsdoesnotloseanyinformationaboutthesupport.Thesupportofthemissingsubsetscanbederivedfromtheclosedfrequentpatterndatabase.Onewayofviewingclosedfrequentpatternsisasthemaximalpatternsfromeachgroupoffrequentpatterns.Closedfrequentitemsetsareacondensedrepresentationoffrequentitemsetsthatislossless.ForthetoytransactiondatabaseofTable2.1thefrequentclosedpatternsareabcdforminimumsupportvalueof3,asillustratedinFig.2.13.Alltherectangleswithdottedborderrepresentclosedfrequentpatterns.Theremainingnodesinthetree(not“lledanddottedborder)representfrequentitemsets.5.2FrequentMaximalItemsetMiningAlgorithmsInthissubsection,wewilldiscusssomeofmaximalfrequentitemsetmining5.2.1MaxMinerAlgorithmalgorithmwasthe“rstalgorithmthatusedavarietyofoptimizationstoimprovetheeffectivenessoftreeexplorations[10].Thisalgorithmisgenerallyfocussedondeterminingmaximalpatternsratherthanallpatterns.Theauthorof[10]observedthatitisusuallysuf“cienttoonlyreportmaximalpatterns,whenfrequentpatternsarelong.Thisisbecauseofthecombinatorialexplosioninexaminingallsubsetsofpatterns.Althoughtheexplorationofthetreeisstilldoneinbreadth-“rstfashion,anumberofoptimizationsareusedtoimprovetheef“ciencyofexploration: 2FrequentPatternMiningAlgorithms:ASurvey€Theconceptofisde“ned.Let)bethesetofcandidateitemsthatmightextendnode.Beforecounting,itischeckedwhether)isasubsetofanyofthefrequentpatternsfoundsofar.Ifsuchisindeedthecase,thenitisknownthattheentiresubtreerootedatisfrequent,andcanbeprunedfromconsideration(formaximalpatternmining).Duringcountingthesupportofindividualitemextensionsof,thesupportof)isalsodetermined.Iftheset)isfrequent,thenitisknownthatallitemsetsintheentiresubsetrootedatthatnodearefrequent.Therefore,thetreedoesnotneedtobeexploredfurther,andcanbepruned.€Thesupportlowerboundingtrickdiscussedearliercanbeusedtoquicklyde-terminepatternswhicharefrequentwithoutexplicitcounting.Thecountsofextensionsofnodescanbedeterminedwithoutcountinginmanycases,wherethecountdoesnotchangebyextendinganitem.Ithasbeenshownin[10],thatthesesimpleoptimizationscanimproveoverthealgorithmbyordersofmagnitude.5.2.2DepthProjectAlgorithmDepthProjectalgorithmisbasedonthenotionofthelexicographictree,de“nedin[5].UnlikeTreeProjection,theapproachaggressivelyexploresthecandidatesinadepth-“rststrategybothtoensurebetterpruningandfastercounting.AsinTreeProjection,thedatabaseisrecursivelyprojecteddownthelexicographictreetoensuremoreef“cientcounting.Thiskindofprojectionensuresthatthecountinginformationfor-candidatesisreusedfor(1)-candidates,asinthecaseofFP-growthForthecaseoftheDepthProjectmethod[4],thelexicographictreeisexploredindepth-“rstordertomaximizetheadvantageoflookaheadsinwhichentiresubtreescanbeprunedbecauseitisknownthatallpatternsinthemarefrequent.Theoverallpseudocodeforthedepth-“rststrategyisillustratedinFig.2.14.Thepseudocodesforcandidategenerationandcountingarenotprovidedbecausetheyaresimilartothepreviouslydiscussedalgorithms.However,oneimportantdistinctionincountingisthatprojecteddatabasesareusedforcounting.ThisissimilartotheFP-growthofalgorithms.Notethattherecursivetransactionprojectionisparticularlyeffectivewithadepth-“rststrategybecauseasmallernumberofprojecteddatabasesneedtobestoredalongapathinthetree,ascomparedtothebreadthofthetree.Toreducetheoverheadofcountinglongpatterns,thenotionoflookaheadsareused.Atanynodeofthetree,let)beitspossible(candidate)itemextensions.Then,itischeckedwhether)isfrequentintwoways:1.Beforecountingthesupportoftheindividualextensionsof),itischeckedwhetheroccursassubsetofafrequentitemsetthathasalreadybeendiscoveredearlierduringdepth-“rstexploration.Ifsuchisthecase,thentheentiresubtreerootedatisprunedbecauseitisknown C.C.Aggarwaletal.tobefrequentanditisnotamaximalpattern.Thistypeofpruningisparticularlyeffectivewithadepth-“rststrategy.2.Duringsupportcountingoftheitemextensions,thesupportof)isalsodetermined.Ifaftersupportcounting,)turnsouttobefrequent,thentheentiresubtreerootedatnodecanbepruned.Notethattheprojecteddatabaseatnode(asinTreeProjection)isused.Althoughlookaheadsarealsousedinthealgorithm,itshouldbepointedoutthattheeffectivenessoflookaheadsismaximizedwithadepth-“rststrategy.Thisistrueofthe“rstofthetwoaforementionedstrategies,inwhichitischecked)isasubsetofanalreadyexistingfrequentpattern.Thisisabecauseadepth-“rststrategytendstoexploretheitemsetsindictionaryorder.Indictionaryorder,maximalitemsetsareusuallyexploredmuchearlierthanoftheirsubsets.Forexample,fora10-itemsetabcdefghij,only9ofthe1024subsetsoftheitemsetswillbeexploredbeforeexploringtheitemsetabscdefghij.These9itemsetsaretheimmediatepre“xesoftheitemset.When,thelongeritemsetsareexploredearlytheybecomeavailabletopruneshorteritemsets.Thefollowinginformationisstoredateachnodeduringtheprocessofconstructionofthelexicographictree:1.Theitemsetatthatnode.2.Thesetoflexicographictreeextensionsatthatnodewhichare3.Apointertotheprojectedtransactionset),whereissomeancestorof(includingitself).Therootofthetreepointstotheentiretransactiondatabase.4.Abitvectorcontainingtheinformationaboutwhichtransactionscontaintheitem-setfornodePasasubset.Thelengthofthisbitvectorisequaltothetotalnumberoftransactionsin).Thevalueofabitforatransactionisequaltoone,iftheitemsetPisasubsetofthetransaction.Otherwiseitisequaltozero.Thus,thenumberof1bitsisequaltothenumberoftransactionsin)whichprojectto.Thebitvectorsareusedtomaketheprocessofsupportcountingmoreef“cient.Afteralltheprojectedtransactionsatagivennodehavebeenidenti“ed,then“nd-ingthesubtreerootedatthatnodeisacompletelyindependentitemsetgenerationproblemwithasubstantiallyreducedtransactionset.Thenumberoftransactionsatanodeisproportionaltothesupportatthatnode.ThedescriptioninFig.2.14showshowthedepth“rstcreationofthelexicographictreeisperformed.Thealgorithmisdescribedrecursively,sothatthecallfromeachnodeisacompletelyindependentitemsetgenerationproblemthat“ndsallfrequentitemsetsthataredescendantsofanode.Therearethreeparameterstothealgorithm,apointertothedatabase,theitemsetnode,andthebitvector.Thebitvectorcontainsonebitforeachtransactionin,andindicateswhetherornotthetransactionshouldbeusedin“ndingthefrequentextensionsofAbitforatransactionisone,iftheitemsetatthatnodeisasubsetofthecorrespondingtransaction.The“rstcalltothealgorithmisfromthenode,theparametertheentiretransactiondatabase.Becauseeachtransactioninthedatabaseisrelevanttoperformthecounting,thebitvectorconsistsofalloneŽvalues.Oneproperty 2FrequentPatternMiningAlgorithms:ASurveyFig.2.14Thedepth“rststrategy oftheDepthProjectalgorithmisthattheprojectionisperformedonlywhenthetransactiondatabasereducesbyacertainsize.ThisistheProjectionConditionFig.2.14.Mostofthenodesinthelexicographictreecorrespondtothelowerlevels.Thus,thecountingtimesattheselevelsaccountformostoftheCPUtimesofthealgorithm.Fortheselevels,astrategycalledbucketingcansubstantiallyimprovethecountingtimes.Theideaistochangethecountingtechniqueatanodeinthelexicographictree,ifislessthanacertainvalue.Inthiscase,anupperboundonthenumberofdistinctprojectedtransactionsis2.Thus,forexample,whenisnine,thenthereareonly512distinctprojectedtransactionsatthenode.Clearly,thisisbecausetheprojecteddatabasecontainsseveralrepetitionsofthesame(projected) C.C.Aggarwaletal.Fig.2.15Aggregatingbucket transaction.Thefactthatthenumberoftransactionsintheprojecteddatabaseissmallcanbeexploitedtoyieldsubstantiallymoreef“cientcountingalgorithms.Theaimistocountthesupportfortheentiresubtreerootedatwithaquickpassthroughthedata,andanadditionalpostprocessingphasewhichisindependentofdatabasesize.Theprocessofperformingbucketcountingconsistsoftwophases:1.Inthe“rstphase,thecountsofeachdistincttransactionpresentintheprojecteddatabasearedetermined.Thiscanbeaccomplishedeasilybymaintaining2bucketsorcounters,scanningthetransactionsonebyone,andaddingcountstothebuckets.Thetimeforperformingthissetofoperationsislinearinthenumberof(projected)databasetransactions.2.Inthesecondphase,thecountsofthe2transactionareusedtodeterminetheaggregatesupportcountsforeachitemset.Ingeneral,thesupportcountofanitemsetmaybeobtainedbyaddingthecountsofallthesupersetsofthatitemsettoit.Askillfulalgorithm(fromtheef“ciencyperspective)forperformingtheseoperationsisillustratedinFig.2.15.Considerastringcomposedof0,1,andthatreferstoanitemsetinwhichthepositionswith0and1are“xedtothosevalues(correspondingtopresenceorabsenceofitems),whileapositionwithaisadontcareŽ.Thus,allitemsetscanbeexpressedintermsof1andbecauseitemsetsaretraditionallyde“nedwithrespecttopresenceofitems.Considerforexample,thecasewhen4,andtherearefouritems,numbered1,2,3,4.Anitemsetcontainingitems2and4isdenotedby1.Westartoffwiththeinformationon216bitstringswhicharecomposedof0and1.Theserepresentallpossibledistincttransactions.Thealgorithmaggregatesthecountsiniterations.Thecountforastringwitha*Žinaparticularpositionmaybeobtainedbyaddingthecountsforthestringswitha0and1inthosepositions.Forexample,thecountforthestring*1*1maybeexpressedasthesumofthecountsofthestrings01*1and11*1. 2FrequentPatternMiningAlgorithms:ASurveyFig.2.16Performingthesecondphaseofbucketing TheprocedureinFig.2.15worksbystartingwiththecountsofthe0…1strings,andthenconvertsthemtostringswith1and*.ThealgorithmrequiresInthethiteration,itincreasesthecountsofallthosebucketswitha0inthethbit,sothatthecountnowcorrespondstoacasewhenthatbucketcontainsainthatposition.Thiscanbeachievedbyaddingthecountsofthebucketswitha0inthethpositiontothatofthebucketwitha1inthatposition,withallotherbitshavingthesamevalue.Forexample,thecountofthestring0*1*isobtainedbyaddingthecountsofthebuckets001*and011*.InFig.2.15,theprocessofaddingthecountofthebuckettothatofthebucketachievesthis.Thesecondphaseofthebucketingoperationrequiresiterations,andeachiterationrequires2operations.Therefore,thetotaltimerequiredbythemethodisproportionalto2.Whenissuf“cientlysmall,thetimerequiredbythesecondphaseofpostprocessingissmallcomparedtothe“rstphase,whereasthe“rstphaseisessentiallyproportionaltoreadingthedatabaseforthecurrentWehaveillustratedthesecondphaseofbucketingbyanexampleinwhich3.TheprocessillustratedinFig.2.16illustrateshowthesecondphaseofbucketingisef“cientlyperformed.Theexactstringsandthecorrespondingcountsineachofthe3iterationsareillustrated.Inthe“rstiteration,allthosebitswith0inthelowestorderpositionhavetheircountsaddedwiththecountofthebitstringwitha1inthatposition.Thus,2pairwiseadditionoperations C.C.Aggarwaletal.takeplaceduringthisstep.Thesameprocessisrepeatedtwomoretimeswiththesecondandthirdorderbits.Attheendofthreepasses,eachbucketcontainsthesupportcountfortheappropriateitemset,wherethe0fortheitemsetisreplacedbyadontcareŽwhichisrepresentedbya*.Notethatthenumberoftransactionsinthisexampleis27.Thisisrepresentedbytheentryforthebucket***.Onlytwotransactionscontainallthreeitemsthatisrepresentedbythebucket111.Theprojection-basedmethodswereshowntohaveanorderofmagnitudeim-provementoverthealgorithm.Thedepth-“rstapproachhassubsequentlybeenusedinthecontextofmanytree-basedalgorithms.Otherexamplesofsuchalgorithmsincludethosein[17,18,14].Amongthese,theMAFIAalgorithm[14]isdiscussedinsomedetailinthenextsubsection.Anapproachwhichvariesontheprojectionmethodology,andusesopportunisticprojectionisdiscussedin[38].Thisalgorithmopportunisticallychoosesbetweenarray-basedandtree-basedrep-resentationstorepresentprojectedtransactionsubsets.Suchanapproachhasbeenshowntobemoreef“cientthanmanystateoftheartmethodssuchastheFP-Growthmethod.Othervariationsoftree-basedalgorithmshavealsobeenproposed[70]thatusedifferentstrategiesintreeexploration.5.2.3MAFIAAlgorithmTheMAFIAalgorithmproposedin[14]sharesanumberofsimilaritiestotheProjectapproach,thoughitusesabitmapbasedapproachforcounting,ratherthantheuseofaprojectedtransactiondatabase.Inthebitmap-basedapproach,asequenceofbitsismaintainedforeachitemsetthatcorrespondstowhetherornotthattransac-tioncontainsthatparticularitem.Sparserepresentations(suchasalistoftransactionidenti“ers)mayalsobeused,whenthefractionoftransactionscontainingtheitemsetissmall.Notethatsuchanapproachmaybeconsideredaspecialcaseofdatabaseprojection[5],inwhichverticalprojectionisusedbuthorizontalprojectionisnot.Thishastheadvantageofrequiringlessmemory,butitreusesasmallerfractionofthecountinginformationfromhigherlevelnodes.Anumberofotherpruningopti-mizationshavealsobeenproposedinthisworkthatfurtherimprovetheeffectivenessofthealgorithm.Inparticular,ithasbeenpointedoutthatwhenthesupportoftheextensionofanodeisthesameasthatofitsparent,thenthatsubtreecanbeprunedaway,becauseofthecountsofalltheitemsetsinthesubtreecanbederivedfromthoseofotheritemsetsinthedata.ThisisthesameasthesupportlowerboundingtrickdiscussedinSect.2.4,andalsousedinforpruning.Thus,theapproachin[14]usesmanyofthesamestrategiesusedinTreeProjection,butwithinadifferentcombination,andwithsomevariationsonspeci“cimplementation 2FrequentPatternMiningAlgorithms:ASurvey5.2.4GenMaxLikeMAFIA,isausestheverticalrepresentationtospeedupcounting.Speci“callytheareusedbytospeedupthecountingapproach.Inparticularthemorerecentnotionofdiffsets[72]wasused,andadepth-“rstexplorationstrategywasused.Anapproachknownassuccessivefocussingwasusedtofurtherimprovetheef“ciencyofthealgorithm.Thedetailsoftheapproachmaybefoundin[28].5.3FrequentClosedItemsetMiningAlgorithmsTheareseveralfrequentcloseditemsetminingalgorithms[41,42,51…53,64,66…69,73]existtodate.Mostofthemaximalandclosedpatternminingalgorithmsarebasedondifferentvariationsofthenon-maximalpatternminingalgorithms.Typ-icallypruningstrategiesareincorporatedwithinthenon-maximalpatternminingalgorithmstoyieldmoreef“cientalgorithms.5.3.1CloseInthisalgorithm[52]authorsapplybasedpattengenerationoverthecloseditemsetsearchspace.Theusagesofcloseditemsetlattice(searchspace)signi“cantlyreducestheoverallsearchspaceofthealgorithm.operatesiniterativemanner.Eachiterationconsistsofthreephases,.First,theclosurefunctionisappliedforobtainingthecandidatecloseditemsetsandtheirsupport.Next,theobtainedsetofcandidatecloseditemsetsaretestedagainsttheminimumsupportconstraint.Ifsucceed,thecandidatesaremarkedasfrequentcloseditemset.Finallythesameprocedureisinitiatedtogeneratethenextlevelofcandidatecloseditemsets.Thisprocesscontinuesuntilallfrequentcloseditemsetshavebeengenerated.5.3.2CHARM[73]isafrequentcloseditemsetminingalgorithm,thattakesadvantageoftheverticalrepresentationofdatabaseasinthecaseof[71]foref“cientclosurecheckingoperation.Forpunningthesearchspaceusesthefollowingthreeproperties.Supposeforitemset,iftidset()=tidset(),thenitreplaceseveryoccurrenceofandprunethewholebranchunder.Ontheotherhandiftidset(),itreplaceseveryoccurrenceofbutdoesnotprunethebranchunder.Finallyif,tidset(),noneoftheaforementionedpruningscanbeapplied.Theinitialcallofaset()ofsinglelengthfrequentitemandminimumsupportasinput.Asa“rststep,itsortsbytheincreasingtheorderofsupportoftheitems.Foreachitem C.C.Aggarwaletal.triestoextenditbyanotheritemfromthesamesetandappliesthreeconditionsforpruning.Ifthenewlycreateitemsetbyextensionisfrequent,performsclosure-checkingtoidentifywhethertheitemsetisclosed.updatesthesetaccordingly.Inotherwords,itreplaces,ifthecorrespondingpruningconditionismet.Ifthesetisthenotempty,theniscalledrecursively.5.3.3CLOSETandCLOSET+[53]and[69]frequentcloseditemsetminingalgorithmsareinspiredbytheFP-growthmethod.Thealgorithmmakesuseoftheprin-ciplesoftheFP-Treedatastructuretoavoidthecandidategenerationstepduringtheprocessofminingfrequentcloseditemsets.Thisworkintroducesatechnique,referredtoassinglepre“xpathcompression,thatquicklyassiststheminingprocess.alsoappliespartition-basedprojectionmechanismsforbetterscalability.TheminingprocedureoffollowstheFP-growthalgorithm.However,theal-gorithmisabletoextractonlytheclosedpatternsbycarefulbook-keeping.treatsitemsappearingineverytransactionoftheconditionaldatabasespecially.Forexample,ifisthesetofitemsthatappearineverytransactionofthedatabasethencreatesafrequentcloseditemsetifitisnotapropersubsetofanyfrequentcloseditemsetwiththeequalsupport.alsoprunesthesearchspace.Forexample,ifarefrequentitemsetwiththeequalsupportwhereisalsoacloseditemsetand,thenitdoesnotminetheconditionaldatabasebecausethelatterwillnotproduceanyfrequentcloseditemsets.isafollow-upworkafterbythesamegroupofauthors.attemptstodesignthemostoptimizedfrequentcloseditemsetminingalgorithmby“ndingthebesttrade-offbetweendepth-“rstsearchversusbreadth-“rstsearch,verticalformatsversushorizontalformats,treestructureversusotherdatastructures,top…downversusbottom…uptraversal,andpseudoprojectionver-susphysicalprojectionoftheconditionaldatabase.keepstrackoftheunpromisingpre“xitemsetsforgeneratingpotentialclosedfrequentitemsetsandprunesthesearchspacebydeletingthem.alsoappliesitemmerging,Žandsub-itemsetŽbasedpruning.Tosavethememoryoftheclosurecheckingopera-usesthecombinationofthe2-levelhash-indexedtreebasedmethodandthepseudo-projectionbasedupwardcheckingmethod.Interestedreadersareencouragedtoreferto[69]formoredetails.5.3.4DCI_CLOSED[41,42]usesabitwiseverticalrepresentationoftheinputdatabase.canbeexecutedindependentlyoneachpartitionofthedatabaseinanyorderand,thus,alsoinparallel.isdesignedtoimprovememory-ef“ciencybyavoidingthestorageofduplicatecloseditemsets.designsanovelstrategyforsearchingthelatticethatcandetectanddiscarddu-plicateclosedpatternsonthe”y.Usingtheconceptoforder-preservinggenerators 2FrequentPatternMiningAlgorithms:ASurveyoffrequentcloseditemsets,anewvisitationschemeofthesearchspaceisintro-duced.Suchavisitationschemeresultsadisjointsubdivisionofthesearchspace.Thisalsofacilitatesparallelism.appliesseveraloptimizationtrickstoimproveexecutiontime,suchasthebitwiseintersectionoftidsetstocomputesupportandclosure.Wherepossible,itreusespreviouslycomputedintersectionstoavoidredundantcomputations.6OtherOptimizationsandVariationsInthissection,anumberofotheroptimizationsandvariationsoffrequentpatternminingalgorithmswillbediscussed.Manyofthesemethodsarediscussedindetailinotherchaptersofthisbook,andthereforetheywillbediscussedonlybrie”yhere.6.1RowEnumerationMethodsNotallfrequentpatternminingalgorithmsfollowthefundamentalstepsofbaselinealgorithm,thereexistsanumberofspecialcases,forwhichspecializedfrequentpatternminingalgorithmshavebeendesigned.Aninterestingcaseisthatofmicro-arraydatasets,inwhichthecolumnsareverylongbutthenumberofrowsarenotverylarge.Insuchcases,amethodcalledrow-enumerationisused[22,23,40,48,49]insteadoftheusualcolumnenumeration,inwhichcombinationsofrowsareexaminedduringthesearchprocess.Therearetwocategoriesofrowenumerationalgorithm.Onecategoryalgorithmperformbottom-up[22,23,48]searchovertherowenumerationtreewhereasothercategoryalgorithmsperformtop-down[40]searchstrategy.Rowenumerationalgorithmsperformminingoverthetransposeofthetransactiondatabase.Intransposedatabase,eachtransactionidbecomeitemandeachitemcor-respondsatransaction.Miningoverthetransposeddatabaseisbasicallythebottomupsearchforfrequentpatternsbyenumerationofrowsets.However,thebottom-upsearchstrategycannottakeadvantageofuser-speci“edminimumsupportthresholdtoeffectivelyprunethesearchspace,andthereforeleadstolongerrunningtimeandlargememoryoverhead.Asasolution[40]introduceatop-downapproachofminingusinganovelrowenumerationtree.Theirapproachcantakefulladvantageofuser-de“nedminimumsupportvalueandprunethesearchspaceef“cientlyhencelowerdowntheexecutiontime.Notethat,bothofthesearchstrategiesareappliedoverthetransposedtransactiondatabase.Mostofdevelopedalgorithmusingrowenumerationtechniqueconcentrateonminingfrequentcloseditemset(explainedinSect.5).Thereasonbehindthismotivationisthatduetothenatureofmicro-arraydatathereexistsalargenumberofredundancyamongthefrequentpatternsforaminimumsupportthresholdandclosedpatternsarecapableofsummarizingthewholedatabase.ThesestrategieswillbediscussedindetailinChap.4,andthereforeonlyabriefdiscussionisprovided C.C.Aggarwaletal.6.2OtherExplorationStrategiesTheadvantageoftree-enumerationstrategiesisthattheyfacilitatetheexplorationofcandidatesinthetreeinanarbitraryorder.AmethodknownasPincer-Searchisproposedin[37]thatcombinestop-downandbottom-upexplorationinpincerŽfashiontoavailoftheadvantagesofbothsubsetandsupersetpruning.Twoprimaryobservationsareusedinpincersearch:1.Anysubsetofafrequentitemsetisfrequent.2.Anysupersetofaninfrequentitemsetisinfrequent.Inpincer-search,top…downandbottom…upexplorationarecombinedandirrelevantitemsetsareprunedusingbothobservations.Moredetailsofthisapproacharedis-cussedin[37].Notethat,forsparsetransactiondata,supersetpruningislikelytobeinef“cient.Otherrecentmethodshavebeenproposedforlongpatternminingwithmethodssuchasleapsearch.ŽThesemethodsarediscussedinthechapteronlongpatternmininginthisbook.7ReducingtheNumberofPassesAmajorchallengeinfrequentpatternminingiswhenthedataisdiskresident.Insuchcases,itisdesirabletouselevel-wisemethodstoensurethatrandomaccessestodiskareminimized.Thisisthereasonthatmostoftheavailablealgorithmsuselevel-wisemethods,whichensurethatthenumberofpassesoverthedatabaseareboundedbythesizeofthelongestpattern.Evenso,thiscanbesigni“cant,whenmanylongpatternsarepresentinthedatabase.Therefore,anumberofmethodshavebeenproposedintheliteraturetoreducethenumberofpassesoverthedata.Thesemethodscouldbeusedinthecontextofjoin-basedalgorithms,tree-basedalgorithms,orevenotherclassesoffrequentpatternminingmethods.Thesecorrespondtocombiningthelevel-wisedatabasepasses,usingsampling,andusingapreprocess-once-query-manyparadigm.7.1CombiningPassesTheearliestworkoncombiningpasseswasproposedintheoriginal[1].Thekeyideaincombingpassesisthatitispossibletousejoinstocreatecandidatesofhigherorderthan(1)inasinglepass.Forexample,(candidatescanbecreatedfrom(1)-candidatesbeforeactualvalidationofthe1)-candidatesoverthedata.Then,thecandidatesofsize(1)and(canbevalidatedtogetherinasinglepassoverthedata.Althoughsuchanapproachreducesthenumberofpassesoverthedata,ithasthedownsidethatthenumberofspurious(2)candidateswillbefarlargerbecausethe(1)candidateswerenotcon“rmedtobefrequentbeforetheywerejoined.Therefore,thesavingofdatabase 2FrequentPatternMiningAlgorithms:ASurveypassescomesatanincreasedcomputationalcost.Therefore,itwasproposedin[1]thattheapproachshouldbeusedforlaterpasses,whenthenumberofcandidateshasalreadyreducedsigni“cantly.Thisreducesthelikelihoodthatthenumberofcandidatesblowsuptoomuchwiththisapproach.7.2SamplingTricksAnumberofsamplingtrickscanbeusedtogreatlyimprovetheef“ciencyofthefrequentpatternminingprocess.Mostsamplingmethodsrequiretwopassesoverthedata,the“rstofwhichisusedforsampling.Aninterestingapproachthatusestwopasseswiththeuseofsamplingisdiscussedin[65].Thismethodgeneratestheapproximatelyfrequentpatternsoverthedata,usingasample.Falsenegativescanbereducedbyloweringtheminimumsupportlevelappropriately,sothatboundscanbede“nedonthelikelihoodoffalsenegatives.Falsepositivescanberemovedwiththeuseofasecondpassoverthedata.Themajordownsideoftheapproachisthatthereductionintheminimumsupportleveltoreducethenumberoffalsenegativescanbesigni“cant.Thisalsoreducesthecomputationalef“ciencyoftheapproach.Themethodhoweverrequiresonlytwopassesoverthedata,wherethe“rstpassisusedtocreatethesample,andthesecondpassisusedtoremovethefalsepositives.Aninterestingapproachproposedin[57]dividesthediskresidentdatabaseintosmallermemory-residentpartitions.Foreachpartition,moreef“ciencyalgorithmscanbeused,becauseofthememory-residentnatureofthepartition.Itshouldbepointedoutthateachfrequentpatternovertheentiredatabasewillappearasafre-quentpatterninatleastonetransaction.Therefore,theunionoftheitemsetsoverthedifferenttransactionsprovidesasupersetofthetruefrequentpatterns.Apost-processingphaseisthenusedto“lteroutthespuriousitemsets,bycountingthiscandidatesetagainstthetransactiondatabase.Aslongasthepartitionsarereason-ablylarge,thesupersetfoundapproximatesthetruefrequentpatternsverywell,andthereforetheadditionaltimespentincountingirrelevantcandidatesisrelativelysmall.Themainadvantageofthisapproachisitrequiresonlytwopassesoverthedatabase.Therefore,suchanapproachisparticularlyeffectivewhenthedataisresidentondisk.DynamicItemsetCounting(DIC)algorithm[15]dividesthedatabaseintointervals,andgenerateslongercandidateswhenitisknownthatthesubsetsofthesecandidatesarealreadyfrequent.Thesearethenvalidatedoverthedatabase.Suchanapproachcanreducethenumberofpassesoverthedata,becauseitimplicitlycombinestheprocessofcandidategenerationandcounting. C.C.Aggarwaletal.7.3OnlineAssociationRuleMiningInmanyapplications,ausermaywishtoquerythetransactiondatato“ndtheasso-ciationrulesorthefrequentpatterns.Insuchcases,evenathighsupportlevels,itisoftenimpossibletocreatethefrequentpatternsinonlinetimebecauseofthemultiplepassesrequiredoverapotentiallylargedatabase.Oneoftheearliestalgorithmsforonlineassociationruleminingwasproposedin[6].Inthisapproach,anaugmentedlexicographictreeisstoredeitherondiskorinmain-memory.Thelexicographictreeisaugmentedwithalltheedgesrepresentedthesubsetrelationshipsbetweenitem-sets,andisalsoreferredtoastheitemset.Foranygivenquery,theitemsetlatticemaybetraversedtodeterminetheassociationrules.Ithasbeenshownin[6],thatsuchanapproachcanalsobeusedtodeterminethenon-redundantassociationrulesintheunderlyingdata.Asecondmethod[40]usesacondensedfrequentpatterntree(insteadofalattice)topre-processandstoretheitemsets.Thisstructurecanbequeriedtoprovideonlineresponses.Averydifferentapproachforonlineassociationrulemininghasbeenproposedin[34],inwhichthetransactiondatabaseisprocessedinrealtime.Inthiscase,anincrementalapproachisusedtominethetransactiondatabase.ThisisaAssociationRuleMiningAlgorithm,whichisreferredtoas.Inthiscase,transactionsareprocessedastheyarrive,andcandidateitemsetsaregeneratedonthe”y,byexaminingthesubsetsofthattransaction.Clearly,thedownsideisthatsuchanapproachisthatitwillcreatealotmorecandidatesthananyoftheof”inealgorithmswhichuselevelwisemethodstogeneratethecandidates.Thisgeneralcharacteristicisofcoursetrueofanyalgorithmwhichtriestoreducethenumberofpasseswithapproximatecandidategeneration.OneinterestingcharacteristicoftheCARMAalgorithmisthatitallowstheusertochangetheminimumsupportlevelduringexecution.Inthatcase,thealgorithmisguaranteedtohavegeneratedthesupersetsofthetrueitemsetsinthedata.Ifdesired,asecondpassoverthedatacanbeusedtoremovethespuriousfrequentitemsets.Manystreamingmethodshavealsobeenproposedthatuseonlyonepassoverthetransactiondata[19…21,35,43].Itshouldbepointedoutthatitisoftendif“cultto“ndeven1-itemsetsexactlyoveradatastreambecauseoftheone-passconstraint[21],whenthenumberofdistinctitemsislargerthanthemainmemoryavailability.Thisisoftentrueof-itemsetsaswell,especiallyatlowsupportlevels.Furthermore,ifthepatternsinthestreamchangeovertime,thenthefrequent-itemsetswillchangesigni“cantlyaswell.Thesemethodsthereforehavethechallengeof“ndingthefrequentitemsetsef“ciently,maintainingthem,andhandlingissuesinvolvingevolutionofthedatastream.Giventhenumerouschallengesofpatternmininginthisscenario,mostofthesemethods“ndthefrequentitemsapproximately.TheseissueswillbediscussedindetailinChap.9onstreamingpatternminingalgorithms. 2FrequentPatternMiningAlgorithms:ASurvey8ConclusionsandSummaryThischapterprovidesasurveyofdifferentfrequentpatternminingalgorithms.mostfrequentpatternalgorithms,implicitlyorexplicitly,exploretheenumerationtreeofitemsets.Algorithmssuchasexploretheenumerationtreeinbreadth-“rstfashionwithjoin-basedcandidategeneration.Althoughthenotionofanenumerationtreeisnotexplicitlymentionedbythealgorithm,theexecutiontreeexploresthecandidatesaccordingtoanenumerationtreeconstructedonthepre“xes.OtheralgorithmssuchasTreeProjectionFP-growthusethehierarchicalrelationshipsbetweentheprojecteddatabasesforpatternsofdifferentlengths,andavoidre-doingthecountingworkdonefortheshorterpatterns.Maximalandclosedversionsoffrequentpatternminingalgorithmsarealsoabletoachievemuchbetterpruningperformance.Anumberofef“ciency-basedoptimizationsoffrequentpatternminingalgorithmswerealsodiscussedinthischapter.References1.R.Agrawal,andR.Srikant.FastAlgorithmsforMiningAssociationRulesinLargeDatabases,VLDBConference,pp.487…499,1994.2.R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.ACMSIGMODConference,1993.3.R.Agrawal,H.Mannila,R.Srikant,H.Toivonen,andA.I.Verkamo.Fastdiscoveryofassociationrules,AdvancesinKnowledgeDiscoveryandDataMining,pp.307…328,1996.4.R.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.Depth-“rstGenerationofLongPatterns,ACMKDDConference,2000.AlsoavailableasIBMResearchReport,RC21538,July1999.5.R.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjectionAlgorithmforGenerationofFrequentItemsets,JournalofParallelandDistributedComputing,61(3),pp.350…371,2001.AlsoavailableasIBMResearchReport,RC21341,1999.6.C.C.Aggarwal,P.S.Yu.OnlineGenerationofAssociationRules,ICDEConference,1998.7.C.C.Aggarwal,P.S.Yu.ANewFrameworkforItemsetGeneration,ACMPODSConference8.E.AzkuralandC.Aykanat.ASpaceOptimizationforFP-Growth,FIMIworkshop,2004.9.Y.Bastide,R.Taouil,N.Pasquier,G.Stumme,andL.Lakhal.MiningFrequentPatternswithCountingInference.ACMSIGKDDExplorationsNewsletter,2(2),pp.66…75,2000.10.R.J.BayardoJr.Ef“cientlymininglongpatternsfromdatabases,ACMSIGMODConference11.J.Blanchard,F.Guillet,R.Gras,andH.Briand.UsingInformation-theoreticMeasurestoAssessAssociationRuleInterestingness.ICDMConference,2005.12.C.Borgelt,R.Kruse.InductionofAssociationRules:AprioriImplementation,ConferenceonComputationalStatistics,2002.http://fuzzy.cs.uni-magdeburg.de/borgelt/software.html.13.J.-F.Boulicaut,A.Bykowski,andC.Rigotti.Free-sets:ACondensedRepresentationofBooleandatafortheApproximationofFrequencyQueries.DataMiningandKnowledgeDiscovery,7(1),pp.5…22,2003.14.D.Burdick,M.Calimlim,andJ.Gehrke.MAFIA:AMaximalFrequentItemsetAl-gorithmforTransactionalDatabases,ICDEConference,2000.ImplementationURL:http://himalaya-tools.sourceforge.net/Ma“a/.15.S.Brin,R.Motwani,J.D.Ullman,andS.Tsur.Dynamicitemsetcountingandimplicationrulesformarketbasketdata.ACMSIGMODConference,1997. C.C.Aggarwaletal.16.S.Brin,R.Motwani,andC.Silverstein.BeyondMarketBaskets:GeneralizingAssociationRulestoCorrelations.ACMSIGMODConference,1997.17.T.Calders,andB.Goethals.Miningallnon-derivablefrequentitemsetsPrinciplesofDataMiningandKnowledgeDiscovery,pp.1…42,2002.18.T.Calders,andB.Goethals.Depth-“rstNon-derivableItemsetMining,SDMConference19.T.Calders,N.Dexters,J.Gillis,andB.Goethals.MiningFrequentItemsetsinaStream,InformationsSystems,toappear,2013.20.J.H.Chang,andW.S.Lee.FindingRecentFrequentItemsetsAdaptivelyoverOnlineDataACMKDDConference,2003.21.M.Charikar,K.Chen,andM.Farach-Colton.FindingFrequentItemsinDataStreams.Automata,LanguagesandProgramming,pp.693…703,2002.22.G.Cong,A.K.H.Tung,X.Xu,F.Pan,andJ.Yang.FARMER:Findinginterestingrulegroupsinmicroarraydatasets.ACMSIGMODConference,2004.23.G.Cong,K.-L.Tan,A.K.H.Tung,X.Xu.MiningTop-coveringRuleGroupsforGeneExpressionData.ACMSIGMODConference,2005.24.M.El-HajjandO.Zaiane.COFI-treeMining:ANewApproachtoPatternGrowthwithReducedCandidacyGeneration.FIMIWorkshop,2003.25.F.Geerts,B.Goethals,J.Bussche.ATightUpperBoundontheNumberofCandidatePatterns,ICDMConference,2001.26.B.Goethals.Surveyonfrequentpatternmining,Technicalreport,UniversityofHelsinki,2003.27.R.P.GopalanandY.G.Sucahyo.HighPerformanceFrequentPatternExtractionusingCom-pressedFP-Trees,ProceedingsofSIAMInternationalWorkshoponHighPerformanceandDistributedMining,2004.28.K.Gouda,andM.Zaki.Genmax:Anef“cientalgorithmforminingmaximalfrequentitemsets.DataMiningandKnowledgeDiscovery,11(3),pp.223…242,2005.29.G.Grahne,andJ.Zhu.Ef“cientlyUsingPre“x-treesinMiningFrequentItemsets,IEEEICDMWorkshoponFrequentItemsetMining,2004.30.G.Grahne,andJ.Zhu.FastAlgorithmsforFrequentItemsetMiningUsingFP-Trees.TransactionsonKnowledgeandDataEngineering.17(10),pp.1347…1362,2005,vol.17,no.10,pp.1347…1362,October,2005.31.V.Guralnik,andG.Karypis.Paralleltree-projection-basedsequenceminingalgorithms.ParallelComputing,30(4):pp.443…472,April2004.32.J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration,ACMSIGMODConference,2000.33.J.Han,H.Cheng,D.Xin,andX.Yan.FrequentPatternMining:CurrentStatusandFutureDataMiningandKnowledgeDiscovery,15(1),pp.55…86,2007.34.C.Hidber.OnlineAssociationRuleMining,ACMSIGMODConference,1999.35.R.Jin,andG.Agrawal.AnAlgorithmforin-coreFrequentItemsetMiningonStreamingData,ICDMConference,2005.36.Q.Lan,D.Zhang,andB.Wu.ANewAlgorithmForFrequentItemsetsMiningBasedOnAprioriAndFP-Tree,IEEEInternationalConferenceonGlobalCongressonIntelligentSystemspp.360…364,2009.37.D.-I.Lin,andZ.Kedem.Pincer-search:ANewAlgorithmforDiscoveringtheMaximumFrequentSet,EDBTConference,1998.38.J.Liu,Y.Pan,K.Wang.MiningFrequentItemSetsbyOpportunisticProjection,ACMKDDConference,2002.39.G.Liu,H.LuandJ.X.Yu.AFOPT:AnEf“cientImplementationofPatternGrowthApproach,FIMIWorkshop,2003.40.H.Liu,J.Han,D.Xin,andZ.Shao.Miningfrequentpatternsonveryhighdimensionaldata:atop-downrowenumerationapproach.SDMConference,2006.41.C.Lucchesse,S.Orlando,andR.Perego.DCI-Closed:Afastandmemoryef“cientalgorithmtominefrequentcloseditemsets.FIMIWorkshop,2004. 2FrequentPatternMiningAlgorithms:ASurvey 63 42.C.Lucchese,S.Orlando,andR.Perego.Fastandmemoryef“cientminingoffrequentclosed itemsets. IEEETKDEJournal ,18(1),pp.21…36,January2006. 43.G.Manku,R.Motwani.ApproximateFrequencyCountsoverDataStreams. VLDBConference , 2002. 44.H.Mannila,H.Toivonen,andA.I.Verkamo.Ef“cientalgorithmsfordiscoveringassociation rules. ProceedingsoftheAAAIWorkshoponKnowledgeDiscoveryinDatabases ,pp.181…192, 1994. 45.B.Negrevergne,T.Guns,A.Dries,andS.Nijssen.DominanceProgrammingforItemset Mining. IEEEICDMConference ,2013. 46.S.Orlando,P.Palmerini,R.Perego.Enhancingthea-priorialgorithmforfrequentsetcounting, ThirdInternationalConferenceonDataWarehousingandKnowledgeDiscovery ,2001. 47.S.Orlando,P.Palmerini,R.Perego,andF.Silvestri.Adaptiveandresource-awareminingof frequentsets. ICDMConference ,2002. 48.F.Pan,G.Cong,A.K.H.Tung,J.Yang,andM.J.Zaki.Findingclosedpatternsinlong biologicaldatasets. ACMKDDConference ,2003. 49.FPan,A.K.H.Tung,G.Cong,X.Xu.COBBLER:CombiningcolumnandRowEnumeration forClosedPatternDiscovery. SSDBM ,2004. 50.J.-S.Park,M.S.Chen,andP.S.Yu.AnEffectiveHash-basedAlgorithmforMiningAssociation Rules, ACMSIGMODConference ,1995. 51.N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsfor associationrules. ICDTConference ,1999. 52.N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Ef“cientminingofassociationrulesusing closeditemsetlattices. JournalofInformationSystems ,24(1),pp.25…46,1999. 53.J.Pei,J.Han,andR.Mao.CLOSET:AnEf“cientAlgorithmforMiningFrequentClosed Itemsets, DMKDWorkshop ,2000. 54.J.Pei,J.Han,H.Lu,S.Nishio,S.Tang,D.Yang.H-mine:Hyper-structureminingoffrequent patternsinlargedatabases, ICDMConference ,2001. 55.B.Racz.nonordfp:AnFP-GrowthVariationwithoutRebuildingtheFP-Tree, FIMIWorkshop , 2004. 56.M.Holsheimer,M.Kersten,H.Mannila,andH.Toivonen.APerspectiveonDatabasesand DataMining, ACMKDDConference ,1995. 57.A.Savasere,E.Omiecinski,andS.Navathe.Anef“cientalgorithmforminingassociationrules inlargedatabases. VLDBConference ,1995. 58.P.Shenoy,J.Haritsa,S.Sudarshan,G.Bhalotia,M.Bawa,D.Shah.Turbo-chargingVertical MiningofLargeDatabases. ACMSIGMODConference ,pp.22…33,2000. 59.Z.Shi,andQ.He.Ef“cientlyMiningFrequentItemsetswithCompactFP-Tree,IFIP InternationalFederationforInformationProcessing,V-163,pp.397…406,2005. 60.R.Srikant.Fastalgorithmsforminingassociationrulesandsequentialpatterns. PhDthesis, UniversityofWisconsin,Madison ,1996. 61.Y.G.SucahyoandR.P.Gopalan.CT-ITL:Ef“cientFrequentItemSetMiningUsinga CompressedPre“xTreewithPatternGrowth, Proceedingsofthe14thAustralasianDatabase Conference ,2003. 62.Y.G.SucahyoandR.P.Gopalan.CT-PRO:ABottomUpNonRecursiveFrequentItemset MiningAlgorithmUsingCompressedFP-TreeDataStructures. FIMIWorkshop ,2004. 63.P.-N.Tan,V.Kumar,amdJ.Srivastava.SelectingtheRightInterestingnessMeasurefor AssociationPatterns. ACMKDDConference ,2002. 64.I.Taouil,N.Pasquier,Y.Bastide,andL.Lakhal.MiningBasisforAssociationRulesusing ClosedSets, ICDEConference ,2000. 65.H.Toivonen.Samplinglargedatabasesforassociationrules. VLDBConference ,1996. 66.T.Uno,M.KiyomiandH.Arimura.Ef“cientMiningAlgorithmsforFrequent/Closed/Maximal Itemsets, FIMIWorkshop ,2004. 67.J.Wang,J.Han.BIDE:Ef“cientMiningofFrequentClosedSequences. ICDEConference , 2004. C.C.Aggarwaletal.68.J.Wang,J.Han,Y.Lu,andP.Tzvetkov.TFP:Anef“cientalgorithmforminingtop-closeditemsets.IEEETransactionsonKnowledgeandDataEngineering,17,pp.652…664,69.J.Wang,J.Han,andJ.Pei.CLOSET+:SearchingfortheBeststrategiesforminingfrequentcloseditemsets.ACMKDDConference,2003.70.G.I.Webb.Ef“cientSearchforAssociationRules,ACMKDDConference,2000.71.M.J.Zaki.Scalablealgorithmsforassociationmining,IEEETransactionsonKnowledgeandDataEngineering,12(3),pp.372…390,2000.72.M.Zaki,andK.Gouda.Fastverticalminingusingdiffsets.ACMKDDConference,2003.73.M.J.ZakiandC.Hsiao.CHARM:Anef“cientalgorithmforclosedassociationrulemining.SDMConference,2002.74.M.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.NewAlgorithmsforFastDiscoveryofAssociationRules.KDDConference,pp.283…286,1997.75.C.Zeng,J.F.Naughton,andJYCai.OnDifferentiallyPrivateFrequentItemsetMining.InProceedingsof39thInternationalConferenceonVeryLargedataBases,2012. Chapter3Pattern-GrowthMethodsJiaweiHanandJianPeiMiningfrequentpatternshasbeenafocusedtopicindataminingre-searchinrecentyears,withthedevelopmentofnumerousinterestingalgorithmsforminingassociation,correlation,causality,sequentialpatterns,partialperiodic-ity,constraint-basedfrequentpatternmining,associativeclassi“cation,emergingpatterns,etc.ManystudiesadoptanApriori-like,candidategeneration-and-testap-proach.However,basedonouranalysis,candidategenerationandtestmaystillbeexpensive,especiallywhenencounteringlongandnumerouspatterns.Anewmethodology,calledfrequentpatterngrowth,whichminesfrequentpat-ternswithoutcandidategeneration,hasbeendeveloped.Themethodadoptsadivide-and-conquerphilosophytoprojectandpartitiondatabasesbasedonthecur-rentlydiscoveredfrequentpatternsandgrowsuchpatternstolongeronesintheprojecteddatabases.Moreover,ef“cientdatastructureshavebeendevelopedforeffectivedatabasecompressionandfastin-memorytraversal.Suchamethodologymayeliminateorsubstantiallyreducethenumberofcandidatesetstobegeneratedandalsoreducethesizeofthedatabasetobeiterativelyexamined,and,therefore,leadtohighperformance.Inthispaper,weprovideanoverviewofthisapproachandexamineitsmethodologyandimplicationsforminingseveralkindsoffrequentpatterns,in-cludingassociation,frequentcloseditemsets,max-patterns,sequentialpatterns,andconstraint-basedminingoffrequentpatterns.Weshowthatfrequentpatterngrowthef“cientatmininglargedata-basesanditsfurtherdevelopmentmayleadtoscalableminingofmanyotherkindsofpatternsaswell.KeywordsScalabledataminingmethodsandalgorithmsFrequentpatternsSequentialpatternsConstraint-basedmining J.Han(UniversityofIllinoisatUrbana-Champaign,Urbana,IL61801,USAe-mail:hanj@cs.uiuc.eduJ.PeiSimonFraserUniversity,Burnaby,BCV5A1S6,Canadae-mail:jpei@cs.sfu.caC.C.Aggarwal,J.Han(eds.),FrequentPatternMining,DOI10.1007/978-3-319-07821-2_3,©SpringerInternationalPublishingSwitzerland2014

Related Contents

Next Show more