/
Market Basket,  Frequent Itemsets, Market Basket,  Frequent Itemsets,

Market Basket, Frequent Itemsets, - PowerPoint Presentation

emma
emma . @emma
Follow
64 views
Uploaded On 2024-01-13

Market Basket, Frequent Itemsets, - PPT Presentation

Association Rules Apriori and Other Algorithms Market Basket Analysis Using the market basket analysis you can easily discover what is missing in the basket of every single customer Then you offer the right product ID: 1039926

support frequent association 100 frequent support 100 association rules algorithm itemsets itemset apriori pass diaper item threshold baskets basket

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Market Basket, Frequent Itemsets," is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Market Basket, Frequent Itemsets, Association Rules, Apriori and Other Algorithms

2. Market Basket AnalysisUsing the market basket analysis you can easily discover what is missing in the basket of every single customer. Then you offer the right product.Think Amazon or McDonald’s. Just try buying only one thing from them.Results in increased sales.

3. Items & BasketsItems are the objects that we are identifying associations between. For an online retailer, each item is a product in the shop.Baskets are instances of groups of items co-occuring together. Items go into baskets.The support of an item or item set is the number of transactions in our data set that contain that item or item set.

4. Support threshold & Frequent ItemsetWhat is the support of these itemsets?Sup(A,B)=1Sup(A,C)=3If Support Threshold =2, {A,C} is a frequent ItemsetTrans. IDPurchased Items1A,D2A,C3A,B,C4B,E,F5A,C,F

5. Association RulesAssociation Rules are if/then statements that help uncover relationships between seemingly unrelated data.A common example of association rules is the Market Basket AnalysisEx. If a customer buys a dozen eggs, he is 80% likely to also purchase milk. Ex. If a customer buys a mouse, he/she is 95% likely to buy a keyboard as well. Two Main Components of association rule are:AntecedentItem found in the dataCan be viewed as the “If”ConsequentItem found in combination with the AntecedentCan be viewed as the “then”

6. Association Rules Cont’dAssociation rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationshipsSupport and Confidence help identify relationship between itemsSupport - The number of times an item appears in a datasetConfidence - Indicates the number of times the if/then statements have been found to be true.Ex. Rule A ⇒BSupport = frq (A,B)/N, (N = total # of transactions)Confidence = frq(A,B)/A

7. UsesData mining: association rules are useful for analyzing and predicting customer behavior Play an important part in shopping basket data analysis, product clustering, catalog design and store layout.Programmers use association rules to build programs capable of machine learning.

8. Basic Concepts: Association RulesFind all the rules X  Y with minimum support and confidencesupport, s, probability that a transaction contains X  Yconfidence, c, conditional probability that a transaction having X also contains YLet minsup = 50%, minconf = 50%Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:38Customerbuys diaperCustomerbuys bothCustomerbuys beerNuts, Eggs, Milk40Nuts, Coffee, Diaper, Eggs, Milk50Beer, Diaper, Eggs30Beer, Coffee, Diaper20Beer, Nuts, Diaper10Items boughtTidAssociation rules: (many more!)Beer  Diaper (60%, 100%)Diaper  Beer (60%, 75%)

9. Association Rule Generation(Problem definition)Two sub-problems Finding frequent itemsets (whose occurrences exceed a predefined minimum support threshold)Deriving association rules from those frequent itemsets (with the constraint of minimum confidence threshold)Apriority propertyAll nonempty subsets of a frequent itemset must also be frequentIf {beer, diaper, nuts} is a frequent itemset, then the itemsets {beer, diaper}, {diaper, nuts}, {beer, nuts} must also be frequent

10. The Apriori AlgorithmLet Ck be a set of candidate itemsets of size k, and Lk be a set of frequent itemsets of size kMain steps of iterationFind frequent itemset Lk-1 Join step: Ck is generated by joining Lk-1 with itself (cartesian product Lk-1 x Lk-1)Prune step (use Apriori property): Any (k−1)-itemset that is not frequent cannot be a subset of a frequent k-itemset(Lk), hence should be removed from Ck Obtain frequent itemset Lk and repeat the steps unless Lk= 

11. Apriori Algorithm ExampleConsider a database, D , consisting of 9 transactions.Suppose min. support count required is 2 (i.e. 2/9 = 22 % )Let minimum confidence required is 70%.We have to first find out the frequent itemset using Apriori algorithm.Then, Association rules will be generated using min. support & min. confidenceTIDList of ItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2, I3, I5T900I1, I2, I3

12. Apriori Algorithm Example

13. Apriori Algorithm Example

14. Apriori Algorithm Example

15. Apriori Algorithm Example

16. Apriori Algorithm Example

17. Generating Association RulesRuleConfidence4/6=67% 4/6=67% 2/6=33% 4/7=57% 2/7=29% 2/7=29%RuleConfidence4/6=67%4/6=67%2/6=33%4/7=57%2/7=29%2/7=29%TIDList of ItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2, I3, I5T900I1, I2, I3RuleConfidence4/7=57% 4/6=67% 2/2=100% 4/6=67% 2/2=100% 2/2=100%RuleConfidence4/7=57%4/6=67%2/2=100%4/6=67%2/2=100%2/2=100% 

18. TIDList of ItemsT100I1, I2, I5T200I2, I4T300I2, I3T400I1, I2, I4T500I1, I3T600I2, I3T700I1, I3T800I1, I2, I3, I5T900I1, I2, I3RuleConfidence2/4=50% 2/2=100%1 2/2=100% 2/6=33% 2/7=29% 2/2=100%RuleConfidence2/4=50%2/2=100%2/2=100%2/6=33%2/7=29%2/2=100%RuleConfidence32/4=50%2 2/4=50%1 2/4=50% 2/6=33% 2/7=29% 2/6=33%RuleConfidence2/4=50%2/4=50%2/4=50%2/6=33%2/7=29%2/6=33%Solved iteratively until no more new rules emerge Generating Association Rules

19. PCY (Park-Chen-Yu) AlgorithmHash-based improvement to A-Priori. During Pass 1 of A-priori, most memory is idle. Will use that memory to keep counts of buckets into which pairs of items are hashed. Hashes each pair of items and increments the count at that hash if item pairs occurs more than once.PCY Algorithm accomplishes more on the first pass.It will make use of an array disguised as a hash table (where indices represent keys)

20. PCY (Park-Chen-Yu) Algorithm—(Cont..)Between PassesReplace the buckets by a bit-vector (“bitmap”):1 means the bucket is frequent;0 means it’s not.Also, decide which items are frequent and list them for the second pass.Gives extra conditions that candidate pairs must satisfy on Pass 2.

21. Simple AlgorithmInstead of using the entire file of baskets, pick a random subset of the baskets and pretend it as the entire dataset Safest way to pick the sample -read the entire dataset, and for each basket, select that basket for the sample with some fixed probability p. • Suppose there are m baskets in the entire file. At the end, we shall have a sample whose size is very close to pm baskets. Ex. If support threshold for the full dataset is s ,and we choose a sample of 1% of the baskets, then we may examine the sample for itemsets that appear in at least s/100 of the baskets.Smaller support thresholds will recognize more frequent itemset but require more memory.

22. Son ALGORITHMPart 1The first pass of SON Algorithm performs the simple algorithm on subsets that compose partitions of the dataset processing the subsets in parallel is more efficient. - Scan data - Break data into chunks that can be processed in main memory - Continuously fill memory with new batch of data - Run sampling algorithm on each batch - Generate candidate frequent setsPart 2The second pass counts the output from the first pass and determines if an itemset is frequent across all subsets. - Validates the candidate itemsets - Counts all candidate itemsets and determines which are frequent in the entire setMonotonicity property - Itemset X is frequent overall  frequent in at least one batch

23. Toivonen’s ALGORITHMFirst start as the simple algorithm but in this case, lower the support threshold - Example : if 1%, then make it s/125 not s/100Goal is to prevent false nagatives and ensure that itemsets are frequentIf an item has a support that is close to the support threshold but is not equal to or greater than, then it would be considered frequent in this algorithmAn itemset is in the negative border if it is not deemed frequent in the sample, but all it’s immediate subset are.Negative border : when a set is not frequent in the sample but all of its immediate subsets are {a,b,c,d} is not frequent but {a,b,c}, {a,b,d}, {a,c,d}, {b,c,d} are frequent, then {a,b,c,d} is frequentIn the second pass, count all the frequent itemsets from the first pass, and the negative borders. If there is a negative border as the frequent itemset, then you have to start over with a different support threshold level

24. Demonstration