MiningSequen tialP atterns Rak eshAgra al RamakrishnanSrik an IBMAlmadenResearc hCen ter HarryRoadSanJoseCA Abstract e are giv en a large database of customer transac tions where eac h transaction co
121K - views

MiningSequen tialP atterns Rak eshAgra al RamakrishnanSrik an IBMAlmadenResearc hCen ter HarryRoadSanJoseCA Abstract e are giv en a large database of customer transac tions where eac h transaction co

W ein tro duce the problem of mining sequen tial patterns o er suc h databases W e presen t three algo rithms to solv e this problem and empirically ev alu ate their p erformance using syn thetic data Tw oof the prop osed algorithms AprioriSome and

Tags : ein tro duce
Download Pdf

MiningSequen tialP atterns Rak eshAgra al RamakrishnanSrik an IBMAlmadenResearc hCen ter HarryRoadSanJoseCA Abstract e are giv en a large database of customer transac tions where eac h transaction co




Download Pdf - The PPT/PDF document "MiningSequen tialP atterns Rak eshAgra a..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "MiningSequen tialP atterns Rak eshAgra al RamakrishnanSrik an IBMAlmadenResearc hCen ter HarryRoadSanJoseCA Abstract e are giv en a large database of customer transac tions where eac h transaction co"— Presentation transcript:


Page 1
MiningSequen tialP atterns Rak eshAgra al RamakrishnanSrik an IBMAlmadenResearc hCen ter 650HarryRoad,SanJose,CA95120 Abstract e are giv en a large database of customer transac- tions, where eac h transaction consists of customer-id, transaction time, and the items b ough t in the transac- tion. W ein tro duce the problem of mining sequen tial patterns o er suc h databases. W e presen t three algo- rithms to solv e this problem, and empirically ev alu- ate their p erformance using syn thetic data. Tw oof the prop osed algorithms, AprioriSome and Apriori- All, ha e comparable p

erformance, alb eit AprioriSome p erforms a little b etter when the minim um n um ber of customers that m ust supp ort a sequen tial pattern is lo w. Scale-up exp erimen ts sho w that b oth Apri- oriSome and AprioriAll scale linearly with the n um- b er of customer transactions. They also ha e excel- len t scale-up prop erties with resp ect to the n um ber of transactions p er customer and the n um b er of items in a transaction. In troduction Database mining is motiv ated b y the decision sup- p ort problem faced b y most large retail organizations. Progress in bar-co de tec hnology has made

it p ossible for retail organizations to collect and store massiv amoun ts of sales data, referred to as the asket data. A record in suc h data t ypically consists of the trans- action date and the items b ough t in the transaction. ery often, data records also con tain customer-id, par- ticularly when the purc hase has b een made using a credit card or a frequen t-buy er card. Catalog compa- nies also collect suc h data using the orders they re- ceiv e. ein tro duce the problem of mining se quential pat- terns o er this data. An example of suc h a pattern is Also Departmen t of Computer

Science, Univ ersit y of Wis- consin, Madison. that customers t ypically ren t \Star W ars", then \Em- pire Strik es Bac k", and then \Return of the Jedi". Note that these ren tals need not b e consecutiv e. Cus- tomers who ren t some other videos in b et een also supp ort this sequen tial pattern. Elemen ts of a sequen- tial pattern need not b e simple items. \Fitted Sheet and at sheet and pillo w cases", follo ed b y \com- forter", follo ed b y \drap es and rues" is an example of a sequen tial pattern in whic h the elemen ts are sets of items. Problem Statemen e are giv en a database of

customer transactions. Eac h transaction consists of the follo wing elds: customer-id, transaction-time, and the items purc hased in the transaction. No cus- tomer has more than one transaction with the same transaction-time. e do not consider quan tities of items b ough t in a transaction: eac h item is a binary ariable represen ting whether an item w as b ough tor not. An itemset is a non-empt y set of items. A se quenc is an ordered list of itemsets. Without loss of gener- alit ,w e assume that the set of items is mapp ed to a set of con tiguous in tegers. W e denote an itemset :::i ),

where is an item. W e denote a sequence :::s , where is an itemset. A sequence :::a is c ontaine din another se- quence :::b if there exist in tegers ::: suc h that , ..., .F or example, the sequence (3) (4 5) (8) is con tained in (7) (3 8) (9) (4 5 6) (8) , since (3) (3 8), (4 5) (4 5 6) and (8) (8). Ho ev er, the sequence (3) (5) is not con tained in (3 5) (and vice v ersa). The former represen ts items 3 and 5 b eing b ough t one after the other, while the latter represen ts items 3 and 5 b eing b ough t together. In a set of sequences, a sequence is maximal if is not con tained in an y

other sequence. All the transactions of a customer can together b e view ed as a sequence, where eac h transaction corresp onds to a set of items, and the list of
Page 2
Customer Id ransactionTime Items Bough June 25 '93 30 June 30 '93 90 June 10 '93 10, 20 June 15 '93 30 June 20 '93 40, 60, 70 June 25 '93 30, 50, 70 June 25 '93 30 June 30 '93 40, 70 July 25 '93 90 June 12 '93 90 Figure 1: Database Sorted b y Customer Id and T rans- action Time Customer Id Customer Sequence (30) (90) (10 20) (30) (40 60 70) (30 50 70) (30) (40 70) (90) (90) Figure 2: Customer-Sequence V ersion of the

Database transactions, ordered b y increasing transaction-time, corresp onds to a sequence. e call suc h a se- quence customer-se quenc ormally let the transactions of a customer, ordered b y increasing transaction-time, b e , ..., Let the set of items in b e denoted b y itemset( ). The customer-sequence for this customer is the sequence itemset( ) itemset( ) ... itemset( A customer supp orts a sequence if is con tained in the customer-sequence for this customer. The sup- ort for a se quenc is de ned as the fraction of total customers who supp ort this sequence. Giv en a database of customer

transactions, the problem of mining sequen tial patterns is to nd the maxim al sequences among all sequences that ha ea certain user-sp eci ed minim um supp ort. Eac h suc maxim al sequence represen ts a se quential p attern e call a sequence satisfying the minim um supp ort constrain ta lar ge sequence. Example Consider the database sho wn in Fig. 1. (This database has b een sorted on customer-id and transaction-time.) Fig. 2 sho ws this database ex- pressed as a set of customer sequences. With minim um supp ort set to 25%, i.e., a minim um supp ort of 2 customers, t o sequences: (30) (90)

and (30) (40 70) are maxim al among those satis- fying the supp ort constrain t, and are the desired se- Sequen tial P atterns with supp ort 25% (30) (90) (30) (40 70) Figure 3: The answ er set quen tial patterns. The sequen tial pattern (30) (90) is supp orted b y customers 1 and 4. Customer 4 buys items (40 70) in b et een items 30 and 90, but supp orts the pattern (30) (90) since w e are lo oking for pat- terns that are not necessarily con tiguous. The sequen- tial pattern 30 (40 70) is supp orted b y customers 2 and 4. Customer 2 buys 60 along with 40 and 70, but supp orts this pattern

since (40 70) is a subset of (40 60 70). An example of a sequence that do es not ha e mini- um supp ort is the sequence (10 20) (30) , whic his only supp orted b y customer 2. The sequences (30) (40) (70) (90) (30) (40) (30) (70) and (40 70) , though ha ving minim um supp ort, are not in the answ er b ecause they are not maximal Related W ork In [1 ], the problem of disco ering \what items are b ough t together in a transaction" er bask et data w as in tro duced. While related, the problem of nding what items are b ough t together is concerned with nding in tra-transaction patterns, whereas

the problem of nding sequen tial patterns is concerned with in ter-transaction patterns. A pattern in the rst problem consists of an unordered set of items whereas a pattern in the latter case is an or- dered list of sets of items. Disco ering patterns in sequences of ev en ts has b een an area of activ e researc h in AI (see, for example, [6 ]). Ho ev er, the fo cus in this b o dy of w ork is on dis- co ering the rule underlying the generation of a giv en sequence in order to b e able to predict a plausible sequence con tin uation (e.g. the rule to predict what um b er will come next, giv en

a sequence of n um b ers). e on the hand are in terested in nding all common patterns em b edded in a database of sequences of sets of ev en ts (items). Our problem is related to the problem of nding text subsequences that matc h a giv en regular expres- sion ( c.f. the UNIX grep utilit y). There also has b een ork on nding text subsequences that appro ximately matc h a giv en string (e.g. [5] [12 ]). These tec hniques are orien ted to ard nding matc hes for one pattern. In our problem, the dicult y is in guring out what patterns to try and then ecien tly nding out whic ones are con tained

in a customer sequence. ec hniques based on m ultiple alignmen t [11 ]ha
Page 3
b een prop osed to nd en tire text sequences that are similar. There also has b een w ork to nd lo cally simi- lar subsequences [4] [8] [9 ]. Ho ev er, as p oin ted out in [10 ], these tec hniques apply when the disco ered pat- terns consist of consecutiv ec haracters or m ultiple lists of consecutiv ec haracters separated b y a xed length of noise c haracters. Closest to our problem is the problem form ulation in [10 ] in the con text of disco ering similarities in a database of genetic sequences. The

patterns they wish to disco er are subsequences made up of consecutiv haracters separated b ya v ariable n um b er of noise haracters. A sequence in our problem consists of list of sets of c haracters (items), rather than b eing sim- ply a list of c haracters. Th us, an elemen t of the se- quen tial pattern w e disco er can b e a set of c haracters (items), rather than b eing simply a c haracter. Our solution approac hisen tirely di eren t. The solution in [10 ] is not guaran teed to b e complete, whereas w guaran tee that w eha e disco ered all sequen tial pat- terns of in terest that are

presen t in a sp eci ed mini- um n um b er of sequences. The algorithm in [10 ]is a main memory algorithm based on generalized sux tree [7] and w as tested against a database of 150 se- quences (although the pap er do es con tain some hin ts on ho w they migh t extend their approac h to handle larger databases). Our solution is targeted at millions of customer sequences. Organization of the P ap er e solv e the problem of nding all sequen tial patterns in v e phases: i) sort phase, ii) litemset phase, iii) transformation phase, iv) sequence phase, and v) maximal phase. Section 2 giv es this

problem decomp osition. Section 3 examines the sequence phase in detail and presen ts algorithms for this phase. W e empirically ev aluate the p erformance of these algorithms and study their scale-up prop er- ties in Section 4. W e conclude with a summary and directions for future w ork in Section 5. FindingSequen tialP atterns erminology The length of a sequence is the n um- b er of itemsets in the sequence. A sequence of length is called a -sequence. The sequence formed b y the concatenation of t o sequences and is denoted as x:y The supp ort for an itemset is de ned as the frac- tion of

customers who b ough t the items in in a sin- gle transaction. Th us the itemset and the 1-sequence ha e the same supp ort. An itemset with minim um Large Itemsets Mapp ed T (30) (40) (70) (40 70) (90) Figure 4: Large Itemsets supp ort is called a large itemset or litemset . Note that eac h itemset in a large sequence m ust ha e minim um supp ort. Hence, an y large sequence m ust b e a list of litemsets. 2.1 TheAlgorithm e split the problem of mining sequen tial patterns in to the follo wing phases: 1. Sort Phase. The database ( ) is sorted, with customer-id as the ma jor k ey and

transaction-time as the minor k ey . This step implicitly con erts the orig- inal transaction database in to a database of customer sequences. 2. Litemset Phase. In this phase w e nd the set of all litemsets .W e are also sim ultaneously nding the set of all large 1-sequences, since this set is just fh ij The problem of nding large itemsets in a giv en set of customer transactions, alb eit with a sligh tly di er- en t de nition of supp ort, has b een considered in [1 [2 ]. In these pap ers, the supp ort for an itemset has b een de ned as the fraction of transactions in whic an itemset is

presen t, whereas in the sequen tial pat- tern nding problem, the supp ort is the fraction of customers who b ough t the itemset in an y one of their p ossibly man y transactions. It is straigh tforw ard to adapt an y of the algorithms in [2 ] to nd litemsets. The main di erence is that the supp ort coun t should b e incremen ted only once p er customer ev en if the customer buys the same set of items in t o di eren transactions. The set of litemsets is mapp ed to a set of con tigu- ous in tegers. In the example database giv en in Fig. 1, the large itemsets are (30), (40), (70), (40 70) and

(90). A p ossible mapping for this set is sho wn in Fig.4. The reason for this mapping is that b y treating litem- sets as single en tities, w e can compare t o litemsets for equalit y in constan t time, and reduce the time re- quired to c hec k if a sequence is con tained in a customer sequence.
Page 4
3. T ransformation Phase. As w e will see in Sec- tion 3, w e need to rep eatedly determine whic hofa giv en set of large sequences are con tained in a cus- tomer sequence. T o mak e this test fast, w e transform eac h customer sequence in to an alternativ e represen- tation. In a

transformed customer sequence, eac h transac- tion is replaced b y the set of all litemsets con tained in that transaction. If a transaction do es not con- tain an y litemset, it is not retained in the transformed sequence. If a customer sequence do es not con tain an y litemset, this sequence is dropp ed from the trans- formed database. Ho ev er, it still con tributes to the coun t of total n um b er of customers. A customer se- quence is no w represen ted b y a list of sets of litemsets. Eac h set of litemsets is represen ted b ;l ... ;l where is a litemset. This transformed database is

called . Dep end- ing on the disk a ailabili ,w e can ph ysically create this transformed database, or this transformation can b e done on-the- y ,as w e read eac h customer sequence during a pass. (In our exp erimen ts, w eph ysically cre- ated the transformed database.) The transformation of the database in Fig. 2 is sho wn in Fig. 5. or example, during the transfor- mation of the customer sequence with Id 2, the trans- action (10 20) is dropp ed b ecause it do es not con tain an y litemset and the transaction (40 60 70) is replaced y the set of litemsets (40), (70), (40 70) 4. Sequence

Phase. Use the set of litemsets to nd the desired sequences. Algorithms for this phase are describ ed in Section 3. 5. Maximal Phase. Find the maximal sequences among the set of large sequences. In some algorithms in Section 3, this phase is com bined with the sequence phase to reduce the time w asted in coun ting non- maxim al sequences. Ha ving found the set of all large sequences in the sequence phase, the follo wing algorithm can b e used for nding maxima l sequences. Let the length of the longest sequence b e n. Then, for (k=n;k 1;k do foreac k-sequence do Deletefrom allsubsequencesof

Data structures (the hash-tr ) and algorithm to quic kly nd all subsequences of a giv en sequence are describ ed in [3 ] (and are similar to those used to nd all subsets of a giv en itemset [2 ]). TheSequencePhase The general structure of the algorithms for the se- quence phase is that they mak em ultiple passes o er the data. In eac h pass, w e start with a seed set of large sequences. e use the seed set for generating new p oten tially large sequences, called andidate se- quenc es .W e nd the supp ort for these candidate se- quences during the pass o er the data. t the end of the pass, w e

determine whic h of the candidate se- quences are actually large. These large candidates b e- come the seed for the next pass. In the rst pass, all 1-sequences with minim um supp ort, obtained in the litemset phase, form the seed set. e presen tt o famili es of algorithms, whic hw e call ount-al l and ount-some The coun t-all algorithms coun t all the large sequences, including non-maxim sequences. The non-maxima l sequences m ust then b e pruned out (in the maxim al phase). W e presen t one coun t-all algorithm, called AprioriA ll , based on the Apriori algorithm for nding large itemsets

presen ted in [2]. e presen tt o coun t-some algorithms: Apriori- Some and DynamicSom e. The in tuition b ehind these algorithms is that since w e are only in terested in maxi- mal sequences, w e can a oid coun ting sequences whic are con tained in a longer sequence if w e rst coun longer sequences. Ho ev er, w eha e to b e careful not to coun t a lot of longer sequences that do not ha e minim um supp ort. Otherwise, the time sa ed b not coun ting sequences con tained in a longer sequence ma y b e less than the time w asted coun ting sequences without minim um supp ort that w ould nev er ha e

b een coun ted b ecause their subsequences w ere not large. Both the coun t-some algorithms ha ea forwar phase , in whic hw e nd all large sequences of certain lengths, follo ed b ya ackwar d phase , where w e nd all remaining large sequences. The essen tial di erence is in the pro cedure they use for generating candidate sequences during the forw ard phase. As w e will see momen taril , AprioriSome generates candidates for a pass using only the large sequences found in the pre- vious pass and then mak es a pass o er the data to nd their supp ort. DynamicSom e generates candidates on- the- y

using the large sequences found in the previ- ous passes and the customer sequences read from the The AprioriHybrid algorithm presen ted in [2] did b etter than Apriori for nding large itemsets. Ho ev er, it used the prop ert y that a -itemset is presen t in a transaction if an yt of its ( 1)-subsets are presen t in the transaction to a oid scanning the database in later passes. Since this prop ert ydoes not hold for sequences, w e do not exp ect an algorithm based on AprioriHybrid to do an y b etter than the algorithm based on Apriori.
Page 5
Customer Id Original ransformed After

Customer Sequence Customer Sequence Mapping (30) (90) hf (30) gf (90) gi hf gf gi (10 20) (30) (40 60 70) hf (30) gf (40), (70), (40 70) gi hf gf 2, 3, 4 gi (30 50 70) hf (30), (70) gi hf 1, 3 gi (30) (40 70) (90) hf (30) gf (40), (70), (40 70) gf (90) gi hf gf 2, 3, 4 gf gi (90) hf (90) gi hf gi Figure 5: T ransformed Database large 1-sequences ; // Result of litemset phase for =2; ++ ) do b egin = New candidates generated from (see Section 3.1.1). foreac customer-sequence in the database do Incremen t the coun t of all candidates in that are con tained in = Candidates in with minim um supp

ort. end Answ er = Maximal Sequences in Figure 6: Algorithm AprioriAll database. Notation In all the algorithms, denotes the set of all large k-sequences, and the set of candidate k-sequences. 3.1 AlgorithmAprioriAll The algorithm is giv en in Fig. 6. In eac h pass, w use the large sequences from the previous pass to gen- erate the candidate sequences and then measure their supp ort b y making a pass o er the database. A t the end of the pass, the supp ort of the candidates is used to determine the large sequences. In the rst pass, the output of the litemset phase is used to initialize the set

of large 1-sequences. The candidates are stored in hash-tr [2] [3 ] to quic kly nd all candidates con tained in a customer sequence. 3.1.1 Apriori Candidate Generation The apriori-generate function tak es as argumen , the set of all large ( 1)-sequences. The func- tion w orks as follo ws. First, join with insertin to select .litemset ,..., .litemset .litemset from Large Candidate Candidate 3-Sequences 4-Sequences 4-Sequences (after join) (after pruning) 12 3 1234 1234 12 4 1243 13 4 1345 13 5 1354 23 4 Figure 7: Candidate Generation where .litemset .litemset ,... .litemset .litemset Next,

delete all sequences suc h that some 1)-subsequence of is not in or example, consider the set of 3-sequences sho wn in the rst column of Fig. 7. If this is giv en as input to the apriori-generate function, w e will get the sequences sho wn in the second column after the join. After pruning out sequences whose subsequences are not in , the sequences sho wn in the third column will b e left. F or example, 1243 is pruned out b e- cause the subsequence 243 is not in . Pro of of correctness of the candidate generation pro cedure is giv en in [3 ]. 3.1.2 Example Consider a database with the

customer-sequences sho wn in Fig. 8. eha e not sho wn the original database in this example. The customer sequences are in transformed form where eac h transaction has b een replaced b y the set of litemsets con tained in the transaction and the litemsets ha e b een replaced b y in- tegers. The minim um supp ort has b een sp eci ed to b e 40% (i.e. 2 customer sequences). The rst pass o er the database is made in the litemset phase, and w determine the large 1-sequences sho wn in Fig. 9. The large sequences together with their supp ort at the end of the second, third, and fourth passes are also

sho wn in the same gure. No candidate is generated for the
Page 6
hf 15 gf gf gf gi hf gf gf gf 35 gi hf gf gf gf gi hf gf gf gi hf gf gi Figure 8: Customer Sequences fth pass. The maxim al large sequences w ould b e the three sequences 12 34 13 5 and 45 3.2 AlgorithmAprioriSome This algorithm is giv en in Fig. 10. In the forw ard pass, w e only coun t sequences of certain lengths. F or example, w e migh t coun t sequences of length 1, 2, 4 and 6 in the forw ard phase and coun t sequences of length 3 and 5 in the bac kw ard phase. The func- tion next tak es as parameter the length of

sequences coun ted in the last pass and returns the length of sequences to b e coun ted in the next pass. Th us, this function determines exactly whic h sequences are coun ted, and balances the tradeo b et een the time asted in coun ting non-maxim al sequences v ersus coun ting extensions of small candidate sequences. One extreme is next ( )= +1 ( is the length for whic candidates w ere coun ted last), when all non-maxim al sequences are coun ted, but no extensions of small can- didate sequences are coun ted. In this case, Apriori- Some degenerates in to AprioriAll. The other extreme is a

function lik e next( ) = 100 , when almost no non-maxim al large sequence is coun ted, but lots of ex- tensions of small candidates are coun ted. Let hit denote the ratio of the n um b er of large -sequences to the n um b er of candidate -sequences (i.e., ). The next function w e used in the exp erimen ts is giv en b elo w. The in tuition b ehind the heuristic is that as the p ercen tage of candidates coun ted in the curren t pass whic h had minim um sup- p ort increases, the time w asted b y coun ting extensions of small candidates when w e skip a length go es do wn. function next( in teger

begin if (hit 0.666) return +1; elsif (hit 0.75) return +2; elsif (hit 0.80) return +3; elsif (hit 0.85) return +4; else return +5; end e use the apriori-generate function giv en in Section 3.1.1 to generate new candidate sequences. Ho ev er, in the th pass, w ema y not ha e the large // F orw ard Phase large 1-sequences ; // Result of litemset phase ; // so that w eha e a nice lo op condition last =1;//w e last coun ted last for =2; and last ++ ) do b egin if kno wn) then = New candidates generated from else = New candidates generated from if == next ( last )) then b egin foreac

customer-sequence in the database do Incremen t the coun t of all candidates in that are con tained in = Candidates in with minim um supp ort. last end end // Bac kw ard Phase for k> =1; do if not found in forw ard phase) then b egin Delete all sequences in con tained in some i>k foreac customer-sequence in do Incremen t the coun t of all candidates in that are con tained in = Candidates in with minim um supp ort. end else // already kno wn Delete all sequences in con tained in some i>k Answ er = Figure 10: Algorithm AprioriSome sequence set ailable as w e did not coun t the 1)-candidate

sequences. In that case, w e use the candidate set to generate Correctness is main tained b ecause In the bac kw ard phase, w e coun t sequences for the lengths w e skipp ed o er during the forw ard phase, af- ter rst deleting all sequences con tained in some large sequence. These smaller sequences cannot b e in the answ er b ecause w e are only in terested in maxim al se- quences. W e also delete the large sequences found in the forw ard phase that are non-maxim l. In the implemen tatio n, the forw ard and bac kw ard phases are in tersp ersed to reduce the memory used b the candidates. Ho ev

er, w eha e omitted this detail in Fig. 10 to simplify the exp osition.
Page 7
1-Sequences Supp ort 2-Sequences Supp ort 12 13 14 15 23 24 34 35 45 3-Sequences Supp ort 123 124 134 135 234 4-Sequences Supp ort 1234 Figure 9: Large Sequences 123 124 134 135 234 345 Figure 11: Candidate 3-sequences 3.2.1 Example Using again the database used in the example for the AprioriAll algorithm, w e nd the large 1-sequences ) sho wn in Fig. 9 in the litemset phase (during the rst pass o er the database). T ak e for illustration sim- plicit )=2 . In the second pass, w e coun to get (Fig. 9).

After the third pass, apriori-generate is called with as argumen ttoget . The candi- dates in are sho wn in Fig. 11. W e do not coun and hence do not generate . Next, apriori-generate is called with to get , whic h after pruning, turns out to b e the same sho wn in the third column of Fig. 7. After coun ting to get (Fig. 9), w e try generating , whic h turns out to b e empt e then start the bac kw ard phase. Nothing gets deleted from since there are no longer sequences. e had skipp ed coun ting the supp ort for sequences in in the forw ard phase. After deleting those se- quences in that are

subsequences of sequences in , i.e., subsequences of 1234 ,w e are left with the sequences 13 5 and 34 5 . These w ould b e coun ted to get 135 as a maxim al large 3-sequence. Next, all the sequences in except 45 are deleted since they are con tained in some longer sequence. F or the same reason, all sequences in are also deleted. 3.3 AlgorithmDynamicSome The DynamicSom e algorithm is sho wn in Fig. 12. Lik e AprioriSome, w e skip coun ting candidate se- quences of certain lengths in the forw ard phase. The candidate sequences that are coun ted is determined b the v ariable step . In the

initialization phase, all the candidate sequences of length upto and including step are coun ted. Then in the forw ard phase, all sequences whose lengths are m ultiples of step are coun ted. Th us, with step set to 3, w e will coun t sequences of lengths 1, 2, and 3 in the initialization phase, and 6,9,12,.. in the forw ard phase. W e really w an ted to coun t only sequences of lengths 3,6,9,12,.. .W e can generate se- quences of length 6 b y joining sequences of length 3. e can generate sequences of length 9 b y joining se- quences of length 6 with sequences of length 3, etc. Ho ev er, to

generate the sequences of length 3, w need sequences of lengths 1 and 2, and hence the ini- tialization phase. As in AprioriSome, during the bac kw ard phase, w coun t sequences for the lengths w e skipp ed o er dur- ing the forw ard phase. Ho ev er, unlik e in Apriori- Some, these candidate sequences w ere not generated in the forw ard phase. The in termediate phase gen- erates them. Then the bac kw ard phase is iden tical to the one for AprioriSome. F or example, assume that w coun and , and turns out to b e empt yinthe forw ard phase. W e generate and (in termediate phase), and then coun

follo ed b after delet- ing non-maxim l sequences (bac kw ard phase). This pro cess is then rep eated for and . In actual im- plemen tation, the in termediate phase is in tersp ersed with the bac kw ard phase, but w eha e omitted this detail in Fig. 12 to simplify exp osition. e use apriori-generate in the initialization and in termediate phases, but use otf-generate in the for- ard phase. The otf-generate pro cedure is giv en in
Page 8
// step is an in teger // Initialization Phase large 1-sequences ; // Result of litemset phase for =2; k< = step and ++ ) do b egin = New candidates

generated from foreac customer-sequence in do Incremen t the coun t of all candidates in that are con tained in = Candidates in with minim um supp ort. end // F orw ard Phase for = step ; += step ) do b egin // nd step from and step step foreac customer sequences in do b egin = otf-generate( step ); See Section 3.3.1 or eac h sequence , incremen t its coun tin step (adding it to step if necessary). end step = Candidates in step with min supp ort. end // In termediate Phase for k> 1; do if not y et determined) then if kno wn) then = New candidates generated from else = New candidates generated

from // Bac kw ard Phase : Same as that of AprioriSome Figure 12: Algorithm DynamicSom Section 3.3.1. The reason is that apriori-generate generates less candidates than otf-generate when w generate +1 from [2 ]. Ho ev er, this ma y not hold when w e try to nd step from and step as is the case in the forw ard phase. In addition, if the size of step is less than the size of step generated b y apriori-generate, it ma y b e faster to nd all mem b ers of and step con tained in than to nd all mem b ers of step con tained in 3.3.1 On-the- y Candidate Generation The otf-generate function tak es as

argumen ts the set of large k-sequences, , the set of large j- sequences, and the customer sequence . It returns the set of candidate ( )-sequences con tained in The in tuition b ehind this generation pro cedure is that if and are b oth con tained in and they don't o erlap in , then is a candidate )-sequence. Let b e the sequence :::c The implem en tation of this function is as sho wn b elo w: // isthesequence :::c =subseq( ); forall sequences do .end=min iscon tainedin :::c ig =subseq( ); forall sequences do .start=max iscon tainedin +1 :::c ig Answ er=joinof with withthejoin condition .end

.start; or example, consider to b e the set of se- quences in Fig. 9, and let otf-generate b e called with parameters and the customer-sequence hf gf gf 37 gf gi .Th us corresp onds to to , etc. The end and start v alues for eac h sequence in whic h is con tained in are sho wn in Fig. 13. Th us, the result of the join with the join condition :end :start (where denotes the set of se- quences of length 2) is the single sequence 1234 3.3.2 Example Con tin uing with our example of Section 3.1.2, consider step of 2. In the initialization phase, w e determine sho wn in Fig. 9. Then, in the forw ard

phase, w e get 2 candidate sequences in 1234 with supp ort of 2 and 1345 with supp ort of 1. Out of these, only 1234 is large. In the next pass, w e nd to b e The apriori-gen era te pro cedure in Section 3.1.1 needs to b e generalized to generate from . Essen tially , the join condition has to b e c hanged to require equalit y of the rst terms, and the concatenatio n of the remaining terms.
Page 9
Sequence End Start 12 13 14 23 24 34 Figure 13: Start and End V alues empt .No w, in the in termediate phase, w e generate from , and from . Since turns out to b e empt ,w e coun t just

during the bac kw ard phase to get erformance o assess the relativ e p erformance of the algorithms and study their scale-up prop erties, w e p erformed sev- eral exp erimen ts on an IBM RS/6000 530H w orksta- tion with a CPU clo c k rate of 33 MHz, 64 MB of main memory , and running AIX 3.2. The data resided in the AIX le system and w as stored on a 2GB SCSI 3.5" driv e, with measured sequen tial throughput of ab out 2 MB/second. 4.1 GenerationofSyn theticData oev aluate the p erformance of the algorithms o er a large range of data c haracteristics, w e generated syn- thetic customer

transactions. en vironmen t. In our mo del of the \real" w orld, p eople buy sequences of sets of items. Eac h suc h sequence of itemsets is p o- ten tially a maxim al large sequence. An example of suc h a sequence migh t b e sheets and pillo w cases, follo ed b y a comforter, follo ed b y shams and ruf- es. Ho ev er, some p eople ma y buy only some of the items from suc h a sequence. F or instance, some p eo- ple migh t buy only sheets and pillo w cases follo ed b a comforter, and some only comforters. A customer- sequence ma y con tain more than one suc h sequence. or example, a customer

migh t place an order for a dress and jac et when ordering sheets and pillo w cases, where the dress and jac et together form part of an- other sequence. Customer-sequence sizes are t ypically clustered around a mean and a few customers ma ha e man y transactions. Similarly , transaction sizes are usually clustered around a mean and a few trans- actions ha e man y items. The syn thetic data generation program tak es the parameters sho wn in T able 1. W e generated datasets Num berofcustomers(=sizeofDatabase) Av eragen um beroftransactionsperCustomer Av eragen um berofitemsperT ransaction Av

eragelengthofmaximalpoten tially largeSequences Av eragesizeofItemsetsinmaximal poten tiallylargesequences Num berofmaximalpoten tiallylargeSequences Num berofmaximalpoten tiallylargeItemsets Num berofitems able 1: P arameters Name Size (MB) C10-T5-S4-I1.25 10 1.25 5.8 C10-T5-S4-I2.5 10 2.5 6.0 C20-T2.5-S4-I1.25 20 2.5 1.25 6.9 C20-T2.5-S8-I1.25 20 2.5 1.25 7.8 able 2: P arameter settings (Syn thetic datasets) y setting = 5000, = 25000 and = 10000. The n um b er of customers, as set to 250,000. T a- ble 2 summarizes the dataset parameter settings. W refer the reader to [3 ] for the details of

the data gen- eration program. 4.2 Relativ eP erformance Fig. 14 sho ws the relativ e execution times for the three algorithms for the six datasets giv en in T able 2 as the minim um supp ort is decreased from 1% supp ort to 0.2% supp ort. W eha e not plotted the execution times for DynamicSome for lo wv alues of minim um supp ort since it generated to o man y candidates and ran out of memory .Ev en if DynamicSome had more memory , the cost of nding the supp ort for that man candidates w ould ha e ensured execution times m uc larger than those for Apriori or AprioriSome. As ex- p ected, the

execution times of all the algorithms in- crease as the supp ort is decreased b ecause of a large increase in the n um b er of large sequences in the result. DynamicSom e p erforms w orse than the other t algorithms mainly b ecause it generates and coun ts am uc h larger n um b er of candidates in the forw ard phase. The di erence in the n um b er of candidates gen- erated is due to the otf-generate candidate genera- tion pro cedure it uses. The apriori-generate do es not coun tan y candidate sequence that con tains an subsequence whic h is not large. The otf-generate do es not ha e this

pruning capabilit The ma jor adv an tage of AprioriSome o er Aprior- iAll is that it a oids coun ting man y non-maxim al se-
Page 10
C10-T5-S4-I1.25 C10-T5-S4-I2.5 50 100 150 200 250 300 350 400 450 0.2 0.25 0.33 0.5 0.75 Time (sec) Minimum Support DynamicSome Apriori AprioriSome 200 400 600 800 1000 1200 1400 1600 1800 2000 0.2 0.25 0.33 0.5 0.75 Time (sec) Minimum Support DynamicSome Apriori AprioriSome C20-T2.5-S4-I1.25 C20-T2.5-S8-I1.2 100 200 300 400 500 600 700 0.2 0.25 0.33 0.5 0.75 Time (sec) Minimum Support DynamicSome Apriori AprioriSome 200 400 600 800 1000 1200 1400 0.2

0.25 0.33 0.5 0.75 Time (sec) Minimum Support DynamicSome Apriori AprioriSome Figure 14: Execution times quences. Ho ev er, this adv an tage is reduced b ecause of t o reasons. First, candidates in AprioriAll are generated using , whereas AprioriSome some- times uses for this purp ose. Since the n um b er of candidates generated using Apriori- Some can b e larger. Second, although AprioriSome skips o er coun ting candidates of some lengths, they are generated nonetheless and sta y memory residen t. If memory gets lled up, AprioriSome is forced to coun t the last set of candidates generated ev

en if the heuristic suggests skipping some more candidate sets. This e ect decreases the skipping distance b et een the t o candidate sets that are indeed coun ted, and AprioriSome starts b eha ving more lik e AprioriAll. F or lo er supp orts, there are longer large sequences, and hence more non-maxim al sequences, and AprioriSome do es b etter. 4.3 Scale-up e will presen t in this section the results of scale-up exp erimen ts for the AprioriSome algorithm. W e also p erformed the same exp erimen ts for AprioriAll, and found the results to b e v ery similar. W e do not re- p ort the AprioriAll

results to conserv e space. W e will presen t the scale-up results for some selected datasets. Similar results w ere obtained for other datasets. Fig. 15 sho ws ho w AprioriSome scales up as the um b er of customers is increased ten times from 250,000 to 2.5 millio n. (The scale-up graph for increas- ing the n um b er of customers from 25,000 to 250,000 lo oks v ery similar.) e sho w the results for the
Page 11
10 250 1000 1750 2500 Relative Time Number of Customers (000s) 2% 1% 0.5% Figure 15: Scale-up : Num b er of customers dataset C10-T2.5-S4-I1.25 with three lev els of mini- um

supp ort. The size of the dataset for 2.5 milli on customers w as 368 MB. The execution times are nor- malized with resp ect to the times for the 250,000 cus- tomers dataset. As sho wn, the execution times scale quite linearly Next, w ein estigated the scale-up as w e increased the total n um b er of items in a customer sequence. This increase w as ac hiev ed in t o di eren tw ys: i) y increasing the a erage n um b er of transactions p er customer, k eeping the a erage n um b er of items p er transaction the same; and ii) b y increasing the a v- erage n um b er of items p er transaction, k

eeping the erage n um b er transactions p er customer the same. The aim of this exp erimen tw as to see ho w our data structures scaled with the customer-sequence size, in- dep enden t of other factors lik e the database size and the n um b er of large sequences. W ek ept the size of the database roughly constan tb yk eeping the pro duct of the a erage customer-sequence size and the n um ber of customers constan t. W e xed the minim um supp ort in terms of the n um b er of transactions in this exp er- imen t. Fixing the minim um supp ort as a p ercen tage ould ha e led to large increases in

the n um ber of large sequences and w ew an ted to k eep the size of the answ er set roughly the same. All the exp erimen ts had the large sequence length set to 4 and the large item- set size set to 1.25. The a erage transaction size w as set to 2.5 in the rst graph, while the n um b er of trans- actions p er customer w as set to 10 in the second. The um b ers in the k ey (e.g. 100) refer to the minim um supp ort. The results are sho wn in Fig. 16. As sho wn, the execution times usually increased with the customer- sequence size, but only gradually . The main reason for the increase w as that

in spite of setting the min- im um supp ort in terms of the n um b er of customers, the n um b er of large sequences increased with increas- ing customer-sequence size. A secondary reason w as that nding the candidates presen t in a customer se- quence to ok a little more time. F or supp ort lev el of 200, the execution time actually w en tdo wn a little when the transaction size w as increased. The reason for this decrease is that there is an o erhead asso ciated with reading a transaction. A t high lev el of supp ort, this o erhead comprises a signi can t part of the total execution time.

Since this decreases when the n um ber of transactions decrease, the total execution time also decreases a little. ConclusionsandF utureW ork ein tro duced a new problem of mining sequen tial patterns from a database of customer sales transac- tions and presen ted three algorithms for solving this problem. Tw o of the algorithms, AprioriSome and AprioriAll, ha e comparable p erformance, although AprioriSome p erforms a little b etter for the lo er v al- ues of the minim um n um b er of customers that m ust supp ort a sequen tial pattern. Scale-up exp erimen ts sho w that b oth AprioriSome and

AprioriAll scale lin- early with the n um b er of customer transactions. They also ha e excellen t scale-up prop erties with resp ect to the n um b er of transactions in a customer sequence and the n um b er of items in a transaction. In some applications, the user ma yw an t to kno the ratio of the n um b er of p eople who b ough t the rst + 1 items in a sequence to the n um ber of p eople who b ough t the rst items, for 0 length of sequence . In this case, w e will ha etomak an additional pass o er the data to get coun ts for all pre xes of large sequences if w ew ere using the Apri- oriSome

algorithms. With the AprioriAll algorithm, e already ha e these coun ts. In suc h applications, therefore, AprioriAll will b ecome the preferred algo- rithm. These algorithms ha e b een implemen ted on sev eral data rep ositories, including the AIX le system and DB2/6000, as part of the Quest pro ject, and ha e b een run against data from sev eral data. In the future, w plan to extend this w ork along the follo wing lines: Extension of the algorithms to disco er sequen tial patterns across item categories. An example of suc h a category is that a dish w asher is a kitc hen appliance is a hea

vy electric appliance, etc.
Page 12
0.5 1.5 2.5 10 20 30 40 50 Relative Time # of Transactions Per Customer 200 100 50 0.5 1.5 2.5 2.5 7.5 10 12.5 Relative Time Transaction Size 200 100 50 Figure 16: Scale-up : Num b er of Items p er Customer ransp osition of constrain ts in to the disco ery al- gorithms. There could b e item constrain ts (e.g. sequen tial patterns in olving home appliances) or time constrain ts (e.g. the elemen ts of the patterns should come from transactions that are at least and at most da ys apart. References [1] R. Agra al, T. Imielinski, and A. Sw ami. Mining

asso ciation rules b et een sets of items in large databases. In Pr c. of the A CM SIGMOD Con- fer enc e on Management of Data , pages 207{216, ashington, D.C., Ma y 1993. [2] R. Agra al and R. Srik an t. ast algorithms for mining asso ciation rules. In Pr c. of the VLDB Confer enc , San tiago, Chile, Septem ber 1994. Expanded v ersion a ailable as IBM Re- searc h Rep ort RJ9839, June 1994. [3] R. Agra al and R. Srik an t. Mining sequen tial patterns. Researc h Rep ort RJ 9910, IBM Al- maden Researc h Cen ter, San Jose, California, Oc- tob er 1994. [4] S. Altsc ul, W. Gish, W. Miller, E. My

ers, and D. Lipman. A basic lo cal alignmen t searc h to ol. Journal of Mole cular Biolo gy , 1990. [5] A. Califano and I. Rigoutsos. Flash: A fast lo ok- up algorithm for string homology .In Pr c. of the 1st International Conver enc e on Intel ligent Sys- tems for Mole cular Biolo gy , Bethesda, MD, July 1993. [6] T. G. Dietteric h and R. S. Mic halski. Disco ering patterns in sequences of ev en ts. rti cial Intel li- genc , 25:187{232, 1985. [7] L. Hui. Color set size problem with applica- tions to string matc hing. In A. Ap ostolico, M. Cro c hemere, Z. Galil, and U. Man b er, ed- itors,

Combinatorial Pattern Matching, LNCS 644 , pages 230{243. Springer-V erlag, 1992. [8] M. Ro ytb erg. A searc h for common patterns in man y sequences. Computer Applic ations in the Bioscienc es , 8(1):57{64, 1992. [9] M. Vingron and P . Argos. A fast and sensitiv ultiple sequence alignmen t algorithm. Com- puter Applic ations in the Bioscienc es , 5:115{122, 1989. [10] J. T.-L. W ang, G.-W. Chirn, T. G. Marr, B. Shapiro, D. Shasha, and K. Zhang. Com bina- torial pattern disco ery for scien ti c data: Some preliminary results. In Pr c. of the A CM SIG- MOD Confer enc e on Management of Data ,

Min- neap olis, Ma y 1994. [11] M. W aterman, editor. Mathematic al Metho ds for DNA Se quenc eA nalysis .CR C Press, 1989. [12] S. W u and U. Man b er. ast text searc hing al- lo wing errors. Communic ations of the A CM 35(10):83{91, Octob er 1992.