/
Mining Frequent Patterns II: Mining Frequent Patterns II:

Mining Frequent Patterns II: - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
393 views
Uploaded On 2016-06-05

Mining Frequent Patterns II: - PPT Presentation

Mining Sequential amp Navigational Patterns Bamshad Mobasher DePaul University Sequential pattern mining Association rule mining does not consider the order of transactions In many applications such orderings are significant Eg ID: 349530

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Mining Frequent Patterns II:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Mining Frequent Patterns II:Mining Sequential & Navigational Patterns

Bamshad Mobasher

DePaul

UniversitySlide2

Sequential pattern mining

Association rule mining does not consider the order of transactions.

In many applications such orderings are significant. E.g.,

in market basket analysis, it is interesting to know whether people buy some items in sequence, e.g., buying bed first and then bed sheets some time later. In Web usage mining, it is useful to find navigational patterns of users in a Web site from sequences of page visits of users

2Slide3

3

Sequential

Patterns

Extending Frequent ItemsetsSequential patterns add an extra dimension to frequent itemsets and association rules - time.

Items can appear before, after, or at the same time as each other.

General form: “x% of the time, when A appears in a transaction, B appears within z transactions.”

note that other items may appear between A and B, so sequential patterns do not necessarily imply consecutive appearances of items (in terms of time)

Examples

Renting

“Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi” in that order

Collection of ordered events within an interval

Most sequential pattern discovery algorithms are based on extensions of the

Apriori

algorithm for discovering itemsets

Navigational Patterns

they can be viewed as a special form of sequential patterns which capture navigational patterns among users of a site

in this case a session is a

consecutive sequence of

pageview

references

for a user over a specified period of timeSlide4

4

Objective

Given a set

S

of

input data sequences

(or sequence database), the problem of mining sequential patterns is to find all the sequences that have

a user-specified minimum

support

Each such sequence is called a

frequent sequence

, or a

sequential

pattern

The

support

for a sequence is the fraction of total data sequences in

S

that contains this

sequence Slide5

5

Sequence Databases

A sequence database consists of

an ordered lis

of elements

or

events

Each element can be a set of items or a single item (a singleton set)

Transaction databases vs. sequence databases

A

sequence database

SID

sequences

10

<a(abc)(ac)d(cf)>20<(ad)c(bc)(ae)>30<(ef)(ab)(df)cb>40<eg(af)cbc>

A transaction database

TIDitemsets10a, b, d20a, c, d30a, d, e40b, e, f

Elements in (…) are setsSlide6

6

Subsequence vs. super sequence

A sequence is an ordered list of events, denoted < e

1 e2 … e

l

>

Given two sequences α=< a

1

a

2

… a

n

> and β=< b

1 b

2 … bm >α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 <…< jn ≤m such that a1 ⊆ bj1, a

2 ⊆ bj2,…, an ⊆ bjnExamples:< (ab), d> is a subsequence of < (abc), (de)> 3, (4, 5), 8 is contained in (or is a subsequence of) 6, (3, 7), 9, (4, 5, 8), (3, 8) <a.html, c.html, f.html> ⊆ <a.html, b. html, c.html, d.html, e.html, f.html, g.html> Slide7

7

What Is Sequential Pattern Mining?

Given a set of sequences and support threshold, find the complete set of

frequent

subsequences

A

sequence database

A

sequence

: < (

ef

) (ab) (

df) c b >An element may contain a set of items.Items within an element are unorderedand we list them alphabetically.<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>Given support threshold min_sup =2, <(ab)c> is a sequential patternSIDsequence10<a(abc)(ac)d(cf)>20<(ad)c(bc)(ae)>30

<(

ef)(ab)(df)cb>40<eg(af)cbc>Slide8

Another Example

8

Transactions Sorted by Customer IDSlide9

Example (continued)

9

Sequences produced from transactions

Final sequential patternsSlide10

GSP mining algorithm

Very similar to the

Apriori

algorithm

10Slide11

11

Sequential Pattern Mining

Algorithms

Apriori

-based method:

GSP

(Generalized Sequential Patterns:

Srikant

& Agrawal, 1996

)

Pattern-growth methods:

FreeSpan

&

PrefixSpan

(Han et al., 2000; Pei, et al., 2001)Vertical format-based mining: SPADE (Zaki 2000)Constraint-based sequential pattern mining (SPIRIT: Garofalakis, et al., 1999; Pei, et al., 2002)Mining closed sequential patterns: CloSpan (Yan, Han & Afshar, 2003)From: J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanjiSlide12

Mining Navigation Patterns

Each session induces a user trail through the site

A trail is a sequence of web pages followed by a user during a session, ordered by time of

accessA sequential pattern in this context is a frequent trail

Sequential pattern mining can help identify common navigational sequences which in turn helps in understanding common user behavioral patterns

If the goal is to make predictions about future user actions based on past behavior, approaches such as

Markov models

(e.g.,

Markov Chains

) can be used

12Slide13

13

Mining Navigational Patterns

Another Approach: Markov Chains

idea is to model the navigational sequences through the site as a state-transition diagram without cycles (a directed acyclic graph)

a Markov Chain consists of a set of states (pages or

pageviews

in the site)

S

= {

s

1

,

s

2

, …, sn} and a set of transition probabilities P = {p1,1, … , p1,n, p2,1, … , p2,n, … , pn,1, … , pn,n}a path r from a state si to a state sj, is a sequence states where the transition probabilities for all consecutive states are greater than 0the probability of reaching a state sj from a state si via a path r is the product of all the probabilities along the path:the probability of reaching sj from si is the sum over all paths:Slide14

Construct Markov Chain from Web Navigational Data

Add a unique start state

the start state has a transition to the first page in eac

h session (representing the start of a session)alternatively, could have a transition to every state, assuming that every page can potentially be start of a session

Add a unique final state

the last page in each trail has a transition to the final state (representing the end of the session)

The transition probabilities are obtained from counting click-

throughs

The Markov chain built is called

absorbing

since we always end up in the final state

14Slide15

15

A Hypothetical Markov Chain

What is the probability that a user who visits the

Home

page purchases a product?

Home -> Search -> PD -> $ = 1/3 * 1/2 *1/2 =

1/12 = 0.083

Home -> Cat -> PD -> $ = 1/3 * 1/3 * 1/2 =

1/18 = 0.056

Home -> Cat -> $ = 1/3 * 1/3 =

1/9 = 0.111

Home -> RS -> PD -> $ = 1/3 * 2/3 * 1/2 =

1/9 =

0.111

Sum

= 0.361An exampleMarkov ChainSlide16

16

A

B

C

D

E

Sessions:

A,

B

A,

B

A,

B, C

A,

B, C

A, B, C, D A, B, C, E A, C, EA, C, EA, B, D A, B, D A, B, D, EB, CB, CB, C, DB, C, EB, D, ETransition BC: Total occurrences of B: 14 Total occurrence of BC: 8 Pr(C|B) = 8/14 = 0.570.57Web site hyperlink graphCalculating conditional probabilities for transitionsMarkov Chain ExampleSlide17

17

Sessions:

A,

B

A,

B

A,

B, C

A,

B, C

A,

B, C

, D

A,

B, C

, E A, C, EA, C, EA, B, D A, B, D A, B, D, EB, CB, CB, C, DB, C, EB, D, EThe Full Markov ChainABCDE0.57StartFinal0.690.310.21

0.82

0.180.200.400.330.671.000.140.40Probability that someone will visit page C? SBC + SAC + SABC(0.31 * 0.57) + (0.69 * 0.18) + (0.69 * 0.82 * 0.57) = 0.503Prob. that someone who has visited B will visit E? BDE + BCE + BCDE(0.21 * 0.33) + (0.57 * 0.40) + (0.57 * 0.20 * 0.33) = 0.335Probability that someone visiting page C will leave the site? 0.40 = 40%Markov Chain Example (cont.)Slide18

Mining Frequent Trails Using Markov Chains

Support

s in [0,1) – accept only trails whose initial probability is above

sConfidence c in [0,1) – accept only trails whose probability is above

c

Recall: the

probability of a trail is obtained by multiplying the transition probabilities of the links in the

trail

Mining for Patterns

Find

all trails whose initial probability is higher than

s

,

and whose trail probability is above c.Use depth-first search on the Markov chain to compute the trailsThe average time needed to find the frequent trails is proportional to the number of web pages in the site

18Slide19

Markov Chains: Another Example

19

ID

Session Trail

1

A1 > A2 > A3

2

A1 > A2 > A3

3

A1 > A2 > A3 > A4

4

A5 > A2 > A4

5

A5 > A2 > A4 > A6

6

A5 > A2 > A3 > A6Slide20

Frequent Trails From Example Support = 0.1 and Confidence = 0.3

Trail

Probability

A1 > A2 > A3

0.67

A5 > A2 > A3

0.67

A2 > A3

0.67

A1 > A2 > A4

0.33

A5 > A2 > A4

0.33

A2 > A4

0.33

A4 > A6

0.33

20Slide21

Trail

Probability

A1 > A2 > A3

0.67

A5 > A2 > A3

0.67

A2 > A3

0.67

21

Frequent Trails From Example

Support = 0.1 and Confidence = 0.5Slide22

22

Efficient Management of Navigational Trails

Approach:

Store sessions in an aggregated

sequence tree

Initially introduced in Web

Utilization Miner (WUM) -

Spiliopoulou

, 1998

for each occurrence of a sequence start a new branch or increase the frequency counts of matching nodes

in example below, note that s6 contains “b” twice, hence the sequence is <(b,1),(d,1),(b,

2

),(e,1)>Slide23

23

Mining Navigational Patterns

The aggregated sequence tree can be used directly to determine support and confidence for navigational patterns

Navigation pattern: a

 b

Support = 11/35 = 0.31

Confidence = 11/21 = 0.52

Nav. pattern: a

 b  e

Support = 11/35 = 0.31

Confidence = 11/11 = 1.00

Nav. patterns: a

 b  e  f

Support = 3/35 = 0.086

Confidence = 3/11 = 0.27

Support = count at the node / count at rootConfidence = count at the node / count at the parentNote that each node represents a navigational path ending in that node