/
Web  Mining: An Overview Web  Mining: An Overview

Web Mining: An Overview - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
380 views
Uploaded On 2018-11-12

Web Mining: An Overview - PPT Presentation

CSC 575 Intelligent Information Retrieval Intelligent Information Retrieval 2 Web Mining Today Overview of Web Data Mining Web Content Mining Text Mining Web Usage Mining Web Personalization ID: 728450

mining web retrieval information web mining information retrieval intelligent data user usage content based structure page document cluster users

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Web Mining: An Overview" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Web Mining: An Overview

CSC 575

Intelligent Information RetrievalSlide2

Intelligent Information Retrieval

2

Web Mining

Today

Overview of Web Data Mining

Web Content Mining / Text Mining

Web Usage Mining

Web PersonalizationSlide3

Intelligent Information Retrieval

3

The

non-trivial

extraction of

implicit

,

previously unknown

and potentially useful knowledge from data in large data repositories

What Is Data Mining

Data Mining: A Definition

Non-trivial

: obvious knowledge is not useful

implicit

: hidden difficult to observe knowledge

previously unknown

potentially useful

: actionable; easy to understandSlide4

Intelligent Information Retrieval

4

The Knowledge Discovery in Data (KDD)

Viewed as a ProcessSlide5

Intelligent Information Retrieval

5

What Can Data Mining Do

Many Data Mining Tasks

often inter-related

often need to try different techniques for each task

each tasks may require different types of knowledge discovery

Typical data mining tasks

ClassificationPrediction

ClusteringAssociation DiscoverySequence Analysis

CharacterizationDiscriminationSlide6

Intelligent Information Retrieval

6

What is Web Mining

From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident

Web mining is the collection of technologies to fulfill this potential.

application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.

Web Mining DefinitionSlide7

Intelligent Information Retrieval

7

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web MiningSlide8

Intelligent Information Retrieval

8

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web Mining

Extracting useful knowledge from the contents of Web documents or other semantic information about Web resourcesSlide9

Intelligent Information Retrieval

9

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web Mining

Content data may consist of text, images, audio, video, structured records from lists and tables, or item attributes from backend databases.Slide10

Intelligent Information Retrieval

10

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web Mining

Applications:

document clustering or

categorization

topic identification / tracking

concept discovery

focused crawling

content-based personalization

intelligent search toolsSlide11

Intelligent Information Retrieval

11

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web Mining

Extracting interesting patterns from user interactions with resources on one or more Web sitesSlide12

Intelligent Information Retrieval

12

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web Mining

Applications:

user and customer behavior modeling

Web site optimization

e-customer relationship management

Web marketing

targeted advertising

recommender systemsSlide13

Intelligent Information Retrieval

13

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web Mining

Discovering useful patterns from the hyperlink structure connecting Web sites or Web resourcesSlide14

Intelligent Information Retrieval

14

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web Mining

Data sources include the explicit hyperlink between documents, or implicit links among objects (e.g., two objects being “tagged” using the same keyword). Slide15

Intelligent Information Retrieval

15

Types of Web Mining

Web Content

Mining

Web Structure

Mining

Web Usage

Mining

Web Mining

Applications:

document retrieval and

ranking (e.g., Google)

discovery of “hubs” and

“authorities”

discovery of Web

communities

social network analysisSlide16

Intelligent Information Retrieval

16

Web Content Mining

:: data preparation

Typical steps in content preprocessing

Extract text and meta data from Web documents (generally performed automatically using a Web crawler)

Recognize special entities (e.g., dates) and pre-defined keywords

Remove stop words and non-relevant terms

Perform stemming and morphological analysis

Compute statistics based on term occurrences

document frequency (DF): number of documents with the term occurrence

term frequency (TF): frequency of occurrence within a specific document

Additional considerations

For entities such as products, movies, songs, etc., may need to extract or obtain structured information such as item attributes from databases or available domain ontologies

It may be desirable to identify or discover phrases, facets, collocations, etc. (in order to treat commonly occurring groups of features as a single term).Slide17

Intelligent Information Retrieval

17

Web Content Mining

:: data representation

Vector Representation

Typically, each document is represented as multi-dimensional vector over all terms extracted in the preprocessing step

dimension values represent weights associated with terms in the document

Term weights may be binary or may be derived as a function of term frequency (TF) and document frequency (DF)

In some applications, they may be only a limited number of terms (a controlled vocabulary) is maintained and the weights may be assigned manually or based on external criteria

A

B

C

D

E

web

3

2

1

1

data

4

1

4

mining

2

1

business

3

5

intelligence

3

1

marketing

2

1

1

1

information

1

5

2

4

retrieval

6

1

3

Document / objects

Terms / attributesSlide18

Intelligent Information Retrieval

18

Web Content Mining

:: common approaches and applications

Basic notion: document similarity

Most Web content mining and information retrieval applications involve measuring similarity among two or more documents

Vector representation facilitates similarity computations using vector-space operations (such as Cosine of the angle between two vectors)

Examples

Search engines:

measure the similarity between a query (represented as a vector) and the indexed document vectors to return a ranked list of relevant documents

Document clustering:

group documents based on similarity or dissimilarity (distance) among them

Document categorization:

measure the similarity of a new document to be classified with representations of existing categories (such as the mean vector representing a group of document vectors)

Personalization:

recommend documents or items based their similarity to a representation of the user’s profile (may be a term vector representing concepts or terms of interest to the user)Slide19

Intelligent Information Retrieval

19

Web Content Mining

:: example –

clustered search results

Can drill down within clusters to view sub-topics or to view the relevant subset of resultsSlide20

Intelligent Information Retrieval

20

Web Content Mining

:: example –

personalized content delivery

Google's personalized news is an example of a content-based recommender system which recommends items (in part) based on the similarity of their content to a user’s profile (gathered from search and click history)Slide21

Intelligent Information Retrieval

21

Web Structure Mining

:: graph structures on the Web

The structure of a typical Web graph

Web pages as nodes

hyperlinks as edges connecting two related pages

Hyperlink Analysis

Hyperlinks can serve as a tool for pure navigation

But, often they are used to point to pages with authority on the same topic as the source page (similar to a citation in a publication)Some interesting Web structures

*Slide22

Intelligent Information Retrieval

22

Web Structure Mining

:: example – Google’s PageRank algorithm

Basic idea:

Rank of a page depends on the ranks of pages pointing to it

Out Degree of page is the number of edges pointing away from it – used to compute the contribution of the page to those to which it points

The final PageRank value represents the probability that a random surfer will reach the page

d

is the prob. that a random surfer chooses the page directly rather than getting there via navigation

Illustration of PageRank propagationSlide23

Intelligent Information Retrieval

23

Web Structure Mining

:: example – Hubs and Authorities

Basic idea

Authority

comes from

in-edges

Being a

hub comes from out-edges

Mutually re-enforcing relationshipA good authority is a page that is pointed to by many good hubs.

A good hub is a page that points to many good authorities.

Together they tend to form a

bipartite

graph

This idea can be used to discover authoritative pages related to a topic

HITS algorithm – Hypertext Induced Topic Search

Hubs

AuthoritiesSlide24

Intelligent Information Retrieval

24

Web Structure Mining

:: example – Web communities

Basic idea

Web communities are collections of Web pages such that each member node has more hyperlinks (in either direction) within the community than outside the community.

Typical approach: Maximal-flow model *

Ex: separate the two subgraphs with any choice of source node (left subgraph) and sink node (right subgraph), removing the three dashed links

* Source:

G. Flake, et al. “Self-Organization and Identification of Web Communities”,

IEEE Computer

,

Vol. 35, No. 3, pp. 66-71, March 2002 .

Community 1

sink

Source

node

Community 2Slide25

Intelligent Information Retrieval

25

Web Usage Mining

The Problem:

analyze Web navigational data to

Find how the Web site is used by Web users

Understand the behavior of different user segments

Predict how users will behave in the future

Target relevant or interesting information to individual or groups of users

Increase sales, profit, loyalty, etc.Challenge

Quantitatively capture Web users’ common interests and characterize their underlying tasks Slide26

Intelligent Information Retrieval

26

Applications of Web Usage Mining

Electronic Commerce

design cross marketing strategies across products

evaluate promotional campaigns

target electronic ads and coupons at user groups based on their access patterns

predict user behavior based on previously learned rules and users’ profiles

present dynamic information to users based on their interests and profiles: “Web personalization”

Effective and Efficient Web Presencedetermine the best way to structure the Web site

identify “weak links” for elimination or enhancementprefetch files that are most likely to be accessedenhance workgroup management & communication

Search Engines

Behavior-based rankingSlide27

Intelligent Information Retrieval

27

Behavior-based ranking

For each query

Q

, keep track of which docs in the results are clicked on

On subsequent requests for

Q

, re-order docs in results based on click-throughs.Relevance assessment based onBehavior/usage

vs. contentSlide28

Intelligent Information Retrieval

28

Query-doc popularity matrix

B

Queries

Docs

q

j

B

qj

= number of times doc

j

clicked-through on query

q

When query q issued again, order docs by

B

qj

values.Slide29

Intelligent Information Retrieval

29

Vector space implementation

Maintain a term-doc popularity matrix

C

as opposed to query-doc popularity

initialized to all zeros

Each column represents a doc

jIf doc j clicked on for query

q, update Cj

 Cj +

q

(here

q

is viewed as a vector).

On a query

q’, compute its cosine proximity to Cj for all j. Combine this with the regular text score.Slide30

Intelligent Information Retrieval

30

Issues

Normalization of

C

j

after updating

Assumption of query compositionality

“white house” document popularity derived from “white” and “house”Updating - live or batch?Basic assumption:Relevance can be directly measured by number of click throughs

Valid?Slide31

Intelligent Information Retrieval

31

Web Usage Mining

:: data sources

Typical Sources of Data:

automatically generated Web/application server access logs

e-commerce and product-oriented user events (e.g., shopping cart changes, product clickthroughs, etc.)

user profiles and/or user ratings

meta-data, page content, site structure

User Transactions

sets or sequences of pageviews possibly with associated weights

a pageview is a set of page files and associated objects that contribute to a single display in a Web BrowserSlide32

Intelligent Information Retrieval

32

What’s in a Typical Server Log?Slide33

Intelligent Information Retrieval

33

Typical Fields in a Log File Entry

client IP address

1.2.3.4

base url

maya.cs.depaul.edu

date/time

2006-02-01 00:08:43

http method GET

file accessed

/classes/cs589/papers.html

protocol version

HTTP/1.1

status code 200 (successful access)bytes transferred 9221referrer page http://dataminingresources.blogspot.com/user agent Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1; +SV1;+.NET+CLR+2.0.50727)

In addition, there may be fields corresponding to

login information client-side cookies (unique keys, issued to clients in order to identify a repeat visitor) session ids issued by the Web or application serversSlide34

Intelligent Information Retrieval

34

Usage Data Preparation Tasks

Data cleaning

remove irrelevant references and fields in server logs

remove references due to spider navigation

add missing references due to caching

Data integration

synchronize data from multiple server logsintegrate e-commerce and application server dataintegrate meta-data

Data Transformationpageview identification

identification of unique userssessionization – partitioning each user’s record into multiple sessions or transactions (usually representing different visits)mapping between user sessions and topics or categoriesSlide35

Intelligent Information Retrieval

35

Conceptual Representation of User Transactions or Sessions

Sessions/user transactions

Pageview/objects

Raw weights may be binary or based on time spent on a page; in practice, need to normalize or standardize this data.Slide36

Intelligent Information Retrieval

36

Web Usage Mining as a ProcessSlide37

Intelligent Information Retrieval

37

Common Web Usage Mining Tasks

Clustering (unsupervised):

Automatically group together users with similar purchasing or navigational patterns

User / customer segments

Automatically group together items based on co-occurrence in user sessions

Automatic creation of concept or functional hierarchies for the site

Classification / Prediction (supervised)

categorize pages or items into topics in a concept hierarchy

classify users into behavioral groups based on their navigation or purchase histories (e.g., browser, likely to purchase, loyal customer, etc.)predict a user’s interest in an item based on that user’s profiles and those of other similar users

Predict the life-time-value for a customer based on transaction history and navigation behaviorSlide38

Intelligent Information Retrieval

38

Common Web Usage Mining Tasks

Association Rules

Associating presence of a set of items with other sets of items

X

 Y, where X and Y are sets of items

Support of the itemset X

È

Y: Pr(X,Y); Confidence of rule: Pr(Y|X)

Examples:30% of users who accessed the special-offers page, also placed an online order in

/products/software/

Customers who bought

The Da Vinci Code

and

Holy Blood, Holy Grail

where 65% likely to also purchase the Harry Potter and the Goblet of Fire DVDSequential Patterns / Path AnalysisFinding common sequences of events/items appearing frequently in transactions General form: “x% of the time, when A and B appear in a transaction together, C appears within z transactions (alternatively, within t time units)”15% of visitors had the following common pattern in their navigation path during a session: home  * 

software  *  shopping cart  checkoutSlide39

Intelligent Information Retrieval

39

Example: Association Analysis for Ecommerce

Confidence:

41% who purchased Fully Reversible Mats also purchased Egyptian Cotton Towels

Lift:

People who purchased Fully Reversible Mats were 456 times more likely to purchase the Egyptian Cotton Towels compared to the general population

Product

Association

Lift Confidence

Fully

Reversible

Mats

456 41%

Egyptian Cotton

TowelsSlide40

Intelligent Information Retrieval

40

Example: Association Rules for

Personalized RecommendationsSlide41

Intelligent Information Retrieval

41

Input

set of relevant pageviews in preprocessed log

set of user transactions

each transaction is a pageview vector

Clusters Transaction (e.g., using

k

-means)each cluster contains a set of transaction vectorsfor each cluster compute centroid as cluster representative

Aggregate Usage Profiles

a set of pageview-weight pairs: for transaction cluster C, select each pageview pi such that (in the cluster centroid) is greater than a pre-specified threshold

Profile Aggregation Based on Clustering Transactions (PACT)Slide42

Intelligent Information Retrieval

42

Characterizing User Segments via Clustering

Cluster

0

(Cluster Size = 3)

--------------------------------------

1.00 C.html

1.00 D.html

Cluster

1

(Cluster Size = 4)

--------------------------------------

1.00 B.html

1.00 F.html

0.75 A.html

0.25 C.htmlCluster 2 (Cluster Size = 3)--------------------------------------1.00 A.html1.00 D.html1.00 E.html0.33 C.html

Original Session/user data

Result of

Clustering

Given an active session A

 B, the best matching cluster is Profile 1. This may result in a recommendation

for page F.html, since it appears with high weight in that cluster.Slide43

Intelligent Information Retrieval

43

Example: Clustering User Transactions

Transaction Clusters:

Clustering similar user transactions and using centroid of each cluster as a usage profile (representative for a user segment)

Support

URL

Pageview Description

1.00

/courses/syllabus.asp?course=450-96-303&q=3&y=2002&id=290

SE 450 Object-Oriented Development class syllabus

0.97

/people/facultyinfo.asp?id=290

Web page of a lecturer who thought the above course

0.88

/programs/

Current Degree Descriptions 2002

0.85

/programs/courses.asp?depcode=96&deptmne=se&courseid=450

SE 450 course description in SE program

0.82

/programs/2002/gradds2002.asp

M.S. in Distributed Systems program description

Sample cluster centroid from the CS dept. Web site (cluster size =330)Slide44

Intelligent Information Retrieval

44

Example: Collaborative Filtering

Popular Recommendation Technology

Recommend items to users by finding other users with similar tastes or interests

Compare a target user’s profile (typically ratings on various items) to the profiles of other users in the database with ratings for some common items

Use these “nearest neighbors” to predict a rating by the target user on an unseen item

Collaborative recommendation is one example of using data mining for automatic personalization

Source:

J. Riedl, “Why Does KDD Care About Personalization?”Slide45

Intelligent Information Retrieval

45

Example: Collaborative FilteringSlide46

Intelligent Information Retrieval

46

Web Mining Approach to Personalization

Basic Idea

generate

aggregate user models

(usage profiles)

by discovering user access patterns through Web usage mining (offline process)

Clustering user transactions

Clustering items / pageviews

Association rule miningSequential pattern discoverymatch a user’s active session against the discovered models to provide dynamic content (online process)

Advantages

no explicit user ratings or interaction with users

helps preserve user privacy, by making effective use of anonymous data

enhance the effectiveness and scalability of collaborative filtering

more accurate and broader recommendations than content-only approachesSlide47

Intelligent Information Retrieval

47

Web Personalization

The General Problem

dynamically serve customized content (pages, products, etc.) to users based on their profiles, preferences, or expected interests

as we have seen many of the data mining approaches that allow us to learn aggregate user models can be used for personalization or recommendationSlide48

Intelligent Information Retrieval

48

Real-Time Recommendation Engine

Keep track of users’ navigational history through the site

a fixed-size sliding window over the active session to capture the current user’s “short-term” history depth

Match current user’s activity against the discovered profiles

profiles either can be based on aggregate usage profiles, or are obtained directly from association rules or sequential patterns

Dynamically generated recommendations are added to the returned page

each pageview can be assigned a recommendation score based on

matching score to user profiles (e.g., aggregate usage profiles)

“information value” of the pageview based on domain knowledge (e.g., link distance of the candidate recommendation to the active session)Slide49

Intelligent Information Retrieval

49

Problems with Web Usage Mining

New item problem

Patterns will not capture new items recently added

Bad for dynamic Web sites

Poor machine interpretability

Hard to generalize and reason about patterns

No domain knowledge used to enhance results

E.g., Knowing a user is interested in a program, we could recommend the prerequisites, core or popular courses in this program to the user

Poor insight into the patterns themselves The nature of the relationships among items or users in a pattern is not directly availableSlide50

Intelligent Information Retrieval

50

Solution: Integrate Semantic Knowledge with Web Usage Mining

Information Retrieval/Extraction Approach

Represent semantic knowledge in pageviews as keyword vectors

Keywords extracted from text or meta-data

Text mining can be used to capture higher-level concepts or associations among concepts

Cannot capture deeper relationships among objects based on their inherent properties or attributes

Ontology-based approach

Represent domain knowledge using relational model or ontology representation languages Process Web usage data with the structured domain knowledge

Requires the extraction of ontology instances from Web pagesChallenge: performing underlying mining operations on structured objects (e.g., computing similarities or performing aggregations)