CSC 575 Intelligent Information Retrieval Intelligent Information Retrieval 2 Web Mining Today Overview of Web Data Mining Web Content Mining Text Mining Web Usage Mining Web Personalization ID: 728450
Download Presentation The PPT/PDF document "Web Mining: An Overview" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Web Mining: An Overview
CSC 575
Intelligent Information RetrievalSlide2
Intelligent Information Retrieval
2
Web Mining
Today
Overview of Web Data Mining
Web Content Mining / Text Mining
Web Usage Mining
Web PersonalizationSlide3
Intelligent Information Retrieval
3
The
non-trivial
extraction of
implicit
,
previously unknown
and potentially useful knowledge from data in large data repositories
What Is Data Mining
Data Mining: A Definition
Non-trivial
: obvious knowledge is not useful
implicit
: hidden difficult to observe knowledge
previously unknown
potentially useful
: actionable; easy to understandSlide4
Intelligent Information Retrieval
4
The Knowledge Discovery in Data (KDD)
Viewed as a ProcessSlide5
Intelligent Information Retrieval
5
What Can Data Mining Do
Many Data Mining Tasks
often inter-related
often need to try different techniques for each task
each tasks may require different types of knowledge discovery
Typical data mining tasks
ClassificationPrediction
ClusteringAssociation DiscoverySequence Analysis
CharacterizationDiscriminationSlide6
Intelligent Information Retrieval
6
What is Web Mining
From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident
Web mining is the collection of technologies to fulfill this potential.
application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.
Web Mining DefinitionSlide7
Intelligent Information Retrieval
7
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web MiningSlide8
Intelligent Information Retrieval
8
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Extracting useful knowledge from the contents of Web documents or other semantic information about Web resourcesSlide9
Intelligent Information Retrieval
9
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Content data may consist of text, images, audio, video, structured records from lists and tables, or item attributes from backend databases.Slide10
Intelligent Information Retrieval
10
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Applications:
document clustering or
categorization
topic identification / tracking
concept discovery
focused crawling
content-based personalization
intelligent search toolsSlide11
Intelligent Information Retrieval
11
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Extracting interesting patterns from user interactions with resources on one or more Web sitesSlide12
Intelligent Information Retrieval
12
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Applications:
user and customer behavior modeling
Web site optimization
e-customer relationship management
Web marketing
targeted advertising
recommender systemsSlide13
Intelligent Information Retrieval
13
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Discovering useful patterns from the hyperlink structure connecting Web sites or Web resourcesSlide14
Intelligent Information Retrieval
14
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Data sources include the explicit hyperlink between documents, or implicit links among objects (e.g., two objects being “tagged” using the same keyword). Slide15
Intelligent Information Retrieval
15
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Applications:
document retrieval and
ranking (e.g., Google)
discovery of “hubs” and
“authorities”
discovery of Web
communities
social network analysisSlide16
Intelligent Information Retrieval
16
Web Content Mining
:: data preparation
Typical steps in content preprocessing
Extract text and meta data from Web documents (generally performed automatically using a Web crawler)
Recognize special entities (e.g., dates) and pre-defined keywords
Remove stop words and non-relevant terms
Perform stemming and morphological analysis
Compute statistics based on term occurrences
document frequency (DF): number of documents with the term occurrence
term frequency (TF): frequency of occurrence within a specific document
Additional considerations
For entities such as products, movies, songs, etc., may need to extract or obtain structured information such as item attributes from databases or available domain ontologies
It may be desirable to identify or discover phrases, facets, collocations, etc. (in order to treat commonly occurring groups of features as a single term).Slide17
Intelligent Information Retrieval
17
Web Content Mining
:: data representation
Vector Representation
Typically, each document is represented as multi-dimensional vector over all terms extracted in the preprocessing step
dimension values represent weights associated with terms in the document
Term weights may be binary or may be derived as a function of term frequency (TF) and document frequency (DF)
In some applications, they may be only a limited number of terms (a controlled vocabulary) is maintained and the weights may be assigned manually or based on external criteria
A
B
C
D
E
web
3
2
1
1
data
4
1
4
mining
2
1
business
3
5
intelligence
3
1
marketing
2
1
1
1
information
1
5
2
4
retrieval
6
1
3
Document / objects
Terms / attributesSlide18
Intelligent Information Retrieval
18
Web Content Mining
:: common approaches and applications
Basic notion: document similarity
Most Web content mining and information retrieval applications involve measuring similarity among two or more documents
Vector representation facilitates similarity computations using vector-space operations (such as Cosine of the angle between two vectors)
Examples
Search engines:
measure the similarity between a query (represented as a vector) and the indexed document vectors to return a ranked list of relevant documents
Document clustering:
group documents based on similarity or dissimilarity (distance) among them
Document categorization:
measure the similarity of a new document to be classified with representations of existing categories (such as the mean vector representing a group of document vectors)
Personalization:
recommend documents or items based their similarity to a representation of the user’s profile (may be a term vector representing concepts or terms of interest to the user)Slide19
Intelligent Information Retrieval
19
Web Content Mining
:: example –
clustered search results
Can drill down within clusters to view sub-topics or to view the relevant subset of resultsSlide20
Intelligent Information Retrieval
20
Web Content Mining
:: example –
personalized content delivery
Google's personalized news is an example of a content-based recommender system which recommends items (in part) based on the similarity of their content to a user’s profile (gathered from search and click history)Slide21
Intelligent Information Retrieval
21
Web Structure Mining
:: graph structures on the Web
The structure of a typical Web graph
Web pages as nodes
hyperlinks as edges connecting two related pages
Hyperlink Analysis
Hyperlinks can serve as a tool for pure navigation
But, often they are used to point to pages with authority on the same topic as the source page (similar to a citation in a publication)Some interesting Web structures
*Slide22
Intelligent Information Retrieval
22
Web Structure Mining
:: example – Google’s PageRank algorithm
Basic idea:
Rank of a page depends on the ranks of pages pointing to it
Out Degree of page is the number of edges pointing away from it – used to compute the contribution of the page to those to which it points
The final PageRank value represents the probability that a random surfer will reach the page
d
is the prob. that a random surfer chooses the page directly rather than getting there via navigation
Illustration of PageRank propagationSlide23
Intelligent Information Retrieval
23
Web Structure Mining
:: example – Hubs and Authorities
Basic idea
Authority
comes from
in-edges
Being a
hub comes from out-edges
Mutually re-enforcing relationshipA good authority is a page that is pointed to by many good hubs.
A good hub is a page that points to many good authorities.
Together they tend to form a
bipartite
graph
This idea can be used to discover authoritative pages related to a topic
HITS algorithm – Hypertext Induced Topic Search
Hubs
AuthoritiesSlide24
Intelligent Information Retrieval
24
Web Structure Mining
:: example – Web communities
Basic idea
Web communities are collections of Web pages such that each member node has more hyperlinks (in either direction) within the community than outside the community.
Typical approach: Maximal-flow model *
Ex: separate the two subgraphs with any choice of source node (left subgraph) and sink node (right subgraph), removing the three dashed links
* Source:
G. Flake, et al. “Self-Organization and Identification of Web Communities”,
IEEE Computer
,
Vol. 35, No. 3, pp. 66-71, March 2002 .
Community 1
sink
Source
node
Community 2Slide25
Intelligent Information Retrieval
25
Web Usage Mining
The Problem:
analyze Web navigational data to
Find how the Web site is used by Web users
Understand the behavior of different user segments
Predict how users will behave in the future
Target relevant or interesting information to individual or groups of users
Increase sales, profit, loyalty, etc.Challenge
Quantitatively capture Web users’ common interests and characterize their underlying tasks Slide26
Intelligent Information Retrieval
26
Applications of Web Usage Mining
Electronic Commerce
design cross marketing strategies across products
evaluate promotional campaigns
target electronic ads and coupons at user groups based on their access patterns
predict user behavior based on previously learned rules and users’ profiles
present dynamic information to users based on their interests and profiles: “Web personalization”
Effective and Efficient Web Presencedetermine the best way to structure the Web site
identify “weak links” for elimination or enhancementprefetch files that are most likely to be accessedenhance workgroup management & communication
Search Engines
Behavior-based rankingSlide27
Intelligent Information Retrieval
27
Behavior-based ranking
For each query
Q
, keep track of which docs in the results are clicked on
On subsequent requests for
Q
, re-order docs in results based on click-throughs.Relevance assessment based onBehavior/usage
vs. contentSlide28
Intelligent Information Retrieval
28
Query-doc popularity matrix
B
Queries
Docs
q
j
B
qj
= number of times doc
j
clicked-through on query
q
When query q issued again, order docs by
B
qj
values.Slide29
Intelligent Information Retrieval
29
Vector space implementation
Maintain a term-doc popularity matrix
C
as opposed to query-doc popularity
initialized to all zeros
Each column represents a doc
jIf doc j clicked on for query
q, update Cj
Cj +
q
(here
q
is viewed as a vector).
On a query
q’, compute its cosine proximity to Cj for all j. Combine this with the regular text score.Slide30
Intelligent Information Retrieval
30
Issues
Normalization of
C
j
after updating
Assumption of query compositionality
“white house” document popularity derived from “white” and “house”Updating - live or batch?Basic assumption:Relevance can be directly measured by number of click throughs
Valid?Slide31
Intelligent Information Retrieval
31
Web Usage Mining
:: data sources
Typical Sources of Data:
automatically generated Web/application server access logs
e-commerce and product-oriented user events (e.g., shopping cart changes, product clickthroughs, etc.)
user profiles and/or user ratings
meta-data, page content, site structure
User Transactions
sets or sequences of pageviews possibly with associated weights
a pageview is a set of page files and associated objects that contribute to a single display in a Web BrowserSlide32
Intelligent Information Retrieval
32
What’s in a Typical Server Log?Slide33
Intelligent Information Retrieval
33
Typical Fields in a Log File Entry
client IP address
1.2.3.4
base url
maya.cs.depaul.edu
date/time
2006-02-01 00:08:43
http method GET
file accessed
/classes/cs589/papers.html
protocol version
HTTP/1.1
status code 200 (successful access)bytes transferred 9221referrer page http://dataminingresources.blogspot.com/user agent Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1; +SV1;+.NET+CLR+2.0.50727)
In addition, there may be fields corresponding to
login information client-side cookies (unique keys, issued to clients in order to identify a repeat visitor) session ids issued by the Web or application serversSlide34
Intelligent Information Retrieval
34
Usage Data Preparation Tasks
Data cleaning
remove irrelevant references and fields in server logs
remove references due to spider navigation
add missing references due to caching
Data integration
synchronize data from multiple server logsintegrate e-commerce and application server dataintegrate meta-data
Data Transformationpageview identification
identification of unique userssessionization – partitioning each user’s record into multiple sessions or transactions (usually representing different visits)mapping between user sessions and topics or categoriesSlide35
Intelligent Information Retrieval
35
Conceptual Representation of User Transactions or Sessions
Sessions/user transactions
Pageview/objects
Raw weights may be binary or based on time spent on a page; in practice, need to normalize or standardize this data.Slide36
Intelligent Information Retrieval
36
Web Usage Mining as a ProcessSlide37
Intelligent Information Retrieval
37
Common Web Usage Mining Tasks
Clustering (unsupervised):
Automatically group together users with similar purchasing or navigational patterns
User / customer segments
Automatically group together items based on co-occurrence in user sessions
Automatic creation of concept or functional hierarchies for the site
Classification / Prediction (supervised)
categorize pages or items into topics in a concept hierarchy
classify users into behavioral groups based on their navigation or purchase histories (e.g., browser, likely to purchase, loyal customer, etc.)predict a user’s interest in an item based on that user’s profiles and those of other similar users
Predict the life-time-value for a customer based on transaction history and navigation behaviorSlide38
Intelligent Information Retrieval
38
Common Web Usage Mining Tasks
Association Rules
Associating presence of a set of items with other sets of items
X
Y, where X and Y are sets of items
Support of the itemset X
È
Y: Pr(X,Y); Confidence of rule: Pr(Y|X)
Examples:30% of users who accessed the special-offers page, also placed an online order in
/products/software/
Customers who bought
The Da Vinci Code
and
Holy Blood, Holy Grail
where 65% likely to also purchase the Harry Potter and the Goblet of Fire DVDSequential Patterns / Path AnalysisFinding common sequences of events/items appearing frequently in transactions General form: “x% of the time, when A and B appear in a transaction together, C appears within z transactions (alternatively, within t time units)”15% of visitors had the following common pattern in their navigation path during a session: home *
software * shopping cart checkoutSlide39
Intelligent Information Retrieval
39
Example: Association Analysis for Ecommerce
Confidence:
41% who purchased Fully Reversible Mats also purchased Egyptian Cotton Towels
Lift:
People who purchased Fully Reversible Mats were 456 times more likely to purchase the Egyptian Cotton Towels compared to the general population
Product
Association
Lift Confidence
Fully
Reversible
Mats
456 41%
Egyptian Cotton
TowelsSlide40
Intelligent Information Retrieval
40
Example: Association Rules for
Personalized RecommendationsSlide41
Intelligent Information Retrieval
41
Input
set of relevant pageviews in preprocessed log
set of user transactions
each transaction is a pageview vector
Clusters Transaction (e.g., using
k
-means)each cluster contains a set of transaction vectorsfor each cluster compute centroid as cluster representative
Aggregate Usage Profiles
a set of pageview-weight pairs: for transaction cluster C, select each pageview pi such that (in the cluster centroid) is greater than a pre-specified threshold
Profile Aggregation Based on Clustering Transactions (PACT)Slide42
Intelligent Information Retrieval
42
Characterizing User Segments via Clustering
Cluster
0
(Cluster Size = 3)
--------------------------------------
1.00 C.html
1.00 D.html
Cluster
1
(Cluster Size = 4)
--------------------------------------
1.00 B.html
1.00 F.html
0.75 A.html
0.25 C.htmlCluster 2 (Cluster Size = 3)--------------------------------------1.00 A.html1.00 D.html1.00 E.html0.33 C.html
Original Session/user data
Result of
Clustering
Given an active session A
B, the best matching cluster is Profile 1. This may result in a recommendation
for page F.html, since it appears with high weight in that cluster.Slide43
Intelligent Information Retrieval
43
Example: Clustering User Transactions
Transaction Clusters:
Clustering similar user transactions and using centroid of each cluster as a usage profile (representative for a user segment)
Support
URL
Pageview Description
1.00
/courses/syllabus.asp?course=450-96-303&q=3&y=2002&id=290
SE 450 Object-Oriented Development class syllabus
0.97
/people/facultyinfo.asp?id=290
Web page of a lecturer who thought the above course
0.88
/programs/
Current Degree Descriptions 2002
0.85
/programs/courses.asp?depcode=96&deptmne=se&courseid=450
SE 450 course description in SE program
0.82
/programs/2002/gradds2002.asp
M.S. in Distributed Systems program description
Sample cluster centroid from the CS dept. Web site (cluster size =330)Slide44
Intelligent Information Retrieval
44
Example: Collaborative Filtering
Popular Recommendation Technology
Recommend items to users by finding other users with similar tastes or interests
Compare a target user’s profile (typically ratings on various items) to the profiles of other users in the database with ratings for some common items
Use these “nearest neighbors” to predict a rating by the target user on an unseen item
Collaborative recommendation is one example of using data mining for automatic personalization
Source:
J. Riedl, “Why Does KDD Care About Personalization?”Slide45
Intelligent Information Retrieval
45
Example: Collaborative FilteringSlide46
Intelligent Information Retrieval
46
Web Mining Approach to Personalization
Basic Idea
generate
aggregate user models
(usage profiles)
by discovering user access patterns through Web usage mining (offline process)
Clustering user transactions
Clustering items / pageviews
Association rule miningSequential pattern discoverymatch a user’s active session against the discovered models to provide dynamic content (online process)
Advantages
no explicit user ratings or interaction with users
helps preserve user privacy, by making effective use of anonymous data
enhance the effectiveness and scalability of collaborative filtering
more accurate and broader recommendations than content-only approachesSlide47
Intelligent Information Retrieval
47
Web Personalization
The General Problem
dynamically serve customized content (pages, products, etc.) to users based on their profiles, preferences, or expected interests
as we have seen many of the data mining approaches that allow us to learn aggregate user models can be used for personalization or recommendationSlide48
Intelligent Information Retrieval
48
Real-Time Recommendation Engine
Keep track of users’ navigational history through the site
a fixed-size sliding window over the active session to capture the current user’s “short-term” history depth
Match current user’s activity against the discovered profiles
profiles either can be based on aggregate usage profiles, or are obtained directly from association rules or sequential patterns
Dynamically generated recommendations are added to the returned page
each pageview can be assigned a recommendation score based on
matching score to user profiles (e.g., aggregate usage profiles)
“information value” of the pageview based on domain knowledge (e.g., link distance of the candidate recommendation to the active session)Slide49
Intelligent Information Retrieval
49
Problems with Web Usage Mining
New item problem
Patterns will not capture new items recently added
Bad for dynamic Web sites
Poor machine interpretability
Hard to generalize and reason about patterns
No domain knowledge used to enhance results
E.g., Knowing a user is interested in a program, we could recommend the prerequisites, core or popular courses in this program to the user
Poor insight into the patterns themselves The nature of the relationships among items or users in a pattern is not directly availableSlide50
Intelligent Information Retrieval
50
Solution: Integrate Semantic Knowledge with Web Usage Mining
Information Retrieval/Extraction Approach
Represent semantic knowledge in pageviews as keyword vectors
Keywords extracted from text or meta-data
Text mining can be used to capture higher-level concepts or associations among concepts
Cannot capture deeper relationships among objects based on their inherent properties or attributes
Ontology-based approach
Represent domain knowledge using relational model or ontology representation languages Process Web usage data with the structured domain knowledge
Requires the extraction of ontology instances from Web pagesChallenge: performing underlying mining operations on structured objects (e.g., computing similarities or performing aggregations)