Kristina Lerman What can we learn from web search queries Characteristics Length has steadily grown over the years 1990s lt 2 terms 2001 24 terms 2014 long search queries eg where is the nearest coffee shop ID: 713982
Download Presentation The PPT/PDF document "Search Query Log Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Search Query Log Analysis
Kristina
LermanSlide2
What can we learn from web search queries?
Characteristics
Length has steadily grown over the years
1990’s: < 2 terms
2001: 2.4 terms
2014: long search queries, e.g., “where is the nearest coffee shop”
Heavy-tailed distribution of term frequency
Billions of queries
User intentions
Aggregate query words with results of search to learn user’s needs, wants, goals
Create a database of commonsense knowledge
Cf.
Cyc
Does data exist?
AOL search query log
Google trendsSlide3
2006 AOL search query log dataset
~20M web queries
~650K users
3 month period: March 1 – May 31, 2006
Data format
AnonID
– an anonymous user ID number
Query – the query issued by the user
QueryTime
– time query was submitted
ItemRank
– rank of item clicked in results
ClickURL
– the domain of the clicked itemSlide4
Timeline
8/4/06: Announcement to SIG-
IRList
from AOL
8/6/06:
TechCrunch
slams AOL over privacy
8/7/06: Dataset removed
8/9/06:
NYTimes
identifies user 4417749
Thelma Arnold, 62, from
Lilbum
, Georgia
8/21/06: AOL CTO Maureen Govern resigns
AOL researcher and supervisor are firedSlide5Slide6
Weakly-supervised discovery of named entities using web search queries
Marius
Pasca
(Google)
CIKM-07: Conference on Information and Knowledge Management, Lisbon, PortugalSlide7
Weakly Supervised Discovery of Named Entities using Web Search (2007)
Goal: Automatically extract knowledge (entities) from texts created by many people
Discover new instances of classes
Red Alert is
videogame
Lilbum
is a town
Lorazepam is a
drug
For what purpose?
Cataloging human knowledge
Understanding searching
users
#399392 in
Lilbum
takes Lorazepam, plays Red Alert Slide8
Intuition
Templates in queries
“side effects of
xanax
pills”
“side effects of birth control pills”
“side effects of
lipitor
pills”
…
Prefix: “side effects of”
Postfix: “pills”
But, templates are difficult to specify
Cf.
extraction patterns in web information retrievalSlide9
“Weakly”-supervised approach
Guided by a small set of known seed instances
Input is a target class and some examples
Drug: {phentermine,
viagra
,
vicodin
,
vioxx
,
xanax
}
City: {
london
,
paris
, san
francisco
,
tokyo
,
toronto
}
Food: {chicken, fish, milk, tomatoes, wheat}
Identify the patterns seed instances occur in
Learn many more new instances automatically
Use patterns to find more instancesSlide10
Step 1: Identify query templates
Identify
all queries that contain
each known class instance
vioxx
Extract
left and right context
“long term
vioxx
use”
Prefix: “long term
”
Postfix
: “use
”
Infix: “
vioxx
”Slide11
Step 2: Generate candidate instances
Go over the query log again
Identify all queries that match template
Collect query infixes as candidate instances
{low blood pressure,
xanax
,
lamictal
, generic birth control,
lipitor
,
vicodin
, beta blockers, …}Slide12
Step 3. Compile search signatures
Each candidate is represented as a vector
Each template is a dimension
Weighted by frequency in queriesSlide13
Step 4. Reference signatures
Vectors for example class instances are combined
Prototype of search signature for the classSlide14
ExampleSlide15
Step 5. Compute signature similarity
Vector similarity between reference signature and candidate signature
Jensen-Shannon similarity function
Output is rank-ordered list
Drug: {
viagra
, phentermine,
ambien
,
adderall
,
vicodin
, hydrocodone,
xanax
,
vioxx
,
oxycontin
,
cialis
, valium,
lexapro
,
ritalin
,
zoloft
,
percocet
, …}Slide16
EvaluationSlide17
Repeatability
Need enormous database of search query logs
Probably best done at Google or Microsoft
What can be done with small query databases?
What types of social media text could this method be applied to?Slide18
Classifying the user intent of
web queries using k-means
clustering
Ashish
Kathuria
,
Bernard J. Jansen and Carolyn
Hafernik
,
Amanda SpinkSlide19
Problem Introduction
WWW
playes
a vital tool in many people’s daily lives
Nearly
70 percent
of searchers use a search engine
S
earch
engines receive
hundreds
of millions of queries
per day
Billions of
results
per week in
response
to these queries.
Smart users: Novel and increasingly assorted ways of searching!!Slide20
Understanding intent behind searching
C
an
help to improve search engine performance via
page ranking, result clustering, advertising, and presentation of resultsSlide21
Approach
Automatically
classify a large set of queries from a
web search
engine log as
informational,
navigational
and
transactional.
Encode
the characteristics of informational,
navigational and
transactional queries
identified
from prior work to develop
an
automatic classifier
using k-means clustering
.
Use
data-mining
techniques to more accurately automatically classify queries by
user Intent
Overcome limitations of previous research:
Small datasets
Limited methodologySlide22
Classification of Queries
Images from http://moz.com/blog/segmenting-search-intentSlide23
Research methodology
Dataset: Transaction log from
Dogpile
.
Each
record has fields like: User identification, cookie, Time of day, Query terms,
source
Step 1: Creating sessions and removing duplicates
The
fields
of
Time
of
day, User identification, Cookie,
and
Query
were used to locate
the initial
query of a session and then recreate the series of actions in the session
.
Collapsed the search
using
user identification, cookie, and
query
to eliminate duplicates
of
result and null queriesSlide24
Research methodology
Step 2: Generating additional attributes
Calculated
three additional attributes for each record: Query length, query reformulation and result
page
Step 3: Assignment of terms
1. Navigational
:
C
ontain
company/business/organization/people
names
Q
ueries
containing portions of URLs or even complete
URLs
2. Transactional
:
A
nalysis
, specifically via the identification of key terms related to
transactional domains
such as entertainment and
ecommerce
3. Informational
:
Q
ueries that use natural language terms
Longer sessions than for informational searchingSlide25
Research methodology
Step
5:
Converting string to vector
Step 4
: Textual data to numerical
dataSlide26
K-means Clustering
Navigational
Informational
Transactional
The
resulting data set had four attributes
that could be used for
classification:
query length, source, query reformulation rate, user
intent weight of the
querySlide27
Results
Performed on various datasets and achieved 94% accuracy
Overall, about
76
percent of the queries were classified as informational, while about
12
percent were classified as transactional, and
12
percent were classified as navigationalSlide28
Results
N
avigational queries
: Low
rates
of reformulation, typically sessions
of just one query
.
Informational queries
:
L
ow
occurrences of query reformulation, indicating probably relatively
easy informational
needs, such as fact
finding
Transactional queries
: Shorter queriesSlide29
Discussion of approach
Limitations:
T
he
Dogpile
user population representative of web search engine users in general?
What if a prototype has multiple user intents associated with it ?
Is relying solely on transactional logs
sufficient ?
Future
Scope
:
Investigate in subcategories
A laboratory
study
on
how
searchers express
their underlying
intent
Devlope
algorithmic approaches
for more in-depth analysis of individual queries
The approach has a high success
rate, it
uses a large data set of queries and does not depend on external content, thereby making it implementable in real
time.Slide30
Summary
Identifying
the user intent of web queries
is very
useful for web search
engines because
it would allow them to provide more relevant results to searchers and
more precisely
targeted sponsored links
.
Classifying queries helps in focused search:
Information queries:
Provide
relevant information and ads
Navigational
queries:
Provide
links straight to a requested web page
Transactional queries:
Focus
on all commercial links for future purchase as
well
The
use of k-means as
an automatic
clustering and classification technique yielded positive results
and opened effective ways to improve performance of web search engines.Slide31
-
Neha
MundadaSlide32
Acquiring Explicit Goals from Search Query Logs
Understanding human goals is necessary for
Recognize goals of actions
Create a plan
E.g., ‘plan a trip to Vienna’ has
subgoals
‘contact travel agent’
‘book hotel’
‘buy concert tickets’, etc.
Automatically acquire human goals from search query logs
Acquire and organize
commonsense
knowledgeSlide33
Research overview
Research
Question:
If
and How search query logs can be utilized to overcome the problem
of acquiring
knowledge about human goals?
Following an exploratory research style, we intend to
show:
contain
a small but interesting number of user
goals
Separation
by automatic methods
Results:
Knowledge
about the automatic acquisition of goals out of search query
logs
Knowledge
about the nature of goals extracted from search query logsSlide34
Results of Human Subject Study
4
independent raters
labeled
3000
queries
Examples
bug killing devices
mothers working from home
how to lose weight
Classes appear to be separableSlide35
Experimental Setup
AOL search query log
~
20 million search
queries
recorded
between March 1 and May 31 (2006
)
ethical issues
pre-processing
steps to reduce
noise
5 million queries
labeled
queries from the human subject study were utilized
as training
examples (controversial queries were omitted)Slide36
Classification approach
Part of speech tagging
Maximum entropy tagger converts a sequence of words into a sequence of POS tags
Example
Query “buy a car”
buy/VB a/DT car/NN
Set of words {buy, car}
Part of speech trigrams
$ VB DT NN $
{
$ VB
DT, VB DT NN, DT NN $} Slide37
Classification approach (2)
Linear Support Vector Machine [
Dumais98]
Robust
and effective in the area of text
classification
Weka
Machine Learning Toolkit
http
://www.cs.waikato.ac.nz/ml/weka
/
Performance
:
10 trials –
3-fold
Cross
Validation
Precision
, Recall and F1-Measure for the class: “
queries containing
goals
”
Precision = 0.77
Recall = 0.63
F1-Measure = 0.69Slide38
N-fold cross validation
Problem: limited amount of labeled data
Solution: N-fold cross validation
Divide data into N equal segments (folds)
Training data: N-1 folds
Testing data: remaining fold
Repeat for remaining test folds and average resultsSlide39
Goals are diverse
Rank-Frequency plot of goals
is heavy tailed
Few goals share by many users
Majority
of goals are shared by only few usersSlide40
Most frequent goalsSlide41
Most frequent goals with “get”, “make”, “change” and “be”Slide42
Summary
Web search queries are an abundant, but very sparse and very noisy, source of data about needs, desires, intentions of people
Clever methods can learn from these diverse data
Named entities
Goals
Can these methods be used
in social media?