/
Search Query Log Analysis Search Query Log Analysis

Search Query Log Analysis - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
346 views
Uploaded On 2018-11-04

Search Query Log Analysis - PPT Presentation

Kristina Lerman What can we learn from web search queries Characteristics Length has steadily grown over the years 1990s lt 2 terms 2001 24 terms 2014 long search queries eg where is the nearest coffee shop ID: 713982

query queries goals search queries query search goals user data web results step transactional informational knowledge instances logs log

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Search Query Log Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Search Query Log Analysis

Kristina

LermanSlide2

What can we learn from web search queries?

Characteristics

Length has steadily grown over the years

1990’s: < 2 terms

2001: 2.4 terms

2014: long search queries, e.g., “where is the nearest coffee shop”

Heavy-tailed distribution of term frequency

Billions of queries

User intentions

Aggregate query words with results of search to learn user’s needs, wants, goals

Create a database of commonsense knowledge

Cf.

Cyc

Does data exist?

AOL search query log

Google trendsSlide3

2006 AOL search query log dataset

~20M web queries

~650K users

3 month period: March 1 – May 31, 2006

Data format

AnonID

– an anonymous user ID number

Query – the query issued by the user

QueryTime

– time query was submitted

ItemRank

– rank of item clicked in results

ClickURL

– the domain of the clicked itemSlide4

Timeline

8/4/06: Announcement to SIG-

IRList

from AOL

8/6/06:

TechCrunch

slams AOL over privacy

8/7/06: Dataset removed

8/9/06:

NYTimes

identifies user 4417749

Thelma Arnold, 62, from

Lilbum

, Georgia

8/21/06: AOL CTO Maureen Govern resigns

AOL researcher and supervisor are firedSlide5
Slide6

Weakly-supervised discovery of named entities using web search queries

Marius

Pasca

(Google)

CIKM-07: Conference on Information and Knowledge Management, Lisbon, PortugalSlide7

Weakly Supervised Discovery of Named Entities using Web Search (2007)

Goal: Automatically extract knowledge (entities) from texts created by many people

Discover new instances of classes

Red Alert is

videogame

Lilbum

is a town

Lorazepam is a

drug

For what purpose?

Cataloging human knowledge

Understanding searching

users

#399392 in

Lilbum

takes Lorazepam, plays Red Alert Slide8

Intuition

Templates in queries

“side effects of

xanax

pills”

“side effects of birth control pills”

“side effects of

lipitor

pills”

Prefix: “side effects of”

Postfix: “pills”

But, templates are difficult to specify

Cf.

extraction patterns in web information retrievalSlide9

“Weakly”-supervised approach

Guided by a small set of known seed instances

Input is a target class and some examples

Drug: {phentermine,

viagra

,

vicodin

,

vioxx

,

xanax

}

City: {

london

,

paris

, san

francisco

,

tokyo

,

toronto

}

Food: {chicken, fish, milk, tomatoes, wheat}

Identify the patterns seed instances occur in

Learn many more new instances automatically

Use patterns to find more instancesSlide10

Step 1: Identify query templates

Identify

all queries that contain

each known class instance

vioxx

Extract

left and right context

“long term

vioxx

use”

Prefix: “long term

Postfix

: “use

Infix: “

vioxx

”Slide11

Step 2: Generate candidate instances

Go over the query log again

Identify all queries that match template

Collect query infixes as candidate instances

{low blood pressure,

xanax

,

lamictal

, generic birth control,

lipitor

,

vicodin

, beta blockers, …}Slide12

Step 3. Compile search signatures

Each candidate is represented as a vector

Each template is a dimension

Weighted by frequency in queriesSlide13

Step 4. Reference signatures

Vectors for example class instances are combined

Prototype of search signature for the classSlide14

ExampleSlide15

Step 5. Compute signature similarity

Vector similarity between reference signature and candidate signature

Jensen-Shannon similarity function

Output is rank-ordered list

Drug: {

viagra

, phentermine,

ambien

,

adderall

,

vicodin

, hydrocodone,

xanax

,

vioxx

,

oxycontin

,

cialis

, valium,

lexapro

,

ritalin

,

zoloft

,

percocet

, …}Slide16

EvaluationSlide17

Repeatability

Need enormous database of search query logs

Probably best done at Google or Microsoft

What can be done with small query databases?

What types of social media text could this method be applied to?Slide18

Classifying the user intent of

web queries using k-means

clustering

Ashish

Kathuria

,

Bernard J. Jansen and Carolyn

Hafernik

,

Amanda SpinkSlide19

Problem Introduction

WWW

playes

a vital tool in many people’s daily lives

Nearly

70 percent

of searchers use a search engine

S

earch

engines receive

hundreds

of millions of queries

per day

Billions of

results

per week in

response

to these queries.

Smart users: Novel and increasingly assorted ways of searching!!Slide20

Understanding intent behind searching

C

an

help to improve search engine performance via

page ranking, result clustering, advertising, and presentation of resultsSlide21

Approach

Automatically

classify a large set of queries from a

web search

engine log as

informational,

navigational

and

transactional.

Encode

the characteristics of informational,

navigational and

transactional queries

identified

from prior work to develop

an

automatic classifier

using k-means clustering

.

Use

data-mining

techniques to more accurately automatically classify queries by

user Intent

Overcome limitations of previous research:

Small datasets

Limited methodologySlide22

Classification of Queries

Images from http://moz.com/blog/segmenting-search-intentSlide23

Research methodology

Dataset: Transaction log from

Dogpile

.

Each

record has fields like: User identification, cookie, Time of day, Query terms,

source

Step 1: Creating sessions and removing duplicates

The

fields

of

Time

of

day, User identification, Cookie,

and

Query

were used to locate

the initial

query of a session and then recreate the series of actions in the session

.

Collapsed the search

using

user identification, cookie, and

query

to eliminate duplicates

of

result and null queriesSlide24

Research methodology

Step 2: Generating additional attributes

Calculated

three additional attributes for each record: Query length, query reformulation and result

page

Step 3: Assignment of terms

1. Navigational

:

C

ontain

company/business/organization/people

names

Q

ueries

containing portions of URLs or even complete

URLs

2. Transactional

:

A

nalysis

, specifically via the identification of key terms related to

transactional domains

such as entertainment and

ecommerce

3. Informational

:

Q

ueries that use natural language terms

Longer sessions than for informational searchingSlide25

Research methodology

Step

5:

Converting string to vector

Step 4

: Textual data to numerical

dataSlide26

K-means Clustering

Navigational

Informational

Transactional

The

resulting data set had four attributes

that could be used for

classification:

query length, source, query reformulation rate, user

intent weight of the

querySlide27

Results

Performed on various datasets and achieved 94% accuracy

Overall, about

76

percent of the queries were classified as informational, while about

12

percent were classified as transactional, and

12

percent were classified as navigationalSlide28

Results

N

avigational queries

: Low

rates

of reformulation, typically sessions

of just one query

.

Informational queries

:

L

ow

occurrences of query reformulation, indicating probably relatively

easy informational

needs, such as fact

finding

Transactional queries

: Shorter queriesSlide29

Discussion of approach

Limitations:

T

he

Dogpile

user population representative of web search engine users in general?

What if a prototype has multiple user intents associated with it ?

Is relying solely on transactional logs

sufficient ?

Future

Scope

:

Investigate in subcategories

A laboratory

study

on

how

searchers express

their underlying

intent

Devlope

algorithmic approaches

for more in-depth analysis of individual queries

The approach has a high success

rate, it

uses a large data set of queries and does not depend on external content, thereby making it implementable in real

time.Slide30

Summary

Identifying

the user intent of web queries

is very

useful for web search

engines because

it would allow them to provide more relevant results to searchers and

more precisely

targeted sponsored links

.

Classifying queries helps in focused search:

Information queries:

Provide

relevant information and ads

Navigational

queries:

Provide

links straight to a requested web page

Transactional queries:

Focus

on all commercial links for future purchase as

well

The

use of k-means as

an automatic

clustering and classification technique yielded positive results

and opened effective ways to improve performance of web search engines.Slide31

-

Neha

MundadaSlide32

Acquiring Explicit Goals from Search Query Logs

Understanding human goals is necessary for

Recognize goals of actions

Create a plan

E.g., ‘plan a trip to Vienna’ has

subgoals

‘contact travel agent’

‘book hotel’

‘buy concert tickets’, etc.

Automatically acquire human goals from search query logs

Acquire and organize

commonsense

knowledgeSlide33

Research overview

Research

Question:

If

and How search query logs can be utilized to overcome the problem

of acquiring

knowledge about human goals?

Following an exploratory research style, we intend to

show:

contain

a small but interesting number of user

goals

Separation

by automatic methods

Results:

Knowledge

about the automatic acquisition of goals out of search query

logs

Knowledge

about the nature of goals extracted from search query logsSlide34

Results of Human Subject Study

4

independent raters

labeled

3000

queries

Examples

bug killing devices

mothers working from home

how to lose weight

Classes appear to be separableSlide35

Experimental Setup

AOL search query log

~

20 million search

queries

recorded

between March 1 and May 31 (2006

)

ethical issues

pre-processing

steps to reduce

noise

5 million queries

labeled

queries from the human subject study were utilized

as training

examples (controversial queries were omitted)Slide36

Classification approach

Part of speech tagging

Maximum entropy tagger converts a sequence of words into a sequence of POS tags

Example

Query “buy a car”

 buy/VB a/DT car/NN

Set of words {buy, car}

Part of speech trigrams

$ VB DT NN $

 {

$ VB

DT, VB DT NN, DT NN $} Slide37

Classification approach (2)

Linear Support Vector Machine [

Dumais98]

Robust

and effective in the area of text

classification

Weka

Machine Learning Toolkit

http

://www.cs.waikato.ac.nz/ml/weka

/

Performance

:

10 trials –

3-fold

Cross

Validation

Precision

, Recall and F1-Measure for the class: “

queries containing

goals

Precision = 0.77

Recall = 0.63

F1-Measure = 0.69Slide38

N-fold cross validation

Problem: limited amount of labeled data

Solution: N-fold cross validation

Divide data into N equal segments (folds)

Training data: N-1 folds

Testing data: remaining fold

Repeat for remaining test folds and average resultsSlide39

Goals are diverse

Rank-Frequency plot of goals

is heavy tailed

Few goals share by many users

Majority

of goals are shared by only few usersSlide40

Most frequent goalsSlide41

Most frequent goals with “get”, “make”, “change” and “be”Slide42

Summary

Web search queries are an abundant, but very sparse and very noisy, source of data about needs, desires, intentions of people

Clever methods can learn from these diverse data

Named entities

Goals

Can these methods be used

in social media?