Reference Diane Kelly 2009 Methods for Evaluating Interactive Information Retrieval Systems with Users Foundations and Trends in Information Retrieval 312 1224 DOI 1015611500000012 introduction ID: 578298
Download Presentation The PPT/PDF document "User-centered System Evaluation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
User-centered System EvaluationSlide2
Reference
Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012Slide3
introductionSlide4
Interactive Information Retrieval (IIR)
Traditional IR evaluations abstract users out of the evaluation process
IIR focuses on user’s behaviors and experiences,
physical, cognitive and affective
Interactions between users and systems
Interactions between users and informationSlide5
Different evaluation questions
Classic IR evaluation (non-user centric): does this system retrieve relevant document?
IIR evaluation (user-centric): can people use this system to retrieve relevant documents.
Therefore: IIR is viewed as a sub-area of HCISlide6
Relevance Feedback
Same information needs
different queries different search results different relevance feedback.
Dealing with users is difficult as causes and consequences of interactions cannot be observed easily (it is in user’s head)
The available observation: query, save a document, provide relevance feedback.
Based on these observation, we must inferSlide7
Difficulties
Each individual user has a different cognitive composition and behavioral disposition
Some interactions are not easily observable nor measurable
Motivation,
How much to know the topic
expectationsSlide8
IIR
Using users to evaluate IR
Different approaches
Using users to evaluate the research results of a system (users are treated as black boxes)
Search log analysis (queries, search results and click-through behavior)
TREC Interactive Track evaluation model (evaluating a system or interface)
General information search behavior in electronic environments (observing and documenting user’s natural search behaviors and interactions)Slide9
ApproachesSlide10
Research goals
Setting up a clear research goal:
Exploration: when the subject is less known, focusing on learning the subject, rather than make prediction, open-end research questions or hypotheses are uncommon.
Description: documenting and describing a subject (query log or query behavior analysis), to provide benchmark description and classification, results can be used to inform other studies
Explanation: examine the relationship between two or more variables with the goal of prediction and explanation, establish causality, Slide11
Approaches
Evaluations vs. Experiments
Evaluation: to assess the goodness of a system, interface or interaction technique.
Experiments: to understand behavior, (similar as experiments in psychology or education), compare at least two things.
Lab and naturalistic studies
Lab (more controls) vs. naturalistic (less controls)
Longitudinal studies
Taking place over an extended period of time and measurements are taken at fixed intervals.Slide12
Approaches
Case studies
The intensive study of a small number of cases
A case maybe a user, a system or an organization.
It usually takes place in naturalistic settings and involve some longitudinal elements.
Not for generalizing rather than gaining an in-depth view of a particular case.
Wizard of Oz studies and simulations
Testing “non-real” or simulated system
Used for proof-of-concept
Provide an indication of what might happen in ideal circumstances
Wizard of Oz studies are simulations
Simulated users can represent different actions or steps a real user might take while interacting with an IR systemSlide13
Research basicsSlide14
Problems and Questions
Identify and describe problems
Provide roadmap for research
Example of research questions
Exploratory:
How do people re-find information on the Web?
Descriptive:
What Web browser functionalities are currently being used during web-based information-seeking tasks
Explanatory:
What are the differences between written and spoken queries in terms of their retrieval characteristics and performance outcomes?
What is the relationship between query box size and query length? What is the relationship between query length and performance?Slide15
Theory
A theory is a system of logical principles that attempts to explain relations among natural, observable phenomena.
Theory is abstract, general, can generate more specific hypothesesSlide16
Hypotheses
Hypotheses state expected relationships between two variables
Alternative hypotheses vs. null hypotheses
Specific relationship vs. no relationship
Hypotheses can be directional or non-directionalSlide17
Variables and measurement
Variables represent concept
To analyze concepts
Conceptualization
To define concepts: provide temporary definition, divide into dimensions
Operationalization
How to measure the concept:
Direct and indirect observables
Directly observed:
# of queries entered, the amount time spent searching
Indirectly observed:
User satisfactionSlide18
Variables
Independent: the
causes
examining differences in how males and
females use
an experimental and baseline IIR
system
Sex is independent variables
Dependent: the effects
E.g., Satisfaction or performance of the systems
.
Confounding variables
Affect the independent or dependent variable, but have not been controlled by the researcher
.
E.g., maybe males are more familiar with these systems than females.Slide19
Measurement
Range of variation
Preciseness of the
measure
E.g., category of usage frequency of a system
Exhaustiveness
Complete list of choices
Exclusiveness
How to differentiate partially relevant vs. somewhat relevant (in your relevance rubric)
Equivalence
Find items that are of the same type and at the same level of specificity
Different scales: I know details=very familiar, I know nothing=very unfamiliar
Appropriateness
How likely are you to recommend this system to others? Scale: a five-point scale with strongly agree and strongly
disagree – which
does not match the questionSlide20
Level of Measurement
Two basic levels of measurement: discrete vs. continuous
Discrete measures: categorical responses
Nominal: no order
E.g., interface type, sex, task-type
Ordinal: ordered
Rank-order (from most relevant to least relevant) or
Likert
-type order (five-point scale with 1=not relevant, 5=relevant)
Relative
measure
one subject’s 2 may not represent the same thing
internally as
another subject’s 2
.
we could not say that a document rated 4 was twice as relevant as
a document
rated 2 since the scale contains no true zeroSlide21
Level of Measurement
Two basic levels of measurement: discrete vs. continuous
Continuous measure: interval vs. ratio
Different between consecutive points are equal, but there is no true zero for interval scales
Fahrenheit temperature scale, IQ test
scores
Zero does not mean no heat or no intelligence
The differences between 50 vs. 80 and 90 vs. 120 are same
Ratio: the highest level of measurement: the number of occurrences.
There is a true zero
E.g. time, number of pages viewed (zero is meaningful)Slide22
Experimental designSlide23
The basic experimental design in IIR evaluation examines the relationship between two or more systems or interfaces (independent variable) on some set of outcome measures (dependent variables)Slide24
IIR design
General goal of IIR is to determine if a particular system helps subjects find relevant documents
Developing a valid baseline in IIR evaluation involves identifying and blending the status quo and the experimental system.
Random assignment can be used to increase the characteristics being evenly distributed across groupsSlide25
Factorial Designs
Good for studying the impact of more than one stimulus or variableSlide26
Rotation and counterbalancing
The primary purpose of rotation and
counterbalancing
is to control for order effects and to increase the change that results can be attributed to the experimental treatments and conditions.
Rotating variables:
Latin square design
Graeco
-Latin square designSlide27
Rotation and counterbalancing
A basic design with no rotation. Numbers in cells represent different topics
Cons:
Order effects
Some topics are easier than others, some systems may do better with some topics than others.
Fatigue can impact the resultsSlide28
Latin Square rotation
Basic Latin Square rotation of topics
Basic Latin Square rotation of topics and randomization of columns
Problems:
Interaction among topics
the order effects of interfaces still existSlide29
Graeco-Latin Square Design
To solve the problem of orders of interfaces existing above.
Graeco
-Latin Square is a combination of two or more Latin squares. Slide30
Graeco-Latin Square DesignSlide31
Study mode
Batch-mode
Multiple subjects complete the study at the same location and time
Single-mode
Subjects complete the study alone, with only the researcher present.
The choice of mode is determined by the purpose of the study.
Single-mode: if each subject has to be interviewed, or some interactive communication needed between subject and researcher
Batch-mode: self-contained, efficient (but subject can influence each other)Slide32
Protocols
A protocol is a step by step account of what will happen in a study.
Protocol helps maintain the integrity of the study and ensure that subjects experience the study in similar ways.Slide33
Tutorials
Provide some instruction on how to use a new IIR system
Printed materials
Verbal instructions
Video tutorial
Try to avoid bias in the tutorial
Such as specially focusing on one special feature.Slide34
Pilot testing
To estimate time
To identify problems with instruments, instructions, and protocols
To get detailed feedback from test subjectsSlide35
samplingSlide36
Sampling
It is not possible to include all elements from a population in a study
The population in IIR evaluation is assumed to be all people who engage in online information search.
The size of sample: the more the better
Two approaches to sampling: probability sampling and non-probability samplingSlide37
Probability Sampling
Selecting a sample from a population that maintains the same variation and diversity that exists within the population.
Representative sample:
In a population: 60% are males and 40% are females, then your representative sample would also contain roughly the same ratio of males and females.
Increase the
generalizability
of the results
Assumes that all elements in the population have an equal chance of being selected. Slide38
Probability sampling
Simple random sampling
Randomly pick up an element
Systematic sampling
Pick up every
kth
element, where k=population size/sample size
Stratified sampling
Subdivide the population into more refined groups according to specific strata
Select a sample that is proportionate to the population in each strata.Slide39
Non-probability sampling
Used when all of the elements in a population is unknown, or not available.
It limits its ability to generalize
Researchers should be cautious when generalizing their data and be aware of the sampling limitations in their research.Slide40
Non-probability sampling
Three major types of non-probability sampling:
Convenience: relying on available elements the researcher can access: undergraduate students, people is located closer to the researcher.
Purposive or judgmental sampling: a researcher selects subjects or other elements that have particular characteristics, expertise or perspectives
Quota sampling: similar as stratified sampling, but the subjects for the strata are based on a first-come-first-served policy.Slide41
Subject Recruitment
Many ways to recruit subjects
Send solicitations to
mailinglists
Inviting
Using referral services
Crowdsourcing
Mechanical Turk
Web advertising
Mass mailings
Virtual posting in online locations
Pros and Cons: using lab mates, or own research group members as study subjectsSlide42
collectionsSlide43
Collections for testing
Identification of a set of documents for subjects to search, a set of tasks or topics which directs this searching, and the ground truth about the relevance of the searched objects to the topics -
A test collection: corpus, topics, and relevance judgmentsSlide44
TREC collections
TREC Interactive and HARD tracks
Newswire, blog, legal
Artificial topics
Relevance assessment generalization problemSlide45
Web corpora
The major drawback is that it is impossible to replicate the study since the Web is constantly changing.
The same queries issued at different time can get completely different resultsSlide46
Natural corpora
Corpora assembled over time by study participants
Pros: meaningful to subjects, controllable
Cons: lack of
replicability
and equivalence across participants, Slide47
Tasks and topics
Most information needs can be characterized in terms of tasks and topics
Information need = task = topic
Information needs
People do not know their information needs
People have difficulties to articulate their information needs
Or using a vocabulary proper for
a systemSlide48
Generating information needs
It is not clear at what level of specificity a task or topic should be defined
Task can be broken down into a series of sub-tasks, such as writing a research proposal
Working on the query logs to develop information needsSlide49
Data collection techniquesSlide50
Data collection techniques
Corpora, tasks, topics, and relevance assessments are major instruments to evaluate IIR systems
Other instruments: questionnaires, screen capture software allow researchers to collect data.Slide51
Think-Aloud
Subjects articulate their thinking and decision-making during the evaluation process of IIR.
Microphone, recording software,
It is unnatural as most people do not articulate their thoughts as they complete tasks.Slide52
Stimulated Recall
Researcher records the screen of the computer as the subject completes a searching task. Then, the recording is played back to the subject and ask the subject to articulate thinking and decision-making.
Tool: screen recording softwareSlide53
Spontaneous and prompted self-report
Elicit feedback from subjects periodically while they search.
Goal: get more refined feedback about the search, rather than summative feedback at the end of the searchSlide54
observation
Researcher is seated near subjects and observes them when they conduct IIR activities
Tool: video camera, screen capture software
Time consuming, and labor intensive
Prone to selective attention and researcher bias.Slide55
logging
Analyzing transaction logs.
Client-side logging provides a more robust and comprehensive log of the user’s interactions.
But is very hard to build a client-side logger Slide56
Questionnaire
Consist of
closed questions where a specific response set is provides (e.g. a five-point scale)
quantitative analysis
open questions
qualitative analysis
Closed questions:
Likert
-type scale (e.g. five to seven point: strongly agree, agree, neutral, disagree, strongly disagree)
Open questions: content analysis
Different modes: electronic, pen-and-paper, interviewSlide57
Interview
Few IIR evaluation consist solely of interviews, but interviews are a common component of many study protocols.
Subjects response to open-ended questions in interview better than in other two modes (electronic, or pen-and-paper)
Interview: structured, semi-structured or openSlide58
measuresSlide59
Four basic measures
Four basic classes of measures
Contextual (age, sex, search experience, personality-type),
Interaction (# of queries issued, # of documents viewed, query length), can be extracted from log data
Performance (# of relevant documents saved, mean average precision, discounted cumulated gain), can be computed from log data
Usability: subject attitudes and feelings about the system and their interactionsSlide60
contextual
Individual differences: their impact on the study results
Information needs: domain expertise is measured using credentials
Persistence of information needs
Immediacy of information need
Information-seeking stageSlide61
Interaction
Measures:
# of queries, # of search results viewed, # of documents viewed, # of documents saved, query length
The implicit definition of interaction is tied to feedbackSlide62
Performance
When directly apply TREC measures to IIR evaluation, assume: relevance is binary, static,
uni
-dimensional and
generalizable
Whether the TREC-based performance metrics is meaningful to end users
A measure that evaluates systems based on the retrieval of 1000 documents is unlikely to be meaningful to users since most users will not look through 1000 documents.Slide63
Traditional IR performance measuresSlide64
Interactive recall and precisionSlide65
Measures that accommodate multi-level relevance and rankSlide66
Time-based measures
A variety of time-based measures
The length of time subjects spend in different states or modes
The amount of time it takes a subject to save the first relevant articles
The number of relevant documents saved during a fixed period of time
The number of actions or steps taken to complete a taskSlide67
Cost and utility measures
Some search services are not free
Have always been an important part of the evaluation of library and information servicesSlide68
Evaluative feedback from subjects
Usability
Effectiveness, efficiency and satisfaction as key dimensions of usability
Effectiveness: precision, recall
Efficiency: the time it takes a subject to complete a task.
Satisfaction: be satisfied for each different experimental feature of the system, subject perceptions of outcomes and interactionsSlide69
Available instruments for measuring usability
Questionnaire for User Interface Satisfaction (QUIS
): http://lap.umd.edu/quis/
10-point scale for software, screen, terminology, system, etc.
The USE questionnaire
Usefulness, ease of use, ease of learning, satisfaction (7-point scale)
Software Usability Measurement Inventory (SUMI
): http://sumi.ucc.ie/whatis.html
Agree, do not know and disagree for 50 itemsSlide70
Data analysisSlide71
Qualitative data analyses
The goal of most qualitative data analyses that are constructed in IIR is to reduce the qualitative responses into a set of categories or themes that can be used to characterize and summarize responses.
Content analysis: it starts with a well-defined and structured classification scheme, including categories and classification rules.
Open coding: the categories are usually developed inductively during the analysis process as the researcher analyzes the data.Slide72
Quantitative data analysisSlide73
Validity and reliabilitySlide74
validity
Internal validity: quality of what happens during the study
Whether the selected instrument yields poor or inaccurate data
External validity: to what extent the results from a study can be generalized to the real world.
Lab studies are generally less valid, but more reliable than naturalistic studies
Using instruments with established reliability