/
User-centered System Evaluation User-centered System Evaluation

User-centered System Evaluation - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
385 views
Uploaded On 2017-08-13

User-centered System Evaluation - PPT Presentation

Reference Diane Kelly 2009 Methods for Evaluating Interactive Information Retrieval Systems with Users Foundations and Trends in Information Retrieval 312 1224 DOI 1015611500000012 introduction ID: 578298

system information subjects iir information system iir subjects sampling study subject evaluation time search relevant topics users measures data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "User-centered System Evaluation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

User-centered System EvaluationSlide2

Reference

Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012Slide3

introductionSlide4

Interactive Information Retrieval (IIR)

Traditional IR evaluations abstract users out of the evaluation process

IIR focuses on user’s behaviors and experiences,

physical, cognitive and affective

Interactions between users and systems

Interactions between users and informationSlide5

Different evaluation questions

Classic IR evaluation (non-user centric): does this system retrieve relevant document?

IIR evaluation (user-centric): can people use this system to retrieve relevant documents.

Therefore: IIR is viewed as a sub-area of HCISlide6

Relevance Feedback

Same information needs

 different queries  different search results  different relevance feedback.

Dealing with users is difficult as causes and consequences of interactions cannot be observed easily (it is in user’s head)

The available observation: query, save a document, provide relevance feedback.

Based on these observation, we must inferSlide7

Difficulties

Each individual user has a different cognitive composition and behavioral disposition

Some interactions are not easily observable nor measurable

Motivation,

How much to know the topic

expectationsSlide8

IIR

Using users to evaluate IR

Different approaches

Using users to evaluate the research results of a system (users are treated as black boxes)

Search log analysis (queries, search results and click-through behavior)

TREC Interactive Track evaluation model (evaluating a system or interface)

General information search behavior in electronic environments (observing and documenting user’s natural search behaviors and interactions)Slide9

ApproachesSlide10

Research goals

Setting up a clear research goal:

Exploration: when the subject is less known, focusing on learning the subject, rather than make prediction, open-end research questions or hypotheses are uncommon.

Description: documenting and describing a subject (query log or query behavior analysis), to provide benchmark description and classification, results can be used to inform other studies

Explanation: examine the relationship between two or more variables with the goal of prediction and explanation, establish causality, Slide11

Approaches

Evaluations vs. Experiments

Evaluation: to assess the goodness of a system, interface or interaction technique.

Experiments: to understand behavior, (similar as experiments in psychology or education), compare at least two things.

Lab and naturalistic studies

Lab (more controls) vs. naturalistic (less controls)

Longitudinal studies

Taking place over an extended period of time and measurements are taken at fixed intervals.Slide12

Approaches

Case studies

The intensive study of a small number of cases

A case maybe a user, a system or an organization.

It usually takes place in naturalistic settings and involve some longitudinal elements.

Not for generalizing rather than gaining an in-depth view of a particular case.

Wizard of Oz studies and simulations

Testing “non-real” or simulated system

Used for proof-of-concept

Provide an indication of what might happen in ideal circumstances

Wizard of Oz studies are simulations

Simulated users can represent different actions or steps a real user might take while interacting with an IR systemSlide13

Research basicsSlide14

Problems and Questions

Identify and describe problems

Provide roadmap for research

Example of research questions

Exploratory:

How do people re-find information on the Web?

Descriptive:

What Web browser functionalities are currently being used during web-based information-seeking tasks

Explanatory:

What are the differences between written and spoken queries in terms of their retrieval characteristics and performance outcomes?

What is the relationship between query box size and query length? What is the relationship between query length and performance?Slide15

Theory

A theory is a system of logical principles that attempts to explain relations among natural, observable phenomena.

Theory is abstract, general, can generate more specific hypothesesSlide16

Hypotheses

Hypotheses state expected relationships between two variables

Alternative hypotheses vs. null hypotheses

Specific relationship vs. no relationship

Hypotheses can be directional or non-directionalSlide17

Variables and measurement

Variables represent concept

To analyze concepts

Conceptualization

To define concepts: provide temporary definition, divide into dimensions

Operationalization

How to measure the concept:

Direct and indirect observables

Directly observed:

# of queries entered, the amount time spent searching

Indirectly observed:

User satisfactionSlide18

Variables

Independent: the

causes

examining differences in how males and

females use

an experimental and baseline IIR

system

Sex is independent variables

Dependent: the effects

E.g., Satisfaction or performance of the systems

.

Confounding variables

Affect the independent or dependent variable, but have not been controlled by the researcher

.

E.g., maybe males are more familiar with these systems than females.Slide19

Measurement

Range of variation

Preciseness of the

measure

E.g., category of usage frequency of a system

Exhaustiveness

Complete list of choices

Exclusiveness

How to differentiate partially relevant vs. somewhat relevant (in your relevance rubric)

Equivalence

Find items that are of the same type and at the same level of specificity

Different scales: I know details=very familiar, I know nothing=very unfamiliar

Appropriateness

How likely are you to recommend this system to others? Scale: a five-point scale with strongly agree and strongly

disagree – which

does not match the questionSlide20

Level of Measurement

Two basic levels of measurement: discrete vs. continuous

Discrete measures: categorical responses

Nominal: no order

E.g., interface type, sex, task-type

Ordinal: ordered

Rank-order (from most relevant to least relevant) or

Likert

-type order (five-point scale with 1=not relevant, 5=relevant)

Relative

measure

one subject’s 2 may not represent the same thing

internally as

another subject’s 2

.

we could not say that a document rated 4 was twice as relevant as

a document

rated 2 since the scale contains no true zeroSlide21

Level of Measurement

Two basic levels of measurement: discrete vs. continuous

Continuous measure: interval vs. ratio

Different between consecutive points are equal, but there is no true zero for interval scales

Fahrenheit temperature scale, IQ test

scores

Zero does not mean no heat or no intelligence

The differences between 50 vs. 80 and 90 vs. 120 are same

Ratio: the highest level of measurement: the number of occurrences.

There is a true zero

E.g. time, number of pages viewed (zero is meaningful)Slide22

Experimental designSlide23

The basic experimental design in IIR evaluation examines the relationship between two or more systems or interfaces (independent variable) on some set of outcome measures (dependent variables)Slide24

IIR design

General goal of IIR is to determine if a particular system helps subjects find relevant documents

Developing a valid baseline in IIR evaluation involves identifying and blending the status quo and the experimental system.

Random assignment can be used to increase the characteristics being evenly distributed across groupsSlide25

Factorial Designs

Good for studying the impact of more than one stimulus or variableSlide26

Rotation and counterbalancing

The primary purpose of rotation and

counterbalancing

is to control for order effects and to increase the change that results can be attributed to the experimental treatments and conditions.

Rotating variables:

Latin square design

Graeco

-Latin square designSlide27

Rotation and counterbalancing

A basic design with no rotation. Numbers in cells represent different topics

Cons:

Order effects

Some topics are easier than others, some systems may do better with some topics than others.

Fatigue can impact the resultsSlide28

Latin Square rotation

Basic Latin Square rotation of topics

Basic Latin Square rotation of topics and randomization of columns

Problems:

Interaction among topics

the order effects of interfaces still existSlide29

Graeco-Latin Square Design

To solve the problem of orders of interfaces existing above.

Graeco

-Latin Square is a combination of two or more Latin squares. Slide30

Graeco-Latin Square DesignSlide31

Study mode

Batch-mode

Multiple subjects complete the study at the same location and time

Single-mode

Subjects complete the study alone, with only the researcher present.

The choice of mode is determined by the purpose of the study.

Single-mode: if each subject has to be interviewed, or some interactive communication needed between subject and researcher

Batch-mode: self-contained, efficient (but subject can influence each other)Slide32

Protocols

A protocol is a step by step account of what will happen in a study.

Protocol helps maintain the integrity of the study and ensure that subjects experience the study in similar ways.Slide33

Tutorials

Provide some instruction on how to use a new IIR system

Printed materials

Verbal instructions

Video tutorial

Try to avoid bias in the tutorial

Such as specially focusing on one special feature.Slide34

Pilot testing

To estimate time

To identify problems with instruments, instructions, and protocols

To get detailed feedback from test subjectsSlide35

samplingSlide36

Sampling

It is not possible to include all elements from a population in a study

The population in IIR evaluation is assumed to be all people who engage in online information search.

The size of sample: the more the better

Two approaches to sampling: probability sampling and non-probability samplingSlide37

Probability Sampling

Selecting a sample from a population that maintains the same variation and diversity that exists within the population.

Representative sample:

In a population: 60% are males and 40% are females, then your representative sample would also contain roughly the same ratio of males and females.

Increase the

generalizability

of the results

Assumes that all elements in the population have an equal chance of being selected. Slide38

Probability sampling

Simple random sampling

Randomly pick up an element

Systematic sampling

Pick up every

kth

element, where k=population size/sample size

Stratified sampling

Subdivide the population into more refined groups according to specific strata

Select a sample that is proportionate to the population in each strata.Slide39

Non-probability sampling

Used when all of the elements in a population is unknown, or not available.

It limits its ability to generalize

Researchers should be cautious when generalizing their data and be aware of the sampling limitations in their research.Slide40

Non-probability sampling

Three major types of non-probability sampling:

Convenience: relying on available elements the researcher can access: undergraduate students, people is located closer to the researcher.

Purposive or judgmental sampling: a researcher selects subjects or other elements that have particular characteristics, expertise or perspectives

Quota sampling: similar as stratified sampling, but the subjects for the strata are based on a first-come-first-served policy.Slide41

Subject Recruitment

Many ways to recruit subjects

Send solicitations to

mailinglists

Inviting

Using referral services

Crowdsourcing

Mechanical Turk

Web advertising

Mass mailings

Virtual posting in online locations

Pros and Cons: using lab mates, or own research group members as study subjectsSlide42

collectionsSlide43

Collections for testing

Identification of a set of documents for subjects to search, a set of tasks or topics which directs this searching, and the ground truth about the relevance of the searched objects to the topics -

A test collection: corpus, topics, and relevance judgmentsSlide44

TREC collections

TREC Interactive and HARD tracks

Newswire, blog, legal

Artificial topics

Relevance assessment generalization problemSlide45

Web corpora

The major drawback is that it is impossible to replicate the study since the Web is constantly changing.

The same queries issued at different time can get completely different resultsSlide46

Natural corpora

Corpora assembled over time by study participants

Pros: meaningful to subjects, controllable

Cons: lack of

replicability

and equivalence across participants, Slide47

Tasks and topics

Most information needs can be characterized in terms of tasks and topics

Information need = task = topic

Information needs

People do not know their information needs

People have difficulties to articulate their information needs

Or using a vocabulary proper for

a systemSlide48

Generating information needs

It is not clear at what level of specificity a task or topic should be defined

Task can be broken down into a series of sub-tasks, such as writing a research proposal

Working on the query logs to develop information needsSlide49

Data collection techniquesSlide50

Data collection techniques

Corpora, tasks, topics, and relevance assessments are major instruments to evaluate IIR systems

Other instruments: questionnaires, screen capture software allow researchers to collect data.Slide51

Think-Aloud

Subjects articulate their thinking and decision-making during the evaluation process of IIR.

Microphone, recording software,

It is unnatural as most people do not articulate their thoughts as they complete tasks.Slide52

Stimulated Recall

Researcher records the screen of the computer as the subject completes a searching task. Then, the recording is played back to the subject and ask the subject to articulate thinking and decision-making.

Tool: screen recording softwareSlide53

Spontaneous and prompted self-report

Elicit feedback from subjects periodically while they search.

Goal: get more refined feedback about the search, rather than summative feedback at the end of the searchSlide54

observation

Researcher is seated near subjects and observes them when they conduct IIR activities

Tool: video camera, screen capture software

Time consuming, and labor intensive

Prone to selective attention and researcher bias.Slide55

logging

Analyzing transaction logs.

Client-side logging provides a more robust and comprehensive log of the user’s interactions.

But is very hard to build a client-side logger Slide56

Questionnaire

Consist of

closed questions where a specific response set is provides (e.g. a five-point scale)

 quantitative analysis

open questions

 qualitative analysis

Closed questions:

Likert

-type scale (e.g. five to seven point: strongly agree, agree, neutral, disagree, strongly disagree)

Open questions: content analysis

Different modes: electronic, pen-and-paper, interviewSlide57

Interview

Few IIR evaluation consist solely of interviews, but interviews are a common component of many study protocols.

Subjects response to open-ended questions in interview better than in other two modes (electronic, or pen-and-paper)

Interview: structured, semi-structured or openSlide58

measuresSlide59

Four basic measures

Four basic classes of measures

Contextual (age, sex, search experience, personality-type),

Interaction (# of queries issued, # of documents viewed, query length), can be extracted from log data

Performance (# of relevant documents saved, mean average precision, discounted cumulated gain), can be computed from log data

Usability: subject attitudes and feelings about the system and their interactionsSlide60

contextual

Individual differences: their impact on the study results

Information needs: domain expertise is measured using credentials

Persistence of information needs

Immediacy of information need

Information-seeking stageSlide61

Interaction

Measures:

# of queries, # of search results viewed, # of documents viewed, # of documents saved, query length

The implicit definition of interaction is tied to feedbackSlide62

Performance

When directly apply TREC measures to IIR evaluation, assume: relevance is binary, static,

uni

-dimensional and

generalizable

Whether the TREC-based performance metrics is meaningful to end users

A measure that evaluates systems based on the retrieval of 1000 documents is unlikely to be meaningful to users since most users will not look through 1000 documents.Slide63

Traditional IR performance measuresSlide64

Interactive recall and precisionSlide65

Measures that accommodate multi-level relevance and rankSlide66

Time-based measures

A variety of time-based measures

The length of time subjects spend in different states or modes

The amount of time it takes a subject to save the first relevant articles

The number of relevant documents saved during a fixed period of time

The number of actions or steps taken to complete a taskSlide67

Cost and utility measures

Some search services are not free

Have always been an important part of the evaluation of library and information servicesSlide68

Evaluative feedback from subjects

Usability

Effectiveness, efficiency and satisfaction as key dimensions of usability

Effectiveness: precision, recall

Efficiency: the time it takes a subject to complete a task.

Satisfaction: be satisfied for each different experimental feature of the system, subject perceptions of outcomes and interactionsSlide69

Available instruments for measuring usability

Questionnaire for User Interface Satisfaction (QUIS

): http://lap.umd.edu/quis/

10-point scale for software, screen, terminology, system, etc.

The USE questionnaire

Usefulness, ease of use, ease of learning, satisfaction (7-point scale)

Software Usability Measurement Inventory (SUMI

): http://sumi.ucc.ie/whatis.html

Agree, do not know and disagree for 50 itemsSlide70

Data analysisSlide71

Qualitative data analyses

The goal of most qualitative data analyses that are constructed in IIR is to reduce the qualitative responses into a set of categories or themes that can be used to characterize and summarize responses.

Content analysis: it starts with a well-defined and structured classification scheme, including categories and classification rules.

Open coding: the categories are usually developed inductively during the analysis process as the researcher analyzes the data.Slide72

Quantitative data analysisSlide73

Validity and reliabilitySlide74

validity

Internal validity: quality of what happens during the study

Whether the selected instrument yields poor or inaccurate data

External validity: to what extent the results from a study can be generalized to the real world.

Lab studies are generally less valid, but more reliable than naturalistic studies

Using instruments with established reliability