/
Crowdsourcing Quality Control Crowdsourcing Quality Control

Crowdsourcing Quality Control - PowerPoint Presentation

slayrboot
slayrboot . @slayrboot
Follow
343 views
Uploaded On 2020-08-05

Crowdsourcing Quality Control - PPT Presentation

Mechanisms Vlad Hosu Universität Konstanz 21062017 With thanks to Matthias Hirth from University of Würzburg Agenda Honesty in Online Labor Markets Exemplary Quality Control Mechanisms ID: 799539

quality control mechanisms task control quality task mechanisms crowdsourcing 2017 workers worker costs correct decision incorrect questions test group

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Crowdsourcing Quality Control" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Crowdsourcing

Quality Control Mechanisms

Vlad HosuUniversität Konstanz, 21.06.2017

With

thanks

to

Matthias Hirth

from

University

of

Würzburg

Slide2

Agenda

Honesty in Online Labor Markets

Exemplary Quality Control MechanismsFiltering of workers

Screening of workersStatistical methodsWorkflow based mechanisms

Costs of Quality Control Mechanisms

21.06.2017

Crowdsourcing - Quality Control Mechanisms

2

Slide3

Honesty in Online

Labor Markets21.06.2017

Crowdsourcing - Quality Control Mechanisms

3

Slide4

Honesty in Online Labor Markets

Suri et al. "Honesty in an Online Labor Market

.”User studies on

MTurk with varying bonus paymentStudy 1

Fixed payment per task 0.25 USD

Bonus payment 0.25 USD times the roll of a die (ratio between minimum and maximum payment: 3.5)

Users asked to use own die or

random.org

175 participants

Study 2

Fixed payment per task 0.25 USD

Fixed bonus

payment

of 0.75

USD upon completion plus 0.05 USD times the roll of a die (ratio between minimum and maximum payment: 1.24)Users asked to use own die or random.org267 participants(Expected payout 1.125 USD)

21.06.2017

Crowdsourcing - Quality Control Mechanisms

4

Slide5

Study Results

In both studies: Significant trends towards higher rolls

Workers tend to be dishonest even if only small benefits are expected (Lab participants showed the same behavior)

Study 1

Study 2

21.06.2017

Crowdsourcing - Quality Control Mechanisms

5

Slide6

Influencing Factors on Poor Quality

Poor

Quality

Workers are not qualified for the task

Workers do not work thoughtfully

Workers deliberately cheat

Workers do not

understand

instructions

Qualification

Reliability

Train

Educate

Remove

Task Design

21.06.2017

Crowdsourcing - Quality Control Mechanisms

6

Slide7

Which Workers do You Need?

QoE

Measurement

Surveys

Software Testing

Data Extraction & Labeling

Content Creation

Non-

Trustworthy

Trustworthy

Qualified

Non-Qualified

21.06.2017

Crowdsourcing - Quality Control Mechanisms

7

Slide8

Types of Malicious Behavior

Different type of malicious behavior observed in exemplary online survey (Gadiraju,

Ujwal, et al. "Understanding malicious behavior in crowdsourcing platforms: The case of online surveys.”)

Ineligible WorkersWorkers that do not conform with priory stated prerequisites

Example: Worker performs task not on the requested device

Fast Deceivers

Workers trying to maximize income supplying ill-fitting response

Example: Worker copies parts of the instruction into free text questions

Rule Breakers

Workers not sticking to the instructions

Example: Worker does not care about word limit for articles

Smart Deceivers

Similar to

Fast Deceivers

, but stick to instructions to void automatic detection

Example: Correctly formatted but irrelevant image tags21.06.2017Crowdsourcing - Quality Control Mechanisms

8

Slide9

Possible Attacks on Tasks

Individual attacksRandom Answers:

Worker manually submits random/incorrect answer to minimize effort

Semi-Automated Answers: Worker uses script to automate parts of the spamming process

Automated Answers:

Worker uses bot to automatically submit

random/incorrect answer to minimize

effort

Group Attacks

Agree on Answers: Workers communicate submit consistent answer to unique task items, e.g., images

Answer Sharing: Answers for gold standard data questions are stored and shared among workers

Artificial Clones: Workers create clones that act as independent workers and duplicate the workers answers

21.06.2017

Crowdsourcing - Quality Control Mechanisms

9

Slide10

Exemplary QualityControl Mechanisms

21.06.2017

Crowdsourcing - Quality Control Mechanisms10

Slide11

Types of Quality Control Mechanisms

Filtering of workers

Applied before a taskRestricts access to the task to certain workers

Exemplary filter criteria: Country of origin, reputationScreening of workers

Applied during a task

Based on explicit and implicit tests integrated in the task

Examples: Test questions/consistency questions, attention tests, gold standard data, behavior monitoring

Statistical methods

Applied after task completion

Based on task results

Examples: Random Clicker, Quadrant of Euphoria,

CrowdMOS

,

BT.500-13

Workflow

based mechanisms Based on combination of different task resultsExamples: Repetition and aggregation, two-stage design, Find-Fix-Verify21.06.2017Crowdsourcing - Quality Control Mechanisms11

Slide12

Filtering of workers

Slide13

Filtering of Workers

Applied before beginning of task to restrict access for set of workers

Filter criteria often based onResults from previous campaigns / qualification tasks

Reputation of workers / work history on platform

Properties of worker, e.g., country of origin

Possible application: Filtering of known spammers

Issues

Reputation data on crowdsourcing platform often ‘broken’

Potential discrimination of new workers

Ethical issues when filtering of users based on country of origin

21.06.2017

Crowdsourcing - Quality Control Mechanisms

13

Slide14

Screening of workers

Slide15

Screening of Workers

Applied during the task to estimate quality of worker

Screening mechanisms based on explicit testsContent questions

Consistency questionsTrap questions

Attention tests

G

old

standard data

Screening mechanisms based on implicit feedback

Behavior

monitoring

Possible application

Assessment of worker qualification, attention, and diligence

Assessment

of trustworthiness and honesty

IssuesOften generation of reference data required  Limited set of test itemsSome methods need to be adapted to the specific tasks21.06.2017Crowdsourcing - Quality Control Mechanisms15

Slide16

Verifiable Content Questions

Simple questions about the item/content the worker needs to process/pay attention to during a task

Content question easier to verify than actual task result

Example: Video quality assessmentWorker supposed to watch a video and give a subjective rating on the quality

Possible content question:

‘Which sport was shown in the clip?

A) Tennis. B) Soccer. C) Skiing.’

Example: Keywords for an article

Worker need to read the article and provide meaningful keywords

Possible content question:

‘How may paragraphs does the article have’

21.06.2017

Crowdsourcing - Quality Control Mechanisms

16

Slide17

Consistency Questions

Same question is asked multiple times in a slightly different manner

Example

Pre-task survey: ‘Please select your country of origin’Post-task survey: ‘Which continent do you live on?’

Potential issues

Subjects might not be

willing to provide correct personal

data

Some questions might be ambiguous or too difficult

 Unnecessary rejection of valid data

21.06.2017

Crowdsourcing - Quality Control Mechanisms

17

Slide18

Attention Checks

Actions to objectively assess the focus of the worker on the task

Example: Video quality assessmentWorker supposed to watch a video and give a subjective rating on the quality

Attention check: An additional button is added to the test, worker need to click the button on random requests

Example

:

Survey

Worker

requested to complete a survey Attention check: Asking worker to select a (unlikely/impossible) option in the next answer

Potential

issues

Worker might be distracted (e.g. video test)

Checks might be too obvious (cf. survey)

21.06.2017

Crowdsourcing - Quality Control Mechanisms

18

Slide19

Trap Questions

Question and answer schemes encouraging sloppy users to select incorrect answers

Example: Information extraction task

Worker supposed to extract information from emailGiven email header

Possible answers

Potential issue: Correct data needs to be available

21.06.2017

Crowdsourcing - Quality Control Mechanisms

19

Slide20

Trustworthiness Tests

Testing basic willingness to give honest answer (important for QoE and other subjective tests)

Only measurable via psychological tests

Usually too complex and too long for crowdsourcing test

But: Adding some questions from psychological tests as estimation possible

General idea

Usage of questions that seem to be not-verifiable at first sight

Examples:

Number of visible cats

Recognized number of

differences in two images

21.06.2017

Crowdsourcing - Quality Control Mechanisms

20

Slide21

Gold Standard Data

Questions whereof the correct results are already known

Gold standard questions are interspersed among other questions

Most common quality control approachExample: Video quality assessment

Worker supposed to watch a video and give a subjective rating on the quality

Gold standard question: ’Did you notice any

stallings

in the video’?

Example: Image categorization

Worker needs to add an image to one of two categories

Gold

standard data: Set of (unambiguous) categorized images included in the regular tasks

Potential issue:

Additional costs for creating gold standard data

Limited set of gold standard data (approach to create/extend gold standard data automatically)

Gold standard data need to be robust against worker bias21.06.2017Crowdsourcing - Quality Control Mechanisms21

Slide22

Idea: Use objective measures on application-level for estimating the user interactions with the application

Goal: Estimate quality

of the task results based on technical parametersChallenges:

Model of “correct” behavior required (for every task different)

Implementation can become very complex

Behavior Monitoring to Identify Unreliable Users

Image from sxc.hu by

zenpixel

Expected behavior

 expected

measurement values

Diverging measurements

 reconstruction of behavior

21.06.2017

Crowdsourcing - Quality Control Mechanisms

22

Slide23

Application Layer Measurements (ALM)

Client side

Analysis of interactions with web browserMouse movement

ScrollingWindow focus

Server side

Analysis

using server side logging

Requests pages

Time spend on a specific page

Prerequisite: Web based task

interface

Privacy: Non intrusive, only task environment is

observable

21.06.2017

Crowdsourcing - Quality Control Mechanisms

23

Slide24

Case Study: Language Test

Hirth, Matthias, et al. "Predicting result quality in crowdsourcing using application layer monitoring.”

Requirements for test caseTask needs to be “simple” (no special skills required)

Task evaluation must be objective English language test with multiple choice answers

Test design

Five texts from the

simplified English

version of Wikipedia

Five multiple choice questions each text with four possible answers

Implicit feedback about the user via,

q

uestion design, answer design, and free text question about the most interesting text

User study

215 participants in Feb. 2013 (0.10 USD per task)

Participants from 22 different nations

(2% Native speaker)21.06.2017Crowdsourcing - Quality Control Mechanisms24

Slide25

Implementation of the Language Test

1

2

3

Trustworthiness Trap

Incentive

Online Translation Service

21.06.2017

Crowdsourcing - Quality Control Mechanisms

25

Slide26

Identification Suitable Qualification Threshold

Qualification threshold for binary classification (qualified and unqualified)

Test settingsNumber of questions: 25

Number of answers per question: 4Desired sample size

between

and

Considerations

Probability for randomly selecting correct question:

Probability for randomly selecting

correct questions:

(Binomial distribution)

Qualification Threshold

Goal: Probability of false positive qualification

Qualification

Threshold set to

 

21.06.2017

Crowdsourcing - Quality Control Mechanisms

26

Slide27

Evaluation of Language Test Results

Scores normalized to 100%

 Qualification Threshold: 50% At minimum two answers correct

18% of the candidates achieved the maximum score

71%

“qualified”

How can we predict a worker’s test result?

21.06.2017

Crowdsourcing - Quality Control Mechanisms

27

Slide28

ALM User Modeling

Expected behaviorReading instructions

Start with first questions/check length of test

Read the text

Read and answer the questions/recheck the text

Proceed to next text

Expected interaction

Waiting

Scrolling

Interacting

21.06.2017

Crowdsourcing - Quality Control Mechanisms

28

Slide29

ALM Implementation

Implemented measuresTime of page access and leaving

Vertical scroll positionSynchronous measure every 10 secAllows reconstructing user’s field of sight

Interactions with answering elementsAsynchronous measure based on interaction eventsAllows reconstructing answering times

Evaluated metrics

Completion time

Time between entering the test and submitting the final result

Simple metric only considering server side measures

Consideration time

Time between first time user saw question and last interaction with the related answer

Complex measure based on multiple client side measures

21.06.2017

Crowdsourcing - Quality Control Mechanisms

29

Slide30

Completion

Time

Plausibility Threshold

Minimum time to read the testAutomatically pre-calculated

Temporal qualification Threshold

Minimum completion time of qualified workers

Derived from test

results

 No clear differentiation of workers possible in most cases!

Filters only a small

number of workers

Requires evaluated test results

?

21.06.2017

Crowdsourcing - Quality Control Mechanisms

30

Slide31

Consideration Time and Question Type

Different question types

Comprehension

Required information stated explicitly in textReorganization

Combination of several explicit information from text

Inference

Required information only implicitly stated in text

Significant differences between qualified and unqualified workers

Qualified workers spend more time in general

Time spend by qualified workers increased with difficulty of the questions

21.06.2017

Crowdsourcing - Quality Control Mechanisms

31

Slide32

Consideration Time and Answer Type

Different answer types

Plausible, in textCorrect information stated in the text

Plausible, not in text

Information matching question but not in the text

Not plausible, in text

Information from the text but not matching the question

Observations

Finding correct answer requires largest amount of time (

plausible, in text

)

Users guessing the answer without reading require least amount of time (

plausible, not in text

)

Users guessing the answer without reading

the question properly lay in between (not plausible, in text)21.06.2017Crowdsourcing - Quality Control Mechanisms32

Slide33

Conclusion Application Layer Monitoring

Low quality results in Crowdsourcing

Different reasons for low result Reasons hard to detect

 Understanding of worker-task interactions necessaryApplication layer monitoring

Non intrusive technical worker monitoring on client and server side

Task specific and complex to implement

Allows designing objective worker-task interaction measures

Results of proof-of-concept implementation

Traditional interaction measures (completion time) outperformed by more sophisticated ALM measures (consideration time)

Support Vector Machine using ALM measures capable of predicting worker qualification with accuracy of ~89%

21.06.2017

Crowdsourcing - Quality Control Mechanisms

33

Slide34

Statistical Methods

Slide35

Random Clicker

Assumption: If worker selects random answers, distribution of answers follows uniform distributionTest idea: Use Pearson’s

-Test to test null hypothesis ‘

User is random clicker’

 

21.06.2017

Crowdsourcing - Quality Control Mechanisms

35

Slide36

CrowdMOS

AssumptionsIn subjective studies, worker ratings should converge to a global trend

Individual workers should agree with this trendTest idea: Use sample correlation to compare user ratings with global average ratings

21.06.2017

Crowdsourcing - Quality Control Mechanisms

36

Slide37

Transitivity Satisfaction Rate

Assumption: Subjective ratings are transitive (A is preferred to B, B is preferred to C  A is preferred to C)Test idea:

Use pair comparisons of all data pointsCompute compliance with transitivity assumptions

21.06.2017

Crowdsourcing - Quality Control Mechanisms

37

Slide38

ITU-R BT.500

Assumption: Subjective ratings for the same test condition should vary only by a fixed amount

at maximum among the workersTest ideaCompute global mean and standard deviation of all ratings

Determine distribution of ratings (normal or non-normal distribution)

Test whether individual ratings lie within an interval of length

around the global mean (

if ratings normally distributed, else

)

Filter workers based on number of rating outside the given range

 

21.06.2017

Crowdsourcing - Quality Control Mechanisms

38

Slide39

Reliability of Statistical Methods

DatasetSubjective study on impact of stalling on YouTube QoE

Gold standard data: Unreliable workers identified by content questions and behavior monitoringIssues in this study

Different statistical methods identify different sets ofreliable and unreliable workers

Some methods show large number of false negatives

(no money for those users)

General issue: Filtering based on evaluated data

21.06.2017

Crowdsourcing - Quality Control Mechanisms

39

Slide40

Workflow Based Approach

Slide41

Workflow Based Approach

Based on combination of task results and varying tasks from different workers

ExamplesFind-Fix-Verify

Majority decisionControl group decision

Two-Stage Design

Possible

application

Assessment

of

worker

quality

Increasing the quality of final outcome

Issues: Need to be tailored to the specific task

21.06.2017

Crowdsourcing - Quality Control Mechanisms

41

Slide42

Majority Decision

Parameters

Number of workers

Probability of incorrect task

submission

Probability for correct majority decision

 

21.06.2017

Crowdsourcing - Quality Control Mechanisms

42

Slide43

Control Group

Parameters

One worker performing main task

Number of workers

used of majority decision

Probability of incorrect task

submission

(same probability for main task and control tasks)

Probability for a correct majority decision

 

21.06.2017

Crowdsourcing - Quality Control Mechanisms

43

Slide44

Control group

Main worker

Correct result

Incorrect result

Correct decision

Incorrect decision

Control group

Main worker

Correct result

Incorrect result

Correct decision

Incorrect decision

Control group

Main worker

Correct result

Incorrect result

Correct decision

Correct

task approved (

)

Incorrect task disapproved

(

)

Incorrect decision

Correct task disapproved

(

)

Incorrect task

approved (

)

Control group

Main worker

Correct result

Incorrect result

Correct decision

Incorrect decision

Possible outcomes

Probability of outcomes

Probability of a correct control group decision

Same probability for a correct result as majority decision

 

Probability

for Correct Control Group Decision

21.06.2017

Crowdsourcing - Quality Control Mechanisms

44

Slide45

Two Stage Design

Combined workflow of

using different quality

control mechanismsPhase 1

Pre-filtering of users according to

test requirements

Simple and cheap task including training phases, qualification and trustworthiness tests (all workers paid except obvious spammers

)

 Pseudo

reliable crowd identified

Phase 2

Actual task, longer and more expensive

Only accessible by trusted workers from phase 1

Limited application of quality control mechanism to save time and money

21.06.2017Crowdsourcing - Quality Control Mechanisms45

Slide46

Costs of Quality Control Mechanisms

21.06.2017

Crowdsourcing - Quality Control Mechanisms46

Slide47

Costs of Quality Control

Total cost of quality related activities dividable in conformance costs and

non-conformance costs

Conformance costs: Costs spend on activities to avoid poor qualityPrevention costs

Cost for activities to prevent end result from failing quality requirements

Example: Robust workflow design, easy to use task interface

Appraisal costs

Costs for finding errors

Examples: Repetition of tasks, additional work for gold standard tasks

Non-conformance costs: Costs resulting from poor quality

Internal failure

Cost due to detected quality issues

Example: Reassignment of failed tasks

External failure

Costs due to errors in the final product

Examples: Invalid translation used in article, low quality article publishedTrade-off between conformance and non-conformance costs necessary21.06.2017Crowdsourcing - Quality Control Mechanisms47

Slide48

Cost Model for Majority Decisions

Cost factors

Costs for correct

and incorrect

tasks

Costs for false negative

and false positive

approvals (accepting invalid task/rejecting correct tasks)

Number of correct tasks

and number of incorrect tasks

Expected costs

of majority decision

 

 

Costs incorrect submissions

Costs correct submissions

Costs incorrect submissions

Costs correct submissions

21.06.2017

Crowdsourcing - Quality Control Mechanisms

48

Slide49

Impact of Cost Factors Majority Decision

Fixed values for

and fixed number of workers

Total costs increase with decreasing quality of workers

Influence of cost factors vary depending on the worker quality

 

21.06.2017

Crowdsourcing - Quality Control Mechanisms

49

Slide50

Cost Model for Control Group Approach

Cost parameters

Costs for correct

and incorrect

main tasks

Costs for false negative

and false positive

approvals of main task (accepting invalid task/rejecting correct tasks)

Costs for control group majority decision

Costs for control group outcomes

Control group approach repeated until first result accepted

 On average

repetition with

 

Control group

Main worker

Correct result

Incorrect result

Correct decision

Incorrect decision

Control group

Main worker

Correct result

Incorrect result

Correct decision

Incorrect decision

21.06.2017

Crowdsourcing - Quality Control Mechanisms

50

Slide51

Cost Model for Control Group Approach

Costs for approval and disapproval of main task

Costs for control group approach

 

 

Costs for correctly accepted main task

Costs for incorrectly accepted main task

Number of rejected

main tasks

Costs per rejected main task

21.06.2017

Crowdsourcing - Quality Control Mechanisms

51

Slide52

Impact of Cost Factors Control Group Approach

Fixed values for

(control group),

(main task) and fixed number of workers

3

Maximum of total for

Influence of cost factors vary depending on the worker quality

 

21.06.2017

Crowdsourcing - Quality Control Mechanisms

52

Slide53

Cost Optimal Mechanism for Routine Tasks

Assumptions

Low costs per task, wrong task not paid at all

Small/no difference of costs for main and control task

Some invalid submissions tolerable

Majority decision always better than control group approach

21.06.2017

Crowdsourcing - Quality Control Mechanisms

53

Slide54

Cost Optimal Mechanism for Complex Tasks

Assumptions

High costs per task

Control tasks (often) cheaper than main task

False positive and false negative approvals result in high additional costs

Control group approach always better than majority decision if control

cheaper than main task (

 

21.06.2017

Crowdsourcing - Quality Control Mechanisms

54