Mechanisms Vlad Hosu Universität Konstanz 21062017 With thanks to Matthias Hirth from University of Würzburg Agenda Honesty in Online Labor Markets Exemplary Quality Control Mechanisms ID: 799539
Download The PPT/PDF document "Crowdsourcing Quality Control" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Crowdsourcing
Quality Control Mechanisms
Vlad HosuUniversität Konstanz, 21.06.2017
With
thanks
to
Matthias Hirth
from
University
of
Würzburg
Slide2Agenda
Honesty in Online Labor Markets
Exemplary Quality Control MechanismsFiltering of workers
Screening of workersStatistical methodsWorkflow based mechanisms
Costs of Quality Control Mechanisms
21.06.2017
Crowdsourcing - Quality Control Mechanisms
2
Slide3Honesty in Online
Labor Markets21.06.2017
Crowdsourcing - Quality Control Mechanisms
3
Slide4Honesty in Online Labor Markets
Suri et al. "Honesty in an Online Labor Market
.”User studies on
MTurk with varying bonus paymentStudy 1
Fixed payment per task 0.25 USD
Bonus payment 0.25 USD times the roll of a die (ratio between minimum and maximum payment: 3.5)
Users asked to use own die or
random.org
175 participants
Study 2
Fixed payment per task 0.25 USD
Fixed bonus
payment
of 0.75
USD upon completion plus 0.05 USD times the roll of a die (ratio between minimum and maximum payment: 1.24)Users asked to use own die or random.org267 participants(Expected payout 1.125 USD)
21.06.2017
Crowdsourcing - Quality Control Mechanisms
4
Slide5Study Results
In both studies: Significant trends towards higher rolls
Workers tend to be dishonest even if only small benefits are expected (Lab participants showed the same behavior)
Study 1
Study 2
21.06.2017
Crowdsourcing - Quality Control Mechanisms
5
Slide6Influencing Factors on Poor Quality
Poor
Quality
Workers are not qualified for the task
Workers do not work thoughtfully
Workers deliberately cheat
Workers do not
understand
instructions
Qualification
Reliability
Train
Educate
Remove
Task Design
21.06.2017
Crowdsourcing - Quality Control Mechanisms
6
Slide7Which Workers do You Need?
QoE
Measurement
Surveys
Software Testing
Data Extraction & Labeling
Content Creation
Non-
Trustworthy
Trustworthy
Qualified
Non-Qualified
21.06.2017
Crowdsourcing - Quality Control Mechanisms
7
Slide8Types of Malicious Behavior
Different type of malicious behavior observed in exemplary online survey (Gadiraju,
Ujwal, et al. "Understanding malicious behavior in crowdsourcing platforms: The case of online surveys.”)
Ineligible WorkersWorkers that do not conform with priory stated prerequisites
Example: Worker performs task not on the requested device
Fast Deceivers
Workers trying to maximize income supplying ill-fitting response
Example: Worker copies parts of the instruction into free text questions
Rule Breakers
Workers not sticking to the instructions
Example: Worker does not care about word limit for articles
Smart Deceivers
Similar to
Fast Deceivers
, but stick to instructions to void automatic detection
Example: Correctly formatted but irrelevant image tags21.06.2017Crowdsourcing - Quality Control Mechanisms
8
Slide9Possible Attacks on Tasks
Individual attacksRandom Answers:
Worker manually submits random/incorrect answer to minimize effort
Semi-Automated Answers: Worker uses script to automate parts of the spamming process
Automated Answers:
Worker uses bot to automatically submit
random/incorrect answer to minimize
effort
Group Attacks
Agree on Answers: Workers communicate submit consistent answer to unique task items, e.g., images
Answer Sharing: Answers for gold standard data questions are stored and shared among workers
Artificial Clones: Workers create clones that act as independent workers and duplicate the workers answers
21.06.2017
Crowdsourcing - Quality Control Mechanisms
9
Slide10Exemplary QualityControl Mechanisms
21.06.2017
Crowdsourcing - Quality Control Mechanisms10
Slide11Types of Quality Control Mechanisms
Filtering of workers
Applied before a taskRestricts access to the task to certain workers
Exemplary filter criteria: Country of origin, reputationScreening of workers
Applied during a task
Based on explicit and implicit tests integrated in the task
Examples: Test questions/consistency questions, attention tests, gold standard data, behavior monitoring
Statistical methods
Applied after task completion
Based on task results
Examples: Random Clicker, Quadrant of Euphoria,
CrowdMOS
,
BT.500-13
Workflow
based mechanisms Based on combination of different task resultsExamples: Repetition and aggregation, two-stage design, Find-Fix-Verify21.06.2017Crowdsourcing - Quality Control Mechanisms11
Slide12Filtering of workers
Slide13Filtering of Workers
Applied before beginning of task to restrict access for set of workers
Filter criteria often based onResults from previous campaigns / qualification tasks
Reputation of workers / work history on platform
Properties of worker, e.g., country of origin
Possible application: Filtering of known spammers
Issues
Reputation data on crowdsourcing platform often ‘broken’
Potential discrimination of new workers
Ethical issues when filtering of users based on country of origin
21.06.2017
Crowdsourcing - Quality Control Mechanisms
13
Slide14Screening of workers
Slide15Screening of Workers
Applied during the task to estimate quality of worker
Screening mechanisms based on explicit testsContent questions
Consistency questionsTrap questions
Attention tests
G
old
standard data
Screening mechanisms based on implicit feedback
Behavior
monitoring
Possible application
Assessment of worker qualification, attention, and diligence
Assessment
of trustworthiness and honesty
IssuesOften generation of reference data required Limited set of test itemsSome methods need to be adapted to the specific tasks21.06.2017Crowdsourcing - Quality Control Mechanisms15
Slide16Verifiable Content Questions
Simple questions about the item/content the worker needs to process/pay attention to during a task
Content question easier to verify than actual task result
Example: Video quality assessmentWorker supposed to watch a video and give a subjective rating on the quality
Possible content question:
‘Which sport was shown in the clip?
A) Tennis. B) Soccer. C) Skiing.’
Example: Keywords for an article
Worker need to read the article and provide meaningful keywords
Possible content question:
‘How may paragraphs does the article have’
21.06.2017
Crowdsourcing - Quality Control Mechanisms
16
Slide17Consistency Questions
Same question is asked multiple times in a slightly different manner
Example
Pre-task survey: ‘Please select your country of origin’Post-task survey: ‘Which continent do you live on?’
Potential issues
Subjects might not be
willing to provide correct personal
data
Some questions might be ambiguous or too difficult
Unnecessary rejection of valid data
21.06.2017
Crowdsourcing - Quality Control Mechanisms
17
Slide18Attention Checks
Actions to objectively assess the focus of the worker on the task
Example: Video quality assessmentWorker supposed to watch a video and give a subjective rating on the quality
Attention check: An additional button is added to the test, worker need to click the button on random requests
Example
:
Survey
Worker
requested to complete a survey Attention check: Asking worker to select a (unlikely/impossible) option in the next answer
Potential
issues
Worker might be distracted (e.g. video test)
Checks might be too obvious (cf. survey)
21.06.2017
Crowdsourcing - Quality Control Mechanisms
18
Slide19Trap Questions
Question and answer schemes encouraging sloppy users to select incorrect answers
Example: Information extraction task
Worker supposed to extract information from emailGiven email header
Possible answers
Potential issue: Correct data needs to be available
21.06.2017
Crowdsourcing - Quality Control Mechanisms
19
Slide20Trustworthiness Tests
Testing basic willingness to give honest answer (important for QoE and other subjective tests)
Only measurable via psychological tests
Usually too complex and too long for crowdsourcing test
But: Adding some questions from psychological tests as estimation possible
General idea
Usage of questions that seem to be not-verifiable at first sight
Examples:
Number of visible cats
Recognized number of
differences in two images
21.06.2017
Crowdsourcing - Quality Control Mechanisms
20
Slide21Gold Standard Data
Questions whereof the correct results are already known
Gold standard questions are interspersed among other questions
Most common quality control approachExample: Video quality assessment
Worker supposed to watch a video and give a subjective rating on the quality
Gold standard question: ’Did you notice any
stallings
in the video’?
Example: Image categorization
Worker needs to add an image to one of two categories
Gold
standard data: Set of (unambiguous) categorized images included in the regular tasks
Potential issue:
Additional costs for creating gold standard data
Limited set of gold standard data (approach to create/extend gold standard data automatically)
Gold standard data need to be robust against worker bias21.06.2017Crowdsourcing - Quality Control Mechanisms21
Slide22Idea: Use objective measures on application-level for estimating the user interactions with the application
Goal: Estimate quality
of the task results based on technical parametersChallenges:
Model of “correct” behavior required (for every task different)
Implementation can become very complex
Behavior Monitoring to Identify Unreliable Users
Image from sxc.hu by
zenpixel
Expected behavior
expected
measurement values
Diverging measurements
reconstruction of behavior
21.06.2017
Crowdsourcing - Quality Control Mechanisms
22
Slide23Application Layer Measurements (ALM)
Client side
Analysis of interactions with web browserMouse movement
ScrollingWindow focus
…
Server side
Analysis
using server side logging
Requests pages
Time spend on a specific page
…
Prerequisite: Web based task
interface
Privacy: Non intrusive, only task environment is
observable
21.06.2017
Crowdsourcing - Quality Control Mechanisms
23
Slide24Case Study: Language Test
Hirth, Matthias, et al. "Predicting result quality in crowdsourcing using application layer monitoring.”
Requirements for test caseTask needs to be “simple” (no special skills required)
Task evaluation must be objective English language test with multiple choice answers
Test design
Five texts from the
simplified English
version of Wikipedia
Five multiple choice questions each text with four possible answers
Implicit feedback about the user via,
q
uestion design, answer design, and free text question about the most interesting text
User study
215 participants in Feb. 2013 (0.10 USD per task)
Participants from 22 different nations
(2% Native speaker)21.06.2017Crowdsourcing - Quality Control Mechanisms24
Slide25Implementation of the Language Test
1
2
3
Trustworthiness Trap
Incentive
Online Translation Service
21.06.2017
Crowdsourcing - Quality Control Mechanisms
25
Slide26Identification Suitable Qualification Threshold
Qualification threshold for binary classification (qualified and unqualified)
Test settingsNumber of questions: 25
Number of answers per question: 4Desired sample size
between
and
Considerations
Probability for randomly selecting correct question:
Probability for randomly selecting
correct questions:
(Binomial distribution)
Qualification Threshold
Goal: Probability of false positive qualification
Qualification
Threshold set to
21.06.2017
Crowdsourcing - Quality Control Mechanisms
26
Slide27Evaluation of Language Test Results
Scores normalized to 100%
Qualification Threshold: 50% At minimum two answers correct
18% of the candidates achieved the maximum score
71%
“qualified”
How can we predict a worker’s test result?
21.06.2017
Crowdsourcing - Quality Control Mechanisms
27
Slide28ALM User Modeling
Expected behaviorReading instructions
Start with first questions/check length of test
Read the text
Read and answer the questions/recheck the text
Proceed to next text
Expected interaction
Waiting
Scrolling
Interacting
21.06.2017
Crowdsourcing - Quality Control Mechanisms
28
Slide29ALM Implementation
Implemented measuresTime of page access and leaving
Vertical scroll positionSynchronous measure every 10 secAllows reconstructing user’s field of sight
Interactions with answering elementsAsynchronous measure based on interaction eventsAllows reconstructing answering times
Evaluated metrics
Completion time
Time between entering the test and submitting the final result
Simple metric only considering server side measures
Consideration time
Time between first time user saw question and last interaction with the related answer
Complex measure based on multiple client side measures
21.06.2017
Crowdsourcing - Quality Control Mechanisms
29
Slide30Completion
Time
Plausibility Threshold
Minimum time to read the testAutomatically pre-calculated
Temporal qualification Threshold
Minimum completion time of qualified workers
Derived from test
results
No clear differentiation of workers possible in most cases!
Filters only a small
number of workers
Requires evaluated test results
?
21.06.2017
Crowdsourcing - Quality Control Mechanisms
30
Slide31Consideration Time and Question Type
Different question types
Comprehension
Required information stated explicitly in textReorganization
Combination of several explicit information from text
Inference
Required information only implicitly stated in text
Significant differences between qualified and unqualified workers
Qualified workers spend more time in general
Time spend by qualified workers increased with difficulty of the questions
21.06.2017
Crowdsourcing - Quality Control Mechanisms
31
Slide32Consideration Time and Answer Type
Different answer types
Plausible, in textCorrect information stated in the text
Plausible, not in text
Information matching question but not in the text
Not plausible, in text
Information from the text but not matching the question
Observations
Finding correct answer requires largest amount of time (
plausible, in text
)
Users guessing the answer without reading require least amount of time (
plausible, not in text
)
Users guessing the answer without reading
the question properly lay in between (not plausible, in text)21.06.2017Crowdsourcing - Quality Control Mechanisms32
Slide33Conclusion Application Layer Monitoring
Low quality results in Crowdsourcing
Different reasons for low result Reasons hard to detect
Understanding of worker-task interactions necessaryApplication layer monitoring
Non intrusive technical worker monitoring on client and server side
Task specific and complex to implement
Allows designing objective worker-task interaction measures
Results of proof-of-concept implementation
Traditional interaction measures (completion time) outperformed by more sophisticated ALM measures (consideration time)
Support Vector Machine using ALM measures capable of predicting worker qualification with accuracy of ~89%
21.06.2017
Crowdsourcing - Quality Control Mechanisms
33
Slide34Statistical Methods
Slide35Random Clicker
Assumption: If worker selects random answers, distribution of answers follows uniform distributionTest idea: Use Pearson’s
-Test to test null hypothesis ‘
User is random clicker’
21.06.2017
Crowdsourcing - Quality Control Mechanisms
35
Slide36CrowdMOS
AssumptionsIn subjective studies, worker ratings should converge to a global trend
Individual workers should agree with this trendTest idea: Use sample correlation to compare user ratings with global average ratings
21.06.2017
Crowdsourcing - Quality Control Mechanisms
36
Slide37Transitivity Satisfaction Rate
Assumption: Subjective ratings are transitive (A is preferred to B, B is preferred to C A is preferred to C)Test idea:
Use pair comparisons of all data pointsCompute compliance with transitivity assumptions
21.06.2017
Crowdsourcing - Quality Control Mechanisms
37
Slide38ITU-R BT.500
Assumption: Subjective ratings for the same test condition should vary only by a fixed amount
at maximum among the workersTest ideaCompute global mean and standard deviation of all ratings
Determine distribution of ratings (normal or non-normal distribution)
Test whether individual ratings lie within an interval of length
around the global mean (
if ratings normally distributed, else
)
Filter workers based on number of rating outside the given range
21.06.2017
Crowdsourcing - Quality Control Mechanisms
38
Slide39Reliability of Statistical Methods
DatasetSubjective study on impact of stalling on YouTube QoE
Gold standard data: Unreliable workers identified by content questions and behavior monitoringIssues in this study
Different statistical methods identify different sets ofreliable and unreliable workers
Some methods show large number of false negatives
(no money for those users)
General issue: Filtering based on evaluated data
21.06.2017
Crowdsourcing - Quality Control Mechanisms
39
Slide40Workflow Based Approach
Slide41Workflow Based Approach
Based on combination of task results and varying tasks from different workers
ExamplesFind-Fix-Verify
Majority decisionControl group decision
Two-Stage Design
Possible
application
Assessment
of
worker
quality
Increasing the quality of final outcome
Issues: Need to be tailored to the specific task
21.06.2017
Crowdsourcing - Quality Control Mechanisms
41
Slide42Majority Decision
Parameters
Number of workers
Probability of incorrect task
submission
Probability for correct majority decision
21.06.2017
Crowdsourcing - Quality Control Mechanisms
42
Slide43Control Group
Parameters
One worker performing main task
Number of workers
used of majority decision
Probability of incorrect task
submission
(same probability for main task and control tasks)
Probability for a correct majority decision
21.06.2017
Crowdsourcing - Quality Control Mechanisms
43
Slide44Control group
Main worker
Correct result
Incorrect result
Correct decision
Incorrect decision
Control group
Main worker
Correct result
Incorrect result
Correct decision
Incorrect decision
Control group
Main worker
Correct result
Incorrect result
Correct decision
Correct
task approved (
)
Incorrect task disapproved
(
)
Incorrect decision
Correct task disapproved
(
)
Incorrect task
approved (
)
Control group
Main worker
Correct result
Incorrect result
Correct decision
Incorrect decision
Possible outcomes
Probability of outcomes
Probability of a correct control group decision
Same probability for a correct result as majority decision
Probability
for Correct Control Group Decision
21.06.2017
Crowdsourcing - Quality Control Mechanisms
44
Slide45Two Stage Design
Combined workflow of
using different quality
control mechanismsPhase 1
Pre-filtering of users according to
test requirements
Simple and cheap task including training phases, qualification and trustworthiness tests (all workers paid except obvious spammers
)
Pseudo
reliable crowd identified
Phase 2
Actual task, longer and more expensive
Only accessible by trusted workers from phase 1
Limited application of quality control mechanism to save time and money
21.06.2017Crowdsourcing - Quality Control Mechanisms45
Slide46Costs of Quality Control Mechanisms
21.06.2017
Crowdsourcing - Quality Control Mechanisms46
Slide47Costs of Quality Control
Total cost of quality related activities dividable in conformance costs and
non-conformance costs
Conformance costs: Costs spend on activities to avoid poor qualityPrevention costs
Cost for activities to prevent end result from failing quality requirements
Example: Robust workflow design, easy to use task interface
Appraisal costs
Costs for finding errors
Examples: Repetition of tasks, additional work for gold standard tasks
Non-conformance costs: Costs resulting from poor quality
Internal failure
Cost due to detected quality issues
Example: Reassignment of failed tasks
External failure
Costs due to errors in the final product
Examples: Invalid translation used in article, low quality article publishedTrade-off between conformance and non-conformance costs necessary21.06.2017Crowdsourcing - Quality Control Mechanisms47
Slide48Cost Model for Majority Decisions
Cost factors
Costs for correct
and incorrect
tasks
Costs for false negative
and false positive
approvals (accepting invalid task/rejecting correct tasks)
Number of correct tasks
and number of incorrect tasks
Expected costs
of majority decision
Costs incorrect submissions
Costs correct submissions
Costs incorrect submissions
Costs correct submissions
21.06.2017
Crowdsourcing - Quality Control Mechanisms
48
Slide49Impact of Cost Factors Majority Decision
Fixed values for
and fixed number of workers
Total costs increase with decreasing quality of workers
Influence of cost factors vary depending on the worker quality
21.06.2017
Crowdsourcing - Quality Control Mechanisms
49
Slide50Cost Model for Control Group Approach
Cost parameters
Costs for correct
and incorrect
main tasks
Costs for false negative
and false positive
approvals of main task (accepting invalid task/rejecting correct tasks)
Costs for control group majority decision
Costs for control group outcomes
Control group approach repeated until first result accepted
On average
repetition with
Control group
Main worker
Correct result
Incorrect result
Correct decision
Incorrect decision
Control group
Main worker
Correct result
Incorrect result
Correct decision
Incorrect decision
21.06.2017
Crowdsourcing - Quality Control Mechanisms
50
Slide51Cost Model for Control Group Approach
Costs for approval and disapproval of main task
Costs for control group approach
Costs for correctly accepted main task
Costs for incorrectly accepted main task
Number of rejected
main tasks
Costs per rejected main task
21.06.2017
Crowdsourcing - Quality Control Mechanisms
51
Slide52Impact of Cost Factors Control Group Approach
Fixed values for
(control group),
(main task) and fixed number of workers
3
Maximum of total for
Influence of cost factors vary depending on the worker quality
21.06.2017
Crowdsourcing - Quality Control Mechanisms
52
Slide53Cost Optimal Mechanism for Routine Tasks
Assumptions
Low costs per task, wrong task not paid at all
Small/no difference of costs for main and control task
Some invalid submissions tolerable
Majority decision always better than control group approach
21.06.2017
Crowdsourcing - Quality Control Mechanisms
53
Slide54Cost Optimal Mechanism for Complex Tasks
Assumptions
High costs per task
Control tasks (often) cheaper than main task
False positive and false negative approvals result in high additional costs
Control group approach always better than majority decision if control
cheaper than main task (
21.06.2017
Crowdsourcing - Quality Control Mechanisms
54