Bobby Wellins Jazz Lineup 1322011 Best of all is to convey the magnitude of the effect and the degree of certainty explicitly Pinker 2014 p 45 Usually ID: 632230
Download Presentation The PPT/PDF document "“It is better to observe than to criti..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
“It is better to observe than to criticise.”
– Bobby Wellins
(Jazz Line-up, 13/2/2011
)Slide2
“Best of all is to convey
the magnitude of the effect and the degree of certainty explicitly
.”
–
Pinker (2014
, p. 45
)Slide3
“Usually what one wants to know is not whether the change makes
any
difference, but to know how likely it is that the change will be big enough.”
– (
Landauer
, 1997, p. 222)”)Slide4
Magnitude-based inference in behavioural
research
Paul van
Schaik
p.van-schaik@tees.ac.uk
http://sss-studnet.tees.ac.uk/psychology/staff/Paul_vs/index.htm
Slide5
Outline
Problem and proposed solution
Quantification in behavioural research
Statistical inference in behavioural research
Magnitude-based inference
The application of magnitude-based inference in behavioural research
Other approachesLimitationsRecommendationsSlide6
The problem
A
researcher conducts a study comparing two software designs in terms of their
usability
She
conducts usability tests with two groups, each using one of the designs, and collects various
measuresThese include perceived usability, error rate and time-on-taskShe then compares the two groups in terms of their mean scores on the measures, using a t testShe finds that, although differences in mean scores are apparent, the test results do not show statistical significanceWhat should the researcher conclude about the difference in usability between the two designs?Slide7
A proposed solution
As
an altnernative to null-hypothesis significance-testing (NHST
), use
information about
uncertainty
in the data, the observed value of the effect and smallest substantial values for the effect to make two kinds of magnitude-basedinference: mechanistic and practicalUse the results of (NHST) as inputUse spreadsheets available on the Internet to generate inferencesDeveloped and influential in sport- and exercise scienceSlide8
Quantification in user research
“The
systematic study of the goals, needs, and capabilities of users so as to specify design, construction, or improvement of tools to benefit how users work and live” (Schumacher, 2009, p. 6
)
Usability- and user-experience data
E.g. psychometric data,
error rate and time-on-taskFormative researchusers’ interaction with an artefact is studied to generate data that, when analysed, provide information to inform system improvementSummative research establishes the quality interaction of an artefact in comparison with another artefact or a benchmarkSlide9
Statistical inference in user research
Usually, null-hypothesis significance testing (NHST) is used; limitations:
null
hypothesis of no effect
is
(almost) always
falseignores the smallest important effect: has no effect on the inference that is made in NHSTdoes not address practical relevance; does not clearly define or distinguish practical and mechanistic significancea non-significant result is inconclusive and a crude classification of inference is used (reject or retain H0)sample size estimation is based on NHSTSlide10
Merits of magnitude-based inference
Requires the researcher to define smallest
important effect,
rather
than null
effect
Uses smallest important effect as integral part of inference, so inferences are not an artefact of sample sizeProvides a rigorous and principled approach to infer practical significance; provides a rigorous distinction between practical and mechanistic significanceSlide11
More merits
Provides a
more refined
classification of inferences that can be made than merely rejecting or retaining the null hypothesis
Estimates
of required sample size are based on practical significance or mechanistic significance and researcher-defined smallest important effectSlide12
Inference of mechanistic significance (1)
For descriptive purposes, an effect can be classified in terms of its
size
in relation to smallest important + and - effect
size
as
positive, trivial or negativeFor inference proper, the chances of an effect being positive, negative or trivial are usedThe chances of the effect being positive: effect falling above the threshold of the smallest important + effectThe chances of the effect being negative: effect falling below the threshold of the smallest important - effect The chances of a trivial effect: 100% minus the sum of the chances of a + effect and those of a - effectSlide13
Inference of mechanistic significance (2)
An inference is then made from the chances of each of three ranges of outcome (positivity, triviality and negativity) as
follows
Unclear effect: both
the chances of the obtained effect being
+ and
the chances of the effect being - are too large (e.g., both greater than the default value of 0.05 or other appropriate cut-offs). Otherwise, clear effect, seen as substantially +, - or trivial and considered to have the size of the observed value, with a qualification of probability Proposed interpretation of probability rangesSlide14
Probability
Chances
Odds
The effect …
positive/trivial/negative
beneficial/negligible/harmful
<0; 0.005]<0; 0.5%]<0; 1:199]is almost certainly not …<0.005; 0.05]<0.5%; 5%]<1:199: 1:19]is very unlikely to be …<0.05; 0.25]<5%; 25%]<1:19; 1:3]is unlikely to be …, is probably not …<0.25; 0.75]<25%; 75%]
<1:3; 3:1]
is possibly (not) …, may (not) be …
<0.75; 0.95]
<75%; 95%]
<3:1; 19:1]
is likely to be ..., is probably …
<0.95; 0.995]
<95%; 99.5%]
<19:1;
199:1
]
is very likely to be …
<0.995; 1>
<99.5; 100>
<199:1
;
>
is almost certainly …Slide15Slide16Slide17
Inference of practical significance (1)
For descriptive purposes, an effect can be classified in terms of its
size
in relation to smallest important beneficial and harmful effect size
as
beneficial, negligible or harmful
For inference proper, the chances of an effect being beneficial, harmful or negligible are usedThe chances of the effect being beneficial: effect falling above the threshold of the smallest important ben. effectThe chances of the effect being harmful: effect falling below the threshold of the smallest important harmf. effect The chances of a negligible effect: 100% minus the sum of the chances of a ben. effect and those of a harmf. effectSlide18
Inference of practical significance (2)
Type-1
practical error
analogous
to that of Type-I error in NHST (rejecting the null hypothesis when it is
true)
Type-2 practical error analogous to that of Type-II error in NHST (retaining the null hypothesis when it is false)In the practical (‘clinical’) application of effectsthe chance of using a harmful effect (a Type-1 practical error) needs to be far smaller than the chance of not using a beneficial effect (a Type-2 practical error)Slide19
Inference of practical significance (3)
An inference is then made from the chances of each of three ranges of outcome (benefit, negligibility and harm) as
follows
If
the chances of benefit are greater than the suggested cut-off of 25% for a Type-2 practical error and the chances of harm are greater than the suggested cut-off of
0.5
% for a Type-1 practical error then the effect is unclearIf the chances of benefit are greater than 25% and the chances of harm are smaller than 0.5% then the effect is clearly beneficialOtherwise, the effect is clearly negligible or harmful.Proposed interpretation of probability ranges as beforeSlide20
Example from sport science (1)
I am grateful to Matt Weston for providing this
example
A
sports researcher is interested in whether a new, commercially available nutritional supplement has a beneficial or harmful effect on elite cyclists’ 40 km time
trial
performance (the faster the time, the better the performance)The researcher conducts an experiment to examine the effect of two different doses of the supplement (a low dose and a high dose)Experimental crossover design all of the cyclists perform the time trial under three different conditions (placebo [no supplement], low dose and high dose), in a counterbalanced manner and the researcher’s experience led to the belief that the smallest worthwhile change in 40 km time trial performance was -1%Slide21
Example from sport science (2)
The mean (±
SD
) performance
times
59.5
± 1.6 min (low dose), 60.9 ± 2.2 min (high dose) and 60.5 ± 1.9 min (placebo) Magnitude-based inferences calculate the chances of benefit (or harm), with reference to a change of -1%compared to placebo, the low dose performance improved by -1.7% (90% confidence interval -2.4 to -0.9%) with a 92% chance of benefit and 0.0% chance of harma low dose of the supplement is therefore likely to be beneficial and recommendedhowever, compared to placebo the high dose impaired performance by 0.7% (90% confidence interval -0.1 to 1.5%) with a 0% chance of benefit and a 25% chance of harma high dose of the supplement is therefore most unlikely beneficial and not recommendedSlide22
Demonstration
Example: unrelated
t
test
Mechanistic inference
Practical inference
Spreadsheets available at http://www.sportsci.org/Slide23
Observations
Practical
and mechanistic
inference, but not for statistical inference, depend on smallest worthwhile effect
The range of practical and mechanistic inferences
(e.g., “is
very (un)likely to be harmful/trivial/beneficial”) is greater than that of statistical inference (dichotomous)The results of practical and mechanistic inference concur about half of the time with those of statistical inference; when the results differ, the latter is more conservativePractical and mechanistic inference mostly concurSlide24
Smallest harmful/
-ive
d
Smallest beneficial/
+ive
d
Total sample size (N) Sample size ratio PM
S
S/P
S/M
M/P
-0.2
0.2
268
274
788
2.94
2.88
1.02
-0.3
0.3
122
122
352
2.89
2.89
1.00
-0.4
0.4
70
70
198
2.83
2.83
1.00
-0.5
0.5
46
46
128
2.78
2.78
1.00
-0.6
0.6
34
32
90
2.65
2.81
0.94
-0.7
0.7
26
24
66
2.54
2.75
0.92
-0.8
0.8
22
20
52
2.36
2.60
0.91
-0.9
0.9
18
16
42
2.33
2.63
0.89
-1.0
1.0
14
14
34
2.43
2.43
1.00
-1.1
1.1
14
12
28
2.00
2.33
0.86
-1.2
1.2
14
10
24
1.71
2.40
0.71Slide25
Further alternatives to NHST
C
ounter-null
statistic (Rosenthal & Rubin, 1994
)
p
rep (Killeen, 2005) p-intervals (Cumming, 2008)Minimum-effect tests (Murphy & Myors, 1999) Equivalence-testing (Tryon, 2001)Non-inferiority-testing (Head et al., 2014)Bayesian statistics (Rouder et al., 2009)Slide26
Limitations
Apparent
As in NHST, need to
make several choices or accept recommended
choices
Confidence level
Type-1 and Type-2 practical-error ratesThe smallest important effectThe mapping of quantitative probabilities onto qualitative descriptorsAs in NHST, assumptions about sampling distribution of the outcome statistic; can use bootstrappingSubstantiveThe decision rules do not necessarily take all relevant factors into account, for example the (financial) value of inputs to and outputs from using a harmful or beneficial effect (Murphy & Myors, 1999)Slide27
Recommendations
Plan
sample size using magnitude-based
inference
Analyse
data using
NHST; make better use of the results as input for magnitude-based inferenceAlways analyse data using mechanistic inference; also use practical inference for effects where benefit and harm can be meaningfully defined Use appropriate spreadsheets for sample size estimation and magnitude-based inference (http://www.sportsci.org/)When preparing for journal publication, cogently argue why it is appropriate to use magnitude-based inference in your research; in your section Data Analysis explain the specific magnitude-based inference that you have used (see, e.g., Barnes et al., 2014)Slide28
Some publications
Barnes, K. R., Hopkins, W. G., McGuigan, M. R., & Kilding, A. E. (
2015).
Warm-up with a weighted vest improves running performance via leg stiffness and running economy.
Journal of Science and Medicine in Sport
, 18
, 103-108. doi:10.1016/j.jsams.2013.12.005 Batterham, A. M., & Hopkins, W. G. (2006). Making meaningful inferences about magnitudes. International Journal of Sports Physiology and Performance, 1(1), 50-57. Hopkins, W. G. (2006). Estimating sample size for magnitude-based inference. Sport Science, 10, 63-70. Hopkins, W. G. (2006). Spreadsheets for analysis of controlled trials, with adjustment for a subject characteristic. Sport Science, 10, 46-50. Hopkins, W. G., Marshall, S. W., Batterham, A. M., & Hanin, J. (2009). Progressive statistics for studies in sports medicine and exercise science. Medicine and Science in Sports and Exercise, 41(1), 3-12. doi:10.1249/MSS.0b013e31818cb278 Schaik, P. van & Weston, M. (2016). Magnitude-based inference and its application in user research. International Journal of Human-Computer Studies, 88, 38-50.