/
Evaluation Methodologies Evaluation Methodologies

Evaluation Methodologies - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
402 views
Uploaded On 2016-05-30

Evaluation Methodologies - PPT Presentation

Internal Validity Construct Validity External Validity In the context of a research study ie not measurement validity Validity Generally relevant only to studies with causal relationships ID: 341665

threats validity construct group validity threats group construct internal single wake groups study authors external testing design test participants treatment control measure

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Evaluation Methodologies" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Evaluation MethodologiesSlide2

Internal Validity

Construct ValidityExternal Validity* In the context of a research study, i.e., not measurement validity.

Validity*Slide3

Generally relevant only to studies with

causal relationships.Temporal precedenceCorrelationNo plausible alternativeKey question: can the outcome be attributed to causes other than the designed interventions

If so, it is likely that internal validity needs to be tightened up

Internal ValiditySlide4

Threats to Internal Validity

Single Group ThreatsMultiple Group ThreatsSocial threats to internal validityInternal ValiditySlide5

Single Group Threats

Image an educational program where two different testing regimens are used.

In one, an intervention and then a post-test is used.

In the second, a pre-test, intervention and post-test is used.

What are the single group threats for this design?Slide6

Single Group Threats

History (something happened at the same time)Maturation (something would have happened at the same time)Testing (testing itself induced an effect)Instrumentation (changes in the testing)Mortality (attrition in study participants)

Regression (regression to the mean)

Internal ValiditySlide7

Suppose for the previous study we had multiple groups instead of single groups?

Multiple Group Threats are variations on the Single Group Threat with selection bias added. If the added second group is a control, for instance, it must be selected in a way that makes it fully comparable to the first group (random assignment).If participants cannot be randomly assigned, then we get quasi-experimental design.

Multiple Group ThreatsSlide8

Applicable to social sciences (because people do not react simply to stimuli)

Diffusion (people in treatment groups talk to one another)Compensatory rivalry (treatments groups know what is happening and develop a rivalry)Resentful demoralization (same as above, but with an opposite sign)Compensatory equalization (researchers or others equalize groups).

Social Interaction ThreatsSlide9

Are the results valid for other persons in other places and at other times?

Do they generalize?Types of generalizationThreats to external validityExternal ValiditySlide10

Generalizations

Sampling Model: try to make certain that your study groups are a random sample of the population you wish your generalization to extend to.“Proximal Similarity”: measure or stratify the sample on the things you cannot randomize.External ValiditySlide11

Threats to external validity

PeoplePlacesTimesExternal ValiditySlide12

An assessment of how well ideas or theories are translated into actual programs.

Mapping of concrete activities into theoretical constructs.Construct ValiditySlide13

Formal articulations:

Nomological network (Cronbach and Meehl, 1955): researchers were to establish a theoretical network of what to measure, empirical frameworks of what to measure and the linkages between the two.Multitrait-Multimethod

Matrix (Campbell and Fiske, 1959): Convergent concepts should show higher correlations divergent concepts lower correlations.Pattern matching (

Trochim

, 1985): Linking a theoretical pattern with an operational pattern.

Construct ValiditySlide14

Threats to Construct Validity

Poorly defined constructsMono-operation bias: The construct is larger than the single program / treatment you devised.Mono-method bias: the construct is larger than the limited set of measurements you devised.Test and treatment interaction: measurement changes the treatment group

Other threats generally fall under “labeling” threats: a construct is essentially a metaphor, and if not precisely articulated differing meanings can be held by different persons.

Construct ValiditySlide15

Social Threats to Construct Validity

Hypothesis guessing: participants guess at the purpose of your study and attempt to game it.Evaluation apprehension: if apprehension causes participants to do poorly (or to pose as doing well) then the apprehension becomes a confounding factor.Researcher expectancies: Researcher expectancies confound the outcome.Hawthorne effect: people change behavior when observed

Rosenthal effect: researcher expectations can change outcomes even when subjects are uninformed.

Construct ValiditySlide16

Wake Up and Smell the CoffeeSlide17

Authors see methodology as intellectual infrastructure.

Believe that rapid change in CS produces outdated methodology.Three key claims:Workloads used need to be appropriateExperimental design needs to be appropriateAnalysis needs to be rigorous

Wake Up OverviewSlide18

For this paper, the authors focus on Java

Modern language additions (type safety, memory management, secure execution) have been added to JavaAuthors believe that these additions make previous benchmarks untenable:Tradeoffs due to garbage collection where heap size is a control variableNon-determinism due to adaptive optimization and sampling technologies

System warm-up from dynamic class loading and just-in-time compilation

Wake Up: FocusSlide19

Authors created a suite (

DaCapo) of benchmark tools suitable for research. The suite consists of open source applications.DaCapo validates diversity a variety of tests and then applying PCA.Authors point to “cherry picking” research by Perez, showing that dropping diversity of measures increases ambiguous and incorrect conclusions.

Wake Up: WorkloadsSlide20

The authors in their results show four ways to evaluate garbage collection. Any specific measure can be “gamed” to produce a desired result.

Classic comparison of Fortran / C / C++: control for host platform and language runtime.New comparisons: control for host platform, language runtime, heap size, nondeterminism and warm-up.

Wake Up: Experimental DesignSlide21

To obtain meaningful data from noisy estimates, data must be collected and aggregated.

Current practices sometimes lack statistical rigor.Presenting all the results from the suite (as opposed to one number) will reduce “cherry picking”.Wake Up: Analysis