/
Evaluating Test-Suite Reduction Evaluating Test-Suite Reduction

Evaluating Test-Suite Reduction - PowerPoint Presentation

aaron
aaron . @aaron
Follow
345 views
Uploaded On 2019-03-15

Evaluating Test-Suite Reduction - PPT Presentation

in Real Software Evolution August Shi Alex Gyori Suleman Mahmood Peiyuan Zhao Darko Marinov ISSTA 2018 Amsterdam Netherlands July 16 2018 CCF1409423 CCF1421503 CNS1646305 Regression Testing ID: 756430

failed test builds fbdl test failed fbdl builds tsr code tests suite loss reduced based fault failures coverage map

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Evaluating Test-Suite Reduction" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Evaluating Test-Suite Reduction in Real Software Evolution

August Shi, Alex Gyori, Suleman Mahmood,Peiyuan Zhao, Darko MarinovISSTA 2018Amsterdam, NetherlandsJuly 16, 2018

CCF-1409423

CCF-1421503CNS-1646305Slide2

Regression Testing

2

Code Under Test

V1

test

1

test

2

test3test4…testN-1testN

Code Under TestV2

test1test2test3test4…testN-1testN

Code Under TestV3

test1test2test3test4…testN-1testN

Many Tests x Many Versions = Costly!

…Slide3

Test-Suite Reduction (TSR)

3

Code Under Test

V1

test

1

test

2

test3test4test5test6

Code Under TestV2

test1test2test3test4test5test6

Code Under TestV3

test1test2test3test4test5test6

Kept Size:

 Slide4

test

1

test

2test3test4test5test6Traditional Evaluation4

Code Under Test

V

1

test

1

test2test3test4

test5test6

Coverage-based TSR0% coverage loss

Code Under TestV1’

Code Under Test

V

1

’’

test

1

test

2

test

3

test

4

test

5

test

6

Code Under Test

V

1

’’’

test

1

test

2

test

3

test

4

test

5

test

6

Mutant 1

Mutant 3

Code Under Test

V

1

Code Under Test

V

1

Code Under Test

V

1

’’

Code Under Test

V

1

’’’

Code Under Test

V

1

’’’

Mutant

Loss

Mutant 2

All measured on single version!

Mut

1

Mut

2

Mut

3Slide5

How effective are reduced test suites at detecting faults inreal

software evolution?5Slide6

Developer Usage of TSR

6

Code Under Test

V1

test

1

test

2

test

3test4test5test6

Code Under TestV2test1test2test

3test4test5test6

Code Under TestV3test1test2test3test

4test5test6

Missed test failure!

Kept test failure, build fails

Detects same fault?

Code Under Test

V

2

Miss-build

Miss-build?Slide7

Failed-Build Detection Loss (FBDL)

Given as set of future failed builds and

as set of failed builds where reduced test suite does not detect the same faults original test suite detects (miss-builds):

FBDL based on what developers expect from using reduced test suite in regression testing 7

 Slide8

FBDL Example

8V1

V

2V3V4V5

V

6

V

7

ReductionPoint

 

  V3

V4V5V6V7

ReductionPoint 

 

 

V

6

V

7

V

7Slide9

Miss-buildsWhen is a failed build a

miss-build?Can developers find same faults using a subset of test failures?Define miss-builds using heuristics for how likely groupings of failed tests are due to the same faultParsing test failures from build logs

9Slide10

test

1

test

2test3test4test5test6Classifying Failed Builds10

V

1

V

2

V3V4

V5V6V7

ReductionPointtest1test2test3test4test5

test6

test

1

test

2

test

3

test

4

test

5

test

6

HIT

test

1

test

2

test

3

test

4

test

5

test

6

DEFMISS

test

1

test

2

test

3

test

4

test

5

t

est

6

test

7

NEWONLY

All failed tests in reduced test suite

No failed test in reduced test suite

V

6

Only new tests failedSlide11

Classifying miss-builds for FBDL

The miss-builds are defined based on classificationsDevelopers decide which FBDL makes sense

: DEFMISS builds

Failures map to same fault: DEFMISS and LIKELYMISS buildsFailures in same package map to same fault: DEFMISS, LIKELYMISS, SAMEPACK buildsFailures in same class map to same fault

:

All builds except HIT and NEWONLY builds

Failures map to unique fault each

 11?Slide12

AT.test

1

AT.test

2AT.test3BT.test4CT.test5CT.test6

test

1

test

2

test3test4test5test6

Classifying Failed Builds12V1V2V3

V4V5V6V7

ReductionPointtest1

test2test3test4test5test6

test

1

test

2

test

3

test

4

test

5

test

6

HIT

test

1

test

2

test

3

test

4

test

5

test

6

DEFMISS

test

1

test

2

test

3

test

4

test

5

test

6

SAMECL

SAMEPKG: for tests in same package

LIKELYMISS: all remaining instances

test

1

test

2

test

3

test

4

test

5

t

est

6

test

7

NEWONLY

V

6

Failed tests not in reduced test suite are in same test class as those keptSlide13

Classifying miss-builds for FBDL

The miss-builds are defined based on classificationsDevelopers decide which FBDL makes sense

: DEFMISS builds

Failures map to same fault: DEFMISS and LIKELYMISS buildsFailures in same package map to same fault: DEFMISS, LIKELYMISS, SAMEPACK buildsFailures in same class map to same fault

:

All builds except HIT and NEWONLY builds

Failures map to unique fault each

 13Slide14

Research Questions

RQ1: What is the FBDL of TSR?RQ2: How well can the FBDL of TSR be predicted? RQ3: How does distance from TSR reduction point affect FBDL?

14

See paperSlide15

Evaluation

Failed builds, test failures taken from Travis CI32 projects, 1478 failed builds321 reduction pointsFour TSR algorithms (GREEDY, GE, GRE, HGS)Two types of test requirements (coverage, mutants)Coverage-based TSR: 51.9% kept, 2.7% mutants lossMutant-based TSR: 61.1% kept, 2.2% coverage loss

Computed reductions match with prior work15Slide16

RQ1: FBDL of TSR

16Higher losses than suggested by traditional test-requirements loss

Mut

-Loss

Cov

-Loss

FBDL

S

– failed tests to same faultFBDLP – same package failed tests to same faultFBDLC – same class failed tests to same faultFBDLU – failed tests to unique faultSlide17

RQ2: Prediction of FBDL (Size)

17

 

How well does reduced test suite size predict FBDL?Intuition: the greater the kept percentage, the smaller the FBDL (keeping more failed tests)

Coverage-based GREEDY TSRSlide18

RQ2: Prediction of FBDL (Requirements)

18

 

How well does test-requirement loss predict FBDL?Intuition: the larger the loss in test-requirements, the worse the FBDL

Coverage-based GREEDY TSRSlide19

RQ2: Prediction of FBDL (History)

19

 

How well does historical FBDL predict future FBDL?Measure FBDL of a reduced test suite for first half of failed builds, then correlate with FBDL of second half

Coverage-based GREEDY TSRSlide20

Conclusions about TSR

20TraditionalEvaluation

OurEvaluation

Software Change# “Faults”Evaluation MetricFindingMutantsReal evolution

One at a time

Multiple faults

Mutant-loss

FBDL (new)

TSR is effectiveTSR is not effective,and is risky/unpredictable

awshi2@illinois.eduResearch in TSR should evaluate with FBDL and improve FBDL,or regression testing research should focus on techniques besides TSRSlide21

BACKUP

21Slide22

Lessons LearnedParsing logs is difficult (Section 4.4)

Getting test failures, matching test namesMatching test names across different versions and between tools is challenging (Section 4.4)Rebuilding old versions of code is challengingEven reproducing passed builds is difficult! (Section 4.2)

22Slide23

RQ3: Distance from Reduction PointCorrelated distance from reduction point (in units of # builds) with FBDL

Create 10 bins of roughly equal sized number of builds, measured FBDL for each bin (based on number of failed builds there)Found no correlation in FBDL versus distance23Slide24

Threats to Validity

Generalization of resultsEvaluated four different algorithms, two different test-requirementsBeyond these, should still evaluate with FBDL for better measurement of effectivenessProjects are Java, Maven-based from GitHub/Travis, as many that work with our toolsChoice of reduction points provide diverse distances from failed builds to pointsInherently assumes no test-order dependenciesInherently assumes build history does not change in simulation of TSR

24Slide25

Related WorkOur own prior work explored TSR in evolution

Evaluation only looked at differences in test-requirements loss across different versions, suggesting loss does not change across versions25Slide26

DELETED

26Slide27

How good are reduced test suites?

How effective at detecting the same faults the original test suite detects in future versions of code?Real evolution, real test failuresCan we predict effectiveness of reduced test suite?Even if reduced test suite can be not effective, can we at least predict when it can be useful?How good are traditional TSR metrics at predicting fault-detection effectiveness in future?

27Slide28

Failure-to-Fault Map

Given test failures, what faults do they detect?How to compute the set

? is family of mappings from test failures to faults they detect

Mappings based on heuristics for how likely certain groupings of failed tests are due to the same faultRequires classifying failed tests and failed builds 28Slide29

Classifying Failed TestsFailed tests in future builds classified w.r.t. reduced test

suite computed at prior reduction pointKEPT: Failed test exists in reduced test suiteREMOVED: Failed test was removed from reduced test suiteNEW: Failed test did not exist at reduction point29Slide30

Classifying Failed TestsFailed tests in future builds classified w.r.t. reduced test suite

30

test

0test1test2test3test4

test

0

test

1

test2test

3test4test0test

1test2test3test4test5KEPT

REMOVEDNEW

ReducedOriginalV1V

2 (failed)Slide31

Classifying Failed Builds

Failed builds are classified based on failed testsDEFMISS: All failed tests are REMOVEDHIT: No failed tests are REMOVED, at least one KEPTNEWONLY: All failed tests are NEW

SAMECL: All REMOVED tests are in same class as KEPT/

NEW onesSAMEPACK: All REMOVED tests are in same package as KEPT/NEW onesLIKELYMISS: All remaining failed builds 31Slide32

Basic TSR Stats

32Slide33

RQ1: FBDL of TSR

33

Higher losses than suggested by traditional test-requirements loss

Mut

Loss:

2.7%

Cov

Loss:

2.2%Slide34

RQ1: FBDL of TSR

34FBDL for Greedy, coverage-based TSR

Average, 9.5% for

and 52.2% for is much higher than suggested by mutant loss