in Real Software Evolution August Shi Alex Gyori Suleman Mahmood Peiyuan Zhao Darko Marinov ISSTA 2018 Amsterdam Netherlands July 16 2018 CCF1409423 CCF1421503 CNS1646305 Regression Testing ID: 756430
Download Presentation The PPT/PDF document "Evaluating Test-Suite Reduction" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Evaluating Test-Suite Reduction in Real Software Evolution
August Shi, Alex Gyori, Suleman Mahmood,Peiyuan Zhao, Darko MarinovISSTA 2018Amsterdam, NetherlandsJuly 16, 2018
CCF-1409423
CCF-1421503CNS-1646305Slide2
Regression Testing
2
Code Under Test
V1
test
1
test
2
test3test4…testN-1testN
Code Under TestV2
test1test2test3test4…testN-1testN
Code Under TestV3
test1test2test3test4…testN-1testN
Many Tests x Many Versions = Costly!
…Slide3
Test-Suite Reduction (TSR)
3
Code Under Test
V1
test
1
test
2
test3test4test5test6
Code Under TestV2
test1test2test3test4test5test6
Code Under TestV3
test1test2test3test4test5test6
Kept Size:
Slide4
test
1
test
2test3test4test5test6Traditional Evaluation4
Code Under Test
V
1
test
1
test2test3test4
test5test6
Coverage-based TSR0% coverage loss
Code Under TestV1’
Code Under Test
V
1
’’
test
1
test
2
test
3
test
4
test
5
test
6
Code Under Test
V
1
’’’
test
1
test
2
test
3
test
4
test
5
test
6
Mutant 1
Mutant 3
Code Under Test
V
1
Code Under Test
V
1
’
Code Under Test
V
1
’’
Code Under Test
V
1
’’’
Code Under Test
V
1
’’’
Mutant
Loss
Mutant 2
All measured on single version!
Mut
1
Mut
2
Mut
3Slide5
How effective are reduced test suites at detecting faults inreal
software evolution?5Slide6
Developer Usage of TSR
6
Code Under Test
V1
test
1
test
2
test
3test4test5test6
Code Under TestV2test1test2test
3test4test5test6
Code Under TestV3test1test2test3test
4test5test6
Missed test failure!
Kept test failure, build fails
Detects same fault?
Code Under Test
V
2
Miss-build
Miss-build?Slide7
Failed-Build Detection Loss (FBDL)
Given as set of future failed builds and
as set of failed builds where reduced test suite does not detect the same faults original test suite detects (miss-builds):
FBDL based on what developers expect from using reduced test suite in regression testing 7
Slide8
FBDL Example
8V1
V
2V3V4V5
V
6
V
7
ReductionPoint
V3
V4V5V6V7
ReductionPoint
V
6
V
7
V
7Slide9
Miss-buildsWhen is a failed build a
miss-build?Can developers find same faults using a subset of test failures?Define miss-builds using heuristics for how likely groupings of failed tests are due to the same faultParsing test failures from build logs
9Slide10
test
1
test
2test3test4test5test6Classifying Failed Builds10
V
1
V
2
V3V4
V5V6V7
ReductionPointtest1test2test3test4test5
test6
test
1
test
2
test
3
test
4
test
5
test
6
HIT
test
1
test
2
test
3
test
4
test
5
test
6
DEFMISS
test
1
test
2
test
3
test
4
test
5
t
est
6
test
7
NEWONLY
All failed tests in reduced test suite
No failed test in reduced test suite
V
6
Only new tests failedSlide11
Classifying miss-builds for FBDL
The miss-builds are defined based on classificationsDevelopers decide which FBDL makes sense
: DEFMISS builds
Failures map to same fault: DEFMISS and LIKELYMISS buildsFailures in same package map to same fault: DEFMISS, LIKELYMISS, SAMEPACK buildsFailures in same class map to same fault
:
All builds except HIT and NEWONLY builds
Failures map to unique fault each
11?Slide12
AT.test
1
AT.test
2AT.test3BT.test4CT.test5CT.test6
test
1
test
2
test3test4test5test6
Classifying Failed Builds12V1V2V3
V4V5V6V7
ReductionPointtest1
test2test3test4test5test6
test
1
test
2
test
3
test
4
test
5
test
6
HIT
test
1
test
2
test
3
test
4
test
5
test
6
DEFMISS
test
1
test
2
test
3
test
4
test
5
test
6
SAMECL
SAMEPKG: for tests in same package
LIKELYMISS: all remaining instances
test
1
test
2
test
3
test
4
test
5
t
est
6
test
7
NEWONLY
V
6
Failed tests not in reduced test suite are in same test class as those keptSlide13
Classifying miss-builds for FBDL
The miss-builds are defined based on classificationsDevelopers decide which FBDL makes sense
: DEFMISS builds
Failures map to same fault: DEFMISS and LIKELYMISS buildsFailures in same package map to same fault: DEFMISS, LIKELYMISS, SAMEPACK buildsFailures in same class map to same fault
:
All builds except HIT and NEWONLY builds
Failures map to unique fault each
13Slide14
Research Questions
RQ1: What is the FBDL of TSR?RQ2: How well can the FBDL of TSR be predicted? RQ3: How does distance from TSR reduction point affect FBDL?
14
See paperSlide15
Evaluation
Failed builds, test failures taken from Travis CI32 projects, 1478 failed builds321 reduction pointsFour TSR algorithms (GREEDY, GE, GRE, HGS)Two types of test requirements (coverage, mutants)Coverage-based TSR: 51.9% kept, 2.7% mutants lossMutant-based TSR: 61.1% kept, 2.2% coverage loss
Computed reductions match with prior work15Slide16
RQ1: FBDL of TSR
16Higher losses than suggested by traditional test-requirements loss
Mut
-Loss
Cov
-Loss
FBDL
S
– failed tests to same faultFBDLP – same package failed tests to same faultFBDLC – same class failed tests to same faultFBDLU – failed tests to unique faultSlide17
RQ2: Prediction of FBDL (Size)
17
How well does reduced test suite size predict FBDL?Intuition: the greater the kept percentage, the smaller the FBDL (keeping more failed tests)
Coverage-based GREEDY TSRSlide18
RQ2: Prediction of FBDL (Requirements)
18
How well does test-requirement loss predict FBDL?Intuition: the larger the loss in test-requirements, the worse the FBDL
Coverage-based GREEDY TSRSlide19
RQ2: Prediction of FBDL (History)
19
How well does historical FBDL predict future FBDL?Measure FBDL of a reduced test suite for first half of failed builds, then correlate with FBDL of second half
Coverage-based GREEDY TSRSlide20
Conclusions about TSR
20TraditionalEvaluation
OurEvaluation
Software Change# “Faults”Evaluation MetricFindingMutantsReal evolution
One at a time
Multiple faults
Mutant-loss
FBDL (new)
TSR is effectiveTSR is not effective,and is risky/unpredictable
awshi2@illinois.eduResearch in TSR should evaluate with FBDL and improve FBDL,or regression testing research should focus on techniques besides TSRSlide21
BACKUP
21Slide22
Lessons LearnedParsing logs is difficult (Section 4.4)
Getting test failures, matching test namesMatching test names across different versions and between tools is challenging (Section 4.4)Rebuilding old versions of code is challengingEven reproducing passed builds is difficult! (Section 4.2)
22Slide23
RQ3: Distance from Reduction PointCorrelated distance from reduction point (in units of # builds) with FBDL
Create 10 bins of roughly equal sized number of builds, measured FBDL for each bin (based on number of failed builds there)Found no correlation in FBDL versus distance23Slide24
Threats to Validity
Generalization of resultsEvaluated four different algorithms, two different test-requirementsBeyond these, should still evaluate with FBDL for better measurement of effectivenessProjects are Java, Maven-based from GitHub/Travis, as many that work with our toolsChoice of reduction points provide diverse distances from failed builds to pointsInherently assumes no test-order dependenciesInherently assumes build history does not change in simulation of TSR
24Slide25
Related WorkOur own prior work explored TSR in evolution
Evaluation only looked at differences in test-requirements loss across different versions, suggesting loss does not change across versions25Slide26
DELETED
26Slide27
How good are reduced test suites?
How effective at detecting the same faults the original test suite detects in future versions of code?Real evolution, real test failuresCan we predict effectiveness of reduced test suite?Even if reduced test suite can be not effective, can we at least predict when it can be useful?How good are traditional TSR metrics at predicting fault-detection effectiveness in future?
27Slide28
Failure-to-Fault Map
Given test failures, what faults do they detect?How to compute the set
? is family of mappings from test failures to faults they detect
Mappings based on heuristics for how likely certain groupings of failed tests are due to the same faultRequires classifying failed tests and failed builds 28Slide29
Classifying Failed TestsFailed tests in future builds classified w.r.t. reduced test
suite computed at prior reduction pointKEPT: Failed test exists in reduced test suiteREMOVED: Failed test was removed from reduced test suiteNEW: Failed test did not exist at reduction point29Slide30
Classifying Failed TestsFailed tests in future builds classified w.r.t. reduced test suite
30
test
0test1test2test3test4
test
0
test
1
test2test
3test4test0test
1test2test3test4test5KEPT
REMOVEDNEW
ReducedOriginalV1V
2 (failed)Slide31
Classifying Failed Builds
Failed builds are classified based on failed testsDEFMISS: All failed tests are REMOVEDHIT: No failed tests are REMOVED, at least one KEPTNEWONLY: All failed tests are NEW
SAMECL: All REMOVED tests are in same class as KEPT/
NEW onesSAMEPACK: All REMOVED tests are in same package as KEPT/NEW onesLIKELYMISS: All remaining failed builds 31Slide32
Basic TSR Stats
32Slide33
RQ1: FBDL of TSR
33
Higher losses than suggested by traditional test-requirements loss
Mut
Loss:
2.7%
Cov
Loss:
2.2%Slide34
RQ1: FBDL of TSR
34FBDL for Greedy, coverage-based TSR
Average, 9.5% for
and 52.2% for is much higher than suggested by mutant loss