# Journal of Machine Learning Research Submitted Re vised Published An Extension on Statistical Comparisons of Classiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar a Francisco

### Presentations text content in Journal of Machine Learning Research Submitted Re vised Published An Extension on Statistical Comparisons of Classiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar a Francisco

Page 1

Journal of Machine Learning Research (2008) 2677-2694 Submitted 10/07; Re vised 4/08; Published 12/08 An Extension on “Statistical Comparisons of Classiﬁers er Multiple Data Sets or all airwise Comparisons Salv ador Gar ıa Francisco Herr era Department of Computer Science and Artiﬁcial Intellig ence Univer sity of Gr anada Gr anada, 18071, Spain Editor: John Sha we-T aylor Abstract In recently published paper in JMLR, Dem sar (2006) recommends set of non-parametric sta- tistical tests and procedures which can be safely used for comparing the performance of

classiﬁers er multiple data sets. After studying the paper we realize that the paper correctly introduces the basic procedures and some of the most adv anced ones when comparing control method. Ho w- er it does not deal with some adv anced topics in depth. Re arding these topics, we focus on more po werful proposals of statistical procedures for comparing classiﬁers. Moreo er we illustrate an easy ay of obtaining adjusted and comparable -v alues in multiple comparison procedures. eyw ords: statistical methods, non-parametric test, multiple comparisons tests, adjusted p-v alues,

logically related ypotheses 1. Intr oduction In the Machine Learning (ML) scientiﬁc community there is need for rigorous and correct statisti- cal analysis of published results, due to the act that the de elopment or modiﬁcations of algorithms is relati ely easy task. The main incon enient related to this necessity is to understand and study the statistics and to kno the xact techniques which can or cannot be applied depending on the situation, that is, type of results obtained. In recently published paper in JMLR by Dem sar (2006), group of useful guidelines are gi en in order

to perform correct analysis when we compare set of classiﬁers er multiple data sets. Dem sar recommends set of non-parametric statistical techniques (Zar, 1999; Sheskin, 2003) for comparing classiﬁers under these circumstances, gi en that the sample of results obtained by them does not fulﬁll the required conditions and it is not lar ge enough for making parametric statistical analysis. He analyzed the beha vior of the pro- posed statistics on classiﬁcation tasks and he check ed that the are more con enient than parametric techniques. Recent studies apply the

guidelines gi en by Dem sar in the analysis of performance of classiﬁers (Esmeir and Mark vitch, 2007; Marrocco et al., 2008). In them, ne proposal or methodology is of fered and it is compared with other methods by means of pairwise comparisons. Another type of studies assume an empirical comparison or re vie of already proposed methods. In these cases, no proposal is of fered and statistical comparison could be ery useful in determining the dif ferences among the methods. In the specialized literature, man papers pro vide re vie ws on speciﬁc topic and the also use statistical

methodology to perform comparisons. or xample, in re vie of 2008 Salv ador Garc ıa and Francisco Herrera.

Page 2

ensembles of decision trees, non-parametric tests are also applied in the analysis of performance (Banﬁeld et al., 2007). Ho we er only the rankings computed by Friedman method (Friedman, 1937) are stipulated and authors establish comparisons based on them, without taking into account signiﬁcance le els. Dem sar focused his ork in the analysis of ne proposals, and he introduced the Nemen yi test for making all pairwise comparisons (Nemen yi, 1963). Ne

ertheless, the Nemen yi test is ery conserv ati and it may not ﬁnd an dif ference in most of the xperimentations. In recent papers, the authors ha used the Nemen yi test in multiple comparisons. Due to the act that this test posses lo po wer authors ha to emplo man data sets (Y ang et al., 2007b) or most of the dif ferences found are not signiﬁcant (Y ang et al., 2007a; nez et al., 2007). Although the emplo yment of man data sets could seem beneﬁcial in order to impro the generalization of results, in some speciﬁc domains, that is, imbalanced classiﬁcation

(Owen, 2007) or multi-instance classiﬁcation (Murray et al., 2005), data sets are dif ﬁcult to ﬁnd. Procedures with more po wer than Nemen yi one can be found in specialized literature. ha based on the necessity to apply more po werful procedures in empirical studies in which no ne method is proposed and the beneﬁt consists of obtaining more statistical dif ferences among the classiﬁers compared. Thus, in this paper we describe these procedures and we analyze their beha vior by means of the analysis of multiple repetitions of xperiments with randomly selected

data sets. On the other hand, we can see other orks in which the -v alue associated to comparison between tw classiﬁers is reported (Garc ıa-Pedrajas and Fyfe, 2007). Classical non-parametric tests, such as ilcoxon and Friedman (Sheskin, 2003), may be incorporated in most of the statistical packages (SPSS, SAS, R, etc.) and the computation of the ﬁnal -v alue is usually implemented. Ho we er adv anced procedures such as Holm (1979), Hochber (1988), Hommel (1988) and the ones described in this paper are usually not incorporated in statistical packages. The computation of the

correct -v alue, or Adjusted -V alue (APV) (W estf all and oung, 2004), in comparison using an of these procedures is not ery dif ﬁcult and, in this paper we sho ho to include it with an illustrati xample. The paper is set up as follo ws. Section presents more po werful procedures for comparing all the classiﬁers among them in comparison of multiple classiﬁers and case study In Section we describe the procedures for obtaining the APV by considering the post-hoc procedures xplained by Dem sar and the ones xplained in this paper In Section 4, we perform an xperimental study

of the beha vior of the statistical procedures and we discuss the results obtained. Finally Section concludes the paper 2. Comparison of Multiple Classiﬁers: erf orming All airwise Comparisons In the paper Dem sar (2006), referring to carrying out comparisons of more than tw classiﬁers, set of useful guidelines were gi en for detecting signiﬁcant dif ferences among the results obtained and post-hoc procedures for identifying these dif ferences. Friedman test is an omnib us test which can be used to carry out these types of comparison. It allo ws to detect dif ferences

considering the global set of classiﬁers. Once Friedman test rejects the null ypothesis, we can proceed with post-hoc test in order to ﬁnd the concrete pairwise comparisons which produce dif ferences. Dem sar described the use of the Nemen yi test used when all classiﬁers are compared with each other Then, he focused on procedures that control the amily-wise error when comparing with control classiﬁer ar guing that the objecti of study is to test whether ne wly proposed method is better than the xisting 2678

Page 3

FI ones. or this reason, he described and

studied in depth more po werful and sophisticated procedures deri ed from Bonferroni-Dunn such as Holm s, Hochber g and Hommel methods. Ne ertheless, we think that performing all pairwise comparisons in an xperimental analysis may be useful and interesting in dif ferent cases when proposing ne method. or xample, it ould be interesting to conduct statistical analysis er multiple classiﬁers in re vie orks in which no method is proposed. In this case, the repetition of comparisons choosing dif ferent control classiﬁers may lose the control of the amily-wise error Our intention in

this section is to gi detailed description of more po werful and adv anced procedures deri ed from the Nemen yi test and to sho case study that uses these procedures. 2.1 Adv anced Pr ocedur es or erf orming All airwise Comparisons set of pairwise comparisons can be associated with set or amily of ypotheses. An of the post- hoc tests which can be applied to non-parametric tests (that is, those deri ed from the Bonferroni correction or similar procedures) ork er amily of ypotheses. As Dem sar xplained, the test statistics for comparing the -th and -th classiﬁer is where is the erage rank

computed through the Friedman test for the -th classiﬁer is the number of classiﬁers to be compared and is the number of data sets used in the comparison. The alue is used to ﬁnd the corresponding probability -v alue) from the table of normal dis- trib ution, which is then compared with an appropriate le el of signiﬁcance (T able A1 in Sheskin, 2003). basic procedures are: Nemen yi (1963) procedure: it adjusts the alue of in single step by di viding the alue of by the number of comparisons performed, 2. This procedure is the simplest ut it also has little po wer

Holm (1979) procedure: it as also described in Dem sar (2006) ut it as used for compar isons of multiple classiﬁers in olving control method. It adjusts the alue of in step do wn method. Let :::; be the ordered -v alues (smallest to lar gest) and :::; be the corresponding ypotheses. Holm procedure rejects to if is the smallest in- te ger such that Other alternati es were de eloped by Hochber (1988), Hommel (1988) and Rom (1990). The are easy to perform, ut the often ha similar po wer to Holm procedure (the ha more po wer than Holm procedure, ut the dif ference between them is not ery

notable) when considering all pairwise comparisons. The ypotheses being tested belonging to amily of all pairwise comparisons are logically interrelated so that not all combinations of true and alse ypotheses are possible. As simple xample of such situation suppose that we ant to test the three ypotheses of pairwise equality associated with the pairwise comparisons of three classiﬁers 3. It is easily seen from the relations among the ypotheses that if an one of them is alse, at least one other must be alse. or xample, if is better/w orse than then it is not possible that has the same

performance as and has the same performance as must be better/w orse than or or the tw classiﬁers at the same time. Thus, there cannot be one alse and tw true ypotheses among these three. 2679

Page 4

Based on this ar gument, Shaf fer proposed tw procedures which mak use of the logical relation among the amily of ypotheses for adjusting the alue of (Shaf fer, 1986). Shaf fer static procedure: follo wing Holm step do wn method, at stage instead of rejecting if reject if where is the maximum number of ypotheses which can be true gi en that an ypotheses are alse. It is static

procedure, that is, :::; are fully determined for the gi en ypotheses :::; independent of the observ ed -v alues. The possible numbers of true ypotheses, and thus the alues of can be obtained from the recursi formula where is the set of possible numbers of true ypotheses with classiﬁers being compared, 2, and Shaf fer dynamic procedure: it increases the po wer of the ﬁrst by substituting at stage by the alue where is the maximum number of ypotheses that could be true, gi en that the pre vious ypotheses are alse. It is dynamic procedure since depends not only on the logical

structure of the ypotheses, ut also on the ypotheses already rejected at step Ob viously this procedure has more po wer than the ﬁrst one. In this paper we ha not used this second procedure, gi en that it is included in an adv anced procedure which we will describe in the follo wing. In Ber gmann and Hommel (1988) as proposed procedure based on the idea of ﬁnding all elementary ypotheses which cannot be rejected. In order to formulate Ber gmann-Hommel pro- cedure, we need the follo wing deﬁnition. Deﬁnition An inde set of hypotheses :::; is called xhausti if xactly

all could be true In order to emplify the pre vious deﬁnition, we will consider the follo wing case: ha three classiﬁers, and we will compare them in comparison. will obtain three ypotheses: es equal in beha vior than es equal in beha vior than es equal in beha vior than and eight possible sets All are true. and are true and is alse. and are true and is alse. 2680

Page 5

FI and are true and is alse. is true and and are alse. is true and and are alse. is true and and are alse. All are alse. Sets and can be possible, because their ypotheses can be true at the same time,

so the are xhausti sets. Set basing on logically related ypotheses principles, is not possible because the performance of cannot be equal to and whereas has dif ferent performance than The same consideration can be done to and which are not xhausti sets. Under this deﬁnition, it orks as follo ws. Ber gmann and Hommel (1988) procedure: Reject all with where the acceptance set haus ive min jg is the inde set of null ypotheses which are retained. or this procedure, one has to check for each subset of :::; if is xhausti e, which leads to intensi computation. Due to this act, we will obtain

set, named which will contain all the possible xhausti sets of ypotheses for certain comparison. rapid algo- rithm which as described in Hommel and Bernhard (1994) allo ws substantial reduction in computing time. Once the set is obtained, the ypotheses that do not belong to the set are rejected. Figure sho ws alid algorithm for obtaining all the xhausti sets of ypotheses, using as input list of classiﬁers is set of amilies of ypotheses; lik wise, amily of ypotheses is set of ypotheses. The most important step in the algorithm is the number 6. It performs di vision of the

classiﬁers into tw subsets, in which the last classiﬁer al- ays is inserted in the second subset and the ﬁrst subset cannot be empty In this ay we ensure that subset yielded in di vision is ne er empty and no repetitions are produced. or xample, suppose set with three classiﬁers All possible di visions without taking into account the pre vious assumptions are: fg gg gg gg gg gg gg gg fgg Di visions and and and and are equi alent, respecti ely Furthermore, di visions and are not interesting. Using the assumptions in step of the algorithm, the possible di visions are:

gg gg gg In this case, all the di visions are interesting and no repetitions are yielded. The computational comple xity of the algorithm for obtaining xhausti sets is Ho we er the computation requirements may be reduced by means of using storage capabilities. Relati xhausti sets for classiﬁers can be stored in memory and there is no necessity of in oking the obtainingExhaustive func- tion recursi ely The computational comple xity using storage capabilities is so the algorithm still requires intensi computation. 2681

Page 6

An xample illustrating the algorithm for obtaining all

xhausti sets is dra wn in Figure 2. In it, four classiﬁers, enumerated from to in the set, are used. The comparisons or y- potheses are denoted by pairs of numbers without separation character between them. This illustration does not sho the case in which the set 2, for simplifying the representation. When 2, no comparisons can be performed, so the obtainExhaustive function returns an empty set An edge connecting tw box es represents an in ocation of this function. In each box, the list of classiﬁers gi en as input and the ﬁrst initialization of the set are displayed. The

main edges, whose starting point is the initial box, are labeled by the order of in ocation. Belo the graph, the resulting subset in each main edge is denoted. The ﬁnal will be composed by the union of these subsets. At the end of the process, 14 distinct xhausti sets are found: 12 13 14 23 24 34 23 24 34 13 14 34 12 14 24 13 23 12 13 14 23 24 34 12 34 13 24 23 14 able gi es the number of ypotheses ), the number (2 of inde sets and the number of xhausti inde sets for classiﬁers being compared. Function obtainExhausti e( :::; list of classiﬁers) 1. Let 2. set of all

possible and distinct pairwise comparisons using 3. If == 4. Return 5. End if 6. or all possible di visions of into tw subsets and and 7. ob ainE haus ive 8. ob ainE haus ive 9. 10. 11. or each amily of ypotheses of 12. or each amily of ypotheses of 13. 14. End for 15. End for 16. End for 17. Return Figure 1: Algorithm for obtaining all xhausti sets The follo wing subsections present case study of comparison of some well-kno wn classiﬁers er thirty data sets. In it, the four procedures xplained abo are emplo yed. 2.2 erf orming All airwise Comparisons: Case Study In the follo wing, we

sho an xample in olving the four procedures described with comparison of classiﬁers: C4.5 (Quinlan, 1993); One Nearest Neighbor (1-NN) with Euclidean distance, 2682

Page 7

FI Figure 2: Example of the obtaining of xhausti sets of ypotheses considering classiﬁers 64 14 10 1024 51 15 32768 202 21 2097152 876 28 10 4139 36 10 10 21146 able 1: All pairwise comparisons of classiﬁers 2683

Page 8

Nai eBayes, ernel (McLachlan, 2004) and, ﬁnally CN2 (Clark and Niblett, 1989). The parame- ters used are speciﬁed in Section 4. ha used 10-fold cross

alidation and standard parameters for each algorithm. The results correspond to erage accurac or cl ass er or in test data. ha used 30 data sets. able sho ws the erall process of computation of erage rankings. Friedman (1937) and Iman and Da enport (1980) tests check whether the measured erage ranks are signiﬁcantly dif ferent from the mean rank 3. The respecti ely use the and the statistical distrib utions to determine if distrib ution of observ ed frequencies dif fers from the theoretical xpected frequencies. Their statistics use nominal (cate gorical) or ordinal le el data, instead

of using means and ariances. Dem sar (2006) detailed the computation of the critical alues in each distrib ution. In this case, the critical alues are 9.488 and 2.45, respecti ely at 05, and the Friedman and Iman-Da enport statistics are: 39 647 14 309 Due to the act that the critical alues are lo wer than the respecti statistics, we can proceed with the post-hoc tests in order to detect signiﬁcant pairwise dif ferences among all the classiﬁers. or this, we ha to compute and order the corresponding statistics and -v alues. The standard error in the pairwise comparison between tw

classiﬁers is 30 408. able presents the amily of ypotheses ordered by their -v alue and the adjustment of by Nemen yi s, Holm and Shaf fer static procedures. Nemen yi test rejects the ypotheses [1–4] since the corresponding -v alues are smaller than the adjusted s. Holm procedure rejects the ypotheses [1–5]. Shaf fer static procedure rejects the ypotheses [1–6]. Ber gmann-Hommel dynamic procedure ﬁrst obtains the xhausti inde set of ypotheses. It obtains 51 inde sets. can see them in able 4. From the inde sets, it computes the set. It rejects all ypotheses with so it rejects the

ypotheses [1–8]. Ber gmann-Hommel dynamic procedure allo ws to clearly distinguishing among three groups of classiﬁers, attending to their performance: Best classiﬁers: C4.5 and Nai eBayes. Middle classiﬁers: 1-NN and CN2. orst classiﬁer: ernel. 1. ernel method is bayesian classiﬁer which emplo ys non-parametric estimation of density functions through aussian ernel function. The adjustment of the co ariance matrix is performed by the ad-hoc method. 2. Nai eBayes and CN2 are classiﬁers for discrete domains, so we ha discretized the data prior to

learning with them. The discretizer algorithm is ayyad and Irani (1993). 3. Data sets mark ed with ’* ha been subsampled being adapted to slo algorithms, such as CN2. 4. ha considered that each classiﬁer follo ws the order: C4.5, 1-NN, Nai eBayes, ernel, CN2. or xample, the ypothesis 13 represents the comparison between C4.5 and Nai eBayes. 2684

Page 9

FI C4.5 1-NN Nai eBayes ernel CN2 Abalone* 0.219 (3) 0.202 (4) 0.249 (2) 0.165 (5) 0.261 (1) Adult* 0.803 (2) 0.750 (4) 0.813 (1) 0.692 (5) 0.798 (3) Australian 0.859 (1) 0.814 (4) 0.845 (2) 0.542 (5) 0.816 (3) Autos 0.809 (1)

0.774 (3) 0.673 (4) 0.275 (5) 0.785 (2) Balance 0.768 (3) 0.790 (2) 0.727 (4) 0.872 (1) 0.706 (5) Breast 0.759 (1) 0.654 (5) 0.734 (2) 0.703 (4) 0.714 (3) Bupa 0.693 (1) 0.611 (3) 0.572 (4.5) 0.689 (2) 0.572 (4.5) Car 0.915 (1) 0.857 (3) 0.860 (2) 0.700 (5) 0.777 (4) Cle eland 0.544 (2) 0.531 (4) 0.558 (1) 0.439 (5) 0.541 (3) Crx 0.855 (2) 0.796 (4) 0.857 (1) 0.607 (5) 0.809 (3) Dermatology 0.945 (3) 0.954 (2) 0.978 (1) 0.541 (5) 0.858 (4) German 0.725 (2) 0.705 (4) 0.739 (1) 0.625 (5) 0.717 (3) Glass 0.674 (4) 0.736 (1) 0.721 (2) 0.356 (5) 0.704 (3) Hayes-Roth 0.801 (1) 0.357 (4) 0.520 (2.5)

0.309 (5) 0.520 (2.5) Heart 0.785 (2) 0.770 (3) 0.841 (1) 0.659 (5) 0.759 (4) Ion 0.906 (2) 0.359 (5) 0.895 (3) 0.641 (4) 0.918 (1) Led7Digit 0.710 (2) 0.402 (4) 0.728 (1) 0.120 (5) 0.674 (3) Letter* 0.691 (2) 0.827 (1) 0.667 (3) 0.527 (5) 0.638 (4) ymphograph 0.743 (3) 0.739 (4) 0.830 (1) 0.549 (5) 0.746 (2) Mushrooms* 0.990 (1.5) 0.482 (5) 0.941 (3) 0.857 (4) 0.990 (1.5) OptDigits* 0.867 (3) 0.098 (1) 0.915 (2) 0.986 (1) 0.784 (4) Satimage* 0.821 (3) 0.872 (2) 0.815 (4) 0.885 (1) 0.778 (5) SpamBase* 0.893 (2) 0.824 (4) 0.902 (1) 0.739 (5) 0.885 (3) Splice* 0.799 (2) 0.655 (4) 0.925 (1) 0.517

(5) 0.755 (3) ic-tac-toe 0.845 (1) 0.731 (2) 0.693 (4) 0.653 (5) 0.704 (3) ehicle 0.741 (1) 0.701 (2) 0.591 (5) 0.663 (3) 0.619 (4) wel 0.799 (2) 0.994 (1) 0.603 (4) 0.269 (5) 0.621 (3) ine 0.949 (4) 0.955 (2) 0.989 (1) 0.770 (5) 0.954 (3) east 0.555 (3) 0.505 (4) 0.569 (1) 0.312 (5) 0.556 (2) Zoo 0.928 (2.5) 0.928 (2.5) 0.945 (1) 0.419 (5) 0.897 (4) erage rank 2.100 3.250 2.200 4.333 3.117 able 2: Computation of the rankings for the algorithms considered in the study er 30 data sets, based on test accurac by using ten-fold cross alidation 2685

Page 10

ypothesis C4.5 vs. ernel 5.471

487 10 0.005 0.005 0.005 Nai eBayes vs. ernel 5.226 736 10 0.005 0.0055 0.0083 ernel vs. CN2 2.98 0.0029 0.005 0.0063 0.0083 C4.5 vs. 1NN 2.817 0.0048 0.005 0.0071 0.0083 1NN vs. ernel 2.654 0.008 0.005 0.0083 0.0083 1NN vs. Nai eBayes 2.572 0.0101 0.005 0.01 0.0125 C4.5 vs. CN2 2.49 0.0128 0.005 0.0125 0.0125 Nai eBayes vs. CN2 2.245 0.0247 0.005 0.0167 0.0167 1NN vs. CN2 0.327 0.744 0.005 0.025 0.025 10 C4.5 vs. Nai eBayes 0.245 0.8065 0.005 0.05 0.05 able 3: amily of ypotheses ordered by -v alue and adjusting of by Nemen yi (NM), Holm (HM) and Shaf fer (SH) procedures, considering an

initial 05 Size Size Size Size Size (12) (12,34) (12,13,23) (12,13,23,45) (12,13,14,15,23,24,25,34,35,45) (13) (13,24) (12,14,24) (12,14,24,35) (12,13,14,23,24,34) (23) (14,23) (13,14,34) (12,34,35,45) (12,13,15,23,25,35) (14) (12,35) (23,24,34) (13,14,25,34) (12,14,15,24,25,45) (24) (13,25) (12,15,25) (13,15,24,35) (13,14,15,34,35,45) (34) (15,23) (13,15,35) (13,24,25,45) (23,24,25,34,35,45) (15) (12,45) (23,25,35) (14,15,23,45) (25) (13,45) (14,15,45) (14,23,25,35) (35) (23,45) (24,25,45) (15,23,24,34) (45) (14,25) (34,35,45) (15,24) (14,35) (24,35) able 4: Exhausti sets obtained for the

case study Those belonging to the Acceptance set are typed in bold. 2686

Page 11

FI In Dem sar (2006), we can ﬁnd discussion about the po wer of Hochber g and Hommel pro- cedures with respect to Holm one. The reject more ypothesis than Holm s, ut the dif ferences are in practice rather small (Shaf fer 1995). The most po werful procedures detailed in this paper Shaf fer and Ber gmann-Hommel s, ork follo wing the same method of Holm procedure, so it is possible to ybridize them with other types of step up procedures, such as Hochber g s, Hom- mel and Rom methods. When we apply

these methods by using the logical relationships among ypothesis in static ay the do not control the amily-wise error (Hochber and Rom, 1995). In opposite, when applying these methods by detecting dynamical relationships, the control the amily-wise error In Hochber and Rom (1995), se eral xtensions were gi en in this ay Fur thermore, small impro ement of po wer in the Ber gmann-Hommel procedure described here can be achie ed when using Simes conjecture (Simes, 1986) in the obtaining of set (see Hommel and Bernhard, 1999, for more details). 3. Adjusted P-V alues The smallest le el of

signiﬁcance that results in the rejection of the null ypothesis, the -v alue, is useful and interesting datum for man consumers of statistical analysis. -v alue pro vides information about whether statistical ypothesis test is signiﬁcant or not, and it also indicates something about ”ho signiﬁcant the result is: The smaller the -v alue, the stronger the vidence ag ainst the null ypothesis. Most important, it does this without committing to particular le el of signiﬁcance. When -v alue is within multiple comparison, as in the xample in able 3, it reﬂects the

probability error of certain comparison, ut it does not tak into account the remaining compar isons belonging to the amily One ay to solv this problem is to report APVs which tak into account that multiple tests are conducted. An APV can be compared directly with an chosen sig- niﬁcance le el In this paper we encourage the use of APVs due to the act that the pro vide more information in statistical analysis. In the follo wing, we will xplain ho to compute the APVs depending on the post-hoc procedure used in the analysis, follo wing the indications gi en in Wright (1992) and Hommel and

Bernhard (1999). also include the post-hoc tests xplained in Dem sar (2006) and other for comparisons with control classiﬁer The notation used in the computation of the APVs is the follo wing: Inde es and correspond each one to concrete comparison or ypothesis in the amily of ypotheses, according to an incremental order by their -v alues. Inde al ays refers to the ypothesis in question whose APV is being computed and inde refers to another ypothesis in the amily is the -v alue obtained for the -th ypothesis. is the number of classiﬁers being compared. is the number of possible

comparisons in an all pairwise comparisons design; that is, is the maximum number of ypotheses which can be true gi en that an ypotheses are alse (see the description of Shaf fer static procedure in Section 2.1). 2687

Page 12

The procedures of p-v alue adjustment can be classiﬁed into: one-step. Bonferroni AP min where Nemen yi AP min where step-up. Hochber AP max Hommel AP see algorithm at Figure 3. step-do wn. Holm AP (using control classiﬁer): min where max Nemen yi AP min where Holm AP (using it in all pairwise comparisons): min where max Shaf fer static AP min

where max Ber gmann-Hommel AP min where max fj min haus ive 1. Set AP for all 2. or each :::; (in that order) 3. Let 4. or each 5. Compute alue 6. 7. End for 8. Find the smallest alue in call it min 9. If AP min then AP min 10. or each 11. Let min min 12. If AP then AP 13. End for Figure 3: Algorithm for calculating APVs based on Hommel procedure able sho ws the results in the ﬁnal form of APVs for the xample considered in this section. As we can see, this xample is suitable for observing the dif ference of po wer among the test procedures. Also, this table can pro vide information

about the state of retainment or rejection of an ypothesis, comparing its associated APV with the le el of signiﬁcance pre viously ﬁx ed. 2688

Page 13

FI ypothesis AP AP AP AP BH C4.5 vs .K ernel 487 10 487 10 487 10 487 10 487 10 Nai eBayes vs .K ernel 736 10 736 10 563 10 042 10 042 10 ernel vs .CN2 0.0029 0.0288 0.023 0.0173 0.0115 C4.5 vs .1NN 0.0048 0.0485 0.0339 0.0291 0.0291 1NN vs .K ernel 0.008 0.0796 0.0478 0.0478 0.0319 1NN vs .Nai eBayes 0.0101 0.1011 0.0506 0.0478 0.0319 C4.5 vs .CN2 0.0128 0.1276 0.0511 0.0511 0.0383 Nai eBayes vs .CN2 0.0247 0.2474 0.0742

0.0742 0.0383 1NN vs .CN2 0.744 1.0 1.0 1.0 1.0 10 C4.5 vs .Nai eBayes 0.8065 1.0 1.0 1.0 1.0 able 5: APVs obtained in the xample by Nemen yi (NM), Holm (HM), Shaf fer static (SH) and Ber gmann-Hommel dynamic (BH) 4. Experimental Framew ork In this section, we ant to determine the po wer and beha vior of the studied procedures through the xperiments in which we repeatedly compared the classiﬁers on sets of ten randomly chosen data sets, recording the number of equi alence ypothesis rejected and APVs. follo similar method used in Dem sar (2006). The classiﬁers used are the same as

in the case study of the pre vious subsection: C4.5 with minimum number of item-sets per leaf equal to and conﬁdence le el ﬁtted for optimal accurac and pruning strate gy nai Bayesian learner with continuous attrib utes discretized using ayyad and Irani (1993) discretization, classic 1-Nearest-Neighbor classiﬁer with Euclidean distance, CN2 with ayyad-Irani discretizer star size and 95% of xamples to co er and ernel classiﬁer with sigmaK er nel 01, which is the in erse alue of the ariance that represents the radius of neighborhood. All classiﬁers are ailable

in KEEL softw are (Alcal a-Fdez et al., 2008). or performing this study we ha compiled sample of ﬁfty data sets from the UCI machine learning repository (Asuncion and Ne wman, 2007), all of them alid for classiﬁcation task. measured the performance of each classiﬁer by means of accurac in test by using ten-fold cross alidation. As Dem sar did, when comparing tw classiﬁers, samples of ten data sets were randomly selected so that the probability for the data set being chosen as proportional to where is the (positi or ne ati e) dif ference in the classiﬁcation

accuracies on that data set and is the bias through which we can re gulate the dif ferences between the classiﬁers. ith 0, the selection is purely random and as is being higher the selected data sets are orable to particular classiﬁer In comparisons of multiple classiﬁers, samples of data sets ha to be selected with the prob- abilities computed from the dif ferences in accurac of tw classiﬁers. ha chosen C4.5 and 1-NN, due to the act that we ha found signiﬁcant dif ferences between them in the study con- ducted before (Section 2.2) which in olv ed thirty data

sets. Note that the repeated comparisons done here only in olv ten data sets each time, so the rejection of equi alence of tw classiﬁers is more dif ﬁcult at the be ginning of the process. 5. It is also ailable at http://www.keel.es 6. The data sets used are: abalone, adult, australian, autos, balance, bands, breast, upa, car cle eland, dermatol- ogy ecoli, ﬂare, german, glass, haberman, hayes-roth, heart, iris, led7digit, letter lymphograph magic, monks, mushrooms, ne wth yroid, nursery optdigits, pageblocks, penbased, pima, ring, satimage, se gment, shuttle, spambase,

splice, tae, th yroid, tic-tac-toe, tw onorm, ehicle, wel, wine, wisconsin, yeast, zoo. 2689

Page 14

Figure sho ws the results of this study considering the pairwise comparison between C4.5 and 1-NN. It gi es an approximation of the po wer of the statistical procedures considered in this paper Figure 4(a) reﬂects the number of times the rejected the equi alence of C4.5 and 1-NN. Ob viously the Ber gmann-Hommel procedure is the most po werful, follo wed by Shaf fer static procedure. The graphic also informs us about the use of logically related ypothesis, gi en that the

procedures that use this information ha bias to ards the same point and those which do not use this information, tend to lo wer point than the ﬁrst. When the selection of data sets is purely random 0), the beneﬁt of using the Ber gmann-Hommel procedure is appreciable. Figure 4(b) sho ws the erage APV of the same comparison of classiﬁers. As we can see, the Nemen yi procedure is too conserv ati in comparison with the remaining procedures. Ag ain, the beneﬁt of using more sophisticated testing procedures is easily noticeable. (a) Number of ypotheses rejected in

pairwise compar isons (b) erage APV in pairwise comparisons Figure 4: C4.5 vs. 1-NN Figure sho ws the results of this study considering all possible pairwise comparisons in the set of classiﬁers. It helps us to compare the erall beha vior of the four testing procedures. Figure 5(a) presents the number of times the rejected an comparison belonging to the amily Although it could seem that the selection of data sets determined by the dif ference of accurac between tw classiﬁers may not inﬂuence on the erall comparison, the graphic sho ws us that it occurs. Further more, the

lines dra wn follo parallel beha vior indicating us the relation and magnitude of po wer among the four procedures. In Figure 5(b) we illustrate the erage APV for all the comparisons of classiﬁers. can notice that the conserv atism of the Nemen yi test is ob vious with respect to the rest of procedures. The beneﬁt of using more adv anced testing procedure is similar with respect to the follo wing less-po werful procedure, xcept for Holm procedure. Finally our recommendation on the usage of certain procedure depends on the results obtained in this paper and in our xperience in

understanding and implementing them: do not recommend the use of Nemen yi test, because it is ery conserv ati procedure and man of the ob vious dif ferences may not be detected. When we use considerable number of data sets with re ards to number of classiﬁers, we could proceed with the Holm procedure. 2690

Page 15

FI (a) otal number of ypotheses rejected (b) erage APV in all comparisons Figure 5: All comparisons Ho we er conducting the Shaf fer static procedure means not ery signiﬁcant increase of the dif ﬁculty with respect to the Holm procedure. Moreo er the

beneﬁt of using information about logically related ypothesis is noticeable, thus we strongly encourage the use of this procedure. Ber gmann-Hommel procedure is the best performing one, ut it is also the most dif ﬁcult to understand and computationally xpensi e. recommend its usage when the situation requires so (i.e., the dif ferences among the classiﬁers compared are not ery signiﬁcant), gi en that the results it obtains are as alid as using other testing procedure. 5. Conclusions The present paper is an xtension of Dem sar (2006). Dem sar does not deal in depth

with some topics related to multiple comparisons in olving all the algorithms and computations of adjusted -v alues. In this paper we describe other adv anced testing procedures for conducting all pairwise com- parisons in multiple comparisons analysis: Shaf fer static and Ber gmann-Hommel procedures. The adv antage that the obtain is produced due to the incorporation of more information about the ypotheses to be tested: in comparisons, logical relationship among them xists. As general rule, the Ber gmann-Hommel procedure is the most po werful one ut it requires intensi computation in

comparisons in olving numerous classiﬁers. The second one, Shaf fer procedure, can be used instead of Ber gmann-Hommel in these cases. Moreo er we present the methods for obtaining the adjusted p-v alues, which are alid p-v alues associated to each comparison useful to be compared with an le el of signiﬁcance without restrictions and the also pro vide more infor mation. ha illustrated them with case study and we ha check ed that the ne described methods are more po werful than the classical ones, Nemen yi and Holm procedures. 2691

Page 16

Ackno wledgments This research

has been supported by the project TIN2005-08386-C05-01. S. Garc ıa holds FPU scholarship from Spanish Ministry of Education and Science. The present paper as submitted as re gular paper in the JMLR journal. After the re vie process, the action editor Dale Schuurmans encourages us to submit the paper to the special topic on Multiple Simultaneous Hypothesis esting. are ery grateful to the anon ymous re vie wers and both action editors who managed this paper for their aluable suggestions and comments to impro its quality ppendix A. Sour ce Code of the Pr ocedur es The source code, written in

A, that implements all the procedures described in this paper is ailable at http://sci2s.ugr.es/keel/multipleTest.zip The program allo ws the input of data in CSV format and obtains as output document. Refer ences J. Alcal a-Fdez, L. anchez, S. Garc ıa, M.J. del Jesus, S. entura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, .M. Ri as, J.C. Fern andez, and Herrera. KEEL: softw are tool to assess olu- tionary algorithms to data mining problems. Soft Computing doi: 10.1007/s00500-008-0323-y 2008. In press. A. Asuncion and D.J. Ne wman. UCI machine learning repository 2007. URL

http://www.ics. uci.edu/ mlearn/MLRepository.html R. E. Banﬁeld, L. O. Hall, K. Bo wyer and gelme yer comparison of decision tree ensemble creation techniques. IEEE ansactions on attern Anaylisis and Mac hine Intellig ence 29(1):173–180, 2007. G. Ber gmann and G. Hommel. Impro ements of general multiple test procedures for redundant systems of ypotheses. In Bauer G. Hommel, and E. Sonnemann, editors, Multiple Hypotheses esting pages 100–115. Springer Berlin, 1988. Clark and Niblett. The CN2 induction algorithm. Mac hine Learning 3(4):261–283, 1989. J. Dem sar Statistical comparisons of

classiﬁers er multiple data sets. ournal of Mac hine Learn- ing Resear 7:1–30, 2006. S. Esmeir and S. Mark vitch. An ytime learning of decision trees. ournal of Mac hine Learning Resear 8:891–933, 2007. U. M. ayyad and K. B. Irani. Multi-interv al discretization of continuous alued attrib utes for classiﬁcation learning. In Pr oceedings of the 13th International oint Confer ence on Artiﬁcial Intellig ence pages 1022–1029. Mor an-Kaufmann, 1993. M. Friedman. The use of ranks to oid the assumption of normality implicit in the analysis of ariance. ournal of the American

Statistical Association 32:675–701, 1937. 2692

Page 17

FI N. Garc ıa-Pedrajas and C. Fyfe. Immune netw ork based ensembles. Neur ocomputing 70(7-9): 1155–1166, 2007. Hochber g. sharper bonferroni procedure for multiple tests of signiﬁcance. Biometrika 75: 800–802, 1988. Hochber and D. Rom. Extensions of multiple testing procedures based on Simes test. ournal of Statistical Planning and Infer ence 48:141–152, 1995. S. Holm. simple sequentially rejecti multiple test procedure. Scandinavian ournal of Statistics 6:65–70, 1979. G. Hommel. stage wise rejecti multiple test

procedure. Biometrika 75:383–386, 1988. G. Hommel and G. Bernhard. rapid algorithm and computer program for multiple test proce- dures using procedures using logical structures of ypotheses. Computer Methods and Pr gr ams in Biomedicine 43:213–216, 1994. G. Hommel and G. Bernhard. Bonferroni procedures for logically related ypotheses. ournal of Statistical Planning and Infer ence 82:119–128, 1999. R. L. Iman and J. M. Da enport. Approximations of the critical re gion of the friedman statistic. Communications in Statistics pages 571–595, 1980. C. Marrocco, R. Duin, and ortorella. Maximizing the

area under the OC curv by pairwise feature combination. attern Reco gnition 41:1961–1974, 2008. G. J. McLachlan. Discriminant Analysis and Statistical attern Reco gnition ile Series in Prob- ability and Mathematical Statistics, 2004. J. Murray G. Hughes, and K. Kreutz-Delg ado. Machine learning methods for predicting ailures in hard dri es: multiple-instance application. ournal of Mac hine Learning Resear 6:783–816, 2005. B. Nemen yi. Distrib ution-fr ee Multiple Comparisons PhD thesis, Princeton Uni ersity 1963. M. nez, R. Fidalgo, and R. Morales. Learning in en vironments with unkno wn

dynamics: ards more rob ust concept learners. ournal of Mac hine Learning Resear 8:2595–2628, 2007. A. B. Owen. Inﬁnitely imbalanced logistic re gression. ournal of Mac hine Learning Resear 8: 761–773, 2007. J. R. Quinlan. Pr gr ams for Mac hine Learning Mor an Kauf fman, 1993. D. M. Rom. sequentially rejecti test procedure based on modiﬁed bonferroni inequality Biometrika 77:663–665, 1990. J.P Shaf fer Modiﬁed sequentially rejecti multiple test procedures. ournal of the American Statistical Association 81(395):826–831, 1986. J.P Shaf fer Multiple ypothesis testing. Annual

Re vie of Psyc holo gy 46:561–584, 1995. 2693

Page 18

D. Sheskin. Handbook of ar ametric and Nonpar ametric Statistical Pr ocedur es. Chapman Hall/CRC, 2003. R.J. Simes. An impro ed Bonferroni procedure for multiple tests of signiﬁcance. Biometrika 73: 751–754, 1986. H. estf all and S. S. oung. Resampling-Based Multiple esting: Examples and Methods for p-value Adjustment John ile and Sons, 2004. S. Wright. Adjusted p-v alues for simultaneous inference. Biometrics 48:1005–1013, 1992. ang, G. ebb, K. orb, and K. M. ing. Classifying under computational resource constraints: an

ytime classiﬁcation using probabilistic estimators. Mac hine Learning 69:35–53, 2007a. ang, G. I. ebb, J. Cerquides, K. B. orb, J. Boughton, and K. M. ing. select or to weigh: comparati study of linear combination schemes for superparent-one-dependence estimators. IEEE anscations on Knowledg and Data Engineering 19(12):1652–1665, 2007b. J. H. Zar Biostatistical Analysis Prentice Hall, 1999. 2694

## Journal of Machine Learning Research Submitted Re vised Published An Extension on Statistical Comparisons of Classiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar a Francisco

Download Pdf - The PPT/PDF document "Journal of Machine Learning Research S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.