Journal of Machine Learning Research Submitted Re vised Published An Extension on Statistical Comparisons of Classiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar a Francisco

Journal of Machine Learning Research   Submitted  Re vised  Published  An Extension on Statistical Comparisons of Classiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar a Francisco Journal of Machine Learning Research   Submitted  Re vised  Published  An Extension on Statistical Comparisons of Classiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar a Francisco - Start

2015-01-14 204K 204 0 0

Embed code:
Download Pdf

Journal of Machine Learning Research Submitted Re vised Published An Extension on Statistical Comparisons of Classiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar a Francisco

Download Pdf - The PPT/PDF document "Journal of Machine Learning Research S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentations text content in Journal of Machine Learning Research Submitted Re vised Published An Extension on Statistical Comparisons of Classiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar a Francisco

Page 1
Journal of Machine Learning Research (2008) 2677-2694 Submitted 10/07; Re vised 4/08; Published 12/08 An Extension on Statistical Comparisons of Classifiers er Multiple Data Sets or all airwise Comparisons Salv ador Gar ıa Francisco Herr era Department of Computer Science and Artificial Intellig ence Univer sity of Gr anada Gr anada, 18071, Spain Editor: John Sha we-T aylor Abstract In recently published paper in JMLR, Dem sar (2006) recommends set of non-parametric sta- tistical tests and procedures which can be safely used for comparing the performance of

classifiers er multiple data sets. After studying the paper we realize that the paper correctly introduces the basic procedures and some of the most adv anced ones when comparing control method. Ho w- er it does not deal with some adv anced topics in depth. Re arding these topics, we focus on more po werful proposals of statistical procedures for comparing classifiers. Moreo er we illustrate an easy ay of obtaining adjusted and comparable -v alues in multiple comparison procedures. eyw ords: statistical methods, non-parametric test, multiple comparisons tests, adjusted p-v alues,

logically related ypotheses 1. Intr oduction In the Machine Learning (ML) scientific community there is need for rigorous and correct statisti- cal analysis of published results, due to the act that the de elopment or modifications of algorithms is relati ely easy task. The main incon enient related to this necessity is to understand and study the statistics and to kno the xact techniques which can or cannot be applied depending on the situation, that is, type of results obtained. In recently published paper in JMLR by Dem sar (2006), group of useful guidelines are gi en in order

to perform correct analysis when we compare set of classifiers er multiple data sets. Dem sar recommends set of non-parametric statistical techniques (Zar, 1999; Sheskin, 2003) for comparing classifiers under these circumstances, gi en that the sample of results obtained by them does not fulfill the required conditions and it is not lar ge enough for making parametric statistical analysis. He analyzed the beha vior of the pro- posed statistics on classification tasks and he check ed that the are more con enient than parametric techniques. Recent studies apply the

guidelines gi en by Dem sar in the analysis of performance of classifiers (Esmeir and Mark vitch, 2007; Marrocco et al., 2008). In them, ne proposal or methodology is of fered and it is compared with other methods by means of pairwise comparisons. Another type of studies assume an empirical comparison or re vie of already proposed methods. In these cases, no proposal is of fered and statistical comparison could be ery useful in determining the dif ferences among the methods. In the specialized literature, man papers pro vide re vie ws on specific topic and the also use statistical

methodology to perform comparisons. or xample, in re vie of 2008 Salv ador Garc ıa and Francisco Herrera.
Page 2
ensembles of decision trees, non-parametric tests are also applied in the analysis of performance (Banfield et al., 2007). Ho we er only the rankings computed by Friedman method (Friedman, 1937) are stipulated and authors establish comparisons based on them, without taking into account significance le els. Dem sar focused his ork in the analysis of ne proposals, and he introduced the Nemen yi test for making all pairwise comparisons (Nemen yi, 1963). Ne

ertheless, the Nemen yi test is ery conserv ati and it may not find an dif ference in most of the xperimentations. In recent papers, the authors ha used the Nemen yi test in multiple comparisons. Due to the act that this test posses lo po wer authors ha to emplo man data sets (Y ang et al., 2007b) or most of the dif ferences found are not significant (Y ang et al., 2007a; nez et al., 2007). Although the emplo yment of man data sets could seem beneficial in order to impro the generalization of results, in some specific domains, that is, imbalanced classification

(Owen, 2007) or multi-instance classification (Murray et al., 2005), data sets are dif ficult to find. Procedures with more po wer than Nemen yi one can be found in specialized literature. ha based on the necessity to apply more po werful procedures in empirical studies in which no ne method is proposed and the benefit consists of obtaining more statistical dif ferences among the classifiers compared. Thus, in this paper we describe these procedures and we analyze their beha vior by means of the analysis of multiple repetitions of xperiments with randomly selected

data sets. On the other hand, we can see other orks in which the -v alue associated to comparison between tw classifiers is reported (Garc ıa-Pedrajas and Fyfe, 2007). Classical non-parametric tests, such as ilcoxon and Friedman (Sheskin, 2003), may be incorporated in most of the statistical packages (SPSS, SAS, R, etc.) and the computation of the final -v alue is usually implemented. Ho we er adv anced procedures such as Holm (1979), Hochber (1988), Hommel (1988) and the ones described in this paper are usually not incorporated in statistical packages. The computation of the

correct -v alue, or Adjusted -V alue (APV) (W estf all and oung, 2004), in comparison using an of these procedures is not ery dif ficult and, in this paper we sho ho to include it with an illustrati xample. The paper is set up as follo ws. Section presents more po werful procedures for comparing all the classifiers among them in comparison of multiple classifiers and case study In Section we describe the procedures for obtaining the APV by considering the post-hoc procedures xplained by Dem sar and the ones xplained in this paper In Section 4, we perform an xperimental study

of the beha vior of the statistical procedures and we discuss the results obtained. Finally Section concludes the paper 2. Comparison of Multiple Classifiers: erf orming All airwise Comparisons In the paper Dem sar (2006), referring to carrying out comparisons of more than tw classifiers, set of useful guidelines were gi en for detecting significant dif ferences among the results obtained and post-hoc procedures for identifying these dif ferences. Friedman test is an omnib us test which can be used to carry out these types of comparison. It allo ws to detect dif ferences

considering the global set of classifiers. Once Friedman test rejects the null ypothesis, we can proceed with post-hoc test in order to find the concrete pairwise comparisons which produce dif ferences. Dem sar described the use of the Nemen yi test used when all classifiers are compared with each other Then, he focused on procedures that control the amily-wise error when comparing with control classifier ar guing that the objecti of study is to test whether ne wly proposed method is better than the xisting 2678
Page 3
FI ones. or this reason, he described and

studied in depth more po werful and sophisticated procedures deri ed from Bonferroni-Dunn such as Holm s, Hochber g and Hommel methods. Ne ertheless, we think that performing all pairwise comparisons in an xperimental analysis may be useful and interesting in dif ferent cases when proposing ne method. or xample, it ould be interesting to conduct statistical analysis er multiple classifiers in re vie orks in which no method is proposed. In this case, the repetition of comparisons choosing dif ferent control classifiers may lose the control of the amily-wise error Our intention in

this section is to gi detailed description of more po werful and adv anced procedures deri ed from the Nemen yi test and to sho case study that uses these procedures. 2.1 Adv anced Pr ocedur es or erf orming All airwise Comparisons set of pairwise comparisons can be associated with set or amily of ypotheses. An of the post- hoc tests which can be applied to non-parametric tests (that is, those deri ed from the Bonferroni correction or similar procedures) ork er amily of ypotheses. As Dem sar xplained, the test statistics for comparing the -th and -th classifier is where is the erage rank

computed through the Friedman test for the -th classifier is the number of classifiers to be compared and is the number of data sets used in the comparison. The alue is used to find the corresponding probability -v alue) from the table of normal dis- trib ution, which is then compared with an appropriate le el of significance (T able A1 in Sheskin, 2003). basic procedures are: Nemen yi (1963) procedure: it adjusts the alue of in single step by di viding the alue of by the number of comparisons performed, 2. This procedure is the simplest ut it also has little po wer

Holm (1979) procedure: it as also described in Dem sar (2006) ut it as used for compar isons of multiple classifiers in olving control method. It adjusts the alue of in step do wn method. Let :::; be the ordered -v alues (smallest to lar gest) and :::; be the corresponding ypotheses. Holm procedure rejects to if is the smallest in- te ger such that Other alternati es were de eloped by Hochber (1988), Hommel (1988) and Rom (1990). The are easy to perform, ut the often ha similar po wer to Holm procedure (the ha more po wer than Holm procedure, ut the dif ference between them is not ery

notable) when considering all pairwise comparisons. The ypotheses being tested belonging to amily of all pairwise comparisons are logically interrelated so that not all combinations of true and alse ypotheses are possible. As simple xample of such situation suppose that we ant to test the three ypotheses of pairwise equality associated with the pairwise comparisons of three classifiers 3. It is easily seen from the relations among the ypotheses that if an one of them is alse, at least one other must be alse. or xample, if is better/w orse than then it is not possible that has the same

performance as and has the same performance as must be better/w orse than or or the tw classifiers at the same time. Thus, there cannot be one alse and tw true ypotheses among these three. 2679
Page 4
Based on this ar gument, Shaf fer proposed tw procedures which mak use of the logical relation among the amily of ypotheses for adjusting the alue of (Shaf fer, 1986). Shaf fer static procedure: follo wing Holm step do wn method, at stage instead of rejecting if reject if where is the maximum number of ypotheses which can be true gi en that an ypotheses are alse. It is static

procedure, that is, :::; are fully determined for the gi en ypotheses :::; independent of the observ ed -v alues. The possible numbers of true ypotheses, and thus the alues of can be obtained from the recursi formula where is the set of possible numbers of true ypotheses with classifiers being compared, 2, and Shaf fer dynamic procedure: it increases the po wer of the first by substituting at stage by the alue where is the maximum number of ypotheses that could be true, gi en that the pre vious ypotheses are alse. It is dynamic procedure since depends not only on the logical

structure of the ypotheses, ut also on the ypotheses already rejected at step Ob viously this procedure has more po wer than the first one. In this paper we ha not used this second procedure, gi en that it is included in an adv anced procedure which we will describe in the follo wing. In Ber gmann and Hommel (1988) as proposed procedure based on the idea of finding all elementary ypotheses which cannot be rejected. In order to formulate Ber gmann-Hommel pro- cedure, we need the follo wing definition. Definition An inde set of hypotheses :::; is called xhausti if xactly

all could be true In order to emplify the pre vious definition, we will consider the follo wing case: ha three classifiers, and we will compare them in comparison. will obtain three ypotheses: es equal in beha vior than es equal in beha vior than es equal in beha vior than and eight possible sets All are true. and are true and is alse. and are true and is alse. 2680
Page 5
FI and are true and is alse. is true and and are alse. is true and and are alse. is true and and are alse. All are alse. Sets and can be possible, because their ypotheses can be true at the same time,

so the are xhausti sets. Set basing on logically related ypotheses principles, is not possible because the performance of cannot be equal to and whereas has dif ferent performance than The same consideration can be done to and which are not xhausti sets. Under this definition, it orks as follo ws. Ber gmann and Hommel (1988) procedure: Reject all with where the acceptance set haus ive min jg is the inde set of null ypotheses which are retained. or this procedure, one has to check for each subset of :::; if is xhausti e, which leads to intensi computation. Due to this act, we will obtain

set, named which will contain all the possible xhausti sets of ypotheses for certain comparison. rapid algo- rithm which as described in Hommel and Bernhard (1994) allo ws substantial reduction in computing time. Once the set is obtained, the ypotheses that do not belong to the set are rejected. Figure sho ws alid algorithm for obtaining all the xhausti sets of ypotheses, using as input list of classifiers is set of amilies of ypotheses; lik wise, amily of ypotheses is set of ypotheses. The most important step in the algorithm is the number 6. It performs di vision of the

classifiers into tw subsets, in which the last classifier al- ays is inserted in the second subset and the first subset cannot be empty In this ay we ensure that subset yielded in di vision is ne er empty and no repetitions are produced. or xample, suppose set with three classifiers All possible di visions without taking into account the pre vious assumptions are: fg gg gg gg gg gg gg gg fgg Di visions and and and and are equi alent, respecti ely Furthermore, di visions and are not interesting. Using the assumptions in step of the algorithm, the possible di visions are:

gg gg gg In this case, all the di visions are interesting and no repetitions are yielded. The computational comple xity of the algorithm for obtaining xhausti sets is Ho we er the computation requirements may be reduced by means of using storage capabilities. Relati xhausti sets for classifiers can be stored in memory and there is no necessity of in oking the obtainingExhaustive func- tion recursi ely The computational comple xity using storage capabilities is so the algorithm still requires intensi computation. 2681
Page 6
An xample illustrating the algorithm for obtaining all

xhausti sets is dra wn in Figure 2. In it, four classifiers, enumerated from to in the set, are used. The comparisons or y- potheses are denoted by pairs of numbers without separation character between them. This illustration does not sho the case in which the set 2, for simplifying the representation. When 2, no comparisons can be performed, so the obtainExhaustive function returns an empty set An edge connecting tw box es represents an in ocation of this function. In each box, the list of classifiers gi en as input and the first initialization of the set are displayed. The

main edges, whose starting point is the initial box, are labeled by the order of in ocation. Belo the graph, the resulting subset in each main edge is denoted. The final will be composed by the union of these subsets. At the end of the process, 14 distinct xhausti sets are found: 12 13 14 23 24 34 23 24 34 13 14 34 12 14 24 13 23 12 13 14 23 24 34 12 34 13 24 23 14 able gi es the number of ypotheses ), the number (2 of inde sets and the number of xhausti inde sets for classifiers being compared. Function obtainExhausti e( :::; list of classifiers) 1. Let 2. set of all

possible and distinct pairwise comparisons using 3. If == 4. Return 5. End if 6. or all possible di visions of into tw subsets and and 7. ob ainE haus ive 8. ob ainE haus ive 9. 10. 11. or each amily of ypotheses of 12. or each amily of ypotheses of 13. 14. End for 15. End for 16. End for 17. Return Figure 1: Algorithm for obtaining all xhausti sets The follo wing subsections present case study of comparison of some well-kno wn classifiers er thirty data sets. In it, the four procedures xplained abo are emplo yed. 2.2 erf orming All airwise Comparisons: Case Study In the follo wing, we

sho an xample in olving the four procedures described with comparison of classifiers: C4.5 (Quinlan, 1993); One Nearest Neighbor (1-NN) with Euclidean distance, 2682
Page 7
FI Figure 2: Example of the obtaining of xhausti sets of ypotheses considering classifiers 64 14 10 1024 51 15 32768 202 21 2097152 876 28 10 4139 36 10 10 21146 able 1: All pairwise comparisons of classifiers 2683
Page 8
Nai eBayes, ernel (McLachlan, 2004) and, finally CN2 (Clark and Niblett, 1989). The parame- ters used are specified in Section 4. ha used 10-fold cross

alidation and standard parameters for each algorithm. The results correspond to erage accurac or cl ass er or in test data. ha used 30 data sets. able sho ws the erall process of computation of erage rankings. Friedman (1937) and Iman and Da enport (1980) tests check whether the measured erage ranks are significantly dif ferent from the mean rank 3. The respecti ely use the and the statistical distrib utions to determine if distrib ution of observ ed frequencies dif fers from the theoretical xpected frequencies. Their statistics use nominal (cate gorical) or ordinal le el data, instead

of using means and ariances. Dem sar (2006) detailed the computation of the critical alues in each distrib ution. In this case, the critical alues are 9.488 and 2.45, respecti ely at 05, and the Friedman and Iman-Da enport statistics are: 39 647 14 309 Due to the act that the critical alues are lo wer than the respecti statistics, we can proceed with the post-hoc tests in order to detect significant pairwise dif ferences among all the classifiers. or this, we ha to compute and order the corresponding statistics and -v alues. The standard error in the pairwise comparison between tw

classifiers is 30 408. able presents the amily of ypotheses ordered by their -v alue and the adjustment of by Nemen yi s, Holm and Shaf fer static procedures. Nemen yi test rejects the ypotheses [14] since the corresponding -v alues are smaller than the adjusted s. Holm procedure rejects the ypotheses [15]. Shaf fer static procedure rejects the ypotheses [16]. Ber gmann-Hommel dynamic procedure first obtains the xhausti inde set of ypotheses. It obtains 51 inde sets. can see them in able 4. From the inde sets, it computes the set. It rejects all ypotheses with so it rejects the

ypotheses [18]. Ber gmann-Hommel dynamic procedure allo ws to clearly distinguishing among three groups of classifiers, attending to their performance: Best classifiers: C4.5 and Nai eBayes. Middle classifiers: 1-NN and CN2. orst classifier: ernel. 1. ernel method is bayesian classifier which emplo ys non-parametric estimation of density functions through aussian ernel function. The adjustment of the co ariance matrix is performed by the ad-hoc method. 2. Nai eBayes and CN2 are classifiers for discrete domains, so we ha discretized the data prior to

learning with them. The discretizer algorithm is ayyad and Irani (1993). 3. Data sets mark ed with * ha been subsampled being adapted to slo algorithms, such as CN2. 4. ha considered that each classifier follo ws the order: C4.5, 1-NN, Nai eBayes, ernel, CN2. or xample, the ypothesis 13 represents the comparison between C4.5 and Nai eBayes. 2684
Page 9
FI C4.5 1-NN Nai eBayes ernel CN2 Abalone* 0.219 (3) 0.202 (4) 0.249 (2) 0.165 (5) 0.261 (1) Adult* 0.803 (2) 0.750 (4) 0.813 (1) 0.692 (5) 0.798 (3) Australian 0.859 (1) 0.814 (4) 0.845 (2) 0.542 (5) 0.816 (3) Autos 0.809 (1)

0.774 (3) 0.673 (4) 0.275 (5) 0.785 (2) Balance 0.768 (3) 0.790 (2) 0.727 (4) 0.872 (1) 0.706 (5) Breast 0.759 (1) 0.654 (5) 0.734 (2) 0.703 (4) 0.714 (3) Bupa 0.693 (1) 0.611 (3) 0.572 (4.5) 0.689 (2) 0.572 (4.5) Car 0.915 (1) 0.857 (3) 0.860 (2) 0.700 (5) 0.777 (4) Cle eland 0.544 (2) 0.531 (4) 0.558 (1) 0.439 (5) 0.541 (3) Crx 0.855 (2) 0.796 (4) 0.857 (1) 0.607 (5) 0.809 (3) Dermatology 0.945 (3) 0.954 (2) 0.978 (1) 0.541 (5) 0.858 (4) German 0.725 (2) 0.705 (4) 0.739 (1) 0.625 (5) 0.717 (3) Glass 0.674 (4) 0.736 (1) 0.721 (2) 0.356 (5) 0.704 (3) Hayes-Roth 0.801 (1) 0.357 (4) 0.520 (2.5)

0.309 (5) 0.520 (2.5) Heart 0.785 (2) 0.770 (3) 0.841 (1) 0.659 (5) 0.759 (4) Ion 0.906 (2) 0.359 (5) 0.895 (3) 0.641 (4) 0.918 (1) Led7Digit 0.710 (2) 0.402 (4) 0.728 (1) 0.120 (5) 0.674 (3) Letter* 0.691 (2) 0.827 (1) 0.667 (3) 0.527 (5) 0.638 (4) ymphograph 0.743 (3) 0.739 (4) 0.830 (1) 0.549 (5) 0.746 (2) Mushrooms* 0.990 (1.5) 0.482 (5) 0.941 (3) 0.857 (4) 0.990 (1.5) OptDigits* 0.867 (3) 0.098 (1) 0.915 (2) 0.986 (1) 0.784 (4) Satimage* 0.821 (3) 0.872 (2) 0.815 (4) 0.885 (1) 0.778 (5) SpamBase* 0.893 (2) 0.824 (4) 0.902 (1) 0.739 (5) 0.885 (3) Splice* 0.799 (2) 0.655 (4) 0.925 (1) 0.517

(5) 0.755 (3) ic-tac-toe 0.845 (1) 0.731 (2) 0.693 (4) 0.653 (5) 0.704 (3) ehicle 0.741 (1) 0.701 (2) 0.591 (5) 0.663 (3) 0.619 (4) wel 0.799 (2) 0.994 (1) 0.603 (4) 0.269 (5) 0.621 (3) ine 0.949 (4) 0.955 (2) 0.989 (1) 0.770 (5) 0.954 (3) east 0.555 (3) 0.505 (4) 0.569 (1) 0.312 (5) 0.556 (2) Zoo 0.928 (2.5) 0.928 (2.5) 0.945 (1) 0.419 (5) 0.897 (4) erage rank 2.100 3.250 2.200 4.333 3.117 able 2: Computation of the rankings for the algorithms considered in the study er 30 data sets, based on test accurac by using ten-fold cross alidation 2685
Page 10
ypothesis C4.5 vs. ernel 5.471

487 10 0.005 0.005 0.005 Nai eBayes vs. ernel 5.226 736 10 0.005 0.0055 0.0083 ernel vs. CN2 2.98 0.0029 0.005 0.0063 0.0083 C4.5 vs. 1NN 2.817 0.0048 0.005 0.0071 0.0083 1NN vs. ernel 2.654 0.008 0.005 0.0083 0.0083 1NN vs. Nai eBayes 2.572 0.0101 0.005 0.01 0.0125 C4.5 vs. CN2 2.49 0.0128 0.005 0.0125 0.0125 Nai eBayes vs. CN2 2.245 0.0247 0.005 0.0167 0.0167 1NN vs. CN2 0.327 0.744 0.005 0.025 0.025 10 C4.5 vs. Nai eBayes 0.245 0.8065 0.005 0.05 0.05 able 3: amily of ypotheses ordered by -v alue and adjusting of by Nemen yi (NM), Holm (HM) and Shaf fer (SH) procedures, considering an

initial 05 Size Size Size Size Size (12) (12,34) (12,13,23) (12,13,23,45) (12,13,14,15,23,24,25,34,35,45) (13) (13,24) (12,14,24) (12,14,24,35) (12,13,14,23,24,34) (23) (14,23) (13,14,34) (12,34,35,45) (12,13,15,23,25,35) (14) (12,35) (23,24,34) (13,14,25,34) (12,14,15,24,25,45) (24) (13,25) (12,15,25) (13,15,24,35) (13,14,15,34,35,45) (34) (15,23) (13,15,35) (13,24,25,45) (23,24,25,34,35,45) (15) (12,45) (23,25,35) (14,15,23,45) (25) (13,45) (14,15,45) (14,23,25,35) (35) (23,45) (24,25,45) (15,23,24,34) (45) (14,25) (34,35,45) (15,24) (14,35) (24,35) able 4: Exhausti sets obtained for the

case study Those belonging to the Acceptance set are typed in bold. 2686
Page 11
FI In Dem sar (2006), we can find discussion about the po wer of Hochber g and Hommel pro- cedures with respect to Holm one. The reject more ypothesis than Holm s, ut the dif ferences are in practice rather small (Shaf fer 1995). The most po werful procedures detailed in this paper Shaf fer and Ber gmann-Hommel s, ork follo wing the same method of Holm procedure, so it is possible to ybridize them with other types of step up procedures, such as Hochber g s, Hom- mel and Rom methods. When we apply

these methods by using the logical relationships among ypothesis in static ay the do not control the amily-wise error (Hochber and Rom, 1995). In opposite, when applying these methods by detecting dynamical relationships, the control the amily-wise error In Hochber and Rom (1995), se eral xtensions were gi en in this ay Fur thermore, small impro ement of po wer in the Ber gmann-Hommel procedure described here can be achie ed when using Simes conjecture (Simes, 1986) in the obtaining of set (see Hommel and Bernhard, 1999, for more details). 3. Adjusted P-V alues The smallest le el of

significance that results in the rejection of the null ypothesis, the -v alue, is useful and interesting datum for man consumers of statistical analysis. -v alue pro vides information about whether statistical ypothesis test is significant or not, and it also indicates something about ho significant the result is: The smaller the -v alue, the stronger the vidence ag ainst the null ypothesis. Most important, it does this without committing to particular le el of significance. When -v alue is within multiple comparison, as in the xample in able 3, it reflects the

probability error of certain comparison, ut it does not tak into account the remaining compar isons belonging to the amily One ay to solv this problem is to report APVs which tak into account that multiple tests are conducted. An APV can be compared directly with an chosen sig- nificance le el In this paper we encourage the use of APVs due to the act that the pro vide more information in statistical analysis. In the follo wing, we will xplain ho to compute the APVs depending on the post-hoc procedure used in the analysis, follo wing the indications gi en in Wright (1992) and Hommel and

Bernhard (1999). also include the post-hoc tests xplained in Dem sar (2006) and other for comparisons with control classifier The notation used in the computation of the APVs is the follo wing: Inde es and correspond each one to concrete comparison or ypothesis in the amily of ypotheses, according to an incremental order by their -v alues. Inde al ays refers to the ypothesis in question whose APV is being computed and inde refers to another ypothesis in the amily is the -v alue obtained for the -th ypothesis. is the number of classifiers being compared. is the number of possible

comparisons in an all pairwise comparisons design; that is, is the maximum number of ypotheses which can be true gi en that an ypotheses are alse (see the description of Shaf fer static procedure in Section 2.1). 2687
Page 12
The procedures of p-v alue adjustment can be classified into: one-step. Bonferroni AP min where Nemen yi AP min where step-up. Hochber AP max Hommel AP see algorithm at Figure 3. step-do wn. Holm AP (using control classifier): min where max Nemen yi AP min where Holm AP (using it in all pairwise comparisons): min where max Shaf fer static AP min

where max Ber gmann-Hommel AP min where max fj min haus ive 1. Set AP for all 2. or each :::; (in that order) 3. Let 4. or each 5. Compute alue 6. 7. End for 8. Find the smallest alue in call it min 9. If AP min then AP min 10. or each 11. Let min min 12. If AP then AP 13. End for Figure 3: Algorithm for calculating APVs based on Hommel procedure able sho ws the results in the final form of APVs for the xample considered in this section. As we can see, this xample is suitable for observing the dif ference of po wer among the test procedures. Also, this table can pro vide information

about the state of retainment or rejection of an ypothesis, comparing its associated APV with the le el of significance pre viously fix ed. 2688
Page 13
FI ypothesis AP AP AP AP BH C4.5 vs .K ernel 487 10 487 10 487 10 487 10 487 10 Nai eBayes vs .K ernel 736 10 736 10 563 10 042 10 042 10 ernel vs .CN2 0.0029 0.0288 0.023 0.0173 0.0115 C4.5 vs .1NN 0.0048 0.0485 0.0339 0.0291 0.0291 1NN vs .K ernel 0.008 0.0796 0.0478 0.0478 0.0319 1NN vs .Nai eBayes 0.0101 0.1011 0.0506 0.0478 0.0319 C4.5 vs .CN2 0.0128 0.1276 0.0511 0.0511 0.0383 Nai eBayes vs .CN2 0.0247 0.2474 0.0742

0.0742 0.0383 1NN vs .CN2 0.744 1.0 1.0 1.0 1.0 10 C4.5 vs .Nai eBayes 0.8065 1.0 1.0 1.0 1.0 able 5: APVs obtained in the xample by Nemen yi (NM), Holm (HM), Shaf fer static (SH) and Ber gmann-Hommel dynamic (BH) 4. Experimental Framew ork In this section, we ant to determine the po wer and beha vior of the studied procedures through the xperiments in which we repeatedly compared the classifiers on sets of ten randomly chosen data sets, recording the number of equi alence ypothesis rejected and APVs. follo similar method used in Dem sar (2006). The classifiers used are the same as

in the case study of the pre vious subsection: C4.5 with minimum number of item-sets per leaf equal to and confidence le el fitted for optimal accurac and pruning strate gy nai Bayesian learner with continuous attrib utes discretized using ayyad and Irani (1993) discretization, classic 1-Nearest-Neighbor classifier with Euclidean distance, CN2 with ayyad-Irani discretizer star size and 95% of xamples to co er and ernel classifier with sigmaK er nel 01, which is the in erse alue of the ariance that represents the radius of neighborhood. All classifiers are ailable

in KEEL softw are (Alcal a-Fdez et al., 2008). or performing this study we ha compiled sample of fifty data sets from the UCI machine learning repository (Asuncion and Ne wman, 2007), all of them alid for classification task. measured the performance of each classifier by means of accurac in test by using ten-fold cross alidation. As Dem sar did, when comparing tw classifiers, samples of ten data sets were randomly selected so that the probability for the data set being chosen as proportional to where is the (positi or ne ati e) dif ference in the classification

accuracies on that data set and is the bias through which we can re gulate the dif ferences between the classifiers. ith 0, the selection is purely random and as is being higher the selected data sets are orable to particular classifier In comparisons of multiple classifiers, samples of data sets ha to be selected with the prob- abilities computed from the dif ferences in accurac of tw classifiers. ha chosen C4.5 and 1-NN, due to the act that we ha found significant dif ferences between them in the study con- ducted before (Section 2.2) which in olv ed thirty data

sets. Note that the repeated comparisons done here only in olv ten data sets each time, so the rejection of equi alence of tw classifiers is more dif ficult at the be ginning of the process. 5. It is also ailable at 6. The data sets used are: abalone, adult, australian, autos, balance, bands, breast, upa, car cle eland, dermatol- ogy ecoli, flare, german, glass, haberman, hayes-roth, heart, iris, led7digit, letter lymphograph magic, monks, mushrooms, ne wth yroid, nursery optdigits, pageblocks, penbased, pima, ring, satimage, se gment, shuttle, spambase,

splice, tae, th yroid, tic-tac-toe, tw onorm, ehicle, wel, wine, wisconsin, yeast, zoo. 2689
Page 14
Figure sho ws the results of this study considering the pairwise comparison between C4.5 and 1-NN. It gi es an approximation of the po wer of the statistical procedures considered in this paper Figure 4(a) reflects the number of times the rejected the equi alence of C4.5 and 1-NN. Ob viously the Ber gmann-Hommel procedure is the most po werful, follo wed by Shaf fer static procedure. The graphic also informs us about the use of logically related ypothesis, gi en that the

procedures that use this information ha bias to ards the same point and those which do not use this information, tend to lo wer point than the first. When the selection of data sets is purely random 0), the benefit of using the Ber gmann-Hommel procedure is appreciable. Figure 4(b) sho ws the erage APV of the same comparison of classifiers. As we can see, the Nemen yi procedure is too conserv ati in comparison with the remaining procedures. Ag ain, the benefit of using more sophisticated testing procedures is easily noticeable. (a) Number of ypotheses rejected in

pairwise compar isons (b) erage APV in pairwise comparisons Figure 4: C4.5 vs. 1-NN Figure sho ws the results of this study considering all possible pairwise comparisons in the set of classifiers. It helps us to compare the erall beha vior of the four testing procedures. Figure 5(a) presents the number of times the rejected an comparison belonging to the amily Although it could seem that the selection of data sets determined by the dif ference of accurac between tw classifiers may not influence on the erall comparison, the graphic sho ws us that it occurs. Further more, the

lines dra wn follo parallel beha vior indicating us the relation and magnitude of po wer among the four procedures. In Figure 5(b) we illustrate the erage APV for all the comparisons of classifiers. can notice that the conserv atism of the Nemen yi test is ob vious with respect to the rest of procedures. The benefit of using more adv anced testing procedure is similar with respect to the follo wing less-po werful procedure, xcept for Holm procedure. Finally our recommendation on the usage of certain procedure depends on the results obtained in this paper and in our xperience in

understanding and implementing them: do not recommend the use of Nemen yi test, because it is ery conserv ati procedure and man of the ob vious dif ferences may not be detected. When we use considerable number of data sets with re ards to number of classifiers, we could proceed with the Holm procedure. 2690
Page 15
FI (a) otal number of ypotheses rejected (b) erage APV in all comparisons Figure 5: All comparisons Ho we er conducting the Shaf fer static procedure means not ery significant increase of the dif ficulty with respect to the Holm procedure. Moreo er the

benefit of using information about logically related ypothesis is noticeable, thus we strongly encourage the use of this procedure. Ber gmann-Hommel procedure is the best performing one, ut it is also the most dif ficult to understand and computationally xpensi e. recommend its usage when the situation requires so (i.e., the dif ferences among the classifiers compared are not ery significant), gi en that the results it obtains are as alid as using other testing procedure. 5. Conclusions The present paper is an xtension of Dem sar (2006). Dem sar does not deal in depth

with some topics related to multiple comparisons in olving all the algorithms and computations of adjusted -v alues. In this paper we describe other adv anced testing procedures for conducting all pairwise com- parisons in multiple comparisons analysis: Shaf fer static and Ber gmann-Hommel procedures. The adv antage that the obtain is produced due to the incorporation of more information about the ypotheses to be tested: in comparisons, logical relationship among them xists. As general rule, the Ber gmann-Hommel procedure is the most po werful one ut it requires intensi computation in

comparisons in olving numerous classifiers. The second one, Shaf fer procedure, can be used instead of Ber gmann-Hommel in these cases. Moreo er we present the methods for obtaining the adjusted p-v alues, which are alid p-v alues associated to each comparison useful to be compared with an le el of significance without restrictions and the also pro vide more infor mation. ha illustrated them with case study and we ha check ed that the ne described methods are more po werful than the classical ones, Nemen yi and Holm procedures. 2691
Page 16
Ackno wledgments This research

has been supported by the project TIN2005-08386-C05-01. S. Garc ıa holds FPU scholarship from Spanish Ministry of Education and Science. The present paper as submitted as re gular paper in the JMLR journal. After the re vie process, the action editor Dale Schuurmans encourages us to submit the paper to the special topic on Multiple Simultaneous Hypothesis esting. are ery grateful to the anon ymous re vie wers and both action editors who managed this paper for their aluable suggestions and comments to impro its quality ppendix A. Sour ce Code of the Pr ocedur es The source code, written in

A, that implements all the procedures described in this paper is ailable at The program allo ws the input of data in CSV format and obtains as output document. Refer ences J. Alcal a-Fdez, L. anchez, S. Garc ıa, M.J. del Jesus, S. entura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, .M. Ri as, J.C. Fern andez, and Herrera. KEEL: softw are tool to assess olu- tionary algorithms to data mining problems. Soft Computing doi: 10.1007/s00500-008-0323-y 2008. In press. A. Asuncion and D.J. Ne wman. UCI machine learning repository 2007. URL

http://www.ics. mlearn/MLRepository.html R. E. Banfield, L. O. Hall, K. Bo wyer and gelme yer comparison of decision tree ensemble creation techniques. IEEE ansactions on attern Anaylisis and Mac hine Intellig ence 29(1):173180, 2007. G. Ber gmann and G. Hommel. Impro ements of general multiple test procedures for redundant systems of ypotheses. In Bauer G. Hommel, and E. Sonnemann, editors, Multiple Hypotheses esting pages 100115. Springer Berlin, 1988. Clark and Niblett. The CN2 induction algorithm. Mac hine Learning 3(4):261283, 1989. J. Dem sar Statistical comparisons of

classifiers er multiple data sets. ournal of Mac hine Learn- ing Resear 7:130, 2006. S. Esmeir and S. Mark vitch. An ytime learning of decision trees. ournal of Mac hine Learning Resear 8:891933, 2007. U. M. ayyad and K. B. Irani. Multi-interv al discretization of continuous alued attrib utes for classification learning. In Pr oceedings of the 13th International oint Confer ence on Artificial Intellig ence pages 10221029. Mor an-Kaufmann, 1993. M. Friedman. The use of ranks to oid the assumption of normality implicit in the analysis of ariance. ournal of the American

Statistical Association 32:675701, 1937. 2692
Page 17
FI N. Garc ıa-Pedrajas and C. Fyfe. Immune netw ork based ensembles. Neur ocomputing 70(7-9): 11551166, 2007. Hochber g. sharper bonferroni procedure for multiple tests of significance. Biometrika 75: 800802, 1988. Hochber and D. Rom. Extensions of multiple testing procedures based on Simes test. ournal of Statistical Planning and Infer ence 48:141152, 1995. S. Holm. simple sequentially rejecti multiple test procedure. Scandinavian ournal of Statistics 6:6570, 1979. G. Hommel. stage wise rejecti multiple test

procedure. Biometrika 75:383386, 1988. G. Hommel and G. Bernhard. rapid algorithm and computer program for multiple test proce- dures using procedures using logical structures of ypotheses. Computer Methods and Pr gr ams in Biomedicine 43:213216, 1994. G. Hommel and G. Bernhard. Bonferroni procedures for logically related ypotheses. ournal of Statistical Planning and Infer ence 82:119128, 1999. R. L. Iman and J. M. Da enport. Approximations of the critical re gion of the friedman statistic. Communications in Statistics pages 571595, 1980. C. Marrocco, R. Duin, and ortorella. Maximizing the

area under the OC curv by pairwise feature combination. attern Reco gnition 41:19611974, 2008. G. J. McLachlan. Discriminant Analysis and Statistical attern Reco gnition ile Series in Prob- ability and Mathematical Statistics, 2004. J. Murray G. Hughes, and K. Kreutz-Delg ado. Machine learning methods for predicting ailures in hard dri es: multiple-instance application. ournal of Mac hine Learning Resear 6:783816, 2005. B. Nemen yi. Distrib ution-fr ee Multiple Comparisons PhD thesis, Princeton Uni ersity 1963. M. nez, R. Fidalgo, and R. Morales. Learning in en vironments with unkno wn

dynamics: ards more rob ust concept learners. ournal of Mac hine Learning Resear 8:25952628, 2007. A. B. Owen. Infinitely imbalanced logistic re gression. ournal of Mac hine Learning Resear 8: 761773, 2007. J. R. Quinlan. Pr gr ams for Mac hine Learning Mor an Kauf fman, 1993. D. M. Rom. sequentially rejecti test procedure based on modified bonferroni inequality Biometrika 77:663665, 1990. J.P Shaf fer Modified sequentially rejecti multiple test procedures. ournal of the American Statistical Association 81(395):826831, 1986. J.P Shaf fer Multiple ypothesis testing. Annual

Re vie of Psyc holo gy 46:561584, 1995. 2693
Page 18
D. Sheskin. Handbook of ar ametric and Nonpar ametric Statistical Pr ocedur es. Chapman Hall/CRC, 2003. R.J. Simes. An impro ed Bonferroni procedure for multiple tests of significance. Biometrika 73: 751754, 1986. H. estf all and S. S. oung. Resampling-Based Multiple esting: Examples and Methods for p-value Adjustment John ile and Sons, 2004. S. Wright. Adjusted p-v alues for simultaneous inference. Biometrics 48:10051013, 1992. ang, G. ebb, K. orb, and K. M. ing. Classifying under computational resource constraints: an

ytime classification using probabilistic estimators. Mac hine Learning 69:3553, 2007a. ang, G. I. ebb, J. Cerquides, K. B. orb, J. Boughton, and K. M. ing. select or to weigh: comparati study of linear combination schemes for superparent-one-dependence estimators. IEEE anscations on Knowledg and Data Engineering 19(12):16521665, 2007b. J. H. Zar Biostatistical Analysis Prentice Hall, 1999. 2694

About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.