Download
# ErrorCorrecting Output Co ding for ext Classication Adam Berger Sc ho ol of Computer Science Carnegie Mellon Univ ersit Pittsburgh abergercs PDF document - DocSlides

tatiana-dople | 2014-12-13 | General

### Presentations text content in ErrorCorrecting Output Co ding for ext Classication Adam Berger Sc ho ol of Computer Science Carnegie Mellon Univ ersit Pittsburgh abergercs

Show

Page 1

Error-Correcting Output Co ding for ext Classiﬁcation Adam Berger Sc ho ol of Computer Science Carnegie Mellon Univ ersit Pittsburgh 15213 aberger@cs.cmu.edu Abstract This pap er applies error-correcting output co d- ing (ECOC) to the task of do cumen cate- gorization. ECOC, of recen vin tage in the AI literature, is metho for decomp osing ultiw classiﬁcation problem in to man bi- nary classiﬁcation tasks, and then com bining the results of the subtasks in to yp othesized solution to the original problem. There has een uc recen in terest in the mac hine learn- ing comm unit ab out algorithms whic in te- grate “advice from man sub ordinate predic- tors in to single classiﬁer, and error-correcting output co ding is one suc tec hnique. pro- vide exp erimen tal results on sev eral real-w orld datasets, extracted from the In ternet, whic demonstrate that ECOC can oﬀer signiﬁcan impro emen ts in accuracy er con en tional classiﬁcation algorithms. In tro duction Error-correcting output co ding is recip for solving ulti-w classiﬁcation problems. It orks in stages: ﬁrst, indep enden tly construct man sub ordinate clas- siﬁers, eac resp onsible for remo ving some uncertain ab out the correct class of the input; second, apply oting sc heme to decide up on the correct class, giv en the output of eac eak learner. Recen exp erimen tal ork has sho wn that ECOC oﬀers impro emen ts er standard -w classiﬁcation metho ds in domains rang- ing from cloud classiﬁcation Aha and Bank ert, 1997 to sp eec syn thesis Bakiri and Dietteric h, 1999 and um er of theories ha een prop osed for its success James, 1998 In this pap er, explore the application of error-correcting output co ding to do cumen catego- rization. The idea of “classifying consensus using large um er of indep enden tly-constructed classiﬁers has ap- eared in um er of other guises recen tly in the ma- hine learning literature. The tec hnique of bagging, for instance, in olv es generating ultiple training sets sampling with replacemen t, learning classiﬁer from eac generated set, and allo wing the learned classiﬁers to ote on the correct class for unlab eled ob ject Breiman, 1996a Bo osting can view ed as sp ecial case of bagging where the sampling is adaptiv e, concen trating on misclassiﬁed training instances reund and Sc hapire, 1997 oting metho ds ha also een applied to com bin- ing ultiple neural net orks trained on the same data errone, 1993 and applying diﬀeren yp es of classiﬁers to the same problem Quinlan, 1993 Wh consensus algorithms ork so ell in practice is still an op en question. As step in that direction, theoretical ork has recen tly established that com bin- ing ultiple runs of classiﬁcation algorithm can re- duce its ariance Breiman, 1996b Unlik most oting algorithms, the constituen classiﬁers in error-correcting output co ding aren’t all solving the same problem; in fact, they are eac solving distinct binary classiﬁcation problem. Kong and Dietteric h, 1995 ha sho wn that this prop ert of the ECOC algorithm esto ws on it, in addition to the ariance-reduction prop ert of all oting metho ds, the abilit to correct for bias in the constituen classiﬁers. This pap er applies ECOC to the problem of text cat- egorization: giv en database of do cumen ts, eac an- notated with lab el or set of lab els, learn mapping from do cumen ts to lab els. ext categoriza- tion computer—suc as the automatic assignmen of index terms to medical researc pap ers ang and Ch ute, 1994 —has een cen tral concern in the ﬁeld of bib- liometrics for man ears, but the recen ﬂo of on- line text has increased the in terest in and applications for text categorization. In ternet-related classiﬁcation re- searc has addressed the problem of learning to collect in teresting ostings to electronic discussion groups based on user’s predilections Lang, 1995 automatically clas- sifying eb pages con ten Cra en et al. 1998 and suggesting eb pages to user based on his or her ex- pressed preferences azzani et al. 1996 fo cus here on restricted ersion of the general classiﬁcation problem—namely imagine do cumen ts ha exactly one correct lab eling, meaning that the mapping is function. The databases emplo for ex- erimen tal purp oses in Section ha this and an addi- tional con enien haracteristic: eac lab el is ell repre- sen ted in the data. Under these conditions, the metho

Page 2

of Naiv Ba es classiﬁcation is highly comp etitiv e. Ho w- ev er, Section demonstrates that in this setting, error- correcting output co ding consisten tly outp erforms Naiv Ba es. urther exp erimen ts rep orted there suggest that ECOC will of utilit in the sparse-data domain as ell. This pap er will pro ceed as follo ws. The next section in tro duces the tec hnique of error-correcting output co d- ing and its application to text classiﬁcation. An ECOC classiﬁer relies on binary “co ding matrix, and Sec- tion discusses some considerations in selecting this ma- trix. Section describ es series of exp erimen ts to ali- date the claim that ECOC oﬀers impro emen ts on stan- dard classiﬁcation tec hniques. Section relates ECOC to Naiv Ba es and -nearest neigh or, another high- erformance classiﬁcation algorithm, and Section con- cludes outlining some directions for future ork in ECOC-based text categorization. Error-correcting output co ding describ the tec hnique of error-correcting output co ding with simple example: the task of classifying newswire articles in to the categories politics sports business arts egin, one assigns unique -bit ector to eac lab el (where log ): lab el co ding politics 0110110001 sports 0001111100 business 1010101101 arts 1000011010 One can view the th bitv ector as unique co ding for lab el or this reason (and others, whic will so on ecome apparen t), e’ll refer to the set of bitv ectors as de and denote it The th ro of will write as and the alue of the th bit in this ro as ij The second step in constructing an ECOC classiﬁer is to build an individual binary classiﬁer for eac column of the co de—10 classiﬁers in all, in this case. The ositiv instances for classiﬁer are do cumen ts with lab el for whic ij 1. The third classiﬁer, for instance, has the resp onsibilit of distinguishing et een do cumen ts whose lab el is sports or arts and those whose lab el is politics or business Heeding to con en tion, refer generically to an algorithm for predicting the alue of single bit as “plug-in classiﬁer (PiC). PiC, then, is predictor of whether do cumen elongs to some ﬁxed subset of the classes. summarize, training an ECOC classiﬁer consists of learning set of indep enden binary classiﬁers. With in hand, one can yp othesize the cor- rect class of an unlab eled do cumen as follo ws. Ev al- uate eac indep enden classiﬁer on generating -bit ector Λ( Most lik ely the generated bitv ector Λ( will not ro of but it will certainly closer (in Hamming distance ∆, sa y) to some ro ws than to others. Categorizing the do cumen in olv es selecting argmin ∆( Λ( )), the lab el for politics arts business sports politics arts business sports politics arts business sports politics arts business sports Figure 1: Decision oundaries for the ﬁrst three plug-in classiﬁers corresp onding to the co de giv en ab e. Clo k- wise from upp er left: all decision oundaries, bit 1, bit 2, bit 3. Algorithm raining an ECOC do cumen classiﬁer Input: Documents Labelings (with distinct labels); Desired code size log Output: by coding matrix classifiers 1. Generate by coding matrix 2. Do for [1 Construct two superclasses, and consists of all labels for which ij and is the complement set. Construct binary classifier to distinguish from whic is closest to Λ( ). (If more than one ro of are equidistan to Λ( ), select one arbitrarily .) or in- stance, if the generated bitv ector Λ( 1010111101 the do cumen ould receiv the lab el business the exten that ro ws of are ell-spaced in Ham- ming distance, the classiﬁer will robust to few erran PiCs. This is the idea ehind error-correcting co des as ell: to transmit oin in the -dimensional cub reli- ably er noisy hannel, map it to one of set of ell- separated “ﬁxed oin ts in higher-dimensional cub e; to reco er the original oin t, ﬁnd the closest ﬁxed oin to the oin actually receiv ed and tak its preimage in the original cub e. In general, ma not alue, but real- alued probabilit measuring the classiﬁer’s conﬁdence that do cumen elongs in the ’th sup erclass. In this case, one can searc for the nearest neigh or according to some distance, rather than Hamming distance. In

Page 3

Algorithm Applying an ECOC do cumen classiﬁer Input: Trained ECOC classifier: by coding matrix and classifiers Unlabeled document Output: Hypothesized label for 1. Do for [1 Compute ---the confidence with which PiC believes 2. Calculate ∆(Λ( =1 ij for [1 3. Output argmin ∆(Λ( the exp erimen ts rep orted in Section 4, the plug-in clas- siﬁers output probabilit and compute the nearest neigh or according to distance. 2.1 The Naiv Ba es classiﬁer The PiC relied most hea vily on in constructing ECOC classiﬁers is the Naive Bayes classiﬁer Lewis, 1998 Naiv Ba es assumes that do cumen is generated selecting lab el according to prior distribution ), and then indep enden tly selecting ords for the do cu- men according to distribution ). The probabil- it of generating do cumen of ords from lab el is th us =1 (1) Used for prediction, the Naiv Ba es classiﬁer selects for an unlab eled do cumen the most lik ely lab el, giv en argmax argmax argmax =1 (2) where the ﬁrst equalit follo ws from Ba es La w. 2.2 Wh should ECOC classiﬁcation ork? Some standard classiﬁcation algorithms suc as bac k- propagation Rumelhart et al. 1986 are est suited to distinguishing et een outcomes. natural to com bine suc algorithms to predict from among outcomes is to construct indep enden predictors, as- signing predictor the task of deciding whether the th outcome obtains. build the classiﬁer, construct individual classiﬁers, where the ositiv examples for classiﬁer are those do cumen ts with lab el ap- ply the classiﬁer to an unlab eled do cumen select argmax )—the lab el whose classiﬁer pro duces the highest score. This is what some call the one versus est strategy This metho is sp ecial case of ECOC classiﬁcation where is the iden tit matrix. see wh one migh exp ect ECOC classiﬁcation to outp erform one-vs.-rest approac h, consider the prob- lem of learning to classifying fruit. Imagine that within the lab eled set of examples used to train the individual one-vs.-rest classiﬁers, the only ello fruit are bananas. So banana will learn strong asso ciation et een el- lo color and bananas. No pro vide ello grap e- fruit to the trained one-vs.-rest classiﬁer. The alue of grap efruit will lik ely close to one—after all, the ob- ject in question is round and grap efruit-sized, despite not eing red lik all the grap efruits encoun tered in training. But the alue of banana will very close to one, and the system will misclassify the ob ject as banana. ECOC classiﬁcation is less “brittle than the one-vs.-rest approac h: the distributed output represen tation means one erran sub ordinate classiﬁer on’t necessarily result in misclassiﬁcation. This is circuitous of sa ying that ECOC reduces ariance of the individual classiﬁers. Man classiﬁcation algorithms, including decision trees, exp onen tial mo dels, and neural net orks ha the capabilit to directly erform ultiw 2) classiﬁ- cation. reasonable classiﬁcation strategy with these algorithms is to construct single, monolithic classi- ﬁer. But the monolithic classiﬁer faces diﬃcult task. Imagining the classes as clouds in large-dimensional feature space, single classiﬁer ust learn all the deci- sion oundaries sim ultaneously whereas eac PiC of an ECOC classiﬁer learns only relativ ely small um er of decision oundaries at once. Moreo er, (assuming is suﬃcien tly large) an ECOC classiﬁer learns eac ound- ary man times, and is forgiving if few PiCs place the input on the wrong side of some decision oundaries Kong and Dietteric h, 1995 Cho osing go co de Early ork on error-correcting output co ding lo ok ed to algebraic co ding theory and in particular to the family of linear co des, for co ding matrix An -bit lin- ar err or-c orr cting de subspace of the ertices on -dimensional cub e, can deﬁned as the span of an -column binary matrix called gener ator matrix Error-correcting co des are often measured on the mini- um distance et een an linear com binations of BCH co des MacWilliams and Sloane, 1977 opular class of linear algebraic error-correcting co des, ha the useful prop ert that their co dew ords (all diﬀeren linear com binations of ro ws of are ell separated. Using suc matrix for ECOC classiﬁcation is for this reason an attractiv ossibilit and some ECOC classiﬁcation ork has used BCH co des as co ding matrix. Ho ev er, subsequen ECOC ork has established that ECOC classiﬁcation should erform ell when the co d- ing matrix is constructed randomly—sp eciﬁcally ho osing eac en try ij uniformly at random from This section pro vide some statistical and com binatorical argumen ts for wh this should the case. Section 3.1 summarizes some results from James, 1998 and Sec- tion 3.2 is new.

Page 4

3.1 statistical ersp ectiv Deﬁnition: Given datab ase of (do cument, la- el) airs x, with empiric al distribution the Ba es classiﬁer is argmax The Ba es classiﬁer assigns to do cumen the lab el whic app ears most often in the database with In terms of classiﬁcation accuracy on the Ba es classiﬁer is the est ossible strategy In the presen setting, it is reasonable to assume do cumen ts don’t ccur ultiple times with diﬀeren lab els in the collection, and so the Ba es classiﬁer simply selects the lab el of the do cumen in During the training phase, all do cumen lab els are ailable and so ha access to the Ba es classiﬁer. But in applying the classiﬁer do not. et the Ba es classiﬁer will still turn out to useful concept, as the follo wing deﬁnition and theorem from James, 1998 suggest. Deﬁnition: classiﬁc ation algorithm built fr om sub or dinate classiﬁers is Ba es consisten if, whenever the ar Bayes classiﬁers, so to is Lo osely sp eaking, Ba es consisten classiﬁer con- structed from accurate PiCs will accurate. This is prop ert one ould lik to ac hiev in an ECOC classi- ﬁer. The next theorem states the conditions under whic this is ac hiev able. Theorem Assuming was onstructe andomly, the ECOC classiﬁer omes onsistent as This theorem is not sa ying that with enough bits, an ECOC classiﬁer will do arbitrarily ell. Consistency of an ECOC classiﬁer do esn’t guaran tee correctness—since the PiCs aren’t themselv es pro ducing Ba es estimates. Still, this theorem suggests wh random construction of erforms ell. 3.2 com binatorial ersp ectiv The example co de presen ted earlier has the unfortunate prop ert that the third and ten th columns are equal. Therefore, the corresp onding classiﬁers will learn pre- cisely the same task. This is ermissible situation, though hardly desirable. Not ermissible is when ows of are equal, for then the co de cannot distin- guish et een the corresp onding lab els. ortunately for randomly-generated binary co de with suﬃcien tly man columns, the probabilit of suc an ev en is miniscule: for co de with lab els and bits, the probabilit is =1 whic is one for log but approac hes zero quic kly thereafter as increases. More generally if ro ws in are close in Ham- ming distance, an ECOC classiﬁer built from is apt to confuse the corresp onding lab els. e’ll write ∆( as the Hamming distance et een ro ws and of and min as the minim um distance et een an co dew ords. If the PiCs pro duce binary outputs, then the ECOC classiﬁer can alw ys reco er from at least min incorrect PiC outputs. The follo wing theo- rem is statemen ab out ho uc ro separation one can ossibly hop for in co ding matrix. Theorem or any by binary matrix ther ex- ist two ows which diﬀer in at most bits. Pro of Let the minim um distance et een an ro ws of one suc matrix Select ro ws i, [1 with replacemen t. Select column [1 ]. The probabilit that ik is di No select column [1 ], and then select ro ws i, [1 with replacemen t. The probabilit that ik is no greater than 2. Com bining these inequalities to solv for giv es the result. This sho ws that, as ecomes large, relativ spac- ing of one half is optimal. If consider only square matrices, there exists an explicit construction whic ac hiev es this ound; namely the Hadamard matrix. or general matrices are not are of an explicit con- struction meeting this ound, but the follo wing result suggests that random construction is lik ely to ha go separation. Theorem Deﬁne ell ro w-separated by binary matrix as one in which al ows have minimum elative Hamming sep ar ation at le ast log The pr ob ability that andomly-c onstructe binary ma- trix is not wel ow-sep ar ate is at most /m Pro of Giv en is randomly-constructed Fix diﬀeren ro ws and or [1 ], deﬁne the random ariable as +1 if otherwise Let or randomly-constructed 0, whic corresp onds to an n/ Hamming distance et een the ro ws. are in terested in the probabilit that 0. Using Cherno ounds, Pr log log There are ro ws in and so the probabilit that no pair of ro ws is to close is Pr

Page 5

Although atten tion in the ECOC literature has gen- erally concen trated on ﬁnding with go ro sep- aration, erhaps equally imp ortan desideratum is large separation et een columns. Columns that are close giv rise to classiﬁers whic are erforming nearly the same task—in the extreme case, equal columns corresp onding to iden tical classiﬁers. With only sligh hange, Theorem sho ws that random matrices are lik ely to ha go olumn separation as ell, pro- viding another justiﬁcation for constructing co de ran- domly In practice, large column separation in is not quite suﬃcien to ensure go erformance, ecause of de- generacy inheren in binary classiﬁcation. Man classiﬁ- cation algorithms treat and symmetrically and so if columns of are complemen tary (or nearly so), the corresp onding PICs will learn iden tical (or nearly iden- tical) classiﬁcation tasks. What really an t, then, is matrix whose ro ws are pairwise ell-separated, but not to ell-separated. The follo wing corollary to Theo- rem sho ws that randomly-selected matrix is, asymp- totically ery lik ely to ha this prop ert Corollary: Deﬁne strongly ell-separated by binary matrix as matrix any two ows of which have elative Hamming sep ar ation in the ange log log The pr ob ability that andomly-c onstructe binary ma- trix is not str ongly wel ow-sep ar ate is at most /m Exp erimen tal results applied error-correcting output co ding classiﬁcation to four real-w orld text collections, all extracted from the In ternet All corp ora ere sub ject to the same prepro- cessing: remo punctuation, con ert dates and mone- tary amoun ts and um ers to canonical forms, map all ords to upp ercase, and remo ords ccurring wice or less. able summarizes some salien haracteristics of these datasets. 20 Newsgroups This is collection of ab out 20 000 do cumen ts, culled from ostings to 20 Usenet discussion groups Lang, 1995 The do cu- men ts are appro ximately ev enly distributed among the 20 lab els. our univ ersities This (misnamed) dataset con- tains eb pages gathered from large um er of univ ersit computer science departmen ts Cra en et al. 1998 The pages ere man ually clas- siﬁed in to the categories course, department, faculty, staff, student, project, other aho science ollo wing Bak er and McCallum, 1998 automatically extracted the en tire aho The 20 newsgroups and four univ ersities datasets are publicly ailable at www.cs.cmu.edu/ textlearning collection do cumen ts lab els ords 20 newsgroups 19997 20 60915 univ ersities 8263 29004 aho science 10158 41 69939 aho health 5625 36 48110 able 1: articulars on the four training datasets used. Eac dataset as partitioned ﬁv separate times in to training/test split, and the um ers are statis- tics from the last of these trials. The last column rep orts the um er of distinct ords in the collection, excluding those app earing once or wice. science hierarc in early 1999, and formed la- eled collection con taining 41 classes collapsing the hierarc to the ﬁrst lev el. aho health This corpus as collected in the same as the science collection, but has rather diﬀeren haracteristics. In particular, man of its 36 classes are highly confusable, presen t- ing diﬃcult task for classiﬁcation algorithms. or instance, three of the lab els in this collec- tion are Health Administration Hospitals And Medical Centers and Health Care Figure plots ECOC classiﬁcation accuracy against co de size for these four corp ora. The co des ere con- structed selecting en tries uniformly at random from except in the case of the univ ersities dataset, for whic the columns of ere random erm utation of the 126 unique, non-trivial 7-bit ectors. The plots also displa the results of standard Naiv Ba es classiﬁcation. rom an implemen tation standp oin t, larger alue of incurs enalt in sp eed. (This ma an issue in high-throughput systems suc as text ﬁltering systems designed to route relev an news articles to man users, eac with their wn preferences. Ho ev er, Figure sug- gests that, to oin t, larger alues of oﬀer more ac- curate classiﬁcation. And ey ond that oin t, accuracy do esn’t tail oﬀ, as is the case in man other learning al- gorithms for classiﬁcation, whic are prone to erﬁtting when the um er of parameters ecomes large. The four univ ersities dataset as the only collection on whic ECOC classiﬁcation didn’t signiﬁcan tly out- erform Naiv Ba es one-vs.-rest classiﬁcation. The ECOC classiﬁer’s erformance on this collection is al- most oignan t: error rate steadily decreases un til 126, at whic oin there simply are no more un used, non-trivial 7-bit columns to add to In the collections are considering, eac lab el is ell- represen ted in the data and mo dels can ell estimated. In this setting the standard Naiv Ba es metho is highly comp etitiv Lewis, 1998 or this reason, use Naiv Ba es classiﬁer as the PiC in the ECOC classiﬁers corresp onding to Figure 2. Ho ev er, on datasets with orly-represen ted lab els, Naiv Ba es can starv for lac of data. With an ey to ards suc collections, explored using feature-based classiﬁca-

Page 6

20 40 60 80 100 10 20 100 500 2000 % error # bits 20 Newsgroups 20 40 60 80 100 20 50 126 % error # bits Four universities 20 40 60 80 100 10 20 41 100 500 2000 % error # bits Yahoo science 20 40 60 80 100 10 20 36 100 500 2000 % error # bits Yahoo health Figure 2: erformance of ECOC classiﬁcation as function of co de size. Naiv Ba es classiﬁers serv ed as the PiCs. Eac oin reﬂects an erage er ﬁv randomized training/test splits, and the bars measure the standard deviation er these trials. The horizon tal line is the eha vior of standard one-vs.-rest Naiv Ba es. All oin ts are eraged er ﬁv trials with randomized and randomized training/test split of the data. tion approac as the ECOC PiC. Sp eciﬁcally trained binary decision trees to predict the individual bits in an ECOC co de; the questions at the no des of eac tree ere of the form Did wor app ar in the do cument? do not exp ect suc classiﬁer to matc the est rep orted erformance on this dataset, since this algorithm only considers whether ord ccurs in do cumen and not ho often. Ho ev er, Figure do es suggest that for suf- ﬁcien tly high com bining decision trees in to an ECOC classiﬁer impro es erformance er one-vs.-rest deci- sion tree approac h, whic augurs ell for the application of ECOC to larger, sparse datasets. ruly meaningful alues of lie in the range [log m, ]. co de of size log cannot ev en assign distinct bitv ector to eac lab el; at the other extreme, co de of size ust con tain duplicate columns, whic corresp onds to PiCs learning the same task. (A tigh ter upp er ound is actually (2 1) 2: the comes ab out since the all-zero ector corresp onds to trivial classiﬁer, and the denominator arises from the degeneracy men tioned ab e). 20 40 60 80 100 10 20 100 500 % error # bits 20 Newsgroups (Tree) Figure 3: erformance of ECOC classiﬁcation as func- tion of co de size, for decision tree PiC with Bernoulli ev en mo del whic tak es no accoun of ultiple app ear- ances of ord in do cumen t. Eac oin reﬂects single trial using randomized training/test partition of the 20 newsgroups collection. The horizon tal line is the one-vs.-rest decision tree erformance.

Page 7

Discussion The results of the previous section suggest that up to oin t, classiﬁer erformance impro es with simple calculation sho ws wh this should so. Assume for the momen that the PiCs only output bi- nary alues, and the errors committed an PiCs are indep enden of one another. Denote the prob- abilit of error the th PiC, and let max If the minim um distance of is min then classiﬁcation is robust to an min or few er errors, and so the probabilit of correct classiﬁcation, as function of is min =0 (1 (3) The quan tit on the righ t—the ﬁrst min terms of the binomial expansion of (1 )—is monotoni- cally increasing in min whic itself increases with for randomly-constructed co de. Section sho ws that in practice, ev en tually plateaus, whic means that the assumption that the errors are uncorrelated is false. This is hardly surprising: after all, the individual classi- ﬁers ere trained on the same data. One ould exp ect correlation et een, for instance, the second and third columns of the co de presen ted in Section 2. 5.1 Relation to Naiv Ba es ha already seen that the one-vs.-rest strategy is sp ecial case of ECOC classiﬁcation. It is not diﬃcult to see that the standard Naiv Ba es approac is an im- plemen tation of ECOC classiﬁcation. Notice that Naiv Ba es is clearly one-vs.-rest tec hnique: predicting from among classes requires constructing classiﬁers (eac consisting of prior and class-sp eciﬁc distribution )), and selecting lab el via (2). But this just amoun ts to using as co de the iden tit matrix, and then applying Algorithm using an norm. 5.2 Relation to -nearest neigh or opular approac to text classiﬁcation, particularly comp etitiv for ery large and sparse datasets, is nearest neigh or NN). NN relies on map from do cumen ts to -dimensional ectors The en- tries of the latter ma ord coun ts or, more gen- erally list of feature alues. NN classiﬁer stores the images of all training set do cumen ts in database classify an unlab eled do cumen NN ﬁnds the ectors in closest to ), and tak es eigh ted ote of their lab els. NN and ECOC ha some sup erﬁcial similarities. Both use for classiﬁcation data structure consisting of set of ectors, and oth searc this data structure using nearest-neigh or algorithm, linear in the size of the data structure. One distinction—of particular imp or- tance when the size of the training set ecomes large—is that while ECOC’s data structure consists of single ector for eac lab el, NN ust store ector for eac do cumen in the training set. Conclusion ha describ ed an application of error-correcting out- put co ding to the problem of automatic text categoriza- tion. The recen explosion in ailabilit of online text lends an extra imp ortance, if not urgency to this prob- lem, and also suggests source of exp erimen tal data. In fact, the exp erimen ts rep orted in Section ere all conducted on data gathered from the In ternet. Those exp erimen ts oﬀer comp elling empirical evidence for the eﬀectiv eness of ECOC in text categorization. This pap er rep orts just some initial pro of of concept exp erimen ts. There is et uc unexplored terrain, and it is our elief that co ding theory has more to sa ab out classiﬁcation. or instance, useful class of error- correcting co des for digital transmission is er asur des whic are robust to some fraction of lost bits. If the PiCs pro duce probabilities, then one could view classiﬁer whic is suﬃcien tly indecisiv 2) as “lost bit”; an ECOC classiﬁer con taining could ignore in attempting to reco er the lab el of the do cumen t. Although presen ted evidence suggesting the en- eﬁts of random co des, there are settings in whic one ould exp ect structured co de to preferable. or instance, erforming nearest-neigh or searc in high dimensional space can exp ensiv e, prohibitiv ely so for high-throughput systems. Ho ev er, one migh still able to reap the eneﬁts of high- error-correcting out- put co ding without actually conducting the full searc h. Using deterministic co de with some structure, lik BCH co de, ma allo the user to replace the Θ( nm exhaustiv searc with Θ( searc at sligh cost in accuracy or just this reason, real-w orld digital en- co ding/deco ding systems—suc as mo dems, CD pla y- ers, satellites, and digital cell phones—rely on structured co des. urthermore, the theoretical argumen ts whic argue in fa or of random co des are predicated on the assump- tion, un tenable in most real-w orld data, that the er- rors made the individual predictors are uncorrelated. In fact, textual data often con tain strong correlations, whic classiﬁer ignores at its wn eril. or instance, the astronomy and space classes in the aho science category ha strong erlap in ord usage—evidenced the confusion matrices of classiﬁers ha con- structed on this data. promising direction for im- pro emen is to com bine the ECOC approac with some form of ord or do cumen clustering, designing co de whic captures the inheren “clumpiness of the data. In particular, ell-engineered co de could reﬂect hierar- hical decomp osition of the problem: ﬁrst determine if the do cumen elongs to either astronomy or space and only then decide whic of these classes is most appropri- ate. Ac kno wledgmen ts The author thanks om Dietteric h, Adam Kalai, John Laﬀert and Kamal Nigam for suggestions on an early draft, and the “theory lunc group at CMU for sug-

Page 8

gestions leading to the material in Section 3.2. This researc as supp orted in part an IBM Co op erativ ello wship. References Aha and Bank ert, 1997 D. Aha and R. Bank ert. Cloud classiﬁcation using error-correcting output co des. r- tiﬁcial Intel ligenc Applic ations: Natur al esour es, gricultur e, and Envir onmental Scienc 11:1:13–28, 1997. Bak er and McCallum, 1998 D. Bak er and A. McCal- lum. Distributional clustering for text classiﬁcation. In Pr dings of SIGIR 1998. Bakiri and Dietteric h, 1999 G. Bakiri and T. Diet- teric h. Ac hieving high-accuracy text-to-sp eec with mac hine learning. Data mining in sp ch synthesis 1999. Breiman, 1996a L. Breiman. Bagging predictors. Ma- chine arning 26:2:123–140, 1996. Breiman, 1996b L. Breiman. Bias, ariance, and arcing classiﬁers. ec hnical rep ort, Statistics Departmen t, Stanford Univ ersit TR-460, 1996. Cra en et al. 1998 M. Cra en, D. DiP asquo, D. re- itag, A. McCallum, T. Mitc hell, K. Nigam, and S. Slattery Learning to extract sym olic kno wledge from the World Wide Web. In Pr dings of the 15th National Confer enc on rtiﬁcial Intel ligenc (AAAI- 98) 1998. reund and Sc hapire, 1997 Y. reund and R. Sc hapire. decision-theoretic generalization of on-line learning and an application to osting. Journal of Computer and System Scienc es 55(1):119–139, 1997. James and Hastie, 1997 G. James and T. Hastie. The error co ding metho and PiCTs. Journal of Compu- tational and Gr aphic al Statistics 7:3:377–387, 1997. James, 1998 G. James. Majority vote classiﬁers: the- ory and applic ations PhD thesis, Stanford Univ ersit 1998. Kong and Dietteric h, 1995 E. Kong and T. Dietteric h. Error-correcting output co ding corrects bias and ari- ance. In Pr dings of the 12th International Confer- enc on Machine arning pages 313–321, 1995. Lang, 1995 K. Lang. Newsw eeder: Learning to ﬁlter news. In Pr dings of the 12th International Con- fer enc on Machine arning pages 331–339, 1995. Lewis, 1998 D. Lewis. Naiv (Ba es) at fort y: The in- dep endence assumption in information retriev al. In Pr dings of the Eur op an Confer enc on Machine arning 1998. MacWilliams and Sloane, 1977 F. MacWilliams and N. Sloane. The the ory of err or-c orr cting des North Holland: Amsterdam, The Netherlands, 1977. azzani et al. 1996 M. azzani, J. Muramatsu, and D. Billsus. Syskill eb ert: Iden tifying in teresting eb sites. In Pr dings of the National Confer enc on rtiﬁcial Intel ligenc 1996. errone, 1993 M. errone. Impr oving gr ession es- timation: ver aging metho ds for varianc duction with extensions to gener al onvex me asur optimiza- tion PhD thesis, Bro wn Univ ersit 1993. Quinlan, 1993 J. Quinlan. Com bining instance-based and mo del-based learning. In Pr dings of the In- ternational Confer enc on Machine arning Morgan Kaufman, 1993. Rumelhart et al. 1986 D. Rumelhart, G. Hin ton, and R. Williams. Learning represen tations bac k- propagating errors. Natur 323:533–536, 1986. ang and Ch ute, 1994 Y. ang and C. Ch ute. An ap- plication of exp ert net ork to clinical classiﬁcation and Medline indexing. In Pr dings of the 18th n- nual Symp osium on Computer Applic ations in Me dic al Car (SCAMC’94) olume 18 (Symp.Suppl), pages 157–161, 1994.

cmuedu Abstract This pap er applies errorcorrecting output co d ing ECOC to the task of do cumen cate gorization ECOC of recen vin tage in the AI literature is metho for decomp osing ultiw classi64257cation problem in to man bi nary classi64257cation ID: 23359

- Views :
**201**

**Direct Link:**- Link:https://www.docslides.com/tatiana-dople/errorcorrecting-output-co-ding
**Embed code:**

Download this pdf

DownloadNote - The PPT/PDF document "ErrorCorrecting Output Co ding for ext C..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Page 1

Error-Correcting Output Co ding for ext Classiﬁcation Adam Berger Sc ho ol of Computer Science Carnegie Mellon Univ ersit Pittsburgh 15213 aberger@cs.cmu.edu Abstract This pap er applies error-correcting output co d- ing (ECOC) to the task of do cumen cate- gorization. ECOC, of recen vin tage in the AI literature, is metho for decomp osing ultiw classiﬁcation problem in to man bi- nary classiﬁcation tasks, and then com bining the results of the subtasks in to yp othesized solution to the original problem. There has een uc recen in terest in the mac hine learn- ing comm unit ab out algorithms whic in te- grate “advice from man sub ordinate predic- tors in to single classiﬁer, and error-correcting output co ding is one suc tec hnique. pro- vide exp erimen tal results on sev eral real-w orld datasets, extracted from the In ternet, whic demonstrate that ECOC can oﬀer signiﬁcan impro emen ts in accuracy er con en tional classiﬁcation algorithms. In tro duction Error-correcting output co ding is recip for solving ulti-w classiﬁcation problems. It orks in stages: ﬁrst, indep enden tly construct man sub ordinate clas- siﬁers, eac resp onsible for remo ving some uncertain ab out the correct class of the input; second, apply oting sc heme to decide up on the correct class, giv en the output of eac eak learner. Recen exp erimen tal ork has sho wn that ECOC oﬀers impro emen ts er standard -w classiﬁcation metho ds in domains rang- ing from cloud classiﬁcation Aha and Bank ert, 1997 to sp eec syn thesis Bakiri and Dietteric h, 1999 and um er of theories ha een prop osed for its success James, 1998 In this pap er, explore the application of error-correcting output co ding to do cumen catego- rization. The idea of “classifying consensus using large um er of indep enden tly-constructed classiﬁers has ap- eared in um er of other guises recen tly in the ma- hine learning literature. The tec hnique of bagging, for instance, in olv es generating ultiple training sets sampling with replacemen t, learning classiﬁer from eac generated set, and allo wing the learned classiﬁers to ote on the correct class for unlab eled ob ject Breiman, 1996a Bo osting can view ed as sp ecial case of bagging where the sampling is adaptiv e, concen trating on misclassiﬁed training instances reund and Sc hapire, 1997 oting metho ds ha also een applied to com bin- ing ultiple neural net orks trained on the same data errone, 1993 and applying diﬀeren yp es of classiﬁers to the same problem Quinlan, 1993 Wh consensus algorithms ork so ell in practice is still an op en question. As step in that direction, theoretical ork has recen tly established that com bin- ing ultiple runs of classiﬁcation algorithm can re- duce its ariance Breiman, 1996b Unlik most oting algorithms, the constituen classiﬁers in error-correcting output co ding aren’t all solving the same problem; in fact, they are eac solving distinct binary classiﬁcation problem. Kong and Dietteric h, 1995 ha sho wn that this prop ert of the ECOC algorithm esto ws on it, in addition to the ariance-reduction prop ert of all oting metho ds, the abilit to correct for bias in the constituen classiﬁers. This pap er applies ECOC to the problem of text cat- egorization: giv en database of do cumen ts, eac an- notated with lab el or set of lab els, learn mapping from do cumen ts to lab els. ext categoriza- tion computer—suc as the automatic assignmen of index terms to medical researc pap ers ang and Ch ute, 1994 —has een cen tral concern in the ﬁeld of bib- liometrics for man ears, but the recen ﬂo of on- line text has increased the in terest in and applications for text categorization. In ternet-related classiﬁcation re- searc has addressed the problem of learning to collect in teresting ostings to electronic discussion groups based on user’s predilections Lang, 1995 automatically clas- sifying eb pages con ten Cra en et al. 1998 and suggesting eb pages to user based on his or her ex- pressed preferences azzani et al. 1996 fo cus here on restricted ersion of the general classiﬁcation problem—namely imagine do cumen ts ha exactly one correct lab eling, meaning that the mapping is function. The databases emplo for ex- erimen tal purp oses in Section ha this and an addi- tional con enien haracteristic: eac lab el is ell repre- sen ted in the data. Under these conditions, the metho

Page 2

of Naiv Ba es classiﬁcation is highly comp etitiv e. Ho w- ev er, Section demonstrates that in this setting, error- correcting output co ding consisten tly outp erforms Naiv Ba es. urther exp erimen ts rep orted there suggest that ECOC will of utilit in the sparse-data domain as ell. This pap er will pro ceed as follo ws. The next section in tro duces the tec hnique of error-correcting output co d- ing and its application to text classiﬁcation. An ECOC classiﬁer relies on binary “co ding matrix, and Sec- tion discusses some considerations in selecting this ma- trix. Section describ es series of exp erimen ts to ali- date the claim that ECOC oﬀers impro emen ts on stan- dard classiﬁcation tec hniques. Section relates ECOC to Naiv Ba es and -nearest neigh or, another high- erformance classiﬁcation algorithm, and Section con- cludes outlining some directions for future ork in ECOC-based text categorization. Error-correcting output co ding describ the tec hnique of error-correcting output co ding with simple example: the task of classifying newswire articles in to the categories politics sports business arts egin, one assigns unique -bit ector to eac lab el (where log ): lab el co ding politics 0110110001 sports 0001111100 business 1010101101 arts 1000011010 One can view the th bitv ector as unique co ding for lab el or this reason (and others, whic will so on ecome apparen t), e’ll refer to the set of bitv ectors as de and denote it The th ro of will write as and the alue of the th bit in this ro as ij The second step in constructing an ECOC classiﬁer is to build an individual binary classiﬁer for eac column of the co de—10 classiﬁers in all, in this case. The ositiv instances for classiﬁer are do cumen ts with lab el for whic ij 1. The third classiﬁer, for instance, has the resp onsibilit of distinguishing et een do cumen ts whose lab el is sports or arts and those whose lab el is politics or business Heeding to con en tion, refer generically to an algorithm for predicting the alue of single bit as “plug-in classiﬁer (PiC). PiC, then, is predictor of whether do cumen elongs to some ﬁxed subset of the classes. summarize, training an ECOC classiﬁer consists of learning set of indep enden binary classiﬁers. With in hand, one can yp othesize the cor- rect class of an unlab eled do cumen as follo ws. Ev al- uate eac indep enden classiﬁer on generating -bit ector Λ( Most lik ely the generated bitv ector Λ( will not ro of but it will certainly closer (in Hamming distance ∆, sa y) to some ro ws than to others. Categorizing the do cumen in olv es selecting argmin ∆( Λ( )), the lab el for politics arts business sports politics arts business sports politics arts business sports politics arts business sports Figure 1: Decision oundaries for the ﬁrst three plug-in classiﬁers corresp onding to the co de giv en ab e. Clo k- wise from upp er left: all decision oundaries, bit 1, bit 2, bit 3. Algorithm raining an ECOC do cumen classiﬁer Input: Documents Labelings (with distinct labels); Desired code size log Output: by coding matrix classifiers 1. Generate by coding matrix 2. Do for [1 Construct two superclasses, and consists of all labels for which ij and is the complement set. Construct binary classifier to distinguish from whic is closest to Λ( ). (If more than one ro of are equidistan to Λ( ), select one arbitrarily .) or in- stance, if the generated bitv ector Λ( 1010111101 the do cumen ould receiv the lab el business the exten that ro ws of are ell-spaced in Ham- ming distance, the classiﬁer will robust to few erran PiCs. This is the idea ehind error-correcting co des as ell: to transmit oin in the -dimensional cub reli- ably er noisy hannel, map it to one of set of ell- separated “ﬁxed oin ts in higher-dimensional cub e; to reco er the original oin t, ﬁnd the closest ﬁxed oin to the oin actually receiv ed and tak its preimage in the original cub e. In general, ma not alue, but real- alued probabilit measuring the classiﬁer’s conﬁdence that do cumen elongs in the ’th sup erclass. In this case, one can searc for the nearest neigh or according to some distance, rather than Hamming distance. In

Page 3

Algorithm Applying an ECOC do cumen classiﬁer Input: Trained ECOC classifier: by coding matrix and classifiers Unlabeled document Output: Hypothesized label for 1. Do for [1 Compute ---the confidence with which PiC believes 2. Calculate ∆(Λ( =1 ij for [1 3. Output argmin ∆(Λ( the exp erimen ts rep orted in Section 4, the plug-in clas- siﬁers output probabilit and compute the nearest neigh or according to distance. 2.1 The Naiv Ba es classiﬁer The PiC relied most hea vily on in constructing ECOC classiﬁers is the Naive Bayes classiﬁer Lewis, 1998 Naiv Ba es assumes that do cumen is generated selecting lab el according to prior distribution ), and then indep enden tly selecting ords for the do cu- men according to distribution ). The probabil- it of generating do cumen of ords from lab el is th us =1 (1) Used for prediction, the Naiv Ba es classiﬁer selects for an unlab eled do cumen the most lik ely lab el, giv en argmax argmax argmax =1 (2) where the ﬁrst equalit follo ws from Ba es La w. 2.2 Wh should ECOC classiﬁcation ork? Some standard classiﬁcation algorithms suc as bac k- propagation Rumelhart et al. 1986 are est suited to distinguishing et een outcomes. natural to com bine suc algorithms to predict from among outcomes is to construct indep enden predictors, as- signing predictor the task of deciding whether the th outcome obtains. build the classiﬁer, construct individual classiﬁers, where the ositiv examples for classiﬁer are those do cumen ts with lab el ap- ply the classiﬁer to an unlab eled do cumen select argmax )—the lab el whose classiﬁer pro duces the highest score. This is what some call the one versus est strategy This metho is sp ecial case of ECOC classiﬁcation where is the iden tit matrix. see wh one migh exp ect ECOC classiﬁcation to outp erform one-vs.-rest approac h, consider the prob- lem of learning to classifying fruit. Imagine that within the lab eled set of examples used to train the individual one-vs.-rest classiﬁers, the only ello fruit are bananas. So banana will learn strong asso ciation et een el- lo color and bananas. No pro vide ello grap e- fruit to the trained one-vs.-rest classiﬁer. The alue of grap efruit will lik ely close to one—after all, the ob- ject in question is round and grap efruit-sized, despite not eing red lik all the grap efruits encoun tered in training. But the alue of banana will very close to one, and the system will misclassify the ob ject as banana. ECOC classiﬁcation is less “brittle than the one-vs.-rest approac h: the distributed output represen tation means one erran sub ordinate classiﬁer on’t necessarily result in misclassiﬁcation. This is circuitous of sa ying that ECOC reduces ariance of the individual classiﬁers. Man classiﬁcation algorithms, including decision trees, exp onen tial mo dels, and neural net orks ha the capabilit to directly erform ultiw 2) classiﬁ- cation. reasonable classiﬁcation strategy with these algorithms is to construct single, monolithic classi- ﬁer. But the monolithic classiﬁer faces diﬃcult task. Imagining the classes as clouds in large-dimensional feature space, single classiﬁer ust learn all the deci- sion oundaries sim ultaneously whereas eac PiC of an ECOC classiﬁer learns only relativ ely small um er of decision oundaries at once. Moreo er, (assuming is suﬃcien tly large) an ECOC classiﬁer learns eac ound- ary man times, and is forgiving if few PiCs place the input on the wrong side of some decision oundaries Kong and Dietteric h, 1995 Cho osing go co de Early ork on error-correcting output co ding lo ok ed to algebraic co ding theory and in particular to the family of linear co des, for co ding matrix An -bit lin- ar err or-c orr cting de subspace of the ertices on -dimensional cub e, can deﬁned as the span of an -column binary matrix called gener ator matrix Error-correcting co des are often measured on the mini- um distance et een an linear com binations of BCH co des MacWilliams and Sloane, 1977 opular class of linear algebraic error-correcting co des, ha the useful prop ert that their co dew ords (all diﬀeren linear com binations of ro ws of are ell separated. Using suc matrix for ECOC classiﬁcation is for this reason an attractiv ossibilit and some ECOC classiﬁcation ork has used BCH co des as co ding matrix. Ho ev er, subsequen ECOC ork has established that ECOC classiﬁcation should erform ell when the co d- ing matrix is constructed randomly—sp eciﬁcally ho osing eac en try ij uniformly at random from This section pro vide some statistical and com binatorical argumen ts for wh this should the case. Section 3.1 summarizes some results from James, 1998 and Sec- tion 3.2 is new.

Page 4

3.1 statistical ersp ectiv Deﬁnition: Given datab ase of (do cument, la- el) airs x, with empiric al distribution the Ba es classiﬁer is argmax The Ba es classiﬁer assigns to do cumen the lab el whic app ears most often in the database with In terms of classiﬁcation accuracy on the Ba es classiﬁer is the est ossible strategy In the presen setting, it is reasonable to assume do cumen ts don’t ccur ultiple times with diﬀeren lab els in the collection, and so the Ba es classiﬁer simply selects the lab el of the do cumen in During the training phase, all do cumen lab els are ailable and so ha access to the Ba es classiﬁer. But in applying the classiﬁer do not. et the Ba es classiﬁer will still turn out to useful concept, as the follo wing deﬁnition and theorem from James, 1998 suggest. Deﬁnition: classiﬁc ation algorithm built fr om sub or dinate classiﬁers is Ba es consisten if, whenever the ar Bayes classiﬁers, so to is Lo osely sp eaking, Ba es consisten classiﬁer con- structed from accurate PiCs will accurate. This is prop ert one ould lik to ac hiev in an ECOC classi- ﬁer. The next theorem states the conditions under whic this is ac hiev able. Theorem Assuming was onstructe andomly, the ECOC classiﬁer omes onsistent as This theorem is not sa ying that with enough bits, an ECOC classiﬁer will do arbitrarily ell. Consistency of an ECOC classiﬁer do esn’t guaran tee correctness—since the PiCs aren’t themselv es pro ducing Ba es estimates. Still, this theorem suggests wh random construction of erforms ell. 3.2 com binatorial ersp ectiv The example co de presen ted earlier has the unfortunate prop ert that the third and ten th columns are equal. Therefore, the corresp onding classiﬁers will learn pre- cisely the same task. This is ermissible situation, though hardly desirable. Not ermissible is when ows of are equal, for then the co de cannot distin- guish et een the corresp onding lab els. ortunately for randomly-generated binary co de with suﬃcien tly man columns, the probabilit of suc an ev en is miniscule: for co de with lab els and bits, the probabilit is =1 whic is one for log but approac hes zero quic kly thereafter as increases. More generally if ro ws in are close in Ham- ming distance, an ECOC classiﬁer built from is apt to confuse the corresp onding lab els. e’ll write ∆( as the Hamming distance et een ro ws and of and min as the minim um distance et een an co dew ords. If the PiCs pro duce binary outputs, then the ECOC classiﬁer can alw ys reco er from at least min incorrect PiC outputs. The follo wing theo- rem is statemen ab out ho uc ro separation one can ossibly hop for in co ding matrix. Theorem or any by binary matrix ther ex- ist two ows which diﬀer in at most bits. Pro of Let the minim um distance et een an ro ws of one suc matrix Select ro ws i, [1 with replacemen t. Select column [1 ]. The probabilit that ik is di No select column [1 ], and then select ro ws i, [1 with replacemen t. The probabilit that ik is no greater than 2. Com bining these inequalities to solv for giv es the result. This sho ws that, as ecomes large, relativ spac- ing of one half is optimal. If consider only square matrices, there exists an explicit construction whic ac hiev es this ound; namely the Hadamard matrix. or general matrices are not are of an explicit con- struction meeting this ound, but the follo wing result suggests that random construction is lik ely to ha go separation. Theorem Deﬁne ell ro w-separated by binary matrix as one in which al ows have minimum elative Hamming sep ar ation at le ast log The pr ob ability that andomly-c onstructe binary ma- trix is not wel ow-sep ar ate is at most /m Pro of Giv en is randomly-constructed Fix diﬀeren ro ws and or [1 ], deﬁne the random ariable as +1 if otherwise Let or randomly-constructed 0, whic corresp onds to an n/ Hamming distance et een the ro ws. are in terested in the probabilit that 0. Using Cherno ounds, Pr log log There are ro ws in and so the probabilit that no pair of ro ws is to close is Pr

Page 5

Although atten tion in the ECOC literature has gen- erally concen trated on ﬁnding with go ro sep- aration, erhaps equally imp ortan desideratum is large separation et een columns. Columns that are close giv rise to classiﬁers whic are erforming nearly the same task—in the extreme case, equal columns corresp onding to iden tical classiﬁers. With only sligh hange, Theorem sho ws that random matrices are lik ely to ha go olumn separation as ell, pro- viding another justiﬁcation for constructing co de ran- domly In practice, large column separation in is not quite suﬃcien to ensure go erformance, ecause of de- generacy inheren in binary classiﬁcation. Man classiﬁ- cation algorithms treat and symmetrically and so if columns of are complemen tary (or nearly so), the corresp onding PICs will learn iden tical (or nearly iden- tical) classiﬁcation tasks. What really an t, then, is matrix whose ro ws are pairwise ell-separated, but not to ell-separated. The follo wing corollary to Theo- rem sho ws that randomly-selected matrix is, asymp- totically ery lik ely to ha this prop ert Corollary: Deﬁne strongly ell-separated by binary matrix as matrix any two ows of which have elative Hamming sep ar ation in the ange log log The pr ob ability that andomly-c onstructe binary ma- trix is not str ongly wel ow-sep ar ate is at most /m Exp erimen tal results applied error-correcting output co ding classiﬁcation to four real-w orld text collections, all extracted from the In ternet All corp ora ere sub ject to the same prepro- cessing: remo punctuation, con ert dates and mone- tary amoun ts and um ers to canonical forms, map all ords to upp ercase, and remo ords ccurring wice or less. able summarizes some salien haracteristics of these datasets. 20 Newsgroups This is collection of ab out 20 000 do cumen ts, culled from ostings to 20 Usenet discussion groups Lang, 1995 The do cu- men ts are appro ximately ev enly distributed among the 20 lab els. our univ ersities This (misnamed) dataset con- tains eb pages gathered from large um er of univ ersit computer science departmen ts Cra en et al. 1998 The pages ere man ually clas- siﬁed in to the categories course, department, faculty, staff, student, project, other aho science ollo wing Bak er and McCallum, 1998 automatically extracted the en tire aho The 20 newsgroups and four univ ersities datasets are publicly ailable at www.cs.cmu.edu/ textlearning collection do cumen ts lab els ords 20 newsgroups 19997 20 60915 univ ersities 8263 29004 aho science 10158 41 69939 aho health 5625 36 48110 able 1: articulars on the four training datasets used. Eac dataset as partitioned ﬁv separate times in to training/test split, and the um ers are statis- tics from the last of these trials. The last column rep orts the um er of distinct ords in the collection, excluding those app earing once or wice. science hierarc in early 1999, and formed la- eled collection con taining 41 classes collapsing the hierarc to the ﬁrst lev el. aho health This corpus as collected in the same as the science collection, but has rather diﬀeren haracteristics. In particular, man of its 36 classes are highly confusable, presen t- ing diﬃcult task for classiﬁcation algorithms. or instance, three of the lab els in this collec- tion are Health Administration Hospitals And Medical Centers and Health Care Figure plots ECOC classiﬁcation accuracy against co de size for these four corp ora. The co des ere con- structed selecting en tries uniformly at random from except in the case of the univ ersities dataset, for whic the columns of ere random erm utation of the 126 unique, non-trivial 7-bit ectors. The plots also displa the results of standard Naiv Ba es classiﬁcation. rom an implemen tation standp oin t, larger alue of incurs enalt in sp eed. (This ma an issue in high-throughput systems suc as text ﬁltering systems designed to route relev an news articles to man users, eac with their wn preferences. Ho ev er, Figure sug- gests that, to oin t, larger alues of oﬀer more ac- curate classiﬁcation. And ey ond that oin t, accuracy do esn’t tail oﬀ, as is the case in man other learning al- gorithms for classiﬁcation, whic are prone to erﬁtting when the um er of parameters ecomes large. The four univ ersities dataset as the only collection on whic ECOC classiﬁcation didn’t signiﬁcan tly out- erform Naiv Ba es one-vs.-rest classiﬁcation. The ECOC classiﬁer’s erformance on this collection is al- most oignan t: error rate steadily decreases un til 126, at whic oin there simply are no more un used, non-trivial 7-bit columns to add to In the collections are considering, eac lab el is ell- represen ted in the data and mo dels can ell estimated. In this setting the standard Naiv Ba es metho is highly comp etitiv Lewis, 1998 or this reason, use Naiv Ba es classiﬁer as the PiC in the ECOC classiﬁers corresp onding to Figure 2. Ho ev er, on datasets with orly-represen ted lab els, Naiv Ba es can starv for lac of data. With an ey to ards suc collections, explored using feature-based classiﬁca-

Page 6

20 40 60 80 100 10 20 100 500 2000 % error # bits 20 Newsgroups 20 40 60 80 100 20 50 126 % error # bits Four universities 20 40 60 80 100 10 20 41 100 500 2000 % error # bits Yahoo science 20 40 60 80 100 10 20 36 100 500 2000 % error # bits Yahoo health Figure 2: erformance of ECOC classiﬁcation as function of co de size. Naiv Ba es classiﬁers serv ed as the PiCs. Eac oin reﬂects an erage er ﬁv randomized training/test splits, and the bars measure the standard deviation er these trials. The horizon tal line is the eha vior of standard one-vs.-rest Naiv Ba es. All oin ts are eraged er ﬁv trials with randomized and randomized training/test split of the data. tion approac as the ECOC PiC. Sp eciﬁcally trained binary decision trees to predict the individual bits in an ECOC co de; the questions at the no des of eac tree ere of the form Did wor app ar in the do cument? do not exp ect suc classiﬁer to matc the est rep orted erformance on this dataset, since this algorithm only considers whether ord ccurs in do cumen and not ho often. Ho ev er, Figure do es suggest that for suf- ﬁcien tly high com bining decision trees in to an ECOC classiﬁer impro es erformance er one-vs.-rest deci- sion tree approac h, whic augurs ell for the application of ECOC to larger, sparse datasets. ruly meaningful alues of lie in the range [log m, ]. co de of size log cannot ev en assign distinct bitv ector to eac lab el; at the other extreme, co de of size ust con tain duplicate columns, whic corresp onds to PiCs learning the same task. (A tigh ter upp er ound is actually (2 1) 2: the comes ab out since the all-zero ector corresp onds to trivial classiﬁer, and the denominator arises from the degeneracy men tioned ab e). 20 40 60 80 100 10 20 100 500 % error # bits 20 Newsgroups (Tree) Figure 3: erformance of ECOC classiﬁcation as func- tion of co de size, for decision tree PiC with Bernoulli ev en mo del whic tak es no accoun of ultiple app ear- ances of ord in do cumen t. Eac oin reﬂects single trial using randomized training/test partition of the 20 newsgroups collection. The horizon tal line is the one-vs.-rest decision tree erformance.

Page 7

Discussion The results of the previous section suggest that up to oin t, classiﬁer erformance impro es with simple calculation sho ws wh this should so. Assume for the momen that the PiCs only output bi- nary alues, and the errors committed an PiCs are indep enden of one another. Denote the prob- abilit of error the th PiC, and let max If the minim um distance of is min then classiﬁcation is robust to an min or few er errors, and so the probabilit of correct classiﬁcation, as function of is min =0 (1 (3) The quan tit on the righ t—the ﬁrst min terms of the binomial expansion of (1 )—is monotoni- cally increasing in min whic itself increases with for randomly-constructed co de. Section sho ws that in practice, ev en tually plateaus, whic means that the assumption that the errors are uncorrelated is false. This is hardly surprising: after all, the individual classi- ﬁers ere trained on the same data. One ould exp ect correlation et een, for instance, the second and third columns of the co de presen ted in Section 2. 5.1 Relation to Naiv Ba es ha already seen that the one-vs.-rest strategy is sp ecial case of ECOC classiﬁcation. It is not diﬃcult to see that the standard Naiv Ba es approac is an im- plemen tation of ECOC classiﬁcation. Notice that Naiv Ba es is clearly one-vs.-rest tec hnique: predicting from among classes requires constructing classiﬁers (eac consisting of prior and class-sp eciﬁc distribution )), and selecting lab el via (2). But this just amoun ts to using as co de the iden tit matrix, and then applying Algorithm using an norm. 5.2 Relation to -nearest neigh or opular approac to text classiﬁcation, particularly comp etitiv for ery large and sparse datasets, is nearest neigh or NN). NN relies on map from do cumen ts to -dimensional ectors The en- tries of the latter ma ord coun ts or, more gen- erally list of feature alues. NN classiﬁer stores the images of all training set do cumen ts in database classify an unlab eled do cumen NN ﬁnds the ectors in closest to ), and tak es eigh ted ote of their lab els. NN and ECOC ha some sup erﬁcial similarities. Both use for classiﬁcation data structure consisting of set of ectors, and oth searc this data structure using nearest-neigh or algorithm, linear in the size of the data structure. One distinction—of particular imp or- tance when the size of the training set ecomes large—is that while ECOC’s data structure consists of single ector for eac lab el, NN ust store ector for eac do cumen in the training set. Conclusion ha describ ed an application of error-correcting out- put co ding to the problem of automatic text categoriza- tion. The recen explosion in ailabilit of online text lends an extra imp ortance, if not urgency to this prob- lem, and also suggests source of exp erimen tal data. In fact, the exp erimen ts rep orted in Section ere all conducted on data gathered from the In ternet. Those exp erimen ts oﬀer comp elling empirical evidence for the eﬀectiv eness of ECOC in text categorization. This pap er rep orts just some initial pro of of concept exp erimen ts. There is et uc unexplored terrain, and it is our elief that co ding theory has more to sa ab out classiﬁcation. or instance, useful class of error- correcting co des for digital transmission is er asur des whic are robust to some fraction of lost bits. If the PiCs pro duce probabilities, then one could view classiﬁer whic is suﬃcien tly indecisiv 2) as “lost bit”; an ECOC classiﬁer con taining could ignore in attempting to reco er the lab el of the do cumen t. Although presen ted evidence suggesting the en- eﬁts of random co des, there are settings in whic one ould exp ect structured co de to preferable. or instance, erforming nearest-neigh or searc in high dimensional space can exp ensiv e, prohibitiv ely so for high-throughput systems. Ho ev er, one migh still able to reap the eneﬁts of high- error-correcting out- put co ding without actually conducting the full searc h. Using deterministic co de with some structure, lik BCH co de, ma allo the user to replace the Θ( nm exhaustiv searc with Θ( searc at sligh cost in accuracy or just this reason, real-w orld digital en- co ding/deco ding systems—suc as mo dems, CD pla y- ers, satellites, and digital cell phones—rely on structured co des. urthermore, the theoretical argumen ts whic argue in fa or of random co des are predicated on the assump- tion, un tenable in most real-w orld data, that the er- rors made the individual predictors are uncorrelated. In fact, textual data often con tain strong correlations, whic classiﬁer ignores at its wn eril. or instance, the astronomy and space classes in the aho science category ha strong erlap in ord usage—evidenced the confusion matrices of classiﬁers ha con- structed on this data. promising direction for im- pro emen is to com bine the ECOC approac with some form of ord or do cumen clustering, designing co de whic captures the inheren “clumpiness of the data. In particular, ell-engineered co de could reﬂect hierar- hical decomp osition of the problem: ﬁrst determine if the do cumen elongs to either astronomy or space and only then decide whic of these classes is most appropri- ate. Ac kno wledgmen ts The author thanks om Dietteric h, Adam Kalai, John Laﬀert and Kamal Nigam for suggestions on an early draft, and the “theory lunc group at CMU for sug-

Page 8

gestions leading to the material in Section 3.2. This researc as supp orted in part an IBM Co op erativ ello wship. References Aha and Bank ert, 1997 D. Aha and R. Bank ert. Cloud classiﬁcation using error-correcting output co des. r- tiﬁcial Intel ligenc Applic ations: Natur al esour es, gricultur e, and Envir onmental Scienc 11:1:13–28, 1997. Bak er and McCallum, 1998 D. Bak er and A. McCal- lum. Distributional clustering for text classiﬁcation. In Pr dings of SIGIR 1998. Bakiri and Dietteric h, 1999 G. Bakiri and T. Diet- teric h. Ac hieving high-accuracy text-to-sp eec with mac hine learning. Data mining in sp ch synthesis 1999. Breiman, 1996a L. Breiman. Bagging predictors. Ma- chine arning 26:2:123–140, 1996. Breiman, 1996b L. Breiman. Bias, ariance, and arcing classiﬁers. ec hnical rep ort, Statistics Departmen t, Stanford Univ ersit TR-460, 1996. Cra en et al. 1998 M. Cra en, D. DiP asquo, D. re- itag, A. McCallum, T. Mitc hell, K. Nigam, and S. Slattery Learning to extract sym olic kno wledge from the World Wide Web. In Pr dings of the 15th National Confer enc on rtiﬁcial Intel ligenc (AAAI- 98) 1998. reund and Sc hapire, 1997 Y. reund and R. Sc hapire. decision-theoretic generalization of on-line learning and an application to osting. Journal of Computer and System Scienc es 55(1):119–139, 1997. James and Hastie, 1997 G. James and T. Hastie. The error co ding metho and PiCTs. Journal of Compu- tational and Gr aphic al Statistics 7:3:377–387, 1997. James, 1998 G. James. Majority vote classiﬁers: the- ory and applic ations PhD thesis, Stanford Univ ersit 1998. Kong and Dietteric h, 1995 E. Kong and T. Dietteric h. Error-correcting output co ding corrects bias and ari- ance. In Pr dings of the 12th International Confer- enc on Machine arning pages 313–321, 1995. Lang, 1995 K. Lang. Newsw eeder: Learning to ﬁlter news. In Pr dings of the 12th International Con- fer enc on Machine arning pages 331–339, 1995. Lewis, 1998 D. Lewis. Naiv (Ba es) at fort y: The in- dep endence assumption in information retriev al. In Pr dings of the Eur op an Confer enc on Machine arning 1998. MacWilliams and Sloane, 1977 F. MacWilliams and N. Sloane. The the ory of err or-c orr cting des North Holland: Amsterdam, The Netherlands, 1977. azzani et al. 1996 M. azzani, J. Muramatsu, and D. Billsus. Syskill eb ert: Iden tifying in teresting eb sites. In Pr dings of the National Confer enc on rtiﬁcial Intel ligenc 1996. errone, 1993 M. errone. Impr oving gr ession es- timation: ver aging metho ds for varianc duction with extensions to gener al onvex me asur optimiza- tion PhD thesis, Bro wn Univ ersit 1993. Quinlan, 1993 J. Quinlan. Com bining instance-based and mo del-based learning. In Pr dings of the In- ternational Confer enc on Machine arning Morgan Kaufman, 1993. Rumelhart et al. 1986 D. Rumelhart, G. Hin ton, and R. Williams. Learning represen tations bac k- propagating errors. Natur 323:533–536, 1986. ang and Ch ute, 1994 Y. ang and C. Ch ute. An ap- plication of exp ert net ork to clinical classiﬁcation and Medline indexing. In Pr dings of the 18th n- nual Symp osium on Computer Applic ations in Me dic al Car (SCAMC’94) olume 18 (Symp.Suppl), pages 157–161, 1994.

Today's Top Docs

Related Slides