Mining Comple Models fr om Arbitraril Lar Databases in Constant Time Geoff Hulten Dept

Mining Comple Models fr om Arbitraril Lar Databases in Constant Time Geoff Hulten Dept - Description

of Computer Science Engineer ing Univ ersity of ashington Bo 352350 Seattle 981952350 S A ghultencs w ashingtonedu edro Domingos Dept of Computer Science Engineer ing Univ ersity of ashington Bo 352350 Seattle 981952350 S A pedrodcs w ashingtonedu A ID: 36275 Download Pdf

180K - views

Mining Comple Models fr om Arbitraril Lar Databases in Constant Time Geoff Hulten Dept

of Computer Science Engineer ing Univ ersity of ashington Bo 352350 Seattle 981952350 S A ghultencs w ashingtonedu edro Domingos Dept of Computer Science Engineer ing Univ ersity of ashington Bo 352350 Seattle 981952350 S A pedrodcs w ashingtonedu A

Similar presentations

Download Pdf

Mining Comple Models fr om Arbitraril Lar Databases in Constant Time Geoff Hulten Dept

Download Pdf - The PPT/PDF document "Mining Comple Models fr om Arbitraril La..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "Mining Comple Models fr om Arbitraril Lar Databases in Constant Time Geoff Hulten Dept"‚ÄĒ Presentation transcript:

Page 1
Mining Comple Models fr om Arbitraril Lar Databases in Constant Time Geoff Hulten Dept. of Computer Science Engineer ing Univ ersity of ashington, Bo 352350 Seattle 98195-2350, .S .A. ghulten@cs .w edro Domingos Dept. of Computer Science Engineer ing Univ ersity of ashington, Bo 352350 Seattle 98195-2350, .S .A. pedrod@cs .w ABSTRA CT In this pap er prop ose scaling-up metho that is applica- ble to essen tially an induction algorithm based on discrete searc h. The result of applying the metho to an algorithm is that its running time ecomes indep

enden of the size of the database, while the decisions made are essen tially iden- tical to those that ould made giv en innite data. The metho orks within pre-sp ecied memory limits and, as long as the data is iid, only requires accessing it sequen- tially It giv es an ytime results, and can used to pro duce batc h, stream, time-c hanging and activ e-learning ersions of an algorithm. apply the metho to learning Ba esian net orks, dev eloping an algorithm that is faster than previ- ous ones orders of magnitude, while ac hieving essen tially the same predictiv erformance. observ

these gains on series of large databases generated from enc hmark net- orks, on the KDD Cup 2000 e-commerce data, and on eb log con taining 100 million requests. Categories and Subject Descriptors H.2.8 Database Managemen ]: Database Applications| data mining I.2.6 Articial In telligence ]: Learning| in- duction I.5.1 attern Recognition ]: Mo dels| statistic al I.5.2 attern Recognition ]: Design Metho dology| clas- sier design and evaluation General erms Scalable learning algorithms, subsampling, Ho eding ounds, discrete searc h, Ba esian net orks 1. INTR ODUCTION Muc

ork in KDD has fo cused on scaling mac hine learn- ing and statistical algorithms to large databases. The goal is generally to obtain algorithms whose running time is linear (or near-linear) in the size of the database, and that only ac- cess the data sequen tially So far this has een done mainly Permission to mak digital or hard copies of all or part of this ork for personal or classroom use is granted without fee pro vided that copies are not made or distrib uted for profit or commercial adv antage and that copies bear this notice and the full citation on the first page. cop

otherwise, to republish, to post on serv ers or to redistrib ute to lists, requires prior specific permission and/or fee. SIGKDD 02 Edmonton, Alberta, Canada Cop yright 2002 CM 1-58113-567-X/02/0007 ... 5.00. for one algorithm at time, in slo and lab orious pro cess. eliev that this state of aairs can ercome de- eloping scaling metho ds that are automatically (or nearly automatically) applicable to broad classes of learning algo- rithms, and that scale up to databases of arbitrary size limiting the quan tit of data used at eac step, while guar- an teeing that the decisions made

do not dier signican tly from those that ould made giv en innite data. This pap er describ es one suc metho d, based on generalizing the ideas initially prop osed in our VFDT algorithm for scaling up decision tree induction [3]. The metho is applicable to essen tially an learning algorithm based on discrete searc h, where at eac searc step um er of candidate mo dels or mo del comp onen ts are considered, and the est one or ones are selected based on their erformance on an iid sample from the domain of in terest. Searc yp es it is applicable to include greedy hill-clim

bing, eam, ultiple-restart, lo ok a- head, est-rst, genetic, etc. It is applicable to common al- gorithms for decision tree and rule induction, instance-based learning, feature selection, mo del selection, parameter set- ting, probabilistic classication, clustering, and probabilit estimation in discrete spaces, etc., as ell as their com bina- tions. demonstrate the metho dís utilit using it to scale up Ba esian net ork learning [8]. Ba esian net orks are erful metho for represen ting the join distribution of set of ariables, but learning their structure and parameters from data

is computationally costly pro cess, and to our kno wledge has not previously een successfully attempted on databases of more than tens of thousands of examples. With our metho d, ha een able to mine millions of examples er min ute. demonstrate the scalibilit and predictiv erformance of our algorithms on set of enc h- mark net orks and on three large eb data sets. 2. GENERAL METHOD FOR SCALING UP LEARNING ALGORITHMS Consider the follo wing simple problem. are giv en classiers and B, and an innite database of iid (inde- enden and iden tically distributed) examples. wish to

determine whic of the classiers is more accurate on the database. If an to absolutely sure of making the correct decision, ha no hoice but to apply the classiers to ev ery example in the database, taking innite time. If, ho ev er, are willing to accommo date prob- abilit of ho osing the wrong classier, can generally mak decision in nite time, taking adv an tage of statis-
Page 2
tical results that giv condence in terv als for the mean of ariable. One suc result is kno wn as Ho eding ounds or additive Cherno ounds [9]. Consider

real-v alued ran- dom ariable whose range is Supp ose ha made indep enden observ ations of this ariable, and computed their mean One form of Ho eding ound states that, with probabilit the true mean of the ariable is at least where ln(1 = Let the dierence in ac- curacy et een the classiers on the rst examples in the database, and assume without loss of generalit that it is ositiv e. Then, if the Ho eding ound guaran tees that with probabilit the classier with highest accu- racy on those examples is also the most accurate one on the en tire

innite sample. In other ords, in order to mak the correct decision with error probabilit it is sucien to observ enough examples to mak smaller than The only case in whic this pro cedure do es not yield decision in nite time ccurs when the classiers ha exactly the same accuracy No um er of examples will then suce to nd winner. Ho ev er, in this case do not care whic classier wins. If stipulate minim um dierence in accuracy elo whic are indieren as to whic classier is hosen, the pro cedure ab is guaran teed to

terminate after seeing at most = ln(1 = examples. In other ords, the time required to ho ose classier is constan t, indep enden of the size of the database. The Ho eding ound has the ery attractiv prop ert that it is indep enden of the probabilit distribution gen- erating the observ ations. The price of this generalit is that the ound is more conserv ativ than distribution-dep enden ones (i.e., it will tak more observ ations to reac the same and ). An alternativ is to use normal ound, whic assumes is normally distributed. (By the cen tral limit theorem, this will alw ys appro

ximately true after some um er of samples.) In this case, is the alue of for whic (( = where () is the standard normal distribution function. Either can view ound as function that returns the maxim um whic the true mean of is smaller than the sample mean giv en desired condence and sample size Supp ose no that, instead of classiers, wish to nd the est one among Making the correct hoice re- quires that eac of the comparisons et een the est classier and all others ha the correct outcome (i.e., that the classier that is est on nite

data also est on in- nite data). If the probabilit of error in eac of these deci- sions is then the union ound the probabilit of error in an of them, and th us in the global decision, is at most 1) Th us, if wish to ho ose the est classier with error probabilit at most it suces to use 1) in the ound function for eac comparison. Simi- larly if wish to nd the est classiers among with probabilit of error at most comparisons need to ha the correct outcome, and to ensure that this is the case with probabilit of error at most it suces to require that

the probabilit of error for eac individual comparison at most )]. Supp ose no that wish to nd the est classier searc pro cess comp osed of steps, at eac step considering at most candidates and ho osing the est or the searc If the ariance is estimated from data, the Studen distribution is used instead. pro cess with nite data to output the same classier as with innite data with probabilit at least it suces to use da )] in eac comparison. Th us, in eac searc step, need to use enough examples to mak da )]) where is the dierence in accuracy

et een the th and 1)th est classiers (on examples, at the th step). As an example, for hill-clim bing searc of depth and breadth the required ould =db This result is indep enden of the pro cess used to generate candidate classiers at eac searc step, as long as this pro cess do es not itself access the data, and of the order in whic searc steps are erformed. It is appli- cable to randomized searc pro cesses, if assume that the outcomes of random decisions are the same in the nite- and innite-data cases. In general, do not kno in adv ance ho man examples will

required to mak for all at step Instead, can scan examples from the database or data stream at time, and hec whether the goal has een met. Let ; the in erse of the ound function, yielding the um er of samples needed to reac the desired and (F or the Ho eding ound, = ln(1 = .) Recall that is the threshold of indierence elo whic do not care whic classier is more accurate. Then, in an giv en searc step, at most goal hec ks will made. Since mistak could made in an one of them (i.e., an incorrect winner could hosen), need to use enough examples at eac step to mak cda

)]) Notice that nothing in the treatmen ab requires that the mo dels eing compared classiers, or that accuracy the ev aluation function. The treatmen is applicable to an inductiv mo del it for classication, regression, probabilit estimation, clustering, ranking, etc. and to comparisons et een comp onen ts of mo dels as ell as e- een whole mo dels. The only prop ert required of the ev al- uation function is that it decomp osable in to an erage (or sum) er training examples. This leads to the general metho for scaling up learning algorithms sho wn in able 1. The ey prop erties of

this metho are summarized in the follo wing theorem (A pro of can found in [10]). Let Gen the total time sp en generating candidates, and let el the total time sp en it in calls to Sele ct- Candidates ), including data access. Let with and the rst terminating condition of the \rep eat" lo op in Sele ctCandidates (see able 1). Theorem 1. If el Gen and the time omplexity of is db )) With pr ob ability at le ast and cho ose the same andidates at every step for which is satise d. If is satise at al steps, and eturn the same mo del with pr ob ability at le ast In other ords,

the running time of the mo died algorithm is indep enden of the size of the database, as long as the latter is greater than cda )]). tie et een candidates at some searc step ccurs if is reac hed or the database is exhausted efore is satised. With probabilit all decisions made without ties are the same that ould made with innite data. If there are no ties in an This excludes sim ulated annealing, ecause in this case the outcome probabilities dep end on the observ ed dierences in erformance et een candidates. Extending our treatmen to this case is matter for

future ork.
Page 3
able 1: Metho for scaling up learning algorithms. Giv en: An iid sample real-v alued ev aluation function ), where is mo del (or mo del comp onen t) and is an example. learning algorithm that nds mo del er- forming at most searc steps, at eac step considering at most candidates and selecting the with highest ). desired maxim um error probabilit threshold of indierence ound function n; ). blo size Mo dify yielding at eac step replacing the selection of the candidates with highest with call to Sele ctCandidates ), where is the set of candidates at

that step. Pro cedure Sele ctCandidates Let 0. or eac let ( 0. Rep eat If then let Else let or eac Let ( ( +1 ). Let ( =n Let has one of the highest in Mg cda )]) Un til n; cda )]) )] or n; cda )]) or Return of the searc steps, the mo del pro duced is, with probabilit the same that ould pro duced with innite data. An alternativ use of our metho is to rst ho ose maxi- um error probabilit to used at eac step, and then re- ort the global error probabilit ac hiev ed, =1 ), where is the um er of goal hec ks erformed in step is the um er of candidates

considered at that step, and is the um er selected. Ev en if is computed from and the desired as in able 1, rep orting the actual ac hiev ed is recommended, since it will often uc etter than the original target. urther time can sa ed at eac step, dropping candidate from consideration as so on as its score is lo er than at least others at least n; cda )]). Sele ctCandidates is an an ytime pro cedure in the sense that, at an oin after pro cessing the rst examples, it is ready to return the est candidates according to the data scanned so far (in incremen ts of ). If the learning algorithm

is suc that successiv searc steps progressiv ely rene the mo del to returned, itself is an an ytime pro- cedure. The metho in able is equally applicable to databases (i.e., samples of xed size that can scanned ultiple times) and data streams (i.e., samples that gro without limit and can only scanned once). In the database case, start new scan whenev er reac the end of the database, and ensure that no searc step uses the same ex- ample wice. In the data stream case, simply con tin ue scanning the stream after the winners for step are hosen, using the new examples to ho ose the

winners in the next step, and so on un til the searc terminates. When learning large, complex mo dels it is often the case that the structure, parameters, and sucien statistics used all the candidates at giv en step exceed the ailable memory This can lead to sev ere slo wdo wns due to re- eated sw apping of memory pages. Our metho can easily adapted to oid this, as long as where is the maxim um um er of candidates that ts within the ailable RAM. rst form set comp osed of an elemen ts of and run Sele ctCandidates ). then add another candidates to the selected, yielding

and run Sele ctCandidates ). con tin ue in this un til is exhausted, returning the candidates se- lected in the last iteration as the erall winners. As long as all calls to Sele ctCandidates () start scanning at the same example and ties are brok en consisten tly these win- ners are guaran teed to the same that ould obtained the single call Sele ctCandidates ). If iterations are carried out, this mo dication increases the running time of Sele ctCandidates factor of at most with Notice that the \new" candidates in eac do not need to generated un til Sele ctCandidates is to run, and th

us they nev er need to sw app ed to disk. The running time of an algorithm scaled up with our metho is often dominated the time sp en reading data from disk. When subset of the searc steps are indep en- den (i.e., the results of one step are not needed to erform the next one, as when gro wing the dieren no des on the fringe of decision tree), uc time can sa ed us- ing separate searc for eac indep enden mo del comp o- nen and (conceptually) erforming them in parallel. This means that eac data blo needs to read only once for all of the in terlea ed searc hes, greatly reducing I/O

require- men ts. When these searc hes do not all t sim ultaneously in to memory main tain an inactiv queue from whic steps are transferred to memory as it ecomes ailable (b e- cause some other searc completes or is inactiv ated). The pro cesses that generate massiv data sets and op en- ended data streams often span mon ths or ears, during whic the data-generating distribution can hange signican tly violating the iid assumption made most learning algo- rithms. common solution to this is to rep eatedly apply the learner to sliding windo of examples, whic can ery inecien

t. Our metho can adapted to ecien tly ac- commo date time-c hanging data as follo ws. Main tain ( throughout time for ev ery candidate considered at ev- ery searc step. After the rst examples, where is the windo width, subtract the oldest example from ( whenev er new one is added. After ev ery new exam- ples, determine again the est candidates at ev ery previ- ous searc decision oin t. If one of them is etter than an old winner =s is the maxim um um er of candidate
Page 4
comparisons exp ected during the en tire run) then there has probably een some

concept drift. In these cases, egin an alternate searc starting from the new winners. erio dically use um er of new examples as alidation set to compare the erformance of the mo dels pro duced the new and old searc hes. Prune the old searc when the new mo dels are on erage etter than the old ones, and prune the new searc if after maxim um um er of alidations its mo dels ha failed to ecome more accurate on erage than the old ones. If more than maxim um um er of new searc hes is in progress, prune the lo est-p erforming ones. This ap- proac to handling time-c hanging data is generalization of

the one successfully applied to the VFDT decision-tree induction algorithm [11]. Activ learning is erful yp of subsampling where the learner activ ely selects the examples that ould cause the most progress [1]. Our metho has natural exten- sion to this case when dieren examples are relev an to dieren searc steps, and some subset of these steps is in- dep enden of eac other (as in decision tree induction, for example): ho ose the next examples to relev an to the step where is curren tly farthest from eing ac hiev ed (i.e., where n; cda )]) min ij max gg is highest). In the

remainder of this pap er apply our metho to learning Ba esian net orks, and ev aluate the erformance of the resulting algorithms. 3. LEARNING YESIAN NETW ORKS no briey in tro duce Ba esian net orks and meth- ds for learning them. See Hec erman et al. [8] for more complete treatmen t. Ba esian net ork enco des the join probabilit distribution of set of ariables, as directed acyclic graph and set of conditional probabil- it tables (CPTs). (In this pap er assume all ariables are discrete.) Eac no de corresp onds to ariable, and the CPT asso ciated with it con tains the probabilit of eac

state of the ariable giv en ev ery ossible com bination of states of its paren ts. The set of paren ts of denoted par ), is the set of no des with an arc to in the graph. The structure of the net ork enco des the assertion that eac no de is condition- ally indep enden of its non-descendan ts giv en its paren ts. Th us the probabilit of an arbitrary ev en can computed as =1 par )). In gen- eral, enco ding the join distribution of set of discrete ariables requires space exp onen tial in Ba esian net orks reduce this to space exp onen tial in max 2f ;::: ;d par In this pap er consider learning

the structure of Ba esian net orks when no alues are missing from training data. um er of algorithms for this ha een prop osed; erhaps the most widely used one is describ ed Hec erman et al. [8]. It erforms searc er the space of net ork struc- tures, starting from an initial net ork whic ma ran- dom, empt or deriv ed from prior kno wledge. eac step, the algorithm generates all ariations of the curren net ork that can obtained adding, deleting or rev ersing sin- gle arc, without creating cycles, and selects the est one using the Bayesian Dirichlet (BD) score (see Hec erman et al. [8]). The

searc ends when no ariation ac hiev es higher score, at whic oin the curren net ork is returned. This algorithm is commonly accelerated cac hing the man re- dundan calculations that arise when the BD score is applied to collection of similar net orks. This is ossible ecause the BD score is de omp osable in to separate comp onen for eac ariable. In the remainder of this pap er scale up Hec erman et al.ís algorithm, whic will refer to as HGC throughout. 4. SCALING UP YESIAN NETW ORKS eac searc step, HGC considers all examples in the training set when computing the BD score of eac candi- date

structure. Th us its running time gro ws without limit as the training set size increases. By applying our metho d, HGCís running time can made indep enden of for ), with user-determined and In order to do this, ust rst decomp ose the BD score in to an erage of some quan tit er the training sample. This is made ossible taking the logarithm of the BD score, and discarding terms that ecome insignican when e- cause the goal is to mak the same decisions that ould made with innite data. This yields (see Hulten Domin- gos [10] for the detailed deriv ation): log S; =1 =1 log

ie S; par ie )) (1) where ie is the alue of the th ariable in the th example, and ie par ie )) is the maxim um-lik eliho estimate of the probabilit of ie giv en its paren ts in structure equal to ij =n ij if in example the ariable is in its th state and its paren ts are in their th state. Eectiv ely when the log-BD score con erges to the log-lik eliho of the data giv en the net ork structure and maxim um-lik eliho pa- rameter estimates, and ho osing the net ork with highest BD score ecomes equiv alen to ho osing the maxim um- lik eliho netw ork. The quantit =1 log ie S; par ie )) is

the log-lik eliho of an example giv en the structure and corresp onding parameter estimates. When comparing candidate structures and compute the mean dierence in this quan tit et een them: =1 =1 log par )) =1 =1 log par )) (2) Notice that the decomp osabilit of the BD score allo ws this computation to accelerated only considering the com- onen ts corresp onding to the or four ariables with dif- feren paren ts in and can apply either the normal ound or the Ho eding ound to ). In order to ap- ply the Ho eding ound, the quan tit eing eraged ust ha nite range.

estimate this range measuring the minim um non-zero probabilit at eac no de and use as the range =1 og where is small in teger. After the structure is learned, the nal ij and ij coun ts ust estimated. In future ork will use the ound from [6] to determine the needed sample size; the curren al- gorithm simply uses single pass er training data. ogeth- er with the parameters of the Diric hlet prior, these coun ts induce osterior distribution er the parameters of the net ork. Prediction of the log-lik eliho of new examples is alue ma ha zero true probabilit giv en some paren state, but this

is not problem, since suc alue and paren com bination will nev er ccur in the data.
Page 5
carried out in tegrating er this distribution, as in HGC. call this algorithm VFBN1. HGC and VFBN1 share ma jor limitation: they are un- able to learn Ba esian net orks in domains with more than 100 ariables or so. The reason is that the space and time complexit of their searc increases quadratically with the um er of ariables (since, at eac searc step, eac of the ariables considers on the order of one hange with resp ect to eac of the other ariables). This complexit can greatly reduced

noticing that, ecause of the decomp os- abilit of the BD score, man of the alternativ es considered in searc step are indep enden of eac other. That is, except for oiding cycles, the est arc hange for par- ticular ariable will still est after hanges are made to other ariables. The VFBN2 algorithm exploits this car- rying out separate searc for eac ariable in the domain, in terlea ving all the searc hes. VFBN2 tak es adv an tage of our metho dís abilit to reduce I/O time reading data blo just once for all of its searc hes. The generalit of our scaling-up metho is illustrated the fact that it is

applied in VFBN2 as easily as it as in VFBN1. Tw issues prev en VFBN2 from treating these searc hes completely indep enden tly: arc rev ersal and the constrain that Ba esian net ork con tain no cycles. VFBN2 oids the rst problem simply disallo wing arc rev ersal; our ex- erimen ts sho this is not detrimen tal to its erformance. Cycles are oided in greedy manner whenev er new arc is added, remo ving from consideration in all other searc h- es all alternativ es that ould form cycle with the new arc. VFBN2 uses our metho dís abilit to cycle through searc hes when they do not all sim

ultaneously t in to memory All searc hes are initially on queue of inactiv searc hes. An in- activ searc uses only small constan amoun of memory to hold its curren state. The main lo op of VFBN2 transfers searc hes from the head of the inactiv queue to the activ set un til memory is lled. An activ searc for ariable considers remo ving the arc et een eac ariable in par and adding an arc to from eac ariable not already in par ), and making no hange to par ), all under the constrain that no hange add cycle to Eac time blo of data is read, the candidates scores in eac activ searc

are up dated and winner is tested for as in Sele ct- Candidates () (see able 1). If there is winner (or one is selected tie-breaking ecause has een reac hed or has een exhausted) its hange is applied to the net ork struc- ture. Candidates in other searc hes that ould create cycles if added to the up dated net ork structure are remo ed from consideration. searc is nished if making no hange to the asso ciated ariableís paren ts is etter than an of the alternativ es. If searc tak es step and is not nished, it is app ended to the bac of the inactiv queue, and will reactiv ated when

memory is ailable. VFBN2 terminates when all of its searc hes nish. HGC and VFBN1 require memory where is the maxim um um er of paren ts and is the maxim um um er of alues of an ariable in the domain. This is used to store the CPTs that dier et een eac alternativ structure and VFBN2 impro es on this using an inactiv queue to temp orarily deactiv ate searc hes when RAM is short. This giv es VFBN2 the abilit to mak progress with dv mem- ory turning quadratic dep endence on in to linear one. VFBN2 is th us able to learn on domains with man more ariables than HGC or VFBN1. urther,

HGC and VFBN1 erform redundan ork rep eatedly ev aluating al- ternativ es, selecting the est one, and discarding the rest. This is asteful ecause some of the discarded alternativ es are indep enden of the one hosen, and will selected as winners at later step. VFBN2 oids uc of this redun- dan ork and learns structure up to factor of faster than VFBN1 and HGC. exp erimen ted with sev eral olicies for breaking ties, and obtained the est results selecting the alternativ with highest observ ed BD score. First, it limits the um er of times an arc can added or remo ed et een eac pair of ariables to

o. Second, to mak sure that at least one searc will t in the activ set, it do es not consider an alternativ that requires more than =d MB of memory where is the systemís total ailable memory in MB. Third, it can limit the um er of parameters used in an ariableís CPT to less than user-supplied threshold. 5. EMPIRICAL EV ALU TION compared VFBN1, VFBN2, and our implemen tation of HGC on data generated from sev eral enc hmark net- orks, and on three massiv eb data sets. All exp eri- men ts ere carried out on cluster of GHz en tium mac hines, with memory limit of 200MB for the enc h- mark

net orks and 300MB for the eb domains. In or- der to sta within these memory limits, VFBN1 and HGC discarded an searc alternativ that required more than =d MB of RAM. The blo size as set to 10,000. VFBN1 and VFBN2 used the normal ound with ariance estimated from training data. con trol complexit lim- ited the size of eac CPT to 10,000 parameters (one param- eter for ev ery 500 training samples). VFBN1 and VFBN2 used 10 and 05%. All algorithms started their searc from empt net orks (w also exp erimen ted with prior net orks, but space precludes rep orting on them). Benc hmark Net orks or the

enc hmark study gen- erated data sets of v million examples eac from the net- orks con tained in the rep ository at ttp://www.cs.h il/labs/compbio/Rep ository/. The net orks used and um- er of ariables they con tained ere: (Insurance, 27), (W a- ter, 32), (Alarm, 37), (Hailnder, 56), (Munin1, 189), (Pigs, 441), (Link, 724), and (Munin4, 1041). Predictiv erfor- mance as measured the log-lik eliho on 100,000 inde- enden tly generated test examples. HGCís parameters ere set as describ ed in Hulten Domingos [10]. limited the algorithms to sp end at most da ys of CPU time

(7200 min- utes) learning structure, after whic the est structure found up to that oin as returned. able con tains the results of our exp erimen ts. All three systems ere able to run on the small net orks (Insurance, ater, Alarm, Hailnder). Because of RAM limitations only VFBN2 could run on the remaining enc hmark net- orks. The systems ac hiev appro ximately the same lik eli- ho ds for the small net orks. urther, their lik eliho ds ere ery close to those of the true net orks, indicating that our scaling metho can ac hiev high qualit mo dels. VFBN1 and VFBN2 oth completed all four runs

within the allot- ted da ys, while HGC did so only wice. VFBN2 as an order of magnitude faster than VFBN1, whic as an order of magnitude faster than HGC. VFBN2 sp en less than v min utes on eac run. This as less than the time it sp en doing single scan of data to estimate parameters, fac- tor of v to ten. VFBN2ís global condence ounds ere
Page 6
etter than VFBN1ís at least an order of magnitude in eac exp erimen t. This as caused VFBN1ís redundan searc forcing it to remak man decisions, th us requiring that man more statistical ounds hold. also ran HGC on

random sample of 10,000 training examples. This ari- ation alw ys had orse lik eliho than oth VFBN1 and VFBN2. It also alw ys sp en more time learning structure than did VFBN2. or the large net orks, found VFBN2 to impro sig- nican tly on the initial net orks, and to learn net orks with lik eliho ds similar to those of the true net orks. Not sur- prisingly found that man more hanges ere required to learn large net orks than to learn small ones (on these, VFBN2 made et een 566 and 7133 hanges to the prior net orks). Since HGC and VFBN1 require one searc step for eac hange, this suggests

that ev en with sucien RAM they ould learn uc more slo wly compared to VFBN2 than they did on the small net orks. atc hed the ac- tiv set and inactiv queue during the runs on the large data sets. found the prop ortion of searc hes that ere activ as high near the eginning of eac run. As time progressed, ho ev er, the size of the CPTs eing learned tended to in- crease, lea ving less ro om for searc hes to activ e. In fact, during the ma jorit of the large net ork runs, only small fraction of the remaining searc hes ere activ at an one time. Recall that VFBN1 and HGC can only run when all

remaining searc hes t in the activ set. or these net orks the 200 MB allo cation falls far short. or example, on yp- ical run on large net ork found that at the orst oin only 31% of the remaining searc hes ere activ e. Assuming they all tak ab out the same RAM, HGC and VFBN1 ould ha required nearly 15 GB to run. eb Applications In order to ev aluate VFBN2ís erfor- mance on large real-w orld problems, ran it on large eb traces. The rst as the data set used in the KDD Cup 2000 comp etition [12]. The second as trace of all requests made to the eb site of the Univ ersit of ash-

ingtonís Departmen of Computer Science and Engineering et een Jan uary 2000 and Jan uary 2002. The KDD Cup data consists of 777,000 eb page requests collected from an e-commerce site. Eac request is anno- tated with requester session ID and large collection of attributes describing the pro duct in the requested page. fo cused on one of the elds in the log, \Assortmen Lev el 4", whic con tained 65 categorizations of pages in to pro duct yp es (including \none"). or eac session pro duced training example with one Bo olean attribute er category \T rue" if the session visited page with

that category There ere 235,000 sessions in the log, of whic held out 35,000 for testing. The UW-CSE-80 and UW-CSE-400 data sets ere created from log of ev ery request made to our departmen tís eb site et een late Jan uary 2000 and late Jan uary 2002. The log con tained on the order of one undred million requests. extracted the training sets from this log in manner ery similar the KDD Cup data set. or the rst iden- tied the 80 most commonly visited lev el directories on the site (e.g. /homes/faculty/ and /e duc ation/under gr ads/ ). or the second iden tied the 400 most

commonly vis- ited eb ob jects (excluding most images, st yle sheets, and scripts). In oth cases brok the log in to appro ximate sessions, with eac session con taining all the requests made able 2: Empirical results. Samples is the total um er of examples read from disk while learn- ing structure, in millions. Times in old exceeded our v da limit and the corresp onding runs ere stopp ed efore con erging. Net ork Algorithm Log-Lik el Samples Min utes Insurance rue 13.048 HGC 13.048 320.00 2446.08 VFBN1 13.069 16.69 39.72 VFBN2 13.070 0.52 1.02 ater rue 12.781 HGC 12.783 375.00 5897.53

VFBN1 12.805 35.80 360.45 VFBN2 12.795 0.88 1.85 Alarm rue 10.448 HGC 10.455 279.93 7200.15 VFBN1 10.439 15.07 87.87 VFBN2 10.447 0.81 2.92 Hailnder rue 49.115 HGC 54.403 123.12 7200.62 VFBN1 48.885 23.80 194.97 VFBN2 48.889 0.17 3.22 Munin1 rue 37.849 VFBN2 38.417 0.95 47.38 Pigs rue 330.277 VFBN2 319.559 1.24 636.98 Link rue 210.153 VFBN2 232.449 12.29 2451.95 Munin4 rue 171.874 VFBN2 173.757 4.61 3003.28 KDD-Cup Empt 2.446 0.00 HGC 2.282 69.59 6345.13 VFBN1 2.272 12.05 335.37 VFBN2 2.301 0.42 15.40 UW-80 Empt 1.611 0.00 HGC 1.346 75.53 7201.05 VFBN1 1.273 14.93 457.18 VFBN2 1.269

1.98 12.08 UW-400 Empt 7.002 0.00 VFBN2 4.556 3.63 905.30 single host un til an idle erio of 10 min utes; there ere 8.3 million sessions. held out the last eek of data for testing. The UW-CSE-400 domain as to large for VFBN1 or HGC. HGC ran for nearly v da ys on the other data sets, while VFBN2 to ok less than 20 min utes for eac h. The systems ac hiev ed similar lik eliho ds when they could run, and alw ys impro ed on their starting net ork. Examining the net orks pro duced, found man oten tially in terest- ing patterns. or example, in the e-commerce domain found pro ducts that ere uc

more lik ely to visited when com bination of related pro ducts as visited than when only one of those pro ducts as visited. 6. RELA TED ORK Our metho falls in the general category of sequen tial analysis [17], whic determines at run time the um er of examples needed to satisfy giv en qualit criterion. Other recen examples of this approac include Maron and Mo oreís
Page 7
racing algorithm for mo del selection [13], Greinerís ALO al- gorithm for probabilistic hill-clim bing [7], Sc heer and ro- elís sequen tial sampling algorithm [16], and Domingo et al.ís AdaSelect algorithm

[2]. Our metho go es ey ond these in applying to an yp of discrete searc h, pro viding new formal results, orking within pre-sp ecied memory limits, supp orting in terlea ving of searc steps, learning from time- hanging data, etc. related approac is progressiv sam- pling [14, 15], where successiv ely larger samples are tried, learning curv is t to the results, and this curv is used to decide when to stop. This ma lead to stopping earlier than with our metho d, but stopping can also ccur prema- turely due to the dicult in reliably extrap olating learning curv es. riedman

et al.ís Sparse Candidate algorithm [5] alter- nates et een heuristically selecting small group of o- ten tial relativ es for eac ariable and doing searc step limited to considering hanging arcs et een ariable and its oten tial relativ es. This pro cedure oids the quadratic dep endency on the um er of ariables in domain. ried- man et al. ev aluated it on data sets con taining 10,000 sam- ples and net orks with up to 800 ariables. This pap er describ es general metho for scaling up learn- ers based on discrete searc h. ha also dev elop ed related metho for scaling up learners based on searc in

con tin uous spaces [4]. 7. CONCLUSION AND FUTURE ORK Scaling up learning algorithms to the massiv data sets that are increasingly common is fundamen tal hallenge for KDD researc h. This pap er prop oses that the time used learning algorithm should dep end only on the com- plexit of the mo del eing learned, not on the size of the ailable training data. presen framew ork for semi- automatically scaling an learning algorithm that erforms discrete searc er mo del space to able to learn from an arbitrarily large database in constan time. Our frame- ork further allo ws transforming the algorithm to

ork in- cremen tally to giv results an ytime, to t within memory constrain ts, to supp ort in terlea ed searc h, to adjust to time- hanging data, and to supp ort activ learning. use our metho to dev elop new algorithm for learning large Ba esian net orks from arbitrary amoun ts of data. Exp eri- men ts sho that this algorithm is orders of magnitude faster than previous ones, while learning mo dels of essen tially the same qualit Directions for future ork on our scaling framew ork in- clude com bining it with the one ha dev elop ed for searc in con tin uous spaces, impro ving the ounds

taking can- didate dep endencies in to accoun t, constructing program- ming library to facilitate our framew orkís application, and scaling up additional algorithms using it. uture ork on VFBN includes extending it to handle missing data al- ues, dev eloping etter mec hanisms for con trolling complex- it when data is abundan t, and scaling VFBN further learning lo cal structure at eac no de (e.g., in the form of decision tree). Ac kno wledgmen ts thank Blue Martini and Corin An- derson for pro viding the eb data sets and the donors and main tainers of the Ba esian net ork rep ository This re-

searc as partly supp orted NSF CAREER and IBM acult ards to the second author, and gift from the ord Motor Co. 8. REFERENCES [1] D. Cohn, L. tlas, and R. Ladner. Impro ving generalization with activ learning. Machine arning 15:201{221, 1994. [2] C. Domingo, R. Ga alda, and O. atanab e. Adaptiv sampling metho ds for scaling up kno wledge disco ery algorithms. Data Mining and Know le dge Disc overy 6:131{152, 2002. [3] Domingos and G. Hulten. Mining high-sp eed data streams. In Pr c. 6th CM SIGKDD International Conf. on Know le dge Disc overy and Data Mining pp. 71{80, Boston, MA, 2000. [4]

Domingos and G. Hulten. Learning from innite data in nite time. In dvanc es in Neur al Information Pr essing Systems 14 MIT Press, Cam bridge, MA, 2002. [5] N. riedman, I. Nac hman, and D. r. Learning Ba esian net ork structure from massiv datasets: The \sparse candidate" algorithm. In Pr c. 15th Conf. on Unc ertainty in Articial Intel ligenc pp. 206{215, Sto kholm, Sw eden, 1999. [6] N. riedman and Z. akhini. On the sample complexit of learning Ba esian net orks. In Pr c. 12th Conf. on Unc ertainty in Articial Intel ligenc pp. 274{282, ortland, OR, 1996. [7] R.

Greiner. ALO: probabilistic hill-clim bing algorithm. rticial Intel ligenc 84:177{208, 1996. [8] D. Hec erman, D. Geiger, and D. M. Chic ering. Learning Ba esian net orks: The com bination of kno wledge and statistical data. Machine arning 20:197{243, 1995. [9] W. Ho eding. Probabilit inequalities for sums of ounded random ariables. Journal of the meric an Statistic al Asso ciation 58:13{30, 1963. [10] G. Hulten and Domingos. general metho for scaling up learning algorithms and its application to Ba esian net orks. ec hnical rep ort, Departmen of Computer Science and

Engineering, Univ ersit of ashington, Seattle, A, 2002. [11] G. Hulten, L. Sp encer, and Domingos. Mining time-c hanging data streams. In Pr c. 7th CM SIGKDD International Conf. on Know le dge Disc overy and Data Mining pp. 97{106, San rancisco, CA, 2001. [12] R. Koha vi, C. Bro dley B. rasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers rep ort: eeling the onion. SIGKDD Explor ations 2(2):86{98, 2000. [13] O. Maron and A. Mo ore. Ho eding races: Accelerating mo del selection searc for classication and function appro ximation. In dvanc es in Neur al Information Pr essing

Systems Morgan Kaufmann, San Mateo, CA, 1994. [14] C. Meek, B. Thiesson, and D. Hec erman. The learning-curv metho applied to mo del-based clustering. Journal of Machine arning ese ar ch 2:397{418, 2002. [15] F. Pro ost, D. Jensen, and T. Oates. Ecien progressiv sampling. In Pr c. 5th CM SIGKDD International Conf. on Know le dge Disc overy and Data Mining pp. 23{32, San Diego, CA, 1999. [16] T. Sc heer and S. rob el. Incremen tal maximization of non-instance-a eraging utilit functions with applications to kno wledge disco ery problems. In Pr c. 18th International Conf. on

Machine Le arning pp. 481{488, Williamsto wn, MA, 2001. [17] A. ald. Se quential analysis Wiley New ork, 1947.