Anthill Scalable RunT ime En vir onment or Data Mining pplications Renato Ferreira agner Meira Jr Dor gi al Guedes SpeedDCCUFMG Lucia Drumond ICUFF Abstract Data mining tec hniques ar becoming incr e
197K - views

Anthill Scalable RunT ime En vir onment or Data Mining pplications Renato Ferreira agner Meira Jr Dor gi al Guedes SpeedDCCUFMG Lucia Drumond ICUFF Abstract Data mining tec hniques ar becoming incr e

Howe ver as the size of the aw data incr eases par allel data mining algo rithms ar becoming necessity In this paper we pr esent runtime support system that was designed to allow the ef 64257 cient implementation of datamining algorithms on heter o

Tags : Howe ver the
Download Pdf

Anthill Scalable RunT ime En vir onment or Data Mining pplications Renato Ferreira agner Meira Jr Dor gi al Guedes SpeedDCCUFMG Lucia Drumond ICUFF Abstract Data mining tec hniques ar becoming incr e




Download Pdf - The PPT/PDF document "Anthill Scalable RunT ime En vir onment ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Anthill Scalable RunT ime En vir onment or Data Mining pplications Renato Ferreira agner Meira Jr Dor gi al Guedes SpeedDCCUFMG Lucia Drumond ICUFF Abstract Data mining tec hniques ar becoming incr e"— Presentation transcript:


Page 1
Anthill: Scalable Run-T ime En vir onment or Data Mining pplications Renato Ferreira agner Meira Jr Dor gi al Guedes Speed-DCC-UFMG Lucia Drumond IC-UFF Abstract Data mining tec hniques ar becoming incr easingly mor popular as easonable means to collect summaries fr om the apidly gr owing datasets in many ar eas. Howe ver as the size of the aw data incr eases, par allel data mining algo- rithms ar becoming necessity In this paper we pr esent run-time support system that was designed to allow the ef fi- cient implementation of data-mining algorithms on heter o- eneous

distrib uted en vir onments. belie ve that the run- time fr ame work is suitable for br oader class of applica- tions, be yond data mining also pr esent, thr ough an x- ample par allelization str ate gy that is supported by the run-time system. show scalability esults of two dif fer ent data-mining algorithms that wer par allelized using our ap- pr oac and our runt-time support. Both applications scale very close to linearly with lar number of nodes. 1. Intr oduction ery lar ge datasets are becoming common place in man areas. This act is consequence of both continuous drop on the cost of data

storage and continuous increase in the sophistication of equipments and algorithms that col- lect and store such data. Analyzing these huge datasets is rapidly becoming impractical in its ra form, and data mining techniques ha increased in popularity lately as means of collecting meaningful summarized information from these huge datasets. Ho we er en the astest sequen- tial algorithms may not be enough to sumarize such olume of data and therefore, the de elopment of ef ficient parallel algorithms for such tasks is crucial. At the same time, Grid Computing [5 is emer ging as an alternati

to ery xpensi supercomputers. The Grid is lar ge distrib uted system created by connecting together se v- eral clusters of machines on dif ferent sites through AN connections. The clusters are sets of homogeneous ma- chines connected by LAN. While the potential computing po wer made ailable by the Grid is ery lar ge, xploiting this po wer is not tri vial. Much ork has been done in de eloping parallel data mining algorithms er the years [11 7]. The main limita- tion in all those algorithms, to the best of our kno wledge, is that the ha not been sho wn to scale well to ery lar ge number of

processors. ha recently published paral- lel implementation for the Frequent Itemset Mining prob- lem for lar ge heterogeneous distrib uted en vironments and our xperimetns ha sho wn it to scale really well [9 ]. In the process of creating this implementation, we gener ated both parallelization strate gy for lar ger class of data- mining algorithms, as well as run-time frame ork to sup- port such strate gy In this paper we focus on these tw is- sues. The frame ork is refeered to as Anthill, and we sho tw ne data mining algorithms that were implemented us- ing the same strate gy as the one for

the earlier Frequent Itemset Mining problem. The tw ne algorithms are: k- means for clustering and ID3 for classification. Our xper iments with these ne applications ha sho wn high scala- bility similar to the earlier algorithm. In particular the ap- plications are sho wn to scale ery close to linearly up to dozens of distrib uted nodes. belie that our frame ork xposes con enient pro- gramming abstraction which is suitable for designing ef- ficient parallel ersions of algorithms in se eral areas be- sides data mining. The run-time assumes that the applica- tions entually run on lar

ge heterogenous distrib uted en vi- ronments (grids). The starting point of Anthill is Datacut- ter [3 ], which is data-flo based run-time en vironment for distrib uted architectures, ut Anthill supports richer pro- gramming model that allo ws wide range of paralleliza- tions to be ef ficiently implemented. The remaining of the paper is or ganized as follo ws. In Section we present Anthill run-time en vironment and its programming model. Section describes the ID3 algorithm and its parallelization and Section present some xperi- mental results. conclude and present some future

direc- tions in Section 6.
Page 2
2. Run-time framew ork In this section we describe Anthill, our run-time support frame ork for scalable applications on grid en vironments. Building applications that may ef ficiently xploit such en- vironment, while maintaining good performance is chal- lenge. In this scenario, the datasets are usually distrib uted across se eral machines in the Grid. Mo ving the data to where the processing is about to tak place is often inef- ficient. Usually for such applications, the resulting data is man times smaller than the input. The alternati

is to bring the computation to where the data resides. Success in this approach depends on the application being di vided into portions that may be instantiated on dif ferent nodes on the Grid for ecution. Each of these portions will perform part of the transformation on the data starting from the in- put dataset and until the resulting dataset. The discussion abo indicates that good paralleliza- tion of an application in such en vironment should consider both data parallelism and task parallelism at the same time. Our strate gy uses these tw approaches together with third approach that orks

er the time dimension allo wing some de gree of asynchronous ecution of independent sub-tasks. The benefits of these three dimensions combined produces the high speedups observ ed in our xperiments. based Anthill on an earlier run-time support frame- ork for distrib uted en vironments call Datacutter which is in turn based upon the filter -stream programming model. no describe Datacutter and the xtensions that are sup- ported in Anthill. 2.1. Datacutter The filter -stream programming model as origi- nally proposed for Acti Disks [1 ]. The idea as to create the concept of

disklets or little pieces of the appli- cation computation that could be of f-loaded to the pro- cessors within the disks. In that conte xt, the disklets, or filters, are entities that percei streams of data flo w- ing in, and after some computation it ould generate streams of data flo wing out. Later this concept as x- tended as programming model suitable for Grid en viron- ment, and runtime system as de eloped that supported such model [4 ]. This runtime system is called Datacut- ter and there has been considerable amount of ef fort put into arious aspects of this system [8

]. In Datacutter streams are abstractions for communica- tion which allo fix ed sized untyped data uf fers to be transferred from one filter to another In sense, it is ery similar to the concept of UNIX pipes. The dif ference is that while pipes only ha one stream of data coming in and one going out, in the proposed model, arbitrary graphs with an number of input and output streams are possible. Creating an application that runs in DataCutter is pro- cess referred to as decomposition into filters. In this process, the application is modeled into dataflo computation and

brok en into netw ork of filters. At ecution time, the fil- ters that compose the application are instantiated on se eral machines comprising Grid and the streams are connected from source to destination. ecute an application, description of the filters and the streams that connect them need to be pro vided to the run-time en vironment. ith that information, number of copies of each of the filters are instantiated on dif ferent nodes of the distrib uted en vironment. These are referred to as transparent copies of filter 2.2. Anthill In this paper we presented our

parallel programming model and discuss ho it supports highly scalable dis- trib uted computing. Our approach is based on the simple observ ation that the applications can be decomposed in pipeline of operations, which represents task parallelism. Further for man applications, the ecution consists of multiple interations of this pipeline. The application starts with an initial set of possible solutions, and as these pos- sibilities are passed do wn the pipeline, ne possible solu- tions are created. In our xperience, we noticed that man applications fit this model. Also, this strate gy

allo ws asyn- chronous ecution, in the sense that se eral possible solu- tions are being tested simultaneously at run-time. Our proposed model, therefore, consist of xploiting maximum parallelism of the applications by using all three possibilities discussed abo e: task parallelism, data paral- lelism and asynchron Because the actual compute units are copies of pipeline stages, we can ha ery fine grain parallelism and since all these are happenening asyn- chronously the ecution will be mostly bottleneck free. In order to reduce latenc the grain of the parallelism should be defined

by the application designer at run-time. Three important issues arise from this proposed model: 1. The transparent copies mechanism allo ws ery stage of the pipeline to be distrib uted across man nodes of parallel machine and the data that goes through that stage can be partitioned across the transparent copies, which represents data parallelism.Some times it is nec- essary for certain data block to reach one specific cop of stage of the pipeline; 2. These distrib uted stages often ha some state, which need to be maintained globally; 3. Because of the nature of the application decomposi-

tion, it can be ery trick for them to detect that the computation is finished. These issues are discussed on the folo wing subsections.
Page 3
2.3. Labeled str eam The labeled stream abstraction is designed to pro vide con enient ay for the application to allo customized routing of the message uf fers to specific transparent copies of the recei ving filter As mentioned earlier each stage of the application pipeline is actually ecuted on se eral dif- ferent nodes on the distrib uted en vironment. Each cop of pipeline stage is dif ferent than the other in the sense

that the do data parallelism, meaning that each transpar ent cop will handle distinct and independent portion of the space comprised by the stage input data. Which cop should handle an particular data uf fer de- pends upon the data itself. ith labeled stream we add label to ery message that tra erses the stream thus cre- ating tuple where is the label and is the original message. Instead of just sending the message do wn the stream, the application will no send the entire tuple Associated to each labeled stream, there is also hash function. or ery message tuple tra ersing the stream, the hash

function is called with the tuple as parame- ter The output of this hash function indicates to the system the particular cop to which that message should be delie v- ered. This mechanism gi es the application total control of the messages. Because the hash function is called at run- time, the actual routing decision is tak en indi vidually for each message and can change dynamically as the ecution progresses. This feature is con enient for it allo ws dynamic reconfiguration, which is particularly useful to balance the load on dynamic and irre gular applications. The hash func- tion is

also little bit relax ed in the sense that the output does not ha to be necessarily return one single filter In- stead, it can output set of filters, and in that case, mes- sage can be replicated and sent to multiple instances. This is particularly useful for applications in which one single in- put data element influences se eral output data elements. Broadcast is one instance of this situation. 2.4. Global persistent storage As mentioned earlier each stage of the pipeline is dis- trib uted across man nodes on Grid en vironment. Often times these stages are stateful. This

means that the stage has an internal state which changes as more and more of the computation chunks is passed do wn the pipeline. Anthill needs mechanism that allo ws the set of transparent copies of an filter to share global state. or some applications, once the stage is partitioned across the nodes, the state ari- ables on each transparent cop will reside locally on them. But in man cases, this state may need to be updated on dif ferent locations. This is particularly true for situations where the applications are dynamically reconfiguring itself to balance the orkload, or in

ailure situations in which the system is automatically reco ering. If we consider the ault tolerance scenario, we add inter esting features to this state. It no requires to be stable, in the sense that it has transactional property Once change is committed on the state, it has to be maintained. So, ha v- ing multiple copies of the state on separate hosts is ery im- portant for the sak of safety As described abo e, tw features are ery important for an data structure maintaining the global state. First, the se eral data points within this state need to migrate con e- niently from one

filter to the other as the computation pro- gresses. Second, it has to be stable in the sense that the data stored there need to be preserv ed en in the case of ail- ures on indi vidual hosts running filter copies. are imple- menting tuple space which is similar to Linda tuple space to maintain the state. Such structure seems ery con enient for our purposes. Whene er filter cop updates data ele- ment, it is updated on the tuple space. cop of it is main- tained on the same filter for performance reasons, while cop of it will be forw ard to another host for safek eeping.

The system then allo ws some de gree of ault tolerance in the sense that once cop of tuple is safely stored on dif fer ent host, the update is assumed to be committed. 2.5. ermination pr oblem In Anthill, applications are modelled as generic directed graphs. As long as such graphs remain ac yclic, termination detection for an application is straight forw ard: whene er data in stream ends the filter reading from it is notified. When it finishes processing an outstanding tasks it may propagate the information that the flo ended to its outgo- ing streams and terminate.

When application graphs ha ycles, ho we er the problem is not that simple. It may be the case that filters operating in ycle can- not decide by themselv es whether stream has ended or not. Remember that each filter may ha an number of copies, all completely independent. Therefore, although filter cop may ha local information indicating its job is done, processing taking place in another cop may produce ne data that might tra el through the loop and reach the first filter again. When such situation happens in an application in Anthill, lea ving the task of

detecting the termination condition to the programmer is not reasonable solution. One of the goals of the system is xactly to simplify the de elopment of the ap- plication, what ould be compromised in that case. In order to oid that, Anthill implements complete distrib uted ter mination detection protocol which may be relied upon by applications that require it. The protocol is implemented in the run time system, so applications need not be concerned with it. Whene er the programmer designs filter graph
Page 4
with ycles all that is required is an indication of which stream

should be chosen by the run time system to insert an end-of-stream notification. The insertion of that message breaks the ycle and causes the filters to propagate the in- formation accordingly no describe in more details our termination algo- rithm. 2.5.1. ermination Detection Pr otocol or the sak of the termination protocol, the filter graph is replaced by the graph with all filter copies, where each cop is seen as con- nected to all copies of the filters that are connected to its fil- ter in both directions (a cop is not connected to the other copies of

the same filter since that is premise of the orig- inal filter -stream model). The algorithm orks in rounds, when some filter cop suspects computation has completed and be gins to contact its neighbors. If some filter is still computing and produces ne data, that round ails and the algorithm proceeds to another round sometime in the fu- ture. If all copies of all filters agree on termination during single round, termination is reached. The final decision is left for process leader responsible for collecting informa- tion from all filters. Three types

of message are then xchanged in the pro- tocol: copies that suspect termination as reach send SUSPECT(R) to their neighbors stating the suspect ter mination in round when cop reaches an agreement with all its neighbors, it notifies the process leader us- ing TERMINATE(R) message, also identifying the round number; if the leader collects TERMINATE mes- sages from all filter copies for same round it brodcasts END message back to them. Although streams are uni- directional at the application le el, the run-time system uses them bidirectionally and guarantee the communica- tion

channels are reliable and messages between tw fil- ter copies are al ays deli ered in the order the were sent. Besides the round counter each filter cop eeps list of the neighbors which suspect the same termination round has been reached. The core of the protocol is illustrated by the xtended Finite State Machine (FSM) in Figure 1. Each filter cop may be in one of tw states: as long as the cop is running and/or there is data still to be read from its input streams, it remains in the running state. If the filter code blocks aiting for data from streams that are empty

that is an indication that it is done computing (in act, it aits for short interv al before taking that decision, in case data is about to arri e). While filter is computing and/or has data in its input streams still to be read it does not propagate messages for the termination protocol. If it recei es data from another fil- ter (an application message), it remo es the sender from its list of neighbors suspected of ha ving terminated, since it is ob viously computing. If it recei es SUSPECT(R) mes- Termination Suspected Running App. msg. from i Remove i from list list has all

neighbors snd(leader,TERMINATE(R)) Add i to suspects list update R <− R SUSPECT(R) from i update R <− R tell all neighbors Add i to suspects list if R > R SUSPECT(R) from i App.Msg. update R <− R+1 Idle tell all neighbors SUSPECT(R) Figure 1. Extended FSM or the termination algorithm sage from another node, it first checks whether the indicated round R is the same it is in; if that is not the case and the R round is ne wer R ), it updates to the ne alue and resets its list of suspects. After that, it adds the sender of the SUSPECT message to its list. If the run-time

system in filter cop detects it has been idle for some time (no computation taking place, and the cop is block ed aiting for data in an empty input stream), it mo es to suspecting termination state and notifies all its neighbors sending them SUSPECT message with its cur rent round number It eeps the list of suspected neighbors it collected while in running state, since the were consid- ered to be in the same termination round as itself. When in suspecting state, cop eeps track of which of its neighbors are in the same round as itself and ha also reached possible termination state.

As it recei es SUSPECT messages from its neighbors it adds them to the list of suspects. No reply message is needed since, as that cop is alread in round it must ha send SUSPECT mes- sages to all neighbors when it entered that state. If the cop recei es SUSPECT message with lar ger round number R it indicates other copies may ha gone arther in their processing while that cop remained aiting for consensus from its neighbors. It must therefore update its round counter to the ne alue and clear its list of sus- pects (since all there belonged to pre vious round, no w). Only the sender of the

message will be added to the list at that point. Whene er cop has collected SUSPECT from all its neighbors for gi en round, there is widespread suspi- cion that termination has been reached, although that may be true for just the vicinities of that cop At that point the process leader must be informed with TERMINATE(R) message.
Page 5
While in the suspecting state, application messages may still arri e; it may be the case that one of its neighbors had just been computing for longer time before it had an data to send. When that happens, the arri al of data in stream is bound to cause

ne computation to start in that filter cop so it must gi up its suspicions about termination and get back to ork. At that point, it must clear its list of suspected neighbors, prepare itself for ne (future) round by incre- menting its round counter and switch back to the running state. The process leader on its turn, must eeep track of the ne west termination round it has heard of ). Whene er it recei es TERMINATE(R) message from filter cop it must compare and R if R is lo wer the message may be simply discarded, since it relates to round that is already kno wn to ha passed; if R

is lar ger than ne round as started, and is not rele ant an ymore, so the list of ter minated copies must be cleared and just the sender of that message must be added to it; finally if the are equal, an- other filter cop has joined the group of processes that sus- pect termination as reached, so it must be added to the list of processes in termination. When that list is complete with all processes the leader can declare the aplication er At that point it broadcast the END message and all processes tak their final steps to ard the end. In the filter stream model, that

leads to the reinitialization of the terminatio pro- tocol, and the stream selected by the user is closed, deli er ing an end-of-stream notification to the filter reading from it. Since the filter graph describing an application is re- quired to be strongly connected (although directed), the graph created by the relation of each filter cop to its neigh- bors will also be, specially since the neighbor relationship is al ays bidirectional. That ay when all filter copies do reach global termination, the will all be in the suspecting state. Since the alue of the round

counter gro ws monotoni- cally and only gets added when cop switches back to the running state, all copies are bound to con er ge to same alue of when termination will be agreed upon. 3. arallelization of ID3 In this section we describe our parallel implementation of the ID3 decision tree algorithm. In decision tree, the leaf nodes are the indi vidual data elements. The internal nodes contain an attrib ute and each descending pointer en- codes possible alue for the attrib uted mapped on the node which will distinguish the descendants. The depth of such tree is the maximum number of questions

about attrib ute alues that need to be ask ed about the data element in or der to find one single element on the data. The basic idea of the ID3 algorithm is to use top-do wn and greedy search on the data to find the most discriminating attrib ute on each p.atr None p.instance None p.dataset while ar titions ar titions p:dataset p:atr p:instance Attr ibutes al ues pr ob jf t:a t:v t:c gj jf t:a t:v gj 10 inf a;v lasses pr ob og pr ob 11 Attr ibutes 12 pr ob jf t:a t:v gj 13 ain alues inf a;v pr ob 14 disc ain isM aximum 15 al ues disc 16 p:atr disc 17 p:instance 18 p:dataset 19 ar

titions Figure 2. ID3 Algorithm le el of the tree. or the sak of the filter definition, we dis- tinguish three main tasks to insert node in the decision tree: 1. or each alue of each attrib ute, count the number of instances that ha that alue; 2. Compute the information gain of each attrib ute; 3. Find the attrib ute with the highest information gain. The starting point of the ID3 algorithm is set of tu- ples, containing instances of attrib utes and one out of possible classes. Each attrib ute may assume alues. The tree generation process is based on discriminants. Each

discriminant is test on an specific attrib ute that is used to di vide set of tuples in tw or more subsets (depending on the number of dif ferent alues that occur for the discrimi- nant). Initially there is no discriminant and the partition is the whole set of tuples. Then, as we find ne discriminants we recursi ely partition the set of tuples into subsets, un- til all tuples in one partition belong to the same class. The pseudo-code of the algorithm is presented in Fig- ure 2. Lines 13 are the initialization of the ar titions the first one being the entire database. The

loop in line will ecute until there are no ne partitions. or each of them (line 5), we select the tuples that will compose the partition in line 6. Then we compute the information (using an en- trop metric) for each attrib ute and instance alue in lines 710. Lines 11-13 compute the information gain for each attrib ute and, in the final step, find the attrib ute that yields maximum information gain and inserts the corresponding
Page 6
partitions into ar titions In terms of parallelization, there are tw multi-tar get reductions in lines 10 and 13, which identify boundaries

among filters. di vide the process- ing into three filters. The first filter named Counter per forms the operations associated with lines to 9, and is re- sponsible for counting the number of instances that each alue of each attrib ute has. The second filter Attrib ute per forms the operations associated with lines 10 to 12, which corresponds to computing the information gain on each of the attrib utes tested on the pre vious filter The third, Deci- sion performs the remaining operations, which corresponds to communicating the decision of the appropriate

attrib ute to back to the first filter where the process will continue, selecting ne discriminating attrib utes for each of the pro- duced classes. In our algorithm we xploited tw dimensions of par allelism: base partition and decision tree node pipelining. Base partition is achie ed by running se eral instances of the counter filter so each instance process subset of the database and pipelining is when one filter is processing one node of the tree the others are processing others nodes so all nodes stay usy practically all the time. Granularity of base partition

parallelism is ery fine. can assign sin- gle instance to filter with no changes to the code. An- other source of parallelism is asynchron The filters may be orking on multiple partitions simultaneously in an at- tempt to achie maximum ef ficienc 4. Experimental Results In this section we aluate the implementation of tw data mining algorithms in Anthill, focusing on their ef fi- cienc and scalability The xperiments were run on 16 node cluster of PCs, connect using switched ast Ethernet. Each node is 3GHz Pentium IV with 1GB main memory running Linux 2.6. 4.1. ID3

start with an aluation of our ID3 implementation, as described earlier In these xperiments, we run the De- cision filter alone on separate node. The other nodes run both Counter and Attrib ute filters. aluate our algorithm, we used synthetic datasets that are described in [10 ]. In particular we used tw clas- sification functions with dif ferent comple xity: functions and 7. Function is simpler and produces smaller decision trees when compared with function 7. In table we sho the characteristics of the datasets generated for these tw functions. The notation Ay is used to

denote dataset with function containing attrib utes and 1000 instances. Dataset DB Size (MB) No. Le els Max Lea es/Le el F2-A32-D1000K 172 5612 F7-A32-D750K 129 8195 ab le 1. Dataset haracteristics 4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 Speedup Number of Processors F2A32D1000K f(x) Figure 3. Speedup F2-A32-D1000K start by analyzing the speedups. Figures and sho the speedups for datasets F2 and F7, respecti ely can observ that the ecutions of the F2 dataset scale better than those using F7, and both sho superlinear beha vior F2 scales better because it is simpler and demands less mem- ory which

seems to af fect the speedup significantly In or der to understand the superlinear speedups, we performed detailed analysis of the processor cache usage as we change the number of processors. used API (Performance Ap- plication Programming Interf ace [6]) to measure the num- ber of cache misses for each configuration. Figures and sho ws that there is substantial drop on the number of cache misses as processors are added for the ecution, which is xpected, and xplains the superlinear beha vior or in- stance, in figure we notice that increasing the number of processors from to

14 (a actor of 3.5), resulted in re- duction of the total number of cache misses by actor of 11. no focus our analysis on three criteria, to demon- strate ho the arious aspects of Anthill collaborate to the observ ed speedups. 4.1.1. ask Analysis Each task in our algorithm is asso- ciated with analyzing and determining the discriminant for 4 6 8 10 12 14 16 4 6 8 10 12 14 16 Speedup Number of Processors F2A32D1000K f(x) Figure 4. Speedup F7-A32-D750K
Page 7
0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 1.4e+08 1.6e+08 1.8e+08 2e+08 2.2e+08 4 6 8 10 12 14 Total Cache Miss L2 Machines Cache

Miss Figure 5. otal Cac he Misses Number of Pr ocessor F2-A32-D1000K 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 1.4e+08 1.6e+08 4 6 8 10 12 14 Total Cache Miss L2 Machines Cache Miss Figure 6. otal Cac he Misses Number of Pr ocessor Speedup F7-A32-D750K decision tree node. Asynchron arises by erlapping the processing of se eral tree nodes, which may belong to the same tree le el or not. Notice that all tasks from the same tree le el are independent and the parallelism in this case is tri vial, and has been xploited in other conte xts. are in- terested in erify whether we may observ tasks from

more than one le el being ecuted simultaneously thus xploit- ing all the potential parallelism present on the algorithm. aluate the le el of asynchron by plotting the num- ber of acti tasks from each tree le el across time. Figure sho ws the tasks beha vior during the ecution unning on 16 processors using F7 as input. clearly see tasks from more than one tree le el erlaping during the whole xper iment (e.g., ecution time 325), xplaining the algorithm ef ficienc 4.1.2. Filter Analysis able sho ws the break-do wn of the task ecution time per filter consider ecutions from to 12 nodes

and F7 as the input. observ that the majority of the processing time (o er 95%) of the task oc- curs in the Counter filter confirm the higher demand imposed by the Counter filter checking the message counters for the same configu- rations, as presented in able 3, where we observ ho the amount of data to be processed decreases as we go from the Counter to the Decision filter Finally it is interesting 0 100 200 300 400 500 600 700 −50 0 50 100 150 200 250 300 350 400 450 500 Active Tasks Time (sec) Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level

8 Total Figure 7. Active asks Ex ecution Time Number of Processors Counter Attrib ute Decision 96.17 3.36 0.46 10 95.27 4.20 0.52 11 96.13 3.30 0.56 12 96.10 3.36 0.52 ab le 2. er cent of the Time in Eac Filter est F7A32D750K. to notice that there is significant amount of parallelism to xplore, since the elapsed time for ecuting the tasks is usually lar ger than the elapsed time in each processor (Fig- ure ?? ), which demonstrates that lar ger number of proces- sors ould allo en better results. It is interesting to note, that, despite the high ari- ability on the demand imposed by the

dif ferent filters, the application scaled ery well. Other task paralleliza- tion schemes, such as pipelining, may suf fer significant per formance de gradation as result of such imbalance, ut our application as not af fected at all. 4.1.3. Filter Instance Analysis Finally we aluated the performance of the filter instances, in order to aluate the imbalance that is generated by the data sk wness, by the la- beled stream, or en both. One basic metric for such al- uation is the ariability of the ecution time of each fil- ter instance, since the ecution time of the tasks

may ary significantly among tasks, and comparing their times does not quantify the load imbalance among filter instances. then calculated, for each task and for each filter the rel- ati standard de viation among the ecution times of the filter instances. The resulting alues are presented in a- ble where we observ slight increase on the ariability Number of Processors Counter Attrib ute Decision 36785592 408649 15185 10 40872880 408649 15185 11 44960168 408649 15185 12 49047456 408649 15185 ab le 3. Number of Sent Messa es. est F7A32D750K.
Page 8
0 2 4 6 8 10

12 14 1 10 100 1000 10000 100000 Duration (s) Task Counter Attribute Decision Total Figure 8. Elapsed time per filter and verall Procs Counter Attrib ute 40.82% 28.45% 10 45.10% 28.59% 11 46.59% 29.65% 12 50.39% 29.79% ab le 4. ariability of ecution times among filter instances with the number of instances, as xpected, since the load as- signed to each instance is reduced, and sk wness may ha stronger impact. also observ that the relati stan- dard de viations are quite high, although there is compen- sating ef fect taking place, that is, the instance that orks more for gi en task,

orks less later sho wing that the la- beled stream presents good performance as both load distri- ution and balancing mechanisms. 4.2. Association Analysis Association analysis determines association rules, which sho attrib ute instances (usually called items) that occur frequently together and present causality relation between them. di vide the problem of determining association rules into tw phases: determining the frequent itemsets and uilding the rules from them. Since the first phase is much more computationally intensi e, we parallelize just the first phase. In this case,

the input is set of transactions, where each transaction contains the objects that occur simultane- ously and the output is the set of items that occurs more than frequenc threshold kno wn as support. Most of the association rule algorithms are based on simple and po werful principle: for an itemset of items to be frequent, all of its subsets of size must also be frequent. Based on this principle we may easily uild the itemset dependenc graph, which xplicits the dependen- cies among tasks. Each task is di vided into three partitioned reductions: counter erifier and candidate generator

The counter re- duction just counts the number of occurrencies of gi en itemset, which are forw arded to the erifier filter The er ifier filter recei es the partial counts and adds them, eri- fying whether the itemset is frequent or not considering the whole database. Whene er the erifier finds frequent item- set, it informs the candidate generator The candidate gener ator eeps track about the itemsets found frequent so that it is able to check whether an itemset may be counted and er ified, according to the task graph. In these xperiments we used dif

ferent synthetic databases with size ranging from 560MB to 2.240GB, gen- erated using the procedure described in [9 ]. These databases mimic the transactions in retailing en viron- ment. All the xperiments were performed with min- imum support alue of 0.1%. Sensiti vity analysis as conducted on data distrib ution, data size, and de gree of par allelism. better understand the ef fects of data distrib ution, we distrib uted the transactions among the partitions in tw dif- ferent ays: Random ransaction Distrib ution: ransactions are ran- domly distrib uted among equal-sized partitions. This

strate gy tends to reduce data sk wness, since all parti- tions ha an equal probability to tak gi en transac- tion. Original ransaction Distrib ution: The database is sim- ply splited into partitions, preserving its original data sk wness. aluate the parallel performance of the data min- ing application by means of tw metrics: speedup and scaleup. As can be seem in Figure 9, better speedup num- bers are achie ed with the original transaction distrib ution. The reason is that the baseline time (i.e. with counter filters) is much lar ger when the database has its original (sk wed)

transaction distrib ution. Ho we er as we increase the number of counter filters, the ecution times for dif- ferent data distrib utions tend to approach (since the parti- tions get smaller and parallel opportunities are reduced). or the scaleup xperiment we increase (in the same pro- portion) the database size and the number of counter filters. As we can see, our parallel data mining application sho ws to scale ery well, en for sk wed databases. In order to better understand the dynamics of the appli- cations, we defined some metrics that may be used to un- derstand. will

focus on the data mining algorithm, ut the same applies to the vision application. may di vide the determination of the support of an itemset into four phases: Acti ation: The arious notifications necessary for an itemset become candidate may not arri at the same time, and the erification filter has to ait un- til the conditions for an itemset be considered candi- date are satisfied.
Page 9
10 20 30 40 50 60 70 5 10 15 20 25 30 35 Value (secs) #Filters Execution Time - T20D3.2MI1K - minsupp=0.1% Original Distribution Random Distribution 10 20 30 40 50 60 70 80

90 5 10 15 20 25 30 35 Value (secs) #Filters Execution Time - T20D6.4MI1K - minsupp=0.1% Original Distribution Random Distribution 5 10 15 20 25 30 35 5 10 15 20 25 30 35 Value #Filters Speedup - minsupp=0.1% Ideal Original Distribution (T20D3.2MI1K) Random Distribution (T20D3.2MI1K) Original Distribution (T20D6.4MI1K) Random Distribution (T20D6.4MI1K) 0.9 0.95 1 1.05 1.1 5 10 15 20 25 30 35 Value #Filters Scaleup - T20D3.2M-D12.8MI1K - minsupp=0.1% Ideal Original Distribution Random Distribution Figure 9. arallel erf ormance of the Apriori. Contention: After the itemset is considered good

candi- date, it may ait in the processing queue of the counter filter Counting: The counter filters may not start simultane- ously and the counting phase is characterized by counter filters calculating the support of candi- date itemset in each partition. Checking: The support counters of each counter filter may not arri at the same time in the support check er fil- ter and the checking phase is the time period during which the notifications arri e. Ne xt we are going to analyze the duration of these phases in both speedup and scaleup xperiments. The

analysis of the speedup xperiments xplains the ef ficienc achie ed, while the analysis of scaleup xperiments sho ws the scala- bility In able we sho the duration of the phases we just de- scribed for configurations emplo ying 8, 16, and 32 proces- sors. The rightmost column also sho ws the erage process- ing cost for counting an itemset, where we can see that this cost reduces as the number of processors increase, as x- pected. The same may be observ ed for all phases, xcept for the Acti ation phase, whose duration seems to reach limit around second. The problem in this case is

that the num- ber of processors in olv ed is high and the asynchronous na- ture of the algorithm mak es the reduction of the Acti ation time ery dif ficult. erifying the timings for the scaleup xperiments in a- ble 6, we erify the scalability of our algorithm. can see that an increase in the number of processors and in the size of the database does not af fect significantly the mea- surements, that is, the algorithm implementation does not saturate system resources (mainly communication) when scaled. Proc Acti ation Contention Counting Checking Processing 2.741046 5.564751 9.412093

8.469050 0.001645 16 1.264842 2.058052 4.893773 4.691232 0.000759 32 1.229330 0.273229 1.129718 1.986129 0.000369 ab le 5. Speedup Experiments: Pr ofiling of Itemset Pr ocessing Proc Acti ation Contention Counting Checking Processing 2.741046 5.564751 9.412093 8.469050 0.001645 16 2.628118 5.538353 9.349371 8.403360 0.001596 32 2.439369 5.021002 10.311501 8.906631 0.001594 ab le 6. Scaleup Experiments: Pr ofiling of Itemset Pr ocessing 5. Clustering Cluster analysis partitions and determines groups of ob- jects that are similar re garding user -gi en similarity crite- ria. In this

section we discuss the parallelization of popu- lar clustering algorithm, k-means. The algorithm is based on the concept of centroids, which represent the objects that compose the cluster Each iteration of the algorithm assigns each object to the clos- est centroid, updating its alue properly The algorithm ends when no object changes cluster or maximum number of it- eractions is reached. Since there is single task that determines clusters, there is no task graph. xpress the algorithm using tw reduc- tions: assigner and centroid calculator The assigner holds the objects that must be clustered

based on the centroids and determines which centroid is the closest to each object. The list of objects assigned to each centroid is then sent to the centroid calculator that recalculates the centroids. Our clustering aluation is based on tw synthetic datasets, containing 400,000 and 800,000 points to be clus- tered. Each point has 50 dimensions. performed x- periments aluating both the scaleup and the speedup. The scaleup is linear and close to 1, that is, the appli- cation scales perfectly The ecution times and respec- ti speedups are sho wn in Figure 10, where we can see the time per

iteraction of the algorithm for the tw datasets and the speedup when arying the number of proces- sors. In both cases, the speedup is almost linear for when clustering the 400,000-point dataset and is super -linear for the 800,000-point dataset. This super -linear beha v- ior comes from the reduction of memory requirements as we increase the number of processors.
Page 10
0 5 10 15 20 25 30 0 5 10 15 20 25 Speedup # Processors Speedup Linear 800.000 points 400.000 points 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 5 10 15 20 25 Seconds # Processors Time per Iteration 800.000 points 400.000 points

Figure 10. arallel erf ormance of the k- Means Algorithm. 6. Conclusions and Futur ork In this paper we ha described run-time sup- port frame ork who as de eloped to support the ef ficient implementation of significant class of applications on het- erogeneous distrib uted en vironments. ha also sho wn our parallelization strate gy which is the approach used to perform the decompostion on the applications. be- lie that the approach can be applied for lar ge class of applications. Our xperimental results ha sho wn that tw applica- tions designed using our strate gy and our run-time

support scale almost linearly to lar ge number of nodes. Our xperi- ments, ho we er where only limited to the number of com- pute nodes we had ailable for xperimentation. In act, we see no reason why the applications should eep the same be- ha vior for much lar ger configurations. Our run-time support system is significant olution on top of Datacutter In particular we ha implemented the la- belled stream infrasture and the termination detection algo- rithm. still do not ha an actual implementation of the stable storage for the distrib uted state. or futyre ork, we are in the

process of incorporating ault tolerance on Anthill. rob ust implementation of the distrib uted state is requirement for such system. Refer ences [1] Anurag Acharya, Mustaf Uysal, and Joel Satlz. Acti disks: Programming model, algorithms and aluation. In In- ternation Confer ence on Ar hitectur al Support for pr gr am- ming Langua es and Oper ating Systems (ASPLOS VIII) pages 8191. CM Press, Oct 1998. [2] M. Be ynon, C. Chang, U. atalyrek, ur, A. Sussman, H. Andrade, R. Ferreira, and J. Saltz. Processing lar ge-scale multi-dimensional data in parallel and distrib uted en viron- ments. ar allel

Computing 28(5):827859, 2002. [3] Michael Be ynon, Renato Ferreira, ahsin M. urc, Alan Sussman, and Joel H. Saltz. Datacutter: Middle are for ˛l- tering ery lar ge scienti˛c datasets on archi al storage sys- tems. In IEEE Symposium on Mass Stor Systems pages 119134, 2000. [4] Michael Be ynon, ahsin urc, Alan Sussman, and Joel Saltz. Design of frame ork for data-intensi wide-area applica- tions. In Heter eneous Computing orkshop (HCW) pages 116130. IEEE Computer Society Press, May 2000. [5] I. oster and C. esselman. The GRID: Blueprint for Ne Computing Infr astructur Mor gan

Kaufmann, 1999. [6] Philip Mucci. The performance API API. White aper of the Uni ersity of ennessee, March 2001. [7] Clark Olson. arallel algorithms for hierarchical cluster ing. ar allel Computing 21(8):13131325, 1995. [8] Matthe Spencer Renato Ferreira, Michael Be ynon, ahsin urc, Umit Catalyurek, Alan Sussman, and Joel Saltz. Ex e- cuting multiple pipelined data analysis operations in the grid. In Pr oceedings of the 2002 CM/IEEE confer ence on Super computing pages 118. IEEE Computer Society Press, 2002. [9] A. eloso, Meira Jr ., R. Ferreira, D. Guedes, and S. arthasarathy Asynchronous

and anticipatory ˛lter -stream based parallel algorithm for frequent item- set mining. In Pr oceedings of the 8th Eur opean Confer ence on Principles and Pr actice of Knowledg Disco very in Databases (PKDD) 2004. [10] M. Zaki, C. Ho, and R. Agra al. arallel classi˛cation for data mining on shared-memory multiprocessors. In ICDE 99: Pr oceedings of the 15th International Confer ence on Data Engineering page 198, ashington, DC, USA, 1999. IEEE Computer Society [11] Mohammed Ja eed Zaki and Ching-T ien Ho. Lar ge-scale parallel data mining. Springer -V erlag, 2000.