Download
# Scalable Autotuning ramew ork for Compiler Optimization Anan ta Tiw ari Ch un Chen Jacqueline Chame Mary Hall and Jerey K PDF document - DocSlides

mitsue-stanley | 2014-12-12 | General

### Presentations text content in Scalable Autotuning ramew ork for Compiler Optimization Anan ta Tiw ari Ch un Chen Jacqueline Chame Mary Hall and Jerey K

Show

Page 1

Scalable Auto-tuning ramew ork for Compiler Optimization Anan ta Tiw ari Ch un Chen Jacqueline Chame Mary Hall and Jerey K. Hollingsw orth Univ ersit of Maryland Univ ersit of Utah Departmen of Computer Science Sc ho ol of Computing College ark, MD 20740 Salt Lak Cit UT 84112 tiw ari, hollings @cs.umd.edu unc hen, mhall @cs.utah.edu Univ ersit of Southern California Information Sciences Institute Marina del Ra CA 90292 jc hame@isi.edu Abstract We describ sc alable and gener al-purp ose fr ame- work for auto-tuning ompiler-gener ate de. We ombine ctive Harmony's ar al lel se ar ch ackend with the CHiLL ompiler tr ansformation fr amework to gen- er ate in ar al lel set of alternative implementations of omputation kernels and automatic al ly sele ct the one with the est-p erforming implementation. The esult- ing system achieves erformanc of ompiler-gener ate de omp ar able to the ful ly automate version of the TLAS libr ary for the teste kernels. Performanc for various kernels is 1.4 to 3.6 times faster than the native Intel ompiler without se ar ch. Our se ar ch algorithm si- multane ously evaluates dier ent ombinations of om- piler optimizations and onver ges to solutions in only few tens of se ar ch-steps. In tro duction The complexit and div ersit of to da y's parallel ar- hitectures erly burdens application programmers in orting and tuning their co de. the ery high end, pro cessor utilization is notoriously lo w, and the high cost of asting these precious resources motiv ates ap- plication programmers to dev ote signican time and energy to tuning their co des. This tuning pro cess ust largely rep eated to mo from one arc hitecture to This ork as done when the author as at USC/ISI. another, as to often, co de that erforms ell on one arc hitecture faces ottlenec ks on another. As are en tering the era of etascale systems, the hallenges facing application programmers in obtaining accept- able erformance on their co des will only gro w. assist the application programmer in managing this complexit uc researc in the last few ears has een dev oted to auto-tuning soft are that emplo ys em- pirical tec hniques to ev aluate set of alternativ map- pings of computation ernels to an arc hitecture and select the mapping that obtains the est erformance. Auto-tuning soft are can group ed in to three cate- gories: (1) self-tuning library generators suc as T- LAS, PhiP and OSKI for linear algebra and FTTW and SPIRAL for signal pro cessing [21 3, 20 9, 22 ]; (2) compiler-based auto-tuners that automatically gener- ate and searc set of alternativ implemen tations of computation [7, 24 11 ]; and, (3) application-lev el auto-tuners that automate empirical searc across set of parameter alues prop osed the application programmer [8, 16 ]. What is common across all these dieren categories of auto-tuners is the need to searc range of ossible implemen tations to iden tify one that erforms comparably to the est-p erforming solution. The resulting searc space of alternativ implemen ta- tions can prohibitiv ely large. Ther efor e, key chal- lenge that fac es auto-tuners, esp cial ly as we exp and the sc op of their ap abilities, involves sc alable se ar ch among alternative implementations. As lo ok to the future, full applications will lik ely include mix of auto-tuning soft are from the ab

Page 2

1 2 3 4 5 6 7 8 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 Runtime Parameter Interaction (Tiling and Unrolling for MM, N=800) Tile Size Unroll Amount Runtime Figure 1. arameter Sear Space or Tiling and Unr olling (Figure is easier to see in color). three categories: automatically-generated libraries, compiler-generated co de and application-lev el param- eters exp osed to auto-tuning en vironmen ts. Th us, ap- plications of the future will demand cohesiv en vi- ronmen that can seamlessly com bine these dieren kinds of auto-tuning soft are and that emplo ys scal- able empirical searc to manage the cost of the searc pro cess. In this pap er, tak an imp ortan step in the direction of building suc an en vironmen t. e- gin with Activ Harmon [8 ], whic ermits applica- tion programmers to express application-lev el param- eters, and automates the pro cess of searc hing among set of alternativ implemen tations. com bine Activ Harmon with CHiLL [5 ], compiler frame- ork that is designed to supp ort con enien auto- matic generation of co de arian ts and parameters from compiler-generated or user-sp ecied transforma- tion recip es. In com bining these systems, ha pro duced unique and erful framew ork for auto- tuning compiler-generated co de that explores ric her space than compiler-based systems are doing to da and can emp er application programmers to dev elop self- tuning applications that include compiler transforma- tions. unique feature of our system is erful paral- lel searc algorithm whic lev erages parallel arc hitec- tures to searc across set of optimization parameter alues. Multiple, sometimes unrelated, oin ts in the searc space are ev aluated at eac timestep. With this approac h, oth explore ultiple parameter in ter- actions at eac iteration and also ha dieren no des of the parallel system ev aluate dieren congurations to con erge to solution faster. In supp ort of this searc pro cess, CHiLL pro vides con enien high-lev el scripting in terface to the compiler that simplies co de generation and arying optimization parameter alues. The remainder of the pap er is organized in to v sections. The next section motiv ates the need for an ef- fectiv searc algorithm to explore compiler generated parameter spaces. Section describ es our searc algo- rithm, whic is follo ed high-lev el description of CHiLL in section 4. In section 5, giv an erview of the tuning orko in our framew ork. Section presen ts an exp erimen tal ev aluation of our framew ork. discuss related ork in section 7. Finally section will pro vide concluding remarks and future implica- tions of this ork. Motiv ation da y's complex arc hitecture features and deep memory hierarc hies require applying non trivial opti- mization strategies on lo op nests to ac hiev high er- formance. This is ev en true for simple lo op nest lik Matrix Multiply Although naiv ely tiling all three lo ops of Matrix Multiply ould signican tly increase its erformance, the erformance is still ell elo hand-tuned libraries. Chen et al [7 demonstrate that automatically-generated optimized co de can ac hiev erformance comparable to hand-tuned libraries us- ing more complex tiling strategy com bined with other optimizations suc as data cop and unroll-and-jam. Com bining optimizations, ho ev er, is not an easy task ecause lo op transformation strategies in teract with eac other in complex ys. Dieren lo op optimizations usually ha dieren

Page 3

goals, and when com bined they migh ha unexp ected (and sometimes undesirable) eects on eac other. Ev en optimizations with similar goals but targeting dif- feren resources, suc as unroll-and-jam plus scalar re- placemen targeting data reuse in registers, and lo op tiling plus data cop for reuse in cac hes, ust care- fully com bined. Unroll-and-jam generally has more im- pact on erformance than tiling for cac hes, since reuse in registers reduces the um er of loads and stores. In addition, in arc hitectures with SIMD units, unroll-and- jam can used to exp ose SIMD parallelism. The un- roll factors ust tuned so that reuse and SIMD are exploited without causing register spilling or instruc- tion cac he misses. On the other hand, tiling plus data cop ying for reuse in cac hes hanges the iteration or- der and data la out, and ma aect reuse in registers and SIMD parallelism. When com bining unroll-and- jam and tiling, oth unroll and tile sizes ust tuned so that erformance gains are complemen tary Figure illustrates these complex in teractions sho wing the erformance of square matrix (of size 800 800) ul- tiplication as function of tiling and unrolling factors. Tiling factors range from to 80 and unrolling factors from to 32. see corridor of est erforming com binations along the x-y diagonal where tiling and unrolling factors are equal, and smaller corridors when tile factors are ultiples of unroll factors. The est erforming co de arian used tiling factor of 24 and unrolling factor of 24 and ac hiev es erformance of 845 MFLOPS. Empirical optimization can comp ensate for the lac of precise analytical mo dels erforming system- atic se ar ch er collection of automatically generated co de arian ts. Eac arian exp oses set of parameters that con trols the application of dieren transforma- tion strategies. arameter congurations for arian ts serv as oin ts in the searc space and the ob jectiv function alues asso ciated with the oin ts are gath- ered actually running the arian ts on the target ar- hitecture. The success of empirical searc is largely driv en ho ell the hosen se ar ch algorithm na v- igates the searc space. The searc space sho wn in Figure is not smo oth and con tains ultiple minimas and maximas. The est and the orst congurations are factor of six dieren t. Activ Harmon an automated erformance tun- ing infrastructure supp orting oth online and oine The ob jectiv function alues asso ciated with oin ts in the searc space can an desired metric of erformance (for ex- ample time er timestep, MFLOPS, cac he utilization etc.). Algorithm PR for Compiler Optimization 1: Start with initial simplex with ertices and ev aluate in parallel. 2: 3: while Stopping Criteria Not alid do 4: Reorder simplex ertices, so that 5: Compute reection pts and function alues in parallel. ee ction step 6: arg min Most Pr omising Point 7: if then 8: Compute expansion pts and function alues in parallel. Exp ansion che cking step 9: if then ept Exp ansion 10: +1 11: else Send HAL signal to al pr esses and ac ept ee ction 12: +1 13: end if 14: else ept shrink 15: Compute +1 and +1 in parallel. Shrink step 16: end if 17: k+1 18: end while tuning for scien tic applications pro vides selection of searc algorithms designed sp ecically to deal with searc spaces where the explicit denition of the ob- jectiv function is not ailable. Finding go set of lo op transformation parameters is go example of the yp of searc that the Harmon system is designed to address. In the next section, describ our parameter tuning algorithm for compiler generated parameter spaces. arameter uning Algorithm As previously sho wn, the lo op transformation pa- rameters in teract with eac other in complex ys. The searc algorithm used to explore the parameter spaces of compiler-optimized computations ust tak in to accoun suc in teractions and able to tune the parameters sim ultaneously The sim ultaneous tuning, ho ev er, leads to added dimensions in the searc space. or our purp oses, use mo died ersion of the arallel Rank Ordering (PR O) algorithm prop osed Online tuning refers to adapting erformance related param- eters during run time. Oine tuning refers to tuning for param- eters that can selected at compile/launc time but remain xed throughout the execution.

Page 4

abatabaee et al [19]. Although the original PR algo- rithm can eectiv ely deal with high-dimensional searc spaces with unkno wn ob jectiv functions, there are main dierences et een the yp of searc PR as designed for and the yp of searc an to con- duct. First, PR as designed for online tuning of SPMD-based parallel applications while our approac needs an oine searc h. Secondly abatabaee et al only lo ok ed at (h yp er) rectangular searc spaces instead of the more general parameter space used in our com- piler optimization. In addition, mo died the initial simplex construction metho to etter suit our goal of using all ailable parallelism. describ eac mo d- ication in detail later in this section. will refer to the mo died algorithm as PR O-C (PR for Compiler Optimization). The parameter tuning algorithm is giv en in Algo- rithm 1. or function of ariables, PR O-C main- tains set of oin ts forming the ertices of sim- plex in an -dimensional space. Eac simplex trans- formation step (lines 5, and 15) of the algorithm gen- erates up to new ertices reecting, expand- ing, or shrinking the simplex around the est ertex. After eac transformation step, the ob jectiv function alue, asso ciated with eac of the newly generated oin ts are calculated in parallel. The reection step is considered suc essful if at least one of the new oin ts has etter than the est oin in the simplex. If the reection step is not successful, the simplex is shrunk around the est oin t. successful reection step is follo ed expansion hec step (line 9). If the expansion hec step is successful, the ex- panded simplex is accepted. Otherwise, the reected simplex is accepted and the searc mo es on to the next iteration. graphical illustration for reection, expansion and shrink steps are sho wn in Figure for 2-dimensional searc space and 4-p oin simplex. In the remainder of this section, describ the mo di- cations that made to the original PR algorithm to mak it suitable for searc hing compiler generated parameter spaces. 3.1 arallelizing Expansion Chec Step Recall that eac simplex transformation step gen- erates up to new ertices. The time required to complete the parallel ev aluation of these new er- tices is the time tak en the orst erforming ertex. The decision to in tro duce the expansion-c hec step in Eac simplex transformation is considered to se ar ch- step within one searc iteration. One iteration of the searc algorithm consists of all the simplex transformations that happ en et een successiv reection steps. Figure 2. Simple ransf ormation steps. PR as motiv ated the observ ation that there are some expansion oin ts with ery or erformance. or online tuning of SPMD-based parallel applications, suc congurations slo do wn not only the searc but also the execution of the application itself. oid these time consuming instances, efore ev aluating all expansion oin ts, PR rst calculates the expansion oin erformance of only the most promising case at the exp ense of parallelism. If the expansion hec king step is successful, the algorithm erforms expansion of other oin ts in the simplex. Assuming ha no des ailable, eac iteration of PR O, therefore, tak es at most three searc steps (reection, expansion hec and expansion). In an oine parallel searc h, ho ev er, pro cessors par- ticipating in the searc are indep enden t, whic allo ws us to tak full adv an tage of the underlying parallelism while still oiding expansion oin ts with or er- formance. that end, PR O-C ev aluates all expan- sion oin ts and the decision to accept or reject the expanded simplex is based on the erformance of the most promising case. If the erformance rep orted the most promising case is orse than that of the est oin in the reected simplex, our system sends sig- nal to all the other pro cessors to stop the ev aluation of their candidate congurations and accepts the reected simplex. The expansion of the simplex is accepted if the erformance of the most promising case is etter than the est ertex in the reected simplex. With this mo dication, not only reduce the um er of steps within one iteration of the searc algorithm to at Most promising oin is the oin in the original simplex whose reection around the est oin returns etter function alue.

Page 5

most (reection-expansion and reection-shrink) but also increase parallelism. 3.2 Pro jection Op erator for Arbitrary Space Oine tuning of lo op transformation parameters is constrained optimization problem. Therefore in eac step ha to mak sure that the computed oin ts are admissible, i.e. they satisfy the constrain ts. The pro jection op erator, function ( (used in the pseudo- co de), tak es care of this problem mapping oin ts that are not admissible to admissible oin ts. PR uses simple metho that indep enden tly maps the computed alue of the parameter to its lo er or up- er limit, whic hev er is closer. This metho orks ell for yp er-rectangular searc spaces, but not when ha an arbitrarily shap ed space dened (p ossibly non-linear) constrain ts on parameter alues. Our pro- jection op erator accommo dates suc arbitrarily shap ed spaces pro jecting an inadmissible oin to its near- est admissible neigh or. dene distance et een oin ts using distance, whic is the sum of the absolute dierences of their co ordinates. The nearest neigh or of an inadmissible oin (calculated in terms of will th us legal oin with the least amoun of change (in terms of parameter alues) summed er all dimensions. Computing the least distance unfortunately in- olv es nding the nearest neigh ors in high dimen- sional space, whic is computationally in tensiv task. After exp erimen ting with ultiple nearest-neigh or al- gorithms, adopted the Appro ximate Nearest Neigh- or (ANN) [2 algorithm for reasons. First, for appro ximate neigh ors, ANN has linear space require- men ts and logarithmic time complexit on the um er of oin ts in the searc space. Second, an ecien im- plemen tation of the ANN library is ailable [15 ]. The library supp orts ariet of metrics to dene distance et een oin ts, including distance metric. set 5, whic h, for distance, means error of at most one along at most one dimension is tolerated, whic is fairly small price to pa for logarithmic query time. 3.3 Simplex Construction and Size The initial simplex, with size needs to non- degenerate so that it can span the whole parameter Giv en an 0, (1 )-nearest neigh or of is oin s.t. dist p;q dist ;q space; therefore, ust at least 1, where is the um er of tunable parameters. or discrete parameter space, PR O's simplex construction metho can generate only up to oin ts. In PR O-C, extend the metho to generate oin ts for an 1. exploit all ailable parallelism, can set to the um er of resources/pro cessors ailable. Unlik PR O's strategy of starting the searc at the cen ter of the searc h-space (whic is hard to ascertain in high-dimensional constrained space), randomly select oin ts at the start of the algorithm. The rst iteration of the algorithm ev aluates these random congurations. The initial simplex is constructed randomly sampling oin ts at distance distance) from the est erforming oin t. The set of searc di- rections/v ectors (from the initial est oin to the sam- pled oin ts) generated in this fashion is guaran teed to linearly indep enden set, whic is imp ortan e- cause this prop ert giv es us unique parameter in- teractions. In section 4, describ CHiLL our lo op transfor- mation and co de generation framew ork. CHiLL: ramew ork for Comp osing High-Lev el Lo op ransformations Automatic tuning requires compiler to able to generate dieren co des rapidly during the searc adjusting parameter alues, without costly compiler reanalysis. It also demands that the compiler ha clean in terface to separate parameter searc en- gine. CHiLL [5, 6], olyhedral lo op transformation and co de generation framew ork, pro vides suc capabil- it for comp osing high-lev el lo op transformations with script in terface to describ the transformations and searc space to the searc engine. olyhedral represen- tation of lo ops facilitates compilers to comp ose com- plex lo op transformations in mathematically rigor- ous to ensure co de correctness. Ho ev er, existing olyhedral framew orks are often to limited in sup- orting wide arra of lo op transformations (for oth erfect and imp erfect lo op nests) required to ac hiev high-p erformance on to da y's computer arc hitectures. CHiLL emplo ys new design features suc as iter ation sp ac alignment and auxiliary lo ops to greatly expand the capabilit of olyhedral framew ork. urther, its high-lev el script in terface allo ws compilers or applica- tion programmers to use common in terface to de- scrib parameterized co de transformations to ap- plied to computation, whose parameters can in- stan tiated an external searc engine to nd the est- erforming implemen tation. no briey describ CHiLL's new features.

Page 6

DO I=2, s1 SUM(I)=0 DO J=1, I-1 s2 SUM(I)=SUM(I)+A(J,I)*B(J) s3 B(I)=B(I)-SUM(I) (a) Original co de i; i; i; (b) Aligned iteration spaces s3 s1 s2 flow(0,+) flow(0,0) output(0,+) output(0,0) flow(0,+) flow(0,0) flow(0,+) flow(0,0) flow(+,1) output(0,+) flow(0,+) anti(0,+) (c) Dep endence graph i; j; [0 i; j; 0] i; j; [0 i; j; 0] i; j; [0 i; j; 0] (d) ransformation relations to generate the original lo op nest in (a) Figure 3. Representing Loop Nests and ransf ormations. 4.1 olyhedral Represen tation In olyhedral represen tation, lo op nest is repre- sen ted the collection of iteration spaces of the state- men ts inside the lo op nest. Eac statemen has its wn iteration space, deriv ed from its enclosing lo ops resp ectiv ely Th us for imp erfect lo op nests the um- er of dimensions of the iteration spaces of individual statemen ts deriv ed initially ma dieren t. Addi- tional iter ation sp ac alignment brings eac statemen to represen ted in same unied iteration space. generate imp erfectly nested transformed lo ops, auxil- iary lo ops are added to determine lexicographical order among lo ops at eac lo op lev el. will discuss oth concepts in detail elo w. Iteration space alignmen can though as gen- eralization of co de sinking and lo op fusion. or an imp erfect lo op nests suc as in Figure 3(a), CHiLL ex- tracts the iteration space for eac statemen as in Fig- ure 3(b). Note that in CHiLL's represen tation ev ery statemen in the lo op nest has the same um er of di- mensions in its iteration space. Although s1 and s3 are only surrounded one lo op their iteration spaces are still 2-dimensional; more precisely eac represen ts line aligne in 2-dimensional iteration space. Once the iteration spaces of all statemen ts are aligned in the same iteration space, CHiLL can transform erfect and imp erfect lo op nests in systematic and the legal- it of transformation can determined in the same as erfect lo op nests, i.e., from data dep endences (e.g. 3(c)) prior to the transformation. The complete algorithm for iteration space alignmen can found in [5]. Auxiliary lo ops are in tro duced to allo system- atic co de generation strategy for oth erfect and im- erfect lo op nests. If the aligned iteration spaces only include dimensions for eac lo op lev el, there ould no information ailable as to the relationship or re- quired execution order among statemen ts or ho lo ops and statemen ts ould organized at sp ecic lo op lev el. eep simple and robust olyhedral scanning strategy for co de generation, an auxiliary lo op is asso- ciated with eac lo op lev el in the original nest. Eac auxiliary lo op carries the execution order of statemen ts and lo ops at its asso ciated lev el. An additional auxil- iary lo op is asso ciated with the statemen ts within the deep est lev el of the iteration space, and carries the ex- ecution order of these statemen ts. By setting dier- en constan in teger alues for these auxiliary lo ops, CHiLL establishes the lexicographical order of lo ops at eac lo op lev el as ell as the lexicographical order of statemen ts in the innermost lo op. So for an -deep lo op nest, ha (2 1)-dimension iteration spaces as +1 ], where 's are auxiliary lo ops. Eac lo op transformation from an -deep lo op nest to new -deep lo op nest is represen ted as set of relations: +1 +1 Figure 3(d) sho ws the transformation relations to gen- erate the original lo op nest, with the initial auxiliary lo op alues unkno wn et. Since only constan alues are allo ed in auxiliary lo ops, no lo ops are generated in the nal transformed co de. 4.2 Co de ransformations Recip es CHiLL tak es as input the original co de and lo op transformation recip (a CHiLL script) describing ho

Page 7

to optimize the co de. Eac line of the script describ es transformation to applied on an existing lo op repre- sen tation. or illustration purp oses, list some most common high-lev el lo op transformations elo w. As general rule, eac lo op transformation aects set of statemen ts within the sp ecied lo op. erm ute ([stmt],or der) the lo op order of stmt is er- uted to the new or der whic is represen ted se- quence of in tegers iden tifying the lo ops. If erm ute do es not ha stmt parameter, it indicates that the lo op order of all statemen ts should erm uted. tile (stmt,lo op,size,[outer-lo op]) Tile lo op at lev el lo op of stmt with the tile con trolling lo op at lo op lev el outer- lo op (default alue 1), with tile size size unroll (stmt,lo op,size) Unroll stmt 's lo op at lev el lo op unroll factor size or all unrolled statemen ts, the inner lo op dies elo lo op lev el lo op are jammed together. datacop (stmt,lo op,arr ay,[index]) or the sp ecied arr ay in stmt temp orary arra cop construction is in tro duced for all arr ay accesses touc hed within lo op lev el lo op The index (default alue 0) sp ecies whic subscript in arr ay corresp onds to the new temp orary arra y's rst index (assuming ortran arra la out). The arr ay accesses in stmt are replaced appropriate temp orary arra accesses. split (stmt,lo op,c ondition) Split stmt 's lo op lev el lo op in to ultiple lo ops according to ondition The orig- inal stmt 's iteration space will satisfy ondition The iteration space satisfying the complemen of onditions will split in to new statemen ts. nonsingular (matrix) ransform the erfect lo op nest according to nonsingular matrix This includes oth unimo dular and non unimo dular transformations. In the next section, describ ho CHiLL and Ac- tiv Harmon framew orks in teract with eac other to generate set of alternativ implemen tations of com- putation ernels and automatically searc and select the one with the est-p erforming implemen tation. Ov erall System orko Figure sho ws the erall orko of our sys- tem. In the prop osed framew ork, co de transformation recip es and parameter sp ecications (i.e. parameter domain and constrain ts) can either generated the compiler automatically or the users tuning their application co de. With this exibilit our approac can supp ort oth fully automated compiler optimiza- tions and user-directed tuning. or our exp erimen ts, Figure 4. Overall System orko Dia gram. translate lo op transformation sequences from the al- gorithms presen ted Chen et al [7 to CHiLL scripts. Sp ecications for un ound parameters in the scripts are deriv ed using simple heuristics based on arc hitec- tural parameters (e.g., consider cac he capacit to gen- erate constrain ts for tile-sizes). elab orate more on parameter sp ecication in the next section. If user, with domain kno wledge, an ts more con trol er what part of the parameter space to fo cus on, he/she can pro vide additional constrain ts to ne-tune the searc space. Using the parameter sp ecications, normal- ize the domain of eac parameter on to our in ternal in- teger based co ordinate system. This step is necessary to ensure that the dierences in the range of alues parameters can tak in dieren dimensions do not un- duly inuence the distance metric. arameters that app ear in one or more constrain ts are considered to in terdep enden and are ev aluated as sets. or example, tile-size parameters for ulti- ple lo ops ma app ear in one or more cac he capacit constrain ts. simple constrain solv er is then used to en umerate oin ts for eac of these sets. Pro jection of an inadmissible oin to alid oin in the searc space is done (b the pro jection serv er) separately for dieren groups of parameters. eac searc step, Activ Harmon y's searc h-k ernel requests CHiLL's co de-generator to generate co de ari- an ts with giv en sets of parameters for lo op transforma- tions. The CHiLL generated co de arian ts are then compiled and run in parallel on the target arc hitecture the optimization driv er. Measured erformance al- ues are consumed the searc h-k ernel to mak simplex transformation decisions.

Page 8

ab le 1. ernels used or xperiments er nel aiv ansf or mation onstr aints ode Recipe DO 1, DO 1, DO 1, C[I,J] C[I,J]+A[I,K]*B[K,J] permute([3,1,2]) tile(0,2,TJ) tile(0,2,TI) tile(0,5,TK) datacopy(0,3,2,1) datacopy(0,4,3) unroll(0,4,UI) unroll(0,5,UJ) siz siz siz [0 512] [1 16] RS DO 1, DO 1, DO 1,N B(I,J) B(I,J) B(K,J)*A(I,K) permute([1,3,2]) tile(0,3,TK) split(0,2,L3>=L1+TK) tile(0,3,TI,2) tile(0,3,TJ,2) datacopy(0,3,2) datacopy(0,4,3,1) unroll(0,4,UJ1) unroll(0,5,UI1) datacopy(1,2,3,1) unroll(1,2,UJ2) unroll(1,3,UI2) siz siz siz siz siz [0 512] [1 16] acobi DO 2, N-1 DO 2, N-1 DO 2, N-1 A(I,J,K) C*(B(I-1,J,K)+B(I+1,J,K)+ B(I,J-1,K)+B(I,J+1,K)+ B(I,J,K-1)+B(I,J,K+1)) original() tile(0, 3, TI) tile(0, 3, TJ) tile(0, 3, TK) unroll(0,5,UJ) [0 512] [1 16] Exp erimen tal Results In this section, presen an exp erimen tal ev alua- tion of our framew ork. First, use Matrix Multi- plication ernel to explore the eectiv eness of PR O-C on the searc space for lo op transformation parame- ters. study ho the size of the initial simplex (and hence the degree of parallelism) aects the con ergence and erformance of the searc algorithm. In the second part, use our framew ork to optimize additional computational ernels riangular Solv er (TRSM) and Jacobi. The use of linear algebra ernels Matrix Mul- tiplication and riangular Solv er as motiv ated our goal to compare the eectiv eness of our framew ork to ell tuned co des. The results for the Jacobi er- nel sho that our underlying olyhedral framew ork is general-purp ose lo op transformation to ol, whic can handle arbitrary co de ey ond the linear algebra do- main. In addition, MM, TRSM and Jacobi all exhibit complex parameter in teractions (discussed in section 2) for to da y's computer arc hitectures. or all the ernels, pro vide the original co de, the transformation recip and the constrain ts on un ound parameters in able 1. The exp erimen ts ere erformed on 64-no de Lin ux cluster. Eac no de is equipp ed with dual In tel Xeon 2.66 GHz (SSE2) pro cessors. L1-cac he and L2-cac he sizes are 128 KB and 4096 KB resp ectiv ely com- pare the erformance of our co de ersions with those of the nativ compiler (ifort 10.0.026, compiled with -O3 -xN). When compiling our transformed co de, turn o the nativ compiler's lo op transformations to prev en them from in terfering with our optimizations. or Matrix Multiplication and riangular Solv er, presen the erformance of TLAS (v ersion 3.8) self- tuning libraries. In addition to near exhaustiv sam- pling of the searc space, TLAS uses carefully hand- tuned BLAS routines con tributed exp ert program- mers. mak meaningful comparison, pro vide the erformance of the se ar ch-only ersion of TLAS co de generated the TLAS Co de Generator via pure empirical searc h. The searc h-only ersion as generated disabling the use of arc hitectural defaults and turning o the use of hand-co ded BLAS routines. or all our exp erimen ts, unroll factors and tile sizes are constrained the storage capacit of their asso ciated memory hierarc lev els. In addition, for tile sizes, use simple heuristic whic tries to t references with temp oral reuse in to half of the cac he, lea ving the other half for references with spatial or no reuse. 6.1 erformance of PR O-C In this section, use Matrix Multiplication (MM) to demonstrate the eectiv eness of parallel searc h. The optimization strategy reected in the transformation recip in able exploits the reuse of in reg- isters, and the reuse of and in cac hes and ha the same amoun of temp oral reuse, carried dieren lo ops). The transformation recip applies tiling to in the L1 cac he and in the L2

Page 9

10 20 30 40 50 1.4 1.6 1.8 2.2 Search Steps Speedup over the Native Compiler Effects of Simplex Size on the Convergence of the Search Algorithm 2N Simplex (10 Nodes) 4N Simplex (20 Nodes) 8N Simplex (40 Nodes) 12N Simplex (60 Nodes) Figure 5. Eff ects of Diff erent Degree of aral- lelism on the Con ver ence of PR O-C. cac he. Data cop ying is applied to oid conict misses. In addition, to exp ose SSE optimization opp ortunities to the In tel compiler, the cop ying of transp oses the data in to the temp orary arra The alues for the v un ound parameters and are de- termined the searc algorithm. study the eect of simplex size, considered four alternativ simplex sizes (10 No des), (20 No des), (40 No des) and 12 (60 No des), where is the um er of un ound parameters for this exp erimen t). Eac simplex as constructed around the same initial oin t, whic as randomly selected from the searc space at the eginning of the exp erimen t. The searc algorithm as run for square matrix of size 800 800. The results for this exp erimen are summarized in able 2. Figure sho ws the erformance of the est oin in the simplex across searc steps. Searc conducted with 12 and simplices clearly use few er searc steps than the searc conducted with smaller simplices. Recall from our discussion in section and from Fig- ure that lo op transformation parameter space is not smo oth and con tains ultiple lo cal minimas and max- imas. The existence of long stretc hes of consecutiv searc steps with minimal or no erformance impro e- men (mark ed arro ws in Figure 5) in and cases sho that more searc steps are required to get out of lo cal minimas for smaller simplices. the same time, eectiv ely harnessing the underlying paral- lelism, and 12 simplices ev aluate more unique parameter congurations (see able 2) and get out of 500 1000 1500 2000 2500 3000 10 15 Performance Distribution MFLOPS Greater Than Percentage of the Total Samples 1.7% of 100K Samples Figure 6. erf ormance Distrib ution or ran- doml hosen MM Congurations ab le 2. MM Results Alternate Simple Siz es 12 Num er of unction Ev als. 276 571 750 961 Num er of Searc Steps 49 32 22 18 Sp eedup er Nativ 2.30 2.33 2.32 2.33 lo cal minimas at faster rate. Results summarized in able also sho that as the simplex size increases, the um er of searc steps decreases, thereb conrming the eectiv eness of in- creased parallelism. Using 12 initial simplex, the searc con erges to solution 2.7 times faster than us- ing initial simplex. The next question regarding the eectiv eness of our framew ork relates to the qualit of the searc result. answ er this question, selected 100,000 uniformly dis- tributed samples from the searc space, whic has er 70 million total oin ts, and ev aluated the erformance asso ciated with all the samples. The erformance dis- tribution is sho wn is Figure 6. Appro ximately 7% of the total samples rep ort erformance greater than GFLOPS. The est erformance (3.22 GFLOPS) as asso ciated with the conguration 160, 6, 162, and 6. or the same problem size, our co de deliv ers 3.17 GFLOPS. The re- sult demonstrates PR O-C's eectiv eness on compiler- generated searc spaces. Finally gure sho ws the erformance of the co de arian pro duced 12 simplex across range of

Page 10

500 1000 1500 2000 2500 3000 3500 1.5 2.5 3.5 4.5 Matrix Size(N) GFLOPS Matrix Multiplication Results Ifort ATLAS search−only Harmony−CHiLL ATLAS Full Figure 7. Results or MM ernel problem sizes along with the erformance of nativ compiler, TLAS' searc h-only and full ersion. Our co de ersion erforms, on erage, 2.36 times faster than the nativ compiler. The erformance is 1.66 times faster than the searc h-only ersion of TLAS. Our co de arian also erforms within 20% of TLAS' full ersion (with pro cessor-sp ecic hand co ded assem- bly). 6.2 riangular Solv er (TRSM) The optimization strategy for the TRSM ernel is outlined in its transformation recip pro vided in able 1. Tw inner lo ops are erm uted to reuse in registers, and lo ops and are unrolled. or data reuse in cac he, lo op is tiled rst. The splitting con- dition is based on the decision to separate read ac- cess from write access ). After split- ting, one sublo op has non-o erlapping read and write accesses and it is optimized in the same as matrix ultiplication. The other sublo op has only one non- erlapping read access ), for whic data cop is applied to reduce cac he conict misses caused this arra reference. Un ound parameters in the transformation recip 1, 1, and form sev en dimensional parameter space. PR O-C used 60-p oin simplex and con erged to solution in 55 steps ev alu- ating 1,579 unique parameter congurations. Figure sho ws the erformance of the co de arian along with the erformance of the Nativ compiler and oth T- LAS ersions. The parameter conguration selected PR O-C erforms, on erage, 3.62 times faster than 500 1000 1500 2000 2500 3000 0.5 1.5 2.5 3.5 Matrix Size(N) GFLOPS Triangular Solver Results Ifort ATLAS searchonly HarmonyCHiLL ATLAS Full Figure 8. Results or TRSM ernel 50 100 150 200 250 300 350 400 450 350 400 450 500 550 600 650 700 750 800 Matrix Size(N) MFLOPS Jacobi Results Ifort HarmonyCHiLL Figure 9. Results or Jacobi ernel the nativ In tel compiler. The erformance, on v- erage, is 1.07 times faster than the searc h-only er- sion of TLAS. Ho ev er, TLAS full-v ersion (with pro cessor-sp ecic hand-tuned assem bly) erformance is 1.55 times faster than our co de-v arian t. 6.3 Jacobi The transformation recip pro vided in able out- lines the optimization strategy use for this ernel. Since only arra has reuse on three dimensions, the lo ops are tiled on three dimensions for reuse in L1 or L2 cac he. Arra ys and access data in the lo op nest in the same order as the dimensionalit of the iteration

Page 11

space, th us the original lo op order is est for spatial reuse in cac he and TLB. Finally lo op is unrolled for register reuse. our un ound parameters in the script and form four-dimensional parame- ter space. PR O-C to ok 23 steps (870 unique function ev alua- tions) to con erge to 0, 22, and 1. The results of and suggest that no tiling is needed for and lo ops. Tiling only the lo op pro duces the est erformance. Also no un- roll is erformed. susp ect that the nativ compiler's scalar replacemen cannot tak adv an tage of ailable register reuse across the dimension so there is little enet of unrolling Figure sho ws the erformance of our co de arian t. On erage, our co de arian er- forms 1.35 times faster than the nativ In tel compiler. Related ork There are man researc pro jects orking on empir- ical optimization of linear algebra ernels and domain sp ecic libraries. TLAS [21 uses the tec hnique to ge- neate highly optimized BLAS routines. It uses near- exhaustiv orthogonal searc (searc in one dimension at time eeping rest of the parameters xed). The OSKI (Optimized Sparse Kernel In terface) [20 library pro vides automatically tuned computational ernels for sparse matrices. FFTW [9 and SPIRAL [22 are do- main sp ecic libraries. FFTW com bines the static mo dels with empirical searc to optimize FFTs. SPI- RAL generates empirically tuned Digital Signal Pro- cessing (DSP) libraries. Rather than fo cussing on one particular domain, our framew ork aims at pro viding general-purp ose compiler based approac tuning co de. Recen tly man researc pro jects on compiler trans- formation framew orks ha fo cussed on facilitating the exploration of large optimization space of ossible compiler transformations and their parameter alues. TLOG [13 is co de generator for parameterized tiled lo ops where tile sizes are sym olic parameters. Sym- olic tile-size enables static or run-time tile size opti- mization without rep eatedly generating the co de and recompiling it for eac tile size. POET [23 is trans- formation scripting language em edded in an arbitrary programming language. It is in terpreted POET compiler to apply source-to-source co de transforma- tions. In teractiv Compilation In terface (ICI) [10 pro- vides exible and ortable in terface to in ternal com- piler optimizations so that iterativ optimization [1 can applied at the lo op or instruction-lev el ad- justing optimization decisions externally WRaP-IT [11] and etit [12 are oth olyhedral lo op transfor- mation framew ork that supp orts comp osition of trans- formations. They supp ort man high-lev el lo op trans- formations on erfect lo op nests in single transfor- mation step and comp osing man lo w-lev el trans- formations on eac individual lo op, they also supp ort arbitrary lo op transformations on imp erfect lo op nests. LeTSeE [17 is an iteration optimization to ol based on the olyhedral mo del. It nds all legal ane sc heduling of lo op nest and explores this space to nd the est sc heduling and parameter alues. Pluto [4] is an au- tomatic parallelization and lo calit optimization to ol also based on the olyhedral mo del. There is also some ork done in using searc tec h- niques to explore compiler generated parameter spaces. Kisuki et al [14 addresses the problem of selecting tile sizes and unroll factors sim ultaneously Dier- en searc algorithms are used to searc the param- eter space Genetic algorithms, Sim ulated Annealing, Pyramid searc h, Windo searc and Random searc h. Qasem et al [18 use mo died ersion of pattern-based direct searc algorithm to explore the same searc space. Our ork considers uc broader range of lo op transformations. Also Kisuki et al. rep ort con- erging to solution in undreds of iterations. By eectiv ely utilizing the underlying parallel infrastruc- ture, con erge to solutions in few tens of itera- tions. Conclusion In this pap er, in tegrated the capabilities of Ac- tiv Harmon and CHiLL to create unique and er- ful framew ork that is capable of oth fully automated co de transformation and parameter searc as ell as user assisted transformation com bined with automatic parameter searc h. The resulting framew ork emplo ys parallel searc tec hnique to sim ultaneously ev alu- ate dieren com binations of compiler optimizations. Our system is demonstrated on three computational ernels for automatic compilation and tuning in par- allel to ac hiev erformance that greatly exceeds the In tel compiler, and is comparable to (and sometimes exceeds) the near-exhaustiv searc of the TLAS li- brary system. Our ork on this topic is just eginning, in the near term plan to explore optimizing larger programs within our framew ork. also plan to com bine our curren oine optimization approac with online opti- mization of application parameters. Ac kno wledgemen ts This ork as supp orted in part DOE gran ts DE-CF C02-01ER25489, DE- G02-01ER25510, DE-F C02-06ER25763, DE-F C02- 06ER25765 and DE-F G02-08ER25834, NSF ards EIA-0080206 and CSR-0615412, and gift from In-

Page 12

tel Corp oration. References [1] F. Agak v, E. Bonilla, J. Ca azos, B. rank e, G. ursin, M. F. O'Bo yle, J. Thomson, M. ous- sain t, and C. K. I. Williams. Using mac hine learn- ing to fo cus iterativ optimization. In Pr dings of the International Symp osium on Co de Gener ation and Optimization Mar. 2004. [2] S. Ary a, D. M. Moun t, N. S. Netan ah u, R. Silv erman, and A. Y. u. An optimal algorithm for appro ximate nearest neigh or searc hing xed dimensions. J. CM 45(6):891{923, 1998. [3] J. Bilmes, K. Asano vi c, C.-W. Chin, and J. Dem- mel. Optimizing matrix ultiply using PHiP C: ortable, high-p erformance, ANSI co ding metho d- ology In Pr dings of the 1997 CM International Confer enc on Sup er omputing June 1997. [4] U. Bondh ugula, A. Hartono, J. Raman ujam, and Sada appan. practical automatic olyhedral pro- gram optimization system. In CM SIGPLAN Con- fer enc on Pr gr amming anguage Design and Imple- mentation (PLDI) June 2008. [5] C. Chen. Mo del-Guide Empiric al Optimization for Memory Hier ar chy PhD thesis, Univ ersit of South- ern California, 2007. [6] C. Chen, J. Chame, and M. Hall. CHiLL: framew ork for comp osing high-lev el lo op transformations. ec h- nical rep ort, Univ ersit of Southern California, 2008. [7] C. Chen, J. Chame, and M. W. Hall. Com bining mo d- els and guided empirical searc to optimize for ulti- ple lev els of the memory hierarc In Pr dings of the International Symp osium on Co de Gener ation and Optimization Mar. 2005. [8] I.-H. Ch ung and J. K. Hollingsw orth. Using informa- tion from prior runs to impro automated tuning sys- tems. In SC '04: Pr dings of the 2004 CM/IEEE onfer enc on Sup er omputing page 30, ashington, DC, USA, 2004. IEEE Computer So ciet [9] M. rigo. fast Fourier transform compiler. In Pr dings of CM SIGPLAN Confer enc on Pr o- gr amming anguage Design and Implementation Ma 1999. [10] G. ursin and A. Cohen. Building practical iter- ativ compiler. In Workshop on Statistic al and Ma- chine arning Appr aches to chite ctur es and Com- pilation (SMAR T'09) Jan. 2007. [11] S. Girbal, N. asilac he, C. Bastoul, A. Cohen, D. ar- ello, M. Sigler, and O. emam. Semi-automatic com- osition of lo op transformations for deep parallelism and memory hierarc hies. International Journal of Par- al lel Pr gr amming 34(3):261{317, June 2006. [12] W. Kelly V. Maslo v, W. Pugh, E. Rosser, T. Shp eis- man, and D. onnacott. The Omega Library in ter- face guide. ec hnical Rep ort CS-TR-3445, Univ ersit of Maryland at College ark, Mar. 1995. [13] D. Kim, L. Renganara anan, D. Rostron, S. Ra jopad- e, and M. M. Strout. Multi-lev el tiling: for the price of one. In SC '07: Pr dings of the 2007 CM/IEEE onfer enc on Sup er omputing pages 1{ 12, New ork, NY, USA, 2007. CM. [14] T. Kisuki, M. W. Knijnen burg, and M. F. O'Bo yle. Com bined selection of tile sizes and unroll factors using iterativ compilation. In CT '00: Pr o- dings of the 2000 International Confer enc on Par- al lel chite ctur es and Compilation chniques page 237, ashington, DC, USA, 2000. IEEE Computer So ciet [15] D. M. Moun t. http://www.cs.umd.edu/~mount/AN N/ [last accessed: eb 09, 2009]. [16] Y. Nelson, B. Bansal, M. Hall, A. Nak ano, and K. Ler- man. Mo del-guided erformance tuning of param- eter alues: case study with molecular dynam- ics visualization. Par al lel and Distribute Pr essing, 2008. IPDPS 2008. IEEE International Symp osium on pages 1{8, April 2008. [17] L.-N. ouc het, C. Bastoul, A. Cohen, and J. Ca azos. Iterativ optimization in the olyhedral mo del: art I, ultidimensional time. In CM SIGPLAN Con- fer enc on Pr gr amming anguage Design and Imple- mentation (PLDI'08) pages 90{100, ucson, Arizona, June 2008. CM Press. [18] A. Qasem, K. Kennedy and J. Mellor-Crummey Automatic tuning of whole applications using direct searc and erformance-based transformation sys- tem. J. Sup er omput. 36(2):183{196, 2006. [19] V. abatabaee, A. Tiw ari, and J. K. Hollingsw orth. arallel parameter tuning for applications with erfor- mance ariabilit In SC '05: Pr dings of the 2005 CM/IEEE onfer enc on Sup er omputing page 57, ashington, DC, USA, 2005. IEEE Computer So ci- et [20] R. uduc, J. W. Demmel, and K. A. elic k. Oski: library of automatically tuned sparse matrix er- nels. Journal of Physics: Confer enc Series 16:521{ 530, June 2005. [21] R. C. Whaley and J. Dongarra. Automatically tuned linear algebra soft are. In Pr dings of Sup er om- puting '98 No v. 1998. [22] J. Xiong, J. Johnson, R. Johnson, and D. adua. SPL: language and compiler for DSP algorithms. In Pr dings of CM SIGPLAN Confer enc on Pr o- gr amming anguage Design and Implementation June 2001. [23] Q. Yi, K. Seymour, H. ou, R. uduc, and D. Quin- lan. et: arameterized optimizations for empirical tuning. Par al lel and Distribute Pr essing Symp o- sium, 2007. IPDPS 2007. IEEE International pages 1{8, Marc 2007. [24] K. oto v, X. Li, G. Ren, M. Garzaran, D. adua, K. Pingali, and Sto dghill. Is searc really necessary to generate high-p erformance BLAS? Pr dings of the IEEE: Sp cial Issue on Pr gr am Gener ation, Op- timization, and Platform daptation 93(2):358{386, eb. 2005.

Hollingsw orth Univ ersit of Maryland Univ ersit of Utah Departmen of Computer Science Sc ho ol of Computing College ark MD 20740 Salt Lak Cit UT 84112 tiw ari hollings csumdedu unc hen mhall csutahedu Univ ersit of Southern California Information S ID: 22399

- Views :
**130**

**Direct Link:**- Link:https://www.docslides.com/mitsue-stanley/scalable-autotuning-ramew-ork
**Embed code:**

Download this pdf

DownloadNote - The PPT/PDF document "Scalable Autotuning ramew ork for Compil..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Page 1

Scalable Auto-tuning ramew ork for Compiler Optimization Anan ta Tiw ari Ch un Chen Jacqueline Chame Mary Hall and Jerey K. Hollingsw orth Univ ersit of Maryland Univ ersit of Utah Departmen of Computer Science Sc ho ol of Computing College ark, MD 20740 Salt Lak Cit UT 84112 tiw ari, hollings @cs.umd.edu unc hen, mhall @cs.utah.edu Univ ersit of Southern California Information Sciences Institute Marina del Ra CA 90292 jc hame@isi.edu Abstract We describ sc alable and gener al-purp ose fr ame- work for auto-tuning ompiler-gener ate de. We ombine ctive Harmony's ar al lel se ar ch ackend with the CHiLL ompiler tr ansformation fr amework to gen- er ate in ar al lel set of alternative implementations of omputation kernels and automatic al ly sele ct the one with the est-p erforming implementation. The esult- ing system achieves erformanc of ompiler-gener ate de omp ar able to the ful ly automate version of the TLAS libr ary for the teste kernels. Performanc for various kernels is 1.4 to 3.6 times faster than the native Intel ompiler without se ar ch. Our se ar ch algorithm si- multane ously evaluates dier ent ombinations of om- piler optimizations and onver ges to solutions in only few tens of se ar ch-steps. In tro duction The complexit and div ersit of to da y's parallel ar- hitectures erly burdens application programmers in orting and tuning their co de. the ery high end, pro cessor utilization is notoriously lo w, and the high cost of asting these precious resources motiv ates ap- plication programmers to dev ote signican time and energy to tuning their co des. This tuning pro cess ust largely rep eated to mo from one arc hitecture to This ork as done when the author as at USC/ISI. another, as to often, co de that erforms ell on one arc hitecture faces ottlenec ks on another. As are en tering the era of etascale systems, the hallenges facing application programmers in obtaining accept- able erformance on their co des will only gro w. assist the application programmer in managing this complexit uc researc in the last few ears has een dev oted to auto-tuning soft are that emplo ys em- pirical tec hniques to ev aluate set of alternativ map- pings of computation ernels to an arc hitecture and select the mapping that obtains the est erformance. Auto-tuning soft are can group ed in to three cate- gories: (1) self-tuning library generators suc as T- LAS, PhiP and OSKI for linear algebra and FTTW and SPIRAL for signal pro cessing [21 3, 20 9, 22 ]; (2) compiler-based auto-tuners that automatically gener- ate and searc set of alternativ implemen tations of computation [7, 24 11 ]; and, (3) application-lev el auto-tuners that automate empirical searc across set of parameter alues prop osed the application programmer [8, 16 ]. What is common across all these dieren categories of auto-tuners is the need to searc range of ossible implemen tations to iden tify one that erforms comparably to the est-p erforming solution. The resulting searc space of alternativ implemen ta- tions can prohibitiv ely large. Ther efor e, key chal- lenge that fac es auto-tuners, esp cial ly as we exp and the sc op of their ap abilities, involves sc alable se ar ch among alternative implementations. As lo ok to the future, full applications will lik ely include mix of auto-tuning soft are from the ab

Page 2

1 2 3 4 5 6 7 8 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 Runtime Parameter Interaction (Tiling and Unrolling for MM, N=800) Tile Size Unroll Amount Runtime Figure 1. arameter Sear Space or Tiling and Unr olling (Figure is easier to see in color). three categories: automatically-generated libraries, compiler-generated co de and application-lev el param- eters exp osed to auto-tuning en vironmen ts. Th us, ap- plications of the future will demand cohesiv en vi- ronmen that can seamlessly com bine these dieren kinds of auto-tuning soft are and that emplo ys scal- able empirical searc to manage the cost of the searc pro cess. In this pap er, tak an imp ortan step in the direction of building suc an en vironmen t. e- gin with Activ Harmon [8 ], whic ermits applica- tion programmers to express application-lev el param- eters, and automates the pro cess of searc hing among set of alternativ implemen tations. com bine Activ Harmon with CHiLL [5 ], compiler frame- ork that is designed to supp ort con enien auto- matic generation of co de arian ts and parameters from compiler-generated or user-sp ecied transforma- tion recip es. In com bining these systems, ha pro duced unique and erful framew ork for auto- tuning compiler-generated co de that explores ric her space than compiler-based systems are doing to da and can emp er application programmers to dev elop self- tuning applications that include compiler transforma- tions. unique feature of our system is erful paral- lel searc algorithm whic lev erages parallel arc hitec- tures to searc across set of optimization parameter alues. Multiple, sometimes unrelated, oin ts in the searc space are ev aluated at eac timestep. With this approac h, oth explore ultiple parameter in ter- actions at eac iteration and also ha dieren no des of the parallel system ev aluate dieren congurations to con erge to solution faster. In supp ort of this searc pro cess, CHiLL pro vides con enien high-lev el scripting in terface to the compiler that simplies co de generation and arying optimization parameter alues. The remainder of the pap er is organized in to v sections. The next section motiv ates the need for an ef- fectiv searc algorithm to explore compiler generated parameter spaces. Section describ es our searc algo- rithm, whic is follo ed high-lev el description of CHiLL in section 4. In section 5, giv an erview of the tuning orko in our framew ork. Section presen ts an exp erimen tal ev aluation of our framew ork. discuss related ork in section 7. Finally section will pro vide concluding remarks and future implica- tions of this ork. Motiv ation da y's complex arc hitecture features and deep memory hierarc hies require applying non trivial opti- mization strategies on lo op nests to ac hiev high er- formance. This is ev en true for simple lo op nest lik Matrix Multiply Although naiv ely tiling all three lo ops of Matrix Multiply ould signican tly increase its erformance, the erformance is still ell elo hand-tuned libraries. Chen et al [7 demonstrate that automatically-generated optimized co de can ac hiev erformance comparable to hand-tuned libraries us- ing more complex tiling strategy com bined with other optimizations suc as data cop and unroll-and-jam. Com bining optimizations, ho ev er, is not an easy task ecause lo op transformation strategies in teract with eac other in complex ys. Dieren lo op optimizations usually ha dieren

Page 3

goals, and when com bined they migh ha unexp ected (and sometimes undesirable) eects on eac other. Ev en optimizations with similar goals but targeting dif- feren resources, suc as unroll-and-jam plus scalar re- placemen targeting data reuse in registers, and lo op tiling plus data cop for reuse in cac hes, ust care- fully com bined. Unroll-and-jam generally has more im- pact on erformance than tiling for cac hes, since reuse in registers reduces the um er of loads and stores. In addition, in arc hitectures with SIMD units, unroll-and- jam can used to exp ose SIMD parallelism. The un- roll factors ust tuned so that reuse and SIMD are exploited without causing register spilling or instruc- tion cac he misses. On the other hand, tiling plus data cop ying for reuse in cac hes hanges the iteration or- der and data la out, and ma aect reuse in registers and SIMD parallelism. When com bining unroll-and- jam and tiling, oth unroll and tile sizes ust tuned so that erformance gains are complemen tary Figure illustrates these complex in teractions sho wing the erformance of square matrix (of size 800 800) ul- tiplication as function of tiling and unrolling factors. Tiling factors range from to 80 and unrolling factors from to 32. see corridor of est erforming com binations along the x-y diagonal where tiling and unrolling factors are equal, and smaller corridors when tile factors are ultiples of unroll factors. The est erforming co de arian used tiling factor of 24 and unrolling factor of 24 and ac hiev es erformance of 845 MFLOPS. Empirical optimization can comp ensate for the lac of precise analytical mo dels erforming system- atic se ar ch er collection of automatically generated co de arian ts. Eac arian exp oses set of parameters that con trols the application of dieren transforma- tion strategies. arameter congurations for arian ts serv as oin ts in the searc space and the ob jectiv function alues asso ciated with the oin ts are gath- ered actually running the arian ts on the target ar- hitecture. The success of empirical searc is largely driv en ho ell the hosen se ar ch algorithm na v- igates the searc space. The searc space sho wn in Figure is not smo oth and con tains ultiple minimas and maximas. The est and the orst congurations are factor of six dieren t. Activ Harmon an automated erformance tun- ing infrastructure supp orting oth online and oine The ob jectiv function alues asso ciated with oin ts in the searc space can an desired metric of erformance (for ex- ample time er timestep, MFLOPS, cac he utilization etc.). Algorithm PR for Compiler Optimization 1: Start with initial simplex with ertices and ev aluate in parallel. 2: 3: while Stopping Criteria Not alid do 4: Reorder simplex ertices, so that 5: Compute reection pts and function alues in parallel. ee ction step 6: arg min Most Pr omising Point 7: if then 8: Compute expansion pts and function alues in parallel. Exp ansion che cking step 9: if then ept Exp ansion 10: +1 11: else Send HAL signal to al pr esses and ac ept ee ction 12: +1 13: end if 14: else ept shrink 15: Compute +1 and +1 in parallel. Shrink step 16: end if 17: k+1 18: end while tuning for scien tic applications pro vides selection of searc algorithms designed sp ecically to deal with searc spaces where the explicit denition of the ob- jectiv function is not ailable. Finding go set of lo op transformation parameters is go example of the yp of searc that the Harmon system is designed to address. In the next section, describ our parameter tuning algorithm for compiler generated parameter spaces. arameter uning Algorithm As previously sho wn, the lo op transformation pa- rameters in teract with eac other in complex ys. The searc algorithm used to explore the parameter spaces of compiler-optimized computations ust tak in to accoun suc in teractions and able to tune the parameters sim ultaneously The sim ultaneous tuning, ho ev er, leads to added dimensions in the searc space. or our purp oses, use mo died ersion of the arallel Rank Ordering (PR O) algorithm prop osed Online tuning refers to adapting erformance related param- eters during run time. Oine tuning refers to tuning for param- eters that can selected at compile/launc time but remain xed throughout the execution.

Page 4

abatabaee et al [19]. Although the original PR algo- rithm can eectiv ely deal with high-dimensional searc spaces with unkno wn ob jectiv functions, there are main dierences et een the yp of searc PR as designed for and the yp of searc an to con- duct. First, PR as designed for online tuning of SPMD-based parallel applications while our approac needs an oine searc h. Secondly abatabaee et al only lo ok ed at (h yp er) rectangular searc spaces instead of the more general parameter space used in our com- piler optimization. In addition, mo died the initial simplex construction metho to etter suit our goal of using all ailable parallelism. describ eac mo d- ication in detail later in this section. will refer to the mo died algorithm as PR O-C (PR for Compiler Optimization). The parameter tuning algorithm is giv en in Algo- rithm 1. or function of ariables, PR O-C main- tains set of oin ts forming the ertices of sim- plex in an -dimensional space. Eac simplex trans- formation step (lines 5, and 15) of the algorithm gen- erates up to new ertices reecting, expand- ing, or shrinking the simplex around the est ertex. After eac transformation step, the ob jectiv function alue, asso ciated with eac of the newly generated oin ts are calculated in parallel. The reection step is considered suc essful if at least one of the new oin ts has etter than the est oin in the simplex. If the reection step is not successful, the simplex is shrunk around the est oin t. successful reection step is follo ed expansion hec step (line 9). If the expansion hec step is successful, the ex- panded simplex is accepted. Otherwise, the reected simplex is accepted and the searc mo es on to the next iteration. graphical illustration for reection, expansion and shrink steps are sho wn in Figure for 2-dimensional searc space and 4-p oin simplex. In the remainder of this section, describ the mo di- cations that made to the original PR algorithm to mak it suitable for searc hing compiler generated parameter spaces. 3.1 arallelizing Expansion Chec Step Recall that eac simplex transformation step gen- erates up to new ertices. The time required to complete the parallel ev aluation of these new er- tices is the time tak en the orst erforming ertex. The decision to in tro duce the expansion-c hec step in Eac simplex transformation is considered to se ar ch- step within one searc iteration. One iteration of the searc algorithm consists of all the simplex transformations that happ en et een successiv reection steps. Figure 2. Simple ransf ormation steps. PR as motiv ated the observ ation that there are some expansion oin ts with ery or erformance. or online tuning of SPMD-based parallel applications, suc congurations slo do wn not only the searc but also the execution of the application itself. oid these time consuming instances, efore ev aluating all expansion oin ts, PR rst calculates the expansion oin erformance of only the most promising case at the exp ense of parallelism. If the expansion hec king step is successful, the algorithm erforms expansion of other oin ts in the simplex. Assuming ha no des ailable, eac iteration of PR O, therefore, tak es at most three searc steps (reection, expansion hec and expansion). In an oine parallel searc h, ho ev er, pro cessors par- ticipating in the searc are indep enden t, whic allo ws us to tak full adv an tage of the underlying parallelism while still oiding expansion oin ts with or er- formance. that end, PR O-C ev aluates all expan- sion oin ts and the decision to accept or reject the expanded simplex is based on the erformance of the most promising case. If the erformance rep orted the most promising case is orse than that of the est oin in the reected simplex, our system sends sig- nal to all the other pro cessors to stop the ev aluation of their candidate congurations and accepts the reected simplex. The expansion of the simplex is accepted if the erformance of the most promising case is etter than the est ertex in the reected simplex. With this mo dication, not only reduce the um er of steps within one iteration of the searc algorithm to at Most promising oin is the oin in the original simplex whose reection around the est oin returns etter function alue.

Page 5

most (reection-expansion and reection-shrink) but also increase parallelism. 3.2 Pro jection Op erator for Arbitrary Space Oine tuning of lo op transformation parameters is constrained optimization problem. Therefore in eac step ha to mak sure that the computed oin ts are admissible, i.e. they satisfy the constrain ts. The pro jection op erator, function ( (used in the pseudo- co de), tak es care of this problem mapping oin ts that are not admissible to admissible oin ts. PR uses simple metho that indep enden tly maps the computed alue of the parameter to its lo er or up- er limit, whic hev er is closer. This metho orks ell for yp er-rectangular searc spaces, but not when ha an arbitrarily shap ed space dened (p ossibly non-linear) constrain ts on parameter alues. Our pro- jection op erator accommo dates suc arbitrarily shap ed spaces pro jecting an inadmissible oin to its near- est admissible neigh or. dene distance et een oin ts using distance, whic is the sum of the absolute dierences of their co ordinates. The nearest neigh or of an inadmissible oin (calculated in terms of will th us legal oin with the least amoun of change (in terms of parameter alues) summed er all dimensions. Computing the least distance unfortunately in- olv es nding the nearest neigh ors in high dimen- sional space, whic is computationally in tensiv task. After exp erimen ting with ultiple nearest-neigh or al- gorithms, adopted the Appro ximate Nearest Neigh- or (ANN) [2 algorithm for reasons. First, for appro ximate neigh ors, ANN has linear space require- men ts and logarithmic time complexit on the um er of oin ts in the searc space. Second, an ecien im- plemen tation of the ANN library is ailable [15 ]. The library supp orts ariet of metrics to dene distance et een oin ts, including distance metric. set 5, whic h, for distance, means error of at most one along at most one dimension is tolerated, whic is fairly small price to pa for logarithmic query time. 3.3 Simplex Construction and Size The initial simplex, with size needs to non- degenerate so that it can span the whole parameter Giv en an 0, (1 )-nearest neigh or of is oin s.t. dist p;q dist ;q space; therefore, ust at least 1, where is the um er of tunable parameters. or discrete parameter space, PR O's simplex construction metho can generate only up to oin ts. In PR O-C, extend the metho to generate oin ts for an 1. exploit all ailable parallelism, can set to the um er of resources/pro cessors ailable. Unlik PR O's strategy of starting the searc at the cen ter of the searc h-space (whic is hard to ascertain in high-dimensional constrained space), randomly select oin ts at the start of the algorithm. The rst iteration of the algorithm ev aluates these random congurations. The initial simplex is constructed randomly sampling oin ts at distance distance) from the est erforming oin t. The set of searc di- rections/v ectors (from the initial est oin to the sam- pled oin ts) generated in this fashion is guaran teed to linearly indep enden set, whic is imp ortan e- cause this prop ert giv es us unique parameter in- teractions. In section 4, describ CHiLL our lo op transfor- mation and co de generation framew ork. CHiLL: ramew ork for Comp osing High-Lev el Lo op ransformations Automatic tuning requires compiler to able to generate dieren co des rapidly during the searc adjusting parameter alues, without costly compiler reanalysis. It also demands that the compiler ha clean in terface to separate parameter searc en- gine. CHiLL [5, 6], olyhedral lo op transformation and co de generation framew ork, pro vides suc capabil- it for comp osing high-lev el lo op transformations with script in terface to describ the transformations and searc space to the searc engine. olyhedral represen- tation of lo ops facilitates compilers to comp ose com- plex lo op transformations in mathematically rigor- ous to ensure co de correctness. Ho ev er, existing olyhedral framew orks are often to limited in sup- orting wide arra of lo op transformations (for oth erfect and imp erfect lo op nests) required to ac hiev high-p erformance on to da y's computer arc hitectures. CHiLL emplo ys new design features suc as iter ation sp ac alignment and auxiliary lo ops to greatly expand the capabilit of olyhedral framew ork. urther, its high-lev el script in terface allo ws compilers or applica- tion programmers to use common in terface to de- scrib parameterized co de transformations to ap- plied to computation, whose parameters can in- stan tiated an external searc engine to nd the est- erforming implemen tation. no briey describ CHiLL's new features.

Page 6

DO I=2, s1 SUM(I)=0 DO J=1, I-1 s2 SUM(I)=SUM(I)+A(J,I)*B(J) s3 B(I)=B(I)-SUM(I) (a) Original co de i; i; i; (b) Aligned iteration spaces s3 s1 s2 flow(0,+) flow(0,0) output(0,+) output(0,0) flow(0,+) flow(0,0) flow(0,+) flow(0,0) flow(+,1) output(0,+) flow(0,+) anti(0,+) (c) Dep endence graph i; j; [0 i; j; 0] i; j; [0 i; j; 0] i; j; [0 i; j; 0] (d) ransformation relations to generate the original lo op nest in (a) Figure 3. Representing Loop Nests and ransf ormations. 4.1 olyhedral Represen tation In olyhedral represen tation, lo op nest is repre- sen ted the collection of iteration spaces of the state- men ts inside the lo op nest. Eac statemen has its wn iteration space, deriv ed from its enclosing lo ops resp ectiv ely Th us for imp erfect lo op nests the um- er of dimensions of the iteration spaces of individual statemen ts deriv ed initially ma dieren t. Addi- tional iter ation sp ac alignment brings eac statemen to represen ted in same unied iteration space. generate imp erfectly nested transformed lo ops, auxil- iary lo ops are added to determine lexicographical order among lo ops at eac lo op lev el. will discuss oth concepts in detail elo w. Iteration space alignmen can though as gen- eralization of co de sinking and lo op fusion. or an imp erfect lo op nests suc as in Figure 3(a), CHiLL ex- tracts the iteration space for eac statemen as in Fig- ure 3(b). Note that in CHiLL's represen tation ev ery statemen in the lo op nest has the same um er of di- mensions in its iteration space. Although s1 and s3 are only surrounded one lo op their iteration spaces are still 2-dimensional; more precisely eac represen ts line aligne in 2-dimensional iteration space. Once the iteration spaces of all statemen ts are aligned in the same iteration space, CHiLL can transform erfect and imp erfect lo op nests in systematic and the legal- it of transformation can determined in the same as erfect lo op nests, i.e., from data dep endences (e.g. 3(c)) prior to the transformation. The complete algorithm for iteration space alignmen can found in [5]. Auxiliary lo ops are in tro duced to allo system- atic co de generation strategy for oth erfect and im- erfect lo op nests. If the aligned iteration spaces only include dimensions for eac lo op lev el, there ould no information ailable as to the relationship or re- quired execution order among statemen ts or ho lo ops and statemen ts ould organized at sp ecic lo op lev el. eep simple and robust olyhedral scanning strategy for co de generation, an auxiliary lo op is asso- ciated with eac lo op lev el in the original nest. Eac auxiliary lo op carries the execution order of statemen ts and lo ops at its asso ciated lev el. An additional auxil- iary lo op is asso ciated with the statemen ts within the deep est lev el of the iteration space, and carries the ex- ecution order of these statemen ts. By setting dier- en constan in teger alues for these auxiliary lo ops, CHiLL establishes the lexicographical order of lo ops at eac lo op lev el as ell as the lexicographical order of statemen ts in the innermost lo op. So for an -deep lo op nest, ha (2 1)-dimension iteration spaces as +1 ], where 's are auxiliary lo ops. Eac lo op transformation from an -deep lo op nest to new -deep lo op nest is represen ted as set of relations: +1 +1 Figure 3(d) sho ws the transformation relations to gen- erate the original lo op nest, with the initial auxiliary lo op alues unkno wn et. Since only constan alues are allo ed in auxiliary lo ops, no lo ops are generated in the nal transformed co de. 4.2 Co de ransformations Recip es CHiLL tak es as input the original co de and lo op transformation recip (a CHiLL script) describing ho

Page 7

to optimize the co de. Eac line of the script describ es transformation to applied on an existing lo op repre- sen tation. or illustration purp oses, list some most common high-lev el lo op transformations elo w. As general rule, eac lo op transformation aects set of statemen ts within the sp ecied lo op. erm ute ([stmt],or der) the lo op order of stmt is er- uted to the new or der whic is represen ted se- quence of in tegers iden tifying the lo ops. If erm ute do es not ha stmt parameter, it indicates that the lo op order of all statemen ts should erm uted. tile (stmt,lo op,size,[outer-lo op]) Tile lo op at lev el lo op of stmt with the tile con trolling lo op at lo op lev el outer- lo op (default alue 1), with tile size size unroll (stmt,lo op,size) Unroll stmt 's lo op at lev el lo op unroll factor size or all unrolled statemen ts, the inner lo op dies elo lo op lev el lo op are jammed together. datacop (stmt,lo op,arr ay,[index]) or the sp ecied arr ay in stmt temp orary arra cop construction is in tro duced for all arr ay accesses touc hed within lo op lev el lo op The index (default alue 0) sp ecies whic subscript in arr ay corresp onds to the new temp orary arra y's rst index (assuming ortran arra la out). The arr ay accesses in stmt are replaced appropriate temp orary arra accesses. split (stmt,lo op,c ondition) Split stmt 's lo op lev el lo op in to ultiple lo ops according to ondition The orig- inal stmt 's iteration space will satisfy ondition The iteration space satisfying the complemen of onditions will split in to new statemen ts. nonsingular (matrix) ransform the erfect lo op nest according to nonsingular matrix This includes oth unimo dular and non unimo dular transformations. In the next section, describ ho CHiLL and Ac- tiv Harmon framew orks in teract with eac other to generate set of alternativ implemen tations of com- putation ernels and automatically searc and select the one with the est-p erforming implemen tation. Ov erall System orko Figure sho ws the erall orko of our sys- tem. In the prop osed framew ork, co de transformation recip es and parameter sp ecications (i.e. parameter domain and constrain ts) can either generated the compiler automatically or the users tuning their application co de. With this exibilit our approac can supp ort oth fully automated compiler optimiza- tions and user-directed tuning. or our exp erimen ts, Figure 4. Overall System orko Dia gram. translate lo op transformation sequences from the al- gorithms presen ted Chen et al [7 to CHiLL scripts. Sp ecications for un ound parameters in the scripts are deriv ed using simple heuristics based on arc hitec- tural parameters (e.g., consider cac he capacit to gen- erate constrain ts for tile-sizes). elab orate more on parameter sp ecication in the next section. If user, with domain kno wledge, an ts more con trol er what part of the parameter space to fo cus on, he/she can pro vide additional constrain ts to ne-tune the searc space. Using the parameter sp ecications, normal- ize the domain of eac parameter on to our in ternal in- teger based co ordinate system. This step is necessary to ensure that the dierences in the range of alues parameters can tak in dieren dimensions do not un- duly inuence the distance metric. arameters that app ear in one or more constrain ts are considered to in terdep enden and are ev aluated as sets. or example, tile-size parameters for ulti- ple lo ops ma app ear in one or more cac he capacit constrain ts. simple constrain solv er is then used to en umerate oin ts for eac of these sets. Pro jection of an inadmissible oin to alid oin in the searc space is done (b the pro jection serv er) separately for dieren groups of parameters. eac searc step, Activ Harmon y's searc h-k ernel requests CHiLL's co de-generator to generate co de ari- an ts with giv en sets of parameters for lo op transforma- tions. The CHiLL generated co de arian ts are then compiled and run in parallel on the target arc hitecture the optimization driv er. Measured erformance al- ues are consumed the searc h-k ernel to mak simplex transformation decisions.

Page 8

ab le 1. ernels used or xperiments er nel aiv ansf or mation onstr aints ode Recipe DO 1, DO 1, DO 1, C[I,J] C[I,J]+A[I,K]*B[K,J] permute([3,1,2]) tile(0,2,TJ) tile(0,2,TI) tile(0,5,TK) datacopy(0,3,2,1) datacopy(0,4,3) unroll(0,4,UI) unroll(0,5,UJ) siz siz siz [0 512] [1 16] RS DO 1, DO 1, DO 1,N B(I,J) B(I,J) B(K,J)*A(I,K) permute([1,3,2]) tile(0,3,TK) split(0,2,L3>=L1+TK) tile(0,3,TI,2) tile(0,3,TJ,2) datacopy(0,3,2) datacopy(0,4,3,1) unroll(0,4,UJ1) unroll(0,5,UI1) datacopy(1,2,3,1) unroll(1,2,UJ2) unroll(1,3,UI2) siz siz siz siz siz [0 512] [1 16] acobi DO 2, N-1 DO 2, N-1 DO 2, N-1 A(I,J,K) C*(B(I-1,J,K)+B(I+1,J,K)+ B(I,J-1,K)+B(I,J+1,K)+ B(I,J,K-1)+B(I,J,K+1)) original() tile(0, 3, TI) tile(0, 3, TJ) tile(0, 3, TK) unroll(0,5,UJ) [0 512] [1 16] Exp erimen tal Results In this section, presen an exp erimen tal ev alua- tion of our framew ork. First, use Matrix Multi- plication ernel to explore the eectiv eness of PR O-C on the searc space for lo op transformation parame- ters. study ho the size of the initial simplex (and hence the degree of parallelism) aects the con ergence and erformance of the searc algorithm. In the second part, use our framew ork to optimize additional computational ernels riangular Solv er (TRSM) and Jacobi. The use of linear algebra ernels Matrix Mul- tiplication and riangular Solv er as motiv ated our goal to compare the eectiv eness of our framew ork to ell tuned co des. The results for the Jacobi er- nel sho that our underlying olyhedral framew ork is general-purp ose lo op transformation to ol, whic can handle arbitrary co de ey ond the linear algebra do- main. In addition, MM, TRSM and Jacobi all exhibit complex parameter in teractions (discussed in section 2) for to da y's computer arc hitectures. or all the ernels, pro vide the original co de, the transformation recip and the constrain ts on un ound parameters in able 1. The exp erimen ts ere erformed on 64-no de Lin ux cluster. Eac no de is equipp ed with dual In tel Xeon 2.66 GHz (SSE2) pro cessors. L1-cac he and L2-cac he sizes are 128 KB and 4096 KB resp ectiv ely com- pare the erformance of our co de ersions with those of the nativ compiler (ifort 10.0.026, compiled with -O3 -xN). When compiling our transformed co de, turn o the nativ compiler's lo op transformations to prev en them from in terfering with our optimizations. or Matrix Multiplication and riangular Solv er, presen the erformance of TLAS (v ersion 3.8) self- tuning libraries. In addition to near exhaustiv sam- pling of the searc space, TLAS uses carefully hand- tuned BLAS routines con tributed exp ert program- mers. mak meaningful comparison, pro vide the erformance of the se ar ch-only ersion of TLAS co de generated the TLAS Co de Generator via pure empirical searc h. The searc h-only ersion as generated disabling the use of arc hitectural defaults and turning o the use of hand-co ded BLAS routines. or all our exp erimen ts, unroll factors and tile sizes are constrained the storage capacit of their asso ciated memory hierarc lev els. In addition, for tile sizes, use simple heuristic whic tries to t references with temp oral reuse in to half of the cac he, lea ving the other half for references with spatial or no reuse. 6.1 erformance of PR O-C In this section, use Matrix Multiplication (MM) to demonstrate the eectiv eness of parallel searc h. The optimization strategy reected in the transformation recip in able exploits the reuse of in reg- isters, and the reuse of and in cac hes and ha the same amoun of temp oral reuse, carried dieren lo ops). The transformation recip applies tiling to in the L1 cac he and in the L2

Page 9

10 20 30 40 50 1.4 1.6 1.8 2.2 Search Steps Speedup over the Native Compiler Effects of Simplex Size on the Convergence of the Search Algorithm 2N Simplex (10 Nodes) 4N Simplex (20 Nodes) 8N Simplex (40 Nodes) 12N Simplex (60 Nodes) Figure 5. Eff ects of Diff erent Degree of aral- lelism on the Con ver ence of PR O-C. cac he. Data cop ying is applied to oid conict misses. In addition, to exp ose SSE optimization opp ortunities to the In tel compiler, the cop ying of transp oses the data in to the temp orary arra The alues for the v un ound parameters and are de- termined the searc algorithm. study the eect of simplex size, considered four alternativ simplex sizes (10 No des), (20 No des), (40 No des) and 12 (60 No des), where is the um er of un ound parameters for this exp erimen t). Eac simplex as constructed around the same initial oin t, whic as randomly selected from the searc space at the eginning of the exp erimen t. The searc algorithm as run for square matrix of size 800 800. The results for this exp erimen are summarized in able 2. Figure sho ws the erformance of the est oin in the simplex across searc steps. Searc conducted with 12 and simplices clearly use few er searc steps than the searc conducted with smaller simplices. Recall from our discussion in section and from Fig- ure that lo op transformation parameter space is not smo oth and con tains ultiple lo cal minimas and max- imas. The existence of long stretc hes of consecutiv searc steps with minimal or no erformance impro e- men (mark ed arro ws in Figure 5) in and cases sho that more searc steps are required to get out of lo cal minimas for smaller simplices. the same time, eectiv ely harnessing the underlying paral- lelism, and 12 simplices ev aluate more unique parameter congurations (see able 2) and get out of 500 1000 1500 2000 2500 3000 10 15 Performance Distribution MFLOPS Greater Than Percentage of the Total Samples 1.7% of 100K Samples Figure 6. erf ormance Distrib ution or ran- doml hosen MM Congurations ab le 2. MM Results Alternate Simple Siz es 12 Num er of unction Ev als. 276 571 750 961 Num er of Searc Steps 49 32 22 18 Sp eedup er Nativ 2.30 2.33 2.32 2.33 lo cal minimas at faster rate. Results summarized in able also sho that as the simplex size increases, the um er of searc steps decreases, thereb conrming the eectiv eness of in- creased parallelism. Using 12 initial simplex, the searc con erges to solution 2.7 times faster than us- ing initial simplex. The next question regarding the eectiv eness of our framew ork relates to the qualit of the searc result. answ er this question, selected 100,000 uniformly dis- tributed samples from the searc space, whic has er 70 million total oin ts, and ev aluated the erformance asso ciated with all the samples. The erformance dis- tribution is sho wn is Figure 6. Appro ximately 7% of the total samples rep ort erformance greater than GFLOPS. The est erformance (3.22 GFLOPS) as asso ciated with the conguration 160, 6, 162, and 6. or the same problem size, our co de deliv ers 3.17 GFLOPS. The re- sult demonstrates PR O-C's eectiv eness on compiler- generated searc spaces. Finally gure sho ws the erformance of the co de arian pro duced 12 simplex across range of

Page 10

500 1000 1500 2000 2500 3000 3500 1.5 2.5 3.5 4.5 Matrix Size(N) GFLOPS Matrix Multiplication Results Ifort ATLAS search−only Harmony−CHiLL ATLAS Full Figure 7. Results or MM ernel problem sizes along with the erformance of nativ compiler, TLAS' searc h-only and full ersion. Our co de ersion erforms, on erage, 2.36 times faster than the nativ compiler. The erformance is 1.66 times faster than the searc h-only ersion of TLAS. Our co de arian also erforms within 20% of TLAS' full ersion (with pro cessor-sp ecic hand co ded assem- bly). 6.2 riangular Solv er (TRSM) The optimization strategy for the TRSM ernel is outlined in its transformation recip pro vided in able 1. Tw inner lo ops are erm uted to reuse in registers, and lo ops and are unrolled. or data reuse in cac he, lo op is tiled rst. The splitting con- dition is based on the decision to separate read ac- cess from write access ). After split- ting, one sublo op has non-o erlapping read and write accesses and it is optimized in the same as matrix ultiplication. The other sublo op has only one non- erlapping read access ), for whic data cop is applied to reduce cac he conict misses caused this arra reference. Un ound parameters in the transformation recip 1, 1, and form sev en dimensional parameter space. PR O-C used 60-p oin simplex and con erged to solution in 55 steps ev alu- ating 1,579 unique parameter congurations. Figure sho ws the erformance of the co de arian along with the erformance of the Nativ compiler and oth T- LAS ersions. The parameter conguration selected PR O-C erforms, on erage, 3.62 times faster than 500 1000 1500 2000 2500 3000 0.5 1.5 2.5 3.5 Matrix Size(N) GFLOPS Triangular Solver Results Ifort ATLAS searchonly HarmonyCHiLL ATLAS Full Figure 8. Results or TRSM ernel 50 100 150 200 250 300 350 400 450 350 400 450 500 550 600 650 700 750 800 Matrix Size(N) MFLOPS Jacobi Results Ifort HarmonyCHiLL Figure 9. Results or Jacobi ernel the nativ In tel compiler. The erformance, on v- erage, is 1.07 times faster than the searc h-only er- sion of TLAS. Ho ev er, TLAS full-v ersion (with pro cessor-sp ecic hand-tuned assem bly) erformance is 1.55 times faster than our co de-v arian t. 6.3 Jacobi The transformation recip pro vided in able out- lines the optimization strategy use for this ernel. Since only arra has reuse on three dimensions, the lo ops are tiled on three dimensions for reuse in L1 or L2 cac he. Arra ys and access data in the lo op nest in the same order as the dimensionalit of the iteration

Page 11

space, th us the original lo op order is est for spatial reuse in cac he and TLB. Finally lo op is unrolled for register reuse. our un ound parameters in the script and form four-dimensional parame- ter space. PR O-C to ok 23 steps (870 unique function ev alua- tions) to con erge to 0, 22, and 1. The results of and suggest that no tiling is needed for and lo ops. Tiling only the lo op pro duces the est erformance. Also no un- roll is erformed. susp ect that the nativ compiler's scalar replacemen cannot tak adv an tage of ailable register reuse across the dimension so there is little enet of unrolling Figure sho ws the erformance of our co de arian t. On erage, our co de arian er- forms 1.35 times faster than the nativ In tel compiler. Related ork There are man researc pro jects orking on empir- ical optimization of linear algebra ernels and domain sp ecic libraries. TLAS [21 uses the tec hnique to ge- neate highly optimized BLAS routines. It uses near- exhaustiv orthogonal searc (searc in one dimension at time eeping rest of the parameters xed). The OSKI (Optimized Sparse Kernel In terface) [20 library pro vides automatically tuned computational ernels for sparse matrices. FFTW [9 and SPIRAL [22 are do- main sp ecic libraries. FFTW com bines the static mo dels with empirical searc to optimize FFTs. SPI- RAL generates empirically tuned Digital Signal Pro- cessing (DSP) libraries. Rather than fo cussing on one particular domain, our framew ork aims at pro viding general-purp ose compiler based approac tuning co de. Recen tly man researc pro jects on compiler trans- formation framew orks ha fo cussed on facilitating the exploration of large optimization space of ossible compiler transformations and their parameter alues. TLOG [13 is co de generator for parameterized tiled lo ops where tile sizes are sym olic parameters. Sym- olic tile-size enables static or run-time tile size opti- mization without rep eatedly generating the co de and recompiling it for eac tile size. POET [23 is trans- formation scripting language em edded in an arbitrary programming language. It is in terpreted POET compiler to apply source-to-source co de transforma- tions. In teractiv Compilation In terface (ICI) [10 pro- vides exible and ortable in terface to in ternal com- piler optimizations so that iterativ optimization [1 can applied at the lo op or instruction-lev el ad- justing optimization decisions externally WRaP-IT [11] and etit [12 are oth olyhedral lo op transfor- mation framew ork that supp orts comp osition of trans- formations. They supp ort man high-lev el lo op trans- formations on erfect lo op nests in single transfor- mation step and comp osing man lo w-lev el trans- formations on eac individual lo op, they also supp ort arbitrary lo op transformations on imp erfect lo op nests. LeTSeE [17 is an iteration optimization to ol based on the olyhedral mo del. It nds all legal ane sc heduling of lo op nest and explores this space to nd the est sc heduling and parameter alues. Pluto [4] is an au- tomatic parallelization and lo calit optimization to ol also based on the olyhedral mo del. There is also some ork done in using searc tec h- niques to explore compiler generated parameter spaces. Kisuki et al [14 addresses the problem of selecting tile sizes and unroll factors sim ultaneously Dier- en searc algorithms are used to searc the param- eter space Genetic algorithms, Sim ulated Annealing, Pyramid searc h, Windo searc and Random searc h. Qasem et al [18 use mo died ersion of pattern-based direct searc algorithm to explore the same searc space. Our ork considers uc broader range of lo op transformations. Also Kisuki et al. rep ort con- erging to solution in undreds of iterations. By eectiv ely utilizing the underlying parallel infrastruc- ture, con erge to solutions in few tens of itera- tions. Conclusion In this pap er, in tegrated the capabilities of Ac- tiv Harmon and CHiLL to create unique and er- ful framew ork that is capable of oth fully automated co de transformation and parameter searc as ell as user assisted transformation com bined with automatic parameter searc h. The resulting framew ork emplo ys parallel searc tec hnique to sim ultaneously ev alu- ate dieren com binations of compiler optimizations. Our system is demonstrated on three computational ernels for automatic compilation and tuning in par- allel to ac hiev erformance that greatly exceeds the In tel compiler, and is comparable to (and sometimes exceeds) the near-exhaustiv searc of the TLAS li- brary system. Our ork on this topic is just eginning, in the near term plan to explore optimizing larger programs within our framew ork. also plan to com bine our curren oine optimization approac with online opti- mization of application parameters. Ac kno wledgemen ts This ork as supp orted in part DOE gran ts DE-CF C02-01ER25489, DE- G02-01ER25510, DE-F C02-06ER25763, DE-F C02- 06ER25765 and DE-F G02-08ER25834, NSF ards EIA-0080206 and CSR-0615412, and gift from In-

Page 12

tel Corp oration. References [1] F. Agak v, E. Bonilla, J. Ca azos, B. rank e, G. ursin, M. F. O'Bo yle, J. Thomson, M. ous- sain t, and C. K. I. Williams. Using mac hine learn- ing to fo cus iterativ optimization. In Pr dings of the International Symp osium on Co de Gener ation and Optimization Mar. 2004. [2] S. Ary a, D. M. Moun t, N. S. Netan ah u, R. Silv erman, and A. Y. u. An optimal algorithm for appro ximate nearest neigh or searc hing xed dimensions. J. CM 45(6):891{923, 1998. [3] J. Bilmes, K. Asano vi c, C.-W. Chin, and J. Dem- mel. Optimizing matrix ultiply using PHiP C: ortable, high-p erformance, ANSI co ding metho d- ology In Pr dings of the 1997 CM International Confer enc on Sup er omputing June 1997. [4] U. Bondh ugula, A. Hartono, J. Raman ujam, and Sada appan. practical automatic olyhedral pro- gram optimization system. In CM SIGPLAN Con- fer enc on Pr gr amming anguage Design and Imple- mentation (PLDI) June 2008. [5] C. Chen. Mo del-Guide Empiric al Optimization for Memory Hier ar chy PhD thesis, Univ ersit of South- ern California, 2007. [6] C. Chen, J. Chame, and M. Hall. CHiLL: framew ork for comp osing high-lev el lo op transformations. ec h- nical rep ort, Univ ersit of Southern California, 2008. [7] C. Chen, J. Chame, and M. W. Hall. Com bining mo d- els and guided empirical searc to optimize for ulti- ple lev els of the memory hierarc In Pr dings of the International Symp osium on Co de Gener ation and Optimization Mar. 2005. [8] I.-H. Ch ung and J. K. Hollingsw orth. Using informa- tion from prior runs to impro automated tuning sys- tems. In SC '04: Pr dings of the 2004 CM/IEEE onfer enc on Sup er omputing page 30, ashington, DC, USA, 2004. IEEE Computer So ciet [9] M. rigo. fast Fourier transform compiler. In Pr dings of CM SIGPLAN Confer enc on Pr o- gr amming anguage Design and Implementation Ma 1999. [10] G. ursin and A. Cohen. Building practical iter- ativ compiler. In Workshop on Statistic al and Ma- chine arning Appr aches to chite ctur es and Com- pilation (SMAR T'09) Jan. 2007. [11] S. Girbal, N. asilac he, C. Bastoul, A. Cohen, D. ar- ello, M. Sigler, and O. emam. Semi-automatic com- osition of lo op transformations for deep parallelism and memory hierarc hies. International Journal of Par- al lel Pr gr amming 34(3):261{317, June 2006. [12] W. Kelly V. Maslo v, W. Pugh, E. Rosser, T. Shp eis- man, and D. onnacott. The Omega Library in ter- face guide. ec hnical Rep ort CS-TR-3445, Univ ersit of Maryland at College ark, Mar. 1995. [13] D. Kim, L. Renganara anan, D. Rostron, S. Ra jopad- e, and M. M. Strout. Multi-lev el tiling: for the price of one. In SC '07: Pr dings of the 2007 CM/IEEE onfer enc on Sup er omputing pages 1{ 12, New ork, NY, USA, 2007. CM. [14] T. Kisuki, M. W. Knijnen burg, and M. F. O'Bo yle. Com bined selection of tile sizes and unroll factors using iterativ compilation. In CT '00: Pr o- dings of the 2000 International Confer enc on Par- al lel chite ctur es and Compilation chniques page 237, ashington, DC, USA, 2000. IEEE Computer So ciet [15] D. M. Moun t. http://www.cs.umd.edu/~mount/AN N/ [last accessed: eb 09, 2009]. [16] Y. Nelson, B. Bansal, M. Hall, A. Nak ano, and K. Ler- man. Mo del-guided erformance tuning of param- eter alues: case study with molecular dynam- ics visualization. Par al lel and Distribute Pr essing, 2008. IPDPS 2008. IEEE International Symp osium on pages 1{8, April 2008. [17] L.-N. ouc het, C. Bastoul, A. Cohen, and J. Ca azos. Iterativ optimization in the olyhedral mo del: art I, ultidimensional time. In CM SIGPLAN Con- fer enc on Pr gr amming anguage Design and Imple- mentation (PLDI'08) pages 90{100, ucson, Arizona, June 2008. CM Press. [18] A. Qasem, K. Kennedy and J. Mellor-Crummey Automatic tuning of whole applications using direct searc and erformance-based transformation sys- tem. J. Sup er omput. 36(2):183{196, 2006. [19] V. abatabaee, A. Tiw ari, and J. K. Hollingsw orth. arallel parameter tuning for applications with erfor- mance ariabilit In SC '05: Pr dings of the 2005 CM/IEEE onfer enc on Sup er omputing page 57, ashington, DC, USA, 2005. IEEE Computer So ci- et [20] R. uduc, J. W. Demmel, and K. A. elic k. Oski: library of automatically tuned sparse matrix er- nels. Journal of Physics: Confer enc Series 16:521{ 530, June 2005. [21] R. C. Whaley and J. Dongarra. Automatically tuned linear algebra soft are. In Pr dings of Sup er om- puting '98 No v. 1998. [22] J. Xiong, J. Johnson, R. Johnson, and D. adua. SPL: language and compiler for DSP algorithms. In Pr dings of CM SIGPLAN Confer enc on Pr o- gr amming anguage Design and Implementation June 2001. [23] Q. Yi, K. Seymour, H. ou, R. uduc, and D. Quin- lan. et: arameterized optimizations for empirical tuning. Par al lel and Distribute Pr essing Symp o- sium, 2007. IPDPS 2007. IEEE International pages 1{8, Marc 2007. [24] K. oto v, X. Li, G. Ren, M. Garzaran, D. adua, K. Pingali, and Sto dghill. Is searc really necessary to generate high-p erformance BLAS? Pr dings of the IEEE: Sp cial Issue on Pr gr am Gener ation, Op- timization, and Platform daptation 93(2):358{386, eb. 2005.

Today's Top Docs

Related Slides