Download
# Lar geScale arallel Br eadthFirst Sear ch Richard E PDF document - DocSlides

luanne-stotts | 2014-12-12 | General

### Presentations text content in Lar geScale arallel Br eadthFirst Sear ch Richard E

Show

Page 1

Lar ge-Scale arallel Br eadth-First Sear ch Richard E. orf and eter Schultze Computer Science Department Uni ersity of California, Los Angeles Los Angeles, CA 90095 orf@cs.ucla.edu, petersch@cs.ucla.edu Abstract Recently best-ﬁrst search algorithms ha been introduced that store their nodes on disk, to oid their inherent memory limitation. introduce se eral impro ements to the best of these, including parallel processing, to reduce their storage and time requirements. also present linear -time algo- rithm for bijecti ely mapping permutations to inte gers in le x- icographic order use breadth-ﬁrst searches of sliding-tile puzzles as testbeds. On the 3x5 ourteen Puzzle, we reduce both the storage and time needed by actor of 3.5 on tw processors. also performed the ﬁrst complete breadth-ﬁrst search of the 4x4 Fifteen Puzzle, with er 10 13 states. Intr oduction Breadth-˛rst search is basic search algorithm. It is used in model checking, to sho that certain states are reach- able or unreachable, and to determine the radius of prob- lem space, or the longest shortest path from an gi en state. It is also used to compute pattern-database heuris- tics (Culberson Schaef fer 1998). Br eadth-ﬁr st heuris- tic sear (Zhou Hansen 2004a) is space-ef ˛cient er sion of A*(Hart, Nilsson, Raphael 1968) for problems with unit edge costs. It implements A* as series of breadth-˛rst search iterations, with each iteration generat- ing all nodes whose costs donot xceed threshold for that iteration. Other methods for xtending disk-based breadth- ˛rst search to A* ha also been implemented (K orf 2004; Edelkamp, Jabbar Schroedl 2004), and the techniques de- scribed in this paper apply to such heuristic searches as well. Breadth-˛rst search is often much more ef ˛cient than depth-˛rst search, because the latter can detect duplicate nodes representing the same state, and generates all paths to gi en state. or xample, with branching actor of 2.13, depth-˛rst search of the Fifteen Puzzle, the sliding- tile puzzle, to the erage solution depth of 53 mo es ould generate 10 17 nodes, whereas the entire problem space only contains 10 13 unique states. Our goal is to increase the size of feasible searches. The primary limitation of best-˛rst search is the memory needed to store nodes, in order to detect duplicate nodes. Se eral recent adv ances ha addressed this problem. Cop yright 2005, American Association for Artiﬁcial Intelli- gence (www .aaai.or g). All rights reserv ed. Pr vious ork Fr ontier Sear ch Frontier search (K orf 1999; orf Zhang 2000; orf et al. 2005) stores the Open list of generated nodes, ut not the Closed list of xpanded nodes. This reduces the memory required for breadth-˛rst search from the size of the problem space to the width of the problem space, or the maximum number of nodes at an depth. or the Fifteen Puzzle, for xample, this reduces storage by actor of er 13. Disk Storage The ne xt adv ance is storing nodes on magnetic disks (Roscoe 1994; Stern Dill 1998; orf 2003; 2004; Zhou Hansen 2004b; Edelkamp, Jabbar Schroedl 2004). Disks costs less than $1 per gig abyte, compared to $200 per gi- abyte for memory Disks must be accessed sequentially ho we er since disk latenc is 10 times memory latenc There xists body of ork on algorithms for graphs stored xplicitly on disk, which focuses on asymptotic I/O comple xity See (Katriel Me yer 2003), for xample. By contrast, we are interested in search algorithms for ery lar ge implicit graphs de˛ned by root node and successor func- tion, which can be xplicitly stored en on disk. Sorting-Based DDD In the ˛rst of these algorithms (Roscoe 1994; orf 2003; 2004), each le el of breadth- ˛rst search starts with ˛le containing the nodes at the cur rent depth. All these nodes are xpanded, and their children are written to another ˛le, without an duplicate checking. Ne xt, the ˛le of child nodes is sorted by their state represen- tation, bringing duplicate nodes together single pass of the sorted ˛le then mer ges an duplicate nodes. refer to this as sorting-based DDD for delayed duplicate detection. Hash-based DDD oid the time comple xity of sort- ing, hash-based DDD (K orf 2004) uses tw orthogonal hash functions, and alternates xpansion with mer ging. In the xpansion phase, we xpand all the nodes at gi en depth, and write the child nodes to dif ferent ˛les, based on the alue of ˛rst hash function. An duplicate nodes will be mapped to the same ˛le. Frontier search guarantees that all children nodes are either one le el deeper than their par ents, in the case of the sliding-tile puzzles, or possibly at the same depth in general. AAAI-05 / 1380

Page 2

In the mer ge phase, we process each ˛le, hashing its nodes into memory using second hash function, thus detecting an duplicate nodes. Finally we write one cop of each child node back to disk to be gin the ne xt iteration. This algo- rithm as described in (K orf 2004), ut only implemented for the 4-pe wers of Hanoi problem, where ideal hash functions are tri vial to compute. Structur ed Duplicate Detection Structur ed duplicate de- tection (Zhou Hansen 2004b) detects duplicates as soon as the are generated. All nodes must be di visible into sub- sets, such that the children of nodes in one subset all into small number of subsets. In sliding-tile puzzle, for x- ample, subsets may be based on the blank position. The children of nodes with one blank position can ha at most four other blank positions. Furthermore, all the states in an subset, plus all its child subsets, must ˛t in memory simulta- neously When xpanding nodes in one subset, the children are look ed up in the corresponding child subsets in memory When xpanding nodes in another subset, currently resident subsets may ha to be sw apped out to disk to mak room for ne parent and child subsets in memory Symmetry or some sliding-tile puzzles, symmetry can reduce the space needed by actor of tw (Culberson Schaef fer 1998). or the Fifteen Puzzle, for xample, starting with the blank in corner ery state has mirror state computed by reﬂecting the puzzle about diagonal passing through the initial blank position, and renumbering the tiles using the same transformation. Some states equal their mirror reﬂec- tions. The reduction in time is less than actor of tw o, due to the erhead of computing the mirror states. Ov er view of aper This ork is based on hash-based DDD (K orf 2004), the most promising algorithm for lar ge-scale problems. Hash- based DDD is aster than sorting-based DDD, and prefer able to structured duplicate detection for se eral reasons. The ˛rst is that it doesn require the subset structure de- scribed abo e. second reason is that it requires relati ely little memory whereas structured duplicate detection must be able to hold parent subset and all its child subsets in memory simultaneously Hashed-based DDD is easily par allelized, and can be interrupted and resumed. Finally hash- based DDD only reads and writes each node at most twice. describe number of impro ements designed to re- duce both the storage needed and the running time of hash- based DDD. These include ef ˛cient state encoding for per mutation problems, interlea ving xpansion and mer ging, not storing nodes that ha no children, parallel processing, and ault tolerance. On the ourteen Puzzle, we reduce both the storage needed and the running time by actor of 3.5, using tw processors. also completed breadth-˛rst search of the Fifteen Puzzle, with er 10 13 states. Efﬁcient ermutation Encoding Problems such as the sliding-tile puzzles and Rubik Cube are permutation pr oblems in that state represents permu- tation of elements. The simplest representation of permu- tation is to list the position of each element. or xample, Fifteen Puzzle state can be represented as 16-digit he x- adecimal number where each digit represents the position of one tile or the blank. This occupies 64 bits of storage. more ef ˛cient encoding sa es storage and reduces I/O time. Ideally we ould map permutation of elements to unique inte ger from zero to or the Fifteen Puzzle, this requires only 45 bits. or xample, we could map each permutation to its inde in le xicographic ordering of all such permutations. or permutations of three elements, this mapping is: 012-0, 021-1, 102-2, 120-3, 201-4, 210-5. An algorithm for this mapping starts with sequence of positions, and maps it to number in factorial base of the form: 1)! 2)! 2! 1! Digit can range from to resulting in unique representation of each inte ger Gi en permutation as se- quence of digits in actorial base, we perform the indicated arithmetic operations to compute the actual inte ger alue. map permutation to sequence of actorial digits, we subtract from each element the number of original elements to its left that are less than it. or xample, the mapping from permutations of three elements to actorial base digits is: 012-000, 021-010, 102-100, 120-110, 201-200, 210-210. By reducing these actorial digits to an inte ger we obtain the desired alues: 012-000-0, 021-010-1, 102-100-2, 120-110- 3, 201-200-4, 210-210-5. This algorithm tak es time to compute the digits in actorial base. Ne xt we pro vide an algorithm. scan the permutation from left to right, constructing bit string of length indicating which elements of the permutation we seen so ar Initially the string is all zeros. As each element of the permutation is encountered, we use it as an inde into the bit string and set the corresponding bit to one. When we encounter element in the permutation, to determine the number of elements less than to its left, we need to kno the number of ones in the ˛rst bits of our bit string. xtract the ˛rst bits by right shifting the string by This reduces the problem to: gi en bit string, count the number of one bits in it. solv this problem in constant time by using the bit string as an inde into precomputed table, containing the number of ones in the binary representation of each inde x. or xample, the initial entries of this table are: 0-0, 1-1, 2- 1, 3-2, 4-1, 5-2, 6-2, 7-3. The size of this table is (2 where is the number of permutation elements. Such table for the Fifteen Puzzle ould contain 32 768 entries. This gi es us linear -time algorithm for mapping permu- tations of elements in le xicographic order to unique inte- gers from zero to implemented both the quadratic and linear algorithms abo e, and tested them by mapping all permutations of up to 14 elements. or 14 elements, the lin- ear algorithm as se en times aster and this ratio increases with increasing problem size, as xpected. This mapping is also used in heuristic searches of permu- tation problems using pattern databases (Culberson Scha- ef fer 1998). By mapping each permutation of the pattern to unique inte ger the permutation doesn ha to be stored with the heuristic alue, and each location corresponds to AAAI-05 / 1381

Page 3

alid permutation, making ef ˛cient use of memory xpand node, we need to re generate the original per mutation from its inte ger encoding. This can be done in linear time as well, ut requires more memory and is not signi˛cantly aster than the quadratic algorithm. The reason is that mapping the inte ger to permutation requires inte ger di vision and remaindering, which is much more xpensi than multiplication, dominating the cost of the mapping. There xist other algorithms for mapping between permu- tations and inte gers in linear time and linear space (Myrv old Rusk 2001), ut not in le xicographic order In act, (Myrv old Rusk 2001) claim that, ... it seems that major breakthrough will be required to do that computation in linear time, if indeed it is possible at all. Our algorithm runs in linear time, ut uses (2 space. The space is for table that is only computed once for gi en alue of erfect Hashing As described abo e, hash-based DDD mak es use of tw hash functions. When node is xpanded, its children are written to particular ˛le based on the ˛rst hash alue. or the Fifteen Puzzle, we map the positions of the blank and tiles 1, 2, and 3, to unique inte ger in the range zero to 16 15 14 13 43 679 This alue forms part of the name of the ˛le. The diagonal symmetry mentioned abo reduces the actual number of ˛les to 21 852 Since all nodes in an one ˛le ha the blank and ˛rst three tiles in the same positions, we only ha to specify the positions of the remaining twelv tiles. Since only half the initial states of sliding-tile puzzle are solv able, the posi- tions of the last tw tiles are determined by the positions of the other tiles. Thus, we only specify the positions of ten tiles, by mapping their positions to unique inte ger from zero to 12! 239 500 799 requiring 28 bits. oid re generating xpanded nodes, frontier search stores with each node its used oper ator which lead to neighboring nodes that ha already been xpanded. Since sliding-tile puzzle state has at most four operators, mo ving tile up, do wn, left, or right, we need four used-operator bits. Thus, Fifteen-Puzzle state can be stored in 28 32 bits, which is half the storage needed without this encoding. Since the states in ˛le are already encoded in 28-bit inte ger this is used as the second hash alue. mer ge the duplicate nodes in gi en ˛le, we set up hash table in memory with 239 500 800 4-bit locations, initialized to all zeros. then read each node from the ˛le, map it to its unique location in the hash table, and OR its used-operator bits to those already stored in the table, if an thus taking the union of used-operator bits of duplicate nodes. Finally we write single cop of each node, along with its used- operator bits, to mer ged ˛le. perfect hash function that maps each state to unique alue sa es great deal of mem- ory since we don ha to store the state in the table, nor use empty locations or pointers to handle collisions. After mer ging the nodes in one ˛le, we need to zero the hash table. could zero ery entry in sequential order ut this is xpensi if there are only small number of non-zero entries. Alternati ely we can scan the input uf fer and only zero those entries that were set to non-zero alue. Zeroing the states in the order the appear in the input uf fer may lead to poor cache performance, ho we er Our solution to this dilemma is that if only small number of table entries were set, we xplicitly zero those entries, and otherwise we sequentially zero the entire table. In our table of 239 million elements, the break-e en point is about 2.5 million entries. Interlea ving Expansion and Mer ging In our pre vious hash-based DDD algorithm (K orf 2004), all parent ˛les at gi en depth being xpanded before an child ˛les at the ne xt depth were mer ged. The disadv antage of this approach is that at the end of the xpansion phase, all nodes generated at the ne xt le el are stored on disk, including their duplicates. The storage required is thus proportional to the maximum number of nodes generated at an depth. If we mer ge child ˛les as soon as possible, ho we er we only ha to store approximately the maximum number of unique states at an le el. In order to mer ge each child ˛le only once, we defer mer ging it until all the parent ˛les that could contrib ute to it ha been xpanded. At that point, the child ˛le is placed on queue for mer ging. If an ˛les are eligible for mer ging, the tak priority er xpanding ˛les. minimize the time that child ˛le xists, when we x- pand parent ˛le, we lik to xpand other neighbors of its children as soon as possible. As heuristic for this, we xpand parent ˛les in the order in which states in that ˛le ould ˛rst be generated in breadth-˛rst search. or the ourteen Puzzle, with symmetry the maximum number of nodes generated at an depth is 10 10 while the maximum number of unique states is 10 10 sa ving 30%. or the Fifteen Puzzle, with symmetry the maximum number of nodes generated is 10 11 while the maxi- mum number of unique nodes is 10 11 sa ving 34%. Not Storing Sterile Nodes fertile node has children that ˛rst appear at the ne xt search depth, whereas all the neighbors of sterile node appear at the pre vious depth. In our algorithm, sterile nodes are de- tected during mer ging when all their used-operator bits are set. Rather than writing sterile nodes to ˛le to be xpanded in the ne xt iteration, we simply count and delete these nodes. or the ourteen Puzzle, this reduces the maximum num- ber of stored nodes from 10 10 to 10 10 16% sa vings. or the Fifteen Puzzle, this reduces the number of nodes stored from 10 11 to 10 11 12% sa vings. second adv antage is that we sa some time by not writ- ing sterile nodes to disk, and not reading them back in, par ticularly in the latter stages of the search, where most nodes are sterile. or both the ourteen and Fifteen Puzzles, this reduces the total I/O by about 5%. Multi-Thr eading aradoxically en on single processor multi-threading is important to maximize the performance of disk-based algo- rithms. The reason is that single-threaded implementation will run until it has to read from or write to disk. At that point it will block until the I/O operation has completed. The AAAI-05 / 1382

Page 4

operating system will use the CPU brieﬂy to set up the I/O transfer ut then the CPU will be idle until the I/O com- pletes. Furthermore, man orkstations are ailable with dual processors for small additional price. or ery lar ge searches, machines with man processors will be required. Hash-based DDD is ideally suited to multi-threading. ithin an iteration, most ˛le xpansions and mer ges can be done independently If we simultaneously xpand tw par ent ˛les that ha child ˛le in common, the tw xpansions will interlea their output to the child ˛le. While this is ac- ceptable, we oid it for simplicity implement our parallel algorithm, we use the paral- lel primiti es of POSIX threads (Nichols, Butler arrell 1996). All threads share the same data space, and mutual xclusion is used to temporarily lock data structures. Our algorithm maintains ork queue, which contains parent ˛les aiting to be xpanded, and child ˛les aiting to be mer ged. At the start of each iteration, the queue is initialized to contain all parent ˛les. Once all the neighbors of child ˛le ha been xpanded, it is placed at the head of the queue to be mer ged. minimize the maximum storage needed, ˛le mer ging tak es precedence er ˛le xpansion. Each thread orks as follo ws. It ˛rst locks the ork queue. If there is child ˛le to mer ge, it unlocks the queue, mer ges the ˛le, and returns to the queue for more ork. If there are no child ˛les to mer ge, it considers the ˛rst parent ˛le in the queue. parent ˛les conﬂict if the can generate nodes that hash to the same child ˛le. It checks whether the ˛rst parent ˛le conﬂicts with an other ˛le cur rently being xpanded. If so, it scans the queue for parent ˛le with no conﬂicts. It sw aps the position of that ˛le with the one at the head of the queue, grabs the non-conﬂicting ˛le, unlocks the queue, and xpands the ˛le. or each child ˛le it generates, it checks to see if all of its parents ha been xpanded. If so, it puts the child ˛le at the head of the queue for xpansion, and then returns to the queue for more ork. If there is no more ork in the queue, an idle threads ait for the current iteration to complete. At the end of each iter ation, arious node counts are printed, and the ork queue is initialized to contain all parent ˛les for the ne xt iteration. Each thread needs its wn hash table for mer ging, which tak es about 114 me abytes, and space to uf fer its ˛le I/O. Our program ork ed best with relati ely small I/O uf fers, total of only 20 me abytes per thread. Exter nal Disk Storage At four bytes per node, complete search of the Fifteen Puz- zle requires maximum of 1.4 terabytes of storage. The lar gest single disks currently ailable hold 400 gig abytes, ut only fe will ˛t inside typical orkstation, leading us to consider xternal disk storage. There are man choices on the mark et, arying in cost per byte, maximum transfer rate, and reliability allo others to reproduce our re- sults, we chose the least xpensi solution. purchased four LaCie Big Disk Extreme units, plus Fire wire 800 When referring to disk storage, gig abyte and terabyte refer to 10 and 10 12 bytes respecti ely rather than 30 and 40 as in the case of memory (IEEE 1394b) interf ace card for each. Each unit packages tw 250 gig abyte dri es striped together non-redundantly with maximum transfer rate of 88 me abytes per second. The cost of each unit plus the card as less than dollar per gig abyte. By plugging each disk into its wn card on the PCI us, we can potentially multiply the total bandwidth by the number of disks. In addition to the Fire wire disks, we also ha tw 300 gig abyte, and one 400 gig abyte serial (SA A) disks inside our orkstation. Since hash-based DDD uses lar ge number of dif ferent ˛les, the simplest ay to use multiple disks is to partition the ˛les among the dif ferent disks. This also gi es the best per formance, since the erhead of striping the data is oided, and multiple threads can access ˛les simultaneously if the are on dif ferent disks. This con˛guration as used for the ourteen Puzzle xperiments reported belo ault olerance Using disk storage, breadth-˛rst search can run for weeks. Unlik other lar ge-scale computations, breadth-˛rst search cannot be easily decomposed into independent computa- tions, which can ail and then simply be restarted until the succeed. Rather it must be tolerant of memory losses, due to system crashes or po wer ailures, and loss of disk data. Loss of Memory The simplest solution to memory loss is to eep all the nodes of one iteration until the ne xt iteration completes. This al- lo ws restarting from the last completed iteration. This re- quires twice the disk space, ho we er since tw complete le els of the search must be stored at once. In act, our program is interruptible with no storage er head. When interrupted, the ˛le system will contain parent ˛les at the pre vious depth aiting to be xpanded, child ˛les at the current depth aiting to be mer ged, and child ˛les at the current depth that ha already been mer ged. If parent ˛le is being xpanded when the program is interrupted, it will still xist, ut may ha output some of its children to child ˛les. The resumed program xpands all the nodes in the parent ˛le, creating additional copies of an child nodes already output, which will entually be mer ged as duplicates. Duplicate nodes due to ree xpanding the same nodes could be detected, since the ha identical used-operator bits. In an case, the number of unique states will not be af fected. If child ˛le is being mer ged when the program is interrupted, that ˛le will still xist, and partially mer ged output ˛le may also xist. In that case, we delete the output ˛le, and remer ge the child ˛le. ne er delete an input ˛le until after the output ˛les it generates ha been written. Data written to disks is cached in memory on the disk controller ho we er and en block- ing write call returns before the data has been magnetically committed to disk. As result, during po wer ailure, cached data as lost before it as committed to disk, ut af- ter the input ˛les that generated it had already been deleted. The solution to this problem is an uninterrupted po wer supply (UPS), with battery suf ˛cient to po wer the com- puter long enough for clean shutdo wn, ﬂushing all ˛le uf fers, in the ent of po wer ailure. AAAI-05 / 1383

Page 5

number of parallel threads 10 time in hours and minutes 52:13 28:45 26:08 25:11 24:52 24:50 24:59 25:11 25:13 25:30 able 1: ourteen Puzzle Runtimes vs. Number of arallel Threads with Processors Unr eco erable Disk Err ors Se eral attempts to complete the Fifteen Puzzle search on disk con˛gurations optimized for speed ailed, due to transient disk errors. Disk manuf acturers specify non- reco erable read error rates between one in 10 13 and one in 10 15 bits. While single-bit errors are routinely corrected by error correcting codes, uncorrectable multiple-bit errors within parity block do occur If such an error occurs in user ˛le, that ˛le cannot be read, ut if it occurs in certain critical data, the entire ˛le system can be corrupted. Most people are not are of this ailure mode of disks, because it occurs so rarely or xample, at an error rate of one in 10 14 bits, we ould xpect such an error on the central ˛le serv er in our department about once ery years. The complete Fifteen Puzzle search reads and writes total of 10 14 bits, ho we er and we sa these errors almost weekly The solution to this problem is RAID, or Redundant Ar ray of Ine xpensi Disks (P atterson, Gibson, Katz 1988). In simple RAID, an xtra disk holds the xclusi OR of the corresponding bits on the other disks. In the ent of an unreco erable error on an one disk, its data can be recon- structed from the others, without en interrupting the pro- gram. In the case of complete loss of disk, the bad disk can be unplugged and replaced, and its data reconstructed from the other disks, ag ain without interrupting the program. or our successful Fifteen Puzzle search, we used le el-5 redundant softw are RAID composed of four xternal Fire wire disks and tw internal SA disks. This increased the running time of our program by almost 50%, compared to non-redundant disk array due to the redundant output, and the CPU ycles needed to compute this output. It com- pletely eliminated our disk error problem, ho we er Experiments ourteen Puzzle Pre viously the lar gest sliding-tile puzzled searched com- pletely breadth-˛rst as the ourteen Puzzle (K orf 2004). Using sorting-based DDD with symmetry it required 259 gig abytes of storage at eight bytes per node, and almost 18 days on 440 me ahertz Sun Ultra-10 orkstation. On an IBM Intellistation Pro orkstation with dual tw o- gig ahertz, 64-bit AMD Opteron processors, tw gig abytes of memory and single Fire wire disk, it took 88 hours. ran the program described here on the ourteen Puz- zle with three Fire wire disks, and tw internal SA disks, arying the number of parallel threads. The maximum amount of storage used as 75 gig abytes. The ˛le hash function as based on the positions of the blank and ˛rst tw tiles. able sho ws the results, with number of threads on top, and times in hours and minutes on the bottom. ith one thread, our hash-based DDD program ran for er 52 hours, actor of 1.7 aster than our sorting-based DDD program, using actor of 3.5 less storage. ith six threads on tw processors, our program took less than 25 hours to run, parallel speedup of 2.1. Increasing the num- ber of threads be yond six increased the running time on tw processors, presumably due to coordination erhead. ith disks and tw processors, our program is CPU- bound. Changing the number of disks slightly doesn sig- ni˛cantly af fect performance, ut increasing the number of processors should impro it. Most analyses of disk-based algorithms assume the are I/O bound, ho we er ignoring CPU time and only counting disk I/O. Fifteen Puzzle Our main goal as complete breadth-˛rst search of the Fif- teen Puzzle. learned all the reliability lessons described abo the hard ay as the program ailed se eral times due to unreco erable disk errors, until we diagnosed that prob- lem, and once due to po wer ailure. Using six disks in le el-5 softw are RAID, and UPS, we entually completed the search in 28 days and hours, using maximum of 1.4 terabytes of disk storage. Since the RAID generated more I/O, and consumed CPU ycles, the best performance as achie ed with three parallel threads on tw processors. Our results con˛rmed that the radius of the problem space, starting with the blank in corner is 80 mo es, which as ˛rst determined by (Brungger et al. 1999) using more comple method. also found that there are xactly 17 states at depth 80, more than as pre viously kno wn. able sho ws the number of unique states at each depth. The act that the total number of states found is xactly 16! gi es us additional con˛dence that the search is correct. Conclusions presented linear -time algorithm for bijecti ely map- ping permutations to inte gers in le xicographic order On per mutations of 14 elements, our algorithm is se en times aster than an xisting quadratic algorithm. impro ed our disk- based search algorithm (K orf 2004), by interlea ving xpan- sion and mer ging, not storing sterile nodes, and introducing multi-threading, which impro es its performance en on single processor On the ourteen Puzzle, these impro e- ments reduce both the storage needed and the running time by actor of 3.5 on tw processors, compared to the pre- vious state of the art. Contrary to the usual assumption in the literature of disk-based algorithms, our program is CPU- bound rather than I/O-bound, en on tw processors. learned the hard ay that program running for month, reading and writing total of 3.5 terabytes of data per day must be ault tolerant. Rare unreco erable disk errors can be solv ed by redundant array of ine xpensi disks. Po wer ailures require an algorithm that can be interrupted and re- sumed, plus backup po wer supply to allo clean system shutdo wn, ﬂushing all ˛le uf fers. complete search of AAAI-05 / 1384

Page 6

the Fifteen Puzzle required 28 days and hours, and 1.4 terabytes of storage. our kno wledge, this is the lar gest best-˛rst search er completed. Ackno wledgements This research as supported by NSF grant No. EIA- 0113313, and by IBM, which donated the orkstation. Thanks to Satish Gupta of IBM, and Eddie ohler and uv al amir of UCLA, for their support and help with this ork. Refer ences Brungger A.; Marzetta, A.; Fukuda, K.; and Nie er gelt, J. 1999. The parallel search bench ZRAM and its applica- tions. Annals of Oper ations Resear 90:4563. Culberson, J., and Schaef fer J. 1998. attern databases. Computational Intellig ence 14(3):318334. Edelkamp, S.; Jabbar S.; and Schroedl, S. 2004. External A*. In Pr oceedings of the German Confer ence on Artiﬁcial Intellig ence 226240. Hart, .; Nilsson, N.; and Raphael, B. 1968. formal ba- sis for the heuristic determination of minimum cost paths. IEEE ansactions on Systems Science and Cybernetics SSC-4(2):100107. Katriel, I., and Me yer U. 2003. Elementary graph al- gorithms in xternal memory In Algorithms for Memory Hier ar hies, LNCS 2625 Springer -V erlag. 6284. orf, R., and Zhang, 2000. Di vide-and-conquer frontier search applied to optimal sequence alignment. In Pr oceed- ings of the National Confer ence on Artiﬁcial Intellig ence (AAAI-2000) 910916. orf, R.; Zhang, .; Thayer I.; and Hohw ald, H. 2005. Frontier search. ournal of the Association for Computing Mac hinery (J CM) ,to appear orf, R. 1999. Di vide-and-conquer bidirectional search: First results. In Pr oceedings of the International oint Con- fer ence on Artiﬁcial Intellig ence (IJCAI-99) 11841189. orf, R. 2003. Delayed duplicate detection: Extended ab- stract. In Pr oceedings of the International oint Confer ence on Artiﬁcial Intellig ence (IJCAI-03) 15391541. orf, R. 2004. Best-˛rst frontier search with delayed dupli- cate detection. In Pr oceedings of the National Confer ence on Artiﬁcial Intellig ence (AAAI-2004) 650657. Myrv old, ., and Rusk 2001. Ranking and unranking permutations in linear time. Information Pr ocessing Letter 79:281284. Nichols, B.; Butler D.; and arrell, J. 1996. Pthr eads Pr gr amming OReilly atterson, D.; Gibson, G.; and Katz, R. 1988. case for redundant arrays of ine xpensi disks (RAID). In Pr o- ceedings of the CM SIGMOD International Confer ence on Mana ement of Data 109116. Roscoe, A. 1994. Model-checking CSP. In Roscoe, A., ed., Classical Mind, Essays in Honour of CAR Hoar Prentice-Hall. Stern, U., and Dill, D. 1998. Using magnetic disk instead of main memory in the Mur(phi) eri˛er In Pr oceedings of the 10th International Confer ence on Computer -Aided eriﬁcation 172183. Zhou, R., and Hansen, E. 2004a. Breadth-˛rst heuris- tic search. In Pr oceedings of the 14th International Con- fer ence on utomated Planning and Sc heduling (ICAPS- 2004) 92100. Zhou, R., and Hansen, E. 2004b Structured duplicate de- tection in xternal-memory graph search. In Pr oceedings of the National Confer ence on Artiﬁcial Intellig ence (AAAI- 2004) 683688. depth states depth states 41 83,099,401,368 42 115,516,106,664 43 156,935,291,234 10 44 208,207,973,510 24 45 269,527,755,972 54 46 340,163,141,928 107 47 418,170,132,006 212 48 500,252,508,256 446 49 581,813,416,256 946 50 657,076,739,307 10 1,948 51 719,872,287,190 11 3,938 52 763,865,196,269 12 7,808 53 784,195,801,886 13 15,544 54 777,302,007,562 14 30,821 55 742,946,121,222 15 60,842 56 683,025,093,505 16 119,000 57 603,043,436,904 17 231,844 58 509,897,148,964 18 447,342 59 412,039,723,036 19 859,744 60 317,373,604,363 20 1,637,383 61 232,306,415,924 21 3,098,270 62 161,303,043,901 22 5,802,411 63 105,730,020,222 23 10,783,780 64 65,450,375,310 24 19,826,318 65 37,942,606,582 25 36,142,146 66 20,696,691,144 26 65,135,623 67 10,460,286,822 27 116,238,056 68 4,961,671,731 28 204,900,019 69 2,144,789,574 29 357,071,928 70 868,923,831 30 613,926,161 71 311,901,840 31 1,042,022,040 72 104,859,366 32 1,742,855,397 73 29,592,634 33 2,873,077,198 74 7,766,947 34 4,660,800,459 75 1,508,596 35 7,439,530,828 76 272,198 36 11,668,443,776 77 26,638 37 17,976,412,262 78 3,406 38 27,171,347,953 79 70 39 40,271,406,380 80 17 40 58,469,060,820 able 2: States as Function of Depth for Fifteen Puzzle AAAI-05 / 1385

orf and eter Schultze Computer Science Department Uni ersity of California Los Angeles Los Angeles CA 90095 orfcsuclaedu peterschcsuclaedu Abstract Recently best64257rst search algorithms ha been introduced that store their nodes on disk to oid thei ID: 22638

- Views :
**115**

**Direct Link:**- Link:https://www.docslides.com/luanne-stotts/lar-gescale-arallel-br-eadthfirst
**Embed code:**

Download this pdf

DownloadNote - The PPT/PDF document "Lar geScale arallel Br eadthFirst Sear c..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Page 1

Lar ge-Scale arallel Br eadth-First Sear ch Richard E. orf and eter Schultze Computer Science Department Uni ersity of California, Los Angeles Los Angeles, CA 90095 orf@cs.ucla.edu, petersch@cs.ucla.edu Abstract Recently best-ﬁrst search algorithms ha been introduced that store their nodes on disk, to oid their inherent memory limitation. introduce se eral impro ements to the best of these, including parallel processing, to reduce their storage and time requirements. also present linear -time algo- rithm for bijecti ely mapping permutations to inte gers in le x- icographic order use breadth-ﬁrst searches of sliding-tile puzzles as testbeds. On the 3x5 ourteen Puzzle, we reduce both the storage and time needed by actor of 3.5 on tw processors. also performed the ﬁrst complete breadth-ﬁrst search of the 4x4 Fifteen Puzzle, with er 10 13 states. Intr oduction Breadth-˛rst search is basic search algorithm. It is used in model checking, to sho that certain states are reach- able or unreachable, and to determine the radius of prob- lem space, or the longest shortest path from an gi en state. It is also used to compute pattern-database heuris- tics (Culberson Schaef fer 1998). Br eadth-ﬁr st heuris- tic sear (Zhou Hansen 2004a) is space-ef ˛cient er sion of A*(Hart, Nilsson, Raphael 1968) for problems with unit edge costs. It implements A* as series of breadth-˛rst search iterations, with each iteration generat- ing all nodes whose costs donot xceed threshold for that iteration. Other methods for xtending disk-based breadth- ˛rst search to A* ha also been implemented (K orf 2004; Edelkamp, Jabbar Schroedl 2004), and the techniques de- scribed in this paper apply to such heuristic searches as well. Breadth-˛rst search is often much more ef ˛cient than depth-˛rst search, because the latter can detect duplicate nodes representing the same state, and generates all paths to gi en state. or xample, with branching actor of 2.13, depth-˛rst search of the Fifteen Puzzle, the sliding- tile puzzle, to the erage solution depth of 53 mo es ould generate 10 17 nodes, whereas the entire problem space only contains 10 13 unique states. Our goal is to increase the size of feasible searches. The primary limitation of best-˛rst search is the memory needed to store nodes, in order to detect duplicate nodes. Se eral recent adv ances ha addressed this problem. Cop yright 2005, American Association for Artiﬁcial Intelli- gence (www .aaai.or g). All rights reserv ed. Pr vious ork Fr ontier Sear ch Frontier search (K orf 1999; orf Zhang 2000; orf et al. 2005) stores the Open list of generated nodes, ut not the Closed list of xpanded nodes. This reduces the memory required for breadth-˛rst search from the size of the problem space to the width of the problem space, or the maximum number of nodes at an depth. or the Fifteen Puzzle, for xample, this reduces storage by actor of er 13. Disk Storage The ne xt adv ance is storing nodes on magnetic disks (Roscoe 1994; Stern Dill 1998; orf 2003; 2004; Zhou Hansen 2004b; Edelkamp, Jabbar Schroedl 2004). Disks costs less than $1 per gig abyte, compared to $200 per gi- abyte for memory Disks must be accessed sequentially ho we er since disk latenc is 10 times memory latenc There xists body of ork on algorithms for graphs stored xplicitly on disk, which focuses on asymptotic I/O comple xity See (Katriel Me yer 2003), for xample. By contrast, we are interested in search algorithms for ery lar ge implicit graphs de˛ned by root node and successor func- tion, which can be xplicitly stored en on disk. Sorting-Based DDD In the ˛rst of these algorithms (Roscoe 1994; orf 2003; 2004), each le el of breadth- ˛rst search starts with ˛le containing the nodes at the cur rent depth. All these nodes are xpanded, and their children are written to another ˛le, without an duplicate checking. Ne xt, the ˛le of child nodes is sorted by their state represen- tation, bringing duplicate nodes together single pass of the sorted ˛le then mer ges an duplicate nodes. refer to this as sorting-based DDD for delayed duplicate detection. Hash-based DDD oid the time comple xity of sort- ing, hash-based DDD (K orf 2004) uses tw orthogonal hash functions, and alternates xpansion with mer ging. In the xpansion phase, we xpand all the nodes at gi en depth, and write the child nodes to dif ferent ˛les, based on the alue of ˛rst hash function. An duplicate nodes will be mapped to the same ˛le. Frontier search guarantees that all children nodes are either one le el deeper than their par ents, in the case of the sliding-tile puzzles, or possibly at the same depth in general. AAAI-05 / 1380

Page 2

In the mer ge phase, we process each ˛le, hashing its nodes into memory using second hash function, thus detecting an duplicate nodes. Finally we write one cop of each child node back to disk to be gin the ne xt iteration. This algo- rithm as described in (K orf 2004), ut only implemented for the 4-pe wers of Hanoi problem, where ideal hash functions are tri vial to compute. Structur ed Duplicate Detection Structur ed duplicate de- tection (Zhou Hansen 2004b) detects duplicates as soon as the are generated. All nodes must be di visible into sub- sets, such that the children of nodes in one subset all into small number of subsets. In sliding-tile puzzle, for x- ample, subsets may be based on the blank position. The children of nodes with one blank position can ha at most four other blank positions. Furthermore, all the states in an subset, plus all its child subsets, must ˛t in memory simulta- neously When xpanding nodes in one subset, the children are look ed up in the corresponding child subsets in memory When xpanding nodes in another subset, currently resident subsets may ha to be sw apped out to disk to mak room for ne parent and child subsets in memory Symmetry or some sliding-tile puzzles, symmetry can reduce the space needed by actor of tw (Culberson Schaef fer 1998). or the Fifteen Puzzle, for xample, starting with the blank in corner ery state has mirror state computed by reﬂecting the puzzle about diagonal passing through the initial blank position, and renumbering the tiles using the same transformation. Some states equal their mirror reﬂec- tions. The reduction in time is less than actor of tw o, due to the erhead of computing the mirror states. Ov er view of aper This ork is based on hash-based DDD (K orf 2004), the most promising algorithm for lar ge-scale problems. Hash- based DDD is aster than sorting-based DDD, and prefer able to structured duplicate detection for se eral reasons. The ˛rst is that it doesn require the subset structure de- scribed abo e. second reason is that it requires relati ely little memory whereas structured duplicate detection must be able to hold parent subset and all its child subsets in memory simultaneously Hashed-based DDD is easily par allelized, and can be interrupted and resumed. Finally hash- based DDD only reads and writes each node at most twice. describe number of impro ements designed to re- duce both the storage needed and the running time of hash- based DDD. These include ef ˛cient state encoding for per mutation problems, interlea ving xpansion and mer ging, not storing nodes that ha no children, parallel processing, and ault tolerance. On the ourteen Puzzle, we reduce both the storage needed and the running time by actor of 3.5, using tw processors. also completed breadth-˛rst search of the Fifteen Puzzle, with er 10 13 states. Efﬁcient ermutation Encoding Problems such as the sliding-tile puzzles and Rubik Cube are permutation pr oblems in that state represents permu- tation of elements. The simplest representation of permu- tation is to list the position of each element. or xample, Fifteen Puzzle state can be represented as 16-digit he x- adecimal number where each digit represents the position of one tile or the blank. This occupies 64 bits of storage. more ef ˛cient encoding sa es storage and reduces I/O time. Ideally we ould map permutation of elements to unique inte ger from zero to or the Fifteen Puzzle, this requires only 45 bits. or xample, we could map each permutation to its inde in le xicographic ordering of all such permutations. or permutations of three elements, this mapping is: 012-0, 021-1, 102-2, 120-3, 201-4, 210-5. An algorithm for this mapping starts with sequence of positions, and maps it to number in factorial base of the form: 1)! 2)! 2! 1! Digit can range from to resulting in unique representation of each inte ger Gi en permutation as se- quence of digits in actorial base, we perform the indicated arithmetic operations to compute the actual inte ger alue. map permutation to sequence of actorial digits, we subtract from each element the number of original elements to its left that are less than it. or xample, the mapping from permutations of three elements to actorial base digits is: 012-000, 021-010, 102-100, 120-110, 201-200, 210-210. By reducing these actorial digits to an inte ger we obtain the desired alues: 012-000-0, 021-010-1, 102-100-2, 120-110- 3, 201-200-4, 210-210-5. This algorithm tak es time to compute the digits in actorial base. Ne xt we pro vide an algorithm. scan the permutation from left to right, constructing bit string of length indicating which elements of the permutation we seen so ar Initially the string is all zeros. As each element of the permutation is encountered, we use it as an inde into the bit string and set the corresponding bit to one. When we encounter element in the permutation, to determine the number of elements less than to its left, we need to kno the number of ones in the ˛rst bits of our bit string. xtract the ˛rst bits by right shifting the string by This reduces the problem to: gi en bit string, count the number of one bits in it. solv this problem in constant time by using the bit string as an inde into precomputed table, containing the number of ones in the binary representation of each inde x. or xample, the initial entries of this table are: 0-0, 1-1, 2- 1, 3-2, 4-1, 5-2, 6-2, 7-3. The size of this table is (2 where is the number of permutation elements. Such table for the Fifteen Puzzle ould contain 32 768 entries. This gi es us linear -time algorithm for mapping permu- tations of elements in le xicographic order to unique inte- gers from zero to implemented both the quadratic and linear algorithms abo e, and tested them by mapping all permutations of up to 14 elements. or 14 elements, the lin- ear algorithm as se en times aster and this ratio increases with increasing problem size, as xpected. This mapping is also used in heuristic searches of permu- tation problems using pattern databases (Culberson Scha- ef fer 1998). By mapping each permutation of the pattern to unique inte ger the permutation doesn ha to be stored with the heuristic alue, and each location corresponds to AAAI-05 / 1381

Page 3

alid permutation, making ef ˛cient use of memory xpand node, we need to re generate the original per mutation from its inte ger encoding. This can be done in linear time as well, ut requires more memory and is not signi˛cantly aster than the quadratic algorithm. The reason is that mapping the inte ger to permutation requires inte ger di vision and remaindering, which is much more xpensi than multiplication, dominating the cost of the mapping. There xist other algorithms for mapping between permu- tations and inte gers in linear time and linear space (Myrv old Rusk 2001), ut not in le xicographic order In act, (Myrv old Rusk 2001) claim that, ... it seems that major breakthrough will be required to do that computation in linear time, if indeed it is possible at all. Our algorithm runs in linear time, ut uses (2 space. The space is for table that is only computed once for gi en alue of erfect Hashing As described abo e, hash-based DDD mak es use of tw hash functions. When node is xpanded, its children are written to particular ˛le based on the ˛rst hash alue. or the Fifteen Puzzle, we map the positions of the blank and tiles 1, 2, and 3, to unique inte ger in the range zero to 16 15 14 13 43 679 This alue forms part of the name of the ˛le. The diagonal symmetry mentioned abo reduces the actual number of ˛les to 21 852 Since all nodes in an one ˛le ha the blank and ˛rst three tiles in the same positions, we only ha to specify the positions of the remaining twelv tiles. Since only half the initial states of sliding-tile puzzle are solv able, the posi- tions of the last tw tiles are determined by the positions of the other tiles. Thus, we only specify the positions of ten tiles, by mapping their positions to unique inte ger from zero to 12! 239 500 799 requiring 28 bits. oid re generating xpanded nodes, frontier search stores with each node its used oper ator which lead to neighboring nodes that ha already been xpanded. Since sliding-tile puzzle state has at most four operators, mo ving tile up, do wn, left, or right, we need four used-operator bits. Thus, Fifteen-Puzzle state can be stored in 28 32 bits, which is half the storage needed without this encoding. Since the states in ˛le are already encoded in 28-bit inte ger this is used as the second hash alue. mer ge the duplicate nodes in gi en ˛le, we set up hash table in memory with 239 500 800 4-bit locations, initialized to all zeros. then read each node from the ˛le, map it to its unique location in the hash table, and OR its used-operator bits to those already stored in the table, if an thus taking the union of used-operator bits of duplicate nodes. Finally we write single cop of each node, along with its used- operator bits, to mer ged ˛le. perfect hash function that maps each state to unique alue sa es great deal of mem- ory since we don ha to store the state in the table, nor use empty locations or pointers to handle collisions. After mer ging the nodes in one ˛le, we need to zero the hash table. could zero ery entry in sequential order ut this is xpensi if there are only small number of non-zero entries. Alternati ely we can scan the input uf fer and only zero those entries that were set to non-zero alue. Zeroing the states in the order the appear in the input uf fer may lead to poor cache performance, ho we er Our solution to this dilemma is that if only small number of table entries were set, we xplicitly zero those entries, and otherwise we sequentially zero the entire table. In our table of 239 million elements, the break-e en point is about 2.5 million entries. Interlea ving Expansion and Mer ging In our pre vious hash-based DDD algorithm (K orf 2004), all parent ˛les at gi en depth being xpanded before an child ˛les at the ne xt depth were mer ged. The disadv antage of this approach is that at the end of the xpansion phase, all nodes generated at the ne xt le el are stored on disk, including their duplicates. The storage required is thus proportional to the maximum number of nodes generated at an depth. If we mer ge child ˛les as soon as possible, ho we er we only ha to store approximately the maximum number of unique states at an le el. In order to mer ge each child ˛le only once, we defer mer ging it until all the parent ˛les that could contrib ute to it ha been xpanded. At that point, the child ˛le is placed on queue for mer ging. If an ˛les are eligible for mer ging, the tak priority er xpanding ˛les. minimize the time that child ˛le xists, when we x- pand parent ˛le, we lik to xpand other neighbors of its children as soon as possible. As heuristic for this, we xpand parent ˛les in the order in which states in that ˛le ould ˛rst be generated in breadth-˛rst search. or the ourteen Puzzle, with symmetry the maximum number of nodes generated at an depth is 10 10 while the maximum number of unique states is 10 10 sa ving 30%. or the Fifteen Puzzle, with symmetry the maximum number of nodes generated is 10 11 while the maxi- mum number of unique nodes is 10 11 sa ving 34%. Not Storing Sterile Nodes fertile node has children that ˛rst appear at the ne xt search depth, whereas all the neighbors of sterile node appear at the pre vious depth. In our algorithm, sterile nodes are de- tected during mer ging when all their used-operator bits are set. Rather than writing sterile nodes to ˛le to be xpanded in the ne xt iteration, we simply count and delete these nodes. or the ourteen Puzzle, this reduces the maximum num- ber of stored nodes from 10 10 to 10 10 16% sa vings. or the Fifteen Puzzle, this reduces the number of nodes stored from 10 11 to 10 11 12% sa vings. second adv antage is that we sa some time by not writ- ing sterile nodes to disk, and not reading them back in, par ticularly in the latter stages of the search, where most nodes are sterile. or both the ourteen and Fifteen Puzzles, this reduces the total I/O by about 5%. Multi-Thr eading aradoxically en on single processor multi-threading is important to maximize the performance of disk-based algo- rithms. The reason is that single-threaded implementation will run until it has to read from or write to disk. At that point it will block until the I/O operation has completed. The AAAI-05 / 1382

Page 4

operating system will use the CPU brieﬂy to set up the I/O transfer ut then the CPU will be idle until the I/O com- pletes. Furthermore, man orkstations are ailable with dual processors for small additional price. or ery lar ge searches, machines with man processors will be required. Hash-based DDD is ideally suited to multi-threading. ithin an iteration, most ˛le xpansions and mer ges can be done independently If we simultaneously xpand tw par ent ˛les that ha child ˛le in common, the tw xpansions will interlea their output to the child ˛le. While this is ac- ceptable, we oid it for simplicity implement our parallel algorithm, we use the paral- lel primiti es of POSIX threads (Nichols, Butler arrell 1996). All threads share the same data space, and mutual xclusion is used to temporarily lock data structures. Our algorithm maintains ork queue, which contains parent ˛les aiting to be xpanded, and child ˛les aiting to be mer ged. At the start of each iteration, the queue is initialized to contain all parent ˛les. Once all the neighbors of child ˛le ha been xpanded, it is placed at the head of the queue to be mer ged. minimize the maximum storage needed, ˛le mer ging tak es precedence er ˛le xpansion. Each thread orks as follo ws. It ˛rst locks the ork queue. If there is child ˛le to mer ge, it unlocks the queue, mer ges the ˛le, and returns to the queue for more ork. If there are no child ˛les to mer ge, it considers the ˛rst parent ˛le in the queue. parent ˛les conﬂict if the can generate nodes that hash to the same child ˛le. It checks whether the ˛rst parent ˛le conﬂicts with an other ˛le cur rently being xpanded. If so, it scans the queue for parent ˛le with no conﬂicts. It sw aps the position of that ˛le with the one at the head of the queue, grabs the non-conﬂicting ˛le, unlocks the queue, and xpands the ˛le. or each child ˛le it generates, it checks to see if all of its parents ha been xpanded. If so, it puts the child ˛le at the head of the queue for xpansion, and then returns to the queue for more ork. If there is no more ork in the queue, an idle threads ait for the current iteration to complete. At the end of each iter ation, arious node counts are printed, and the ork queue is initialized to contain all parent ˛les for the ne xt iteration. Each thread needs its wn hash table for mer ging, which tak es about 114 me abytes, and space to uf fer its ˛le I/O. Our program ork ed best with relati ely small I/O uf fers, total of only 20 me abytes per thread. Exter nal Disk Storage At four bytes per node, complete search of the Fifteen Puz- zle requires maximum of 1.4 terabytes of storage. The lar gest single disks currently ailable hold 400 gig abytes, ut only fe will ˛t inside typical orkstation, leading us to consider xternal disk storage. There are man choices on the mark et, arying in cost per byte, maximum transfer rate, and reliability allo others to reproduce our re- sults, we chose the least xpensi solution. purchased four LaCie Big Disk Extreme units, plus Fire wire 800 When referring to disk storage, gig abyte and terabyte refer to 10 and 10 12 bytes respecti ely rather than 30 and 40 as in the case of memory (IEEE 1394b) interf ace card for each. Each unit packages tw 250 gig abyte dri es striped together non-redundantly with maximum transfer rate of 88 me abytes per second. The cost of each unit plus the card as less than dollar per gig abyte. By plugging each disk into its wn card on the PCI us, we can potentially multiply the total bandwidth by the number of disks. In addition to the Fire wire disks, we also ha tw 300 gig abyte, and one 400 gig abyte serial (SA A) disks inside our orkstation. Since hash-based DDD uses lar ge number of dif ferent ˛les, the simplest ay to use multiple disks is to partition the ˛les among the dif ferent disks. This also gi es the best per formance, since the erhead of striping the data is oided, and multiple threads can access ˛les simultaneously if the are on dif ferent disks. This con˛guration as used for the ourteen Puzzle xperiments reported belo ault olerance Using disk storage, breadth-˛rst search can run for weeks. Unlik other lar ge-scale computations, breadth-˛rst search cannot be easily decomposed into independent computa- tions, which can ail and then simply be restarted until the succeed. Rather it must be tolerant of memory losses, due to system crashes or po wer ailures, and loss of disk data. Loss of Memory The simplest solution to memory loss is to eep all the nodes of one iteration until the ne xt iteration completes. This al- lo ws restarting from the last completed iteration. This re- quires twice the disk space, ho we er since tw complete le els of the search must be stored at once. In act, our program is interruptible with no storage er head. When interrupted, the ˛le system will contain parent ˛les at the pre vious depth aiting to be xpanded, child ˛les at the current depth aiting to be mer ged, and child ˛les at the current depth that ha already been mer ged. If parent ˛le is being xpanded when the program is interrupted, it will still xist, ut may ha output some of its children to child ˛les. The resumed program xpands all the nodes in the parent ˛le, creating additional copies of an child nodes already output, which will entually be mer ged as duplicates. Duplicate nodes due to ree xpanding the same nodes could be detected, since the ha identical used-operator bits. In an case, the number of unique states will not be af fected. If child ˛le is being mer ged when the program is interrupted, that ˛le will still xist, and partially mer ged output ˛le may also xist. In that case, we delete the output ˛le, and remer ge the child ˛le. ne er delete an input ˛le until after the output ˛les it generates ha been written. Data written to disks is cached in memory on the disk controller ho we er and en block- ing write call returns before the data has been magnetically committed to disk. As result, during po wer ailure, cached data as lost before it as committed to disk, ut af- ter the input ˛les that generated it had already been deleted. The solution to this problem is an uninterrupted po wer supply (UPS), with battery suf ˛cient to po wer the com- puter long enough for clean shutdo wn, ﬂushing all ˛le uf fers, in the ent of po wer ailure. AAAI-05 / 1383

Page 5

number of parallel threads 10 time in hours and minutes 52:13 28:45 26:08 25:11 24:52 24:50 24:59 25:11 25:13 25:30 able 1: ourteen Puzzle Runtimes vs. Number of arallel Threads with Processors Unr eco erable Disk Err ors Se eral attempts to complete the Fifteen Puzzle search on disk con˛gurations optimized for speed ailed, due to transient disk errors. Disk manuf acturers specify non- reco erable read error rates between one in 10 13 and one in 10 15 bits. While single-bit errors are routinely corrected by error correcting codes, uncorrectable multiple-bit errors within parity block do occur If such an error occurs in user ˛le, that ˛le cannot be read, ut if it occurs in certain critical data, the entire ˛le system can be corrupted. Most people are not are of this ailure mode of disks, because it occurs so rarely or xample, at an error rate of one in 10 14 bits, we ould xpect such an error on the central ˛le serv er in our department about once ery years. The complete Fifteen Puzzle search reads and writes total of 10 14 bits, ho we er and we sa these errors almost weekly The solution to this problem is RAID, or Redundant Ar ray of Ine xpensi Disks (P atterson, Gibson, Katz 1988). In simple RAID, an xtra disk holds the xclusi OR of the corresponding bits on the other disks. In the ent of an unreco erable error on an one disk, its data can be recon- structed from the others, without en interrupting the pro- gram. In the case of complete loss of disk, the bad disk can be unplugged and replaced, and its data reconstructed from the other disks, ag ain without interrupting the program. or our successful Fifteen Puzzle search, we used le el-5 redundant softw are RAID composed of four xternal Fire wire disks and tw internal SA disks. This increased the running time of our program by almost 50%, compared to non-redundant disk array due to the redundant output, and the CPU ycles needed to compute this output. It com- pletely eliminated our disk error problem, ho we er Experiments ourteen Puzzle Pre viously the lar gest sliding-tile puzzled searched com- pletely breadth-˛rst as the ourteen Puzzle (K orf 2004). Using sorting-based DDD with symmetry it required 259 gig abytes of storage at eight bytes per node, and almost 18 days on 440 me ahertz Sun Ultra-10 orkstation. On an IBM Intellistation Pro orkstation with dual tw o- gig ahertz, 64-bit AMD Opteron processors, tw gig abytes of memory and single Fire wire disk, it took 88 hours. ran the program described here on the ourteen Puz- zle with three Fire wire disks, and tw internal SA disks, arying the number of parallel threads. The maximum amount of storage used as 75 gig abytes. The ˛le hash function as based on the positions of the blank and ˛rst tw tiles. able sho ws the results, with number of threads on top, and times in hours and minutes on the bottom. ith one thread, our hash-based DDD program ran for er 52 hours, actor of 1.7 aster than our sorting-based DDD program, using actor of 3.5 less storage. ith six threads on tw processors, our program took less than 25 hours to run, parallel speedup of 2.1. Increasing the num- ber of threads be yond six increased the running time on tw processors, presumably due to coordination erhead. ith disks and tw processors, our program is CPU- bound. Changing the number of disks slightly doesn sig- ni˛cantly af fect performance, ut increasing the number of processors should impro it. Most analyses of disk-based algorithms assume the are I/O bound, ho we er ignoring CPU time and only counting disk I/O. Fifteen Puzzle Our main goal as complete breadth-˛rst search of the Fif- teen Puzzle. learned all the reliability lessons described abo the hard ay as the program ailed se eral times due to unreco erable disk errors, until we diagnosed that prob- lem, and once due to po wer ailure. Using six disks in le el-5 softw are RAID, and UPS, we entually completed the search in 28 days and hours, using maximum of 1.4 terabytes of disk storage. Since the RAID generated more I/O, and consumed CPU ycles, the best performance as achie ed with three parallel threads on tw processors. Our results con˛rmed that the radius of the problem space, starting with the blank in corner is 80 mo es, which as ˛rst determined by (Brungger et al. 1999) using more comple method. also found that there are xactly 17 states at depth 80, more than as pre viously kno wn. able sho ws the number of unique states at each depth. The act that the total number of states found is xactly 16! gi es us additional con˛dence that the search is correct. Conclusions presented linear -time algorithm for bijecti ely map- ping permutations to inte gers in le xicographic order On per mutations of 14 elements, our algorithm is se en times aster than an xisting quadratic algorithm. impro ed our disk- based search algorithm (K orf 2004), by interlea ving xpan- sion and mer ging, not storing sterile nodes, and introducing multi-threading, which impro es its performance en on single processor On the ourteen Puzzle, these impro e- ments reduce both the storage needed and the running time by actor of 3.5 on tw processors, compared to the pre- vious state of the art. Contrary to the usual assumption in the literature of disk-based algorithms, our program is CPU- bound rather than I/O-bound, en on tw processors. learned the hard ay that program running for month, reading and writing total of 3.5 terabytes of data per day must be ault tolerant. Rare unreco erable disk errors can be solv ed by redundant array of ine xpensi disks. Po wer ailures require an algorithm that can be interrupted and re- sumed, plus backup po wer supply to allo clean system shutdo wn, ﬂushing all ˛le uf fers. complete search of AAAI-05 / 1384

Page 6

the Fifteen Puzzle required 28 days and hours, and 1.4 terabytes of storage. our kno wledge, this is the lar gest best-˛rst search er completed. Ackno wledgements This research as supported by NSF grant No. EIA- 0113313, and by IBM, which donated the orkstation. Thanks to Satish Gupta of IBM, and Eddie ohler and uv al amir of UCLA, for their support and help with this ork. Refer ences Brungger A.; Marzetta, A.; Fukuda, K.; and Nie er gelt, J. 1999. The parallel search bench ZRAM and its applica- tions. Annals of Oper ations Resear 90:4563. Culberson, J., and Schaef fer J. 1998. attern databases. Computational Intellig ence 14(3):318334. Edelkamp, S.; Jabbar S.; and Schroedl, S. 2004. External A*. In Pr oceedings of the German Confer ence on Artiﬁcial Intellig ence 226240. Hart, .; Nilsson, N.; and Raphael, B. 1968. formal ba- sis for the heuristic determination of minimum cost paths. IEEE ansactions on Systems Science and Cybernetics SSC-4(2):100107. Katriel, I., and Me yer U. 2003. Elementary graph al- gorithms in xternal memory In Algorithms for Memory Hier ar hies, LNCS 2625 Springer -V erlag. 6284. orf, R., and Zhang, 2000. Di vide-and-conquer frontier search applied to optimal sequence alignment. In Pr oceed- ings of the National Confer ence on Artiﬁcial Intellig ence (AAAI-2000) 910916. orf, R.; Zhang, .; Thayer I.; and Hohw ald, H. 2005. Frontier search. ournal of the Association for Computing Mac hinery (J CM) ,to appear orf, R. 1999. Di vide-and-conquer bidirectional search: First results. In Pr oceedings of the International oint Con- fer ence on Artiﬁcial Intellig ence (IJCAI-99) 11841189. orf, R. 2003. Delayed duplicate detection: Extended ab- stract. In Pr oceedings of the International oint Confer ence on Artiﬁcial Intellig ence (IJCAI-03) 15391541. orf, R. 2004. Best-˛rst frontier search with delayed dupli- cate detection. In Pr oceedings of the National Confer ence on Artiﬁcial Intellig ence (AAAI-2004) 650657. Myrv old, ., and Rusk 2001. Ranking and unranking permutations in linear time. Information Pr ocessing Letter 79:281284. Nichols, B.; Butler D.; and arrell, J. 1996. Pthr eads Pr gr amming OReilly atterson, D.; Gibson, G.; and Katz, R. 1988. case for redundant arrays of ine xpensi disks (RAID). In Pr o- ceedings of the CM SIGMOD International Confer ence on Mana ement of Data 109116. Roscoe, A. 1994. Model-checking CSP. In Roscoe, A., ed., Classical Mind, Essays in Honour of CAR Hoar Prentice-Hall. Stern, U., and Dill, D. 1998. Using magnetic disk instead of main memory in the Mur(phi) eri˛er In Pr oceedings of the 10th International Confer ence on Computer -Aided eriﬁcation 172183. Zhou, R., and Hansen, E. 2004a. Breadth-˛rst heuris- tic search. In Pr oceedings of the 14th International Con- fer ence on utomated Planning and Sc heduling (ICAPS- 2004) 92100. Zhou, R., and Hansen, E. 2004b Structured duplicate de- tection in xternal-memory graph search. In Pr oceedings of the National Confer ence on Artiﬁcial Intellig ence (AAAI- 2004) 683688. depth states depth states 41 83,099,401,368 42 115,516,106,664 43 156,935,291,234 10 44 208,207,973,510 24 45 269,527,755,972 54 46 340,163,141,928 107 47 418,170,132,006 212 48 500,252,508,256 446 49 581,813,416,256 946 50 657,076,739,307 10 1,948 51 719,872,287,190 11 3,938 52 763,865,196,269 12 7,808 53 784,195,801,886 13 15,544 54 777,302,007,562 14 30,821 55 742,946,121,222 15 60,842 56 683,025,093,505 16 119,000 57 603,043,436,904 17 231,844 58 509,897,148,964 18 447,342 59 412,039,723,036 19 859,744 60 317,373,604,363 20 1,637,383 61 232,306,415,924 21 3,098,270 62 161,303,043,901 22 5,802,411 63 105,730,020,222 23 10,783,780 64 65,450,375,310 24 19,826,318 65 37,942,606,582 25 36,142,146 66 20,696,691,144 26 65,135,623 67 10,460,286,822 27 116,238,056 68 4,961,671,731 28 204,900,019 69 2,144,789,574 29 357,071,928 70 868,923,831 30 613,926,161 71 311,901,840 31 1,042,022,040 72 104,859,366 32 1,742,855,397 73 29,592,634 33 2,873,077,198 74 7,766,947 34 4,660,800,459 75 1,508,596 35 7,439,530,828 76 272,198 36 11,668,443,776 77 26,638 37 17,976,412,262 78 3,406 38 27,171,347,953 79 70 39 40,271,406,380 80 17 40 58,469,060,820 able 2: States as Function of Depth for Fifteen Puzzle AAAI-05 / 1385

Today's Top Docs

Related Slides