/
Paper 12-27- 1 -Paul M. Dorfman, Independent ConsultantGregg P. Snell, Paper 12-27- 1 -Paul M. Dorfman, Independent ConsultantGregg P. Snell,

Paper 12-27- 1 -Paul M. Dorfman, Independent ConsultantGregg P. Snell, - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
388 views
Uploaded On 2015-10-08

Paper 12-27- 1 -Paul M. Dorfman, Independent ConsultantGregg P. Snell, - PPT Presentation

09 temporary 1 2 3 0 4 and we want to know if the value of VARARRAY isever equal to 7 You might notice that the values ofthe table are not sorted and contain duplicates ID: 153647

0:9) _temporary_ 1

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Paper 12-27- 1 -Paul M. Dorfman, Indepen..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Paper 12-27- 1 -Paul M. Dorfman, Independent ConsultantGregg P. Snell, Data Savant ConsultingABSTRACTYou know hashing works—and fast! You’ve been tothe presentations, even read the papers. But youstill have not tried any hashing techniques. Why? Isit because you had a hard time understanding allthose hkeys, collisions, and prime modulo thingies?Well, you’re not alone. Fortunately, just as you donot need to be an auto mechanic to learn how todrive, you do not need to be a mathematician tolearn how to use hashing. 0:9) _temporary_ 1 2 3 0 4 and we want to know if the value of VAR_ARRAY isever equal to 7. You might notice that the values ofthe table are not sorted and contain duplicates—arather common phenomenon in the real world. Asearch for 7 might look like this: - 2 - % _null_; 0:9) _temporary_ 1 2 3 0 4 5 6 9 7); 0 to 9; then do; 'hit'; 'miss'; While this example is absurdly small, it illustrates theAchilles heel of a comparison search: Some (or evenall) values in the table are compared to our targetvalue of 7 until (or if) we find a match. But how canwe use direct addressing in this example by directlyreferencing the array element that contains 7?hisper&#xw7.8;sppissst! This is where the magic happens…it’scalled key-indexing!hisper&#x/w8.;ကKEY-INDEXINGKey-indexing is the concept of using the value of atable’s key variables as the index into that table.This enables us to directly address the table, as weknow exactly what the addresses are for each value!Consider the following: data _null_; 0:9) _temporary_ 0 1 2 3 4 5 6 . 9); . 'hit'; put 'miss'; It doesn’t take a rocket scientist to realize that thiscode is simpler and will execute much faster as wehave eliminated all but one comparison! Now, at thispoint, you may be saying to yourself “Hold on there!You guys changed the order of the table!” Yes,we did and thank you for noticing. By definition, thevalues in a key-indexed table must be stored suchthat the key is the index. However, this is extremely to do programmatically and there are some fringe benefits as well!So with that in mind, let us do it for our zip codeproblem, shall we? To use key-indexing, we want anarray (table) large enough to hold all of the SUVcounts for all possible zip codes(ARRAY SUV(0:99999) _TEMPORARY_). Then wewant to load the array from the demographic filesuch that the value of SUV(zip) is either missing (.)or contains the SUV count for that customer’s zipcode. For example…SUV(0)=.SUV(1)=.SUV(27513)=214SUV(32258)=88SUV(66216)=67SUV(99999)=.So our code might look something like this: data ** 0:99999) _temporary_; until(eof1); /* source not sorted */ . That’s it! Now we have an array, called SUV, whichcontains either missing values (no zip or the count ismissing) or the number of SUVs for that zip. Did younotice some of the fringe benefits of loading a key-indexed array? The source data did not requiresorting because, by definition, we always knowexactly where the values will be stored in the array.Also, our table is naturally de-duped because weonly set the array value once. The rest of ourprogram is now very simple and straight-forward: ** until(eof2); /* We used the key variable ZIPCODE as the the table and were able to assign the SUV countswithout any comparisons by directly addressingkey-indexed table. Neither data set required sortingand had there been duplicates in our demographic - 3 -data, they would not have presented a problem!Pretty slick, eh?Unfortunately, there are two primary limitations tousing key-indexing—memoryindex sizethis example, we only needed enough memory tohold a temporary array that was 8-byte numeric by100,000 rows, or 800K. But as the size of the datato be added gets larger, so does the memoryrequirement.The second primary limitation to using key-indexingis how large the range of the key variable canbecome before exceeding memory (not to mention itmust always be integer). 100,000 zips may not be aproblem, but what about 9-digit Social SecurityNumbers (SSN)? That would require 8 GB, which isnot chump change even with contemporarymemories. And adding just one digit to the keyrange would increase memory usage ten-fold! Howcan one possibly use key-indexing when SSN (orany other large or non-integer value) is the key?Since memory and index size represent our primarylimitations, our solution must address one or both ofthese issues.hisper&#xw7.8;sppissst! This is where the REAL magichappens… actually, there are several pieces ofmagic but we won’t try to explain them all, justenough to get you going…hisper&#x/w8.;ကBITMAPPINGBitmapping dramatically improves memory utilizationby using a single bit (1 bit) rather than 8 bytes (64bits) of memory to the presence of absence of anumericlookup key. Given the same memoryresource, bitmapping has the addressable key range53 to 56 times that of key-indexing alone. Whileusing bitmapping for adding satellite data to a masterfile can be challenging, it is a champion when weonly need to rapidly find out if the record with a givenkey should be selected, and if memory resources aresufficient for mapping the entire key range intomemory bits.In this paper, we have chosen to not expand on thebitmapping technique. A full and detailedexplanation can be found in “Last Year’s Paper”.HASHINGHashing actually addresses the memory andindex size issues. Now, before we go any farther,does everybody know what hashing means?Hashing is the process of converting a long-rangekey (numeric or character) to a smaller-range integernumber with a mathematical algorithm or function:LONG-RANGE KEY HASHSMALLER-RANGEINTEGERHere, let us explain it to you with a simple examplebefore we attempt to use it on our CUSTOMERdatabase. Let us presume our database has only 10rows and the key is 3-digits like this: data Now, on such a small key, we could use key-indexing with a simple array of 1,000 elements(ARRAY KEY(0:999)) but only 10 cells out of 1,000would be occupied while the rest is wasted memory.But what if we could “hash” these ten 3-digit keys tofit in a smaller array, take up less memory, and stillmake use of direct addressing?Ideally, if we had a hash function that would mapeach of the ten 3-digit values to its own hashaddress from 0 to 9 we would have the perfect hashfunction. However, it would take a great deal of timeto develop such a function, which might turn out tobe quite slow to compute and would also fail as soonas the keys changed or if you have more than 10records. In reality, we will use a less-than-perfectfunction. It will not guarantee a one-to-one mappingbetween the keys and their hash addresses, i.e. itwill allow some keys to map to the same hashaddress – a phenomenon called a collisionHowever, the function will be able to not pile toomany keys on one address, it will be fast tocalculate, and good for any input set of keys, not justone. Any collisions that ensue will be dealt withseparately by invoking a collision resolution policyA fairly fast, simple, and uniform hash function canbe obtained by dividing the key (or its numericrepresentation if the key is character) by a prime - 4 -number and using the remainder as a hash address.A detailed discussion of such a division method canbe found in “Last Year’s Paper”. Essentially, thereare only three things you will need to know in orderto effectively use this method.First, what is the load factor of the hashing array? Inother words, after all the keys have been hashed andloaded into the array, how full do you want it to be?The sparser the array is populated (i.e. smaller loadfactor) the faster searching will be but at the expenseof using more memory for the larger array. A moredensely populated array is somewhat slower (due tomore collisions) but requires less memory. Formany applications of this technique, a load factor of0.8 yields a good balance of performance andmemory requirements. Let us assign that to a macrovariable: Secondly, you need to know what the optimal arraysize should be for the given load factor. This is oneof those “magic” things we won’t spend much timeon since it requires some knowledge of numbertheory to really understand. Fortunately, the authorof this function has already provided us with a shortand extremely fast SAS program to dynamicallycalculate and store the optimal value in a macrovariable called HASH_SIZE. Let us use this programon the SMALL data set we used previously: data _null_; by 1 until(j=up+1); 2 to up until(not mod( end; ' ,left best So, if we use the load factor of 0.8, our optimal arraysize would be 13. Now, let us see what effect thishas on the hashed values in our example: data _null_; 1; 1 then put 'KEY HASH_ADDR'; Z3. �' === ' Z2.; �185 === 04 �971 === 10 �400 === 11 �260 === 01 �922 === 13 �970 === 09 �543 === 11 �532 === 13 �050 === 12 �067 === 03 From this example you can see that our hashfunction has produced two collisions: 400 and 543both hashed into 11 and 922 and 532 hashed intoThat brings up the third thing you need to knowabout this method: how to resolve collisions. Butbefore we go any farther, you need to be aware thatwe will be deviating from pure direct addressingbecause of the potential collisions.For with each collision, we will be altering the originalhashed number in order to move our index to anempty node (un-collided address) in the array.Consequently, we can no longer rely upon the simplefact that ARRAY(HASH_ADDR) is not missing toidentify a match. We will have to store the KEY ineach node to enable further confirmation thatARRAY(HASH_ADDR)=KEY. And should this initialcomparison fail, we will not yet know if KEY ismissing or simply stored elsewhere due to acollision, until we follow our chosen resolution policyto its logical end. This makes hashing a hybridalgorithm because it combines direct addressing withshort, and hence very fast, bursts of sequentialsearching.Now, hang with us folks! We know this part is a littletougher to follow but it is the key to understandinghow to use hashing and we are only going to discusstwo methods: linear probing and coalesced chaining.COLLISION RESOLUTION POLICY:LINEAR PROBINGCollision resolution with Linear Probing is, perhaps,the simplest method of all. As the name implies,when a collision occurs, the remaining cells in thetable are simply probed in a linear fashion(decrementing the index by 1) until the next emptyaddress is found. If our index becomes less than 0then we have stepped off the bottom of the table andneed to reset the index to the top, and continue untilencountering either a duplicate key or emptyaddress. The simplicity of this method becomes - 5 -quite evident when you realize this entire wrap-around cycle can be coded in a single DO loop.A program using SMALL might look like this: data _null_; 0:&hash_size) _ ; ** until(eof1); /* 1 1 until . or /* 0 then 1; /* /* ' ; 0 to & '(' ')=' (00)=. (01)=260 (02)=. (03)=067 (04)=185 (05)=. (06)=. (07)=050 (08)=543 (09)=970 (10)=971 (11)=400 (12)=532 (13)=922 Our first collision occurred when 543 hashed into 11because HASH_TABLE(11) was greater thenmissing. Visualizing our code at that point might looklike this:KEY HASH_ADDR HASH_TABLEThe main advantage of linear probing is its uttersimplicity. And, if the table is sparse enough, itperforms quite well! As a rule of thumb, linearprobing performs best if about half of all nodes in thetable are left empty, i.e. with the load factor of about0.5. However, this represents a significant problemwhen you consider that a smaller load factorcorresponds to a larger array and dramatically largermemory requirements. Also, linear probingperformance rapidly deteriorates, as the table getsfuller.Which leads us naturally to what might beconsidered the best collision resolution scheme:COLLISION RESOLUTION POLICY:COALESCED CHAININGCoa-what? Coalesce simply means to unite into awhole. Earlier, we have seen that the key(s) hashingto the same address form a “chain”, which we needto traverse sequentially. We can create one array foreach chain and write very simple code to implementsuch separate chaining. But since each array mustbe allocated to its maximum possible size, thesimplicity comes at the expense of a huge memorycost. In coalesced chaining, we avoid this bycoalescing all the chains together into a “whole”array. The idea is fairly simple in that we willmaintain a parallel array, called LINK_TO_NEXT thatwill contain one of 3 distinct values: missing if the corresponding node ofHASH_TABLE is empty, 0 if the corresponding node ofHASH_TABLE contains a hashed value andno other keys have “collided” with it, or a number that points to the next cell with ahashed number.HASH LOOPCOLLISION FOUND MISSING, STORE IT - 6 -Let us go back to our SMALL example. The first sixvalues hash without collisions, so stepping throughour process:1. hash KEY into HASH_ADDR2. check to see ifLINK_TO_NEXT(HASH_ADDR) is missing3. found a missing LINK_TO_NEXT cell so thecorresponding HASH_TABLE cell is alsomissing; store 0 in LINK_TO_NEXT to markit as filled4. store KEY in HASH_TABLE(HASH_ADDR)and our arrays would look like this:KEY HASH_ADDRHASH_ADDR LINK_TO_NEXTLINK_TO_NEXT HASH_TABLEThe next KEY of 543 hashes into 11, but 11 isalready occupied because LINK_TO_NEXT(11) isgreater than missing. So our process now becomes:1. hash KEY into HASH_ADDR2. check to see ifLINK_TO_NEXT(HASH_ADDR) is missinga. if not, see if the key already exists byfollowing the chained addresses inLINK_TO_NEXT until we find it inHASH_TABLE, or find a 0.b. If not found, then beginning with the topof LINK_TO_NEXT, search backwarduntil we find a missing value.c. Store this index in the originalLINK_T0_NEXT(HASH_ADDR) (it was a0) before changing HASH_ADDR to this3. found a missing LINK_TO_NEXT cell so thecorresponding HASH_TABLE cell is alsomissing; store 0 in LINK_TO_NEXT to markit as filled4. store KEY in HASH_TABLE(HASH_ADDR)SAS code for this process becomes thus: data _null_; 0:&hash_size) ; 0:&hash_size) ; ** until(eof1); /* STEP 1 (hash the key) */ 1; 0; /* STEP 2 (collision?) */ . do; /* STEP 2.a (check for duplicates) */ then continue; by -1 .); /* STEP 3 (mark link as occupied) */ 0; /* STEP 4 (store the key) */ ' ; 0 to & '(' I z2. ')=' 15 '(' ')=' /* 1; if 0 do; (00)= . (00)= . (01)=00 (01)=260 (02)= . (02)= . (03)=00 (03)=067 (04)=00 (04)=185 (05)= . (05)= . (06)= . (06)= . (07)=00 (07)=050 (08)=00 (08)=532 (09)=00 (09)=970 (10)=00 (10)=971 (11)=12 (11)=400 (12)=07 (12)=543 (13)=08 (13)=922 STEP 1 STEP 2 STEP 3 STEP 4 - 7 -Visualizing the process during the first collision mightlook like this:KEY HASH_ADDRHASH_ADDR LINK_TO_NEXTLINK_TO_NEXT HASH_TABLEAnd during the subsequent collision of 532:KEY HASH_ADDRHASH_ADDR LINK_TO_NEXTLINK_TO_NEXT HASH_TABLEA piece of cake, right? Come on now, it’s not thatbad. Besides, you have to get those spouse agesadded to CUSTOMER by this afternoon and hashingis the only way you can pull-off such a super-humanfeat! Recalling the three things we need to knowabout hashing, let us apply them to our problem:1. The load factor (how full do you want thearray to be?)The spouse age file is pretty big so we betterstart with the recommended factor of 0.8 andif we run out of memory, up it to 0.9 or higher(we may also have to invoke SAS withMEMSIZE= option greater than the default of2. The optimal array size given our load factorThis part is very easy because we just runthe magic code provided in “Last Year’sPaper”.Big keys (SSN) and a big table point us tousing coalesced chaining.Here we go…UP, UP and AWAY! creeeec&#xs-8.;䀀hNope. Wait. Hold everything! We forgot somethingvery necessary for this particular problem. TheSPOUSE_AGE, remember? How are we going toadd this field if we have to store SSN in theHASH_TABLE because of collisions? Simple! Justadd another array that will hold the ages, but we onlyhave to load or unload it the collision stuff is allover and we are ready to work with the key.Ok, now we are ready… % _ by 1 until(j=up+1); 2 to up until(not mod( ' ,left best 0:&hash_size) _ 0:&hash_size) _ ARRAY :&hash_size) _ ** until(eof1); /* STEP 1 (hash the key) */ 1; 0; . do; then continue; /* STEP 2.b (follow by -1 .); /* STEP 2.c (change original link /* ( STEP 1 STEP 2 STEP 2.c STEP 4 STEP 1 STEP 2STEP 2.aSTEP 2.bSTEP 3 STEP 2.aSTEP 2.bSTEP 2.c STEP 3 STEP 4 - 8 - /* STEP 3 (mark link as occupied) */ 0; /* ** until(eof2); 0; 1; . then do; /* /* 1; if 0 do; Now, wasn’t that easy? Outside of comments, weonly had to change about 12 lines of code (seeupper-case).Hopefully we have been able to remove at leastsome of the confusion that may have prevented youfrom using key-indexing and hashing in the past.Most of the code is extremely straight-forward andrequires but a few minor changes to address mostproblems. And, at the sake of sounding like abroken record, we strongly encourage you to rereadthe SUGI 26 Paper “Table Look-Up by DirectAddressing: Key-Indexing – Bitmapping – Hashing”.We have re-addressed some of the originalconcepts, but there are more we did not coverand certainly not in as much detail.Now that you have a better understanding of what“hashing” is all about, we hope you will try it. But,how do you decide to use key-indexing orhashing over the more traditional approaches?Under certain conditions, as in the examples wepresented, the decision is made for you. Butassuming you could choose between a traditionalSORT and MERGE or Hashing, what other criteriashould you consider? SpeedNothing is faster (not even a speeding bullet)!Whether your problem is a one time hurry-up-and-get-it-done job for the boss, or a web-deployeddatabase application that requires sub secondresponse time, hashing will save the day and you toocould become a hero! - 9 -1.D. E. Knuth, The Art of Computer Programming,2.D. D. Knuth, The Art of Computer Programming,3.R. Sedgewick, Algorithms in C, Parts 1-4.4.T. A. Standish. Data Structures, Algorithms andSoftware Principles in C.5.P.M. Dorfman. Table Look-Up by DirectAddressing: Key-Indexing– Bitmapping–Hashing.SUGI 26.6.J. Morris, Data Structures and Algorithms, HashTables(http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210 /hash_tables.html ACKNOWLEDGMENTSP.M.D.: Special thanks to Karsten M. Self, IanWhitlock, F. Joseph Kelley, Sigurd Hermansen, andBase SAS R&D team for their support of direct-addressing methods in SAS, valuable discussionsfull of ideas, wit, and vigor, and giving the author anopportunity to apply the techniques to solve practicalproblems. The author gratefully acknowledges thecontribution of all individuals who have, directly orindirectly, encouraged the author and supported hisefforts of making direct addressing an accepted andpractically used DATA step philosophy.G.P.S.: A very special thanks to Paul Dorfman, author of “the” paper and a very gracious andforgiving coauthor as well a consummate expatiatorof the English language ;-) You are my Hashinghero!CONTACT INFORMATIONPaul M. Dorfman4437 Summer Walk Ct.Jacksonville, FL 32258(904) 260-6509 (h)(904) 905-5428 (o)pdorfman@bellsouth.net paul_dorfman@hotmail.com paul.dorfman@bcbsfl.com Gregg P. SnellData Savant ConsultingShawnee, KS 66216(913) 638-4640 (o)(208) 977-1943 (f)gsnell@datasavantconsulting.com www.datasavantconsulting.com TRADEMARK CITATIONSAS and all other SAS Institute Inc. product orservice names are registered trademarks ortrademarks of SAS Institute Inc. in the USA andother countries. ® Indicates USA registration.Other brand and product names are registeredtrademarks or trademarks of their respectivecompanies.