/
Further Discussion of Summarizing Impossibly Large SAS Further Discussion of Summarizing Impossibly Large SAS

Further Discussion of Summarizing Impossibly Large SAS - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
384 views
Uploaded On 2017-01-11

Further Discussion of Summarizing Impossibly Large SAS - PPT Presentation

Michael Raithel presented at the Conference a very concept of the large data set in SAS The concept isinnovative and important at age of data warehousing both authors have been extensively involved ID: 508578

Michael Raithel presented the

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Further Discussion of Summarizing Imposs..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Further Discussion of Summarizing Impossibly Large SASââAbstract Michael Raithel presented at the Conference a very concept of the large data set in SAS. The concept isinnovative and important at age of data warehousing. both authors have been extensively involved in processing largeespecially interested in The task is to summarize a level or 5. Sort the output data set step 4 and summarize bycan summarize impossibly large datalong as SORT procedure in step 1 can succeed. Thealgorithm can also be modified easily for number ofvariables. It does not require concept, we do however believe extensive steps. Inalgorithm, there are 2**(N+1)+1 steps; 1 forthe lowest level orhighest _TYPE_ of classes is summarized twice; once in 1 and 2 and once in steps 4 to 6, an performance of horizontal summary bysteps of his algorithm, andadding a new step. This is what resulted in our first SAS The first algorithm is a revision of the Raithel’s. we eliminated his steps 1, 4 and 6. We also replaced his step 3codes to generate names of variables so tocomplete sample codes of our SAS algorithm for 6 1. Sort the input data set and BY X1-X&N, where &N is the number of 2. Loop steps 3 and 4 in sequence.For 6 loop 63 from 0 to62. In the sample &BICOUNT is equal to thevalue of the _TYPE_ variable in loop because it3. For each _TYPE_ in the loop, assign names of correspondingvariables to a variable &X. We use the _NULL_ because there is no input or output of the data sets foroutput is the variable KEYS is to contain the names of selected variables. Its length should be at least total length of all variable names, plus the number of minus one. In the sample the total length of allvariable names is 12. The number of is 6. Therefore the length of KEYS should be at least 17(=12+6-1).The names of variables ‘X1’ to ‘X6’ areinitiated to a temporary array XX. The next DO element of the correspondence between current_TYPE_ defined as Ith CLASS variabledefined by 2**(6-I). If equals to 0, meaning nothe ith element of XX is Otherwise, the ith element of the XX is the name of the ithCLASS variable. Then all elements of the XX are assigned tosends the value of variable &X. Inthe ‘BY &X’statements are used to the names of CLASS variables to process. Thus the composition anddecomposition of STRING in 4. Sort the output data set step 2 and summarize by &X,where &X gives a list of variables corresponding to thecurrent value of &BICOUNT or _TYPE_. The output data setthis step is the summary at corresponding to theThe major difference between our and Raithel’s algorithms isin that ours steps to manipulatevalues of the CLASS variables STRING. soavoided. Of causehave to process 2**N-1 times on _NULL_ in step 3. But 1, 4 and 6 in Raithel’s algorithm, the step 3 inour algorithm is extremely trivial. As Raithel’s algorithm, therevised can also be modified easily for number of system of algorithms. The numbers of CLASS variables used are 6 and 7.The numbers of observations of data sets are 1.3 million and10.5 million respectively. We Our second algorithm is based on our premise SAS, thenumber of SORT procedures required summarizing N CLASS variables is no more where K is the integer part of N divided by 2, and C(N,K) isthe number of combinations of K elements out of N samples.the number of Raithel's algorithm, the number of for N CLASS variables is at least 2**N, that translate to64 for 6 CLASS variables. Therefore, 45 SORT procedures canfor 6 CLASS variables. This is where we can reach To demonstrate the premise, lets use an example of 3 variables, X1 to X3. The number of SORT procedures required1.1 the output of step 1.0 by X1, X2 and X3, for 1.2 the output of step 1.1 by X1 and X2, for2.1 step 2.0 by X2 and X3, for3.1 the output of step 3.0 by X3 and X1, for3.3 the output of step 3.2 without BY statement,In steps 1.1, 1.2, 2.1, 2.2, 3.1, 3.2 and 3.3, there are no already in sorting1 Sort from the original data set, and summarize by X1 X2 X3,2 Sort from step 1 and summarize by nothing,3 Sort from step 1 and summarize by X3, for4 Sort from step 1 and summarize by X2, for5 Sort from step 1 and summarize by X2 X3,6 Sort from step 1 and summarize by X1, for7 Sort from step 1 and summarize by X1 X3,8 Sort from step 1 and summarize by X1 X2,9 Sort from step 1 and summarize by X1 X2algorithm needs 8 SORT procedures at By comparison, our algorithm not saves 6 procedures, but also the subsequent for each SORT block. Thus itThe CPU reduction of our algorithm is achieved throughrearranging order of variables in By statements. the number of variables is relatively small, such as thecase of 3 variables, one can easily figure out thenecessary order of CLASS variables and then simply type in theSAS codes as the example described. However, if variables is large, such as 5 or more, not only theorder of CLASS variable for the minimum sorting becomes lessbut also the number of statements becomes toomany for one to type. Therefore, it is good to have some tools tofacilitate implementation of our algorithm. The first of thesetools is to create a combination table. An example of the 142 134 234 42 In this table, the numbers 1 to 4 represent CLASS variables X1row of the table represent the order ofCLASS variables in a SORT procedure. For instance, the 2ndhas values 1, 4, 2, which represents a ‘BY X1 X4 X2’statement in SORT procedure. Once you sort the data set by X1can summarize by X1 X4 X2, then by X1 X4, andthen by X1. If column includes allcombinations of C(4,1); two columns include allcombinations of C(4,2); columns include allcombinations of C(4,3); and first 4 columns include allcombinations of C(4,4). In general the first j include all combinations of C(N, j). In this way, summary at allcombinations of variables can be reached throughAfter creation of table, we the SASprogram displayed in Appendix 2 to quickly generate anotherSAS program for horizontal summary. In the sample program ofAppendix 2, the variables X1 to X4 are table in Table one. The variables Y1 to Y4 are _TYPE_ values for variables in horizontal For example, in row 4, X1 is 2 and Y1 equals 4, which meanssummarizing at second Class variable is for _TYPE_ 4. In row4, X2 is 3 and Y2 is 6, which means summarizing and the third CLASS variables corresponds _TYPE_ 6. TheMISS is the number of missing values of X1 to X4 incolumns of X1 to X4 need to drop summarizing. Inrows 2 and 3, the values of which means not toAppendix 3 displays the SAS codes, the output from the sampleprogram in Appendix 2, to execute second algorithm. Asthe number of procedures used is 6, which is the number of combinations of 2elements out of 4. The number of SUMMARY procedures isWe conducted experiment on system for 6 and 7 variables, and 1.3 million and 10.5 million CPU is 49% and 72% less than Raithel's However creation of consume a lotof time for large numbers of CLASS variables. This is adisadvantage of our algorithm. Authors have worked out away to variables, but we are still in put it indata sets it is worth investing time on creating We presented 2 summarization presented by Mr. Michael Raithel. We description analysis and results of experiments to algorithms. From theoretical point of view, our algorithmsappear to be able to enhance Raithel’s algorithm. Results of ourexperiment also supported the theoretical analysis. We wishthis paper could benefit to SAS Technical Report P-222,Michael Raithel, “Summarizing Sets For The Data Warehouse Server Using Summarization", Proceedings Of The 22nd Annual SAS Institute Inc. Version 6,Guide Version 6,Inc. in indicates USA 660 West Germantown PikePlymouth Meeting, PA 19462*Step 2: Loop steps 3 and 4*;DATA P; DO I=2 TO (4-MISS);Appendix 3. The Second SAS Algorithm, Output PROC SORT OUT=TEMP DATA=IN.DATA;