MultiCache Profiling of Parallel Processing Programs Using Simics Nikrouz Faroughi Computer Science Department Sacramento State University Sacramento CA  faroughicsus
268K - views

MultiCache Profiling of Parallel Processing Programs Using Simics Nikrouz Faroughi Computer Science Department Sacramento State University Sacramento CA faroughicsus

edu Abstract This paper presents a multicache profiler for shared memory multiprocessor systems For each programs static data structure the profiler outputs the read and writemiss frequencies that are due to cache line migrations Those programs stati

Download Pdf

MultiCache Profiling of Parallel Processing Programs Using Simics Nikrouz Faroughi Computer Science Department Sacramento State University Sacramento CA faroughicsus




Download Pdf - The PPT/PDF document "MultiCache Profiling of Parallel Process..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "MultiCache Profiling of Parallel Processing Programs Using Simics Nikrouz Faroughi Computer Science Department Sacramento State University Sacramento CA faroughicsus"— Presentation transcript:


Page 1
Multi-Cache Profiling of Parallel Processing Programs Using Simics Nikrouz Faroughi Computer Science Department Sacramento State University Sacramento, CA 95819-6021 faroughi@csus.edu Abstract This paper presents a multi-cache profiler for shared memory multiprocessor systems. For each program’s static data structure, the profiler outputs the read- and write-miss frequencies that are due to cache line migrations. Those program’s static data structures, which their manipulations, re sult in excessive cache line migrations—potentially a source for excessive false- misses—are

identified. The frequency of line migrations from cache to cache may inherently depend on the algorithm or may depend on the coding. The paper illustrates that our profiled data can be useful for analyzing algorithms and programs from cache performance points of views as well as for code optimizations to reduce cache line migrations. This profiler is created by extending Simics single-cache profiling capabilities. It combines the profiled data from the individual cache memories and the virtual memory addresses assigned to each of the static data structures of a parallel processing program.

Keywords : parallel processing, program instrumentation, multiprocessor simulation, and multi- cache memory profiling. 1. Introduction Simulation tools have been used to study system behavior, performance analysis, and scalability. Numerous simulation environments and extensions, such as, MPI, Augmint, LIMES, Tango Lite, SimOS, MulSim, Simics, and others [1-8], have been developed for this purpose. Simulation has been used to evaluate cache memory access patterns in commercial workloads, such as those in data base applications. In such applications, for example, simulation was effective in

showing the application’s performance pr oblem areas. For example, it was shown that high data accesses that are generated by a sequential scan operation were cached only once, and changing such data accesses to un-cacheable improves performance [10]. Simulation also has been used to show cache accesses—triggered by load -store sequences or are for migratory cache lines—can be improved if read-shares are replaced with read-for-own erships, instead [11]. In these studies simulation tool Simics, which is a system- level instruction-set simulator and can simulate the execution of a real

executable binary, were used [9]. In this paper, we describe a multi-cache profiler, based on Simics, that tracks static data structures of a parallel processing program. This coarse—data structure rather than block—level cach e profiler collects cache line migration frequencies for each of the static program data structures. The non-migratory da ta structures are ignored. The multi-cache profiling of a Parallel Radix Sort program is used, as an example, to demonstrate the application of this profiler. Specifically, Section 2 presents a brief description of the simulation environment,

multi-processor simulation configuration, program instrumentation, and cache memory profiling in Simics. Section 3 introduces our multi-cache profiler and discuses its generated outputs. Section 4 demonstrates its application to improve the Parallel Radix Sort algorithm implementation. Section 5 concludes the paper. 2. Simics Simulation Environment Simics is a full system simulator that runs on a variety of host machines and can simulate a user defined target machine with contempor ary operating system such as RedHat Linux. The simula ted target system operates like a real system except it ru

ns slower; the simulation speed depends on the speed of the host machine. The target system has read-only access to all the files on the host system. One of the profiling techniques supported in Simics is the “magic” (no-op) instruction with the overhead of only few assembly instructions. When the magic instruction, instrumented in the target program, executes, Simics stops the simulation and various data on the state of the simulate d system, including the cache memories, are collected and processed. In this paper we use the Simics Enterprise system configuration. The Enterprise configuration

simulates x86-p4 (Pentium IV) processors running RedHat Linux 7.3. For simplicity and illustration purposes, our simulated multiprocessor system consists of two processors, each with single hyperthreading, and 256MB
Page 2
shared memory. Each proce ssor connects to the shared memory via a 16KB cache memory. The cache memories are configured as 4-way set as sociative with 32B line size and implement MESI cache coherency protocol with write-allocate and LRU line replacement policy. The single-cache memory profilers in Simics are also configured to operate on virtual addresses; this

facilitates a simpler tracking of the cache read/write activities. 2.1 Simics Single-Cache Profiling Tool The basic source program instrumentation in Simics is called “magic” instruction and is defined as a macro, MAGIC(0) [7]. When the magic instruction executes, the simulation stops and a pre-assigned and user defined Python profiling management function is invoked. From within the profiling management function, additional Python functions are called to initialize, collect, and process the profiled data. The only acceptable value to pass to MAGIC(0), on the x86 platforms, is zero. However,

another macro—for example, named MY_MAGIC(n) and calls MAGIC(0), can be included in the target program—that allows for other values of n to be stored in a register (e.g., “eax”) and accessed from within the Python management function [7]. In our study, MY_MAGIC(1) and MY_MAGIC(2) are called, respectively, before and after the desired sections in each of the parallel threads to initialize and collect and process the profiled data; this is a standard step using Simics tools. For a tutorial on multiprocessor simulation and cache profiling in Simics refer to [12]. 2.2 Multi-Cache Profiling For

each individual cache memory in our virtual system, we add Simics data read- and write-miss virtual address profilers. These indi vidual cache profilers keep track of all the cache lines an d their frequencies that are associated with accessing data that result in either a read- miss or a write-miss. We achieved a multi-cache memory profiling through the following processes: 1. We use the Linux function “objdump” to generate the symbol table of a target program from its executable binary. From this table and for each static data structure in the target program, we extract the assigned virtual

address and the corresponding data structure size. This info rmation defines the data space address boundaries used in ou r multi-cache memory profiler. 2. We then collect the individual read-miss and write- miss profiled data for all the cache memories in the system during the execution of the parallel threads. 3. Using the information extracted in step 1, we organize the data from step 2 by the different program data structures and by each cache memory. 4. We then combine the data from step 3 to generate the multi-cache read- and write-miss data organized by each program data structure. The

data from step 4 cont ains the frequencies of migratory cache lines for each of the program shared data structures. This frequency information is used to identify those data structures in the target program that produce high frequencies of migratory cache lines. The areas in the target program source-code that manipulate these data structures must be re-examined for potential optimizations. In the next section, we discuss the application of our multi-cache profiler to an implementation of Parallel Radix Sort algorithm. 3. Parallel Radix Sort We have selected a two-thread version of Parallel

Radix Sort algorithm to illustrate the results obtained from our multi-cache profiler. Each thread will execute on a separate processor in our two-processor virtual system. Parallel Radix Sort algorithm used here is discussed in [14, 18]. Below is the set of shared data structures we have used in the implementation of the algorithm. The multi-cache profiled data are charted using these names: “buffer” – It consists of two buffers called input and output buffers each with rooms for n entries (keys). In this case, the input buffer is divided between two threads and holds the input keys for th e

current cycle. Both the threads concurrently read the keys from their respective sections of the input buffer an d write them at different but unique locations in the output buffer. A thread can write to any location in the output buffer. The input and output buffers switch roles during each new cycle. “globalCnt” – In our implementation, keys are examined 8-bit slices at a tim e with one slice during each sorting cycle. During each sorting cycle and the specific 8-bit slice, each thread independently processes its keys and determines how many keys there are with the same 8- bit slice. These

counts ar e combined into a global histogram. For 8-bit slices, globalCnt is a 256-entry histogram. “lower” – In each given thread id number k, this data structure holds the number of keys in “globalCnt” that will be processed by all the threads whose id number is less than k. For 8-bit slices, this data structure is also a 256-entry histogram. Note that, in the two-thread implementation, k has values 0 and 1. Thus, “lower” will be generated only for thread with id = 1. “table” – This is a global off-set table and is compiled from data in “globalCnt” historm. For each key in the current input

buffer of each thread, “table” and “lower” identify a unique location in the output buffer where the key is stored. For 32-bit keys and 8-bit slices, Parallel Radix Sort takes four cycles to sort n keys. In this case, both the threads independently and concurrently write the keys in the output buffer; thus invalidating a common cache line in different cache memories . Additionally since each thread writes its keys in different but unique locations in
Page 3
the output buffer, the process will cause many false misses. During the last sorting cycle, the output buffer will contain the

keys in the sorted order. We will illustrate the app lication of our multi-cache profiler by examining the mu lti-cache profiled data for the different versions of Parallel Radix Sort programs. Version I of the program (partially included here for convenience) contains no specific improvements. The barrier related functions used in the program are not shown. Other shared data structures used in the program are “bar”, “lock”, “final[0]”, and “flag[0]”. Both “flag and “final” are defined large enough so that each would fall in a separate cache line. This is assuming that the line size (i.e.,

32B) divides the size of “buffer”. This simplifies the discussions in this paper. The remaining data structures used by th e program are local to each thread. These local data structures are shown as “other in the charts and are included in case their access results in conflict-misses when th e shared data spaces are accessed. For an example of how to use our multi-cache profiler to analyze the cache memory accesses for parallel processing programs, consider the profiled data presented in Figure 1. It shows the multi-cache data read-miss frequencies collected from our two-processor virtual

system during the execution of Version I program with n=32 keys. The keys are 32-bit integers generated at random. Figure 1 shows only the number of read-misses that are caused by migratory cache lines. Those cache lines that cause read-misses in only one cache memory are not considered migratory and are not tracked by the multi-cache profiler. As show n in Figure 1, the access to both “lock” and “globalCnt” data structures cause the largest number of migratory read-misses. A closer examination of the program lines 26 through 31 leads to the source of this relatively la rge number of

read-misses. In this case, both threads aggressively attempt to add their local-count histogram entries to “globalCnt” histogram. During these aggressive attempts, both of the cache memories access cache lines from “globalCnt” data structure. Each time when one of the cache memories writes to one of thes e cache lines it invalidates the copy of the cache line in the other cache memory. Thus, these cache lines keep migrating from one cache memory to the next and back. Depending on the type of operations performed on each of the pr ogram data structure, the migration of the cache lines could be

due to true sharing and/or false sharing. In this study, our focus is to identify those program data structures that produce high migratory cache lines. Figure 2 shows which program data structures when accessed map to the same set in the cache memories. For n=32 keys, each set contains only cache lines from at most two or fewer of the specific data structures; and, as shown in Figure 3, at most two out of the four cache lines availabl e in each set are utilized. This implies that, for n=32 keys the program data structures did not produce conflict-misses and all the data shown in Figure 1 are

read-m isses due to migratory cache lines. Note that, our Parallel Radix Sort algorithm is suitable for systems with small number of processors. For a system with a large numbe r of processors, different techniques, such as the parallel prefix computation, are more efficient to generate “globalCnt” and “lower histograms [15, 16]. 200 400 600 800 1000 1200 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1

Sum of rm_cpu0 Sum of rm_cpu1 bar buffer_0 buffer_1 globalCnt lock lower_0 lower_1 other table Figure 1. Data structures and the number of read-misses (rm) due to migratory cache lines for executing the Version I Parallel Radix Sort program for n=32. Y-axis: Number of read-misses due to migration. 10 15 20 25 30 35 (blank) (blank) (blank) (blank) (blank) (blank) (blank) (blank) (blank) bar globalCnt (blank) buffer_0 buffer_1 final flag globalCnt lock lower_0 lower_1 other table Figure 2. Number of sets with cache lines from the specific data structures for n=32 keys. In this case, cache lines

from at most two data structures map to the same set. Y-axis: Number of sets. 4. Radix Sort Read-Miss Improved The multi-cache performance of Parallel Radix Sort program Version I can be improved if the program lines 26 through 31 are replaced with the following lines: pthread_mutex_lock(&lock); //lock for (i=0; i<256; i++) globalCnt[i]=globalCnt[i]+counters[i]; pthread_mutex_unlock(&lock); //unlock In the modified version of the program (named Version II and is not included), the fine-grained word level synchronization used in Version I is replaced with the
Page 4
data-structure

level synchronization. In this case, each thread locks the entire “globa lCnt” histogram and releases it only after the thread has added all its local-count histogram entries to “globalCnt”. Figure 4 displays the multi-cache read-miss profiled data for the improved Version II program. In Figure 4, the number of cache accesses made from “lock” and “globalCnt” data structures has reduced from the range of between 772 and 1015 read-misses in Figure 1 to the range of between 8 and 157 read-misses in Figure 4. These results are from a single simulation run. It is assumed that the simulation results

from multiple simulation runs will be similar and are not subject to big variations that one would find in the commercial workloads [17]. 105 23 20 40 60 80 100 120 12 Figure 3. Number of cache li nes utilized in each sets. At most two or fewer cache lines from the total four in each set are utilized for n=32 keys. Y-axis: Number of sets; X- axis: Number of cache lines utilized. Figure 5 shows the multi-cache write-miss profiled data for Version II program. The write-miss profiled data chart for the Version I program is similar to the one in Figure 5 and thus is not shown. Note the change in

the scale in Figure 5. As the program line numbers 56 through 60 illustrate, the output buffer entrie s are filled in an irregular and scatter way based on the 8-bit slice used during each cycle (pass). This irregular and scatter writing pattern of the keys to the output buffer increases the chances for false-misses. In order to analyze the migratory cache lines of “buffer” data structure, “buffer is divided into two halves labeled “buffer_0” and “buffer_1” in the charts. Both halves consist of input and output buffers. Each thread (e .g., thread 0) reads the keys from its halve of the input

buffer (i.e., “Buffer_0” input) and scatters them to unique and distinct locations in one or both halves of the output buffers (“Buffer_0” and Buffer_1” outputs); thus, potentially resulting in migratory cache lines due to false-misses. The frequency of these migratory cache lines depends on the initial order of the keys; partially sorted keys will result in fewer write-miss migratory cache lin es. Figure 5 shows the migratory cache lines for randomly selected keys. In this case, both halves of the output buffers are written by both of the threads (shown as wm_cpu0 and wm_cpu1) causing some

write-miss mi gratory cache lines for “buffer” data structure. 200 400 600 800 1000 1200 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 bar buffer_0 buffer_1 globalCnt lock lower_0 lower_1 other table Figure 4. Data structures and the number of read-misses (rm) due to migratory cache lines for executing Version II program for Parallel Radix Sort for n=32 keys. Y-axis: Number

of read-misses due to migration. 100 200 300 400 500 600 Sum of wm _cpu0 Sum of wm _cpu1 Sum of wm _cpu0 Sum of wm _cpu1 Sum of wm _cpu0 Sum of wm _cpu1 Sum of wm _cpu0 Sum of wm _cpu1 Sum of wm _cpu0 Sum of wm _cpu1 Sum of wm _cpu0 Sum of wm _cpu1 Sum of wm _cpu0 Sum of wm _cpu1 Sum of wm _cpu0 Sum of wm _cpu1 Sum of wm _cpu0 Sum of wm _cpu1 bar buffer_0 buffer_1 globalCnt lock lower_0 lower_1 other table Figure 5. Data structures and the number of write-misses (wm) due to migratory cache lines for executing Version II Parallel Radix Sort program for n=32 random keys. Y- axis: Number of

write-misses due to migration. Figures 6 and 7 show the simulation run results for n=1024 randomly selected keys using the modified program (Version II). As expected and shown in Figure 6, the multi-cache read-miss frequencies increase for a larger size “buffer” as compared to the data in Figure 4. However, compared to the data in Figure 1, Version II program, results in far fewer read-misses for n=1024 keys than the read-misses resulted for n=32 in Version I. Such source code optimizations, without a data structure oriented multi-cache profiling, may not be obvious to the programmers at the

time of coding. Additionally, as shown in Figure 7, Parallel Radix Sort algorithm produces a proportionally larger number of write-misses for “buffer” as compared to the number of read-misses in Figure 6. Depending on the type of cache coherency protocol used, this can further influence the performance. As n increases it is expected that the cache lines from various groups of program data structures would map to
Page 5
same set in the cache memories; thus potentially resulting in some conflict-misses. This is illustrated in Figure 8 where it shows the program data structures in

groups; and when they are accessed by the cache memories, the corresponding cache lines are mapped to the same group of sets in each of the cache memories. Further extensions would be necessary for the profiler to collect and organize the multi-cache profiled data by the different cache-miss types, such as true-miss, false-miss, and conflict-miss, and by each data structure in the program. Additionally, our profiler presently tracks multi-cache memory activities of the static program data structures and not those that are dynamically allocated. Yet another valuable extension to the profiler

would be to additionally organize the profiled data by the different sections in the program source code so th at inefficient codes are automatically identified and possibly modified. 200 400 600 800 1000 1200 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 Sum of rm_cpu0 Sum of rm_cpu1 bar buffer_0 buffer_1 globalCnt lock lower_0 lower_1 other table Figure 6. Data Structures and the number of read-misses

(rm) for executing two-thread Parallel Radix Sort program Version II with 1024 random keys. Excludes non-migratory cache lines and no-migratory data structures. Y-axis: Nu mber of read-misses. 100 200 300 400 500 600 Sum of wm_cpu0 Sum of wm_cpu1 Sum of wm_cpu0 Sum of wm_cpu1 Sum of wm_cpu0 Sum of wm_cpu1 Sum of wm_cpu0 Sum of wm_cpu1 Sum of wm_cpu0 Sum of wm_cpu1 Sum of wm_cpu0 Sum of wm_cpu1 Sum of wm_cpu0 Sum of wm_cpu1 Sum of wm_cpu0 Sum of wm_cpu1 Sum of wm_cpu0 Sum of wm_cpu1 bar buffer_0 buffer_1 globalCnt lock lower_0 lower_1 other table Figure 7. Data structures and the number of

write-misses (wm) for executing the two-thread Parallel Radix Sort program Version II with 1024 keys. Excludes non- migratory cache lines and no-m igratory data structures. Y-axis: Number of write-misses. 20 40 60 80 100 120 (blank) (blank) (blank) (blank) (blank) (blank) (blank) bar globalCnt (blank) globalCnt lower_0 lower_1 buffer_1 buffer_1 buffer_1 buffer_1 buffer_1 buffer_1 buffer_0 buffer_0 buffer_0 buffer_0 buffer_0 buffer_0 final flag lock other table Figure 8. Program data structures that map to the same sets in cache memories for n=1024. Y-axis: Number of sets. Y-axis: Number of

sets. 5. Conclusion The paper presents a multi-cache profiling tool using Simics, a computer simulati on tool. The single-cache profiling capabilities of Simics were extended to collect cache line migration frequencie s and to organize them by the static data structures of the parallel processing program. The profiler identifies those data structures, which their manipulations, re sult in excess ive cache line migrations, potentially a source for excessive false- misses. Our multi-cache profiler was used, as an example, on two different implementations of Parallel Radix Sort algorithm. The

profiler shows that during the construction of the global histogram of the keys frequencies, word-level synchronization produces a lot more data migrations in the caches and that this might not be obvious to the programmers during coding. By changing the word-level synchr onization to data-structure level synchronization, the frequencies of data migrations in the cache memories, as expected, reduce drastically. The profiler also shows thos e cache line migrations that are inherent in the implementation of the algorithm—for example, during each sorting cy cle, when each thread in the Parallel

Radix Sort program distributes its keys among several threads, it causes cache line migrations—and that these migrations do not depend on the decisions made by the programmers during coding. This kind of algorithm and program analysis is po ssible if the cache memory activities are monitored at the data structure level as it is done in our multi-cache profiler. References [1] MPI: The Message Passing Interface standard, [2] Augmint: A multiprocessor simulator, [3] LIMES: A Multiprocessor Simulation Tool for PC Platforms, [4] Tango multitasking simulator is developed at Stanford and based on

Unix shared memory. TangoLite is a successor of Tango.
Page 6
[5] SimOS: The Complete Machine Simulator, [6] MulSim: Multiprocessor Simulator, Doc.html> [7] Simics: system-level instruction-set simulator available from Virtutech: www.virtutech.com [8] A Processor Based Classification of the Instrumentation and Simulation Tools, Irina Chihaia (chihaia@inf.ethz.ch ), www.inf.ethz.ch/personal/chihaia/instr-sim.html. [9] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. H. and Johan Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform,”

IEEE Computer, Feb. 2002, pp 50- 58. [10] Jim Nilsson, Fredrik Dahlgren, Magnus Karlsson, Peter Magnusson, and Per Stenström, Computer System Evaluation with Commercial Workloads, in Proceedings of the 1998 IASTED Conference on Modeling and Simulation, pp. 293-297, May 1998. [11] Jim Nilsson and Fredrik Dahlgren, “Reducing Ownership Overhead for Load-Store Sequences in Cache- Coherent Multiprocessors, Pub lished in Proceedings of the 2000 International Parallel and Distributed Processing Symposium, Cancun, Mexico, May 2000. [12] Nikrouz faroughi, “Profiling of Parallel Processing Programs on

Shared Memory Multiprocessors using Simics,” Workshop on Binary Instrumentation and Applications, held in conjunction with the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT-14, St. Louis, MO September 2005. [13] Dinesh C. Suresh, Walid A. Najjar, Frank Vahid, Jason R. Villarreal, and Greg Stitt, “Profiling Tools for Hardware/Software Partitioning of Embedded Applications,” The 2003 ACM SIGPLAN Conf. on Languages, Compilers and Tools for Embedded Systems (LCTES 2003), June 2003. [14] David Culler and Jaswinder Pal Singh, “Parallel Computer

Architecture: A Hardware/Software Approach,” Morgan Kaufmann Publishing, 1999. [15] Blelloch et. al., “Compa rison of Sorting Algorithms for the Connection Machine CM-2,” Proc. Symposium on Parallel Algorithms and Architectures, July 1991, pp. 3- 16. [16] Culler et. al., “Parallel Programming in Split-C, Proc. Supercomputing, November 1993, pp. 262-273. [17] Alaa R. Alameldeen, Carl J. Mauer, Min Xu, Pacia J. Harper, Milo M.K. Martin, Daniel J. Sorin, Mark D. Hill and David A. Wood, Evaluating Non-deterministic Multi- threaded Commercial Workloads, appears in the proceedings of the Comput er

Architecture Evaluation using Commercial Workloads (CAECW-02), February 2, 2002. [18] Pierre Terdiman, #Target Program Version I with only two threads. Partially shown. #define MY_MAGIC(n) \ { \ asm volatile ("movl %0, %%eax" \ : /*no ou tputs*/ \ : "g" (n) \ : "eax"); \ MAGIC(0); \ } void *sortSubList(void *arg) { 16 MY_MAGIC(1) ;//Call Python profiling function to initialize cache memories 17 for(pass=0; pass<4;pass++){ 18 memset(counters, 0, 256*sizeof(int)); // Set local counts to 0 19 memset(&lower[pid][0], 0, 256*sizeof(int)); // reset 20 memset(&globalCnt[pid*128], 0, 128*sizeof(int));

// reset 21 for( i =minIdx ; i < maxIdx ; i++){ // Loop over the input array 22 c = (buffer[inBuff][i]>>(pass<<3)) & 0xFF; // Get current byte 23 counters[c]++; // …and update counter 24 } 25 barRet=barrier_wait(bar, 1); // barrier: Finish local-count histograms 26 for (i=0; i<256; i++) 27 { 28 pthread_mutex_lock(&lock); //lock 29 globalCnt[i]=globalCnt[i]+counters[i];
Page 7
30 pthread_mutex_unlock(&lock); //unlock 31 } 32 barRet=barrier_wait(bar, 1); //barrier: Finish generating global-count 33 if(pid==0){ //pid =0, do the lower half of the global-offset table 34 table[0]=0; 35

for(i=1; i<128; i++){ 36 table[i] = table[i-1] + globalCnt[i-1]; 37 } 38 } 39 else{ //pid=1, do the other half of the table 40 sumCntForPid1=0; 41 for(i=0; i<128; i++) 42 sumCntForPid1=sumCntForPid1 + globalCnt[i]; 43 table[128]=sumCntForPid1; 44 for(i=129; i<256; i++){ 45 table[i] = table[i-1] + globalCnt[i-1]; 46 } 47 } 48 for(i=0; i<256; i++){ 49 for(j=pid; j 50 pthread_mutex_lock(&lock); //lock 51 lower[j+1][i]=lower[j+1][i] + counters[i]; 52 pthread_mutex_unlock(&lock); //unlock 53 } 54 } 55 barRet=barrier_wait(bar, 1); //barrier: Finish global-offset table and //lower-count histograms 56

for(i=minIdx; i 57 c = (buffer[inBuff][i]>>(pass<<3)) & 0xFF; 58 buffer[outBuff][table[c]+lower[pid][c]] = buffer[inBuff][i]; 59 lower[pid][c]++; 60 } 61 barRet = barrier_wait(bar, 1); //barrier: Finish one pass 62 if(pid==0) //only one thread needs to do this 63 final[0]=outBuff; // identify which buffer has the sorted numbers //(could do without) 64 t=inBuff; 65 inBuff=outBuff; 66 outBuff = t; 67 } 68 MY_MAGIC(2) ; //Done. Call Python function to create multi-cache profiled data } main() { 10 pthread_create(&threadID, NULL, sortSubList, &info1); //create a thread 11 info2.minIdx =

(int)(Nitems/2); 12 info2.maxIdx=Nitems; 13 info2.pid=1; 14 while(flag[0]==0){}; //wait until thread with pid=0 starts 15 sortSubList(&info2); 16 pthread_join(threadID, &exit_status); }