/
Experiences parallelising the mixed C-Fortran Experiences parallelising the mixed C-Fortran

Experiences parallelising the mixed C-Fortran - PowerPoint Presentation

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
388 views
Uploaded On 2017-08-19

Experiences parallelising the mixed C-Fortran - PPT Presentation

Sussix BPM postprocessor H Renshall BE Dept associate Jan 2012 Using appendix material from CERNATSNote2011052 MD July 2011 23012012 1 BENAG Meeting Parallelisation of CFortran SUSSIX ID: 580267

sussix fortran time nag fortran sussix nag time bpm parallelisation meeting 2012 data real omp loop cores run parallel code core case

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Experiences parallelising the mixed C-Fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Experiences parallelising the mixed C-Fortran Sussix BPM post-processor

H. Renshall, BE Dept associate, Jan 2012Using appendix material from CERN-ATS-Note-2011-052 MD (July 2011)

23/01/2012

1

BE-NAG Meeting:

Parallelisation

of C-Fortran SUSSIX Slide2

23/01/2012

BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 2

The Problem:

SUSSIX is a FORTRAN program for the post processing of turn-by-turn BeamPositionMonitor

data, which computes the frequency, amplitude, and phase of tunes and resonant lines to a high degree of precision through the use of an interpolated FFT. Analysis of such data represents a vital component of many linear and non-linear dynamics measurements.

For analysis of LHC BPM data a specific version

sussix4drive, run through the C steering

code

Drive God

lin

, has been implemented in the CCC by the beta-beating team.

Analysis of all LHC BPMs, however, represents a major real time computational bottleneck in the control room, which has prevented truly on-line study of the BPM data. In response to this limitation an effort has been underway to decrease the real computational time, with a factor of 10 as the target, of the C and Fortran codes by parallelizing them.Slide3

23/01/2012

BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 3

Solutions considered:

Since the application is run on dedicated servers in the CCC the obvious technique is to profit from the current multi-core hardware: 24/48 cores are now typical.

The first idea was to use a parallelised FFT from the NAG fsl6i2dcl library for SMP and

multicore

together with the

intel

64-bit Fortran compiler and the

intel

maths kernel library recommended by NAG.

As a learning exercise various NAG installation validation examples of enhanced routines were run, including multi-dimensional FFTs, and all took about the same real time, but increasing user

cpu

time, as the number of cores was increased on a fairly idle 16-core

lxplus

machine. Not surprising since the examples only take

msec

, comparable to the overhead to launch a new thread.Slide4

23/01/2012

BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 4

The

Sussix

application calls

cfft

(D704 in the CERN program library) which maps onto NAG c06ecf which has not yet been enhanced. c06ecf was 10% slower than

cfft

on a simple test case giving the same numerical results, probably due to extra housekeeping and extra numerical controls.

At the same time profiling the

Sussix

application (with

gprof

) showed that only 7.5% of the total

cpu

time was spent in

cfft

and with less than 10

msec

per individual call hence one could expect little or no real-time speedup by using a parallelised version.

The profile showed that 70% of the

cpu

time was spent in a function

calcr

searching for the maxima of the

fourier

spectra with large numbers of executions of a compact reverse inner loop over the number of turns of

bpm

data.Slide5

23/01/2012

BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 5

This inverse loop over

maxd

, the number of LHC turns measured by an individual

bpm

, could not be improved. In a real case

maxd

is typically 1000 and this loop is executed 10 million times:

double complex

zp,zpp,zv

zpp

=

zp

(

maxd

)

do

np

=maxd-1,1, -1

zpp

=

zpp

*

zv+zp

(

np

)

enddo

It was decided to try and parallelise using, like NAG, the OPENMP implementation supported by the Intel compiler and examining the granularity revealed that the highest level of independent code execution was over the processing of individual BPM data.Slide6

23/01/2012

BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 6

The pure FORTRAN offline version was parallelised first by adding OPENMP parallelisation directives round the main

bpm

loop. Each

bpm

data is in a separate file:

!$OMP PARALLEL DO PRIVATE(

n,iunit,filename,nturn

)

!$OMP& SHARED (isix,ntot,iana,iconv,nt1,nt2,narm,istune,etune,tunex,

tuney,tunez,nsus,idam,ntwix,ir,imeth,nrc,eps,nline

,

lr,mr,kr,idamx,ifin,isme,iusme,inv,iinv,icf,iicf

)

do n=1,ntot ! Parallel loop over all

bpm

(typically 500)

call

datspe

(iunit,idam,ir,nt1,nt2,nturn,imeth,narm,iana) call

ordres

(

eps,narm,nrc,idam,n,nturn

)

enddo

!$OMP END PARALLEL DO

In addition !$OMP THREADPRIVATE directives were added for all non-shareable variables in the called subroutine trees.

Slide7

23/01/2012

BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 7

This gave good scaling up to 10 cores on a non-dedicated 16-core

lxplus

machine (reported at the 24

th

ICE section meeting of 2011) so was worth extending to the target mixed C and Fortran version to be run in the control room.

The

bpm

data is read into memory from a single file then a

bpm

loop is called from C code with a different but similar OPENMP syntax to give the same scaling result:

#pragma omp parallel private(i,ii,ij,kk)

#pragma omp for

for (i=pickstart; i<=maxcounthv ; i++){

sussix4drivenoise_(&

doubleToSend

[0], &tune[0], &amplitude[0])

#

pragma

omp

critical

/* here I/O C-code in the loop needing sequential execution */ }

The Fortran

datspe

and

ordres

call trees were unchanged.Slide8

23/01/2012

BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 8

The OPENMP directives multi-thread the code and the threads then map onto physical CPUs in a multi-core machine. The run-time environment variable OMP_NUM_THREADS instructs OPENMP how many threads, hence cores, it can use for an execution and enables easy measurement of the scaling.

Since the order of processing of individual BPMs is arbitrary the results file is post-processed by a

unix

sort as part of the application to give the same results as a non-parallel execution.

A test case of real 1000 turn LHC BPM data,

analysed

to

find 160 lines, was performed on a reserved 24 core machine

cs-ccr-spareb7 in the

CCC. A normal run of this test case takes about 50 seconds on this machine. The observed w

all-time speedup of C-Fortran

Sussix

as a function of the number of cores (from E. Maclean) is shown on the final slide.

Slide9

23/01/2012

BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX

9

About a factor of 10 improvement in the real computation time has been

realised

for this test case saturating at 12 cores, probably due to memory bandwidth limits. For the study of amplitude detuning reported in CERN-ATS-Note-2011-52 the parallelized C-Fortran SUSSIX was

utilised

within the beta-beat GUI and the target tenfold real-time reduction was verified in practice.

This technique could be of interest to other applications.