Sussix BPM postprocessor H Renshall BE Dept associate Jan 2012 Using appendix material from CERNATSNote2011052 MD July 2011 23012012 1 BENAG Meeting Parallelisation of CFortran SUSSIX ID: 580267
Download Presentation The PPT/PDF document "Experiences parallelising the mixed C-Fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor
H. Renshall, BE Dept associate, Jan 2012Using appendix material from CERN-ATS-Note-2011-052 MD (July 2011)
23/01/2012
1
BE-NAG Meeting:
Parallelisation
of C-Fortran SUSSIX Slide2
23/01/2012
BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 2
The Problem:
SUSSIX is a FORTRAN program for the post processing of turn-by-turn BeamPositionMonitor
data, which computes the frequency, amplitude, and phase of tunes and resonant lines to a high degree of precision through the use of an interpolated FFT. Analysis of such data represents a vital component of many linear and non-linear dynamics measurements.
For analysis of LHC BPM data a specific version
sussix4drive, run through the C steering
code
Drive God
lin
, has been implemented in the CCC by the beta-beating team.
Analysis of all LHC BPMs, however, represents a major real time computational bottleneck in the control room, which has prevented truly on-line study of the BPM data. In response to this limitation an effort has been underway to decrease the real computational time, with a factor of 10 as the target, of the C and Fortran codes by parallelizing them.Slide3
23/01/2012
BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 3
Solutions considered:
Since the application is run on dedicated servers in the CCC the obvious technique is to profit from the current multi-core hardware: 24/48 cores are now typical.
The first idea was to use a parallelised FFT from the NAG fsl6i2dcl library for SMP and
multicore
together with the
intel
64-bit Fortran compiler and the
intel
maths kernel library recommended by NAG.
As a learning exercise various NAG installation validation examples of enhanced routines were run, including multi-dimensional FFTs, and all took about the same real time, but increasing user
cpu
time, as the number of cores was increased on a fairly idle 16-core
lxplus
machine. Not surprising since the examples only take
msec
, comparable to the overhead to launch a new thread.Slide4
23/01/2012
BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 4
The
Sussix
application calls
cfft
(D704 in the CERN program library) which maps onto NAG c06ecf which has not yet been enhanced. c06ecf was 10% slower than
cfft
on a simple test case giving the same numerical results, probably due to extra housekeeping and extra numerical controls.
At the same time profiling the
Sussix
application (with
gprof
) showed that only 7.5% of the total
cpu
time was spent in
cfft
and with less than 10
msec
per individual call hence one could expect little or no real-time speedup by using a parallelised version.
The profile showed that 70% of the
cpu
time was spent in a function
calcr
searching for the maxima of the
fourier
spectra with large numbers of executions of a compact reverse inner loop over the number of turns of
bpm
data.Slide5
23/01/2012
BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 5
This inverse loop over
maxd
, the number of LHC turns measured by an individual
bpm
, could not be improved. In a real case
maxd
is typically 1000 and this loop is executed 10 million times:
double complex
zp,zpp,zv
zpp
=
zp
(
maxd
)
do
np
=maxd-1,1, -1
zpp
=
zpp
*
zv+zp
(
np
)
enddo
It was decided to try and parallelise using, like NAG, the OPENMP implementation supported by the Intel compiler and examining the granularity revealed that the highest level of independent code execution was over the processing of individual BPM data.Slide6
23/01/2012
BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 6
The pure FORTRAN offline version was parallelised first by adding OPENMP parallelisation directives round the main
bpm
loop. Each
bpm
data is in a separate file:
!$OMP PARALLEL DO PRIVATE(
n,iunit,filename,nturn
)
!$OMP& SHARED (isix,ntot,iana,iconv,nt1,nt2,narm,istune,etune,tunex,
tuney,tunez,nsus,idam,ntwix,ir,imeth,nrc,eps,nline
,
lr,mr,kr,idamx,ifin,isme,iusme,inv,iinv,icf,iicf
)
do n=1,ntot ! Parallel loop over all
bpm
(typically 500)
call
datspe
(iunit,idam,ir,nt1,nt2,nturn,imeth,narm,iana) call
ordres
(
eps,narm,nrc,idam,n,nturn
)
enddo
!$OMP END PARALLEL DO
In addition !$OMP THREADPRIVATE directives were added for all non-shareable variables in the called subroutine trees.
Slide7
23/01/2012
BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 7
This gave good scaling up to 10 cores on a non-dedicated 16-core
lxplus
machine (reported at the 24
th
ICE section meeting of 2011) so was worth extending to the target mixed C and Fortran version to be run in the control room.
The
bpm
data is read into memory from a single file then a
bpm
loop is called from C code with a different but similar OPENMP syntax to give the same scaling result:
#pragma omp parallel private(i,ii,ij,kk)
#pragma omp for
for (i=pickstart; i<=maxcounthv ; i++){
sussix4drivenoise_(&
doubleToSend
[0], &tune[0], &litude[0])
#
pragma
omp
critical
/* here I/O C-code in the loop needing sequential execution */ }
The Fortran
datspe
and
ordres
call trees were unchanged.Slide8
23/01/2012
BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 8
The OPENMP directives multi-thread the code and the threads then map onto physical CPUs in a multi-core machine. The run-time environment variable OMP_NUM_THREADS instructs OPENMP how many threads, hence cores, it can use for an execution and enables easy measurement of the scaling.
Since the order of processing of individual BPMs is arbitrary the results file is post-processed by a
unix
sort as part of the application to give the same results as a non-parallel execution.
A test case of real 1000 turn LHC BPM data,
analysed
to
find 160 lines, was performed on a reserved 24 core machine
cs-ccr-spareb7 in the
CCC. A normal run of this test case takes about 50 seconds on this machine. The observed w
all-time speedup of C-Fortran
Sussix
as a function of the number of cores (from E. Maclean) is shown on the final slide.
Slide9
23/01/2012
BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
9
About a factor of 10 improvement in the real computation time has been
realised
for this test case saturating at 12 cores, probably due to memory bandwidth limits. For the study of amplitude detuning reported in CERN-ATS-Note-2011-52 the parallelized C-Fortran SUSSIX was
utilised
within the beta-beat GUI and the target tenfold real-time reduction was verified in practice.
This technique could be of interest to other applications.