Negin Sobhani 12 Davide Del Vento 2 David Gill 2 Sam Elliot 32 and Srinath Vadlamani 4 1 1 University of Iowa 2 National Center for atmospheric ResearchNCAR 3 ID: 776112
Download Presentation The PPT/PDF document " Performance Analysis, Profiling and Opt..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model
Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam Elliot3,2,and Srinath Vadlamani4
1
1
University of Iowa 2National Center for atmospheric Research(NCAR) 3University of Colorado at Boulder 4Paratools Inc
Slide2Outline
2
Introduction WRF MPI ScalabilityHybrid ParallelizationProfiling WRF Intel Vtune Amplifier XETaul toolsIdentifying hotspots and suggested areas for improvement
Slide3The Weather Research & Forecasting(WRF) Model
Numerical weather prediction systemDesigned for both operational forecasting and atmospheric researchCommunity model with large user base:More than 30,000 users in 150 countries
3
Figure from WRF-ARW Technical Note
Slide4Previous Scaling Studies
4
WRF has benchmarked on different systems.
Figures from cisl.ucar.edu
Slide5TACC Stampede Supercomuter
5
Aggregate Peak Performance : ~10 PFLOPS(PF)
6400+
Dell PowerEdge
(C8220z) server nodes2 Intel Xeon E5 (Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture) per each compute NodeEach computer node has 32 GB of “host” memory with an additional 8GB of memory on the Xeon Phi coprocessor card2.2 PF from Xeon E5 processors and 7.4 PF from Xeon Phi coprocessors
Figures from tacc.utexas.edu
Slide6Hurricane Sandy Benchmark
6
Coarser resolution 40-km (50x50)
Time-step: 180 sec
Finer resolution 4-km (500x 500)
Time-step
:
20 sec
Time Period for both Simulation:
54 hour forecast
Between 2012 Oct 27 12:00 UTC through 2012 Oct 29 18:00 UTC
60 vertical layers
Slide7Scalability Assessment (MPI Only)
7
MPI Bound
500X500 horizontal grids
Compute Bound
Simulation Speed is duration of simulation per wall clock time
Slide88
Scalability Assessment (MPI Only)
Allinea
Perfomance Reports Separate netcdf output file(io_form_history=102 in WRF namelist)
87% of total time spent on computation
79% of total time spent on MPI
Slide9Domain Decomposition (MPI only)
9
Per Grid :1/4 Computation and 1/2 MPI
Slide10AVX compiler flag
10
AVX
(Intel® Advanced Vector Extensions) is a 256 bit instruction set extension
More aggressive optimization
Not working on intel 15
This issue has been reported to Intel
Slide11Hybrid Parallelization
11
Hybrid : distributed and shared memory parallelism(dmpar+smpar)As the number of threads increase the performance decreasesThe cores have never been oversubscribedBinding increases the performance significantlyI_MPI_PROCESSOR_LIST= p1,p2tacc_affinity script
Slide12Intel Vtune Amplifier XE
12
Intel profiling and performance analysis tool
Profiling includes stack sampling, thread profiling and hardware event samplingCollect performance statistics of different part of the code
Slide13Radiation
Longwave Radiation SchemeRRTMG Scheme (ra_lw_physics =4)Shortwave Radiation SchemeCAM Scheme(ra_sw_physics = 3)Microphysics Scheme Thompson et al. 2008 (mp_physics =8)
What does make WRF expensive?
13
Time(%)
But is this case representative of the significant effect of the dynamics on performance?
Slide14Microphysics options summary
14
Schememp_physicsSimulation Speed# of Variables#timesteps/sKessler12493.6313.8Purdue Lin et al.22043.8611.3WSM-3 32263.8312.5WSM-542012.3511.2Ferrier(current NAM)52451.2513.6WSM-661859.5610.3Goddard 6 class71929.9610.6Thompson et al.81739.879.7Milbrandt- Yau 2-moment91189.3 136.6Morison 2-moment101475.5108.2WDM-5141478.6 88.2WDM-16161358.8 97.5
Thompson microphysics is among the most expensive microphysics and it is widely used.
Slide15TAU tools
15
TAU (Tuning and Analysis Utilities) is a program and performance analysis tool framework for high-performance parallel and distributed
computingTAU can automatically instrument source code using a package called PDT for routines, loops, I/O, memory, phases, etc.Tau uses wallclock time and PAPI metrics to read hardware counters for profiling and tracing
Slide16Using Tau/PAPI for Advection Module
16
1- PDT instrumentation for
module_advect_em
2- Manually instrumented code for higher granularity of desired loops
TAU/PAPI variables analyzed:
Time
L1 and L2 Data Cache Misses (DCM)
Conditional
branch instructions
mispredicted
Floating point instruction and operations
Single
and double precision
vector/SIMD instructions
Slide17Identified Hotspots
17
1-
Positive Definite
Advection Loop (32 lines)
High
Time
High
cache misses (both L1 and L2 Cache misses
)
High
branch miss-prediction
2- x, y, z flux 5 advection equation loops
High Time
High Cache misses
Repeated through the code for different advection schemes
Slide18Moisture transport in ARW
18
Until recently,
many weather models did not conserve moisture because of the numerical challenges in advection schemes. high bias in precipitationWRF-ARW scheme is conservative but not all of the advection schemes are.This introduces new masses to the system.
Figure from Skamarock and Dudhia 2012
Advection schemes can introduce both positive and negative errors particularly at sharp gradients
Slide19Advection options in WRF
19
Explicit IFs to remove negative values and over shoots
Explicit IFs to remove oscillations
High number of explicit IFs are causing high branch mispredictions
moist_adv_opt
=0
moist_adv_opt
=1
moist_adv_opt
=2
Figure from
S
kamarock
and
D
udhia
2012
Slide20The effect of optimization of advection module
20
1- Optimizing the positive definitive advection moduleTest1 : WRF only caseTest 2: WRF-Chem case
CaseAdvected VariablesMaximum performance increaseWRFMoisture13%WRF-Chem Moisture- Tracers- Species-Scalars- Chemical concentration- Particles21% *
This hotspot has a potential for being optimized and provides significant improvement in performance.
* The performance increase will be even significantly higher for dust and particle only WRF-
Chem
cases.
Slide21Identified Hotspots
21
1-
Positive Definite
Advection Loop
High
Time
High
cache misses (both L1 and L2 Cache misses
)
High
branch miss-prediction
2- x, y, z flux 5 advection equation loops
High Time
High Cache misses
Repeated through the code for different advection schemes
Slide22The effect of optimization of advection equations
22
2- Flux
5 advection equations
High Time and High L1 and L2 Data Cache Misses
This loop is repeated throughout the code for x, y and z directions
Very similar loop repeated for all the advection schemes
Test1 : WRF 4 km benchmark with TAU instrumentation
58% time spent in advection is in these flux equations loops
Many L1 data cache misses per iteration
Many L2 data cache misses per iteration
This hotspot has a potential for being optimized and provides significant improvement in performance.
Slide23Conclusion
23
WRF shows good MPI scalability depending on the workload
Thread Binding should be used for improving the performance of the WRF hybrid runs
Intel
Vtune
Amplifier and Tau tools used for performance analysis of WRF code.
Dynamics is identified as the most expensive part of ARW
We identified the hotspots of the advection module and estimated the amount of performance increase from modifying these parts of the WRF code
Slide24Ongoing and Future Work
24
Performance Improvement of advection module
Analysis of hardware counters to fix branch mispredictions and cache misses
Advection module vectorization for Intel Xeon Phi Coprocessors
Reducing memory footprint by decreasing the number of temporary variables
Exploring performance optimization with different compiler flags
Acknowledgements
25
Davide Del VentoRich LoftSrinath VadlamaniDave GillGreg CarmichaelAll SIParCS admins and staff
Slide26Microphysics Schemes
26
Provides atmospheric heat and moisture tendenciesIncludes water vapor, cloud and precipitation processes Microphysical ratesSurface rainfall
Mielikainen
et al. 2014
Slide27Runge-Kutta loop (steps 1, 2, and 3) (i) advection, p-grad, buoyancy using (ii) physics if step 1, save for steps 2 and 3 (iii) mixing, other non-RK dynamics, save… (iv) assemble dynamics tendencies Acoustic step loop (i) advance U,V, then W, (ii) time-average U,V,W End acoustic loop Advance scalars using time-averaged U,V,WEnd Runge-Kutta loopOther physics (currently microphysics)
Begin time step
End time step
WRF Model Integration Procedure
27