/
 Performance Analysis, Profiling and Optimization of Weather Research and Forecasting  Performance Analysis, Profiling and Optimization of Weather Research and Forecasting

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
345 views
Uploaded On 2020-04-06

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting - PPT Presentation

Negin Sobhani 12 Davide Del Vento 2 David Gill 2 Sam Elliot 32 and Srinath Vadlamani 4 1 1 University of Iowa 2 National Center for atmospheric ResearchNCAR 3 ID: 776112

advection time wrf performance advection time wrf performance high misses cache mpi intel code schemes loop tau step microphysics

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document " Performance Analysis, Profiling and Opt..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model

Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam Elliot3,2,and Srinath Vadlamani4

1

1

University of Iowa 2National Center for atmospheric Research(NCAR) 3University of Colorado at Boulder 4Paratools Inc

Slide2

Outline

2

Introduction WRF MPI ScalabilityHybrid ParallelizationProfiling WRF Intel Vtune Amplifier XETaul toolsIdentifying hotspots and suggested areas for improvement

Slide3

The Weather Research & Forecasting(WRF) Model

Numerical weather prediction systemDesigned for both operational forecasting and atmospheric researchCommunity model with large user base:More than 30,000 users in 150 countries

3

Figure from WRF-ARW Technical Note

Slide4

Previous Scaling Studies

4

WRF has benchmarked on different systems.

Figures from cisl.ucar.edu

Slide5

TACC Stampede Supercomuter

5

Aggregate Peak Performance : ~10 PFLOPS(PF)

6400+

Dell PowerEdge

(C8220z) server nodes2 Intel Xeon E5 (Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture) per each compute NodeEach computer node has 32 GB of “host” memory with an additional 8GB of memory on the Xeon Phi coprocessor card2.2 PF from Xeon E5 processors and 7.4 PF from Xeon Phi coprocessors

Figures from tacc.utexas.edu

Slide6

Hurricane Sandy Benchmark

6

Coarser resolution 40-km (50x50)

Time-step: 180 sec

Finer resolution 4-km (500x 500)

Time-step

:

20 sec

Time Period for both Simulation:

54 hour forecast

Between 2012 Oct 27 12:00 UTC through 2012 Oct 29 18:00 UTC

60 vertical layers

Slide7

Scalability Assessment (MPI Only)

7

MPI Bound

500X500 horizontal grids

 

Compute Bound

Simulation Speed is duration of simulation per wall clock time

Slide8

8

Scalability Assessment (MPI Only)

Allinea

Perfomance Reports Separate netcdf output file(io_form_history=102 in WRF namelist)

87% of total time spent on computation

79% of total time spent on MPI

Slide9

Domain Decomposition (MPI only)

9

Per Grid :1/4 Computation and 1/2 MPI

Slide10

AVX compiler flag

10

AVX

(Intel® Advanced Vector Extensions) is a 256 bit instruction set extension

More aggressive optimization

Not working on intel 15

This issue has been reported to Intel

Slide11

Hybrid Parallelization

11

Hybrid : distributed and shared memory parallelism(dmpar+smpar)As the number of threads increase the performance decreasesThe cores have never been oversubscribedBinding increases the performance significantlyI_MPI_PROCESSOR_LIST= p1,p2tacc_affinity script

Slide12

Intel Vtune Amplifier XE

12

Intel profiling and performance analysis tool

Profiling includes stack sampling, thread profiling and hardware event samplingCollect performance statistics of different part of the code

Slide13

Radiation

Longwave Radiation SchemeRRTMG Scheme (ra_lw_physics =4)Shortwave Radiation SchemeCAM Scheme(ra_sw_physics = 3)Microphysics Scheme Thompson et al. 2008 (mp_physics =8)

What does make WRF expensive?

13

Time(%)

But is this case representative of the significant effect of the dynamics on performance?

Slide14

Microphysics options summary

14

Schememp_physicsSimulation Speed# of Variables#timesteps/sKessler12493.6313.8Purdue Lin et al.22043.8611.3WSM-3 32263.8312.5WSM-542012.3511.2Ferrier(current NAM)52451.2513.6WSM-661859.5610.3Goddard 6 class71929.9610.6Thompson et al.81739.879.7Milbrandt- Yau 2-moment91189.3 136.6Morison 2-moment101475.5108.2WDM-5141478.6 88.2WDM-16161358.8 97.5

Thompson microphysics is among the most expensive microphysics and it is widely used.

Slide15

TAU tools

15

TAU (Tuning and Analysis Utilities) is a program and performance analysis tool framework for high-performance parallel and distributed

computingTAU can automatically instrument source code using a package called PDT for routines, loops, I/O, memory, phases, etc.Tau uses wallclock time and PAPI metrics to read hardware counters for profiling and tracing

Slide16

Using Tau/PAPI for Advection Module

16

1- PDT instrumentation for

module_advect_em

2- Manually instrumented code for higher granularity of desired loops

TAU/PAPI variables analyzed:

Time

L1 and L2 Data Cache Misses (DCM)

Conditional

branch instructions

mispredicted

Floating point instruction and operations

Single

and double precision

vector/SIMD instructions

Slide17

Identified Hotspots

17

1-

Positive Definite

Advection Loop (32 lines)

High

Time

High

cache misses (both L1 and L2 Cache misses

)

High

branch miss-prediction

2- x, y, z flux 5 advection equation loops

High Time

High Cache misses

Repeated through the code for different advection schemes

Slide18

Moisture transport in ARW

18

Until recently,

many weather models did not conserve moisture because of the numerical challenges in advection schemes.  high bias in precipitationWRF-ARW scheme is conservative but not all of the advection schemes are.This introduces new masses to the system.

Figure from Skamarock and Dudhia 2012

Advection schemes can introduce both positive and negative errors particularly at sharp gradients

Slide19

Advection options in WRF

19

Explicit IFs to remove negative values and over shoots

Explicit IFs to remove oscillations

High number of explicit IFs are causing high branch mispredictions

moist_adv_opt

=0

moist_adv_opt

=1

moist_adv_opt

=2

Figure from

S

kamarock

and

D

udhia

2012

Slide20

The effect of optimization of advection module

20

1- Optimizing the positive definitive advection moduleTest1 : WRF only caseTest 2: WRF-Chem case

CaseAdvected VariablesMaximum performance increaseWRFMoisture13%WRF-Chem Moisture- Tracers- Species-Scalars- Chemical concentration- Particles21% *

This hotspot has a potential for being optimized and provides significant improvement in performance.

* The performance increase will be even significantly higher for dust and particle only WRF-

Chem

cases.

Slide21

Identified Hotspots

21

1-

Positive Definite

Advection Loop

High

Time

High

cache misses (both L1 and L2 Cache misses

)

High

branch miss-prediction

2- x, y, z flux 5 advection equation loops

High Time

High Cache misses

Repeated through the code for different advection schemes

Slide22

The effect of optimization of advection equations

22

2- Flux

5 advection equations

High Time and High L1 and L2 Data Cache Misses

This loop is repeated throughout the code for x, y and z directions

Very similar loop repeated for all the advection schemes

Test1 : WRF 4 km benchmark with TAU instrumentation

58% time spent in advection is in these flux equations loops

Many L1 data cache misses per iteration

Many L2 data cache misses per iteration

This hotspot has a potential for being optimized and provides significant improvement in performance.

Slide23

Conclusion

23

WRF shows good MPI scalability depending on the workload

Thread Binding should be used for improving the performance of the WRF hybrid runs

Intel

Vtune

Amplifier and Tau tools used for performance analysis of WRF code.

Dynamics is identified as the most expensive part of ARW

We identified the hotspots of the advection module and estimated the amount of performance increase from modifying these parts of the WRF code

Slide24

Ongoing and Future Work

24

Performance Improvement of advection module

Analysis of hardware counters to fix branch mispredictions and cache misses

Advection module vectorization for Intel Xeon Phi Coprocessors

Reducing memory footprint by decreasing the number of temporary variables

Exploring performance optimization with different compiler flags

Slide25

Acknowledgements

25

Davide Del VentoRich LoftSrinath VadlamaniDave GillGreg CarmichaelAll SIParCS admins and staff

Slide26

Microphysics Schemes

26

Provides atmospheric heat and moisture tendenciesIncludes water vapor, cloud and precipitation processes Microphysical ratesSurface rainfall

Mielikainen

et al. 2014

Slide27

Runge-Kutta loop (steps 1, 2, and 3) (i) advection, p-grad, buoyancy using (ii) physics if step 1, save for steps 2 and 3 (iii) mixing, other non-RK dynamics, save… (iv) assemble dynamics tendencies Acoustic step loop (i) advance U,V, then W, (ii) time-average U,V,W End acoustic loop Advance scalars using time-averaged U,V,WEnd Runge-Kutta loopOther physics (currently microphysics)

Begin time step

End time step

WRF Model Integration Procedure

27