/
Critical Flags, Variables, and Other Important ALCF Minutiae Critical Flags, Variables, and Other Important ALCF Minutiae

Critical Flags, Variables, and Other Important ALCF Minutiae - PowerPoint Presentation

reportcetic
reportcetic . @reportcetic
Follow
346 views
Uploaded On 2020-08-03

Critical Flags, Variables, and Other Important ALCF Minutiae - PPT Presentation

Jini Ramprakash Technical Support Specialist Argonne Leadership Computing Facility Presentation outline Its all about your job Job management Job basics Submission Queuing Execution ID: 796799

job supported computing facility supported job facility computing leadership office science department energy argonne compiler alcf ibm systems jobs

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Critical Flags, Variables, and Other Imp..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Critical Flags, Variables, and Other Important ALCF Minutiae

Jini Ramprakash

Technical Support Specialist

Argonne Leadership Computing Facility

Slide2

Presentation outline

It’s all about

your

job!Job managementJob basicsSubmissionQueuingExecutionTerminationSoftware environmentOptimization for beginnersALCF resources, outlined

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

2

Slide3

Job management

Cobalt

(the ALCF resource scheduler) is

used on all ALCF systems Similar to PBS but not the sameFind more information at http://trac.mcs.anl.gov/projects/cobaltJob management commands:qsub: submit a job

qstat: query a job statusqdel:  delete a jobqalter

: alter batched job parameters

qmove

: move job to different queueqhold: place queued (non-running) job on holdqrls: release hold on jobshowres: show current and future reservations

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

3

Slide4

Job basics – submission

Two modes of submitting jobs

Basic

Script modeGet all flags and options by running ‘man qsub’For example:qsub -A alchemy -n 40960 --mode c1 -t 720 --env “OMP_NUM_THREADS=4”

lead_to_goldIn English: Charge project “Alchemy” for this job. Run on 40960 nodes, with one MPI rank per node.

Run for

720 minutes. Set the “OMP_NUM_THREADS” environment variable to 4. Run the “lead_to_gold” binary.Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

4

Slide5

qsub

checks your submission for sanity

Did you specify a

nodecount and walltime? Are they legal?Is the mode you specified valid?Did you ask for more than the minimum runtime?Are you a member of the project you specified? Does that project have a usable allocation?If so … all systems go! Get a JOBID, and put it in the queue

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

5

Slide6

Not there yet!

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

6

Slide7

Job basics - life in the queue

Periodically, your job’s score will increase

Periodically, the scheduler will decide if there are any jobs it wants to run

Check current state with qstatAt some point, your score will be high enough, and it will be YOUR TURN!

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

7

Slide8

Score accrual

Large jobs are prioritized

Jobs that have been waiting long are prioritized

INCITE/ALCC projects are prioritizedNegative allocations have a score cap lower than the starting score of other jobsArgonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

8

Slide9

Job basics - execution

Book-keeping

Put a start record in the database. Output a log file start record. Send email of job start if –notify was requested. Start job timers

Fire up to execute the jobCobalt boots partitionrunjob starts executable

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

9

Slide10

Script mode jobs

All jobs

launch

via runjob on the service nodesScript mode jobs launch your script on a special login nodeThat script is responsible for calling runjob to launch the actual compute-node jobYou are charged for the duration of the script

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

10

Slide11

Job basics – termination aka are we there yet?

Your requested wall-time ticks down. Either your

runjob

returns, or you run out of wall-time and your job is forcibly removedJob-end cleanup happensIf your partition wasn’t cleaned up, that happens nowJob-end book-keeping happensDatabase, log file, notify if requestedArgonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

11

Slide12

Job basics – Termination, life after your job

If you had a job depending on you, it can be released to run. If you had a non-zero exit code, it moves to

dep_fail

insteadThat night, the log files will be fed into clusterbank (the ALCF accounting system) to create chargesArgonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

12

Slide13

Non-standard job events

Reservations and/or draining

qsub

rejectionJob holdsJob redefinition (qalter)Job removal (qdel)Abnormal job failureWhy isn’t this job running?

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

13

Slide14

Software environment - SoftEnv

A tool for managing your environment

Sets your PATH to access desired front-end tools

Your compiler version can be changed hereSettings:Maintained in the file ~/.softAdd/remove keywords from ~/.soft to change environmentMake sure @default is at the very endCommands:

softenva list of all keywords defined on the systems

resoft

reloads initial environment from ~/.soft filesoft add|remove keywordTemporarily modify environment by adding/removing keywords

http://www.mcs.anl.gov/hs/software/systems/softenv/softenv-intro.html

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

14

Slide15

Software libraries

ALCF Supports two sets of libraries

:

IBM system and provided libraries: /bgsys/drivers/ppcfloorglibcmpiSite supported libraries and programs: /

soft/PETScESSLAnd many others

See

http

://www.alcf.anl.gov/resource-guides/software-and-libraries

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

15

Slide16

Compiler wrappers

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

16

MPI wrappers for IBM XL cross-compilers

:

MPI

wrappers for GNU cross-compilers

:

Wrapper

Thread-Safe Wrapper

Underlying Compiler

Description

mpixlc

mpixlc_r

bgxlc

IBM BG C Compiler

mpixlcxx

mpixlcxx_r

bgxlC

IBM BG C++ Compiler

mpixlf77

mpixlf77_r

bgxlf

IBM BG Fortran 77 Compiler

mpixlf90

mpixlf90_r

bgxlf90

IBM BG Fortran 90 Compiler

mpixlf95

mpixlf95_r

bgxlf95

IBM BG Fortran 95 Compiler

mpixlf2003

mpixlf2003_r

bgxlf2003

IBM BG Fortran 2003 Compiler

Wrapper

Underlying Compiler

Description

mpicc

powerpc-bgp-linux-gcc

GNU BG C Compiler

mpicxx

powerpc

-

bgp

-

linux

-g++

GNU BG C++ Compiler

mpif77

powerpc-bgp-linux-gfortran

GNU BG Fortran 77 Compiler

mpif90

powerpc-bgp-linux-gfortran

GNU BG Fortran 90 Compiler

Slide17

Optimization for beginners

Suggested set of optimization levels from least to most optimization:

-O0 # best level for use with a

debugger-O2 # good level for verifying correctness, baseline

perf

-O2 -

qmaxmem

=-1 -qhot

=level=0-O3 -

qstrict

(preserves program semantics)

-O3

-O3 -

qhot

=level=1

-O4

-O5

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

17

Slide18

Optimization tips

-

qlistopt

generates a listing with all flags used in compilation-qreport produces a listing, shows how code was optimizedPerformance can decrease at higher levels of optimization, especially at -O4 or -O5May specify different optimization levels for different routines/filesArgonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

18

Slide19

ALCF Resources – BG/Q systems

Mira

BG/Q system 49,152 nodes / 786,432 cores 786 TB of memory Peak flop rate: 10 PF Linpack flop rate: 8.1 PF Cetus (T&D) – BG/Q system

1024 nodes / 16,384 cores16 TB of memoryPeak flop rate: 208 TFVesta

(T&D) -­‐ BG/Q systems

2,048 nodes / 32,768 cores 32 TB of memory Peak flop rate: 416 TF Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

19

Slide20

ALCF Resources – supporting systems

Tukey

Nvidia system 100 nodes / 1600 x86 cores/ 200 M2070 GPUs 6.4 TB x86 memory / 1.2 TB GPU memory Peak flop rate: 220 TF Storage Scratch

: 28.8 PB raw capacity, 240 GB/s bw (GPFS) Home: 1.8 PB raw capacity, 45 GB/s bw

(GPFS)

Storage

upgrade planned in 2015Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

20

Slide21

ALCF Resources

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

21

Mira48 racks/

768K cores10 PFCetus

(

Dev)1 rack/16K cores208 TFTukey

(Viz)100 nodes/

1600 cores

200 NVIDIA GPUs

220 TF

Networks

100Gb (via

Esnet

, internet2

UltraScienceNet

)

Vesta

(

Dev)

2

racks

/32K

cores

416 TF

Slide22

Coming up next…

Data Transfers in the ALCF - Robert Scott, ALCF

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

22

Slide23

Thank You!

Questions?

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

23