Katia Oleinik koleinikbuedu Shared Computing Cluster Shared transparent multiuser and multitasking environment Computing heterogeneous environment interactive jobs single processor and parallel jobs ID: 638075
Download Presentation The PPT/PDF document "Intermediate SCC Usage Research Computi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Intermediate SCC Usage
Research Computing Services
Katia Oleinik (koleinik@bu.edu)Slide2
Shared Computing Cluster
Shared -
transparent multi-user and multi-tasking environmentComputing -
heterogeneous environment:
interactive jobs
single processor and parallel jobsgraphics jobCluster - a set of connected via a fast local area network computers; job scheduler coordinates work loads on each nodeSlide3
Shared Computing Cluster
Rear View
Compute Nodes
Infiniband
EthernetSlide4
SCC resources
Processors
: Intel and AMD CPU Architecture
:
nehalem
, sandybridge, ivybridge, bulldozer, haswell
,
broadwell
Ethernet connection
:
1 or 10
Gbps
Infiniband
:
E
DR
, FDR, QDR ( or none )
GPUs: NVIDIA Tesla P100, K40m, M2070 and M2050
Number of cores
:
8, 12, 16, 20, 28, 36, 64Memory (RAM): 24GB – 1TBScratch Disk: 244GB – 886GBTechnical Summary:http://www.bu.edu/tech/support/research/computing-resources/tech-summary/Slide5
SCC organization
Around 900 nodes with
~
12,000
CPUs and ~200 GPUs
File StorageLogin nodes
Compute nodes
Private Network
Public Network
SCC1
SCC2
GEO
SCC4
~3.4PB of StorageSlide6
SCC General limits
All login nodes are limited to
15min. of CPU timeDefault wall clock time limit for all jobs
–
12 hoursMaximum number of processors – 1000Slide7
SCC General limits
1 processor job (batch or interactive)
– 720 hours
omp
job (16 processors or less)
– 720 hoursmpi job (multi-node job) – 120 hoursgpu job –
48 hours
Interactive Graphics job (virtual GL)
–
48 hoursSlide8
SCC Login nodes
Login nodes are designed for light work:
- text editing
- light debugging
- program compilation - file transferSlide9
Service Models - shared and buy-in
Shared:
paid for by BU and university-wide grants and are free to the entire BU Research Computing community.
Buy-In:
purchased by individual faculty or research groups through the Buy-In program with priority access for the purchaser.
~
55
~
45Slide10
SCC Compute Nodes
Buy-in nodes:
All buy-in nodes have a hard limit of 12 hours for non-member jobs. The time limit for group member jobs is set by the PI of the group;
Currently, more than
50
% of all nodes are buy-in nodes. Setting time limit for a job larger than 12 hours automatically excludes all buy-in nodes from the available resources; All nodes in a buy-in queue do not accept new non-member jobs if a project member submitted a job or running a job anywhere on the cluster. Slide11
SCC: running jobs
Types of jobs:
Interactive job – running interactive shell: run GUI applications, code debugging, benchmarking of serial and parallel code performance;
Interactive Graphics job
( for running interactive software with advanced graphics ) .
Batch job – execution of the program without manual intervention;Slide12
SCC: interactive jobs
qsh
qlogin / qrsh
X-forwarding is required
✓
—
Session is opened in a separate window
✓
—
Allows for a graphics window to be opened by a program
✓
✓
Current environment variables can be passed to the session
✓
—
Batch-system environment variables ($NSLOTS, etc.) are set
✓
—Slide13
SCC: running interactive jobs (
qrsh)
"
qrsh
" - Request from the queue (
q) a remote (r) shell (sh)
[koleinik@
scc2
~]$
qrsh
-P
myproject
[koleinik@
scc-pi4
~]$
Interactive shell
GUI applications
code debugging
benchmarkingSlide14
SCC: running interactive jobs
Request appropriate resources for the interactive job
:
- Some software (like MATLAB, STATA-MP) might use multiple cores.
- Make sure to request enough resources if the program needs more than 8GB of memory or longer than 12 hours;Slide15
SCC: interactive graphics jobs (
qvgl)
The majority of graphical applications perform well using VNC.
Required for those applications that use OpenGL for 3D hardware acceleration
fMRI and similar Applications (freesurfer,
freeview
, SPM, MNE, ...),
molecular modeling (
gview
, VMD,
Pymol
, maestro, ...)
This job type combines dedicated GPU resources with VNCSlide16
SCC: submitting batch jobs
Using
-b y
option:
scc1 %
qsub
-b y
cal
-y
Using script:
scc1 %
qsub
<
script_name
>Slide17
SCC: batch jobs
Script organization:
#!/bin/bash
-l
#Time limit
#$ -l
h_rt
=12:00:00
#Project name
#$ -P
krcs
#Send email-report at the end of the job
#$ -m e
#Job name
#$
-N
myjob
#Load modules:
module load R/R-3.2.3
#Run the program
R
script
my_R_program.R
Script interpreter
Scheduler Directives
Commands to execute
Execute login shell
(for proper interpretation of the module commands)Slide18
SCC: requesting resources (job options)
General Directives
Directive
Description
-l h_rt
=
hh:mm:ss
Hard run time limit in
hh:mm:ss
format. The default is 12 hours.
-P
project_name
Project to which this jobs is to be assigned. This directive is
mandatory
for all users associated with any Med.Campus project.
-N
job_name
Specifies the job name. The default is the script or command name.
-o
outputfile
File name for the stdout output of the job.
-e
errfile
File name for the stderr output of the job.
-j y
Merge the error and output stream files into a single file.
-m
b|e|a|s|n
Controls when the batch system sends email to you. The possible values are – when the job begins (b), ends (e), is aborted (a), is suspended (s), or never (n) – default.
-M
user_email
Overwrites the default email address used to send the job report.
-V
All current environment variables should be exported to the batch job.
-v
env=value
Set the runtime environment variable
env
to
value
.
-hold_jid
job_list
Setup job dependency list.
job_list
is a comma separated list of job ids and/or job names which must complete before this job can run. See
Advanced Batch System Usage
for more information.Slide19
SCC: requesting resources (job options)
Directives to request SCC resources
Directive
Description
-l h_rt
=
hh:mm:ss
Hard run time limit in
hh:mm:ss
format. The default is 12 hours.
-l
mem_total
=
#G
Request a node that has at least this amount of memory. Current possible choices include 94G, 125G, 252G ( 504G – for Med. Campus users only).
-l
mem_per_core
=
#G
Request a node that has at least these amount of memory per core.
-l
cpu_arch
=
ARCH
Select a processor architecture (sandybridge, nehalem). See
Technical Summary
for all available choices.
-l cpu_type
=
TYPE
Select a processor type (E5-2670, E5-2680, X5570, X5650, X5670, X5675). See
Technical Summary
for all available choices.
-l
gpus
=
G/C
Requests a node with GPU.
G/C
specifies the number of GPUs per each CPU requested and should be expressed as a decimal number. See
Advanced Batch System Usage
for more information.
-l
gpu_type
=
GPUMODEL
Current choices for
GPUMODEL
are M2050, M2070 and K40m.
-pe omp
N
Request multiple slots for Shared Memory applications (
OpenMP
,
pthread
). This option can also be used to reserve larger amount of memory for the application.
N
can vary from 1 to 16.
-pe mpi_#_tasks_per_node
N
Select multiple nodes for MPI job. Number of tasks can be 4, 8, 12 or 16 and
N
must be a multiple of this value. See
Advanced Batch System Usage
for more information.Slide20
SCC: requesting resources (job options)
Directives to request SCC
resources (continuation)
Directive
Description
-l
eth_speed
=
1
Ethernet speed (1 or
10
Gbps
)
.
-l
mem_free
=
#G
Request a node that has at least this amount of
free memory
.
Note
that the amount of free memory changes!
-l
scratch_free
=
#G
Request a node that has at least this amount of available disc space in scratch.
List various resources that can be requested
scc1 %
qconf
-
sc
scc1 %
man
qstatSlide21
SCC: tracking the jobs
Checking the status of a batch job
scc1 %
qstat
-u <
userID
>
List only running jobs
scc1 %
qstat
–u <
userID
> -s r
Get job information:
scc1 %
qsub
-j <
jobID
>
Display resources requested
by a job
scc1 %
qstat
–u
<
userID
>
-rSlide22
SCC: tracking the jobs
scc1 %
qstat
-j 596557
job ID
job_number
: 596557
exec_file
:
job_scripts
/596557
submission_time
: Mon Sep 11 10:11:04 2017
owner:
koleinik
sge_o_home: /usr1/scv/koleinik
sge_o_log_name: koleiniksge_o_path
: /usr/java/default/jre/bin:/usr
/java/default/bin:/
usr
/lib64/...sge_o_shell: /bin/bashsge_o_workdir: /projectnb/krcs/projects/sge_o_host: scc4account: sgecwd: /projectnb/krcs
/projects/
chamongrp
merge: y
hard
resource_list
:
no_gpu
=
TRUE,h_rt
=172800
soft
resource_list
:
buyin
=TRUE
mail_options
: ae
mail_list
: koleinik@scc4.bu.edu
notify: FALSE
job_name
: sim
jobshare
: 0
env_list
: PATH=/
usr
/java/default/
jre
/bin:/
usr
/java/default/bin
script_file
:
job.qsub
parallel environment: omp16 range: 16
project
:
krcsusage 1: cpu=00:13:38, mem=813.90147 GBs, io=0.01024, vmem=1.013G, maxvmem=1.013Gscheduling info: (Collecting of scheduler job information is turned off)Slide23
SCC: tracking the jobs
1. Login to the compute node
scc1 %
ssh
scc-ca1
2. Run
top
command
scc1 %
top
-u <
userID
>
Top command will give you a listing of the processes running as well as memory an CPU usage
3
. Exit from the compute node
scc1 %
exitSlide24
SCC: completed jobs report (
qacct)
qacct - query the accounting system
scc1 %
qacct
-
j
596557
query the job by ID
scc1 %
qacct
-
j
-d 3 -o
koleinik
query the job by the time of execution
number of days
job ownerSlide25
SCC: completed jobs report (
qacct)
qacct - query the accounting system
scc1 %
qacct
-
j
596557
query the job by ID
scc1 %
qacct
-
j
-d 3 -o
koleinik
query the job by the time of execution
number of days
job ownerSlide26
SCC: completed jobs report (
qacct)
qname
p100
hostname scc-c11.scc.bu.edu
group scvowner koleinikproject krcsjobname myjobjobnumber
551947
qsub_time
Wed Sep 6 20:08:56 2017
start_time
Wed Sep 6 20:09:37 2017
end_time
Wed Sep 6 23:32:29 2017granted_pe NONEslots 1
failed 0exit_status
0cpu 11232.780
mem 611514.460io 14.138iow 0.000
maxvmem 71.494Garid undefinedSlide27
SCC: node architecture
Login nodes:
Broadwell
architecture; 28 cores
Many compute nodes have
older architecture.As a result, the programs compiled with Intel and PGI compilers with some optimization options on a login node might fail when run on a compute node with a different architecture
http://www.bu.edu/tech/support/research/software-and-programming/programming/compilers/intel-compiler-flags
/
http://www.bu.edu/tech/support/research/software-and-programming/programming/compilers/pgi-compiler-flags
/Slide28
My job failed… WHY?Slide29
SCC: job analysis
If the job ran with "-m e" flag, an email will be sent at the end of the job
:
Job 7883980 (
smooth_spline
) Complete User = koleinik
Queue =
p-int@scc-pi2.scc.bu.edu
Host = scc-pi2.scc.bu.edu
Start Time = 08/29/2015 13:18:02
End Time = 08/29/2015 13:58:59
User Time = 01:05:07
System Time = 00:03:24
Wallclock
Time = 00:40:57
CPU = 01:08:31
Max
vmem
= 6.692G
Exit Status = 0Slide30
SCC: job analysis
The default time for interactive and non-interactive jobs on the SCC is
12 hours.Make sure you request enough time for your application to complete:
Job 9022506 (
myJob
) AbortedExit Status = 137
Signal = KILL
User =
koleinik
Queue = b@scc-bc3.scc.bu.edu
Host = scc-bc3.scc.bu.edu
Start Time = 08/18/2014 15:58:55
End Time = 08/19/2014 03:58:56
CPU = 11:58:33
Max
vmem
= 4.324G
failed assumedly after job because:
job 9022506.1 died through signal KILL (9)Slide31
SCC: job analysis
The memory (RAM) varies from node to node (some nodes have only 3GB of memory per slot, while others up to
28
GB
) . It is important to know how much memory the program needs and request appropriate resources.
Job 1864070 (myBigJob
) Complete
User =
koleinik
Queue = linga@scc-kb8.scc.bu.edu
Host = scc-kb8.scc.bu.edu
Start Time = 10/19/2014 15:17:22
End Time = 10/19/2014 15:46:14
User Time = 00:14:51
System Time = 00:06:59
Wallclock
Time = 00:28:52
CPU = 00:27:43
Max vmem = 207.393G
Exit Status = 137
Show RAM of a node
scc1 %
qhost
-h scc-kb8Slide32
SCC: job analysis
Currently, on the SCC there are nodes with:
16 cores & 128GB = 8GB/per
slot 20 cores & 128GB
~ 6GB/per
slot
16 cores & 256GB = 16GB/per
slot 20 cores &
256GB ~ 12GB/per
slot
12 cores & 48GB = 4GB/per
slot
28
cores &
256GB
~
9GB/per
slot8 cores & 24GB = 3GB/per slot 28 cores & 512GB ~ 18GB/per slot8 cores & 96GB = 12GB/per slot
36
cores & 1TB ~ 28GB/per
slot
64 cores & 256GB = 4GB/per slot
64 cores & 512GB = 8GB/per slot
Available only to Med. Campus usersSlide33
SCC: job analysis
Example:
Single processor job needs
20GB
of memory.
-----------------------------------------------------------
# Request a node with enough memory per core
#$ -l
mem_per_core
=8G
# Request enough slots
#$ -
pe
omp
3
http://www.bu.edu/tech/support/research/system-usage/running-jobs/batch-script-examples/#MEMORYSlide34
SCC: job analysis
Example:
Single processor job needs
200 GB
of memory.
-----------------------------------------------------------
# Request a node with enough memory per core
#$ -l
mem_per_core
=16G
# Request enough slots
#$ -
pe
omp
16
http://www.bu.edu/tech/support/research/system-usage/running-jobs/batch-script-examples/#LARGEMEMORY
Slide35
SCC: job analysis
Job 1864070 (
myParJob
) Complete
User =
koleinik
Queue = budge@scc-hb2.scc.bu.edu
Host = scc-hb2.scc.bu.edu
Start Time = 11/29/2014 00:48:27
End Time = 11/29/2014 01:33:35
User Time = 02:24:13
System Time = 00:09:07
Wallclock
Time = 00:45:08
CPU = 02:38:59
Max
vmem
= 78.527G
Exit Status = 137
Some applications try to detect the number of cores and parallelize if possible.
One common example is MATLAB.
Always read documentation and available options to applications. And either disable parallelization or request additional cores.
If the program does not allow to control the number of cores used – request the whole node.Slide36
SCC: job analysis
Example:
MATLAB by default will use all available cores.
-----------------------------------------------------------
# Start MATLAB using a single thread option:
matlab
-
nodisplay
-
singleCompThread
-r "n=4, rand(n), exit"
Slide37
SCC: job analysis
Example:
Running MATLAB Parallel Computing Toolbox.
-----------------------------------------------------------
# Request 4 cores:
#$ -
pe
omp
4
matlab
-
nodisplay
-r "
matlabpool
open 4, s=0;
parfor
i=1:n, s=
s+i
; end, matlabpool close, s, exit" Slide38
SCC: job analysis
The information about past job can be retrieved using
qacct
command
:
scc1 %
qacct
-o <
userID
> -d <number of days> -j
scc1 %
qacct
-j <
jobID
>
Information about a particular job:
Information about all the jobs that ran in the past 3 days:Slide39
SCC: quota and project quotas
My job used to run fine and now it fails… Why?
scc1 %
pquota
-u <project name>
scc1 %
quota -s
Check your disc usage in the home directory:
Check the disc usage by your projectSlide40
SCC: SU usage
Use
acctool
to get the information about SU (service units) usage
:
scc1 % acctool
-host shared -b 1/01/15 y
scc1 %
acctool
y
My project(s) total usage on all hosts yesterday (short form):
My project(s) total usage on shared nodes for the past moth
scc1 %
acctool
-p
scv
-balance -b 1/01/15 y
My balance for the project
scv
scc1 %
acctool
-b y
My balance for all the projects I belong toSlide41
My job is to slow… How I can speed it up?Slide42
SCC: optimization
Before you look into parallelization of your code, optimize it
! Parallelized inefficient
code is still
inefficient
There are a number of well know techniques in every language. There are also some specifics in running the code on the cluster!There are a few different versions of compilers on the SCC:GCC
(4.8.1, 4.9.2, 5.1.0, 5.3.0)
PGI
(13.5, 16.5)
Intel
(2015, 2016)Slide43
SCC: optimization - IO
Reduce the number of I/O to the home directory/project space (if possible);
Group smaller I/O statements into larger where possible
Utilize local /scratch space
Optimize the seek pattern to reduce the amount of time waiting for disk seeks.
If possible read and write numerical data in a binary formatSlide44
SCC: optimization
Many languages allow operations on vectors/matrices;
Pre-allocate arrays before accessing them within loops;
Reuse variables when possible and delete those that are not needed anymore;
Access elements within your code according to the storage pattern in this language (FORTRAN, MATLAB, R – in columns; C, C++ - rows)
email SCC (help@scc.bu.edu)
The members of our group will be happy to assist you with the tips how to improve the performance of your code for the specific language/application.Slide45
SCC: Code development and debugging
Integrated development Environment (IDE)
codeblocks
geany
eclipse
Debuggers:gdbddd
TotalView
OpenSpeedShopSlide46
SCC: parallelization
Running multiple jobs (tasks) simultaneously
openMP/multithreaded jobs ( use some or all the cores on one node)
MPI (uses multiple cores possibly across a number of nodes)
GPU parallelization
SCC tutorials There are a number of tutorials that cover various parallelization techniques in R, MATLAB, C and FORTRAN.Slide47
SCC: parallelization
Copy Simple Examples
The examples could be found on-line:
http://www.bu.edu/tech/support/research/system-usage/running-jobs/advanced-batch
/
http://scv.bu.edu/examples/SCC/ Copy examples to the current directory:
scc1 %
cp
/project/
scv
/examples/SCC/depend .
scc1 %
cp
/
project/
scv
/examples/SCC/many
.
scc1 %
cp
/project/scv
/examples/SCC/par .Slide48
SCC: Array jobs
An array job executes independent copy of the same job script. The number of tasks to be executed is set using
-t option to the qsub
command, .
i.e
:scc1 % qsub
-t 1-10
<
my_script
>
The above command will submit an array job consisting of 10 tasks, numbered from 1 to 10. The batch system sets up
SGE_TASK_ID
environment variable which can be used inside the script to pass the task ID to the program:
#!/bin/bash -l
Rscript
my_R_program.R
$SGE_TASK_IDSlide49
SCC: Job dependency
Some jobs may be required to run in a specific order. For this
application, the job dependency can be controlled using "-hold_jid" option:
scc1 %
qsub
-N job1 script1scc1 %
qsub
-N job2 -
hold_jid
job1 script2
scc1 %
qsub
-N job3 -
hold_jid
job2 script3
A job might need to wait until the remaining jobs in the group have completed (aka post-processing).
In this example,
lastjob
won’t start until job1, job2, and job3 have completed.
scc1%
qsub
-N job1 script1
scc1%
qsub
-N job2 script2
scc1%
qsub
-N job3 script3
scc
%
qsub
-N
lastJob
-
hold_jid
"job*" script4Slide50
SCC: Links
Research Computing website:
http://www.bu.edu/tech/support/research/
RCS software:
http://sccsvc.bu.edu/software/
RCS examples: http://rcs.bu.edu/examples/
RCS
Tutorial Evaluation:
http://
scv.bu.edu/survey/tutorial_evaluation.html
Please contact us at
help@scc.bu.edu
if you have any problem or questionSlide51
SCC: Apendix
qstat
qstat
-u
user-id
All
current jobs submitted by the user user-id
qstat
-s r
List
of running jobs
qstat
-s p
List
of pending jobs (
hw
,
hqw
, Eqw...)qstat -u user-id -r Display the resources requested by the job
qstat
-u user-id -s r -t
Display
info about sub-tasks of parallel jobs
qstat
-explain c -j job-id
Display
job status
qstat
-g c
Display the
list of queues and load information
qstat
-q queue
Display
jobs running on a particular queueSlide52
SCC: Apendix
qselect
qselect
-
pe
omp
16
list
all nodes that can execute
16-processor job
qselect
-l
mem_total
=252G
list
all large memory nodes
qselect
-pe mpi16 list all the nodes that can run 16-slot mpi jobsqselect
-l
gpus
=1
list
all the nodes with GPUsSlide53
SCC: Apendix
qdel
qdel
-j job-id
Delete job
job-id
qdel
-u user-id
Delete all the jobs submitted by the userSlide54
SCC: Apendix
qhost
qhost
-q
Display queues hosted by host
qhost
-j
Display all the jobs hosted by host
qhost
-F
Display
info
about each node