Slurm, First Steps

Introduction

To run tasks that require a large amount of time (more than a few minutes) and/or computational resources (multiple cores, lots of memory, GPU), you should submit them as jobs through a scheduler. Examples of such tasks would include training of large neural networks networks, retrieval/download of large amounts of data, or generic processing of large amounts of data.

On this machine, we have installed Slurm as workload manager (many more exist).

To use the scheduler, you will have to write a script that contains the following.

Job metadata where you specify how much time, memory, CPUs, and GPUs the job will require.
Calling the actual program.

To practically get started, let's look at a minimum example.

Getting Started

Job Script

Consider the file /home/volker/Jobs/Examples/Example_01.slurm, viz.

volkerh@ds01:~$ cd Jobs/Examples/
volkerh@ds01:~/Jobs/Examples$ cat Example_01.slurm

#!/bin/bash
#SBATCH --output /home/volkerh/Jobs/Examples/Example_01-%j.out
#SBATCH --job-name Example_01
#SBATCH --partition sintef
#SBATCH --ntasks 1
#SBATCH --mem=128MB
#SBATCH --cpus-per-task=1
#SBATCH --gres gpu:a30:1
#SBATCH --time 00-00:01:00

srun hostname

This is called a job script that runs a single program (hostname). Observe that we need to call this through something called srun.

Before calling the actual program, we define a bunch of directives that specify the resource requirement for this program. In particular, we define

The file to where the output of the program is redirected (#SBATCH --output)
The name of the job (#SBATCH --job-name)
The number of processor cores available to the program (#SBATCH --ntasks)
The maximum allowed execution time (#SBATCH --time)
The job has requested a single GPU of type A30 (#SBATCH --gres gpu:a30:1)
The amount of memory required (SBATCH --mem)

Exceeding memory

If your job exceeds the requested memory, it is automatically terminated.

Time limit reached

If the runtime is up, the job is terminated even if the program is not done.

Request only what you need

If you do not need a GPU, do not request one.

What now?

To continue, you should copy this script into a directory you have write access to.

Submit Job

Assuming you've put the script somewhere you can write permissions (and updated the #SBATCH --output line correspondingly), you can now submit this test script using the sbatch command, viz.

volkerh@ds01:~/Jobs/Examples$ sbatch Example_01.slurm 
Submitted batch job 190

Monitor Running Jobs

You can check what jobs are running by issuing the squeue command, viz.

volkerh@ds01:~/Jobs/Examples$ squeue
JOBID         USER PARTITION                         NAME    EXEC_HOST ST REASON         TIME    TIME_LEFT   CPUS   MIN_MEMORY
 2680      volkerh    sintef        2018/MEPS/Download/Q3         ds01  R   None      1:06:20   6-22:53:40      1           2G
 2681      volkerh    sintef        2018/MEPS/Download/Q4         ds01  R   None      1:04:35   6-22:55:25      1           2G
 2682      volkerh    sintef        2019/MEPS/Download/Q2         ds01  R   None        34:43   3-23:25:17      1           2G

Here, the user volkerh is running three jobs, two with a maximum runtime of seven days and one with a maximum runtime of four days. All jobs have been allocated one CPU and 2GB of memory.

Inspect Output

Once our super-short script has run, let's look at the output.

volkerh@ds01:~/Jobs/Examples$ cat Example_01-190.out 
ds01

This checks with what we expect -- it's exactly what issueing hostname on the shell will return.