Slurm, First Steps
Introduction
To run tasks that require a large amount of time (more than a few minutes) and/or computational resources (multiple cores, lots of memory, GPU), you should submit them as jobs through a scheduler. Examples of such tasks would include training of large neural networks networks, retrieval/download of large amounts of data, or generic processing of large amounts of data.
On this machine, we have installed Slurm as workload manager (many more exist).
To use the scheduler, you will have to write a script that contains the following.
-
Job metadata where you specify how much time, memory, CPUs, and GPUs the job will require.
-
Calling the actual program.
To practically get started, let's look at a minimum example.
Getting Started
Job Script
Consider the file /home/volker/Jobs/Examples/Example_01.slurm
, viz.
volkerh@ds01:~$ cd Jobs/Examples/
volkerh@ds01:~/Jobs/Examples$ cat Example_01.slurm
#!/bin/bash
#SBATCH --output /home/volkerh/Jobs/Examples/Example_01-%j.out
#SBATCH --job-name Example_01
#SBATCH --partition sintef
#SBATCH --ntasks 1
#SBATCH --mem=128MB
#SBATCH --cpus-per-task=1
#SBATCH --gres gpu:a30:1
#SBATCH --time 00-00:01:00
srun hostname
This is called a job script that runs a single program (hostname
). Observe that we need to call this through something called srun
.
Before calling the actual program, we define a bunch of directives that specify the resource requirement for this program. In particular, we define
- The file to where the output of the program is redirected (
#SBATCH --output
) - The name of the job (
#SBATCH --job-name
) - The number of processor cores available to the program (
#SBATCH --ntasks
) - The maximum allowed execution time (
#SBATCH --time
) - The job has requested a single GPU of type A30 (
#SBATCH --gres gpu:a30:1
) - The amount of memory required (
SBATCH --mem
)
Exceeding memory
If your job exceeds the requested memory, it is automatically terminated.
Time limit reached
If the runtime is up, the job is terminated even if the program is not done.
Request only what you need
If you do not need a GPU, do not request one.
What now?
To continue, you should copy this script into a directory you have write access to.
Submit Job
Assuming you've put the script somewhere you can write permissions (and updated the #SBATCH --output
line correspondingly), you can now submit this test script using the sbatch
command, viz.
Monitor Running Jobs
You can check what jobs are running by issuing the squeue
command, viz.
volkerh@ds01:~/Jobs/Examples$ squeue
JOBID USER PARTITION NAME EXEC_HOST ST REASON TIME TIME_LEFT CPUS MIN_MEMORY
2680 volkerh sintef 2018/MEPS/Download/Q3 ds01 R None 1:06:20 6-22:53:40 1 2G
2681 volkerh sintef 2018/MEPS/Download/Q4 ds01 R None 1:04:35 6-22:55:25 1 2G
2682 volkerh sintef 2019/MEPS/Download/Q2 ds01 R None 34:43 3-23:25:17 1 2G
Here, the user volkerh
is running three jobs, two with a maximum runtime of seven days and one with a maximum runtime of four days. All jobs have been allocated one CPU and 2GB of memory.
Inspect Output
Once our super-short script has run, let's look at the output.
This checks with what we expect -- it's exactly what issueing hostname
on the shell will return.