Deep Learning

The global Python installation does not support deep learning frameworks such as Tensorflow and Pytorch due to the exploding support matrix across framework and CUDA versions.

If you want to use such frameworks, you should install the desired version yourself in your Conda environment as described below.

Installation

Pytorch

The following steps will create a new virtual environment, activate it, and then install PyTorch with the CUDA backend.

(base) volkerh@ds01:~$ conda create --name pytorch python=3.9
(base) volkerh@ds01:~$ conda activate pytorch
(pytorch) volkerh@ds01:~$ conda install numpy pandas scikit-learn scipy
(pytorch) volkerh@ds01:~$ conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

For more information, see the PyTorch website.

Tensorflow

The following steps will create a new virtual environment, activate it, and then install the most recent version of Tensorflow with the CUDA backend.

(base) volkerh@ds01:~$ conda create --name tf python=3.8
(base) volkerh@ds01:~$ conda activate tf
(tf) volkerh@ds01:~$ conda install tensorflow-gpu

If you need a specific version of Tensorflow, you can install that with

(tf) volkerh@ds01:~$ conda install tensorflow-gpu=X.Y

You can list available versions with

(tf) volkerh@ds01:~$ conda search tensorflow-gpu

Tensorflow is greedy

By default, Tensorflow will allocate all available GPU memory even if it does no actually need it. This is terrible for our shared system. Running export TF_FORCE_GPU_ALLOW_GROWTH=true before launching any scripts using Tensorflow disbables this behaviour. Please do so.

Usage

Shell

You need to activate your virtual environment after you log in, cf.

(root) volkerh@ds01:~$ conda activate tf

Replace tf with pytorch (or the name of your environment) as needed.

Slurm

You also need to activate the environment in job scripts and request a GPU (if you want), cf.

volkerh@ds01:/home/volkerh/jobs$ cat tftrain.job
#!/bin/bash
#SBATCH --output /home/volkerh/jobs/tftrain-%j.out
#SBATCH --job-name tftrain
#SBATCH --partition sintef
#SBATCH --ntasks 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=24GB
#SBATCH --gres gpu:a30:1
#SBATCH --time 07-00:00:00

# ENABLE ACCESS TO CONDA ENVIRONMENTS
. "/opt/miniforge3/etc/profile.d/conda.sh"

# ACTIVATE CONDA ENVIRONMENT
conda activate tf

# DON'T BE GREEDY
export TF_FORCE_GPU_ALLOW_GROWTH=true

cd /data/volkerh/tftrain
export DATE=`date +%F_%H%M`
srun python -u /home/volkerh/scripts/tftrain.py > Run_$DATE.log 2>&1

This is based on example 3 from the list of Slurm examples.

No need for many CPU cores

If you run GPU code, you should request at most two CPU cores! (Unless you know very well what you're doing.)

Jupyter

To make the environments above available from Jupyter, you need to install the kernel specification, cf.

(base) volkerh@ds01:~$ conda activate tf
(tf) volkerh@ds01:~$ conda install ipykernel
(tf) volkerh@ds01:~$ python -m ipykernel install --user --name tf --display-name TensorFlow

See the Jupyter documentation on how to select the right kernel for your notebooks.

Don't use JupyterLab

You should avoid running the GPU version of Tensorflow or Pytorch from Jupyter. If you absolutely must, kill your kernel when done.

References

SO: How to prevent tensorflow from allocating the totality of a GPU memory?