Deep Learning
The global Python installation does not support deep learning frameworks such as Tensorflow and Pytorch due to the exploding support matrix across framework and CUDA versions.
If you want to use such frameworks, you should install the desired version yourself in your Conda environment as described below.
Installation
Pytorch
The following steps will create a new virtual environment, activate it, and then install PyTorch with the CUDA backend.
(base) volkerh@ds01:~$ conda create --name pytorch python=3.9
(base) volkerh@ds01:~$ conda activate pytorch
(pytorch) volkerh@ds01:~$ conda install numpy pandas scikit-learn scipy
(pytorch) volkerh@ds01:~$ conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
For more information, see the PyTorch website.
Tensorflow
The following steps will create a new virtual environment, activate it, and then install the most recent version of Tensorflow with the CUDA backend.
(base) volkerh@ds01:~$ conda create --name tf python=3.8
(base) volkerh@ds01:~$ conda activate tf
(tf) volkerh@ds01:~$ conda install tensorflow-gpu
If you need a specific version of Tensorflow, you can install that with
You can list available versions with
Tensorflow is greedy
By default, Tensorflow will allocate all available GPU memory even if it does no actually need it. This is terrible for our shared system. Running export TF_FORCE_GPU_ALLOW_GROWTH=true
before launching any scripts using Tensorflow disbables this behaviour. Please do so.
Usage
Shell
You need to activate your virtual environment after you log in, cf.
Replace tf
with pytorch
(or the name of your environment) as needed.
Slurm
You also need to activate the environment in job scripts and request a GPU (if you want), cf.
volkerh@ds01:/home/volkerh/jobs$ cat tftrain.job
#!/bin/bash
#SBATCH --output /home/volkerh/jobs/tftrain-%j.out
#SBATCH --job-name tftrain
#SBATCH --partition sintef
#SBATCH --ntasks 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=24GB
#SBATCH --gres gpu:a30:1
#SBATCH --time 07-00:00:00
# ENABLE ACCESS TO CONDA ENVIRONMENTS
. "/opt/miniforge3/etc/profile.d/conda.sh"
# ACTIVATE CONDA ENVIRONMENT
conda activate tf
# DON'T BE GREEDY
export TF_FORCE_GPU_ALLOW_GROWTH=true
cd /data/volkerh/tftrain
export DATE=`date +%F_%H%M`
srun python -u /home/volkerh/scripts/tftrain.py > Run_$DATE.log 2>&1
This is based on example 3 from the list of Slurm examples.
No need for many CPU cores
If you run GPU code, you should request at most two CPU cores! (Unless you know very well what you're doing.)
Jupyter
To make the environments above available from Jupyter, you need to install the kernel specification, cf.
(base) volkerh@ds01:~$ conda activate tf
(tf) volkerh@ds01:~$ conda install ipykernel
(tf) volkerh@ds01:~$ python -m ipykernel install --user --name tf --display-name TensorFlow
See the Jupyter documentation on how to select the right kernel for your notebooks.
Don't use JupyterLab
You should avoid running the GPU version of Tensorflow or Pytorch from Jupyter. If you absolutely must, kill your kernel when done.