GPU jobs

  1. Single GPU CUDA jobs
  2. Single GPU jobs in Python
    1. GPU accelerated Python packages
    2. Using GPUs directly in Python code
  3. Single node, multi-GPU
    1. Multiprocessing pool with shared GPUs
    2. MPI application with one GPU per task
  4. Multi-node GPU jobs

Sulis contains a number of nodes equipped with Nvidia A100 GPUs. These are the 40GB variant of the A100, connected via PCI-express. Three A100s are installed in each node. GPU nodes are accessed via submitting batch scripts to the gpu partition. Such scripts should request one or more GPUs in their SLURM resource request.

Most GPU jobs will require loading of the a CUDA environment module to make use of GPU acceleration.

Single GPU CUDA jobs

See the compiling section for information on compiling CUDA C codes on Sulis.

An example GPU program in CUDA C.

An trivial example of a CUDA C program.

cuda_hello.cu

#include <stdio.h>
#include <cuda.h> 

int main() {

  printf("Hello world from the host\n");
  printf("Checking for CUDA devices...\n");


  int count;
  cudaError_t err;
  err = cudaGetDeviceCount(&count);
  if ( (count==0) || (err!=cudaSuccess) ) {
    printf("No CUDA supported devices are available in this system.\n");
    exit(EXIT_FAILURE);
  } else {
    printf("Found %d CUDA device(s) in this system\n",count);
  }

  cudaDeviceProp prop;
  int idev;
  for (idev=0;idev<count;idev++) {

    // Call another CUDA helper function to populate prop
    err = cudaGetDeviceProperties(&prop,idev);
    if ( err!=cudaSuccess ) {
      printf("Error getting device properties\n");
      exit(EXIT_FAILURE);
    }

    printf("Device %d : %s\n",idev,prop.name);

  }

  err = cudaGetDevice(&idev);
  if ( err!=cudaSuccess ) {
    printf("Error identifying active device\n");
    exit(EXIT_FAILURE);
  }
  printf("Using device %d\n",idev);

  return(EXIT_SUCCESS);

}

This might be compiled into the executable a.out via:

[user@login01(sulis) ~]$ module load GCC/10.2.0 CUDA/11.1.1
[user@login01(sulis) ~]$ nvcc -arch=sm_80 cuda_hello.cu

The following example job script requests a single GPU via the gpu partition. Each GPU nodes contains 128 which does not divide equally over the 128 A100 GPUs in each node. We therefore recommend that single GPU jobs request 42 CPUs per task.

gpu.slurm

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=42
#SBATCH --mem-per-cpu=3850
#SBATCH --gres=gpu:ampere_a100:1
#SBATCH --partition=gpu
#SBATCH --time=08:00:00
#SBATCH --account=suxxx-somebudget

module purge
module load GCC/10.2.0 CUDA/11.1.1

srun ./a.out

The resource request gres=gpu:ampere_a100:1 specifies that we require a single A100 GPU for the job. The request partition=gpu overrides the default partition and tells SLURM the job should run on the partition consisting of GPU enabled nodes. BOTH are required.

In this example, a more complicated program than our cuda_hello.cu might be able to make use of the 42 CPUs/cores by spawning additional threads. Some codes (e.g. LAMMPS) instead benefit from multiple MPI tasks sharing a single CPU, in which case the ntasks-per-node part of the resource request should reflect the desired number of tasks and cpus-per-task reduced such that the total number of CPUs/cores requested is no more than 42.

GPU jobs are submitted to SLURM in the usual way.

[user@login01(sulis) ~]$ sbatch gpu.slurm
Submitted batch job 212981
Output from example program.
[user@login01(sulis) ~]$ cat slurm-212981.out
Hello world from the host
Checking for CUDA devices...
Found 1 CUDA device(s) in this system
Device 0 : A100-PCIE-40GB
Using device 0

Note that our program only identified a single G

Single GPU jobs in Python

GPU accelerated Python packages

Some Python codes may make use GPU acceleration. Most commonly this will be via use of GPU-accelerated packages such as TensorFlow, PyTorch, Magma and many others. Jobs scripts should ensure that the appropriate version of environment modules are loaded, i.e. those which have a CUDA module as a dependency.

A suitable script is below, using TensorFlow as an example.

Python script tf_gpu.py to test for GPU support in TensorFlow.

This trivial script imports the TensorFlow package and checks if the imported build of TensorFlow is built to use GPU acceleration.

tf_gpu.py

import tensorflow as tf

if tf.test.is_built_with_cuda():
    print("Imported TensorFlow package was built with GPU support")
else:
    print("Imported TensorFlow package was NOT built with GPU support")

This can be executed on a GPU node with the following SLURM job script.

tf_gpu.slurm

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=42
#SBATCH --mem-per-cpu=3850
#SBATCH --gres=gpu:ampere_a100:1
#SBATCH --partition=gpu
#SBATCH --time=08:00:00
#SBATCH --account=suxxx-somebudget

module purge
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
module load TensorFlow/2.5.0 

srun python tf_gpu.py

Note that omitting CUDA/11.1.1 from the first module load command would use a toolchain that is not GPU enabled. The second module load would then import a build of TensorFlow which is not GPU enabled.

Using GPUs directly in Python code

Some workflows may involve GPU-accelerated code written in Python. This may take the form of Python functions executed as kernels on the GPU device using Numba, or drop-in replacements for compute-intensive NumPy and SciPy operations such as those implemented in CuPy. These can be executed in job scripts provided the appropriate packages are loaded as environment modules.

Example script using CuPy interace to CUDA cupy_api.py.

This script replicates the compiled CUDA C example above.

cupy_api.py

import cupy as cp

print("Hello world from the host")
print("Checking for CUDA devices...")

count = cp.cuda.runtime.getDeviceCount()

if count<1:
    print("No CUDA supported devices are available in this system.")
else:
    print("Found %d CUDA device(s) in this system." % count)
    
for idev in range(count):
    prop = cp.cuda.runtime.getDeviceProperties(idev)        
    print("Device %d %s" % (idev, prop['name'].decode()));

idev = cp.cuda.runtime.getDevice();
print("Using device %d" %idev);

The following SLURM job script is suitable for a Python code written to use a single GPU in CuPy. Other packages such as Numba or PyCUDA might be used instead.

cupy.slurm

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=42
#SBATCH --mem-per-cpu=3850
#SBATCH --gres=gpu:ampere_a100:1
#SBATCH --partition=gpu
#SBATCH --time=08:00:00
#SBATCH --account=suxxx-somebudget

module purge
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
module load CuPy/8.5.0 

srun python cupy_api.py

Here the 42 CPUs/cores available to the job (or some subset thereof) could be used for a multiprocessing pool or a set of workers which all share access to the single GPU. This might be appropriate for workloads in which a set of serial calculations benefit from GPU acceleration but cannot effectively make use of a whole A100.

Single node, multi-GPU

Some scientific packages will support use of multiple GPUs out of the box and handle assignment of GPUs to tasks or CPUs/threads automatically or via user input to the software.

In other cases you may need to provide additional information to srun to indicate how the available GPUs should be shared across the elements of your calculation. Two examples follow, which are not intended to be exhaustive. The examples use CuPy to interact with the GPU for illustrative purposes, but other methods will likely be more appropriate in many cases.

Multiprocessing pool with shared GPUs

This example uses a whole GPU node to create a Python multiprocessing pool of 18 workers which equally share the available 3 GPUs within a node.

Example mp_gpu_pool.py.

This trivial example demonstrates a multiprocessing pool in which the available GPUs are shared equally across the pool. Note that the programmer must calculate which device idev is to be used by which member of the pool procid and set that device as active for the current process. The function f(i) returns which processor in the pool and which GPU was used.

The number of processors available for the pool is set by interrogating the environment variable SLURM_CPUS_PER_TASK.

mp_gpu_pool.py

import sys
import os
import multiprocessing as mp
import cupy as cp

def f(i):

    ngpus = cp.cuda.runtime.getDeviceCount()
    proc  = mp.current_process()
    procid = proc._identity[0]    
    idev   = procid%ngpus # which gpu device to use
    cp.cuda.runtime.setDevice(idev)
    print("proc %s processing input %d using GPU %d" % (proc.name, i, idev))
    return (procid, idev)
     
if __name__ == '__main__':

    p = int(os.environ['SLURM_CPUS_PER_TASK'])
    input_list = range(100)
    
    # Evaluate f for all inputs using a pool of processes
    with mp.Pool(p) as my_pool:
        print(my_pool.map(f, input_list))

In SLURM terminology this is a single task, using 18 CPUs and 3 GPUs. Note that the following script sets cpus-per-task=42 in the resource request so that the pool of 18 processes has the entire RAM of the node available.

gpu_pool.slurm

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=42
#SBATCH --mem-per-cpu=3850
#SBATCH --gres=gpu:ampere_a100:3
#SBATCH --partition=gpu
#SBATCH --time=08:00:00
#SBATCH --account=suxxx-somebudget

module purge
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
module load CuPy/8.5.0 

srun -n 1 -G 3 -c 18 --cpus-per-gpu=6 python mp_gpu_pool.py

MPI application with one GPU per task

Alternatively you may have an MPI program in which each of 3 tasks can effectively utilise an entire GPU.

An example MPI GPU program in Python mpi_gpu.py.

Here each MPI task uses the GPU with id equal its rank.

gpu.py

from mpi4py import MPI
import cupy as cp

comm = MPI.COMM_WORLD
my_rank = comm.Get_rank()
idev = my_rank

cp.cuda.runtime.setDevice(idev)
prop = cp.cuda.runtime.getDeviceProperties(idev)

print("MPI rank %d using GPU : %s_%d" % (my_rank, prop['name'].decode(),idev))

MPI.Finalize()

mpi_gpu.slurm

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=42
#SBATCH --mem-per-cpu=3850
#SBATCH --gres=gpu:ampere_a100:3
#SBATCH --partition=gpu
#SBATCH --time=08:00:00
#SBATCH --account=suxxx-somebudget

module purge
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
module load CuPy/8.5.0 

srun -n 3 -G 3 --gpus-per-task=1 python mpi_gpu.py

Here each MPI task uses only one of the 42 CPUs allocated by task by SLURM, and one GPU. In other scenarios it might be sensible for each MPI task to use multiple CPUs via threading of spawning of subprocesses. See the hybrid jobs section for more information.

Multi-node GPU jobs

Jobs using more than three GPUs are possible by making a SLURM resource request for multiple nodes in the gpu partition. Users should be aware of the following.

  • Use of multiple GPU nodes may be desirable to increase concurrency when processing a large batch of smaller calculations which collectively constitute a single job/workflow. This might be accomplished in a number of ways, e.g. a loosely coupled MPI application or a Python script which uses Dask to distribute independent calculations over a pool of GPU-enabled resource. See the Ensemble Computing section of this documentation for examples.

  • The GPU hardware configuration in Sulis is not optimised nor intended for workloads which require very high bandwidth communication between multiple GPUs. Other tier 2 services, in particular JADE II, Baskerville or Bede are much more likely to be appropriate for such workflows.