TensorFlow on Sulis

  1. Accessing TensorFlow
  2. Compatibility
  3. Job submission
  4. Performance and containerisation
  5. TensorFlow 1.15

These notes constitute a brief guide to using TensorFlow on Sulis, with emphasis on using the GPU hardware. They may update occasionally as newer software is deployed.

Accessing TensorFlow

We strongly recommend using the version of TensorFlow provided by the module system. This has been compiled specifically for the hardware in Sulis and subsequently subjected to verification tests.

To search for an appropriate version of TensorFlow

[user@login01(sulis) ~]$ module spider TensorFlow

This will list a number of TensorFlow builds that can be added into your environment. For example TensorFlow/2.5.0. Querying this version specifically will provide information on prerequisite modules than must first be loaded.

[user@login01(sulis) ~]$ module spider TensorFlow/2.5.0

-----------------------------------------------------------------------------
  TensorFlow: TensorFlow/2.5.0
-----------------------------------------------------------------------------
    Description:
      An open-source software library for Machine Intelligence


    You will need to load all module(s) on any one of the lines below before
    the "TensorFlow/2.5.0" module is available to load.

      GCC/10.2.0  CUDA/11.1.1  OpenMPI/4.0.5
      GCC/10.2.0  OpenMPI/4.0.5

In this case there are two sets of possible prerequisite modules. The first includes CUDA and should be used for running TensorFlow on the Sulis GPU nodes. The second is for use on the standard compute nodes.

TensorFlow can hence be added to your environment for GPU computation by loading the following modules

[user@login01(sulis) ~]$ module purge
[user@login01(sulis) ~]$ module load GCC/10.2.0 CUDA/11.1.1  OpenMPI/4.0.5
[user@login01(sulis) ~]$ module load TensorFlow/2.5.0

Loading a TensorFlow module in this way also adds additional prerequisites to your terminal and Python environment. For example in this case the Python 3.8.6 module is loaded, along with a SciPy-bundle module that provides NumPy 1.19.4, SciPy 1.5.4 and Pandas 1.14. There is no need to install these via pip. All dependencies of TensorFlow itself are provided.

NOTE : Attempting to use TensorFlow loaded in this way will fail unless running on an GPU-enabled node in an interactive session or SLURM job script.

Compatibility

Many users will wish to use higher level software that either depends explicitly on TensorFlow or would ideally run in the same environment. Some such software is already available via the module system and can be queried using module av after loading TensorFlow as per above. For the example of TensorFlow/2.5.0 above:

[user@login01(sulis) ~]$ module av

The output of which includes (in this case)

OpenCV/4.5.1-contrib
bokeh/2.2.3
dask/2021.2.0
scikit-learn/0.23.2

and other possible modules which can loaded into the same environment as TensorFlow. Additional software can be installed via pip into this environment. However we recommend discussing with your research computing team first, particularly for software that will perform significant computations. Similarly requests to install additional packages as modules should be directed to your local support contact.

Job submission

The following is an illustrative example of a job script which executes a TensorFlow 2 ResNet50 image classification benchmark on a single GPU when submitted with sbatch from within a clone of the repository at:

https://github.com/aime-team/tf2-benchmarks

tensorflow2.slurm

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=42
#SBATCH --mem-per-cpu=3850
#SBATCH --gres=gpu:ampere_a100:1
#SBATCH --partition=gpu
#SBATCH --time=01:00:00
#SBATCH --account=suxxx-somebudget

module purge
module load GCC/10.2.0 CUDA/11.1.1
module load TensorFlow/2.5.0

srun python tf2-benchmarks.py --model resnet50 --enable_xla --batch_size 256 --num_gpus 1

Performance with this benchmark should be close to 980 images per second.

Performance and containerisation

TensorFlow performance critically depends on the underlying CUDA Deep Neural Network library (cuDNN). Full support for the Nvidia A100 GPUs in Sulis is only present in cuDNN version 8 and later. Versions of TensorFlow provisioned via the module system have been built to take advantage of this.

Users running containerised workflows via Singularity should ensure that container images use cuDNN 8 or later where possible.

TensorFlow 1.15

For workloads that have not yet migrated to TensorFlow 2 we provide a module for TensorFlow 1.15. This is based on builds from the NVIDIA/tensorflow project which supports newer hardware and improved libraries for NVIDIA GPU users using TensorFlow 1.x. In particular they use cuDNN 8 for optimal performance on A100 GPUs.

The list of the corresponding nvidia-tensorflow modules will be extended time to time. Use the module spider command for listing the available ones

[user@login01(sulis) ~]$ module spider nvidia-tensorflow

----------------------------------------------------------------------------------------------------------------------------------------------------
  nvidia-tensorflow:
----------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      An open-source software library for Machine Intelligence. This version of Tensorflow has been developed by NVidia to support newer versions
      of CUDA (11.x onwards) than was support by the original 1.x release in order to benefit from performance improvements.

     Versions:
        nvidia-tensorflow/1.15.5+nv21.10-Python-3.8.2
        nvidia-tensorflow/1.15.5+nv22.01-Python-3.8.2

Each of these depends on GCC and OpenMPI modules of particular versions deduced from another module spider command

[user@login01(sulis) ~]$ module spider nvidia-tensorflow/1.15.5+nv22.01-Python-3.8.2

----------------------------------------------------------------------------------------------------------------------------------------------------
  nvidia-tensorflow: nvidia-tensorflow/1.15.5+nv22.01-Python-3.8.2
----------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      An open-source software library for Machine Intelligence. This version of Tensorflow has been developed by NVidia to support newer versions
      of CUDA (11.x onwards) than was support by the original 1.x release in order to benefit from performance improvements.


    You will need to load all module(s) on any one of the lines below before the "nvidia-tensorflow/1.15.5+nv22.01-Python-3.8.2" module is available to
load.

      GCC/9.3.0  OpenMPI/4.0.3

The module can hence be loaded with:

[user@login01(sulis) ~]$ module purge
[user@login01(sulis) ~]$ module load GCC/9.3.0  OpenMPI/4.0.3
[user@login01(sulis) ~]$ module load nvidia-tensorflow/1.15.5+nv22.01-Python-3.8.2

Note that this also imports CUDA into your environment automatically.

Installing TensorFlow 1.15 from PyPi using pip is not advisable. The PyPi builds are based on cuDNN 7 and give significantly worse performance.

For example, in a ResNet50 classification benchmark:

Build fp 32 (images/second) fp 16 (images/second)
PyPi build (cuDNN 7) 625 N/A
nvidia-tensorflow/nv22.01 (cuDNN 8) 1172 2399

As with TensorFlow 2, we recommend containerised workflows involving TensorFlow 1 use cuDNN 8 or later.