CASTEP on Sulis

  1. Compilation
  2. Running
  3. Performance

These notes are a work in progress and do not represent a final position on how to get best performance from CASTEP on Sulis. We anticipate that most use of CASTEP on Sulis will be via workflows that involve many calculations using a single node (or less). We have not studied multi-node performance in any detail.

Compilation

An out-of-the-box compilation of CASTEP 20.11 against the foss/2020b toolchain will pass all build tests. This has produced the best performance in our testing (see below).

To build CASTEP from source:

[user@login01(sulis) ~]$ cd CASTEP-20.11
[user@login01(sulis) ~]$ module load foss/2020b
[user@login01(sulis) ~]$ export BUILD=fast
[user@login01(sulis) ~]$ export COMMS_ARCH=mpi
[user@login01(sulis) ~]$ export FFT=fftw3
[user@login01(sulis) ~]$ export FFTLIBDIR=$EBROOTFFTW
[user@login01(sulis) ~]$ export MATHLIBS=openblas
[user@login01(sulis) ~]$ export MATHLIBDIR=$EBROOTOPENBLAS
[user@login01(sulis) ~]$ make -e

Job scripts which use CASTEP built in this way should load the foss/2020b module before invoking the castep.mpi executable.

Running

CASTEP benefits from pinning of MPI tasks to CPU cores by rank. We recommend launching CASTEP as per the following example job script.

castep.slurm

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --mem-per-cpu=3850
#SBATCH --time=08:00:00
#SBATCH --account=suxxx-somebudget

module purge
module load foss/2020b

srun --cpu-bind=rank castep.mpi <seedname>

Where <seedname> is the usual argument to CASTEP.

Performance

Performance testing has been limited to the Al3x3 (medium) benchmark test. Results when compiled with various compiler/library combinations are below. All tests use FFTW 3.3.8 for Fourier transforms.

toolchain compiler MPI implementation MATHLIBS time (s)
foss/2020b GCC/10.2.0 OpenMPI/4.0.5 OpenBLAS/0.3.12 305.8(1)
gomkl/2019b GCC/8.3.0 OpenMPI/3.1.3 imkl/2019.5.281 1 310(2)
intel/2019b iccifort/2019.5.281 2 impi/2019.7.217 imkl/2019.5.281 1 324.5(5)
iomkl/2019b iccifort/2019.5.281 2 OpenMPI/3.1.4 imkl/2019.5.281 1 334.5(4)
  1. Uses environment variables export MKL_DEBUG_CPU_TYPE=5 ; export MKL_CBWR=COMPATIBLE such that MKL uses appropriate instruction sets for AMD EPYC.
  2. Makefile edited to force OPT_CPU = -march=core-avx2 for use of appropriate instruction set.

Time reported is the “Calculation time” output by the CASTEP internal timer.

Note that Sulis uses turboboost such that performance of the AMD EPYC processors is increased beyond the default clock speed if there is sufficient power and thermal headroom. Performance can vary between nodes and times. All the above measurements were taken on the same node within a few minutes of each other.

At the time of writing, performance when using AOCL for BLAS/LAPACK was significantly worse than anything in the above table.