These notes are a work in progress and do not represent a final position on how to get best performance from CASTEP on Sulis. We anticipate that most use of CASTEP on Sulis will be via workflows that involve many calculations using a single node (or less). We have not studied multi-node performance in any detail.
An out-of-the-box compilation of CASTEP 20.11 against the foss/2020b toolchain will pass all build tests. This has produced the best performance in our testing (see below).
To build CASTEP from source:
[user@login01(sulis) ~]$ cd CASTEP-20.11 [user@login01(sulis) ~]$ module load foss/2020b [user@login01(sulis) ~]$ export BUILD=fast [user@login01(sulis) ~]$ export COMMS_ARCH=mpi [user@login01(sulis) ~]$ export FFT=fftw3 [user@login01(sulis) ~]$ export FFTLIBDIR=$EBROOTFFTW [user@login01(sulis) ~]$ export MATHLIBS=openblas [user@login01(sulis) ~]$ export MATHLIBDIR=$EBROOTOPENBLAS [user@login01(sulis) ~]$ make
Job scripts which use CASTEP built in this way should load the foss/2020b module before invoking the
CASTEP benefits from pinning of MPI tasks to CPU cores by rank. We recommend launching CASTEP as per the following example job script.
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=128 #SBATCH --mem-per-cpu=3850 #SBATCH --time=08:00:00 #SBATCH --account=suxxx-somebudget module purge module load foss/2020b srun --cpu-bind=rank castep.mpi <seedname>
<seedname> is the usual argument to CASTEP.
Performance testing has been limited to the Al3x3 (medium) benchmark test. Results when compiled with various compiler/library combinations are below. All tests use FFTW 3.3.8 for Fourier transforms.
|toolchain||compiler||MPI implementation||MATHLIBS||time (s)|
|intel/2019b||iccifort/2019.5.281 2||impi/2019.7.217||imkl/2019.5.281 1||324.5(5)|
|iomkl/2019b||iccifort/2019.5.281 2||OpenMPI/3.1.4||imkl/2019.5.281 1||334.5(4)|
- Uses environment variables
export MKL_DEBUG_CPU_TYPE=5 ; export MKL_CBWR=COMPATIBLEsuch that MKL uses appropriate instruction sets for AMD EPYC. ↩
- Makefile edited to force
OPT_CPU = -march=core-avx2for use of appropriate instruction set. ↩
Time reported is the “Calculation time” output by the CASTEP internal timer.
Note that Sulis uses turboboost such that performance of the AMD EPYC processors is increased beyond the default clock speed if there is sufficient power and thermal headroom. Performance can vary between nodes and times. All the above measurements were taken on the same node within a few minutes of each other.
At the time of writing, performance when using AOCL for BLAS/LAPACK was significantly worse than anything in the above table.