LAMMPS on Sulis
LAMMPS contains a number of packages for tuning performance and for enabling additional functionality. We have compiled LAMMPS with the most commonly used packages and performed some basic tests for which the foss-2021b
toolchain gave the best performance in many cases. This is especially true when running a CUDA-augmented build on a single GPU. The centrally provided LAMMPS module on Sulis uses this toolchain.
If additional packages are required then users are welcome to compile LAMMPS within their home directory following the example below.
Accessing LAMMPS
First, check the list of existing LAMMPS modules
[user@login01(sulis) ~]$ module spider LAMMPS
Choosing a specific alias from the list e.g.,
[user@login01(sulis) ~]$ module spider LAMMPS/29Sep2021-kokkos
will print out the required modules to be loaded before LAMMPS’, which are GCC/11.2.0
, OpenMPI/4.1.1
for the foss-2021b
toolchain. More options are available if the code is built with multiple compiler toolchains. See the correspondence between compiler toolchain names and versions of included libraries at this page.
Finally, loading the LAMMPS/29Sep2021-kokkos
into the command line environment is invoked by
module load GCC/11.2.0 OpenMPI/4.1.1 LAMMPS/29Sep2021-kokkos
LAMMPS modules available so far (May 2022):
module | description |
---|---|
LAMMPS/29Sep2021-kokkos | built with OpenMP backend of Kokkos package |
LAMMPS/29Sep2021-CUDA-11.4.1-kokkos | GPU-accelerated (via Kokkos package) build |
LAMMPS/29Sep2021-CUDA-11.4.1-gpu | GPU-accelerated (via GPU package) build |
Running on CPUs
There are many ways of running LAMMPS for which one may refer to the corresponding documentation section and which are not covered here. The next sections demonstrate the usage of available acceleration packages in LAMMPS.
Bare LAMMPS
The header of the submission script is similar for all CPU jobs.
lammps.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job=my_lammps_calculation
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBACTH --mem-per-cpu=3850
#SBATCH --time=24:00:00
#SBATCH --account=suxxx-somebudget
# drop all modules
module purge
# load ones required for the LAMMPS execuation
module load GCC/11.2.0 OpenMPI/4.1.1 LAMMPS/29Sep2021-kokkos
# adjust the number of OpenMP threads automatically
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
# executed command
srun lmp -in in.lammps
Note, that OMP_NUM_THREADS
has no effect without the OMP package discussed below, but is included here for completeness.
OPT package
This results in good performance in the provided builds provided suitable accelerated pair styles are available. Adding the opt
suffix to the last line in the script above runs the OPT package
lammps.slurm
...
srun lmp -sf opt -in in.lammps
OMP package
The OMP package using multiple threads per MPI task via OpenMP. The parallelisation is controlled by adjusting the --ntasks-per-node
(number of MPI tasks) and --cpus-per-task
(number of threads per MPI task) entries such that their product does not exceed the number of available cores on a node (128 for Sulis). For example, using 4 threads may be invoked by the following change to the script above
lammps.slurm
...
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=4
...
srun lmp --sf omp -in in.lammps
We have noticed that some jobs with 8 threads per MPI task were unreasonably slow when using less than a full node. This was caused, most likely, by multiple OpenMP threads running on the same core. The issue was solved by using CPU binding:
lammps.slurm
...
srun --cpu-bind=cores lmp --sf omp -in in.lammps
Kokkos package (CPU)
The Kokkos package is built with the OpenMP
backend for use on the standard compute nodes. Execution is controlled in the same fashion as of for the OMP package with a slightly different srun
command
lammps.slurm
...
srun lmp -k on t ${OMP_NUM_THREADS} -sf kk -in in.lammps
The Kokkos package may throw warnings stating that OMP_PROC_BIND and OMP_PLACES variables have to be set to “spread” and “threads” correspondingly (in addition to the OMP_NUM_THREADS above). Users may which to experiment with this. We have not noticed a substantial effect on the performance during our limited testing.
Running on GPUs
The Kokkos and GPU packages both provide GPU-acceleration for LAMMPS simulations which use compatible pair styles. In our testing the Kokkos
package is resulted in significantly faster performance on a single GPU. The Kokkos
performance on two GPU cards was similar to the GPU
package performance and was significantly slower on three GPUs. The relative performance of the two packages is likely to be simulation-dependent and we encourage users to perform their own testing.
Kokkos package
Available via LAMMPS/29Sep2021-CUDA-11.4.1-kokkos-omp
module. The following submission script implements a single-task, single-GPU calculation.
lammps.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job=my_lammps_calculation-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=42
#SBACTH --mem-per-cpu=3850
#SBATCH --time=24:00:00
#SBATCH --account=suxxx-somebudget
#SBATCH --partition=gpu
#SBATCH --gres=gpu:ampere_a100:1
# drop all modules
module purge
# load ones required for the LAMMPS execuation
module load GCC/11.2.0 OpenMPI/4.1.1 LAMMPS/29Sep2021-CUDA-11.4.1-kokkos-omp
# re-adjust the number of OpenMP threads
export OMP_NUM_THREADS=1
# executed command
srun lmp -k on g 1 -sf kk -in in.lammps
It is important to note that g 1
in the srun command line requests a single GPU, which should be equal to the number of GPUs requested via --gres
string (the number after last :
). The --cpus-per-task=42
is used here dominantly for adjusting the memory request, since the OMP_NUM_THREADS
is set to one. Note also that here the number of requested GPUs must be equal to the number of MPI tasks. Last two conditions (OMP_NUM_THREADS=1
and SLURM_NTASKS_PER_NODE
equal to the number of GPU’s per node) is chosen for the best performance, although users are encouraged to make their own tests in application to a particular situation.
GPU package
The corresponding submission script for the GPU package is below.
lammps.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job=my_lammps_calculation-gpu
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=42
#SBACTH --mem-per-cpu=3850
#SBATCH --time=24:00:00
#SBATCH --account=suxxx-somebudget
#SBATCH --partition=gpu
#SBATCH --gres=gpu:ampere_a100:2
# drop all modules
module purge
# load ones required for the LAMMPS execuation
module load GCC/11.2.0 OpenMPI/4.1.1 LAMMPS/29Sep2021-CUDA-11.4.1-gpu
# re-adjust the number of OpenMP threads
export OMP_NUM_THREADS=1
# executed command
srun lmp -sf gpu -pk gpu 2 -in in.lammps
The same conditions as for Kokkos, OMP_NUM_THREADS=1
and SLURM_NTASKS_PER_NODE
equal to the number of GPU’s per node, showed the best performance in our tests, however users may wish to experiment with multiple MPI tasks per GPU depending on their simulation details.
Performance
Recommendations above are based on limited testing of a single system.
We have tested foss-2020b, foss-2021b, iomkl-2019b and foss-2020a (for LAMMPS-3Mar2020 version) toolchains for CPU tasks, and derivatives (i.e., augmented with the CUDA libraries) of foss-2020b (which is fosscuda-2020b) and foss-2021b for GPU runs. The tested system is an array of 55296 particles in the fcc geometry interacting via Lennard-Jones (LJ) fields. All calculations are done on a single node, check this page for more details. The tables below show time in seconds.
The timings are obtained using all 128 CPUs of Sulis nodes, giving minimal interference between various calculations running on the cluster at the same time and making the tests results less biased. One should note, however, that the parallel efficiency (T(1)/[N T(N)]
with T(N)
being calculation time when using N
processors) in this configuration is around ~50% for the chosen test case, which is quite low (see the plots here). This may change when considering a different system, running other types of calculations or using more complicated forces. Our advise, however, is to use less cores in this situation (64 or less), targeting to the parallel efficiency of 75% or higher. We also encourage to estimate the parallel efficiency prior to full-scale runs and to find an optimal configuration in particular situation.
The best CPU performance was obtained with OMP
package using two threads per MPI task. The OPT
and OMP
packages perform similarly well as single-threaded applications with 1 CPU per MPI task. Kokkos appears to be slightly slower, which is consistent with the documentation.
The foss-2021b
toolchain provides the quickest runs on a single GPU via the Kokkos package. As was mentioned earlier, one MPI rank on one CPU must be allocated per GPU. The GPU
package gives a better scaling with GPUs count and may be considered as well.
CPU runs (MPI)
toolchain | foss-2020b | foss-2021b | iomkl-2019b | foss-2020a* |
---|---|---|---|---|
bare LAMMPS | 100 | 99 | 109 | 103 |
OPT package | 89 | 90 | 97 | 93 |
CPU runs (MPI + OpenMP)
toolchain | number of threads | |||
---|---|---|---|---|
1 | 2 | 4 | 8 | |
OMP package | ||||
foss-2020b | 91 | 79 | 95 | 132 |
foss-2021b | 93 | 79 | 107 | 115 |
iomkl-2019b | 98 | 87 | 89 | 119 |
foss-2020a | 94 | 81 | 83 | 124 |
Kokkos package | ||||
foss-2020b | 101 | 101 | 100 | 137 |
foss-2021b | 100 | 89 | 92 | 126 |
iomkl-2019b | 115 | 106 | 109 | 145 |
foss-2020a | 105 | 102 | 102 | 140 |
*
LAMMPS version 3Mar2020
GPU runs
toolchain | number of gpus | ||
---|---|---|---|
1 | 2 | 3 | |
Kokkos package | |||
foss-2020b | 208 | 167 | 158 |
foss-2021b | 183 | 146 | 148 |
GPU package | |||
foss-2020b | 268 | 143 | 120 |
foss-2021b | 277 | 141 | 120 |
Building LAMMPS
The following instructions were tested on LAMMPS 29Sep2021 release available via the link.
CPU build
First, an interactive session must be requested on the CPU partition
salloc --account=suxxx-somebudget -N 1 -n 1 -c 42 --mem-per-cpu=3850 --time=8:00:00
Next, load the necessary modules sourcing the compiler toolchain, CMake and Python
module load foss/2021b CMake/3.21.1 Python/3.9.6
According to the documentation of LAMMPS, the recommended build option is with CMake. Inside the source directory,
mkdir build
cd build
cmake -C ../cmake/presets/most.cmake -DBUILD_SHARED_LIBS=on ../cmake
The most.cmake
selects many packages for installation. However, only packages with resolved dependencies will be configured. The CMake output normally informs which libraries are required to compile a package if it is not configured, and Sulis is likely to have those available via the module system. If a required package is not configured, loading missing modules and re-running the cmake
command above would normally lead to successful configuration. The compilation itself is invoked via
cmake --build . -j 42
where the -j
flag defines the parallelisation level of the build process and is taken as equal to the number of available cores requested in the interactive session.
To run the compiled code, modify the CPU submission script from e.g., Bare LAMMPS
section in its “module load” and “srun” parts
lammps.slurm
...
module load foss/2021b Python/3.9.6
...
srun /path/to/lammps/source/build/lmp -in in.lammps
GPU build
The next example illustrates the compilation of LAMMPS with the GPU
package. Since the setup involves the GPU hardware detection, one has to make a compilation on the GPU partition by requesting an interactive session
salloc --account=suxxx-somebudget -p gpu -N 1 -n 1 -c 42 --mem-per-cpu=3850 --gres=gpu:ampere_a100:1 --time=8:00:00
Next, a compiler toolchain which LAMMPS is intended to build with must be loaded together with the CUDA library
module load foss/2021b UCX-CUDA/1.11.2-CUDA-11.4.1 CMake/3.21.1
The UCX-CUDA
loads the CUDA/11.4.1
module and extends the OpenMPI to cuda-aware OpenMPI version allowing for a faster communication between MPI processes and GPUs. Such implementation is realised in foss-2021a
, foss-2021b
and will continue in the future. A separate toolchain was used in the past requiring a recompiled version of the OpenMPI library (e.g., fosscuda-2020b). For checking that CUDA/11.4.1
module is loaded, enter module list
command. The configuration can be done with a basic LAMMPS preset, and GPU-related configuration flags.
mkdir build
cd build
cmake -C ../cmake/presets/basic.cmake -D PKG_GPU=on -D GPU_API=cuda -DGPU_ARCH=sm_80 -DBUILD_SHARED_LIBS=ON ../cmake
As with CPU build, the following compiles the code
cmake --build . -j 42
And finally, modify the submission script for the GPU
package calculations above in its “module load” and “srun” parts
lammps.slurm
...
module load foss/2021b UCX-CUDA/1.11.2-CUDA-11.4.1
...
srun /path/to/lammps/source/build/lmp -sf gpu -pk gpu 2 -in in.lammps