- Python environment modules
- Common Python packages
- Accessing additional packages
- Compiling Python packages
- Batch vs interactive computation
- Parallelism in Python
Sulis runs the CentOS 8 operating system, which includes Python 3.6. Many users will instead prefer to use newer versions of Python provided via the environment module system. Scientific packages provided via the module system will be built and configured for these newer versions and not the default CentOS Python installation.
module spider Python to query the available Python builds, and then load using (for example):
[user@login01(sulis) ~]$ module load GCCcore/10.2.0 Python/3.8.6
python will now utilise the newer version.
[user@login01(sulis) ~]$ python --version 3.8.6
Most scientific applications will make use of standard Python packages such as NumPy, SciPy, Pandas etc. A “bundle” of common packages can be imported into your environment via the SciPy-bundle module. These have been installed using the optimal build settings for the Sulis hardware. There should be no need for users to install their own versions of these packages.
In many cases it may be sensible to search and load a SciPy module rather than a Python module directly. The appropriate version of Python will be loaded automatically as a prerequisite.
[user@login01(sulis) ~]$ module spider SciPy-bundle
Will provide a list of compiler and MPI modules (and perhaps a CUDA module in the case of GPU accelerated Python packages) that must be loaded as a prerequisites. For example
[user@login01(sulis) ~]$ module load GCC/10.2.0 OpenMPI/4.0.5 SciPy-bundle/2020.11
will load the SciPy-bundle/2020.11 module and its prerequisites (including the necessary version of Python). With the SciPy-bundle module loaded one can query the available Python version and the packages available.
[user@login01(sulis) ~]$ python --version 3.8.6 [user@login01(sulis) ~]$ pip list <long list of available packages>
Many other Python packages are available via the environment module system. Please search the software already available via the module system before attempting to install additional packages.
If the package you need is likely to be useful for multiple users at your site or elsewhere then consider requesting a centrally installed build via your local research computing support team (HPC Midlands+ consortium members) or by raising an issue in the sulis-hpc/support repository on GitHub (EPSRC national access users).
For other packages, users can invoke
pip with the
--user argument to install packages from pypi.org into their home directory. This may be appropriate for packages which do not perform any processor intensive computation and hence optimisation for the Sulis hardware is not a concern. For example, to install the arrow package:
[user@login01(sulis) ~]$ pip install --user arrow
Do be aware that Python packages are specific to a particular minor version of Python. For example a package installed via pip within a Python 3.7 environment will not be available within a Python 3.8 environment.
Advanced users may wish to use Python virtual environments to manage multiple, possibly conflicting, versions of packages.
We do not recommend use of Anaconda for processor intensive scientific applications on Sulis (or on HPC platforms in general). The Anaconda system modifies your default Python environment in ways which may cause problems for optimised builds of packages provided via the module system. Software distributed via Anaconda is also built for compatibility with the largest range of hardware possible, rather than optimised for the latest hardware.
If processor intensive software needed for use on Sulis is only distributed via Anaconda then please first check with your support contact who may be able to create an optimised build for the Sulis hardware from the software’s source. In some cases the performance difference can be substantial.
If it is genuinely necessary to use Anaconda for a particular project then please be aware that it can be very difficult or impossible to debug compatibility problems with Anaconda-provided software.
In some cases users may need to compile their own Python packages from source. This is usually straightforward if the package provides an appropriate setup script and any prerequisite Python packages have been loaded via the environment module system or otherwise.
A common pitfall occurs when following build/installation guides from a random blog on the internet which assumes all users have root/admin privileges. Be sure to specify that the package should be installed into your home directory when executing the install step, for example:
[user@login01(sulis) src]$ python setup.py build [user@login01(sulis) src]$ python setup.py install --user
Again, note that packages built/installed from source will only be available within environments that use the same minor Python version as the build environment.
Python users new to HPC platforms may not be accustomed to running Python-based calculations as non-interactive scripts submitted to a batch queue.
Sulis is primarily focused on batch computation. Python-based calculations should therefore be prepared as a script which requires no user input beyond command line arguments or input read from a data file. Plots should be saved to file as there is no display connected to the servers which process your script.
Help converting interactive Python workflows into scripts suitable for batch computation is available via the HPC for Data Science video lecture series, the creation of which was funded by the Alan Turing Institute Tools, Practices and Systems programme.
In some cases though it can be invaluable to interact directly with the Python interpreter as a debugging aid, or to test algorithms within Jupyter notebooks directly on the Sulis hardware. For such situations see the section on interactive jobs and the application notes on Jupyter.
Python is an evolving ecosystem with many options for parallel computation. The job submission section contains a non-exhaustive list of examples, including:
Of these only the mpi4py and mpi4py.futures can be used for parallel processing across multiple nodes. However joblib can use used in combination with Dask to create a worker pool across many nodes in the cluster.