AlphaFold

Software name: 
AlphaFold
Policy 

AlphaFold is freely available to users at HPC2N. See here for how to include acknowlegdement of the use of this program into scientific papers.

General 

AlphaFold can predict protein structures with atomic accuracy even where no similar structure is known.

Description 

AlphaFold provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP14 and published in Nature.

Availability 

On HPC2N we have AlphaFold available as a module. Binaries are compiled for both CPU-only and for GPU.

Usage at HPC2N 

Please read this entire message before experimenting with it, because there's a couple of important things to be aware of.
 

Two different modules are available for AlphaFold: a CPU-only version, and a GPU version.

The large database (~4.5TB) that is also required to run AlphaFold has been made available on the HPC2N infrastructure in a central location (so you don't have to download that data yourself): /pfs/data/databases/AlphaFold/20210903 .
The "/pfs/data/databases" resides on a fast parallel filesystem which makes it directly usable for AlphaFold.
The name of the subdirectory indicates when the data was downloaded (which leaves room for providing updated datasets later).

The AlphaFold installations we provide have been enhanced a bit to facilitate the usage:

  • The location to the AlphaFold data can be specified via the $ALPHAFOLD_DATA_DIR environment variable, the module already sets this variable to the current version of the dataset.
  • A symbolic link named 'alphafold' that points to the run_alphafold.py script is included, so you can just use "alphafold" instead of "run_alphafold.py" or "python run_alphafold.py" after loading the AlphaFold module.
  • The run_alphafold.py script has been slightly modified such that defining the $ALPHAFOLD_DATA_DIR is sufficient to pick up all the data provided in that location, so you don't need to use options like --data_dir to specify the location of the data.
  • Similarly, the run_alphafold.py script was tweaked such that the location to commands like hhblits/hhsearch/jackhmmer/kalign are already correctly set, so options like --hhblits_binary_path are not required.
  • The Python script that are used to run hhblits and jackhmmer have been tweaked so you can control how many cores are used for these tools (rather than hardcoding it to 4 and 8 cores, respectively).
    Using the $ALPHAFOLD_HHBLITS_N_CPU environment variable, you can specify how many cores should be used for running hhblits (the default of 4 cores will be used if $ALPHAFOLD_HHBLITS_N_CPU is not defined); likewise for jackhmmer and $ALPHAFOLD_JACKHMMER_N_CPU.
    Tweaking this may or may not be worth it though, we have noticed that these tools sometimes run slower on more than 4/8 cores (but this may be workload dependent).

Submit file example

To run the T1050 AlphaFold example described below on one V100 card, use this as an example:

#!/bin/bash
#SBATCH -A <your-project-id>
#SBATCH -J AF-T1050-full_dbs
#SBATCH -t 05:00:00
#SBATCH -c 14
#SBATCH --gres=gpu:v100:1

# Clean the environment from loaded modules
ml purge > /dev/null 2>&1

# Load AlphaFold
ml fosscuda/2020b
ml AlphaFold/2.0.0

export ALPHAFOLD_HHBLITS_N_CPU=14

alphafold --fasta_paths=T1050.fasta --max_template_date=2020-05-14 --preset=full_dbs --output_dir=$PWD --model_names=model_1,model_2,model_3,model_4,model_5

Note: AlphaFold is not a MPI code and can only run on a single node.

Additional info 

We have run some basic tests on the installation, and it seems to be working as expected using the T1050.fasta example that is mentioned in the AlphaFold github README.

Using "--preset=full_dbs", we got the following runtimes:

  • CPU-only, on Kebnekaise, using 14 cores (1/2 node): 11h 50min
  • CPU-only, on Kebnekaise, using 28 cores (1 full node): 12h 17min
  • GPU, on Kebnekaise, using 1 V100 GPU + 14 cores: 2h 29min
  • GPU, on Kebnekaise, using 2 V100 GPU + 28 cores: 2h 44min

This highlights a couple of important attention points:

  • Running AlphaFold on GPU is significantly faster than CPU-only (about 5x faster for this particular example).
  • Using more CPU cores may lead to longer runtimes, so be careful with using full nodes when running AlphaFold CPU-only.

NOTE: We've also tried running this on the K80 nodes on Kebnekaise and it never finished there. There is something in AlphaFold that uses features not available on the K80 nodes causing this problem.

Updated: 2021-11-11, 13:50