{ "cells": [ { "cell_type": "markdown", "id": "64e80e67-78d1-4595-8873-dd4d157c469d", "metadata": {}, "source": [ "## 1. PP-HelixFold Introduction\n", "\n", "AlphaFold2 is an accurate protein structure prediction pipeline. PP-HelixFold provides an efficient and improved implementation of the complete training and inference pipelines of AlphaFold2 in GPU and DCU. Compared with the computational performance of AlphaFold2 reported in the paper and OpenFold implemented through PyTorch, PP-HelixFold reduces the training time from about 11 days originally to 5.12 days, and only 2.89 days when using hybrid parallelism. Training HelixFold from scratch can achieve competitive accuracy with AlphaFold2.\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "## 2. Technical Highlights for Efficient Implementation\n", "\n", "* **Branch Parallelism and Hybrid Parallelism** PP-HelixFold proposes **Branch Parallelism (BP)** to split the calculation branch across multiple devices in order to accelerate computation during the initial training phase. The training cost is further reduced by training with **Hybrid Parallelism**, combining BP with Dynamic Axial Parallelism (DAP) and Data Parallelism (DP).\n", "\n", "* **Operator Fusion and Tensor Fusion to Reduce the Cost of Scheduling** Scheduling a huge number of operators is one of the bottlenecks for the training. To reduce the cost of scheduling, **Fused Gated Self-Attention** is utilized to combine multiple blocks into an operator, and thousands of tensors are fused into only a few tensors.\n", "\n", "* **Multi-dimensional Memory Optimization** Multiple techniques, including Recompute, BFloat16, In-place memory, and Subbatch (Chunking), are exploited to reduce the memory required for training.\n", "\n", "\n", "## 3. Online Service\n", "\n", "For those who want to try out our model without any installation, we also provide an online interface [PaddleHelix HelixFold Forecast](https://paddlehelix.baidu.com/app/drug/protein/forecast) through web service.\n", "\n", "\n", "## 4. Environment\n", "\n", "To reproduce the results reported in our paper, specific environment settings are required as below. \n", "\n", "- python: 3.7\n", "- cuda: 11.6\n", "- cudnn: 8.4.0\n", "- nccl: 2.14.3\n", "\n", "\n", "## 5. How to Use the Model\n", "\n", "### Installation\n", "\n", "PP-HelixFold depends on [PaddlePaddle](https://github.com/paddlepaddle/paddle).\n", "Python dependencies available through `pip` is provided in `requirements.txt`. PP-HelixFold also depends on `openmm==7.5.1` and `pdbfixer`, which are only available via `conda`. For producing multiple sequence alignments, `kalign`, the [HH-suite](https://github.com/soedinglab/hh-suite) and `jackhmmer` are also needed. The download scripts require `aria2c`.\n", "\n", "We provide a script `setup_env` that setup a `conda` environment and installs all dependencies. You can change the name of the environment and CUDA version in `setup_env`. Run:\n", "```bash\n", "git clone https://github.com/PaddlePaddle/PaddleHelix.git # download PaddleHelix\n", "cd /apps/protein_folding/helixfold\n", "wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/HelixFold/paddlepaddle_gpu-0.0.0.post116-cp37-cp37m-linux_x86_64.whl\n", "sh setup_env\n", "conda activate helixfold # activate the conda environment\n", "```\n", "Note: If you have a different version of python3 and cuda, please refer to [here](https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html) for the compatible PaddlePaddle `dev` package.\n", "\n", "In order to run scripts with DAP/BP/DP-DAP-BP mode, you also need to install `ppfleetx`. Please refer to [here](https://github.com/PaddlePaddle/PaddleFleetX/tree/release/2.4/projects/protein_folding) for more details.\n", "```bash\n", "wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/HelixFold/ppfleetx-0.0.0-py3-none-any.whl\n", "python -m pip install ppfleetx-0.0.0-py3-none-any.whl # install ppfleetx\n", "```\n", "\n", "### Usage\n", "\n", "In order to run PP-HelixFold, the genetic databases and model parameters are required.\n", "\n", "You can use a script `scripts/download_all_data.sh`, which is the same as the original AlphaFold that can be used to download and set up all databases and model parameters:\n", "\n", "* Default:\n", "\n", " ```bash\n", " scripts/download_all_data.sh