introduction_en.ipynb 12.6 KB
Notebook
Newer Older
1 2 3 4 5 6 7 8 9
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "64e80e67-78d1-4595-8873-dd4d157c469d",
   "metadata": {},
   "source": [
    "## 1. PP-HelixFold Introduction\n",
    "\n",
10
    "AlphaFold2 is an accurate protein structure prediction pipeline. PP-HelixFold provides an efficient and improved implementation of the complete training and inference pipelines of AlphaFold2 in GPU and DCU. Compared with the computational performance of AlphaFold2 reported in the paper and OpenFold implemented through PyTorch, PP-HelixFold reduces the training time from about 11 days originally to 5.12 days, and only 2.89 days when using hybrid parallelism. Training HelixFold from scratch can achieve competitive accuracy with AlphaFold2.\n",
11 12
    "\n",
    "<p align=\"center\">\n",
13 14
    "<img src=\"https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_computational_perf.png?raw=true\" align=\"middle\" height=\"50%\" width=\"50%\" />\n",
    "<img src=\"https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_infer_accuracy.png?raw=true\" align=\"middle\" height=\"60%\" width=\"60%\" />\n",
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
    "</p>\n",
    "\n",
    "\n",
    "## 2. Technical Highlights for Efficient Implementation\n",
    "\n",
    "* **Branch Parallelism and Hybrid Parallelism** PP-HelixFold proposes **Branch Parallelism (BP)** to split the calculation branch across multiple devices in order to accelerate computation during the initial training phase. The training cost is further reduced by training with **Hybrid Parallelism**, combining BP with Dynamic Axial Parallelism (DAP) and Data Parallelism (DP).\n",
    "\n",
    "* **Operator Fusion and Tensor Fusion to Reduce the Cost of Scheduling** Scheduling a huge number of operators is one of the bottlenecks for the training. To reduce the cost of scheduling, **Fused Gated Self-Attention** is utilized to combine multiple blocks into an operator, and thousands of tensors are fused into only a few tensors.\n",
    "\n",
    "* **Multi-dimensional Memory Optimization** Multiple techniques, including Recompute, BFloat16, In-place memory, and Subbatch (Chunking), are exploited to reduce the memory required for training.\n",
    "\n",
    "\n",
    "## 3. Online Service\n",
    "\n",
    "For those who want to try out our model without any installation, we also provide an online interface [PaddleHelix HelixFold Forecast](https://paddlehelix.baidu.com/app/drug/protein/forecast) through web service.\n",
    "\n",
    "\n",
    "## 4. Environment\n",
    "\n",
    "To reproduce the results reported in our paper, specific environment settings are required as below. \n",
    "\n",
    "- python: 3.7\n",
37 38 39
    "- cuda: 11.6\n",
    "- cudnn: 8.4.0\n",
    "- nccl: 2.14.3\n",
40 41
    "\n",
    "\n",
L
liuTINA0907 已提交
42
    "## 5. How to Use the Model\n",
43 44 45 46 47 48 49 50 51
    "\n",
    "### Installation\n",
    "\n",
    "PP-HelixFold depends on [PaddlePaddle](https://github.com/paddlepaddle/paddle).\n",
    "Python dependencies available through `pip` is provided in `requirements.txt`. PP-HelixFold also depends on `openmm==7.5.1` and `pdbfixer`, which are only available via `conda`. For producing multiple sequence alignments, `kalign`, the [HH-suite](https://github.com/soedinglab/hh-suite) and `jackhmmer` are also needed. The download scripts require `aria2c`.\n",
    "\n",
    "We provide a script `setup_env` that setup a `conda` environment and installs all dependencies. You can change the name of the environment and CUDA version in `setup_env`. Run:\n",
    "```bash\n",
    "git clone https://github.com/PaddlePaddle/PaddleHelix.git # download PaddleHelix\n",
52
    "cd /apps/protein_folding/helixfold\n",
53
    "wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/HelixFold/paddlepaddle_gpu-0.0.0.post116-cp37-cp37m-linux_x86_64.whl\n",
54 55 56 57 58 59 60
    "sh setup_env\n",
    "conda activate helixfold # activate the conda environment\n",
    "```\n",
    "Note: If you have a different version of python3 and cuda, please refer to [here](https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html) for the compatible PaddlePaddle `dev` package.\n",
    "\n",
    "In order to run scripts with DAP/BP/DP-DAP-BP mode, you also need to install `ppfleetx`. Please refer to [here](https://github.com/PaddlePaddle/PaddleFleetX/tree/release/2.4/projects/protein_folding) for more details.\n",
    "```bash\n",
61 62
    "wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/HelixFold/ppfleetx-0.0.0-py3-none-any.whl\n",
    "python -m pip install ppfleetx-0.0.0-py3-none-any.whl      # install ppfleetx\n",
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
    "```\n",
    "\n",
    "### Usage\n",
    "\n",
    "In order to run PP-HelixFold, the genetic databases and model parameters are required.\n",
    "\n",
    "You can use a script `scripts/download_all_data.sh`, which is the same as the original AlphaFold that can be used to download and set up all databases and model parameters:\n",
    "\n",
    "*   Default:\n",
    "\n",
    "    ```bash\n",
    "    scripts/download_all_data.sh <DOWNLOAD_DIR>\n",
    "    ```\n",
    "\n",
    "    will download the full databases. The total download size for the full databases is around 415 GB and the total size when unzipped is 2.2 TB.  \n",
    "\n",
    "*   With `reduced_dbs`:\n",
    "\n",
    "    ```bash\n",
    "    scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs\n",
    "    ```\n",
    "\n",
    "    will download a reduced version of the databases to be used with the\n",
    "    `reduced_dbs` preset. The total download size for the reduced databases is around 190 GB and the total size when unzipped is around 530 GB. \n",
    "\n",
    "### Running PP-HelixFold for Inference\n",
    "\n",
    "To run inference on a sequence or multiple sequences using a set of DeepMind's pretrained parameters, run e.g.:\n",
    "\n",
    "*   Inference on single GPU (DP):\n",
    "    ```bash\n",
    "    fasta_file=\"target.fasta\"       # path to the target protein\n",
    "    model_name=\"model_5\"            # the alphafold model name\n",
    "    DATA_DIR=\"data\"                 # path to the databases\n",
    "    OUTPUT_DIR=\"helixfold_output\"   # path to save the outputs\n",
    "\n",
    "    python run_helixfold.py \\\n",
    "      --fasta_paths=${fasta_file} \\\n",
    "      --data_dir=${DATA_DIR} \\\n",
    "      --bfd_database_path=${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n",
    "      --small_bfd_database_path=${DATA_DIR}/small_bfd/bfd-first_non_consensus_sequences.fasta \\\n",
    "      --uniclust30_database_path=${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \\\n",
    "      --uniref90_database_path=${DATA_DIR}/uniref90/uniref90.fasta \\\n",
    "      --mgnify_database_path=${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \\\n",
    "      --pdb70_database_path=${DATA_DIR}/pdb70/pdb70 \\\n",
    "      --template_mmcif_dir=${DATA_DIR}/pdb_mmcif/mmcif_files \\\n",
    "      --obsolete_pdbs_path=${DATA_DIR}/pdb_mmcif/obsolete.dat \\\n",
    "      --max_template_date=2020-05-14 \\\n",
    "      --model_names=${model_name} \\\n",
    "      --output_dir=${OUTPUT_DIR} \\\n",
    "      --preset='reduced_dbs' \\\n",
    "      --jackhmmer_binary_path /opt/conda/envs/helixfold/bin/jackhmmer \\\n",
    "      --hhblits_binary_path /opt/conda/envs/helixfold/bin/hhblits \\\n",
    "      --hhsearch_binary_path /opt/conda/envs/helixfold/bin/hhsearch \\\n",
    "      --kalign_binary_path /opt/conda/envs/helixfold/bin/kalign \\\n",
    "      --random_seed=0\n",
    "    ```\n",
    "\n",
    "*   Inference on multiple GPUs (DAP):\n",
    "    ```bash\n",
    "    fasta_file=\"target.fasta\"       # path to the target protein\n",
    "    model_name=\"model_5\"            # the alphafold model name\n",
    "    DATA_DIR=\"data\"                 # path to the databases\n",
    "    OUTPUT_DIR=\"helixfold_output\"   # path to save the outputs\n",
    "    log_dir=\"demo_log\"              # path to log file\n",
    "\n",
    "    distributed_args=\"--run_mode=collective --log_dir=${log_dir}\"\n",
    "    python -m paddle.distributed.launch ${distributed_args} \\\n",
    "      --gpus=\"0,1,2,3,4,5,6,7\" \\\n",
    "      run_helixfold.py \\\n",
    "      --distributed \\\n",
    "      --dap_degree 8 \\\n",
    "      --fasta_paths=${fasta_file} \\\n",
    "      --data_dir=${DATA_DIR} \\\n",
    "      --bfd_database_path=${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n",
    "      --small_bfd_database_path=${DATA_DIR}/small_bfd/bfd-first_non_consensus_sequences.fasta \\\n",
    "      --uniclust30_database_path=${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \\\n",
    "      --uniref90_database_path=${DATA_DIR}/uniref90/uniref90.fasta \\\n",
    "      --mgnify_database_path=${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \\\n",
    "      --pdb70_database_path=${DATA_DIR}/pdb70/pdb70 \\\n",
    "      --template_mmcif_dir=${DATA_DIR}/pdb_mmcif/mmcif_files \\\n",
    "      --obsolete_pdbs_path=${DATA_DIR}/pdb_mmcif/obsolete.dat \\\n",
    "      --max_template_date=2020-05-14 \\\n",
    "      --model_names=${model_name} \\\n",
    "      --output_dir=${OUTPUT_DIR} \\\n",
    "      --preset='reduced_dbs' \\\n",
    "      --seed 2022 \\\n",
    "      --jackhmmer_binary_path /opt/conda/envs/helixfold/bin/jackhmmer \\\n",
    "      --hhblits_binary_path /opt/conda/envs/helixfold/bin/hhblits \\\n",
    "      --hhsearch_binary_path /opt/conda/envs/helixfold/bin/hhsearch \\\n",
    "      --kalign_binary_path /opt/conda/envs/helixfold/bin/kalign \\\n",
    "      --random_seed=0\n",
    "    ```\n",
    "You can use `python run_helixfold.py -h` to find the description of the arguments.\n",
    "\n",
    "### Running PP-HelixFold for CASP14 Demo\n",
    "\n",
    "For convenience, we also provide a demo script `gpu_infer.sh` for some CASP14 proteins under folder `demo_data/casp14_demo`. To run them, you just need to execute following command:\n",
    "\n",
    "```bash\n",
    "sh gpu_infer.sh T1026\n",
    "```\n",
    "\n",
    "Note that such demo for T1026 and T1037 can work without downloading large MSA datasets, only model parameters are required.\n",
    "\n",
    "\n",
L
liuTINA0907 已提交
169
    "## 6. Related papers and citations\n",
170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
    "\n",
    "If you use the code or data in this repos, please cite:\n",
    "\n",
    "```bibtex\n",
    "@article{AlphaFold2021,\n",
    "  author={Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\\v{Z}}{\\'\\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},\n",
    "  journal={Nature},\n",
    "  title={Highly accurate protein structure prediction with {AlphaFold}},\n",
    "  year={2021},\n",
    "  volume={596},\n",
    "  number={7873},\n",
    "  pages={583--589},\n",
    "  doi={10.1038/s41586-021-03819-2}\n",
    "}\n",
    "\n",
    "@article{wang2022helixfold,\n",
    "  title={HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle},\n",
    "  author={Wang, Guoxia and Fang, Xiaomin and Wu, Zhihua and Liu, Yiqun and Xue, Yang and Xiang, Yingfei and Yu, Dianhai and Wang, Fan and Ma, Yanjun},\n",
    "  journal={arXiv preprint arXiv:2207.05477},\n",
    "  year={2022}\n",
    "}\n",
    "\n",
    "@article{wang2022efficient_alphafold2,\n",
    "  title={Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism},\n",
    "  author={Wang, Guoxia and Wu, Zhihua and Fang, Xiaomin and Xiang, Yingfei and Liu, Yiqun and Yu, Dianhai and Ma, Yanjun},\n",
    "  journal={arXiv preprint arXiv:2211.00235},\n",
    "  year={2022}\n",
    "}\n",
    "```\n",
    "\n",
L
liuTINA0907 已提交
200
    "## 7. Copyright\n",
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
    "\n",
    "PP-HelixFold code is licensed under the Apache 2.0 License, which is same as AlphaFold. However, we use the AlphaFold parameters pretrained by DeepMind, which are made available for non-commercial use only under the terms of the CC BY-NC 4.0 license.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}