introduction_en.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "64e80e67-78d1-4595-8873-dd4d157c469d",
   "metadata": {},
   "source": [
    "## 1. PP-HelixFold Introduction\n",
    "\n",
    "AlphaFold2 is an accurate protein structure prediction pipeline. PP-HelixFold provides an efficient and improved implementation of the complete training and inference pipelines of AlphaFold2 in GPU and DCU. Compared with the computational performance of AlphaFold2 reported in the paper and OpenFold implemented through PyTorch, PP-HelixFold reduces the training time from about 11 days originally to 5.12 days, and only 2.89 days when using hybrid parallelism. Training HelixFold from scratch can achieve competitive accuracy with AlphaFold2.\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_computational_perf.png?raw=true\" align=\"middle\" height=\"50%\" width=\"50%\" />\n",
    "<img src=\"https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_infer_accuracy.png?raw=true\" align=\"middle\" height=\"60%\" width=\"60%\" />\n",
    "</p>\n",
    "\n",
    "\n",
    "## 2. Technical Highlights for Efficient Implementation\n",
    "\n",
    "* **Branch Parallelism and Hybrid Parallelism** PP-HelixFold proposes **Branch Parallelism (BP)** to split the calculation branch across multiple devices in order to accelerate computation during the initial training phase. The training cost is further reduced by training with **Hybrid Parallelism**, combining BP with Dynamic Axial Parallelism (DAP) and Data Parallelism (DP).\n",
    "\n",
    "* **Operator Fusion and Tensor Fusion to Reduce the Cost of Scheduling** Scheduling a huge number of operators is one of the bottlenecks for the training. To reduce the cost of scheduling, **Fused Gated Self-Attention** is utilized to combine multiple blocks into an operator, and thousands of tensors are fused into only a few tensors.\n",
    "\n",
    "* **Multi-dimensional Memory Optimization** Multiple techniques, including Recompute, BFloat16, In-place memory, and Subbatch (Chunking), are exploited to reduce the memory required for training.\n",
    "\n",
    "\n",
    "## 3. Online Service\n",
    "\n",
    "For those who want to try out our model without any installation, we also provide an online interface [PaddleHelix HelixFold Forecast](https://paddlehelix.baidu.com/app/drug/protein/forecast) through web service.\n",
    "\n",
    "\n",
    "## 4. Environment\n",
    "\n",
    "To reproduce the results reported in our paper, specific environment settings are required as below. \n",
    "\n",
    "- python: 3.7\n",
    "- cuda: 11.2\n",
    "- cudnn: 8.10.1\n",
    "- nccl: 2.12.12\n",
    "\n",
    "\n",
    "## 5. How to Use the Model\n",
    "\n",
    "### Installation\n",
    "\n",
    "PP-HelixFold depends on [PaddlePaddle](https://github.com/paddlepaddle/paddle).\n",
    "Python dependencies available through `pip` is provided in `requirements.txt`. PP-HelixFold also depends on `openmm==7.5.1` and `pdbfixer`, which are only available via `conda`. For producing multiple sequence alignments, `kalign`, the [HH-suite](https://github.com/soedinglab/hh-suite) and `jackhmmer` are also needed. The download scripts require `aria2c`.\n",
    "\n",
    "We provide a script `setup_env` that setup a `conda` environment and installs all dependencies. You can change the name of the environment and CUDA version in `setup_env`. Run:\n",
    "```bash\n",
    "git clone https://github.com/PaddlePaddle/PaddleHelix.git # download PaddleHelix\n",
    "cd /apps/protein_folding/helixfold\n",
    "wget https://paddle-wheel.bj.bcebos.com/develop/linux/linux-gpu-cuda11.2-cudnn8-mkl-gcc8.2-avx/paddlepaddle_gpu-0.0.0.post112-cp37-cp37m-linux_x86_64.whl\n",
    "sh setup_env\n",
    "conda activate helixfold # activate the conda environment\n",
    "```\n",
    "Note: If you have a different version of python3 and cuda, please refer to [here](https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html) for the compatible PaddlePaddle `dev` package.\n",
    "\n",
    "In order to run scripts with DAP/BP/DP-DAP-BP mode, you also need to install `ppfleetx`. Please refer to [here](https://github.com/PaddlePaddle/PaddleFleetX/tree/release/2.4/projects/protein_folding) for more details.\n",
    "```bash\n",
    "git clone https://github.com/PaddlePaddle/PaddleFleetX.git\n",
    "git checkout release/2.4          # change branch\n",
    "python setup.py develop           # install ppfleetx\n",
    "```\n",
    "\n",
    "### Usage\n",
    "\n",
    "In order to run PP-HelixFold, the genetic databases and model parameters are required.\n",
    "\n",
    "You can use a script `scripts/download_all_data.sh`, which is the same as the original AlphaFold that can be used to download and set up all databases and model parameters:\n",
    "\n",
    "*   Default:\n",
    "\n",
    "    ```bash\n",
    "    scripts/download_all_data.sh <DOWNLOAD_DIR>\n",
    "    ```\n",
    "\n",
    "    will download the full databases. The total download size for the full databases is around 415 GB and the total size when unzipped is 2.2 TB.  \n",
    "\n",
    "*   With `reduced_dbs`:\n",
    "\n",
    "    ```bash\n",
    "    scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs\n",
    "    ```\n",
    "\n",
    "    will download a reduced version of the databases to be used with the\n",
    "    `reduced_dbs` preset. The total download size for the reduced databases is around 190 GB and the total size when unzipped is around 530 GB. \n",
    "\n",
    "### Running PP-HelixFold for Inference\n",
    "\n",
    "To run inference on a sequence or multiple sequences using a set of DeepMind's pretrained parameters, run e.g.:\n",
    "\n",
    "*   Inference on single GPU (DP):\n",
    "    ```bash\n",
    "    fasta_file=\"target.fasta\"       # path to the target protein\n",
    "    model_name=\"model_5\"            # the alphafold model name\n",
    "    DATA_DIR=\"data\"                 # path to the databases\n",
    "    OUTPUT_DIR=\"helixfold_output\"   # path to save the outputs\n",
    "\n",
    "    python run_helixfold.py \\\n",
    "      --fasta_paths=${fasta_file} \\\n",
    "      --data_dir=${DATA_DIR} \\\n",
    "      --bfd_database_path=${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n",
    "      --small_bfd_database_path=${DATA_DIR}/small_bfd/bfd-first_non_consensus_sequences.fasta \\\n",
    "      --uniclust30_database_path=${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \\\n",
    "      --uniref90_database_path=${DATA_DIR}/uniref90/uniref90.fasta \\\n",
    "      --mgnify_database_path=${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \\\n",
    "      --pdb70_database_path=${DATA_DIR}/pdb70/pdb70 \\\n",
    "      --template_mmcif_dir=${DATA_DIR}/pdb_mmcif/mmcif_files \\\n",
    "      --obsolete_pdbs_path=${DATA_DIR}/pdb_mmcif/obsolete.dat \\\n",
    "      --max_template_date=2020-05-14 \\\n",
    "      --model_names=${model_name} \\\n",
    "      --output_dir=${OUTPUT_DIR} \\\n",
    "      --preset='reduced_dbs' \\\n",
    "      --jackhmmer_binary_path /opt/conda/envs/helixfold/bin/jackhmmer \\\n",
    "      --hhblits_binary_path /opt/conda/envs/helixfold/bin/hhblits \\\n",
    "      --hhsearch_binary_path /opt/conda/envs/helixfold/bin/hhsearch \\\n",
    "      --kalign_binary_path /opt/conda/envs/helixfold/bin/kalign \\\n",
    "      --random_seed=0\n",
    "    ```\n",
    "\n",
    "*   Inference on multiple GPUs (DAP):\n",
    "    ```bash\n",
    "    fasta_file=\"target.fasta\"       # path to the target protein\n",
    "    model_name=\"model_5\"            # the alphafold model name\n",
    "    DATA_DIR=\"data\"                 # path to the databases\n",
    "    OUTPUT_DIR=\"helixfold_output\"   # path to save the outputs\n",
    "    log_dir=\"demo_log\"              # path to log file\n",
    "\n",
    "    distributed_args=\"--run_mode=collective --log_dir=${log_dir}\"\n",
    "    python -m paddle.distributed.launch ${distributed_args} \\\n",
    "      --gpus=\"0,1,2,3,4,5,6,7\" \\\n",
    "      run_helixfold.py \\\n",
    "      --distributed \\\n",
    "      --dap_degree 8 \\\n",
    "      --fasta_paths=${fasta_file} \\\n",
    "      --data_dir=${DATA_DIR} \\\n",
    "      --bfd_database_path=${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n",
    "      --small_bfd_database_path=${DATA_DIR}/small_bfd/bfd-first_non_consensus_sequences.fasta \\\n",
    "      --uniclust30_database_path=${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \\\n",
    "      --uniref90_database_path=${DATA_DIR}/uniref90/uniref90.fasta \\\n",
    "      --mgnify_database_path=${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \\\n",
    "      --pdb70_database_path=${DATA_DIR}/pdb70/pdb70 \\\n",
    "      --template_mmcif_dir=${DATA_DIR}/pdb_mmcif/mmcif_files \\\n",
    "      --obsolete_pdbs_path=${DATA_DIR}/pdb_mmcif/obsolete.dat \\\n",
    "      --max_template_date=2020-05-14 \\\n",
    "      --model_names=${model_name} \\\n",
    "      --output_dir=${OUTPUT_DIR} \\\n",
    "      --preset='reduced_dbs' \\\n",
    "      --seed 2022 \\\n",
    "      --jackhmmer_binary_path /opt/conda/envs/helixfold/bin/jackhmmer \\\n",
    "      --hhblits_binary_path /opt/conda/envs/helixfold/bin/hhblits \\\n",
    "      --hhsearch_binary_path /opt/conda/envs/helixfold/bin/hhsearch \\\n",
    "      --kalign_binary_path /opt/conda/envs/helixfold/bin/kalign \\\n",
    "      --random_seed=0\n",
    "    ```\n",
    "You can use `python run_helixfold.py -h` to find the description of the arguments.\n",
    "\n",
    "### Running PP-HelixFold for CASP14 Demo\n",
    "\n",
    "For convenience, we also provide a demo script `gpu_infer.sh` for some CASP14 proteins under folder `demo_data/casp14_demo`. To run them, you just need to execute following command:\n",
    "\n",
    "```bash\n",
    "sh gpu_infer.sh T1026\n",
    "```\n",
    "\n",
    "Note that such demo for T1026 and T1037 can work without downloading large MSA datasets, only model parameters are required.\n",
    "\n",
    "\n",
    "## 6. Related papers and citations\n",
    "\n",
    "If you use the code or data in this repos, please cite:\n",
    "\n",
    "```bibtex\n",
    "@article{AlphaFold2021,\n",
    "  author={Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\\v{Z}}{\\'\\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},\n",
    "  journal={Nature},\n",
    "  title={Highly accurate protein structure prediction with {AlphaFold}},\n",
    "  year={2021},\n",
    "  volume={596},\n",
    "  number={7873},\n",
    "  pages={583--589},\n",
    "  doi={10.1038/s41586-021-03819-2}\n",
    "}\n",
    "\n",
    "@article{wang2022helixfold,\n",
    "  title={HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle},\n",
    "  author={Wang, Guoxia and Fang, Xiaomin and Wu, Zhihua and Liu, Yiqun and Xue, Yang and Xiang, Yingfei and Yu, Dianhai and Wang, Fan and Ma, Yanjun},\n",
    "  journal={arXiv preprint arXiv:2207.05477},\n",
    "  year={2022}\n",
    "}\n",
    "\n",
    "@article{wang2022efficient_alphafold2,\n",
    "  title={Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism},\n",
    "  author={Wang, Guoxia and Wu, Zhihua and Fang, Xiaomin and Xiang, Yingfei and Liu, Yiqun and Yu, Dianhai and Ma, Yanjun},\n",
    "  journal={arXiv preprint arXiv:2211.00235},\n",
    "  year={2022}\n",
    "}\n",
    "```\n",
    "\n",
    "## 7. Copyright\n",
    "\n",
    "PP-HelixFold code is licensed under the Apache 2.0 License, which is same as AlphaFold. However, we use the AlphaFold parameters pretrained by DeepMind, which are made available for non-commercial use only under the terms of the CC BY-NC 4.0 license.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}