未验证 提交 dde08b36 编写于 作者: L liuTINA0907 提交者: GitHub

add new models and helix files (#5569)

Co-authored-by: Nliushuangqiao <liushuangqiao@beibeiMacBook-Pro.local>
上级 52970f04
>T1026 FBNSV, , 172 residues|
MVSNWNWSGKKGRRTPRRGYTRPFKSAVPTTRVVVHQSAVLKKDDVSGSEIKPEGDVARYKIRKVMLSCTLRMRPGELVNYLIVKCSSPIVNWSAAFTAPALMVKESCQDMITIIGKGKVESNGVAGSDCTKSFNKFIRLGAGISQTQHLYVVMYTSEAVKTVLEHRVYIEV
\ No newline at end of file
此差异已折叠。
>T1037 S0A2C3d4, , 404 residues|
SKINFYTTTIETLETEDQNNTLTTFKVQNVSNASTIFSNGKTYWNFARPSYISNRINTFKNNPGVLRQLLNTSYGQSSLWAKHLLGEEKNVTGDFVLAGNARESASENRLKSLELSIFNSLQEKDKGAEGNDNGSISIVDQLADKLNKVLRGGTKNGTSIYSTVTPGDKSTLHEIKIDHFIPETISSFSNGTMIFNDKIVNAFTDHFVSEVNRMKEAYQELETLPESKRVVHYHTDARGNVMKDGKLAGNAFKSGHILSELSFDQITQDDNEMLKLYNEDGSPINPKGAVSNEQKILIKQTINKVLNQRIKENIRYFKDQGLVIDTVNKDGNKGFHFHGLDKSIMSEYTDDIQLTEFDISHVVSDFTLNSILASIEYTKLFTGDPANYKNMVDFFKRVPATYTN
\ No newline at end of file
因为 它太大了无法显示 source diff 。你可以改为 查看blob
!pip install -q gradio
import gradio as gr
import os
def molecule(input_pdb):
mol = read_mol(input_pdb)
x = (
"""<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<style>
body{
font-family:sans-serif
}
.mol-container {
width: 100%;
height: 600px;
position: relative;
}
.mol-container select{
background-image:None;
}
</style>
<script src="https://3Dmol.csb.pitt.edu/build/3Dmol-min.js"></script>
</head>
<body>
<div id="container" class="mol-container"></div>
<script>
let pdb = `"""
+ mol
+ """`
$(document).ready(function () {
let element = $("#container");
let config = { backgroundColor: "white" };
let viewer = $3Dmol.createViewer(element, config);
viewer.addModel(pdb, "pdb");
viewer.getModel(0).setStyle({}, { cartoon: { color:"spectrum" } });
viewer.zoomTo();
viewer.render();
viewer.zoom(1, 1000); /* slight zoom */
})
</script>
</body></html>"""
)
return f"""<iframe style="width: 100%; height: 600px" name="result" allow="midi; geolocation; microphone; camera;
display-capture; encrypted-media;" sandbox="allow-modals allow-forms
allow-scripts allow-same-origin allow-popups
allow-top-navigation-by-user-activation allow-downloads" allowfullscreen=""
allowpaymentrequest="" frameborder="0" srcdoc='{x}'></iframe>"""
def get_pdb(pdb_code="", filepath=""):
if pdb_code is None or pdb_code == "":
try:
return filepath.name
except AttributeError as e:
return None
else:
os.system(f"wget -qnc https://files.rcsb.org/view/{pdb_code}.pdb")
return f"{pdb_code}.pdb"
def read_mol(molpath):
with open(molpath, "r") as fp:
lines = fp.readlines()
mol = ""
for l in lines:
mol += l
return mol
def update(fastaName='',fastaContent=''):
if(fastaName==''):
return None
else:
return molecule(fastaName+"_pred.pdb")
demo = gr.Blocks()
with demo:
gr.Markdown("# PDB viewer using 3Dmol.js")
with gr.Row():
with gr.Box():
fastaName = gr.Textbox(interactive=False,label='Fasta label')
fastaContent = gr.Textbox(interactive=False,label='Fasta content')
gr.Examples([["T1026", read_mol( "T1026.fasta")],["T1037", read_mol("T1037.fasta")]], [fastaName,fastaContent])
btn = gr.Button("View")
mol = gr.HTML()
btn.click(fn=update, inputs=[fastaName,fastaContent], outputs=mol)
demo.launch()
\ No newline at end of file
【PP-HelixFold-App-YAML】
APP_Info:
title: PP-HelixFold-App
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 3.4.1
app_file: app.py
license: apache-2.0
device: gpu
\ No newline at end of file
## 1. 推理 Benchmark
### 1.1 软硬件环境
PP-Helixfold模型的推理测试是在NVIDIA A100 (40G)单卡上完成的,batch size大小为1。为了能复现我们论文中报告的实验结果,需在特定环境下进行实验。
* Python: 3.7
* CUDA 11.2
* CUDNN 8.10.1
* NCCL 2.12.12.
### 1.2 数据集
PP-HelixFold模型使用的训练样本25%来自RCSB PDB,75%来自自蒸馏数据集。测试时,我们搜集了87个CASP14的结构域蛋白和371个从2021-09-04到2022-02-19的CAMEO蛋白作为测试集。
### 1.3 模型效果
通过与原版AlphaFold2模型和哥伦比亚大学Mohammed AlQuraishi教授团队基于PyTorch复现的OpenFold模型的性能对比测试显示,PP-HelixFold模型的训练性能相比AlphaFold2提升106.97%,相比OpenFold提升104.86%,将训练耗时从约11天减少到7.5天,并且在使用混合并行时能进一步降低至5.3天。在性能大幅度提升的同时,PP-HelixFold从头端到端完整训练可以达到AlphaFold2论文媲美的精度。在包含87个蛋白的CASP14数据集和371个蛋白的CAMEO数据集上,PP-HelixFold模型的TM-score指标分别达到0.8771和0.8885,与原版AlphaFold2准确率相当甚至更优。
![](https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_computational_performance.png)
![](https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_accuracy.png)
## 2. 相关使用说明
请参考:https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold
## 1. Inference Benchmark
### 1.1 Environment
The PP-Helixfold model's inference test is tested on single-card NVIDIA A100 (40G), batch size=1. To reproduce the results reported in our paper, specific environment settings are required as below.
* Python: 3.7
* CUDA 11.2
* CUDNN 8.10.1
* NCCL 2.12.12.
### 1.2 Datasets
For training, the PP-Helixfold model uses 25% of samples from RCSB PDB and 75% of self-distillation samples. For evaluation, we collect 87 domain targets from CASP14 and 371 protein targets from CAMEO, ranging from 2021-09-04 to 2022-02-19.
### 1.3 Performance
Compared with the computational performance of AlphaFold2 reported in the paper and OpenFold implemented through PyTorch, PP-Helixfold reduces the training time from about 11 days to 7.5 days, and it can be further reduced to only 5.3 days when using hybrid parallelism. Training PP-Helixfold from scratch can achieve competitive accuracy with AlphaFold2.
![](https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_computational_performance.png)
![](https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_accuracy.png)
## 2. Reference
Ref: https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold
# 提供模型推理所需预训练模型参数和数据库下载:
| 模型名称 | 模型简介 | 模型参数大小 | 下载地址 |
|--------|----|------------|------------|
| AlphaFold2_parameters | 蛋白质结构预测 | ~ 93M | [预训练模型](https://storage.googleapis.com/alphafold/alphafold_params_2021-10-27.tar) |
| 数据库名称 | 功能简介 | 数据库大小 | 下载地址 |
|--------|----|------------|------------|
| bfd | MSA搜索 | ~ 1.7 TB (解压前: 271.6 GB) | [bfd](https://storage.googleapis.com/alphafold-databases/casp14_versions/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz) |
| small_bfd | MSA搜索 | ~ 17 GB (解压前: 9.6 GB) | [small_bfd](https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz) |
| mgnify | MSA搜索 | ~ 64 GB (解压前: 32.9 GB) | [mgnify](https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz) |
| uniclust30 | MSA搜索 | ~ 86 GB (解压前: 24.9 GB) | [uniclust30](https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz) |
| uniref90 | MSA搜索 | ~ 58 GB (解压前: 29.7 GB) | [uniref90](ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz) |
| pdb70 | 模板搜索 | ~ 56 GB (解压前: 19.5 GB) | [pdb70](http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/pdb70_from_mmcif_200401.tar.gz) |
| pdb_mmcif | 模板搜索 | ~ 206 GB (解压前: 46 GB) | [pdb_mmcif](rsync.rcsb.org::ftp_data/structures/divided/mmCIF/) |
\ No newline at end of file
# Download
| model | task | model_size | download |
|--------|----|------------|------------|
| AlphaFold2_parameters | Protein_Structure_Prediction | ~ 93M | [pretrained_models](https://storage.googleapis.com/alphafold/alphafold_params_2021-10-27.tar) |
| database | task | size | download |
|--------|----|------------|------------|
| bfd | MSA_Search | ~ 1.7 TB (download: 271.6 GB) | [bfd](https://storage.googleapis.com/alphafold-databases/casp14_versions/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz) |
| small_bfd | MSA_Search | ~ 17 GB (download: 9.6 GB) | [small_bfd](https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz) |
| mgnify | MSA_Search | ~ 64 GB (download: 32.9 GB) | [mgnify](https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz) |
| uniclust30 | MSA_Search | ~ 86 GB (download: 24.9 GB) | [uniclust30](https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz) |
| uniref90 | MSA_Search | ~ 58 GB (download: 29.7 GB) | [uniref90](ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz) |
| pdb70 | Template_Search | ~ 56 GB (download: 19.5 GB) | [pdb70](http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/pdb70_from_mmcif_200401.tar.gz) |
| pdb_mmcif | Template_Search | ~ 206 GB (download: 46 GB) | [pdb_mmcif](rsync.rcsb.org::ftp_data/structures/divided/mmCIF/) |
\ No newline at end of file
【PP-HelixFold-YAML】(注:本yaml样例仅供大家了解数据结构,研发同学会提供页面配置前端工具,到时候自动生成yaml文件)
Model_Info:
name: "PP-HelixFold"
description:
description_en:
update_time:
icon:
from_repo: "PaddleHelix"
Task:
-
tag: "生物计算"
tag_en: "Biological Computing"
sub_tag: "蛋白质结构预测"
sub_tag_en: "Protein Structure Prediction"
Example:
-
tag:
tag_en:
sub_tag:
sub_tag_en:
title:
title_en:
url:
url_en:
Datasets: "RCSB PDB, Self-distillation datasets, CAMEO, CASP14"
Pulisher: "Baidu"
License: "Apache 2.0"
Paper:
-
title: HelixFold: "An Efficient Implementation of AlphaFold2 using PaddlePaddle"
url: "https://arxiv.org/abs/2207.05477"
-
title: "Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism"
url: "https://arxiv.org/abs/2211.00235"
IfTraining: 1
IfOnlineDemo: 1
{
"cells": [
{
"cell_type": "markdown",
"id": "69972530-dd2b-443c-a1e1-6bcebe1c46b9",
"metadata": {},
"source": [
"## 1. PP-HelixFold模型简介\n",
"\n",
"AlphaFold2是一款高精度的蛋白质结构预测模型。PP-HelixFold基于PaddlePaddle框架在GPU和DCU上完整复现了AlphaFold2的训练和推理流程,并进一步提升模型性能与精度。通过与原版AlphaFold2模型和哥伦比亚大学Mohammed AlQuraishi教授团队基于PyTorch复现的OpenFold模型的性能对比测试显示,PP-HelixFold将训练耗时从约11天减少到7.5天。在性能大幅度提升的同时,PP-HelixFold从头端到端完整训练可以达到AlphaFold2论文媲美的精度。\n",
"\n",
"<p align=\"center\">\n",
"<img src=\"https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_computational_performance.png?raw=true\" align=\"middle\" height=\"50%\" width=\"50%\" />\n",
"<img src=\"https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_accuracy.png?raw=true\" align=\"middle\" height=\"60%\" width=\"60%\" />\n",
"</p>\n",
"\n",
"\n",
"## 2. 技术创新\n",
"\n",
"* **分支并行与混合并行策略** PP-HelixFold创新性地提出**分支并行 (Branch Parallelism, BP)** 策略,将不同的网络模型分支放在不同的卡上并行计算,从而在initial training阶段大幅提高了模型并行效率和训练速度。并且,分支并行和已有的**动态轴并行 (Dynamic Axial Parallelism, DAP)** 和**数据并行 (Data Parallelism, DP)** 结合使用,通过BP-DAP-DP三维混合并行,进一步加快了模型的整体训练速度。\n",
"\n",
"* **算子融合优化技术和张量融合低频次访存技术** 针对AlphaFold2中Gated Self-Attention小算子组合CPU调度开销大、模型参数小、参数个数多的问题,PP-HelixFold将Gated Self-Attention整个模块融合用一个算子实现,将CPU调度开销优化到极致。同时,将数千个小张量融合成一个连续的大张量,模型参数的梯度、优化器状态都相应更新,大幅减少了访存次数、CPU调度开销和显存碎片,从而提升了训练速度。\n",
"\n",
"* **多维度显存优化方案** 采用Recompute、BFloat16、显存复用、Subbatch(Chunking)等技术,将显存峰值降低到40G以内,同时支持MSA长度为512、ExtraMSA长度为5120、残基序列长度为384的最大模型配置的微调训练,从而解决了模型结构深,中间结果计算量大,ExtraMSAStack输入过长等导致无法训练的问题。\n",
"\n",
"\n",
"## 3. 线上服务\n",
"\n",
"如果您想免安装直接尝试使用我们的模型,我们还提供了线上服务器[PaddleHelix HelixFold Forecast](https://paddlehelix.baidu.com/app/drug/protein/forecast)。\n",
"\n",
"\n",
"## 4. 环境需求\n",
"\n",
"为了能复现我们论文中报告的实验结果,需在特定环境下进行实验。\n",
"\n",
"- python: 3.7\n",
"- cuda: 11.2\n",
"- cudnn: 8.10.1\n",
"- nccl: 2.12.12\n",
"\n",
"\n",
"## 4. 模型如何使用\n",
"\n",
"### 安装\n",
"\n",
"PP-HelixFold基于[PaddlePaddle](https://github.com/paddlepaddle/paddle)实现。\n",
"通过`pip`安装的Python相关库在`requirements.txt`文件中提供,PP-HelixFold需要使用的`openmm==7.5.1`和`pdbfixer`工具,仅可通过`conda`安装。 同时,还需要安装`kalign`、[HH-suite](https://github.com/soedinglab/hh-suite)和`jackhmmer`等工具来生成多序列比对文件。下载脚本需要支持`aria2c`。\n",
"\n",
"我们提供脚本`setup_env`来安装`conda`环境和所需的所有第三方工具库。您可以在`setup_env`中更改环境名字和CUDA版本。运行命令如下:\n",
"```bash\n",
"git clone https://github.com/PaddlePaddle/PaddleHelix.git # download PaddleHelix\n",
"cd https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold\n",
"wget https://paddle-wheel.bj.bcebos.com/develop/linux/linux-gpu-cuda11.2-cudnn8-mkl-gcc8.2-avx/paddlepaddle_gpu-0.0.0.post112-cp37-cp37m-linux_x86_64.whl\n",
"sh setup_env\n",
"conda activate helixfold # activate the conda environment\n",
"```\n",
"注意:如果您环境中的Python3和CUDA版本与我们提供的Paddle whl包不匹配,请参考[这里](https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html)下载安装对应版本的PaddlePaddle `dev`包。\n",
"\n",
"为了代码运行时支持开启DAP/BP/DP-DAP-BP模式,您还需安装`ppfleetx`。更多详细信息请参考[这里](https://github.com/PaddlePaddle/PaddleFleetX/tree/release/2.4/projects/protein_folding)。\n",
"```bash\n",
"git clone https://github.com/PaddlePaddle/PaddleFleetX.git\n",
"git checkout release/2.4 # change branch\n",
"python setup.py develop # install ppfleetx\n",
"```\n",
"\n",
"### 使用\n",
"\n",
"在运行PP-HelixFold前,需要先下载所需的数据库和预训练模型参数。\n",
"\n",
"与原版AlphaFold2一样,您可以运行脚本`scripts/download_all_data.sh`下载所有所需的数据库和预训练模型参数文件:\n",
"\n",
"* 默认选项:\n",
"\n",
" ```bash\n",
" scripts/download_all_data.sh <DOWNLOAD_DIR>\n",
" ```\n",
"\n",
" 将下载完整版数据库。完整版数据库和预训练模型参数文件的解压前总大小约415 GB,解压后约2.2 TB。\n",
"\n",
"* `reduced_dbs`选项:\n",
"\n",
" ```bash\n",
" scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs\n",
" ```\n",
"\n",
" 将下载精简版数据库。精简版数据库和预训练模型参数文件的解压前总大小约190 GB,解压后约530 GB。\n",
"\n",
"### PP-HelixFold模型推理\n",
"\n",
"可以使用如下脚本运行PP-HelixFold模型推理单个或多个蛋白序列文件:\n",
"\n",
"* 在单卡GPU上推理(DP模式):\n",
" ```bash\n",
" fasta_file=\"target.fasta\" # path to the target protein\n",
" model_name=\"model_5\" # the alphafold model name\n",
" DATA_DIR=\"data\" # path to the databases\n",
" OUTPUT_DIR=\"helixfold_output\" # path to save the outputs\n",
"\n",
" python run_helixfold.py \\\n",
" --fasta_paths=${fasta_file} \\\n",
" --data_dir=${DATA_DIR} \\\n",
" --bfd_database_path=${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n",
" --small_bfd_database_path=${DATA_DIR}/small_bfd/bfd-first_non_consensus_sequences.fasta \\\n",
" --uniclust30_database_path=${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \\\n",
" --uniref90_database_path=${DATA_DIR}/uniref90/uniref90.fasta \\\n",
" --mgnify_database_path=${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \\\n",
" --pdb70_database_path=${DATA_DIR}/pdb70/pdb70 \\\n",
" --template_mmcif_dir=${DATA_DIR}/pdb_mmcif/mmcif_files \\\n",
" --obsolete_pdbs_path=${DATA_DIR}/pdb_mmcif/obsolete.dat \\\n",
" --max_template_date=2020-05-14 \\\n",
" --model_names=${model_name} \\\n",
" --output_dir=${OUTPUT_DIR} \\\n",
" --preset='reduced_dbs' \\\n",
" --jackhmmer_binary_path /opt/conda/envs/helixfold/bin/jackhmmer \\\n",
" --hhblits_binary_path /opt/conda/envs/helixfold/bin/hhblits \\\n",
" --hhsearch_binary_path /opt/conda/envs/helixfold/bin/hhsearch \\\n",
" --kalign_binary_path /opt/conda/envs/helixfold/bin/kalign \\\n",
" --random_seed=0\n",
" ```\n",
"\n",
"* 在多卡GPU上推理(DAP模式):\n",
" ```bash\n",
" fasta_file=\"target.fasta\" # path to the target protein\n",
" model_name=\"model_5\" # the alphafold model name\n",
" DATA_DIR=\"data\" # path to the databases\n",
" OUTPUT_DIR=\"helixfold_output\" # path to save the outputs\n",
" log_dir=\"demo_log\" # path to log file\n",
"\n",
" distributed_args=\"--run_mode=collective --log_dir=${log_dir}\"\n",
" python -m paddle.distributed.launch ${distributed_args} \\\n",
" --gpus=\"0,1,2,3,4,5,6,7\" \\\n",
" run_helixfold.py \\\n",
" --distributed \\\n",
" --dap_degree 8 \\\n",
" --fasta_paths=${fasta_file} \\\n",
" --data_dir=${DATA_DIR} \\\n",
" --bfd_database_path=${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n",
" --small_bfd_database_path=${DATA_DIR}/small_bfd/bfd-first_non_consensus_sequences.fasta \\\n",
" --uniclust30_database_path=${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \\\n",
" --uniref90_database_path=${DATA_DIR}/uniref90/uniref90.fasta \\\n",
" --mgnify_database_path=${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \\\n",
" --pdb70_database_path=${DATA_DIR}/pdb70/pdb70 \\\n",
" --template_mmcif_dir=${DATA_DIR}/pdb_mmcif/mmcif_files \\\n",
" --obsolete_pdbs_path=${DATA_DIR}/pdb_mmcif/obsolete.dat \\\n",
" --max_template_date=2020-05-14 \\\n",
" --model_names=${model_name} \\\n",
" --output_dir=${OUTPUT_DIR} \\\n",
" --preset='reduced_dbs' \\\n",
" --seed 2022 \\\n",
" --jackhmmer_binary_path /opt/conda/envs/helixfold/bin/jackhmmer \\\n",
" --hhblits_binary_path /opt/conda/envs/helixfold/bin/hhblits \\\n",
" --hhsearch_binary_path /opt/conda/envs/helixfold/bin/hhsearch \\\n",
" --kalign_binary_path /opt/conda/envs/helixfold/bin/kalign \\\n",
" --random_seed=0\n",
" ```\n",
"您可以使用命令`python run_helixfold.py -h`查找各参数选项具体描述与定义。\n",
"\n",
"### PP-HelixFold模型在CASP14 Demo上推理\n",
"\n",
"为了使用方便,我们提供一键式运行脚本`gpu_infer.sh`来运行目录`demo_data/casp14_demo`底下的部分CASP14蛋白。您可以运行以下命令来使用:\n",
"\n",
"```bash\n",
"sh gpu_infer.sh T1026\n",
"```\n",
"\n",
"注意:运行demo蛋白T1026和T1037,您无需下载庞大的数据库,仅需下载预训练模型参数即可使用。\n",
"\n",
"\n",
"## 5. 相关论文以及引用信息\n",
"\n",
"如果您使用了该代码库里的任何代码和数据,请引用:\n",
"\n",
"```bibtex\n",
"@article{AlphaFold2021,\n",
" author={Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\\v{Z}}{\\'\\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},\n",
" journal={Nature},\n",
" title={Highly accurate protein structure prediction with {AlphaFold}},\n",
" year={2021},\n",
" volume={596},\n",
" number={7873},\n",
" pages={583--589},\n",
" doi={10.1038/s41586-021-03819-2}\n",
"}\n",
"\n",
"@article{wang2022helixfold,\n",
" title={HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle},\n",
" author={Wang, Guoxia and Fang, Xiaomin and Wu, Zhihua and Liu, Yiqun and Xue, Yang and Xiang, Yingfei and Yu, Dianhai and Wang, Fan and Ma, Yanjun},\n",
" journal={arXiv preprint arXiv:2207.05477},\n",
" year={2022}\n",
"}\n",
"\n",
"@article{wang2022efficient_alphafold2,\n",
" title={Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism},\n",
" author={Wang, Guoxia and Wu, Zhihua and Fang, Xiaomin and Xiang, Yingfei and Liu, Yiqun and Yu, Dianhai and Ma, Yanjun},\n",
" journal={arXiv preprint arXiv:2211.00235},\n",
" year={2022}\n",
"}\n",
"```\n",
"\n",
"## 6. 版权所有\n",
"\n",
"PP-HelixFold代码使用的是Apache 2.0 License许可文件,该许可与原版AlphaFold2相同。但是,我们使用了由DeepMind提供的AlphaFold2预训练模型参数,根据CC BY-NC 4.0 license许可文件规定,仅可用于非商业用途。\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "64e80e67-78d1-4595-8873-dd4d157c469d",
"metadata": {},
"source": [
"## 1. PP-HelixFold Introduction\n",
"\n",
"AlphaFold2 is an accurate protein structure prediction pipeline. PP-HelixFold provides an efficient and improved implementation of the complete training and inference pipelines of AlphaFold2 in GPU and DCU. Compared with the computational performance of AlphaFold2 reported in the paper and OpenFold implemented through PyTorch, PP-HelixFold reduces the training time from about 11 days to 7.5 days. Training PP-HelixFold from scratch can achieve competitive accuracy with AlphaFold2.\n",
"\n",
"<p align=\"center\">\n",
"<img src=\"https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_computational_performance.png?raw=true\" align=\"middle\" height=\"50%\" width=\"50%\" />\n",
"<img src=\"https://github.com/PaddlePaddle/PaddleHelix/blob/dev/.github/HelixFold_accuracy.png?raw=true\" align=\"middle\" height=\"60%\" width=\"60%\" />\n",
"</p>\n",
"\n",
"\n",
"## 2. Technical Highlights for Efficient Implementation\n",
"\n",
"* **Branch Parallelism and Hybrid Parallelism** PP-HelixFold proposes **Branch Parallelism (BP)** to split the calculation branch across multiple devices in order to accelerate computation during the initial training phase. The training cost is further reduced by training with **Hybrid Parallelism**, combining BP with Dynamic Axial Parallelism (DAP) and Data Parallelism (DP).\n",
"\n",
"* **Operator Fusion and Tensor Fusion to Reduce the Cost of Scheduling** Scheduling a huge number of operators is one of the bottlenecks for the training. To reduce the cost of scheduling, **Fused Gated Self-Attention** is utilized to combine multiple blocks into an operator, and thousands of tensors are fused into only a few tensors.\n",
"\n",
"* **Multi-dimensional Memory Optimization** Multiple techniques, including Recompute, BFloat16, In-place memory, and Subbatch (Chunking), are exploited to reduce the memory required for training.\n",
"\n",
"\n",
"## 3. Online Service\n",
"\n",
"For those who want to try out our model without any installation, we also provide an online interface [PaddleHelix HelixFold Forecast](https://paddlehelix.baidu.com/app/drug/protein/forecast) through web service.\n",
"\n",
"\n",
"## 4. Environment\n",
"\n",
"To reproduce the results reported in our paper, specific environment settings are required as below. \n",
"\n",
"- python: 3.7\n",
"- cuda: 11.2\n",
"- cudnn: 8.10.1\n",
"- nccl: 2.12.12\n",
"\n",
"\n",
"## 4. How to Use the Model\n",
"\n",
"### Installation\n",
"\n",
"PP-HelixFold depends on [PaddlePaddle](https://github.com/paddlepaddle/paddle).\n",
"Python dependencies available through `pip` is provided in `requirements.txt`. PP-HelixFold also depends on `openmm==7.5.1` and `pdbfixer`, which are only available via `conda`. For producing multiple sequence alignments, `kalign`, the [HH-suite](https://github.com/soedinglab/hh-suite) and `jackhmmer` are also needed. The download scripts require `aria2c`.\n",
"\n",
"We provide a script `setup_env` that setup a `conda` environment and installs all dependencies. You can change the name of the environment and CUDA version in `setup_env`. Run:\n",
"```bash\n",
"git clone https://github.com/PaddlePaddle/PaddleHelix.git # download PaddleHelix\n",
"cd https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold\n",
"wget https://paddle-wheel.bj.bcebos.com/develop/linux/linux-gpu-cuda11.2-cudnn8-mkl-gcc8.2-avx/paddlepaddle_gpu-0.0.0.post112-cp37-cp37m-linux_x86_64.whl\n",
"sh setup_env\n",
"conda activate helixfold # activate the conda environment\n",
"```\n",
"Note: If you have a different version of python3 and cuda, please refer to [here](https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html) for the compatible PaddlePaddle `dev` package.\n",
"\n",
"In order to run scripts with DAP/BP/DP-DAP-BP mode, you also need to install `ppfleetx`. Please refer to [here](https://github.com/PaddlePaddle/PaddleFleetX/tree/release/2.4/projects/protein_folding) for more details.\n",
"```bash\n",
"git clone https://github.com/PaddlePaddle/PaddleFleetX.git\n",
"git checkout release/2.4 # change branch\n",
"python setup.py develop # install ppfleetx\n",
"```\n",
"\n",
"### Usage\n",
"\n",
"In order to run PP-HelixFold, the genetic databases and model parameters are required.\n",
"\n",
"You can use a script `scripts/download_all_data.sh`, which is the same as the original AlphaFold that can be used to download and set up all databases and model parameters:\n",
"\n",
"* Default:\n",
"\n",
" ```bash\n",
" scripts/download_all_data.sh <DOWNLOAD_DIR>\n",
" ```\n",
"\n",
" will download the full databases. The total download size for the full databases is around 415 GB and the total size when unzipped is 2.2 TB. \n",
"\n",
"* With `reduced_dbs`:\n",
"\n",
" ```bash\n",
" scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs\n",
" ```\n",
"\n",
" will download a reduced version of the databases to be used with the\n",
" `reduced_dbs` preset. The total download size for the reduced databases is around 190 GB and the total size when unzipped is around 530 GB. \n",
"\n",
"### Running PP-HelixFold for Inference\n",
"\n",
"To run inference on a sequence or multiple sequences using a set of DeepMind's pretrained parameters, run e.g.:\n",
"\n",
"* Inference on single GPU (DP):\n",
" ```bash\n",
" fasta_file=\"target.fasta\" # path to the target protein\n",
" model_name=\"model_5\" # the alphafold model name\n",
" DATA_DIR=\"data\" # path to the databases\n",
" OUTPUT_DIR=\"helixfold_output\" # path to save the outputs\n",
"\n",
" python run_helixfold.py \\\n",
" --fasta_paths=${fasta_file} \\\n",
" --data_dir=${DATA_DIR} \\\n",
" --bfd_database_path=${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n",
" --small_bfd_database_path=${DATA_DIR}/small_bfd/bfd-first_non_consensus_sequences.fasta \\\n",
" --uniclust30_database_path=${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \\\n",
" --uniref90_database_path=${DATA_DIR}/uniref90/uniref90.fasta \\\n",
" --mgnify_database_path=${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \\\n",
" --pdb70_database_path=${DATA_DIR}/pdb70/pdb70 \\\n",
" --template_mmcif_dir=${DATA_DIR}/pdb_mmcif/mmcif_files \\\n",
" --obsolete_pdbs_path=${DATA_DIR}/pdb_mmcif/obsolete.dat \\\n",
" --max_template_date=2020-05-14 \\\n",
" --model_names=${model_name} \\\n",
" --output_dir=${OUTPUT_DIR} \\\n",
" --preset='reduced_dbs' \\\n",
" --jackhmmer_binary_path /opt/conda/envs/helixfold/bin/jackhmmer \\\n",
" --hhblits_binary_path /opt/conda/envs/helixfold/bin/hhblits \\\n",
" --hhsearch_binary_path /opt/conda/envs/helixfold/bin/hhsearch \\\n",
" --kalign_binary_path /opt/conda/envs/helixfold/bin/kalign \\\n",
" --random_seed=0\n",
" ```\n",
"\n",
"* Inference on multiple GPUs (DAP):\n",
" ```bash\n",
" fasta_file=\"target.fasta\" # path to the target protein\n",
" model_name=\"model_5\" # the alphafold model name\n",
" DATA_DIR=\"data\" # path to the databases\n",
" OUTPUT_DIR=\"helixfold_output\" # path to save the outputs\n",
" log_dir=\"demo_log\" # path to log file\n",
"\n",
" distributed_args=\"--run_mode=collective --log_dir=${log_dir}\"\n",
" python -m paddle.distributed.launch ${distributed_args} \\\n",
" --gpus=\"0,1,2,3,4,5,6,7\" \\\n",
" run_helixfold.py \\\n",
" --distributed \\\n",
" --dap_degree 8 \\\n",
" --fasta_paths=${fasta_file} \\\n",
" --data_dir=${DATA_DIR} \\\n",
" --bfd_database_path=${DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n",
" --small_bfd_database_path=${DATA_DIR}/small_bfd/bfd-first_non_consensus_sequences.fasta \\\n",
" --uniclust30_database_path=${DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \\\n",
" --uniref90_database_path=${DATA_DIR}/uniref90/uniref90.fasta \\\n",
" --mgnify_database_path=${DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \\\n",
" --pdb70_database_path=${DATA_DIR}/pdb70/pdb70 \\\n",
" --template_mmcif_dir=${DATA_DIR}/pdb_mmcif/mmcif_files \\\n",
" --obsolete_pdbs_path=${DATA_DIR}/pdb_mmcif/obsolete.dat \\\n",
" --max_template_date=2020-05-14 \\\n",
" --model_names=${model_name} \\\n",
" --output_dir=${OUTPUT_DIR} \\\n",
" --preset='reduced_dbs' \\\n",
" --seed 2022 \\\n",
" --jackhmmer_binary_path /opt/conda/envs/helixfold/bin/jackhmmer \\\n",
" --hhblits_binary_path /opt/conda/envs/helixfold/bin/hhblits \\\n",
" --hhsearch_binary_path /opt/conda/envs/helixfold/bin/hhsearch \\\n",
" --kalign_binary_path /opt/conda/envs/helixfold/bin/kalign \\\n",
" --random_seed=0\n",
" ```\n",
"You can use `python run_helixfold.py -h` to find the description of the arguments.\n",
"\n",
"### Running PP-HelixFold for CASP14 Demo\n",
"\n",
"For convenience, we also provide a demo script `gpu_infer.sh` for some CASP14 proteins under folder `demo_data/casp14_demo`. To run them, you just need to execute following command:\n",
"\n",
"```bash\n",
"sh gpu_infer.sh T1026\n",
"```\n",
"\n",
"Note that such demo for T1026 and T1037 can work without downloading large MSA datasets, only model parameters are required.\n",
"\n",
"\n",
"## 5. Related papers and citations\n",
"\n",
"If you use the code or data in this repos, please cite:\n",
"\n",
"```bibtex\n",
"@article{AlphaFold2021,\n",
" author={Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\\v{Z}}{\\'\\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},\n",
" journal={Nature},\n",
" title={Highly accurate protein structure prediction with {AlphaFold}},\n",
" year={2021},\n",
" volume={596},\n",
" number={7873},\n",
" pages={583--589},\n",
" doi={10.1038/s41586-021-03819-2}\n",
"}\n",
"\n",
"@article{wang2022helixfold,\n",
" title={HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle},\n",
" author={Wang, Guoxia and Fang, Xiaomin and Wu, Zhihua and Liu, Yiqun and Xue, Yang and Xiang, Yingfei and Yu, Dianhai and Wang, Fan and Ma, Yanjun},\n",
" journal={arXiv preprint arXiv:2207.05477},\n",
" year={2022}\n",
"}\n",
"\n",
"@article{wang2022efficient_alphafold2,\n",
" title={Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism},\n",
" author={Wang, Guoxia and Wu, Zhihua and Fang, Xiaomin and Xiang, Yingfei and Liu, Yiqun and Yu, Dianhai and Ma, Yanjun},\n",
" journal={arXiv preprint arXiv:2211.00235},\n",
" year={2022}\n",
"}\n",
"```\n",
"\n",
"## 6. Copyright\n",
"\n",
"PP-HelixFold code is licensed under the Apache 2.0 License, which is same as AlphaFold. However, we use the AlphaFold parameters pretrained by DeepMind, which are made available for non-commercial use only under the terms of the CC BY-NC 4.0 license.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册