{ "cells": [ { "cell_type": "markdown", "id": "69972530-dd2b-443c-a1e1-6bcebe1c46b9", "metadata": {}, "source": [ "## 1. PP-HelixFold模型简介\n", "\n", "AlphaFold2是一款高精度的蛋白质结构预测模型。PP-HelixFold基于PaddlePaddle框架在GPU和DCU上完整复现了AlphaFold2的训练和推理流程,并进一步提升模型性能与精度。通过与原版AlphaFold2模型和哥伦比亚大学Mohammed AlQuraishi教授团队基于PyTorch复现的OpenFold模型的性能对比测试显示,PP-HelixFold将训练耗时从约11天减少到5.12天,在使用混合并行时只需要2.89 天。在性能大幅度提升的同时,PP-HelixFold从头端到端完整训练可以达到AlphaFold2论文媲美的精度。\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "## 2. 技术创新\n", "\n", "* **分支并行与混合并行策略** PP-HelixFold创新性地提出**分支并行 (Branch Parallelism, BP)** 策略,将不同的网络模型分支放在不同的卡上并行计算,从而在initial training阶段大幅提高了模型并行效率和训练速度。并且,分支并行和已有的**动态轴并行 (Dynamic Axial Parallelism, DAP)** 和**数据并行 (Data Parallelism, DP)** 结合使用,通过BP-DAP-DP三维混合并行,进一步加快了模型的整体训练速度。\n", "\n", "* **算子融合优化技术和张量融合低频次访存技术** 针对AlphaFold2中Gated Self-Attention小算子组合CPU调度开销大、模型参数小、参数个数多的问题,PP-HelixFold将Gated Self-Attention整个模块融合用一个算子实现,将CPU调度开销优化到极致。同时,将数千个小张量融合成一个连续的大张量,模型参数的梯度、优化器状态都相应更新,大幅减少了访存次数、CPU调度开销和显存碎片,从而提升了训练速度。\n", "\n", "* **多维度显存优化方案** 采用Recompute、BFloat16、显存复用、Subbatch(Chunking)等技术,将显存峰值降低到40G以内,同时支持MSA长度为512、ExtraMSA长度为5120、残基序列长度为384的最大模型配置的微调训练,从而解决了模型结构深,中间结果计算量大,ExtraMSAStack输入过长等导致无法训练的问题。\n", "\n", "\n", "## 3. 线上服务\n", "\n", "如果您想免安装直接尝试使用我们的模型,我们还提供了线上服务器[PaddleHelix HelixFold Forecast](https://paddlehelix.baidu.com/app/drug/protein/forecast)。\n", "\n", "\n", "## 4. 环境需求\n", "\n", "为了能复现我们论文中报告的实验结果,需在特定环境下进行实验。\n", "\n", "- python: 3.7\n", "- cuda: 11.6\n", "- cudnn: 8.4.0\n", "- nccl: 2.14.3\n", "\n", "\n", "## 5. 模型如何使用\n", "\n", "### 安装\n", "\n", "PP-HelixFold基于[PaddlePaddle](https://github.com/paddlepaddle/paddle)实现。\n", "通过`pip`安装的Python相关库在`requirements.txt`文件中提供,PP-HelixFold需要使用的`openmm==7.5.1`和`pdbfixer`工具,仅可通过`conda`安装。 同时,还需要安装`kalign`、[HH-suite](https://github.com/soedinglab/hh-suite)和`jackhmmer`等工具来生成多序列比对文件。下载脚本需要支持`aria2c`。\n", "\n", "我们提供脚本`setup_env`来安装`conda`环境和所需的所有第三方工具库。您可以在`setup_env`中更改环境名字和CUDA版本。运行命令如下:\n", "```bash\n", "git clone https://github.com/PaddlePaddle/PaddleHelix.git # download PaddleHelix\n", "cd /apps/protein_folding/helixfold\n", "wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/HelixFold/paddlepaddle_gpu-0.0.0.post116-cp37-cp37m-linux_x86_64.whl\n", "sh setup_env\n", "conda activate helixfold # activate the conda environment\n", "```\n", "注意:如果您环境中的Python3和CUDA版本与我们提供的Paddle whl包不匹配,请参考[这里](https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html)下载安装对应版本的PaddlePaddle `dev`包。\n", "\n", "为了代码运行时支持开启DAP/BP/DP-DAP-BP模式,您还需安装`ppfleetx`。更多详细信息请参考[这里](https://github.com/PaddlePaddle/PaddleFleetX/tree/release/2.4/projects/protein_folding)。\n", "```bash\n", "wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/HelixFold/ppfleetx-0.0.0-py3-none-any.whl\n", "python -m pip install ppfleetx-0.0.0-py3-none-any.whl # install ppfleetx\n", "```\n", "\n", "### 使用\n", "\n", "在运行PP-HelixFold前,需要先下载所需的数据库和预训练模型参数。\n", "\n", "与原版AlphaFold2一样,您可以运行脚本`scripts/download_all_data.sh`下载所有所需的数据库和预训练模型参数文件:\n", "\n", "* 默认选项:\n", "\n", " ```bash\n", " scripts/download_all_data.sh