- `Tensor概念介绍 <./tensor_introduction_cn.html>`_ : 飞桨中数据的表示方式,Tensor概念介绍。
- `飞桨广播介绍 <./broadcasting_cn.html>`_ : 飞桨中广播概念的介绍。
.. toctree::
Paddle 2.0 Basic Concept
Let's start with studying basic concept of PaddlePaddle:
- `Introduction to Tensor <tensor_introduction_en.html>`_ : Introduction of Tensor, which is the representation of data in Paddle.
- `broadcasting <./broadcasting_en.html>`_ : Introduction of broadcasting.
.. toctree::
- `飞桨框架2.0基本概念介绍 <./basic_concept/index_cn.html>`_ : 飞桨框架2.0基本概念的介绍。
- `飞桨框架2.0beta升级指南 <./upgrade_guide_cn.html>`_: 介绍飞桨开源框架2.0beta的主要变化和如何升级。
- `版本迁移工具 <./migration_cn.html>`_: 介绍paddle1to2转换工具的使用。
.. toctree::
Paddle 2 Introduction
Introduction of paddle2.
For more information, you can view these pages:
- `paddle 2 basic concept <./basic_concept/index_en.html>`_ : introduction of paddle2 basic concept.
- `migration tools <./migration_en.html>`_ :how to use migration tools to upgrade your code.
.. toctree::
数据预处理 (vision + text)
数据加载 (Dataset + DataLoader、内置数据集介绍)
模型组网 (paddle.nn + paddle.nn.functional、Model介绍、内置模型介绍)
训练与预测 (model.fit evaluate predict、一步步拆解fit、evaluate、predict)
单机多卡 (训练 + 预测)
.. PaddlePaddle Fluid documentation master file, created by
sphinx-quickstart on Thu Jun 7 17:04:53 2018.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
VisualDL 工具
.. toctree::
:maxdepth: 1
VisualDL Tools
.. toctree::
:maxdepth: 1
# VisualDL 工具简介
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/vdl-logo.png" width="70%"/>
VisualDL支持浏览器种类:Chrome(81和83)、Safari 13、FireFox(77和78)、Edge(Chromium版)。
VisualDL原生支持python的使用, 通过在模型的Python配置中添加几行代码,便可为训练过程提供丰富的可视化支持。
## 目录
* [核心亮点](#核心亮点)
* [安装方式](#安装方式)
* [使用方式](#使用方式)
* [可视化功能概览](#可视化功能概览)
* [开源贡献](#开源贡献)
* [更多细节](#更多细节)
* [技术交流](#技术交流)
## 核心亮点
### 简单易用
### 功能丰富
### 高兼容性
### 全面支持
## 安装方式
### 使用pip安装
pip install --upgrade --pre visualdl
### 使用代码安装
git clone https://github.com/PaddlePaddle/VisualDL.git
cd VisualDL
python setup.py bdist_wheel
pip install --upgrade dist/visualdl-*.whl
## 使用方式
### 1. 记录日志
VisualDL的后端提供了Python SDK,可通过LogWriter定制一个日志记录器,接口如下:
class LogWriter(logdir=None,
#### 接口参数
| 参数 | 格式 | 含义 |
| --------------- | ------- | ------------------------------------------------------------ |
| logdir | string | 日志文件所在的路径,VisualDL将在此路径下建立日志文件并进行记录,如果不填则默认为`runs/${CURRENT_TIME}` |
| comment | string | 为日志文件夹名添加后缀,如果制定了logdir则此项无效 |
| max_queue | int | 日志记录消息队列的最大容量,达到此容量则立即写入到日志文件 |
| flush_secs | int | 日志记录消息队列的最大缓存时间,达到此时间则立即写入到日志文件 |
| filename_suffix | string | 为默认的日志文件名添加后缀 |
| write_to_disk | boolean | 是否写入到磁盘 |
#### 示例
from visualdl import LogWriter
# 在`./log/scalar_test/train`路径下建立日志文件
with LogWriter(logdir="./log/scalar_test/train") as writer:
# 使用scalar组件记录一个标量数据
writer.add_scalar(tag="acc", step=1, value=0.5678)
writer.add_scalar(tag="acc", step=2, value=0.6878)
writer.add_scalar(tag="acc", step=3, value=0.9878)
### 2. 启动面板
#### 在命令行启动
visualdl --logdir <dir_1, dir_2, ... , dir_n> --host <host> --port <port> --cache-timeout <cache_timeout> --language <language> --public-path <public_path> --api-only
| 参数 | 意义 |
| --------------- | ------------------------------------------------------------ |
| --logdir | 设定日志所在目录,可以指定多个目录,VisualDL将遍历并且迭代寻找指定目录的子目录,将所有实验结果进行可视化 |
| --model | 设定模型文件路径(非文件夹路径),VisualDL将在此路径指定的模型文件进行可视化,目前可支持PaddlePaddle、ONNX、Keras、Core ML、Caffe等多种模型结构,详情可查看[graph支持模型种类]([https://github.com/PaddlePaddle/VisualDL/blob/develop/docs/components/README.md#Graph--%E7%BD%91%E7%BB%9C%E7%BB%93%E6%9E%84%E7%BB%84%E4%BB%B6](https://github.com/PaddlePaddle/VisualDL/blob/develop/docs/components/README.md#Graph--网络结构组件)) |
| --host | 设定IP,默认为`` |
| --port | 设定端口,默认为`8040` |
| --cache-timeout | 后端缓存时间,在缓存时间内前端多次请求同一url,返回的数据从缓存中获取,默认为20秒 |
| --language | VisualDL面板语言,可指定为'EN'或'ZH',默认为浏览器使用语言 |
| --public-path | VisualDL面板URL路径,默认是'/app',即访问地址为'http://&lt;host&gt;:&lt;port&gt;/app' |
| --api-only | 是否只提供API,如果设置此参数,则VisualDL不提供页面展示,只提供API服务,此时API地址为'http://&lt;host&gt;:&lt;port&gt;/&lt;public_path&gt;/api';若没有设置public_path参数,则默认为'http://&lt;host&gt;:&lt;port&gt;/api' |
visualdl --logdir ./log
#### 在Python脚本中启动
| 参数 | 格式 | 含义 |
| ------------- | ------------------------------------------------ | ------------------------------------------------------------ |
| logdir | string或list[string_1, string_2, ... , string_n] | 日志文件所在的路径,VisualDL将在此路径下递归搜索日志文件并进行可视化,可指定单个或多个路径 |
| model | string | 模型文件路径(非文件夹路径),VisualDL将在此路径指定的模型文件进行可视化 |
| host | string | 指定启动服务的ip,默认为`` |
| port | int | 启动服务端口,默认为`8040` |
| cache_timeout | int | 后端缓存时间,在缓存时间内前端多次请求同一url,返回的数据从缓存中获取,默认为20秒 |
| language | string | VisualDL面板语言,可指定为'en'或'zh',默认为浏览器使用语言 |
| public_path | string | VisualDL面板URL路径,默认是'/app',即访问地址为'http://<host>:<port>/app' |
| api_only | boolean | 是否只提供API,如果设置此参数,则VisualDL不提供页面展示,只提供API服务,此时API地址为'http://<host>:<port>/<public_path>/api';若没有设置public_path参数,则默认为http://<host>:<port>/api' |
| open_browser | boolean | 是否打开浏览器,设置为True则在启动后自动打开浏览器并访问VisualDL面板,若设置api_only,则忽略此参数 |
from visualdl.server import app
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/82786044-67ae9880-9e96-11ea-8a2b-3a0951a6ec19.png" width="60%"/>
## 可视化功能概览
### Scalar
#### 动态展示
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/dynamic_display.gif" width="60%"/>
#### 多实验对比
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/multi_experiments.gif" width="100%"/>
### Image
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/image-eye.gif" width="60%"/>
### Audio
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/89017647-38605000-d34d-11ea-9d75-7d10b9854c36.gif" width="100%"/>
### Graph
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84483052-5acdd980-accb-11ea-8519-1608da7ee698.png" width="100%"/>
### Histogram
- Offset模式
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86551031-86647c80-bf76-11ea-8ec2-8c86826c8137.png" width="100%"/>
- Overlay模式
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86551033-882e4000-bf76-11ea-8e6a-af954c662ced.png" width="100%"/>
### PR Curve
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86738774-ee46c000-c067-11ea-90d2-a98aac445cca.png" width="100%"/>
### High Dimensional
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/high_dimensional_test.png" width="100%"/>
## 开源贡献
VisualDL 是由 [PaddlePaddle](https://www.paddlepaddle.org/)[ECharts](https://echarts.apache.org/) 合作推出的开源项目。
Graph 相关功能由 [Netron](https://github.com/lutzroeder/netron) 提供技术支持。
## 更多细节
## 技术交流
欢迎您加入VisualDL官方QQ群:1045783368 与飞桨团队以及其他用户共同针对VisualDL进行讨论与交流。
# Introduction to VisualDL Toolset
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/images/vs-logo.png" width="60%" />
## Introduction
VisualDL is a deep learning visualization tool that can help design deep learning jobs.
It includes features such as scalar, parameter distribution, model structure and image visualization.
Currently it is being developed at a high pace.
New features will be continuously added.
At present, most DNN frameworks use Python as their primary language. VisualDL supports Python by nature.
Users can get plentiful visualization results by simply add a few lines of Python code into their model before training.
Besides Python SDK, VisualDL was writen in C++ on the low level. It also provides C++ SDK that
can be integrated into other platforms.
## Component
VisualDL provides following components:
- scalar
- histogram
- image
- audio
- graph
- high dimensional
### Scalar
Scalar can be used to show the trends of error during training.
<p align="center">
<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_scalar.gif" width="60%"/>
### Histogram
Histogram can be used to visualize parameter distribution and trends for any tensor.
<p align="center">
<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/histogram.gif" width="60%"/>
### Image
Image can be used to visualize any tensor or intermediate generated image.
<p align="center">
<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_image.gif" width="60%"/>
### Audio
Audio can be used to play input audio samples or generated audio samples.
### Graph
VisualDL graph supports displaying paddle model, furthermore is compatible with ONNX ([Open Neural Network Exchange](https://github.com/onnx/onnx)),
Cooperated with Python SDK, VisualDL can be compatible with most major DNN frameworks, including
PaddlePaddle, PyTorch and MXNet.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/images/graph_demo.gif" width="60%" />
To display the paddle model, all you have to do is:
1. call the `fluid.io.save_inference_model()`interface to save paddle model
2. use `visualdl --model_pb [paddle_model_dir]` to load paddle model in command line
### High Dimensional
High Dimensional can be used to visualize data embeddings by projecting high-dimensional data into 2D / 3D.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/getting_started/high_dimensional_3d.png" width="60%"/>
## Quick Start
To give the VisualDL a quick test, please use the following commands.
# Install the VisualDL. Preferably under a virtual environment or anaconda.
pip install --upgrade visualdl
# run a demo, vdl_create_scratch_log will create logs for testing.
visualdl --logdir=scratch_log --port=8080
# visit
If you encounter the error `TypeError: __init__() got an unexpected keyword argument 'file'`, that is due to protobuf version is not 3.5+,simply run `pip install --upgrade protobuf` will fix the issue.
If you run into any other issues in above steps, it could be error caused by environmental issues by different python or pip versions.
Following installation methods might fix the issues.
## Install with Virtualenv
[Virtualenv](https://virtualenv.pypa.io/en/stable/) creates isolated Python environment that prevents interfering
by other Python programs on the same machine and make sure Python and pip are located properly.
On macOS, install pip and virtualenv by:
sudo easy_install pip
pip install --upgrade virtualenv
On Linux, install pip and virtualenv by:
sudo apt-get install python3-pip python3-dev python-virtualenv
Then create a Virtualenv environment by one of following command:
virtualenv ~/vdl # for Python2.7
virtualenv -p python3 ~/vdl for Python 3.x
```~/vdl``` will be your Virtualenv directory, you may choose to install anywhere.
Activate your Virtualenv environment by:
source ~/vdl/bin/activate
Now you should be able to install VisualDL and run our demo:
pip install --upgrade visualdl
# run a demo, vdl_create_scratch_log will create logs for testing.
visualdl --logdir=scratch_log --port=8080
# visit
If you still have issues installing VisualDL from Virtualenv, try following installation method.
## Install with Anaconda
Anaconda is a python distribution, with installation and package management tools. Also it is an environment manager,
which provides the facility to create different python environments, each with their own settings.
Follow the instructions on the [Anaconda download site](https://www.anaconda.com/download) to download and install Anaconda.
Download Python 3.6 version command-Line installer.
Create a conda environment named ```vdl``` or anything you want by:
conda create -n vdl pip python=2.7 # or python=3.3, etc.
Activate the conda environment by:
source activate vdl
Now you should be able to install VisualDL and run our demo:
pip install --upgrade visualdl
# run a demo, vdl_create_scratch_log will create logs for testing.
visualdl --logdir=scratch_log --port=8080
# visit
If you still have issues installing VisualDL, try installing from sources as in following section.
### Install from source
#Preferably under a virtualenv or anaconda.
git clone https://github.com/PaddlePaddle/VisualDL.git
cd VisualDL
python setup.py bdist_wheel
pip install --upgrade dist/visualdl-*.whl
If there are still issues regarding the ```pip install```, you can still start Visual DL by starting the dev server
## SDK
VisualDL provides both Python SDK and C++ SDK in order to fit more use cases.
### Python SDK
VisualDL now supports both Python 2 and Python 3.
Below is an example of creating a simple Scalar component and inserting data from different timestamps:
import random
from visualdl import LogWriter
logdir = "./tmp"
logger = LogWriter(logdir, sync_cycle=10000)
# mark the components with 'train' label.
with logger.mode("train"):
# create a scalar component called 'scalars/scalar0'
scalar0 = logger.scalar("scalars/scalar0")
# add some records during DL model running.
for step in range(100):
scalar0.add_record(step, random.random())
### C++ SDK
Here is the C++ SDK identical to the Python SDK example above:
#include <cstdlib>
#include <string>
#include "visualdl/logic/sdk.h"
namespace vs = visualdl;
namespace cp = visualdl::components;
int main() {
const std::string dir = "./tmp";
vs::LogWriter logger(dir, 10000);
auto tablet = logger.AddTablet("scalars/scalar0");
cp::Scalar<float> scalar0(tablet);
for (int step = 0; step < 1000; step++) {
float v = (float)std::rand() / RAND_MAX;
scalar0.AddRecord(step, v);
return 0;
## Launch Visual DL
After some logs have been generated during training, users can launch Visual DL application to see real-time data visualization by:
visualdl --logdir <some log dir>
visualDL also supports following optional parameters:
- `--host` set IP
- `--port` set port
- `-m / --model_pb` specify ONNX format for model file to view graph
### Contribute
VisualDL is initially created by [PaddlePaddle](http://www.paddlepaddle.org/) and
We welcome everyone to use, comment and contribute to VisualDL :)
## More details
For more details about how to use VisualDL, please take a look at [documents](https://github.com/PaddlePaddle/VisualDL/tree/develop/demo)
# VisualDL 使用指南
### 概述
VisualDL 是一个面向深度学习任务设计的可视化工具。VisualDL 利用了丰富的图表来展示数据,用户可以更直观、清晰地查看数据的特征与变化趋势,有助于分析数据、及时发现错误,进而改进神经网络模型的设计。
目前,VisualDL 支持 scalar, image, audio, graph, histogram, pr curve, high dimensional 七个组件,项目正处于高速迭代中,敬请期待新组件的加入。
| 组件名称 | 展示图表 | 作用 |
| :-------------------------------------------------: | :--------: | :----------------------------------------------------------- |
| [ Scalar](#Scalar--标量组件) | 折线图 | 动态展示损失函数值、准确率等标量数据 |
| [Image](#Image--图片可视化组件) | 图片可视化 | 显示图片,可显示输入图片和处理后的结果,便于查看中间过程的变化 |
| [Audio](#Audio--音频播放组件) | 音频播放 | 播放训练过程中的音频数据,监控语音识别与合成等任务的训练过程 |
| [Graph](#Graph--网络结构组件) | 网络结构 | 展示网络结构、节点属性及数据流向,辅助学习、优化网络结构 |
| [Histogram](#Histogram--直方图组件) | 直方图 | 展示训练过程中权重、梯度等张量的分布 |
| [PR Curve](#PR-Curve--PR曲线组件) | 折线图 | 权衡精度与召回率之间的平衡关系,便于选择最佳阈值 |
| [High Dimensional](#High-Dimensional--数据降维组件) | 数据降维 | 将高维数据映射到 2D/3D 空间来可视化嵌入,便于观察不同数据的相关性 |
## Scalar -- 折线图组件
### 介绍
Scalar 组件的输入数据类型为标量,该组件的作用是将训练参数以折线图形式呈现。将损失函数值、准确率等标量数据作为参数传入 scalar 组件,即可画出折线图,便于观察变化趋势。
### 记录接口
Scalar 组件的记录接口如下:
add_scalar(tag, value, step, walltime=None)
| 参数 | 格式 | 含义 |
| -------- | ------ | ------------------------------------------- |
| tag | string | 记录指标的标志,如`train/loss`,不能含有`%` |
| value | float | 要记录的数据值 |
| step | int | 记录的步数 |
| walltime | int | 记录数据的时间戳,默认为当前时间戳 |
### Demo
- 基础使用
下面展示了使用 Scalar 组件记录数据的示例,代码文件请见[Scalar组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/scalar_test.py)
from visualdl import LogWriter
if __name__ == '__main__':
value = [i/1000.0 for i in range(1000)]
# 初始化一个记录器
with LogWriter(logdir="./log/scalar_test/train") as writer:
for step in range(1000):
# 向记录器添加一个tag为`acc`的数据
writer.add_scalar(tag="acc", step=step, value=value[step])
# 向记录器添加一个tag为`loss`的数据
writer.add_scalar(tag="loss", step=step, value=1/(value[step] + 1))
visualdl --logdir ./log --port 8080
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/82397559-478c6d00-9a83-11ea-80db-a0844dcaca35.png" width="100%"/>
- 多组实验对比
1. 创建子日志文件储存每组实验的参数数据
2. 将数据写入scalar组件时,**使用相同的tag**,即可实现对比**不同实验****同一类型参数**
from visualdl import LogWriter
if __name__ == '__main__':
value = [i/1000.0 for i in range(1000)]
# 步骤一:创建父文件夹:log与子文件夹:scalar_test
with LogWriter(logdir="./log/scalar_test") as writer:
for step in range(1000):
# 步骤二:向记录器添加一个tag为`train/acc`的数据
writer.add_scalar(tag="train/acc", step=step, value=value[step])
# 步骤二:向记录器添加一个tag为`train/loss`的数据
writer.add_scalar(tag="train/loss", step=step, value=1/(value[step] + 1))
# 步骤一:创建第二个子文件夹scalar_test2
value = [i/500.0 for i in range(1000)]
with LogWriter(logdir="./log/scalar_test2") as writer:
for step in range(1000):
# 步骤二:在同样名为`train/acc`下添加scalar_test2的accuracy的数据
writer.add_scalar(tag="train/acc", step=step, value=value[step])
# 步骤二:在同样名为`train/loss`下添加scalar_test2的loss的数据
writer.add_scalar(tag="train/loss", step=step, value=1/(value[step] + 1))
visualdl --logdir ./log --port 8080
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84644158-5efb3080-af31-11ea-8e64-bbe4078425f4.png" width="100%"/>
*多组实验对比的应用案例可参考AI Studio项目:[VisualDL 2.0--眼疾识别训练可视化](https://aistudio.baidu.com/aistudio/projectdetail/502834)
### 功能操作说明
* 支持数据卡片「最大化」、「还原」、「坐标系转化」(y轴对数坐标)、「下载」折线图
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/scalar-icon.png" width="55%"/>
* 数据点Hover展示详细信息
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/scalar-tooltip.png" width="60%"/>
* 可搜索卡片标签,展示目标图像
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/scalar-searchlabel.png" width="90%"/>
* 可搜索打点数据标签,展示特定数据
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/scalar-searchstream.png" width="40%"/>
* X轴有三种衡量尺度
1. Step:迭代次数
2. Walltime:训练绝对时间
3. Relative:训练时长
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/x-axis.png" width="40%"/>
* 可调整曲线平滑度,以便更好的展现参数整体的变化趋势
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/scalar-smooth.png" width="37%"/>
## Image -- 图片可视化组件
### 介绍
Image 组件用于显示图片数据随训练的变化。在模型训练过程中,将图片数据传入 Image 组件,就可在 VisualDL 的前端网页查看相应图片。
### 记录接口
Image 组件的记录接口如下:
add_image(tag, img, step, walltime=None)
| 参数 | 格式 | 含义 |
| -------- | ------------- | ------------------------------------------- |
| tag | string | 记录指标的标志,如`train/loss`,不能含有`%` |
| img | numpy.ndarray | 以ndarray格式表示的图片 |
| step | int | 记录的步数 |
| walltime | int | 记录数据的时间戳,默认为当前时间戳 |
### Demo
下面展示了使用 Image 组件记录数据的示例,代码文件请见[Image组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/image_test.py)
import numpy as np
from PIL import Image
from visualdl import LogWriter
def random_crop(img):
"""获取图片的随机 100x100 分片
img = Image.open(img)
w, h = img.size
random_w = np.random.randint(0, w - 100)
random_h = np.random.randint(0, h - 100)
r = img.crop((random_w, random_h, random_w + 100, random_h + 100))
return np.asarray(r)
if __name__ == '__main__':
# 初始化一个记录器
with LogWriter(logdir="./log/image_test/train") as writer:
for step in range(6):
# 添加一个图片数据
visualdl --logdir ./log --port 8080
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/image-static.png" width="100%"/>
### 功能操作说明
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/image-search.png" width="90%"/>
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/image-eye.gif" width="60%"/>
## Audio--音频播放组件
### 介绍
### 记录接口
Audio 组件的记录接口如下:
add_audio(tag, audio_array, step, sample_rate)
| 参数 | 格式 | 含义 |
| ----------- | ------------- | ------------------------------------------ |
| tag | string | 记录指标的标志,如`audio_tag`,不能含有`%` |
| audio_arry | numpy.ndarray | 以ndarray格式表示的音频 |
| step | int | 记录的步数 |
| sample_rate | int | 采样率,**注意正确填写对应音频的原采样率** |
### Demo
下面展示了使用 Audio 组件记录数据的示例,代码文件请见[Audio组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/audio_test.py)
from visualdl import LogWriter
import numpy as np
import wave
def read_audio_data(audio_path):
Get audio data.
CHUNK = 4096
f = wave.open(audio_path, "rb")
wavdata = []
chunk = f.readframes(CHUNK)
while chunk:
data = np.frombuffer(chunk, dtype='uint8')
chunk = f.readframes(CHUNK)
# 8k sample rate, 16bit frame, 1 channel
shape = [8000, 2, 1]
return shape, wavdata
if __name__ == '__main__':
with LogWriter(logdir="./log") as writer:
audio_shape, audio_data = read_audio_data("./testing.wav")
audio_data = np.array(audio_data)
visualdl --logdir ./log --port 8080
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/87659138-b4746880-c78f-11ea-965b-c33804e7c296.png" width="100%"/>
### 功能操作说明
- 可搜索音频标签显示对应音频数据
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/87661431-29956d00-c793-11ea-833b-172d8fc1b221.png" width="100%"/>
- 支持滑动Step/迭代次数试听不同迭代次数下的音频数据
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/87661089-a07e3600-c792-11ea-8740-cbe99a64d830.png" width="60%"/>
- 支持播放/暂停音频数据
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/87661130-b3910600-c792-11ea-9f9f-2ae66132e9de.png" width="60%"/>
- 支持音量调节
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/87661497-49c52c00-c793-11ea-9eeb-471543cd2a0b.png" width="60%"/>
- 支持音频下载
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/87661166-c277b880-c792-11ea-8ad7-5c60bb08379b.png" width="60%"/>
## Graph--网络结构组件
### 介绍
### Demo
- 前端模型文件拖拽上传:
- 如只需使用Graph组件,则无需添加任何参数,在命令行执行`visualdl`后即可启动面板进行上传。
- 如果同时需使用其他功能,在命令行指定日志文件路径(以`./log`为例)即可启动面板进行上传:
visualdl --logdir ./log --port 8080
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487396-44c31780-acd1-11ea-831a-1632e636613d.png" width="80%"/>
- 后端启动Graph:
- 在命令行加入参数`--model`并指定**模型文件**路径(非文件夹路径),即可启动并查看网络结构可视化:
visualdl --model ./log/model --port 8080
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84490149-51e20580-acd5-11ea-9663-1f156892c0e0.png" width="100%"/>
### 功能操作说明
- 一键上传模型
- 支持模型格式:PaddlePaddle、ONNX、Keras、Core ML、Caffe、Caffe2、Darknet、MXNet、ncnn、TensorFlow Lite
- 实验性支持模型格式:TorchScript、PyTorch、Torch、 ArmNN、BigDL、Chainer、CNTK、Deeplearning4j、MediaPipe、ML.NET、MNN、OpenVINO、Scikit-learn、Tengine、TensorFlow.js、TensorFlow
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487396-44c31780-acd1-11ea-831a-1632e636613d.png" width="80%"/>
- 支持上下左右任意拖拽模型、放大和缩小模型
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/89163601-6ab9b980-d5a8-11ea-9c6d-2dc5eaed0d41.gif" width="100%"/>
- 搜索定位到对应节点
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487694-b9965180-acd1-11ea-8214-34f3febc1828.png" width="30%"/>
- 点击查看模型属性
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487751-cadf5e00-acd1-11ea-9ce2-4fdfeeea9c5a.png" width="30%"/>
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487759-d03ca880-acd1-11ea-9294-520ef7f9e0b1.png" width="30%"/>
- 支持选择模型展示的信息
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487829-ee0a0d80-acd1-11ea-8563-6682a15483d9.png" width="23%"/>
- 支持以PNG、SVG格式导出模型结构图
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487884-ff531a00-acd1-11ea-8b12-5221db78683e.png" width="30%"/>
- 点击节点即可展示对应属性信息
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487941-13971700-acd2-11ea-937d-42fb524b9ee1.png" width="30%"/>
- 支持一键更换模型
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/84487998-27db1400-acd2-11ea-83d7-5d75832ef41d.png" width="25%"/>
## Histogram--直方图组件
### 介绍
### 记录接口
Histogram 组件的记录接口如下:
add_histogram(tag, values, step, walltime=None, buckets=10)
| 参数 | 格式 | 含义 |
| -------- | --------------------- | ------------------------------------------- |
| tag | string | 记录指标的标志,如`train/loss`,不能含有`%` |
| values | numpy.ndarray or list | 以ndarray或list格式表示的数据 |
| step | int | 记录的步数 |
| walltime | int | 记录数据的时间戳,默认为当前时间戳 |
| buckets | int | 生成直方图的分段数,默认为10 |
### Demo
下面展示了使用 Histogram组件记录数据的示例,代码文件请见[Histogram组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/histogram_test.py)
from visualdl import LogWriter
import numpy as np
if __name__ == '__main__':
values = np.arange(0, 1000)
with LogWriter(logdir="./log/histogram_test/train") as writer:
for index in range(1, 101):
interval_start = 1 + 2 * index / 100.0
interval_end = 6 - 2 * index / 100.0
data = np.random.uniform(interval_start, interval_end, size=(10000))
writer.add_histogram(tag='default tag',
visualdl --logdir ./log --port 8080
### 功能操作说明
- 支持数据卡片「最大化」、直方图「下载」
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86535351-42d82700-bf12-11ea-89f0-171280e7c526.png" width="60%"/>
- 可选择Offset或Overlay模式
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86535413-c134c900-bf12-11ea-9ad6-f0ad8eafa76f.png" width="30%"/>
- Offset模式
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86536435-2b9d3780-bf1a-11ea-9981-92f837d22ae5.png" width="60%"/>
- Overlay模式
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86536458-5ab3a900-bf1a-11ea-985e-05f06c1b762b.png" width="60%"/>
- 数据点Hover展示参数值、训练步数、频次
- 在第240次训练步数时,权重为-0.0031,且出现的频次是2734次
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86536482-80d94900-bf1a-11ea-9e12-5bea9f382b34.png" width="60%"/>
- 可搜索卡片标签,展示目标直方图
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86536503-baaa4f80-bf1a-11ea-80ab-cd988617d018.png" width="30%"/>
- 可搜索打点数据标签,展示特定数据流
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86536639-b894c080-bf1b-11ea-9ee5-cf815dd4bbd7.png" width="30%"/>
## PR Curve--PR曲线组件
### 介绍
PR Curve以折线图形式呈现精度与召回率的权衡分析,清晰直观了解模型训练效果,便于分析模型是否达到理想标准。
### 记录接口
PR Curve组件的记录接口如下:
add_pr_curve(tag, labels, predictions, step=None, num_thresholds=10)
| 参数 | 格式 | 含义 |
| -------------- | --------------------- | ------------------------------------------- |
| tag | string | 记录指标的标志,如`train/loss`,不能含有`%` |
| labels | numpy.ndarray or list | 以ndarray或list格式表示的实际类别 |
| predictions | numpy.ndarray or list | 以ndarray或list格式表示的预测类别 |
| step | int | 记录的步数 |
| num_thresholds | int | 阈值设置的个数,默认为10,最大值为127 |
### Demo
下面展示了使用 PR Curve 组件记录数据的示例,代码文件请见[PR Curve组件](#https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/pr_curve_test.py)
from visualdl import LogWriter
import numpy as np
with LogWriter("./log/pr_curve_test/train") as writer:
for step in range(3):
labels = np.random.randint(2, size=100)
predictions = np.random.rand(100)
visualdl --logdir ./log --port 8080
接着在浏览器打开``,即可查看PR Curve
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86738774-ee46c000-c067-11ea-90d2-a98aac445cca.png" width="100%"/>
### 功能操作说明
- 支持数据卡片「最大化」,「还原」、「下载」PR曲线
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86740067-f18e7b80-c068-11ea-96bf-52cb7da1f799.png" width="60%"/>
- 数据点Hover展示详细信息:阈值对应的TP、TN、FP、FN
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86740477-43370600-c069-11ea-93f0-f4d05445fbab.png" width="70%"/>
- 可搜索卡片标签,展示目标图表
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86740670-66fa4c00-c069-11ea-9ee3-0a22e2d0dbec.png" width="50%"/>
- 可搜索打点数据标签,展示特定数据
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86740817-809b9380-c069-11ea-9453-6531e3ff5f43.png" width="50%"/>
- 支持查看不同训练步数下的PR曲线
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86741057-b04a9b80-c069-11ea-9fef-2dcc16f9cd46.png" width="50%"/>
- X轴-时间显示类型有三种衡量尺度
- Step:迭代次数
- Walltime:训练绝对时间
- Relative:训练时长
<p align="center">
<img src="https://user-images.githubusercontent.com/48054808/86741304-db34ef80-c069-11ea-86eb-787b49ed3705.png" width="50%"/>
## High Dimensional -- 数据降维组件
### 介绍
High Dimensional 组件将高维数据进行降维展示,用于深入分析高维数据间的关系。目前支持以下两种降维算法:
- PCA : Principle Component Analysis 主成分分析
- t-SNE : t-distributed stochastic neighbor embedding t-分布式随机领域嵌入
### 记录接口
High Dimensional 组件的记录接口如下:
add_embeddings(tag, labels, hot_vectors, walltime=None)
| 参数 | 格式 | 含义 |
| ----------- | ------------------- | ---------------------------------------------------- |
| tag | string | 记录指标的标志,如`default`,不能含有`%` |
| labels | numpy.array 或 list | 一维数组表示的标签,每个元素是一个string类型的字符串 |
| hot_vectors | numpy.array or list | 与labels一一对应,每个元素可以看作是某个标签的特征 |
| walltime | int | 记录数据的时间戳,默认为当前时间戳 |
### Demo
下面展示了使用 High Dimensional 组件记录数据的示例,代码文件请见[High Dimensional组件](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/components/high_dimensional_test.py)
from visualdl import LogWriter
if __name__ == '__main__':
hot_vectors = [
[1.3561076367500755, 1.3116267195134017, 1.6785401875616097],
[1.1039614644440658, 1.8891609992484688, 1.32030488587171],
[1.9924524852447711, 1.9358920727142739, 1.2124401279391606],
[1.4129542689796446, 1.7372166387197474, 1.7317806077076527],
[1.3913371800587777, 1.4684674577930312, 1.5214136352476377]]
labels = ["label_1", "label_2", "label_3", "label_4", "label_5"]
# 初始化一个记录器
with LogWriter(logdir="./log/high_dimensional_test/train") as writer:
# 将一组labels和对应的hot_vectors传入记录器进行记录
visualdl --logdir ./log --port 8080
<p align="center">
<img src="http://visualdl.bj.bcebos.com/images/dynamic_high_dimensional.gif" width="100%"/>
# VisualDL user guide
## Overview
VisualDL is a toolkit to visualize data generated in deep learning tasks. VisualDL make use of [ECharts](https://echarts.apache.org/en/feature.html) to display the distribution and change tendency of data, so that users can view data more clearly and intuitively.
To be conductive to analyze the characteristics of data, detect errors, and optimize the neural network model, VisualDL provides seven functional components, including scalar, histogram, image, text, audio, high dimensional and graph.
| Component name | Display chart | Function of component |
|<a href="#1">scalar</a>| Line Chart | Dynamically display scalar data, such as loss, accuracy, etc.|
|<a href="#2">histogram</a>| Histogram | Dynamically display the numerical distribution and change tendency of parameters (such as weight matrix, offset, gradient, etc)|
|<a href="#3">image</a>| Image | Dynamically display images, including input images and convolution results, it is conveniently to view the change tendency of intermediate process|
|<a href="#4">text</a>| Text | Dynamically display text |
|<a href="#5">audio</a>| Audio | Dynamically display audio, users can play directly or choose to download|
|<a href="#6">high dimensional</a>| Coordinate | Map high dimensional data into 2D/3D space, for making it easy to observe the correlation of different data|
|<a href="#7">graph</a>| Directed Graph | Display the neural networks |
## Toolkits of adding data
The six components (scalar, histogram, image, text, audio and high dimensional) are used to add data during program running. Class LogWriter must be initialized before adding data, in order to set the storage path and synchronization cycle. The input parameters of each components will be saved as log file in disk, after that the log file will be loaded into front end to display.
### LogWriter
LogWriter is a Python wrapper to write data to log file with the data format defined as in protobuf file [storage.proto](https://github.com/PaddlePaddle/VisualDL/blob/develop/visualdl/storage/storage.proto).
The definition of LogWriter :
class LogWriter(dir, sync_cycle)
> :param dir : the directory path to the saved log files.
> :param sync_cycle : specify how often should the system store data into the file system, that is, system will save the data into the file system once operations count reaches sync_cycle.
> :return: a new LogWriter instance.
Demo 1. Create a LogWriter instance
# Create a LogWriter instance named log_writer
log_writer = LogWriter("./log", sync_cycle=10)
class LogWriter include the following member functions:
* `mode()`
* `scalar()`, `histogram()`, `image()`, `text()`, `audio()`, `embedding()`
The member function mode() is used to specify the phase of program running. The input string is customized, such as `test`, `validation`, `test`, `conv_layer1`. Components with same mode are grouped together, so users can choose different modes to display on the frontend webpage.
The member functions scalar(), histogram(), image(), text(), audio() and embedding() are used to create component instance。
Demo 2. Use LogWriter instance to create component instance
# Set the name of mode to "train", and create a scalar component instance
with log_writer.mode("train") as logger:
train_scalar = logger.scalar("acc")
# Set the name of mode to "test", and create an image component instance
with log_writer.mode("test") as shower:
test_image = shower.image("conv_image", 10, 1)
### scalar -- component to draw line charts
The <a name="1">scalar</a> component is used to draw line charts. By passing scalar data such as loss value, accuracy as input parameters into the scalar() function, the frontend webpage will display the data in the form of line charts. It can facilitate users to grasp the changing tendency of training process.
The first step of using scalar component is initializing the member function scalar() of LogWriter instance, then you can add data through the member function add_record() of ScalarWriter instance.
* The member function `scalar()` of LogWriter instance :
def scalar(tag, type)
> :param tag : The scalar writer will label the data with tag.
> :param type : Data type, optional choice is limited to “float”, "double", "int", the default setting is "float".
> :return : A ScalarWriter instance to handle step and value records.
* The member function `add_record()` of ScalarWriter instance :
def add_record(step, value)
> :param step : Step number.
> :param value : Input data.
Demo 3. scalar demo program[Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/scalar-demo.py)
# coding=utf-8
from visualdl import LogWriter
# Create a LogWriter instance
log_writer = LogWriter("./log", sync_cycle=20)
# Create two ScalarWriter instances, whose mode is set to be "train"
with log_writer.mode("train") as logger:
train_acc = logger.scalar("acc")
train_loss = logger.scalar("loss")
# Create a ScalarWriter instance, whose mode is set to be "test"
with log_writer.mode("test") as logger:
test_acc = logger.scalar("acc")
value = [i/1000.0 for i in range(1000)]
for step in range(1000):
# Add data
train_acc.add_record(step, value[step])
train_loss.add_record(step, 1 / (value[step] + 1))
test_acc.add_record(step, 1 - value[step])
After running the demo program above, you can start the flask server with command ``visualdl`` :
visualdl --logdir ./log --host --port 8080
By opening the URL []( in your browser,you will see the interface below.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/scalar-interface.png" width=800><br/>
Figure 1. scalar component displays line charts <br/>
The right sidebar of VisualDL has adjustment options for each component, take scalar component as example:
* Smoothing : To adjust the smoothness of the line charts.
* X-axis : The horizontal ordinate of line charts, optional choice : Step, Relative, Wall Time.
* Tooltip sorting : Sorting method of tag, optional choice : default, descending, ascending, nearest.
There is also a ``RUNNING`` button at the bottom of the right sidebar, the frontend webpage will send request to the flask server for data synchronization. Switching to ``Stopped``, it will pause the data update.
### histogram -- component to display data distribution
The <a name="2">histogram</a> component is used to draw histogram for displaying the distribution of input data. By passing some parameters of model training, such as weight matrices, biases, gradient, as input parameters into the `histogram()` function, the frontend webpage will display the data in the form of histogram. It can facilitate users to view the change tendency of parameters distribution.
The first step of using histogram component is initializing the member function `histogram()` of LogWriter instance, then you can add data through the member function `add_record()` of HistogramWriter instance.
* The member function histogram() of LogWriter instance :
def histogram(tag, num_buckets, type)
> :param tag : The histogram writer will label the data with tag.
> :param num_buckets : The number of pillar in the histogram.
> :param type : Data type, optional choice is limited to “float”, "double", "int", the default setting is "float".
> :return : A HistogramWriter instance to record distribution.
* The member function add_record() of HistogramWriter instance :
def add_record(step, value)
> :param step : Step number.
> :param value : Input data, type is list[].
Demo 4. histogram demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/histogram-demo.py)
# coding=utf-8
import numpy as np
from visualdl import LogWriter
# Create a LogWriter instance
log_writer = LogWriter('./log', sync_cycle=10)
# Create a HistogramWriter instance, whose mode is set to be "train"
with log_writer.mode("train") as logger:
param1_histogram = logger.histogram("param1", num_buckets=100)
# Loop
for step in range(1, 101):
# Create input data
interval_start = 1 + 2 * step/100.0
interval_end = 6 - 2 * step/100.0
data = np.random.uniform(interval_start, interval_end, size=(10000))
# Use member function add_record() to add data
param1_histogram.add_record(step, data)
After running the demo program above, you can start the flask server with command ``visualdl`` :
visualdl --logdir ./log --host --port 8080
By opening the URL []( in your browser,you will see the interface below.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/histogram-interface.png" width=800><br/>
Figure 2. histogram component displays histograms <br/>
### image -- component to display image
The <a name="3">image</a> component is used to visualize the image data. By passing the image data (type numpy.ndarray) into the image() function, the frontend webpage will display the image directly.
The first step of using image component is initializing the member function image() of LogWriter instance. Then you can add data through the member functions start_sampling(), is_sample_taken(), set_sample(), and finish_sample() of ImageWriter instance.
* The member function image() of LogWriter instance :
def image(tag, num_samples, step_cycle)
> :param tag : The image writer will label the image with tag.
> :param num_samples : Appoint the number of samples to take in a step.
> :param step_cycle : Store every `step_cycle` as a record, the default value is 1.
> :return: A ImageWriter instance to sample images.
* Start a new sampling cycle, allocate memory space for the sampled data
def start_sampling()
* Determine whether the picture should be sampled or not. If the return value is -1, it means no sampling, otherwise it should be sampled :
def is_sample_taken()
* Add image data :
def set_sample(index, image_shape, image_data)
> :param index : Combined with tag, used to determine the sub-frame of the image display.
> :param image_shape : The shape of image, [weight, height, channel(RGB is 3, GrayScale is 1)].
> :param image_data : Image data with type numpy.ndarray, member function flatten() can turn the shape to row vector.
* End the current sampling period, load the sampled data into disk, and release the memory space :
def finish_sample()
Demo 5. image demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/image-demo.py)
# coding=utf-8
import numpy as np
from visualdl import LogWriter
from PIL import Image
def random_crop(img):
This function is used to get a random block (100*100 pixels) of data img.
img = Image.open(img)
w, h = img.size
random_w = np.random.randint(0, w - 100)
random_h = np.random.randint(0, h - 100)
return img.crop((random_w, random_h, random_w + 100, random_h + 100))
# Create a LogWriter instance
log_writer = LogWriter("./log", sync_cycle=10)
# Create a ImageWriter instance
ns = 2
with log_writer.mode("train") as logger:
input_image = logger.image(tag="test", num_samples=ns)
# The variable sample_num is used to record the number of image data that have been sampled
sample_num = 0
for step in range(6):
# Set the condition of start_sampling()
if sample_num == 0:
idx = input_image.is_sample_taken()
# if idx != -1,sample this data, otherwise skip
if idx != -1:
# Get image data
image_path = "test.jpg"
image_data = np.array(random_crop(image_path))
# Add data
input_image.set_sample(idx, image_data.shape, image_data.flatten())
sample_num += 1
# If sampling of the present period have been completed, call finish_sample()
if sample_num % ns == 0:
sample_num = 0
After running the demo program above, you can start the flask server with command ``visualdl`` :
visualdl --logdir ./log --host --port 8080
By opening the URL []( in your browser,then click the ``SAMPLES`` option at the top of the webpage, you will see the interface below.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/image-interface.png" width=800><br/>
Figure 3. image component displays images <br/>
Each subgraph has a horizontal axis which can be dragged to display images of different steps.
### text -- component to display text
The <a name="4">text</a> component is used to visualize the text data. By passing the text data (type string) into the text() function, the frontend webpage will display the image directly.
The first step of using text component is initializing the member function text() of LogWriter instance, then you can add data through the member function add_record() of TextWriter instance.
* The member function text() of LogWriter instance :
def text(tag)
> :param tag : Combined with tag, used to determine the sub-frame of the image display.
* The member function add_record() of TextWriter instance :
def add_record(step, str)
> :param step : Step number.
> :param value : Input data, type is string.
Demo 6. text demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/text-demo.py)
# coding=utf-8
from visualdl import LogWriter
# create a LogWriter instance
log_writter = LogWriter("./log", sync_cycle=10)
# Create a TextWriter instance
with log_writter.mode("train") as logger:
vdl_text_comp = logger.text(tag="test")
# Use member function add_record() to add data
for i in range(1, 6):
vdl_text_comp.add_record(i, "这是第 %d 个 step 的数据。" % i)
vdl_text_comp.add_record(i, "This is data %d ." % i)
After running the demo program above, you can start the flask server with command ``visualdl`` :
visualdl --logdir ./log --host --port 8080
By opening the URL []( in your browser,then click the ``SAMPLES`` option at the top of the webpage, you will see the interface below.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/text-interface.png" width=800><br/>
Figure 4. text component displays texts <br/>
Each subgraph has a horizontal axis which can be dragged to display text of different steps.
### audio -- component to play audio
The <a name="5"> audio</a> component is used to play audio. By passing the audio data (type numpy.ndarray) into the audio() function, users can play audio directly, or choose to download.
The first step of using audio component is initializing the member function audio() of LogWriter instance. Then you can add data through the member functions start_sampling(), is_sample_taken(), set_sample(), and finish_sample() of AudioWriter instance.
* The member function audio() of LogWriter instance :
def audio(tag, num_samples, step_cycle)
> :param tag : The audio writer will label the audio with tag.
> :param num_samples : Appoint the number of samples to take in a step.
> :param step_cycle : Store every `step_cycle` as a record, the default value is 1.
> :return: An AudioWriter instance to sample images.
* Start a new sampling cycle, allocate memory space for the sampled data :
def start_sampling()
* Determine whether the audio should be sampled or not. If the return value is -1, it means no sampling, otherwise it should be sampled :
def is_sample_taken()
* Add audio data :
def set_sample(index, audio_params, audio_data)
> :param index : Combined with tag, used to determine the sub-frame of the audio.
> :param audio_params : The parameters of audio, [sample rate, sample width, channels].
> :param audio_data : Audio data with type numpy.ndarray, member function flatten() can turn the shape to row vector.
* End the current sampling period, load the sampled data into disk, and release the memory space :
def finish_sample()
Demo 7. audio demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/audio-demo.py)
# coding=utf-8
import numpy as np
import wave
from visualdl import LogWriter
def read_audio_data(audio_path):
Read audio data
CHUNK = 4096
f = wave.open(audio_path, "rb")
wavdata = []
chunk = f.readframes(CHUNK)
while chunk:
data = np.fromstring(chunk, dtype='uint8')
chunk = f.readframes(CHUNK)
# 8k sample rate, 16bit frame, 1 channel
shape = [8000, 2, 1]
return shape, wavdata
# Create a LogWriter instance
log_writter = LogWriter("./log", sync_cycle=10)
# Create an AudioWriter instance
ns = 2
with log_writter.mode("train") as logger:
input_audio = logger.audio(tag="test", num_samples=ns)
# The variable sample_num is used to record the number of audio data that have been sampled
audio_sample_num = 0
for step in range(9):
# Set the condition of start_sampling()
if audio_sample_num == 0:
# Get idx
idx = input_audio.is_sample_taken()
# if idx != -1,sample this data, otherwise skip
if idx != -1:
# Read audio data
audio_path = "test.wav"
audio_shape, audio_data = read_audio_data(audio_path)
# Add data through member function set_samle()
input_audio.set_sample(idx, audio_shape, audio_data)
audio_sample_num += 1
# If sampling of the present period have been completed, call finish_sample()
if audio_sample_num % ns ==0:
audio_sample_num = 0
After running the demo program above, you can start the flask server with command ``visualdl`` :
visualdl --logdir ./log --host --port 8080
By opening the URL []( in your browser,then click the ``SAMPLES`` option at the top of the webpage, you will see the interface below.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/audio-interface.png" width=800><br/>
Figure 5. audio component displays audios <br/>
Each subgraph has a horizontal axis which can be dragged to play audio of different steps.
### high dimensional -- component of dimensionality reduction
The role of <a name="6">high dimensional</a> component is to map data into 2D or 3D space for embedding visualization, which is helpful for users to understand the relevance of different data.
The high dimensional component supports the following two dimensionality reduction algorithms :
* PCA : Principle Component Analysis
* [t-SNE](https://lvdmaaten.github.io/tsne/) : t-distributed stochastic neighbor embedding
The first step of using audio component is initializing the member function embedding() of LogWriter instance. Then you can add data through the member functions add_embeddings_with_word_dict() of EmbeddingWriter instance.
* The member function embedding() of LogWriter instance
def embedding()
* The member function add_embeddings_with_word_dict() of EmbeddingWriter instance :
def add_embeddings_with_word_dict(data, Dict)
> :param data : input data , type List[List(float)].
> :param Dict : dictionary, type Dict[str, int].
Demo 8. high dimensional demo program [Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/embedding-demo.py)
# coding=utf-8
import numpy as np
from visualdl import LogWriter
# Create a LogWriter instance
log_writer = LogWriter("./log", sync_cycle=10)
# Create an EmbeddingWriter instance
with log_writer.mode("train") as logger:
train_embedding = logger.embedding()
# Initialize data List[List(float)]
hot_vectors = np.random.uniform(1, 2, size=(10, 3))
word_dict = {
"label_1": 5,
"label_2": 4,
"label_3": 3,
"label_4": 2,
"label_5": 1,}
# Add data through member function add_embeddings_with_word_dict(data, Dict)
train_embedding.add_embeddings_with_word_dict(hot_vectors, word_dict)
After running the demo program above, you can start the flask server with command ``visualdl`` :
visualdl --logdir ./log --host --port 8080
By opening the URL []( in your browser,then click the ``HIGHDIMENSIONAL`` option at the top of the webpage, you will see the interface below.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/embedding-2D.png" width=800><br/>
Figure 6. high dimensional component displays plane coordinates <br/>
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/embedding-3D.png" width=800><br/>
Figure 7. High dimensional component displays Cartesian coordinates <br/>
## graph -- component to visualize neural network
The role of <a name="7">graph</a> component is to visualize neural network. This component can display models with
Paddle format or [ONNX](https://onnx.ai) format. The graph component can help users understand the model structure of the neural network, and also help to troubleshoot neural network configuration errors.
Unlike other components that need to record data, the only one prerequisite for using graph component is specifying the storage path of the model file. That is, adding the option --model_pb to the command ``visualdl`` to specify the storage path of the model file, then you can see the corresponding neural network in the frontend webpage.
Demo 9. graph demo program(How to save a Lenet-5 model by Paddle)[Github](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/component/graph-demo.py)
# coding=utf-8
import paddle.fluid as fluid
def lenet_5(img):
Define the Lenet-5 model
conv1 = fluid.nets.simple_img_conv_pool(
conv1_bn = fluid.layers.batch_norm(input=conv1)
conv2 = fluid.nets.simple_img_conv_pool(
predition = fluid.layers.fc(input=conv2, size=10, act="softmax")
return predition
# Variable assignment
image = fluid.layers.data(name="img", shape=[1, 28, 28], dtype="float32")
predition = lenet_5(image)
place = fluid.CPUPlace()
exe = fluid.Executor(place=place)
# save the result to "./paddle_lenet_5_model"
After running the demo program above, you can start the flask server with command ``visualdl`` :
visualdl --logdir ./log --host --port 8080 --model_pb paddle_lenet_5_model
By opening the URL []( in your browser,then click the `GRAPHS` option at the top of the webpage, you will see the interface below.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/demo/component/usage-interface/graph.png" width=800><br/>
Figure 8. graph component displays the model structure of Lenet-5 <br/>
- `服务器端部署 <inference/index_cn.html>`_ :介绍了如何在服务器端将模型部署上线
- `移动端部署 <mobile/index_cn.html>`_ :介绍了 PaddlePaddle 组织下的嵌入式平台深度学习框架Paddle-Lite
- `模型压缩 <paddleslim/paddle_slim.html>`_ :简要介绍了PaddleSlim模型压缩工具库的特点以及使用说明。
.. toctree::
Deploy Inference Model
- `Server side Deployment <inference/index_en.html>`_ : This section illustrates the method how to deploy and release the trained models on the servers
- `Model Compression <paddleslim/paddle_slim_en.html>`_ : Introduce the features and usage of PaddleSlim which is a toolkit for model compression.
.. toctree::
.. _install_or_build_cpp_inference_lib:
安装与编译 Linux 预测库
.. csv-table::
:header: "版本说明", "预测库(1.8.4版本)", "预测库(2.0.0-beta0版本)", "预测库(develop版本)"
:widths: 3, 2, 2, 2
"ubuntu14.04_cpu_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-cpu-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-avx-mkl/paddle_inference.tgz>`_"
"ubuntu14.04_cpu_avx_openblas", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-avx-openblas/fluid_inference.tgz>`_", ,"`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-avx-openblas/paddle_inference.tgz>`_"
"ubuntu14.04_cpu_noavx_openblas", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-noavx-openblas/fluid_inference.tgz>`_", ,"`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-noavx-openblas/paddle_inference.tgz>`_"
"ubuntu14.04_cuda9.0_cudnn7_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda9-cudnn7-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda9-cudnn7-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-gpu-cuda9-cudnn7-avx-mkl/paddle_inference.tgz>`_"
"ubuntu14.04_cuda10.0_cudnn7_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda10-cudnn7-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda10-cudnn7-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-gpu-cuda10-cudnn7-avx-mkl/paddle_inference.tgz>`_"
"ubuntu14.04_cuda10.1_cudnn7.6_avx_mkl_trt6", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda10.1-cudnn7.6-avx-mkl-trt6%2Ffluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda10.1-cudnn7.6-avx-mkl-trt6%2Fpaddle_inference.tgz>`_",
"nv-jetson-cuda10-cudnn7.5-trt5", "`fluid_inference.tar.gz <https://paddle-inference-lib.bj.bcebos.com/1.7.1-nv-jetson-cuda10-cudnn7.5-trt5/fluid_inference.tar.gz>`_", "`paddle_inference.tar.gz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-nv-jetson-cuda10-cudnn7.5-trt5/paddle_inference.tgz>`_",
用户也可以从 PaddlePaddle 核心代码编译C++预测库,只需在编译时配制下面这些编译选项:
============================ ============= ==================
选项 值 说明
============================ ============= ==================
CMAKE_BUILD_TYPE Release 编译方式,仅使用预测库设为Release即可
WITH_PYTHON OFF(推荐) 编译python预测库与whl包
ON_INFER ON(推荐) 预测时使用,必须设为ON
WITH_XBYAK ON 使用XBYAK编译,在jetson硬件上编译需要设置为OFF
WITH_NV_JETSON OFF 在NV Jetson硬件上编译时需要设为ON
============================ ============= ==================
.. code-block:: bash
git clone https://github.com/paddlepaddle/Paddle
cd Paddle
# 建议使用git checkout切换到Paddle稳定的版本,如:
git checkout v1.8.4
**note**: 如果您是多卡机器,建议安装NCCL;如果您是单卡机器则可以在编译时显示指定WITH_NCCL=OFF来跳过这一步。注意如果WITH_NCCL=ON,且没有安装NCCL,则编译会报错。
.. code-block:: bash
git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j4
make install
.. code-block:: bash
cd Paddle
mkdir build
cd build
make inference_lib_dist
**NVIDIA Jetson嵌入式硬件预测库源码编译**
NVIDIA Jetson是NVIDIA推出的嵌入式AI平台,Paddle Inference支持在 NVIDIA Jetson平台上编译预测库。具体步骤如下:
1. 准备环境
.. code-block:: bash
sudo nvpmodel -m 0 && sudo jetson_clocks
.. code-block:: bash
sudo fallocate -l 5G /var/swapfile
sudo chmod 600 /var/swapfile
sudo mkswap /var/swapfile
sudo swapon /var/swapfile
sudo bash -c 'echo "/var/swapfile swap swap defaults 0 0" >> /etc/fstab'
2. 编译Paddle Inference预测库
.. code-block:: bash
cd Paddle
mkdir build
cd build
cmake .. \
make -j4
# 生成预测lib
make inference_lib_dist -j4
3. 样例测试
1. 报错:
.. code-block:: bash
ERROR: ../aarch64-linux-gpn/crtn.o: Too many open files.
.. code-block:: bash
ulimit -n 2048
2. 编译卡住
3. 使用TensorRT报错IPluginFactory或IGpuAllocator缺少虚析构函数
下载安装TensorRT后,在NvInfer.h文件中为class IPluginFactory和class IGpuAllocator分别添加虚析构函数:
.. code-block:: bash
virtual ~IPluginFactory() {};
virtual ~IGpuAllocator() {};
.. code-block:: text
├── CMakeCache.txt
├── paddle
│   ├── include
│   │   ├── paddle_anakin_config.h
│   │   ├── paddle_analysis_config.h
│   │   ├── paddle_api.h
│   │   ├── paddle_inference_api.h
│   │   ├── paddle_mkldnn_quantizer_config.h
│   │   └── paddle_pass_builder.h
│   └── lib
│   ├── libpaddle_fluid.a
│   └── libpaddle_fluid.so
├── third_party
│   └── install
│   ├── gflags
│   ├── glog
│   ├── mkldnn
│   ├── mklml
│   └── protobuf
└── version.txt
version.txt 中记录了该预测库的版本信息,包括Git Commit ID、使用OpenBlas或MKL数学库、CUDA/CUDNN版本号,如:
.. code-block:: text
GIT COMMIT ID: 0231f58e592ad9f673ac1832d8c495c8ed65d24f
CUDA version: 10.1
CUDNN version: v7
.. _install_or_build_cpp_inference_lib_en:
Install and Compile C++ Inference Library on Linux
Direct Download and Installation
.. csv-table:: c++ inference library list
:header: "version description", "inference library(1.8.4 version)", "inference library(2.0.0-beta0 version)", "inference library(develop version)"
:widths: 3, 2, 2, 2
"ubuntu14.04_cpu_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-cpu-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-avx-mkl/paddle_inference.tgz>`_"
"ubuntu14.04_cpu_avx_openblas", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-avx-openblas/fluid_inference.tgz>`_", ,"`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-avx-openblas/paddle_inference.tgz>`_"
"ubuntu14.04_cpu_noavx_openblas", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-cpu-noavx-openblas/fluid_inference.tgz>`_", ,"`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-cpu-noavx-openblas/paddle_inference.tgz>`_"
"ubuntu14.04_cuda9.0_cudnn7_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda9-cudnn7-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda9-cudnn7-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-gpu-cuda9-cudnn7-avx-mkl/paddle_inference.tgz>`_"
"ubuntu14.04_cuda10.0_cudnn7_avx_mkl", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda10-cudnn7-avx-mkl/fluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda10-cudnn7-avx-mkl/paddle_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/latest-gpu-cuda10-cudnn7-avx-mkl/paddle_inference.tgz>`_"
"ubuntu14.04_cuda10.1_cudnn7.6_avx_mkl_trt6", "`fluid_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/1.8.4-gpu-cuda10.1-cudnn7.6-avx-mkl-trt6%2Ffluid_inference.tgz>`_", "`paddle_inference.tgz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-gpu-cuda10.1-cudnn7.6-avx-mkl-trt6%2Fpaddle_inference.tgz>`_",
"nv-jetson-cuda10-cudnn7.5-trt5", "`fluid_inference.tar.gz <https://paddle-inference-lib.bj.bcebos.com/1.7.1-nv-jetson-cuda10-cudnn7.5-trt5/fluid_inference.tar.gz>`_", "`paddle_inference.tar.gz <https://paddle-inference-lib.bj.bcebos.com/2.0.0-beta0-nv-jetson-cuda10-cudnn7.5-trt5/paddle_inference.tgz>`_",
Build from Source Code
Users can also compile C++ inference libraries from the PaddlePaddle core code by specifying the following compile options at compile time:
============================ =============== ==================
Option Value Description
============================ =============== ==================
CMAKE_BUILD_TYPE Release cmake build type, set to Release if debug messages are not needed
FLUID_INFERENCE_INSTALL_DIR path install path of inference libs
WITH_PYTHON OFF(recomended) build python libs and whl package
ON_INFER ON(recomended) build with inference settings
WITH_GPU ON/OFF build inference libs on GPU
WITH_MKL ON/OFF build inference libs supporting MKL
WITH_MKLDNN ON/OFF build inference libs supporting MKLDNN
WITH_XBYAK ON build with XBYAK, must be OFF when building on NV Jetson platforms
WITH_NV_JETSON OFF build inference libs on NV Jetson platforms
============================ =============== ==================
It is recommended to configure options according to the recommended values to avoid linking unnecessary libraries. Other options can be set if it is necessary.
Firstly we pull the latest code from github.
.. code-block:: bash
git clone https://github.com/paddlepaddle/Paddle
cd Paddle
# Use git checkout to switch to stable versions such as v1.8.4
git checkout v1.8.4
**note**: If your environment is a multi-card machine, it is recommended to install nccl; otherwise, you can skip this step by specifying WITH_NCCL = OFF during compilation. Note that if WITH_NCCL = ON, and NCCL is not installed, the compiler will report an error.
.. code-block:: bash
git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j4
make install
**build inference libs on server**
Following codes set the configurations and execute building(PADDLE_ROOT should be set to the actual installing path of inference libs, WITH_NCCL should be modified according to the actual environment.).
.. code-block:: bash
git clone https://github.com/PaddlePaddle/Paddle.git
cd Paddle
mkdir build
cd build
make inference_lib_dist
**build inference libs on NVIDIA Jetson platforms**
NVIDIA Jetson is an AI computing platform in embedded systems introduced by NVIDIA. Paddle Inference supports building inference libs on NVIDIA Jetson platforms. The steps are as following.
1. Prepare environments
Turn on hardware performance mode
.. code-block:: bash
sudo nvpmodel -m 0 && sudo jetson_clocks
if building on Nano hardwares, increase swap memory
.. code-block:: bash
# Increase DDR valid space. Default memory allocated is 16G, which is enough for Xavier. Following steps are for Nano hardwares.
sudo fallocate -l 5G /var/swapfile
sudo chmod 600 /var/swapfile
sudo mkswap /var/swapfile
sudo swapon /var/swapfile
sudo bash -c 'echo "/var/swapfile swap swap defaults 0 0" >> /etc/fstab'
2. Build paddle inference libs
.. code-block:: bash
cd Paddle
mkdir build
cd build
cmake .. \
make -j4
# Generate inference libs
make inference_lib_dist -j4
3. Test with samples
Please refer to samples on https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.html#id2
1. Error:
.. code-block:: bash
ERROR: ../aarch64-linux-gpn/crtn.o: Too many open files.
Fix this by increasing the number of files the system can open at the same time to 2048.
.. code-block:: bash
ulimit -n 2048
2. The building process hangs.
Might be downloading third-party libs. Wait or kill the building process and start again.
3. Lacking virtual destructors for IPluginFactory or IGpuAllocator when using TensorRT.
After downloading and installing TensorRT, add virtual destructors for IPluginFactory and IGpuAllocator in NvInfer.h:
.. code-block:: bash
virtual ~IPluginFactory() {};
virtual ~IGpuAllocator() {};
After successful compilation, dependencies required by the C++ inference library Will be stored in the PADDLE_ROOT directory. (dependencies including: (1) compiled PaddlePaddle inference library and header files; (2) third-party link libraries and header files; (3) version information and compilation option information)
The directory structure is:
.. code-block:: text
├── CMakeCache.txt
├── paddle
│ ├── include
│ │ ├── paddle_anakin_config.h
│ │ ├── paddle_analysis_config.h
│ │ ├── paddle_api.h
│ │ ├── paddle_inference_api.h
│   │   ├── paddle_mkldnn_quantizer_config.h
│ │ └── paddle_pass_builder.h
│ └── lib
│ ├── libpaddle_fluid.a
│ └── libpaddle_fluid.so
├── third_party
│ ├── boost
│ │ └── boost
│ ├── eigen3
│ │ ├── Eigen
│ │ └── unsupported
│ └── install
│ ├── gflags
│ ├── glog
│ ├── mkldnn
│ ├── mklml
│ ├── protobuf
│ ├── snappy
│ ├── snappystream
│ ├── xxhash
│ └── zlib
└── version.txt
The version information of the inference library is recorded in version.txt, including Git Commit ID, version of OpenBlas, MKL math library, or CUDA/CUDNN. For example:
.. code-block:: text
GIT COMMIT ID: cc9028b90ef50a825a722c55e5fda4b7cd26b0d6
CUDA version: 8.0
CUDNN version: v7
# C 预测 API介绍
Fluid提供了高度优化的[C++预测库](./native_infer.html),为了方便使用,我们也提供了封装了C++预测库对应的C接口。C接口的使用方式,首先是需要`#include paddle_c_api.h`,头文件`paddle_c_api.h`可以在Paddle的仓库中的`paddle/fluid/inference/capi/paddle_c_api.h`找到,或是在编译Paddle的`Paddle/build/`路径下,`build/fluid_inference_c_install_dir/paddle/include/`路径下找到。此外,使用 CAPI 还需要在编译项目的时候,链接相关的编译的库`libpaddle_fluid_c.so`。下面是详细的使用说明。
需要说明的是,与 C++ API 不同,C API 为了兼顾多语言封装的需要,将不会再设置默认参数,即使用时,所有的参数都需要用户显式地提供。
## C预测相关数据结构
使用C预测API与C++预测API不完全一样,C预测主要包括`PD_AnalysisConfig`, `PD_DataType`, `PD_Predictor`, `PD_Buffer``PD_ZeroCopyTensor`。接下来将会进一步详细地介绍这些数据结构以及使用的方法,并提供相应的示例。
### PD_AnalysisConfig
* `PD_AnalysisConfig* PD_NewAnalysisConfig()`: 新建一个`PD_AnalysisConfig`的指针。
* `void PD_DeleteAnalysisConfig(PD_AnalysisConfig* config)`: 删除一个`PD_AnalysisConfig`的指针。
* `void PD_SetModel(PD_AnalysisConfig* config, const char* model_dir, const char* params_path)`: 设置模型的路径,输入的参数包括`PD_AnalysisConfig``model_dir``params_path`,其中`model_dir`是指的是模型保存位置的路径,一般不用包括文件名,`params_path`为可选参数,<strong>注意</strong>:
- 如果不给定`params_path`,即`params_path``NULL`,则认为该模型的参数存储路径与`model_dir`一致,且模型文件和参数文件是按照默认的文件名存储的,此时参数文件可能有多个。此时,需要用户输入参数与模型文件的`model_dir`,即<strong>模型和参数保存的路径名</strong>,不需要指定文件名,同时,需要显式地设置`params_path``NULL`
- 如果提供了`params_path`,为了方便用户的自定义,则在指明`model_dir`路径最后需要加上模型文件的文件名传入,即`model_dir`传入对应的<strong>模型文件的路径</strong>`params_path`传入对应的<strong>模型参数文件的路径</strong>,需要指定文件名。
* `const char* PD_ModelDir(const PD_AnalysisConfig* config)`: 如果未指明`PD_SetModel()``params_path`,则可以返回模型文件夹路径。
* `const char* PD_ProgFile(const PD_AnalysisConfig* config)`: 如果是指明`PD_SetModel()``params_path`,则可以返回模型文件路径。
* `const char* PD_ParamsFile(const PD_AnalysisConfig* config)`: 如果是指明`PD_SetModel()``params_path`,则可以返回参数文件路径。
* `void PD_SwitchSpecifyInputNames(PD_AnalysisConfig* config, bool x)`: 设置为`true`是指模型运算在读取输入的时候,依据名称来确定不同的输入,否则根据输入的顺序。使用`PD_ZeroCopyTensor`并且是多输入的情况,建议设置为`true`
* `void PD_SwitchUseFeedFetchOps(PD_AnalysisConfig* config, bool x)`: 设置是否使用`feed``fetch` op。在使用`PD_ZeroCopyTensor`必须设置该选项为`false`
* `void PD_EnableUseGpu(PD_AnalysisConfig* config, uint64_t memory_pool_init_size_mb, int device_id)`: 设置开启GPU,并且设定GPU显存(单位M)和设备的Device ID。
* `void PD_DisableGpu(PD_AnalysisConfig* config)`: 禁用GPU。
* `int PD_GpuDeviceId(const PD_AnalysisConfig* config)`: 返回使用的GPU设备的ID。
* `void PD_SwitchIrOptim(PD_AnalysisConfig* config, bool x)`: 设置预测是否开启IR优化。
* `void PD_EnableTensorRtEngine(PD_AnalysisConfig* config, int workspace_size, int max_batch_size, int min_subgraph_size, Precision precision, bool use_static, bool use_calib_mode)`: 开启TensorRT。关于参数的解释,详见[使用Paddle-TensorRT库预测](../../performance_improving/inference_improving/paddle_tensorrt_infer.html)
* `void PD_EnableMKLDNN(PD_AnalysisConfig* config)`: 开启MKLDNN。
#### 代码示例
``` C
PD_AnalysisConfig* config = PD_NewAnalysisConfig();
* 当模型文件夹下存在一个以默认文件名保存的模型文件和多个参数文件时,传入模型文件夹路径,模型文件名默认为`__model__`,需要显式地设置`params_path``NULL`,不需要指定文件名。
``` C
const char* model_dir = "./model/";
PD_SetModel(config, model_dir, NULL);
* 当模型文件夹下只有一个模型文件和一个参数文件,传入模型文件和参数文件,需要指定文件名。
``` C
const char* model_path = "./model/model";
const char* params_path = "./params/params";
PD_SetModel(config, model_path, params_path);
``` C
PD_EnableUseGpu(config, 100, 0); // 初始化100M显存,使用的gpu id为0
PD_GpuDeviceId(config); // 返回正在使用的gpu id
PD_DisableGpu(config); // 禁用gpu
PD_SwitchIrOptim(config, true); // 开启IR优化
PD_EnableMKLDNN(config); // 开启MKLDNN
PD_SwitchSpecifyInputNames(config, true);
PD_SwitchUseFeedFetchOps(config, false);
### PD_ZeroCopyTensor
* `data - (PD_Buffer)`: 设置传入数据的值。
* `shape - (PD_Buffer)`: 设置传入数据的形状(shape)。
* `lod - (PD_Buffer)`: 设置数据的`lod`,目前只支持一阶的`lod`
* `dtype - (PD_DataType)`: 设置传入数据的数据类型,用枚举`PD_DataType`表示。
* `name - (char*)`: 设置传入数据的名称。
* `PD_ZeroCopyTensor* PD_NewZeroCopyTensor()`: 新创建一个`PD_ZeroCopyTensor`的指针。
* `void PD_DeleteZeroCopyTensor(PD_ZeroCopyTensor*)`: 删除一个`PD_ZeroCopyTensor`的指针。
* `void PD_InitZeroCopyTensor(PD_ZeroCopyTensor*)`: 使用默认初始化一个`PD_ZeroCopyTensor`的指针并分配的内存空间。
* `void PD_DestroyZeroCopyTensor(PD_ZeroCopyTensor*)`: 删除`PD_ZeroCopyTensor`指针中,`data``shape``lod``PD_Buffer`的变量。
### PD_DataType
* `PD_FLOAT32`: 32位浮点型
* `PD_INT32`: 32位整型
* `PD_INT64`: 64位整型
* `PD_UINT8`: 8位无符号整型
#### 代码示例
``` C
PD_ZeroCopyTensor input;
``` C
input.dtype = PD_FLOAT32;
### PD_Buffer
* `data`: 输入的数据,类型是`void*`,用于存储数据开始的地址。
* `length`: 输入数据的实际的<strong>字节长度</strong>
* `capacity`: 为数据分配的内存大小,必定大于等于`length`
### 示例代码
``` C
PD_ZeroCopyTensor input;
// 设置输入的名称
input.name = "data";
// 设置输入的数据大小
input.data.capacity = sizeof(float) * 1 * 3 * 300 * 300;
input.data.length = input.data.capacity;
input.data.data = malloc(input.data.capacity);
// 设置数据的输入的形状 shape
int shape[] = {1, 3, 300, 300};
input.shape.data = (int *)shape;
input.shape.capacity = sizeof(shape);
input.shape.length = sizeof(shape);
// 设置输入数据的类型
input.dtype = PD_FLOAT32;
### PD_Predictor
`PD_Predictor`是一个高性能预测引擎,该引擎通过对计算图的分析,可以完成对计算图的一系列的优化(如OP的融合、内存/显存的优化、 MKLDNN,TensorRT 等底层加速库的支持等)。主要包括一下函数:
* `PD_Predictor* PD_NewPredictor(const PD_AnalysisConfig* config)`: 创建一个新的`PD_Predictor`的指针。
* `void PD_DeletePredictor(PD_Predictor* predictor)`: 删除一个`PD_Predictor`的指针。
* `int PD_GetInputNum(const PD_Predictor* predictor)`: 获取模型输入的个数。
* `int PD_GetOutputNum(const PD_Predictor* predictor)`: 获取模型输出的个数。
* `const char* PD_GetInputName(const PD_Predictor* predictor, int n)`: 获取模型第`n`个输入的名称。
* `const char* PD_GetOutputName(const PD_Predictor* predictor, int n)`: 获取模型第`n`个输出的名称。
* `void PD_SetZeroCopyInput(PD_Predictor* predictor, const PD_ZeroCopyTensor* tensor)`: 使用`PD_ZeroCopyTensor`数据结构设置模型输入的具体值、形状、lod等信息。目前只支持一阶lod。
* `void PD_GetZeroCopyOutput(PD_Predictor* predictor, PD_ZeroCopyTensor* tensor)`: 使用`PD_ZeroCopyTensor`数据结构获取模型输出的具体值、形状、lod等信息。目前只支持一阶lod。
* `void PD_ZeroCopyRun(PD_Predictor* predictor)`: 运行预测的引擎,完成模型由输入到输出的计算。
#### 代码示例
``` C
PD_AnalysisConfig* config = PD_NewAnalysisConfig();
const char* model_dir = "./model/";
PD_SetModel(config, model_dir, NULL);
PD_SwitchSpecifyInputNames(config, true); // 使用PD_ZeroCopyTensor并且是多输入建议设置。
PD_SwitchUseFeedFetchOps(config, false); // 使用PD_ZeroCopyTensor一定需要设置为false。
``` C
PD_ZeroCopyTensor input;
// 设置输入的名称
input.name = (char *)(PD_GetInputName(predictor, 0));
// 设置输入的数据大小
input.data.capacity = sizeof(float) * 1 * 3 * 300 * 300;
input.data.length = input.data.capacity;
input.data.data = malloc(input.data.capacity);
// 设置数据的输入的形状(shape)
int shape[] = {1, 3, 300, 300};
input.shape.data = (int *)shape;
input.shape.capacity = sizeof(shape);
input.shape.length = sizeof(shape);
// 设置输入数据的类型
input.dtype = PD_FLOAT32;
``` C
PD_Predictor *predictor = PD_NewPredictor(config);
int input_num = PD_GetInputNum(predictor);
printf("Input num: %d\n", input_num);
int output_num = PD_GetOutputNum(predictor);
printf("Output num: %d\n", output_num);
PD_SetZeroCopyInput(predictor, &input); // 这里只有一个输入,根据多输入情况,可以传入一个数组
PD_ZeroCopyRun(predictor); // 执行预测引擎
PD_ZeroCopyTensor output;
output.name = (char *)(PD_GetOutputName(predictor, 0));
PD_GetZeroCopyOutput(predictor, &output);
## 完整使用示例
下面是使用Fluid C API进行预测的一个完整示例,使用resnet50模型
``` C
#include "paddle_c_api.h"
#include <memory.h>
#include <malloc.h>
* The main procedures to run a predictor according to c-api:
* 1. Create config to set how to process the inference.
* 2. Prepare the input PD_ZeroCopyTensor for the inference.
* 3. Set PD_Predictor.
* 4. Call PD_ZeroCopyRun() to start.
* 5. Obtain the output.
* 6. According to the size of the PD_PaddleBuf's data's size, print all the output data.
int main() {
// 配置 PD_AnalysisConfig
PD_AnalysisConfig* config = PD_NewAnalysisConfig();
const char* model_path = "./model/model";
const char* params_path = "./model/params";
PD_SetModel(config, model_path, params_path);
PD_SwitchSpecifyInputNames(config, true);
PD_SwitchUseFeedFetchOps(config, false);
// 新建一个 PD_Predictor 的指针
PD_Predictor *predictor = PD_NewPredictor(config);
// 获取输入输出的个数
int input_num = PD_GetInputNum(predictor);
printf("Input num: %d\n", input_num);
int output_num = PD_GetOutputNum(predictor);
printf("Output num: %d\n", output_num);
// 设置输入的数据结构
PD_ZeroCopyTensor input;
// 设置输入的名称
input.name = (char *)(PD_GetInputName(predictor, 0));
// 设置输入的数据大小
input.data.capacity = sizeof(float) * 1 * 3 * 318 * 318;
input.data.length = input.data.capacity;
input.data.data = malloc(input.data.capacity);
memset(input.data.data, 0, (sizeof(float) * 3 * 318 * 318));
// 设置数据的输入的形状(shape)
int shape[] = {1, 3, 318, 318};
input.shape.data = (int *)shape;
input.shape.capacity = sizeof(shape);
input.shape.length = sizeof(shape);
// 设置输入数据的类型
input.dtype = PD_FLOAT32;
PD_SetZeroCopyInput(predictor, &input);
// 执行预测引擎
// 获取预测输出
PD_ZeroCopyTensor output;
output.name = (char *)(PD_GetOutputName(predictor, 0));
// 获取 output 之后,可以通过该数据结构,读取到 data, shape 等信息
PD_GetZeroCopyOutput(predictor, &output);
float* result = (float *)(output.data.data);
int result_length = output.data.length / sizeof(float);
return 0;
运行以上代码,需要将 paddle_c_api.h 拷贝到指定位置,确保编译时可以找到这个头文件。同时,需要将 libpaddle_fluid_c.so 的路径加入环境变量。
最后可以使用 gcc 命令编译。
``` shell
gcc ${SOURCE_NAME} \
PaddlePaddle 提供了C++,C和Python的API来支持模型的部署上线。
.. toctree::
Server-side Deployment
PaddlePaddle provides various methods to support deployment and release of trained models.
.. toctree::
# C++ 预测 API介绍
为了更简单方便地预测部署,PaddlePaddle 提供了一套高层 C++ API 预测接口。下面是详细介绍。
## 内容
- [使用Predictor进行高性能预测](#使用Predictor进行高性能预测)
- [使用Config管理预测配置](#使用Config管理预测配置)
- [使用Tensor管理输入/输出](#使用Tensor管理输入/输出)
- [使用PredictorPool在多线程下进行预测](#使用PredictorPool在多线程下进行预测)
- [C++预测样例编译测试](#C++预测样例编译测试)
- [性能调优](#性能调优)
- [推理升级指南](#推理升级指南)
- [C++ API](#C++_API)
## <a name="使用Predictor进行高性能预测"> 使用Predictor进行高性能预测</a>
Paddle Inference采用 Predictor 进行预测。Predictor 是一个高性能预测引擎,该引擎通过对计算图的分析,完成对计算图的一系列的优化(如OP的融合、内存/显存的优化、 MKLDNN,TensorRT 等底层加速库的支持等),能够大大提升预测性能。
为了展示完整的预测流程,下面是一个使用 Predictor 进行预测的完整示例,其中涉及到的具体概念和配置会在后续部分展开详细介绍。
#### Predictor 预测示例
``` c++
#include "paddle_inference_api.h"
namespace paddle_infer {
void CreateConfig(Config* config, const std::string& model_dirname) {
// 模型从磁盘进行加载
config->SetModel(model_dirname + "/model",
model_dirname + "/params");
// config->SetModel(model_dirname);
// 如果模型从内存中加载,可以使用SetModelBuffer接口
// config->SetModelBuffer(prog_buffer, prog_size, params_buffer, params_size);
config->EnableUseGpu(100 /*设定GPU初始显存池为MB*/, 0 /*设定GPU ID为0*/); //开启GPU预测
/* for cpu
config->EnableMKLDNN(); // 开启MKLDNN加速
config->SwitchIrDebug(true); // 可视化调试选项,若开启,则会在每个图优化过程后生成dot文件
// config->SwitchIrOptim(false); // 默认为true。如果设置为false,关闭所有优化
// config->EnableMemoryOptim(); // 开启内存/显存复用
void RunAnalysis(int batch_size, std::string model_dirname) {
// 1. 创建AnalysisConfig
Config config;
CreateConfig(&config, model_dirname);
// 2. 根据config 创建predictor,并准备输入数据,此处以全0数据为例
auto predictor = CreatePredictor(config);
int channels = 3;
int height = 224;
int width = 224;
float input[batch_size * channels * height * width] = {0};
// 3. 创建输入
// 使用了ZeroCopy接口,可以避免预测中多余的CPU copy,提升预测性能
auto input_names = predictor->GetInputNames();
auto input_t = predictor->GetInputHandle(input_names[0]);
input_t->Reshape({batch_size, channels, height, width});
// 4. 运行预测引擎
// 5. 获取输出
std::vector<float> out_data;
auto output_names = predictor->GetOutputNames();
auto output_t = predictor->GetOutputHandle(output_names[0]);
std::vector<int> output_shape = output_t->shape();
int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
} // namespace paddle_infer
int main() {
// 模型下载地址 http://paddle-inference-dist.cdn.bcebos.com/tensorrt_test/mobilenet.tar.gz
paddle_infer::RunAnalysis(1, "./mobilenet");
return 0;
## <a name="使用Config管理预测配置"> 使用Config管理预测配置</a>
#### 通用优化配置
``` c++
config->SwitchIrOptim(true); // 开启计算图分析优化,包括OP融合等
config->EnableMemoryOptim(); // 开启内存/显存复用
#### 设置模型和参数路径
* 非combined形式:模型文件夹`model_dir`下存在一个模型文件和多个参数文件时,传入模型文件夹路径,模型文件名默认为`__model__`
``` c++
* combined形式:模型文件夹`model_dir`下只有一个模型文件`model`和一个参数文件`params`时,传入模型文件和参数文件路径。
``` c++
config->SetModel("./model_dir/model", "./model_dir/params");
#### 配置CPU预测
``` c++
config->DisableGpu(); // 禁用GPU
config->EnableMKLDNN(); // 开启MKLDNN,可加速CPU预测
config->SetCpuMathLibraryNumThreads(10); // 设置CPU Math库线程数,CPU核心数支持情况下可加速预测
#### 配置GPU预测
``` c++
config->EnableUseGpu(100, 0); // 初始化100M显存,使用GPU ID为0
config->GpuDeviceId(); // 返回正在使用的GPU ID
// 开启TensorRT预测,可提升GPU预测性能,需要使用带TensorRT的预测库
config->EnableTensorRtEngine(1 << 20 /*workspace_size*/,
batch_size /*max_batch_size*/,
3 /*min_subgraph_size*/,
AnalysisConfig::Precision::kFloat32 /*precision*/,
false /*use_static*/,
false /*use_calib_mode*/);
## <a name="使用Tensor管理输入/输出"> 使用Tensor管理输入/输出</a>
``` c++
// 通过创建的Predictor获取输入和输出的tensor
auto input_names = predictor->GetInputNames();
auto input_t = predictor->GetInputHandle(input_names[0]);
auto output_names = predictor->GetOutputNames();
auto output_t = predictor->GetOutputHandle(output_names[0]);
// 对tensor进行reshape
input_t->Reshape({batch_size, channels, height, width});
// 通过CopyFromCpu接口,将cpu数据输入;通过CopyToCpu接口,将输出数据copy到cpu
input_t->CopyFromCpu<float>(input_data /*数据指针*/);
output_t->CopyToCpu(out_data /*数据指针*/);
// 设置LOD
std::vector<std::vector<size_t>> lod_data = {{0}, {0}};
// 获取Tensor数据指针
float *input_d = input_t->mutable_data<float>(PaddlePlace::kGPU); // CPU下使用PaddlePlace::kCPU
int output_size;
float *output_d = output_t->data<float>(PaddlePlace::kGPU, &output_size);
## <a name="使用PredictorPool在多线程下进行预测"> 使用PredictorPool在多线程下进行预测</a>
# 服务初始化时,完成PredictorPool的初始化
PredictorPool pool(config, thread_num);
# 根据线程id来获取Predictor
auto predictor = pool.Retrive(thread_id);
# 使用Predictor进行预测
## <a name="C++预测样例编译测试"> C++预测样例编译测试</a>
1. 下载或编译paddle预测库,参考[安装与编译C++预测库](./build_and_install_lib_cn.html)
2. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)并解压,进入`sample/inference`目录下。
`inference` 文件夹目录结构如下:
``` shell
├── CMakeLists.txt
├── mobilenet_test.cc
├── thread_mobilenet_test.cc
├── mobilenetv1
│ ├── model
│ └── params
├── run.sh
└── run_impl.sh
- `mobilenet_test.cc` 为单线程预测的C++源文件
- `thread_mobilenet_test.cc` 为多线程预测的C++源文件
- `mobilenetv1` 为模型文件夹
- `run.sh` 为预测运行脚本文件
3. 配置编译与运行脚本
``` shell
# 设置是否开启MKL、GPU、TensorRT,如果要使用TensorRT,必须打开GPU
# 按照运行环境设置预测库路径、CUDA库路径、CUDNN库路径、TensorRT路径、模型路径
4. 编译与运行样例
``` shell
sh run.sh
## <a name="性能调优"> 性能调优</a>
### CPU下预测
1. 在CPU型号允许的情况下,尽量使用带AVX和MKL的版本。
2. 可以尝试使用Intel的 MKLDNN 加速。
3. 在CPU可用核心数足够时,可以将设置`config->SetCpuMathLibraryNumThreads(num);`中的num值调高一些。
### GPU下预测
1. 可以尝试打开 TensorRT 子图加速引擎, 通过计算图分析,Paddle可以自动将计算图中部分子图融合,并调用NVIDIA的 TensorRT 来进行加速,详细内容可以参考 [使用Paddle-TensorRT库预测](../../performance_improving/inference_improving/paddle_tensorrt_infer.html)
### 多线程预测
Paddle Inference支持通过在不同线程运行多个Predictor的方式来优化预测性能,支持CPU和GPU环境。
sh run.sh
## <a name="推理升级指南"> 推理升级指南</a>
新的 API 为纯增,原有 API 保持不变,在后续版本会逐步删除。
- 命名空间从 `paddle` 变更为 `paddle_infer`
- `PaddleTensor`, `PaddleBuf` 等被废弃,`ZeroCopyTensor` 变为默认 Tensor 类型,并更名为 `Tensor`
- 新增 `PredictorPool` 工具类简化多线程 predictor 的创建,后续也会增加更多周边工具
- `CreatePredictor` (原 `CreatePaddlePredictor`) 的返回值由 `unique_ptr` 变为 `shared_ptr` 以避免 Clone 后析构顺序出错的问题
API 变更
| 原有命名 | 现有命名 | 行为变化 |
| ---------------------------- | ---------------------------- | ----------------------------- |
| 头文件 `paddle_infer.h` | 无变化 | 包含旧接口,保持向后兼容 |
| 无 | `paddle_inference_api.h` | 新API,可以与旧接口并存 |
| `CreatePaddlePredictor` | `CreatePredictor` | 返回值变为 shared_ptr |
| `ZeroCopyTensor` | `Tensor` | 无 |
| `AnalysisConfig` | `Config` | 无 |
| `TensorRTConfig` | 废弃 | |
| `PaddleTensor` + `PaddleBuf` | 废弃 | |
| `Predictor::GetInputTensor` | `Predictor::GetInputHandle` | 无 |
| `Predictor::GetOutputTensor` | `Predictor::GetOutputHandle` | 无 |
| | `PredictorPool` | 简化创建多个 predictor 的支持 |
使用新 C++ API 的流程与之前完全一致,只有命名变化
#include "paddle_infernce_api.h"
using namespace paddle_infer;
Config config;
auto predictor = CreatePredictor(config);
// Get the handles for the inputs and outputs of the model
auto input0 = predictor->GetInputHandle("X");
auto output0 = predictor->GetOutputHandle("Out");
for (...) {
// Assign data to input0
// get data from the output0 handle
## <a name="C++_API"> C++ API</a>
##### CreatePredictor
std::shared_ptr<Predictor> CreatePredictor(const Config& config);
// 设置Config
Config config;
// 根据Config创建Predictor
std::shared_ptr<Predictor> predictor = CreatePredictor(config);
- `config(Config)` - 用于构建Predictor的配置信息
##### GetVersion()
std::string GetVersion();
打印Paddle Inference的版本信息。
- `None`
##### PlaceType
enum class PaddlePlace { kUNK };
using PlaceType = paddle::PaddlePlace;
`{kUNK, kCPU, kGPU}`
##### PrecisionType
enum class Precision { kFloat32 };
using PrecisionType = paddle::AnalysisConfig::Precision;
`{kFloat32, kInt8, kHalf}`
##### DataType
enum class PaddleDType { FLOAT32 };
using DataType = paddle::PaddleDType;
`{FLOAT32, INT64, INT32, UINT8}`
##### GetNumBytesOfDataType
int GetNumBytesOfDataType(DataType dtype);
- `dtype` - DataType枚举
##### Predictor
class Predictor;
`Predictor`是Paddle Inference的预测器,由`CreatePredictor`根据`Config`进行创建。用户可以根据Predictor提供的接口设置输入数据、执行模型预测、获取输出等.
using namespace paddle_infer;
Config config;
auto predictor = CreatePredictor(config);
// Get the handles for the inputs and outputs of the model
auto input0 = predictor->GetInputHandle("X");
auto output0 = predictor->GetOutputHandle("Out");
for (...) {
// Assign data to input0
// get data from the output0 handle
###### GetInputNames()
- `None`
###### GetOutputNames()
- `None`
###### GetInputHandle(const std::string& name)
- `name` - Tensor的名称
###### GetOutputHandle(const std::string& name)
- `name` - Tensor的名称
###### Run()
- `None`
###### ClearIntermediateTensor()
- `None`
###### Clone()
- `None`
##### Tensor
class Tensor;
Tensor是Paddle Inference的数据组织形式,用于对底层数据进行封装并提供接口对数据进行操作,包括设置Shape、数据、LoD信息等。
// 通过创建的Predictor获取输入和输出的tensor
auto input_names = predictor->GetInputNames();
auto input_t = predictor->GetInputHandle(input_names[0]);
auto output_names = predictor->GetOutputNames();
auto output_t = predictor->GetOutputHandle(output_names[0]);
// 对tensor进行reshape
input_t->Reshape({batch_size, channels, height, width});
// 通过CopyFromCpu接口,将cpu数据输入;通过CopyToCpu接口,将输出数据copy到cpu
input_t->CopyFromCpu<float>(input_data /*数据指针*/);
output_t->CopyToCpu(out_data /*数据指针*/);
// 设置LOD
std::vector<std::vector<size_t>> lod_data = {{0}, {0}};
// 获取Tensor数据指针
float *input_d = input_t->mutable_data<float>(PlaceType::kGPU); // CPU下使用PlaceType::kCPU
int output_size;
float *output_d = output_t->data<float>(PlaceType::kGPU, &output_size);
###### Reshape(shape)
- `shape(const std::vector<int>&)` - 维度信息
###### shape()
- `None`
###### CopyFromCpu(data)
template <typename T>
void CopyFromCpu(const T* data);
// float* data = ...;
auto in_tensor = predictor->GetInputHandle("in_name");
- `data(const T*)` - cpu数据指针
###### CopyToCpu(data)
template <typename T>
void CopyToCpu(T* data);
std::vector<float> data(100);
auto out_tensor = predictor->GetOutputHandle("out_name");
- `data(T*)` - cpu数据指针
###### data<T>(place, size)
template <typename T>
T* data(PlaceType* place, int* size) const;
PlaceType place;
int size;
auto out_tensor = predictor->GetOutputHandle("out_name");
float* data = out_tensor->data<float>(&place, &size);
- `place(PlaceType*)` - 获取tensor的PlaceType
- `size(int*)` - 获取tensor的size
###### mutable_data<T>(place)
template <typename T>
T* mutable_data(PlaceType place);
auto in_tensor = predictor->GetInputHandle("in_name");
float* data = out_tensor->mutable_data<float>(PlaceType::kCPU);
data[0] = 1.;
- `place(PlaceType)` - 设备信息
###### SetLoD(lod)
- `lod(const std::vector<std::vector<size_t>>)` - Tensor的LoD信息
###### lod()
- `None`
###### type()
- `None`
###### name()
- `None`
##### Config
class Config;
Config config;
config->SwitchIrOptim(false); // 默认为true。如果设置为false,关闭所有优化
config->EnableMemoryOptim(); // 开启内存/显存复用
###### SetModel(const std::string& model_dir)
- `model_dir` - 模型文件夹路径
###### model_dir()
- `None`
###### SetModel(const std::string& prog, const std::string& params)
- `prog` - 模型文件路径
- `params` - 模型参数文件路径
###### SetProgFile(const std::string& prog)
- `prog` - 模型文件路径
###### prog_file()
- `None`
###### SetParamsFile(const std::string& params)
- `params` - 模型文件路径
###### params_file()
- `None`
###### SetModelBuffer(const char* prog_buffer, size_t prog_buffer_size, const char* params_buffer, size_t params_buffer_size)
- `prog_buffer` - 内存中模型结构数据
- `prog_buffer_size` - 内存中模型结构数据的大小
- `params_buffer` - 内存中模型参数数据
- `params_buffer_size` - 内存中模型参数数据的大小
###### model_from_memory()
- `None`
###### SetOptimCacheDir(const std::string& opt_cache_dir)
- `opt_cache_dir` - 缓存路径
###### DisableFCPadding()
关闭fc padding。
- `None`
###### use_fc_padding()
判断是否启用fc padding。
- `None`
返回:是否启用fc padding
###### EnableUseGpu(uint64_t memory_pool_init_size_mb, int device_id = 0)
- `memory_pool_init_size_mb` - 初始化分配的gpu显存,以MB为单位
- `device_id` - 设备id
###### DisableGpu()
- `None`
###### use_gpu()
- `None`
###### gpu_device_id()
获取gpu的device id。
- `None`
返回:gpu的device id
###### memory_pool_init_size_mb()
- `None`
###### fraction_of_gpu_memory_for_pool()
- `None`
###### EnableCUDNN()
- `None`
###### cudnn_enabled()
- `None`
###### EnableXpu(int l3_workspace_size)
- `l3_workspace_size` - l3 cache分配的显存大小
###### SwitchIrOptim(int x=true)
- `x` - 是否开启ir优化,默认打开
###### ir_optim()
- `None`
###### SwitchUseFeedFetchOps(int x = true)
设置是否使用feed,fetch op,仅内部使用。
- `x` - 是否使用feed, fetch op
###### use_feed_fetch_ops_enabled()
是否使用feed,fetch op。
- `None`
返回:是否使用feed,fetch op
###### SwitchSpecifyInputNames(bool x = true)
- `x` - 是否指定输入tensor的name
###### specify_input_name()
- `None`
###### EnableTensorRtEngine(int workspace_size = 1 << 20, int max_batch_size = 1, int min_subgraph_size = 3, Precision precision = Precision::kFloat32, bool use_static = false, bool use_calib_mode = true)
- `workspace_size` - 指定TensorRT使用的工作空间大小
- `max_batch_size` - 设置最大的batch大小,运行时batch大小不得超过此限定值
- `min_subgraph_size` - Paddle-TRT是以子图的形式运行,为了避免性能损失,当子图内部节点个数大于min_subgraph_size的时候,才会使用Paddle-TRT运行
- `precision` - 指定使用TRT的精度,支持FP32(kFloat32),FP16(kHalf),Int8(kInt8)
- `use_static` - 如果指定为true,在初次运行程序的时候会将TRT的优化信息进行序列化到磁盘上,下次运行时直接加载优化的序列化信息而不需要重新生成
- `use_calib_mode` - 若要运行Paddle-TRT int8离线量化校准,需要将此选项设置为true
###### tensorrt_engine_enabled()
- `None`
###### SetTRTDynamicShapeInfo(std::map<std::string, std::vector<int>> min_input_shape, std::map<std::string, std::vector<int>> max_input_shape, std::map<std::string, std::vector<int>> optim_input_shape, bool disable_trt_plugin_fp16 = false)
- `min_input_shape` - tensorRT子图支持动态shape的最小shape
- `max_input_shape` - tensorRT子图支持动态shape的最大shape
- `optim_input_shape` - tensorRT子图支持动态shape的最优shape
- `disable_trt_plugin_fp16` - 设置tensorRT的plugin不在fp16精度下运行
###### EnableLiteEngine(AnalysisConfig::Precision precision_mode = Precsion::kFloat32, bool zero_copy = false, const std::vector<std::string>& passes_filter = {}, const std::vector<std::string>& ops_filter = {})
- `precision_mode` - lite子图的运行精度
- `zero_copy` - 启用zero_copy,lite子图与paddle inference之间共享数据
- `passes_filter` - 设置lite子图的pass
- `ops_filter` - 设置不使用lite子图运行的op
###### lite_engine_enabled()
- `None`
###### SwitchIrDebug(int x = true)
- `x` - 是否打印ir
###### EnableMKLDNN()
- `None`
###### SetMkldnnCacheCapacity(int capacity)
设置mkldnn针对不同输入shape的cache容量大小,MKLDNN cache设计文档请参考[链接](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/mkldnn/caching/caching.md)
- `capacity` - cache容量大小
###### mkldnn_enabled()
- `None`
###### SetMKLDNNOp(std::unordered_set<std::string> op_list)
- `op_list` - 优先使用mkldnn的op列表
###### EnableMkldnnQuantizer()
- `None`
###### mkldnn_quantizer_enabled()
- `None`
###### EnableMkldnnBfloat16()
启用mkldnn bf16。
- `None`
###### mkldnn_bfloat16_enabled()
是否启用mkldnn bf16。
- `None`
返回:是否启用mkldnn bf16
###### mkldnn_quantizer_config()
- `None`
###### SetCpuMathLibraryNumThreads(int cpu_math_library_num_threads)
设置cpu blas库计算线程数。
- `cpu_math_library_num_threads` - blas库计算线程数
###### cpu_math_library_num_threads()
cpu blas库计算线程数。
- `None`
返回:cpu blas库计算线程数。
###### ToNativeConfig()
- `None`
###### EnableGpuMultiStream()
- `None`
###### thread_local_stream_enabled()
- `None`
###### EnableMemoryOptim()
- `None`
###### enable_memory_optim()
- `None`
###### EnableProfile()
- `None`
###### profile_enabled()
- `None`
###### DisableGlogInfo()
去除Paddle Inference运行中的log。
- `None`
###### glog_info_disabled()
- `None`
###### SetInValid()
- `None`
###### is_valid()
- `None`
###### pass_builder()
Config config;
auto pass_builder = config.pass_builder()
pass_builder->DeletePass("fc_fuse_pass") // 去除fc_fuse
- `None`
##### PredictorPool
class PredictorPool;
Config config;
// init config
int thread_num = 4;
PredictorPool pool(config, thread_num);
auto predictor0 = pool.Retrive(0);
auto predictor3 = pool.Retrive(3);
###### Retrive(idx)
- `idx(int)` - 线程id
# Introduction to C++ Inference API
To make the deployment of inference model more convenient, a set of high-level APIs are provided in Fluid to hide diverse optimization processes in low level.
Details are as follows:
## <a name="Use AnalysisPredictor to perform high-performance inference"> Use AnalysisPredictor to perform high-performance inference</a>
Paddy fluid uses AnalysisPredictor to perform inference. AnalysisPredictor is a high-performance inference engine. Through the analysis of the calculation graph, the engine completes a series of optimization of the calculation graph (such as the integration of OP, the optimization of memory / graphic memory, the support of MKLDNN, TensorRT and other underlying acceleration libraries), which can greatly improve the inference performance.
In order to show the complete inference process, the following is a complete example of using AnalysisPredictor. The specific concepts and configurations involved will be detailed in the following sections.
#### AnalysisPredictor sample
``` c++
#include "paddle_inference_api.h"
namespace paddle {
void CreateConfig(AnalysisConfig* config, const std::string& model_dirname) {
// load model from disk
config->SetModel(model_dirname + "/model",
model_dirname + "/params");
// config->SetModel(model_dirname);
// use SetModelBuffer if load model from memory
// config->SetModelBuffer(prog_buffer, prog_size, params_buffer, params_size);
config->EnableUseGpu(100 /*init graphic memory by 100MB*/, 0 /*set GPUID to 0*/);
/* for cpu
config->EnableMKLDNN(); // enable MKLDNN
// set to true if there are multiple inputs
config->SwitchIrDebug(true); // If the visual debugging option is enabled, a dot file will be generated after each graph optimization process
// config->SwitchIrOptim(false); // The default is true. Turn off all optimizations if set to false
// config->EnableMemoryOptim(); // Enable memory / graphic memory reuse
void RunAnalysis(int batch_size, std::string model_dirname) {
// 1. create AnalysisConfig
AnalysisConfig config;
CreateConfig(&config, model_dirname);
// 2. create predictor based on config, and prepare input data
auto predictor = CreatePaddlePredictor(config);
int channels = 3;
int height = 224;
int width = 224;
float input[batch_size * channels * height * width] = {0};
// 3. build inputs
// uses ZeroCopy API here to avoid extra copying from CPU, improving performance
auto input_names = predictor->GetInputNames();
auto input_t = predictor->GetInputTensor(input_names[0]);
input_t->Reshape({batch_size, channels, height, width});
// 4. run inference
// 5. get outputs
std::vector<float> out_data;
auto output_names = predictor->GetOutputNames();
auto output_t = predictor->GetOutputTensor(output_names[0]);
std::vector<int> output_shape = output_t->shape();
int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
} // namespace paddle
int main() {
// the model can be downloaded from http://paddle-inference-dist.cdn.bcebos.com/tensorrt_test/mobilenet.tar.gz
paddle::RunAnalysis(1, "./mobilenet");
return 0;
## <a name="Use AnalysisConfig to manage inference configurations"> Use AnalysisConfig to manage inference configurations</a>
AnalysisConfig manages the inference configuration of AnalysisPredictor, providing model path setting, inference engine running device selection, and a variety of options to optimize the inference process. The configuration method is as follows:
#### General optimizing configuration
``` c++
config->SwitchIrOptim(true); // Enable analysis and optimization of calculation graph,including OP fusion, etc
config->EnableMemoryOptim(); // Enable memory / graphic memory reuse
**Note:** Using ZeroCopyTensor requires following setting:
``` c++
config->SwitchUseFeedFetchOps(false); // disable feed and fetch OP
#### set model and param path
When loading the model from disk, there are two ways to set the path of AnalysisConfig to load the model and parameters according to the storage mode of the model and parameter file:
* Non combined form: when there is a model file and multiple parameter files under the model folder 'model_dir', the path of the model folder is passed in. The default name of the model file is'__model_'.
``` c++
* Combined form: when there is only one model file 'model' and one parameter file 'params' under the model folder' model_dir ', the model file and parameter file path are passed in.
``` c++
config->SetModel("./model_dir/model", "./model_dir/params");
At compile time, it is proper to co-build with `libpaddle_fluid.a/.so` .
#### Configure CPU inference
``` c++
config->DisableGpu(); // disable GPU
config->EnableMKLDNN(); // enable MKLDNN, accelerating CPU inference
config->SetCpuMathLibraryNumThreads(10); // set number of threads of CPU Math libs, accelerating CPU inference if CPU cores are adequate
#### Configure GPU inference
``` c++
config->EnableUseGpu(100, 0); // initialize 100M graphic memory, using GPU ID 0
config->GpuDeviceId(); // Returns the GPU ID being used
// Turn on TRT to improve GPU performance. You need to use library with tensorrt
config->EnableTensorRtEngine(1 << 20 /*workspace_size*/,
batch_size /*max_batch_size*/,
3 /*min_subgraph_size*/,
AnalysisConfig::Precision::kFloat32 /*precision*/,
false /*use_static*/,
false /*use_calib_mode*/);
## <a name="Use ZeroCopyTensor to manage I/O"> Use ZeroCopyTensor to manage I/O</a>
ZeroCopyTensor is the input / output data structure of AnalysisPredictor. The use of zerocopytensor can avoid redundant data copy when preparing input and obtaining output, and improve inference performance.
**Note:** Using zerocopytensor, be sure to set `config->SwitchUseFeedFetchOps(false);`.
``` c++
// get input/output tensor
auto input_names = predictor->GetInputNames();
auto input_t = predictor->GetInputTensor(input_names[0]);
auto output_names = predictor->GetOutputNames();
auto output_t = predictor->GetOutputTensor(output_names[0]);
// reshape tensor
input_t->Reshape({batch_size, channels, height, width});
// Through the copy_from_cpu interface, the CPU data is prepared; through the copy_to_cpu interface, the output data is copied to the CPU
input_t->copy_from_cpu<float>(input_data /*data pointer*/);
output_t->copy_to_cpu(out_data /*data pointer*/);
// set LOD
std::vector<std::vector<size_t>> lod_data = {{0}, {0}};
// get Tensor data pointer
float *input_d = input_t->mutable_data<float>(PaddlePlace::kGPU); // use PaddlePlace::kCPU when running inference on CPU
int output_size;
float *output_d = output_t->data<float>(PaddlePlace::kGPU, &output_size);
## <a name="C++ inference sample"> C++ inference sample</a>
1. Download or compile C++ Inference Library, refer to [Install and Compile C++ Inference Library](./build_and_install_lib_en.html).
2. Download [C++ inference sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress it , then enter `sample/inference` directory.
`inference` directory structure is as following:
``` shell
├── CMakeLists.txt
├── mobilenet_test.cc
├── thread_mobilenet_test.cc
├── mobilenetv1
│ ├── model
│ └── params
├── run.sh
└── run_impl.sh
- `mobilenet_test.cc` is the source code for single-thread inference.
- `thread_mobilenet_test.cc` is the source code for multi-thread inference.
- `mobilenetv1` is the model directory.
- `run.sh` is the script for running inference.
3. Configure script:
Before running, we need to configure script `run.sh` as following:
``` shell
# set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON
# set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir
Please configure `run.sh` depending on your environment.
4. Build and run the sample.
``` shell
sh run.sh
## <a name="Performance tuning"> Performance tuning</a>
### Tuning on CPU
1. If the CPU model allows, try to use the version with AVX and MKL.
2. You can try to use Intel's MKLDNN acceleration.
3. When the number of CPU cores available is enough, you can increase the num value in the setting `config->SetCpuMathLibraryNumThreads(num);`.
### Tuning on GPU
1. You can try to open the TensorRT subgraph acceleration engine. Through the graph analysis, Paddle can automatically fuse certain subgraphs, and call NVIDIA's TensorRT for acceleration. For details, please refer to [Use Paddle-TensorRT Library for inference](../../performance_improving/inference_improving/paddle_tensorrt_infer_en.html)
### Tuning with multi-thread
Paddle Fluid supports optimizing prediction performance by running multiple AnalysisPredictors on different threads, and supports CPU and GPU environments.
sample of using multi-threads is `thread_mobilenet_test.cc` downloaded from [sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz). You can change `mobilenet_test` in `run.sh` to `thread_mobilenet_test` to run inference with multi-thread.
sh run.sh
# Performance Profiling for TensorRT Library
## Test Environment
- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4
- TensorRT4.0, CUDA8.0, CUDNNV7
- Test model ResNet50, MobileNet, ResNet101, Inception V3.
## Test Targets
**PaddlePaddle, Pytorch, Tensorflow**
- In test, PaddlePaddle adopts subgraph optimization to integrate TensorRT [model](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models) .
- Native implementation is used in Pytorch. Model [address 1](https://github.com/pytorch/vision/tree/master/torchvision/models) , [address 2](https://github.com/marvis/pytorch-mobilenet) .
- Test for TensorFlow contains test for native TF and TF—TRT. **Test for TF—TRT hasn't reached expectation wihch will be complemented later**. Model [address](https://github.com/tensorflow/models) .
### ResNet50
|1|4.64117 |16.3|10.878|
|5|6.90622| 22.9 |20.62|
|10|7.9758 |40.6|34.36|
### MobileNet
|1| 1.7541 | 7.8 |2.72|
|5| 3.04666 | 7.8 |3.19|
|10|4.19478 | 14.47 |4.25|
### ResNet101
|1|8.95767| 22.48 |18.78|
|5|12.9811 | 33.88 |34.84|
|10|14.1463| 61.97 |57.94|
### Inception v3
|1|15.1613 | 24.2 |19.1|
|5|18.5373 | 34.8 |27.2|
|10|19.2781| 54.8 |36.7|
# Python 预测 API介绍
## Python预测相关数据结构
使用Python预测API与C++预测API相似,主要包括`Tensor`, `DataType`, `Config``Predictor`,分别对应于C++ API中同名的类型。
### DataType
class paddle.inference.DataType
* `INT64`: 64位整型
* `INT32`: 32位整型
* `FLOAT32`: 32位浮点型
### PrecisionType
class paddle.3.inference.PrecisionType
* `Float32`: fp32模式运行
* `Half`: fp16模式运行
* `Int8`: int8模式运行
### Tensor
class paddle.inference.Tensor
* `copy_from_cpu`: 从cpu获取模型运行所需输入数据
* `copy_to_cpu`: 获取模型运行输出结果
* `lod`: 获取lod信息
* `set_lod`: 设置lod信息
* `shape`: 获取shape信息
* `reshape`: 设置shape信息
* `type`: 获取DataType信息
``` python
# 创建predictor
predictor = create_predictor(config)
# 获取输入的名称
input_names = predictor.get_input_names()
input_tensor = predictor.get_input_handle(input_names[0])
# 设置输入
fake_input = numpy.random.randn(1, 3, 318, 318).astype("float32")
# 运行predictor
# 获取输出
output_names = predictor.get_output_names()
output_tensor = predictor.get_output_handle(output_names[0])
output_data = output_tensor.copy_to_cpu() # numpy.ndarray类型
### Config
class paddle.inference.Config
* `set_model`: 设置模型的路径
* `model_dir`: 返回模型文件夹路径
* `prog_file`: 返回模型文件路径
* `params_file`: 返回参数文件路径
* `enable_use_gpu`: 设置GPU显存(单位M)和Device ID
* `disable_gpu`: 禁用GPU
* `gpu_device_id`: 返回使用的GPU ID
* `switch_ir_optim`: IR优化(默认开启)
* `enable_tensorrt_engine`: 开启TensorRT
* `enable_mkldnn`: 开启MKLDNN
* `disable_glog_info`: 禁用预测中的glog日志
* `delete_pass`: 预测的时候删除指定的pass
#### 代码示例
* 当模型文件夹下存在一个模型文件和多个参数文件时,传入模型文件夹路径,模型文件名默认为`__model__`
``` python
config = Config("./model")
* 当模型文件夹下只有一个模型文件和一个参数文件时,传入模型文件和参数文件路径
``` python
config = Config("./model/model", "./model/params")
``` python
config.enable_use_gpu(100, 0) # 初始化100M显存,使用gpu id为0
config.gpu_device_id() # 返回正在使用的gpu id
config.disable_gpu() # 禁用gpu
config.switch_ir_optim(True) # 开启IR优化
use_calib_mode=True) # 开启TensorRT预测,精度为fp32,开启int8离线量化
config.enable_mkldnn() # 开启MKLDNN
### Predictor
class paddle.inference.Predictor
* `run()`: 运行预测引擎,返回预测结果
* `get_input_names()`: 获取输入的名称
* `get_input_handle(input_name: str)`: 根据输入的名称获取对应的`Tensor`
* `get_output_names()`: 获取输出的名称
* `get_output_handle(output_name: str)`: 根据输出的名称获取对应的`Tensor`
#### 代码示例
``` python
# 设置完AnalysisConfig后创建预测引擎PaddlePredictor
predictor = create_predictor(config)
# 获取输入的名称
input_names = predictor.get_input_names()
input_handle = predictor.get_input_handle(input_names[0])
# 设置输入
fake_input = numpy.random.randn(1, 3, 318, 318).astype("float32")
input_handle.reshape([1, 3, 318, 318])
# 运行predictor
# 获取输出
output_names = predictor.get_output_names()
output_handle = predictor.get_output_handle(output_names[0])
## 完整使用示例
下面是使用Paddle Inference Python API进行预测的一个完整示例,使用resnet50模型
``` bash
python resnet50_infer.py --model_file ./model/model --params_file ./model/params --batch_size 2
`resnet50_infer.py` 的内容是
``` python
import argparse
import numpy as np
from paddle.inference import Config
from paddle.inference import create_predictor
def main():
args = parse_args()
# 设置AnalysisConfig
config = set_config(args)
# 创建PaddlePredictor
predictor = create_predictor(config)
# 获取输入的名称
input_names = predictor.get_input_names()
input_handle = predictor.get_input_handle(input_names[0])
# 设置输入
fake_input = np.random.randn(1, 3, 318, 318).astype("float32")
input_handle.reshape([1, 3, 318, 318])
# 运行predictor
# 获取输出
output_names = predictor.get_output_names()
output_handle = predictor.get_output_handle(output_names[0])
output_data = output_handle.copy_to_cpu() # numpy.ndarray类型
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--model_file", type=str, help="model filename")
parser.add_argument("--params_file", type=str, help="parameter filename")
parser.add_argument("--batch_size", type=int, default=1, help="batch size")
return parser.parse_args()
def set_config(args):
config = Config(args.model_file, args.params_file)
return config
if __name__ == "__main__":
## 支持方法列表
* Tensor
* `copy_from_cpu(input: numpy.ndarray) -> None`
* `copy_to_cpu() -> numpy.ndarray`
* `reshape(input: numpy.ndarray|List[int]) -> None`
* `shape() -> List[int]`
* `set_lod(input: numpy.ndarray|List[List[int]]) -> None`
* `lod() -> List[List[int]]`
* `type() -> PaddleDType`
* Config
* `set_model(model_dir: str) -> None`
* `set_model(prog_file: str, params_file: str) -> None`
* `set_model_buffer(model: str, model_size: int, param: str, param_size: int) -> None`
* `model_dir() -> str`
* `prog_file() -> str`
* `params_file() -> str`
* `model_from_memory() -> bool`
* `set_cpu_math_library_num_threads(num: int) -> None`
* `enable_use_gpu(memory_pool_init_size_mb: int, device_id: int) -> None`
* `use_gpu() -> bool`
* `gpu_device_id() -> int`
* `switch_ir_optim(x: bool = True) -> None`
* `switch_ir_debug(x: int=True) -> None`
* `ir_optim() -> bool`
* `enable_tensorrt_engine(workspace_size: int = 1 << 20,
max_batch_size: int,
min_subgraph_size: int,
precision_mode: AnalysisConfig.precision,
use_static: bool,
use_calib_mode: bool) -> None`
* `set_trt_dynamic_shape_info(min_input_shape: Dict[str, List[int]]={}, max_input_shape: Dict[str, List[int]]={}, optim_input_shape: Dict[str, List[int]]={}, disable_trt_plugin_fp16: bool=False) -> None`
* `tensorrt_engine_enabled() -> bool`
* `enable_mkldnn() -> None`
* `enable_mkldnn_bfloat16() -> None`
* `mkldnn_enabled() -> bool`
* `set_mkldnn_cache_capacity(capacity: int=0) -> None`
* `set_mkldnn_op(ops: Set[str]) -> None`
* `set_optim_cache_dir(dir: str) -> None`
* `disable_glog_info() -> None`
* `pass_builder() -> paddle::PassStrategy`
* `delete_pass(pass_name: str) -> None`
* `cpu_math_library_num_threads() -> int`
* `disable_gpu() -> None`
* `enable_lite_engine(precision: PrecisionType, zero_copy: bool, passes_filter: List[str]=[], ops_filter: List[str]=[]) -> None`
* `lite_engine_enabled() -> bool`
* `enable_memory_optim() -> None`
* `enable_profile() -> None`
* `enable_quantizer() -> None`
* `quantizer_config() -> paddle::MkldnnQuantizerConfig`
* `fraction_of_gpu_memory_for_pool() -> float`
* `memory_pool_init_size_mb() -> int`
* `glog_info_disabled() -> bool`
* `gpu_device_id() -> int`
* `specify_input_name() -> bool`
* `switch_specify_input_names(x: bool=True) -> None`
* `specify_input_name(q) -> bool`
* `switch_use_feed_fetch_ops(x: int=True) -> None`
* `use_feed_fetch_ops_enabled() -> bool`
* `to_native_config() -> paddle.fluid.core_avx.NativeConfig`
* `create_predictor(config: Config) -> Predictor`
* Predictor
* `run() -> None`
* `get_input_names() -> List[str]`
* `get_input_handle(input_name: str) -> Tensor`
* `get_output_names() -> List[str]`
* `get_output_handle(output_name: str) -> Tensor`
* `clear_intermediate_tensor() -> None`
* `clone() -> Predictor`
* PredictorPool
* `retrive(idx: int) -> Predictor`
安装与编译 Windows 预测库
| 版本说明 | 预测库(1.8.3版本) | 编译器 | 构建工具 | cuDNN | CUDA |
| cpu_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/cpu/fluid_inference_install_dir.zip) | MSVC 2015 update 3| CMake v3.16.0 |
| cpu_avx_openblas | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/open/cpu/fluid_inference_install_dir.zip) | MSVC 2015 update 3| CMake v3.16.0 |
| cuda9.0_cudnn7_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/post97/fluid_inference_install_dir.zip) | MSVC 2015 update 3 | CMake v3.16.0 | 7.3.1 | 9.0 |
| cuda9.0_cudnn7_avx_openblas | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/open/post97/fluid_inference_install_dir.zip) | MSVC 2015 update 3 | CMake v3.16.0 | 7.3.1 | 9.0 |
| cuda10.0_cudnn7_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/post107/fluid_inference_install_dir.zip) | MSVC 2015 update 3 | CMake v3.16.0 | 7.4.1 | 10.0 |
### 硬件环境
| CPU | I7-8700K |
| 内存 | 16G |
| 硬盘 | 1T hdd + 256G ssd |
| 显卡 | GTX1080 8G |
测试环境操作系统使用 win10 家庭版本
用户也可以从 PaddlePaddle 核心代码编译C++预测库,只需在编译时配制下面这些编译选项:
|选项 |说明 | 值 |
|CMAKE_BUILD_TYPE | 配置生成器上的构建类型,windows预测库目前只支持Release | Release |
|ON_INFER | 是否生成预测库,编译预测库时必须设置为ON | ON |
|WITH_GPU | 是否支持GPU | ON/OFF |
|WITH_MKL | 是否使用Intel MKL(数学核心库) | ON/OFF |
|WITH_PYTHON | 是否内嵌PYTHON解释器 | OFF(推荐) |
|MSVC_STATIC_CRT|是否使用/MT 模式进行编译,Windows默认使用 /MT 模式进行编译 |ON/OFF|
1. 将PaddlePaddle的源码clone在当下目录的Paddle文件夹中,并进入Paddle目录:
git clone https://github.com/PaddlePaddle/Paddle.git
cd Paddle
2. 执行cmake:
- 编译CPU预测
# 创建并进入build目录
mkdir build
cd build
cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=OFF -DWITH_GPU=OFF -DON_INFER=ON -DWITH_PYTHON=OFF
# Windows默认使用 /MT 模式进行编译,如果想使用 /MD 模式,请使用以下命令。如不清楚两者的区别,请使用上面的命令
- 编译GPU预测库:
3. 使用Blend for Visual Studio 2015 打开 `paddle.sln` 文件,选择平台为`x64`,配置为`Release`,编译inference_lib_dist项目。
操作方法:在Visual Studio中选择相应模块,右键选择"生成"(或者"build")
version.txt 中记录了该预测库的版本信息,包括Git Commit ID、使用OpenBlas或MKL数学库、CUDA/CUDNN版本号,如:
GIT COMMIT ID: cc9028b90ef50a825a722c55e5fda4b7cd26b0d6
CUDA version: 8.0
CUDNN version: v7
### 硬件环境
| CPU | I7-8700K |
| 内存 | 16G |
| 硬盘 | 1T hdd + 256G ssd |
| 显卡 | GTX1080 8G |
测试环境操作系统使用 win10 家庭版本。
### 软件要求
**安装Visual Studio 2015 update3**
安装Visual Studio 2015,安装选项中选择安装内容时勾选自定义,选择安装全部关于c,c++,vc++的功能。
### 其他要求
1. 你需要直接下载Windows预测库或者从Paddle源码编译预测库,确保windows预测库存在。
2. 你需要下载Paddle源码,确保demo文件和脚本文件存在:
git clone https://github.com/PaddlePaddle/Paddle.git
### 编译demo
#### 使用脚本编译运行
# path为下载Paddle的目录
cd path\Paddle\paddle\fluid\inference\api\demo_ci
其中,run_windows_demo.bat 的部分选项如下:
gpu_inference=Y #是否使用GPU预测库,默认使用CPU预测库
use_mkl=Y #该预测库是否使用MKL,默认为Y
use_gpu=Y #是否使用GPU进行预测,默认为N。使用GPU预测需要下载GPU版本预测库
paddle_inference_lib=path\fluid_inference_install_dir #设置paddle预测库的路径
cuda_lib_dir=path\lib\x64 #设置cuda库的路径
vcvarsall_dir=path\vc\vcvarsall.bat #设置visual studio #本机工具命令提示符路径
#### 手动编译运行
1. 进入demo_ci目录,创建并进入build目录
# path为下载Paddle的目录
cd path\Paddle\paddle\fluid\inference\api\demo_ci
mkdir build
cd build
2. 执行cmake(cmake可以在[官网进行下载](https://cmake.org/download/),并添加到环境变量中):
- 使用CPU预测库编译demo
# -DDEMO_NAME 是要编译的文件
# -DDPADDLE_LIB是预测库目录,例如-DPADDLE_LIB=D:\fluid_inference_install_dir
cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DWITH_GPU=OFF -DWITH_MKL=ON -DWITH_STATIC_LIB=ON ^
-DCMAKE_BUILD_TYPE=Release -DDEMO_NAME=simple_on_word2vec -DPADDLE_LIB=path_to_the_paddle_lib -DMSVC_STATIC_CRT=ON
- 使用GPU预测库编译demo
# -DCUDA_LIB CUDA的库目录,例如-DCUDA_LIB=D:\cuda\lib\x64
cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DWITH_GPU=ON -DWITH_MKL=ON -DWITH_STATIC_LIB=ON ^
3. 使用Blend for Visual Studio 2015 打开 `cpp_inference_demo.sln` 文件,选择平台为`x64`,配置为`Release`,编译simple_on_word2vec项目。
操作方法: 在Visual Studio中选择相应模块,右键选择"生成"(或者"build")
4. [下载模型](http://paddle-inference-dist.bj.bcebos.com/word2vec.inference.model.tar.gz)并解压到当前目录,执行命令:
# 开启GLOG
set GLOG_v=100
# 进行预测,path为模型解压后的目录
Release\simple_on_word2vec.exe --dirname=path\word2vec.inference.model
### 实现一个简单预测demo
1. 创建AnalysisConfig
AnalysisConfig config;
config->SwitchUseFeedFetchOps(false); // 关闭feed和fetch OP使用,使用ZeroCopy接口必须设置此项
// config->EnableUseGpu(100 /*设定GPU初始显存池为MB*/, 0 /*设定GPU ID为0*/); //开启GPU预测
2. 在config中设置模型和参数路径
- 非combined形式:模型文件夹`model_dir`下存在一个模型文件和多个参数文件时,传入模型文件夹路径,模型文件名默认为`__model__`
``` c++
- combined形式:模型文件夹`model_dir`下只有一个模型文件`__model__`和一个参数文件`__params__`时,传入模型文件和参数文件路径。
config->SetModel("path\\model_dir\\__model__", "path\\model_dir\\__params__");
3. 创建predictor,准备输入数据
std::unique_ptr<PaddlePredictor> predictor = CreatePaddlePredictor(config);
int batch_size = 1;
int channels = 3; // channels,height,width三个参数必须与模型中对应输入的shape一致
int height = 300;
int width = 300;
int nums = batch_size * channels * height * width;
float* input = new float[nums];
for (int i = 0; i < nums; ++i) input[i] = 0;
4. 使用ZeroCopyTensor管理输入
// 通过创建的AnalysisPredictor获取输入Tensor,该Tensor为ZeroCopyTensor
auto input_names = predictor->GetInputNames();
auto input_t = predictor->GetInputTensor(input_names[0]);
// 对Tensor进行reshape,将准备好的输入数据从CPU拷贝到ZeroCopyTensor中
input_t->Reshape({batch_size, channels, height, width});
5. 运行预测引擎
6. 使用ZeroCopyTensor管理输出
auto output_names = predictor->GetOutputNames();
auto output_t = predictor->GetOutputTensor(output_names[0]);
std::vector<int> output_shape = output_t->shape();
int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1,
output_t->copy_to_cpu(out_data.data()); // 将ZeroCopyTensor中数据拷贝到cpu中,得到输出数据
delete[] input;
**Note:** 关于AnalysisPredictor的更多介绍,请参考[C++预测API介绍](./native_infer.html)
Install and Compile C++ Inference Library on Windows
Direct Download and Install
| Version | Inference Libraries(v1.8.3) | Compiler | Build tools | cuDNN | CUDA |
| cpu_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/cpu/fluid_inference_install_dir.zip) | MSVC 2015 update 3| CMake v3.16.0 |
| cpu_avx_openblas | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/open/cpu/fluid_inference_install_dir.zip) | MSVC 2015 update 3| CMake v3.16.0 |
| cuda9.0_cudnn7_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/post97/fluid_inference_install_dir.zip) | MSVC 2015 update 3 | CMake v3.16.0 | 7.3.1 | 9.0 |
| cuda9.0_cudnn7_avx_openblas | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/open/post97/fluid_inference_install_dir.zip) | MSVC 2015 update 3 | CMake v3.16.0 | 7.3.1 | 9.0 |
| cuda10.0_cudnn7_avx_mkl | [fluid_inference.zip](https://paddle-wheel.bj.bcebos.com/1.8.3/win-infer/mkl/post107/fluid_inference_install_dir.zip) | MSVC 2015 update 3 | CMake v3.16.0 | 7.4.1 | 10.0 |
### Hardware Environment
Hardware Configuration of the experimental environment:
| CPU | I7-8700K |
| Memory | 16G |
| Hard Disk | 1T hdd + 256G ssd |
| Graphics Card | GTX1080 8G |
The operating system is win10 family version in the experimental environment.
Build From Source Code
Users can also compile C++ inference libraries from the PaddlePaddle core code by specifying the following compile options at compile time:
|Option | Description | Value |
|CMAKE_BUILD_TYPE|Specifies the build type on single-configuration generators, Windows inference library currently only supports Release| Release |
|ON_INFER|Whether to generate the inference library. Must be set to ON when compiling the inference library. | ON |
|WITH_GPU|Whether to support GPU | ON/OFF |
|WITH_MKL|Whether to support MKL | ON/OFF |
|WITH_PYTHON|Whether the PYTHON interpreter is embedded | OFF |
|MSVC_STATIC_CRT|Whether to compile with / MT mode | ON |
|CUDA_TOOKIT_ROOT_DIR | When compiling the GPU inference library, you need to set the CUDA root directory | YOUR_CUDA_PATH |
For details on the compilation options, see [the compilation options list](../../../beginners_guide/install/Tables_en.html/#Compile)
**Paddle Windows Inference Library Compilation Steps**
1. Clone Paddle source code from GitHub:
git clone https://github.com/PaddlePaddle/Paddle.git
cd Paddle
2. Run Cmake command
- compile CPU inference library
# create build directory
mkdir build
# change to the build directory
cd build
cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=OFF -DWITH_GPU=OFF -DON_INFER=ON -DWITH_PYTHON=OFF
# use -DWITH_MKL to select math library: Intel MKL or OpenBLAS
# By default on Windows we use /MT for C Runtime Library, If you want to use /MD, please use the below command
# If you have no ideas the differences between the two, use the above one
- compile GPU inference library
# -DCUDA_TOOKIT_ROOT_DIR is cuda root directory, such as -DCUDA_TOOKIT_ROOT_DIR="D:\\cuda"
3. Open the `paddle.sln` using VisualStudio 2015, choose the`x64` for Slution Platforms, and `Release` for Solution Configurations, then build the `inference_lib_dist` project in the Solution Explorer(Rigth click the project and click Build).
The inference library will be installed in `fluid_inference_install_dir`.
version.txt constains the detailed configurations about the library, including git commit ID、math library, CUDA, CUDNN versions:
GIT COMMIT ID: cc9028b90ef50a825a722c55e5fda4b7cd26b0d6
CUDA version: 8.0
CUDNN version: v7
Inference Demo Compilation
### Hardware Environment
Hardware Configuration of the experimental environment:
| CPU | I7-8700K |
| Memory | 16G |
| Hard Disk | 1T hdd + 256G ssd |
| Graphics Card | GTX1080 8G |
The operating system is win10 family version in the experimental environment.
### Steps to Configure Environment
**Please strictly follow the subsequent steps to install, otherwise the installation may fail**
**Install Visual Studio 2015 update3**
Install Visual Studio 2015. Please choose "customize" for the options of contents to be installed and choose to install all functions relevant to c, c++ and vc++.
### Other requirements
1. You need to download the Windows inference library or compile the inference library from Paddle source code.
2. You need to run the command to get the Paddle source code.
git clone https://github.com/PaddlePaddle/Paddle.git
### Usage of Inference demo
#### Compile with script
Open the windows command line and run the `run_windows_demo.bat`, and input parameters as required according to the prompts.
# Path is the directory of Paddle you downloaded.
cd path\Paddle\paddle\fluid\inference\api\demo_ci
Some options of the script are as follows:
gpu_inference=Y # Use gpu_inference_lib or not(Y/N), default: N.
use_mkl=Y # Use MKL or not(Y/N), default: Y.
use_gpu=Y # Whether to use GPU for prediction, defalut: N.
paddle_inference_lib=path\fluid_inference_install_dir # Set the path of paddle inference library.
cuda_lib_dir=path\lib\x64 # Set the path of cuda library.
vcvarsall_dir=path\vc\vcvarsall.bat # Set the path of visual studio command prompt.
#### Compile manually
1. Create and change to the build directory
# path is the directory where Paddle is downloaded
cd path\Paddle\paddle\fluid\inference\api\demo_ci
mkdir build
cd build
2. Run Cmake command, cmake can be [downloaded at official site](https://cmake.org/download/) and added to environment variables.
- compile inference demo with CPU inference library
# Path is the directory where you downloaded paddle.
# -DDEMO_NAME is the file to be built
# DPADDLE_LIB is the path of fluid_install_dir, for example: DPADDLE_LIB=D:\fluid_install_dir
cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DWITH_GPU=OFF -DWITH_MKL=OFF -DWITH_STATIC_LIB=ON -DCMAKE_BUILD_TYPE=Release -DDEMO_NAME=simple_on_word2vec -DPADDLE_LIB=path_to_the_paddle_lib -DMSVC_STATIC_CRT=ON
- compile inference demo with GPU inference library
cmake .. -G "Visual Studio 14 2015" -A x64 -T host=x64 -DWITH_GPU=ON -DWITH_MKL=ON -DWITH_STATIC_LIB=ON ^
3. Open the `cpp_inference_demo.sln` using VisualStudio 2015, choose the`x64` for Slution Platforms, and `Release` for Solution Configurations, then build the `simple_on_word2vec` project in the Solution Explorer(Rigth click the project and click Build).
In the dependent packages provided, please copy openblas and model files under Release directory to the directory of Release built and generated.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/advanced_usage/deploy/inference/image/image8.png">
4. [Download model](http://paddle-inference-dist.bj.bcebos.com/word2vec.inference.model.tar.gz) and decompress it to the current directory. Run the command:
# Open GLOG
set GLOG_v=100
# Start inference, path is the directory where you decompres model
Release\simple_on_word2vec.exe --dirname=path\word2vec.inference.model
### Implementing a simple inference demo
[Complete code example](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/api/demo_ci/windows_mobilenet.cc)
This example uses Analysisconfig to manage the Analysispredictor prediction configuration. The configuration method is as follows:
1. Create AnalysisConfig
``` c++
AnalysisConfig config;
config->SwitchUseFeedFetchOps(false); // Turn off the use of feed and fetch OP, this must be set when using the ZeroCopy interface.
// config->EnableUseGpu(100 /*Set the GPU initial memory pool to 100MB*/, 0 /*Set GPU ID to 0*/); // Turn on GPU prediction
2. Set path of models and parameters
- When there is a model file and multiple parameter files under the model folder `model_dir`, the model folder path is passed in, and the model file name defaults to `__model__`.
``` c++
config->SetModel("path\\model_dir\\__model__", "path\\model_dir\\__params__");
- When there is only one model file `__model__` and one parameter file `__params__` in the model folder `model_dir`, the model file and parameter file path are passed in.
config->SetModel("path\\model_dir\\__model__", "path\\model_dir\\__params__");
3. Create predictor and prepare input data
``` C++
std::unique_ptr<PaddlePredictor> predictor = CreatePaddlePredictor(config);
int batch_size = 1;
int channels = 3; // The parameters of channels, height, and width must be the same as those required by the input in the model.
int height = 300;
int width = 300;
int nums = batch_size * channels * height * width;
float* input = new float[nums];
for (int i = 0; i < nums; ++i) input[i] = 0;
4. Manage input with ZeroCopyTensor
auto input_names = predictor->GetInputNames();
auto input_t = predictor->GetInputTensor(input_names[0]);
// Reshape the input tensor, copy the prepared input data from the CPU to ZeroCopyTensor
input_t->Reshape({batch_size, channels, height, width});
5. Run prediction engine
6. Manage input with ZeroCopyTensor
auto output_names = predictor->GetOutputNames();
auto output_t = predictor->GetOutputTensor(output_names[0]);
std::vector<int> output_shape = output_t->shape();
int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1,
output_t->copy_to_cpu(out_data.data()); // Copy data from ZeroCopyTensor to cpu
delete[] input;
**Note:** For more introduction to AnalysisPredictor, please refer to the [introduction of C++ Prediction API](./native_infer_en.html).
# 开发者文档
## 基本概念
### Place
* `TargetType target`: kernel运行时所在的平台,如X86/CUDA/ARM等;
* `PrecisionType precision`: kernel执行运算的数据的精度,如Float, Int8, Fp16等;
* `DataLayoutType layout`: kernel执行运算的数据的布局,如NCHW, NHWC等;
### OpLite
* `CheckShape`: 用于检查op的输入/输出参数维度、类型是否合法,以及属性信息是否符合设计;
* `InferShape`: 用于设置输出Tensor的形状信息;
* `CreateKernels`: 创建相关的kernel;
* `Attach`: 用于从`Scope``OpDesc`中获取参数的指针,并传递给kernel;
class OpLite : public Registry {
OpLite() = default;
explicit OpLite(const std::string &type) : op_type_(type) {}
explicit OpLite(const std::vector<Place> &valid_places)
: valid_places_(valid_places) {}
void SetValidPlaces(const std::vector<Place> &places) {
VLOG(3) << "valid places " << valid_places_.size();
valid_places_ = places;
// Set supported places
const std::vector<Place> &valid_places() const { return valid_places_; }
// Check the shape.
virtual bool CheckShape() const { return true; }
// Inference the outputs' shape.
virtual bool InferShape() const { return true; }
// Run this operator.
virtual bool Run();
// Link the external execution environ to internal context.
bool Attach(const cpp::OpDesc &opdesc, lite::Scope *scope);
// Create all the kernels for the valid targets.
std::vector<std::unique_ptr<KernelBase>> CreateKernels(
const std::vector<Place> &places, const std::string &kernel_type = "");
// Assign op param to kernel.
virtual void AttachKernel(KernelBase *kernel) = 0;
### KernelLite
为了提升kernel对`Target`, `Precision`, `DataLayout`等多种执行模式的支持,引入了`KernelLite`的概念,它主要有以下特点:
* 可以通过模版特化不同`Place`和kernel的实现,加强对不同执行模式的支持;
* 轻量级,`KernelLite`类似functor,只有执行的职能,执行效率更高;
* 每个kernel有明确执行的模式,并且可以在analysis time参与分析;
* 依赖简单,便于部署到mobile执行;
* 硬件调度信息等`context`跟具体的kernel绑定,方便定制不同kernel的行为。
template <TargetType Target, PrecisionType Precision,
DataLayoutType DataLayout = DataLayoutType::kNCHW>
class KernelLite : public KernelBase {
// Run the kernel.
virtual void Run() { CHECK(false) << "Not Implemented"; }
// Set target
TargetType target() const override { return Target; }
// Set precision
PrecisionType precision() const override { return Precision; }
// Set data layout
DataLayoutType layout() const override { return DataLayout; }
Place place() const override { return Place{Target, Precision, DataLayout}; }
void Touch() {}
KernelLite() = default;
virtual ~KernelLite() = default;
## 架构简介
Mobile 在这次升级为 lite 架构, 侧重多硬件、高性能的支持,其主要设计思想如下
- 引入 Type system,强化多硬件、量化方法、data layout 的混合调度能力
- 硬件细节隔离,通过不同编译开关,对支持的任何硬件可以自由插拔
- 引入 MIR(Machine IR) 的概念,强化带执行环境下的优化支持
- 优化期和执行期严格隔离,保证预测时轻量和高效率
![Paddle Inference Refactor1.0](https://github.com/Superjomn/_tmp_images/raw/master/images/lite.jpg)
## 增加新 Kernel的方法
- kernel实现:继承自`KernelLite`类的对应op的Compute类定义与实现,根据输入的数据类型,数据布局,数据所在的设备以及运行时所调用的第三方库的不同实现不同的kernel;server端CPU kernel实现在.h文件中。
- kernel注册:server端CPU kernel注册实现在.cc文件。
## 实现C++类
以mul op的CPU Kernel实现为例,mul kernel执行运算的矩阵乘法的公式为*Out* = *X* * *Y*, 可见该计算由两个输入,一个输出组成; 输入输出参数分别从OP的param中获取,如mul op的param定义如下:
struct MulParam {
const lite::Tensor* x{};
const lite::Tensor* y{};
lite::Tensor* output{};
int x_num_col_dims{1};
int y_num_col_dims{1};
template <typename T>
class MulCompute : public KernelLite<TARGET(kX86), PRECISION(kFloat)> {
using param_t = operators::MulParam;
void Run() override {
auto& context = ctx_->As<X86Context>();
auto& param = *param_.get_mutable<operators::MulParam>();
//1. 为output分配内存
param.output->template mutable_data<T>();
// 2. 获取计算用的输入输出
auto* x = &param.x->raw_tensor();
auto* y = &param.y->raw_tensor();
auto* z = &param.output->raw_tensor();
//3. 对输入输出数据进行需要的处理...
Tensor x_matrix, y_matrix;
if (x->dims().size() > 2) {
x_matrix = framework::ReshapeToMatrix(*x, param.x_num_col_dims);
} else {
x_matrix = *x;
//4. 调用数学库进行矩阵的运算...
auto blas = paddle::operators::math::GetBlas<platform::CPUDeviceContext, T>(
blas.MatMul(x_matrix, y_matrix, z);
virtual ~MulCompute() = default;
`MulCompute`类继承自`kernelLite`, 带有下面两个模版参数:
- `TARGET(kX86)`: `Target`代表的是硬件信息,如CUDA/X86/ARM/…,表示该kernel运行的硬件平台,在该示例中我们写的是kX86,表示mul这个kernel运行在X86平台上;
- `PRECISION(kFloat)``Precision`代表该kernel运算支持的数据精度信息,示例中写的是`kFloat`, 表示mul这个kernel支持Float数据的运算;
需要为`MulCompute`类重写`Run`接口, kernel 的输入和输出分别通过`MulParam`获得,输入/输出的变量类型是`lite::Tensor`
到此,前向mul kernel的实现完成,接下来需要在.cc文件中注册该kernel。
## 注册kernel
paddle::lite::kernels::x86::MulCompute<float>, def)
.BindInput("X", {LiteType::GetTensorTy(TARGET(kX86))})
.BindInput("Y", {LiteType::GetTensorTy(TARGET(kX86))})
.BindOutput("Out", {LiteType::GetTensorTy(TARGET(kX86))})
- `REGISTER_LITE_KERNEL`: 注册MulCompute类,并特化模版参数为float类型, 类型名为mul, 运行的平台为X86, 数据精度为float, 数据布局为NCHW;
- 在运行时,框架系统根据输入数据所在的设备,输入数据的类型,数据布局等信息静态的选择合适的kernel执行运算。
## 开发环境
### Mobile端开发和测试
`docker build --file paddle/fluid/lite/tools/Dockerfile.mobile --tag paddle-lite-mobile:latest . `生成镜像文件。
- Android端的交叉编译环境
- ARM Linux端的交叉编译环境
- Android端的模拟器环境
- 开发所需的格式检查工具
#### 相关的cmake选项
- `ARM_TARGET_OS` 代表目标操作系统, 目前支持 "android" "armlinux", 默认是Android
- `ARM_TARGET_ARCH_ABI` 代表ARCH,支持输入"armv8"和"armv7",针对OS不一样选择不一样。
- `-DARM_TARGET_OS="android"`
- "armv8", 等效于 "arm64-v8a"。 default值为这个。
- "armv7", 等效于 "armeabi-v7a"。
- `-DARM_TARGET_OS="armlinux"`
- "armv8", 等效于 "arm64"。 default值为这个。
- "armv7hf", 等效于使用`eabihf``-march=armv7-a -mfloat-abi=hard -mfpu=neon-vfpv4 `
- "armv7", 等效于使用`eabi``-march=armv7-a -mfloat-abi=softfp -mfpu=neon-vfpv4`
- `ARM_TARGET_LANG` 代表目标编译的语言, 默认为gcc,支持 gcc和clang两种。
注意: ARM Linux当前仅支持在armv8上编译并测试。
#### 开发
1. 添加具体的数学计算,在`paddle/fluid/lite/arm/math`中添加对应的数学函数,侧重点在于代码本身的优化,充分利用NEON指令发挥其优势。
2. 添加kernel声明和调用实例,在`paddle/fluid/lite/kernels/arm`中添加对应kernel的框架声明和调用,侧重点在于每种kernel严格对应输入输出的类型。
3. 添加单元测试,在`paddle/fluid/lite/kernels/arm`中添加相应的单元测试,并保持其在模拟器或者真机中可以通过。
#### 测试
# 创建Android avd (armv8)
$ echo n | avdmanager create avd -f -n paddle-armv8 -k "system-images;android-24;google_apis;arm64-v8a"
# 启动Android armv8 emulator
$ ${ANDROID_HOME}/emulator/emulator -avd paddle-armv8 -noaudio -no-window -gpu off -verbose &
# 其他正常测试步骤
# 关闭所有模拟器
$ adb devices | grep emulator | cut -f1 | while read line; do adb -s $line emu kill; done
* `Paddle Lite <mobile_index.html>`_:简要介绍了 Paddle-Lite 特点以及使用说明。
# Paddle-Lite
完整使用文档位于 [Paddle-Lite 文档](https://paddle-lite.readthedocs.io/zh/latest/)
## 特性
### 轻量级
包含完整的80个 Op+85个 Kernel 的动态库,对于ARMV7只有800K,ARMV8下为1.3M,并可以裁剪到更低。
### 高性能
极致的 ARM CPU 性能优化,针对不同微架构特点实现kernel的定制,最大发挥计算性能,在主流模型上展现出领先的速度优势。
支持量化模型,结合[PaddleSlim 模型压缩工具](https://github.com/PaddlePaddle/models/tree/v1.5/PaddleSlim) 中量化功能,可以提供高精度高性能的预测能力。
在Huawei NPU, FPGA上也具有有很好的性能表现。
最新性能数据位于 [Benchmark 文档](https://paddle-lite.readthedocs.io/zh/latest/benchmark/benchmark.html)
### 通用性
硬件方面,Paddle-Lite 的架构设计为多硬件兼容支持做了良好设计。除了支持ARM CPU、Mali GPU、Adreno GPU,还特别支持了华为 NPU,以及 FPGA 等边缘设备广泛使用的硬件。即将支持支持包括寒武纪、比特大陆等AI芯片,未来会增加对更多硬件的支持。
框架兼容方面:除了PaddlePaddle外,对其他训练框架也提供兼容支持。当前,支持Caffe 和 TensorFlow 训练出来的模型,通过[X2Paddle] (https://github.com/PaddlePaddle/X2Paddle) 转换工具实现。接下来将会对ONNX等格式模型提供兼容支持。
## 架构
Paddle-Lite 的架构设计着重考虑了对多硬件和平台的支持,并且强化了多个硬件在一个模型中混合执行的能力,多个层面的性能优化处理,以及对端侧应用的轻量化设计。
其中,Analysis Phase 包括了 MIR(Machine IR) 相关模块,能够对原有的模型的计算图针对具体的硬件列表进行算子融合、计算裁剪 在内的多种优化。Execution Phase 只涉及到Kernel 的执行,且可以单独部署,以支持极致的轻量级部署。
## Paddle-Mobile升级为Paddle-Lite的说明
原Paddle-Mobile作为一个致力于嵌入式平台的PaddlePaddle预测引擎,已支持多种硬件平台,包括ARM CPU、 Mali GPU、Adreno GPU,以及支持苹果设备的GPU Metal实现、ZU5、ZU9等FPGA开发板、树莓派等arm-linux开发板。在百度内已经过广泛业务场景应用验证。对应设计文档可参考: [mobile/README](https://github.com/PaddlePaddle/Paddle-Lite/blob/develop/mobile/README.md)
Paddle-Mobile 整体升级重构并更名为Paddle-Lite后,原paddle-mobile 的底层能力大部分已集成到[新架构 ](https://github.com/PaddlePaddle/Paddle-Lite/tree/develop/lite)下。作为过渡,暂时保留原Paddle-mobile代码。 主体代码位于 `mobile/` 目录中,后续一段时间会继续维护,并完成全部迁移。新功能会统一到[新架构 ](https://github.com/PaddlePaddle/Paddle-Lite/tree/develop/lite)下开发。
metal, web的模块相对独立,会继续在 `./metal``./web` 目录下开发和维护。对苹果设备的GPU Metal实现的需求及web前端预测需求,可以直接进入这两个目录。
## 致谢
Paddle-Lite 借鉴了以下开源项目:
- [ARM compute library](https://github.com/ARM-software/ComputeLibrary)
- [Anakin](https://github.com/PaddlePaddle/Anakin) ,Anakin对应底层的一些优化实现已被集成到Paddle-Lite。Anakin作为PaddlePaddle组织下的一个高性能预测项目,极具前瞻性,对Paddle-Lite有重要贡献。Anakin已和本项目实现整合。之后,Anakin不再升级。
## 交流与反馈
* 欢迎您通过Github Issues来提交问题、报告与建议
* 微信公众号:飞桨PaddlePaddle
* QQ群: 696965088
<p align="center"><img width="200" height="200" src="https://user-images.githubusercontent.com/45189361/64117959-1969de80-cdc9-11e9-84f7-e1c2849a004c.jpeg"/>&#8194;&#8194;&#8194;&#8194;&#8194;<img width="200" height="200" margin="500" src="https://user-images.githubusercontent.com/45189361/64117844-cb54db00-cdc8-11e9-8c08-24bbe594608e.jpeg"/></p>
<p align="center"> &#8194;&#8194;&#8194;微信公众号&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;&#8194;官方技术交流QQ群</p>
* 论坛: 欢迎大家在[PaddlePaddle论坛](https://ai.baidu.com/forum/topic/list/168)分享在使用PaddlePaddle中遇到的问题和经验, 营造良好的论坛氛围
# 模型压缩
## 功能
- 模型剪裁
- 卷积通道均匀剪裁
- 基于敏感度的卷积通道剪裁
- 基于进化算法的自动剪裁
- 定点量化
- 在线量化训练(training aware)
- 离线量化(post training)
- 知识蒸馏
- 支持单进程知识蒸馏
- 支持多进程分布式知识蒸馏
- 神经网络结构自动搜索(NAS)
- 支持基于进化算法的轻量神经网络结构自动搜索
- 支持One-Shot网络结构自动搜索
- 支持 FLOPS / 硬件延时约束
- 支持多平台模型延时评估
- 支持用户自定义搜索算法和搜索空间
## 安装
Paddle >= 1.7.0
pip install paddleslim -i https://pypi.org/simple
## 使用
- [快速开始](https://paddlepaddle.github.io/PaddleSlim/quick_start/index.html):通过简单示例介绍如何快速使用PaddleSlim。
- [进阶教程](https://paddlepaddle.github.io/PaddleSlim/tutorials/index.html):PaddleSlim高阶教程。
- [模型库](https://paddlepaddle.github.io/PaddleSlim/model_zoo.html):各个压缩策略在图像分类、目标检测和图像语义分割模型上的实验结论,包括模型精度、预测速度和可供下载的预训练模型。
- [API文档](https://paddlepaddle.github.io/PaddleSlim/api_cn/index.html)
- [Paddle检测库](https://github.com/PaddlePaddle/PaddleDetection/tree/master/slim):介绍如何在检测库中使用PaddleSlim。
- [Paddle分割库](https://github.com/PaddlePaddle/PaddleSeg/tree/develop/slim):介绍如何在分割库中使用PaddleSlim。
- [PaddleLite](https://paddlepaddle.github.io/Paddle-Lite/):介绍如何使用预测库PaddleLite部署PaddleSlim产出的模型。
## 部分压缩策略效果
### 分类模型
数据: ImageNet2012; 模型: MobileNetV1;
|压缩策略 |精度收益(baseline: 70.91%) |模型大小(baseline: 17.0M)|
| 知识蒸馏(ResNet50)| **+1.06%** |-|
| 知识蒸馏(ResNet50) + int8量化训练 |**+1.10%**| **-71.76%**|
| 剪裁(FLOPs-50%) + int8量化训练|**-1.71%**|**-86.47%**|
### 图像检测模型
#### 数据:Pascal VOC;模型:MobileNet-V1-YOLOv3
| 压缩方法 | mAP(baseline: 76.2%) | 模型大小(baseline: 94MB) |
| :---------------------: | :------------: | :------------:|
| 知识蒸馏(ResNet34-YOLOv3) | **+2.8%** | - |
| 剪裁 FLOPs -52.88% | **+1.4%** | **-67.76%** |
|知识蒸馏(ResNet34-YOLOv3)+剪裁(FLOPs-69.57%)| **+2.6%**|**-67.00%**|
#### 数据:COCO;模型:MobileNet-V1-YOLOv3
| 压缩方法 | mAP(baseline: 29.3%) | 模型大小|
| :---------------------: | :------------: | :------:|
| 知识蒸馏(ResNet34-YOLOv3) | **+2.1%** |-|
| 知识蒸馏(ResNet34-YOLOv3)+剪裁(FLOPs-67.56%) | **-0.3%** | **-66.90%**|
### 搜索
数据:ImageNet2012; 模型:MobileNetV2
|硬件环境 | 推理耗时 | Top1准确率(baseline:71.90%) |
| RK3288 | **-23%** | +0.07% |
| Android cellphone | **-20%** | +0.16% |
| iPhone 6s | **-17%** | +0.32% |
Model Compression
PaddleSlim is a toolkit for model compression. It contains a collection of compression strategies, such as pruning, fixed point quantization, knowledge distillation, hyperparameter searching and neural architecture search.
PaddleSlim provides solutions of compression on computer vision models, such as image classification, object detection and semantic segmentation. Meanwhile, PaddleSlim Keeps exploring advanced compression strategies for language model. Furthermore, benckmark of compression strategies on some open tasks is available for your reference.
PaddleSlim also provides auxiliary and primitive API for developer and researcher to survey, implement and apply the method in latest papers. PaddleSlim will support developer in ability of framework and technology consulting.
- Uniform pruning of convolution
- Sensitivity-based prunning
- Automated pruning based evolution search strategy
- Support pruning of various deep architectures such as VGG, ResNet, and MobileNet.
- Support self-defined range of pruning, i.e., layers to be pruned.
Fixed Point Quantization
- Training aware
- Dynamic strategy: During inference, we quantize models with hyperparameters dynamically estimated from small batches of samples.
- Static strategy: During inference, we quantize models with the same hyperparameters estimated from training data.
- Support layer-wise and channel-wise quantization.
- Post training
Knowledge Distillation
- Naive knowledge distillation: transfers dark knowledge by merging the teacher and student model into the same Program
- Paddle large-scale scalable knowledge distillation framework Pantheon: a universal solution for knowledge distillation, more flexible than the naive knowledge distillation, and easier to scale to the large-scale applications.
- Decouple the teacher and student models --- they run in different processes in the same or different nodes, and transfer knowledge via TCP/IP ports or local files;
- Friendly to assemble multiple teacher models and each of them can work in either online or offline mode independently;
- Merge knowledge from different teachers and make batch data for the student model automatically;
- Support the large-scale knowledge prediction of teacher models on multiple devices.
Neural Architecture Search
- Neural architecture search based on evolution strategy.
- Support distributed search.
- One-Shot neural architecture search.
- Support FLOPs and latency constrained search.
- Support the latency estimation on different hardware and platforms.
Paddle >= 1.7.0
.. code-block:: bash
pip install paddleslim -i https://pypi.org/simple
- `QuickStart <https://paddlepaddle.github.io/PaddleSlim/quick_start/index_en.html>`_ : Introduce how to use PaddleSlim by simple examples.
- `Advanced Tutorials <https://paddlepaddle.github.io/PaddleSlim/tutorials/index_en.html>`_ : Tutorials about advanced usage of PaddleSlim.
- `Model Zoo <https://paddlepaddle.github.io/PaddleSlim/model_zoo_en.html>`_ : Benchmark and pretrained models.
- `API Documents <https://paddlepaddle.github.io/PaddleSlim/api_en/index_en.html>`_
- `PaddleDetection <https://github.com/PaddlePaddle/PaddleDetection/tree/master/slim>`_ : Introduce how to use PaddleSlim in PaddleDetection library.
- `PaddleSeg <https://github.com/PaddlePaddle/PaddleSeg/tree/develop/slim>`_ : Introduce how to use PaddleSlim in PaddleSeg library.
- `PaddleLite <https://paddlepaddle.github.io/Paddle-Lite/>`_ : How to use PaddleLite to deploy models generated by PaddleSlim.
Image Classification
Dataset: ImageNet2012; Model: MobileNetV1;
===================================================== =========================== ============================
Method Accuracy(baseline: 70.91%) Model Size(baseline: 17.0M)
===================================================== =========================== ============================
Knowledge Distillation(ResNet50) +1.06% -
Knowledge Distillation(ResNet50) + int8 quantization +1.10% -71.76%
Pruning(FLOPs-50%) + int8 quantization -1.71% -86.47%
===================================================== =========================== ============================
Object Detection
Dataset: Pascal VOC; Model: MobileNet-V1-YOLOv3
============================================================== ===================== ===========================
Method mAP(baseline: 76.2%) Model Size(baseline: 94MB)
============================================================== ===================== ===========================
Knowledge Distillation(ResNet34-YOLOv3) +2.8% -
Pruning(FLOPs -52.88%) +1.4% -67.76%
Knowledge DistillationResNet34-YOLOv3)+Pruning(FLOPs-69.57%) +2.6% -67.00%
============================================================== ===================== ===========================
Dataset: COCO; Model: MobileNet-V1-YOLOv3
============================================================== ===================== ===========================
Method mAP(baseline: 29.3%) Model Size|
============================================================== ===================== ===========================
Knowledge Distillation(ResNet34-YOLOv3) +2.1% -
Knowledge Distillation(ResNet34-YOLOv3)+Pruning(FLOPs-67.56%) -0.3% -66.90%|
============================================================== ===================== ===========================
Dataset: ImageNet2012; Model: MobileNetV2
=================== ================ ===============================
Device Infer time cost Top1 accuracy(baseline:71.90%)
=================== ================ ===============================
RK3288 -23% +0.07%
Android cellphone -20% +0.16%
iPhone 6s -17% +0.32%
=================== ================ ===============================
.. _cluster_howto:
.. image:: src/parallelism.png
通信。其中RPC通信方式使用 `gRPC <https://github.com/grpc/grpc/>`_ ,Collective通信方式使用
`NCCL2 <https://developer.nvidia.com/nccl>`_ 。
.. csv-table::
:header: "Feature", "Collective", "RPC"
"Ring-Based通信", "Yes", "No"
"异步训练", "Yes", "Yes"
"分布式模型", "No", "Yes"
"容错训练", "No", "Yes"
"性能", "Faster", "Fast"
- RPC通信方式的结构:
.. image:: src/dist_train_pserver.png
**注:** 在使用GPU训练时,pserver可以选择使用GPU或只使用CPU,如果pserver也使用GPU,则会增加一次从CPU拷贝\
**注:** 在使用GPU训练时,如果每个trainer节点有多个GPU卡,则会先在每个trainer节点的多个卡之间执行\
- NCCL2通信方式的结构:
.. image:: src/dist_train_nccl2.png
使用parameter server方式的训练
使用 :code:`transpiler` API可以把单机可以执行的程序快速转变成可以分布式执行的程序。在不同的服务器节点
上,通过传给 :code:`transpiler` 对应的参数,以获取当前节点需要执行的 :code:`Program` 。
.. csv-table::
:header: "参数", "说明"
"role", "\ **必选**\ 区分作为pserver启动还是trainer启动,不传给transpile,也可以用其他的变量名或环境变量"
"trainer_id", "\ **必选**\ 如果是trainer进程,用于指定当前trainer在任务中的唯一id,从0开始,在一个任务中需保证不重复"
"pservers", "\ **必选**\ 当前任务所有pserver的ip:port列表字符串,形式比如:,"
"trainers", "\ **必选**\ trainer节点的个数"
"sync_mode", "\ **可选**\ True为同步模式,False为异步模式"
"startup_program", "\ **可选**\ 如果startup_program不是默认的fluid.default_startup_program(),需要传入此参数"
"current_endpoint", "\ **可选**\ 只有NCCL2模式需要传这个参数"
一个例子,假设有两个节点,分别是 :code:`` 和 :code:`` ,使用端口6170,启动4个trainer,
.. code-block:: python
role = "PSERVER"
trainer_id = 0 # get actual trainer id from cluster
pserver_endpoints = ","
current_endpoint = "" # get actual current endpoint
trainers = 4
t = fluid.DistributeTranspiler()
t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
if role == "PSERVER":
pserver_prog = t.get_pserver_program(current_endpoint)
pserver_startup = t.get_startup_program(current_endpoint,
elif role == "TRAINER":
同步地合并所有节点的梯度数据并发送给parameter server完成更新,在异步训练方式下,每个trainer没有相互\
同步等待的过程,可以独立地更新parameter server的参数。通常情况下,使用异步训练方式,可以在trainer节点\
在调用 :code:`transpile` 函数时,默认会生成同步训练的分布式程序,通过指定 :code:`sync_mode=False`
.. code-block:: python
t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=False)
embedding的feature id可能会非常多,当feature id达到一定数量时,embedding参数会变得很大,一方面可能
Fluid支持千亿量级超大规模稀疏特征embedding的训练,embedding参数只会保存在parameter server上,通过
使用方法,在配置embedding的时候,加上参数 :code:`is_distributed=True` 以及 :code:`is_sparse=True` 即可。
参数 :code:`dict_size` 定义数据中总的id的数量,id可以是int64范围内的任意值,只要总id个数小于等于dict_size就可以支持。
所以配置之前需要预估一下数据中总的feature id的数量。
.. code-block:: python
emb = fluid.layers.embedding(
size=[dict_size, embedding_width],
参数 :code:`split_method` 可以指定参数在parameter server上的分布方式。
Fluid默认使用 `RoundRobin <https://en.wikipedia.org/wiki/Round-robin_scheduling>`_
方式将参数分布在多个parameter server上。此方式在默认未关闭参数切分的情况下,参数会较平均的分布在所有的
parameter server上。如果需要使用其他,可以传入其他的方法,目前可选的方法有: :code:`RoundRobin` 和
:code:`HashName` 。也可以使用自定义的分布方式,只需要参考
`这里 <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/ps_dispatcher.py#L44>`_
参数 :code:`slice_var_up` 指定是否将较大(大于8192个元素)的参数切分到多个parameter server以均衡计算负载,默认为开启。
当模型中的可训练参数体积比较均匀或者使用自定义的参数分布方法是参数均匀分布在多个parameter server上,
.. code-block:: python
t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, slice_var_up=False)
在parameter server分布式训练模式下,要开启内存优化 :code:`memory_optimize` 和单机相比,需要注意按照下面的规则配置:
* 在pserver端,\ **不要**\ 执行 :code:`memory_optimize`
* 在trainer端,先执行 :code:`fluid.memory_optimize` 再执行 :code:`t.transpile()`
* 在trainer端,调用 :code:`memory_optimize` 需要增加 :code:`skip_grads=True` 确保发送的梯度不会被重命名: :code:`fluid.memory_optimize(input_program, skip_grads=True)`
.. code-block:: python
if role == "TRAINER":
fluid.memory_optimize(fluid.default_main_program(), skip_grads=True)
t = fluid.DistributeTranspiler()
t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
if role == "PSERVER":
# start pserver here
elif role == "TRAINER":
# start trainer here
NCCL2模式的分布式训练,由于没有parameter server角色,是trainer之间互相通信,使用时注意:
* 配置 :code:`fluid.DistributeTranspilerConfig` 中 :code:`mode="nccl2"` 。
* 调用 :code:`transpile` 时,:code:`trainers` 传入所有trainer节点的endpoint,并且传入参数 :code:`current_endpoint` 。
在此步骤中,会在 :code:`startup program` 中增加 :code:`gen_nccl_id_op` 用于在多机程序初始化时同步NCCLID信息。
* 初始化 :code:`ParallelExecutor` 时传入 :code:`num_trainers` 和 :code:`trainer_id` 。
在此步骤中,:code:`ParallelExecutor` 会使用多机方式初始化NCCL2并可以开始在多个节点对每个参数对应的梯度执行跨节点的
:code:`allreduce` 操作,执行多机同步训练
.. code-block:: python
trainer_id = 0 # get actual trainer id here
trainers = ","
current_endpoint = ""
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
t.transpile(trainer_id, trainers=trainers, current_endpoint=current_endpoint)
exe = fluid.ParallelExecutor(use_cuda,
loss_name=loss_name, num_trainers=len(trainers.split(",")), trainer_id=trainer_id)
.. csv-table::
:header: "参数", "说明"
"trainer_id", "(int) 任务中每个trainer节点的唯一ID,从0开始,不能有重复"
"trainers", "(int) 任务中所有trainer节点的endpoint,用于在NCCL2初始化时,广播NCCL ID"
"current_endpoint", "(string) 当前节点的endpoint"
同步训练和GPU训练,如果硬件设备支持RDMA和GPU Direct,可以达到很高的分布式训练性能。
启动多进程模式 NCCL2 分布式训练作业
通常情况下使用多进程模式启动 NCCL2 分布式训练作业可以获得更好多训练性能,Paddle 提供了
:code:`paddle.distributed.launch` 模块可以方便地启动多进程作业,启动后每个训练进程将会使用一块独立的 GPU 设备。
* 设置节点数:通过环境变量 :code:`PADDLE_NUM_TRAINERS` 设置作业的节点数,此环境变量也会被设置在每个训练进程中。
* 设置每个节点的设备数:通过启动参数 :code:`--gpus` 可以设置每个节点的 GPU 设备数量,每个进程的序号将会被自动设置在环境变量
* 数据切分: 多进程模式是每个设备一个进程,一般来说需要每个进程处理一部分训练数据,并且保证所有进程能够处理完整的数据集。
* 入口文件:入口文件为实际启动的训练脚本。
* 日志:每个训练进程的日志默认会保存在 :code:`./mylog` 目录下,您也可以通过参数 :code:`--log_dir` 进行指定。
.. code-block:: bash
> PADDLE_NUM_TRAINERS=<TRAINER_COUNT> python -m paddle.distributed.launch --gpus <NUM_GPUS_ON_HOSTS> <ENTRYPOINT_SCRIPT> --arg1 --arg2 ...
**注意:** 使用NCCL2模式分布式训练时,需要确保每个节点训练等量的数据,防止在最后一轮训练中任务不退出。通常有两种方式:
- 随机采样一些数据,补全分配到较少数据的节点上。(推荐使用这种方法,以训练完整的数据集)。
- 在python代码中,每个节点每个pass只训练固定的batch数,如果这个节点数据较多,则不训练这些多出来的数据。
**说明:** 使用NCCL2模式分布式训练时,如果只希望使用一个节点上的部分卡,可以通过配置环境变量::code:`export CUDA_VISIBLE_DEVICES=0,1,2,3` 指定。
**注意:** 如果系统中有多个网络设备,需要手动指定NCCL2使用的设备,假设需要使用 :code:`eth2` 为通信设备,需要设定如下环境变量:
.. code-block:: bash
另外NCCL2提供了其他的开关环境变量,比如指定是否开启GPU Direct,是否使用RDMA等,详情可以参考
`ncclknobs <https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#ncclknobs>`_ 。
.. _cluster_howto_en:
Manual for Distributed Training with Fluid
Basic Idea Of Distributed Training
Distributed deep learning training is usually divided into two parallelization methods: data parallelism, model parallelism. Refer to the following figure:
.. image:: src/parallelism.png
In the model parallelism mode, the layers and parameters of the model will be distributed on multiple nodes. The model will go through multiple communications across nodes in the feeding forward and back propagation training of a mini-batch. Each node only saves a part of the entire model;
In data parallelism mode, each node holds the complete layers and parameters of the model, each node performs feeding forward and back propagation calculations on its own, and then conducts the aggregation of the gradients and updates the parameters on all nodes synchronously.
Current version of Fluid only provides data parallelism mode. In addition, implementations of special cases in model parallelism mode (e.g. large sparse model training ) will be explained in subsequent documents.
In the training of data parallelism mode, Fluid uses two communication modes to deal with the requirements of distributed training for different training tasks, namely RPC Communication and Collective Communication. The RPC communication method uses `gRPC <https://github.com/grpc/grpc/>`_ , Collective communication method uses `NCCL2 <https://developer.nvidia.com/nccl>`_ .
.. csv-table:: The table above is a horizontal comparison of RPC communication and Collective communication
:header: "Feature", "Collective", "RPC"
"Ring-Based Communication", "Yes", "No"
"Asynchronous Training", "Yes", "Yes"
"Distributed Model", "No", "Yes"
"Fault-tolerant Training", "No", "Yes"
"Performance", "Faster", "Fast"
- Structure of RPC Communication Method:
.. image:: src/dist_train_pserver.png
Data-parallelised distributed training in RPC communication mode will start multiple pserver processes and multiple trainer processes, each pserver process will save a part of the model parameters and be responsible for receiving the gradients sent from the trainers and updating these model parameters; Each trainer process will save a copy of the complete model, and use a part of the data to train, then send the gradients to the pservers, finally pull the updated parameters from the pserver.
The pserver process can be on a compute node that is completely different from the trainer, or it can share the same node with a trainer. The number of pserver processes required for a distributed task usually needs to be adjusted according to the actual situation to achieve the best performance. However, usually pserver processes are no more than trainer processes.
**Note:** When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
**Note:** When using GPU training, if there are multiple GPU cards in each trainer node, the gradient polymerization will execute in NCCL2 way among the cards in one node, and then in multiple nodes through pserver.
- Structure of NCCL2 communication method:
.. image:: src/dist_train_nccl2.png
NCCL2 (Collective communication method) for distributed training avoids the need of pserver processes. Each trainer process holds a complete set of model parameters. After the calculation of the gradient, the trainer, through mutual communications, "Reduce" the gradient data to all devices of all nodes and then each node completes parameter updates of its own.
Training in the Parameter Server Manner
Use the :code:`transpiler` API to quickly convert a program that can be executed on a single machine into a program that can be executed in a distributed manner. On different server nodes, pass values to corresponding arguments at :code:`transpiler` to get the :code:`Program` which current node is to execute:
.. csv-table:: required configuration parameters
:header: "parameter", "description"
"role", "\ **required**\ distinguishes whether to start as pserver or trainer, this arugument is not passed into ``transpile`` , you can also use other variable names or environment variables"
"trainer_id", "\ **required**\ If it is a trainer process, it is used to specify the unique id of the current trainer in the task, starting from 0, and must be guaranteed not to be repeated in one task"
"pservers", "\ **required**\ ip:port list string of all pservers in current task, for example:,"
"trainers", "\ **required**\ the number of trainer nodes"
"sync_mode", "\ **optional**\ True for synchronous mode, False for asynchronous mode"
"startup_program", "\ **optional**\ If startup_program is not the default fluid.default_startup_program(), this parameter needs to be passed in"
"current_endpoint", "\ **optional**\ This parameter is only required for NCCL2 mode"
For example, suppose there are two nodes, namely :code:`` and :code:``, use port 6170 to start 4 trainers.
Then the code can be written as:
.. code-block:: python
role = "PSERVER"
trainer_id = 0 # get actual trainer id from cluster
pserver_endpoints = ","
current_endpoint = "" # get actual current endpoint
trainers = 4
t = fluid.DistributeTranspiler()
t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
if role == "PSERVER":
pserver_prog = t.get_pserver_program(current_endpoint)
pserver_startup = t.get_startup_program(current_endpoint,Pserver_prog)
elif role == "TRAINER":
Choose Synchronous Or Asynchronous Training
Fluid distributed tasks support synchronous training or asynchronous training.
In the synchronous training mode, all trainer nodes will merge the gradient data of all nodes synchronously per mini-batch and send them to the parameter server to complete the update.
In the asynchronous mode, each trainer does not wait for each other, and independently update the parameters on the parameter server.
In general, using the asynchronous training method can have a higher overall throughput than the synchronous training mode when there are more trainer nodes.
When the :code:`transpile` function is called, the distributed training program is generated by default. The asynchronous training program can be generated by specifying the :code:`sync_mode=False` parameter:
.. code-block:: python
t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=False)
Whether To Use The Distributed Embedding Table For Training
Embedding is widely used in various network structures, especially text processing related models.
In some scenarios, such as recommendation systems or search engines, the number of feature ids of embedding may be very large. When it reaches a certain number, the embedding parameter will become very large.
On the one hand, the memory of the single machine may not be competent, resulting in the inability to train.
On the other hand, the normal training mode needs to synchronize the complete set of parameters for each iteration. If the parameter is too large, the communication will become very slow, which will affect the training speed.
Fluid supports the training of very large scale sparse features embedding at hundred billion level. The embedding parameter is only saved on the parameter server. The parameter prefetch and gradient sparse update method greatly reduce the traffic and improve the communication speed.
This feature is only valid for distributed training and cannot be used on a single machine. Need to be used with sparse updates.
Usage: When configuring embedding, add the parameters :code:`is_distributed=True` and :code:`is_sparse=True`.
Parameters :code:`dict_size` Defines the total number of ids in the data. The id can be any value in the int64 range. As long as the total number of ids is less than or equal to dict_size, it can be supported.
So before you configure, you need to estimate the total number of feature ids in the data.
.. code-block:: python
emb = fluid.layers.embedding(
size=[dict_size, embedding_width],
Select Parameter Distribution Method
Parameter :code:`split_method` can specify how the parameters are distributed on the parameter servers.
Fluid uses `RoundRobin <https://en.wikipedia.org/wiki/Round-robin_scheduling>`_ by default to scatter parameters to multiple parameter servers.
In this case, the parameters are evenly distributed on all parameter servers in the case where the parameter segmentation is not turned off by default.
If you need to use something else, you can pass in other methods. The currently available methods are: :code:`RoundRobin` and :code:`HashName` . You can also use a customized distribution method, just refer to `here <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/ps_dispatcher.py#L44>`_
to write customized distribution function
Turn Off the slice-up of Parameters
Parameter :code:`slice_var_up` specifies whether to split large (more than 8192 elements) parameters into multiple parameter servers to balance the computational load. The default is on.
When the sizes of the trainable parameters in the model are relatively uniform or a customized parameter distribution method is used, which evenly distributes the parameters on multiple parameter servers, you can choose to turn off the slice-up function, which reduces the computational and copying overhead of slicing and reorganization:
.. code-block:: python
t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, slice_var_up=False)
Turn On Memory Optimization
In the parameter server distributed training mode, to enable memory optimization :code:`memory_optimize` , compared with a single machine, you need to pay attention to the following rules:
- On the pserver side, **don't** execute :code:`memory_optimize`
- On the trainer side, execute :code:`fluid.memory_optimize` and then execute :code:`t.transpile()`
- On the trainer side, calling :code:`memory_optimize` needs to add :code:`skip_grads=True` to ensure the gradient sent is not renamed : :code:`fluid.memory_optimize(input_program, skip_grads=True)`
.. code-block:: python
if role == "TRAINER":
fluid.memory_optimize(fluid.default_main_program(), skip_grads=True)
t = fluid.DistributeTranspiler()
t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
if role == "PSERVER":
# start pserver here
elif role == "TRAINER":
# start trainer here
Training Using NCCL2 Communication
Distributed training in NCCL2 mode, because there is no parameter server role, the trainers directly communicate with each other. Pay attention to the following tips:
* Configure :code:`mode="nccl2"` in :code:`fluid.DistributeTranspilerConfig` .
* When calling :code:`transpile`, :code:`trainers` is fed with the endpoints of all trainer nodes, and passed with the argument :code:`current_endpoint` .
In this step, :code:`gen_nccl_id_op` will add in :code:`startup program` to synchronize NCCLID information during the multi-computer program initialization.
* Initialize :code:`ParallelExecutor` with :code:`num_trainers` and :code:`trainer_id` .
In this step, :code:`ParallelExecutor` will initialize NCCL2 by the multi-computer way and do the operations :code:`allreduce` across the nodes for the gradient of every parameter to execute muti-computer training
For example:
.. code-block:: python
trainer_id = 0 # get actual trainer id here
trainers = ","
current_endpoint = ""
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
t.transpile(trainer_id, trainers=trainers, current_endpoint=current_endpoint)
txe = fluid.ParallelExecutor(use_cuda,
loss_name=loss_name, num_trainers=len(trainers.split(",")), trainer_id=trainer_id)
.. csv-table:: Description of the necessary parameters for NCCL2 mode
:header: "parameter", "description"
"trainer_id", "(int)The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
"trainers", "(int)endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
"current_endpoint", "(string)endpoint of current node"
Currently, distributed training using NCCL2 only supports synchronous training. The distributed training using NCCL2 mode is more suitable for the model which is relatively large and needs \
synchronous training and GPU training. If the hardware device supports RDMA and GPU Direct, this can achieve high distributed training performance.
Start Up NCCL2 Distributed Training in Muti-Process Mode
Usually you can get better multi-training performance by using multi-process mode to start up NCCL2 distributed training assignment. Paddle provides :code:`paddle.distributed.launch` module to start up multi-process assignment, after which each training process will use an independent GPU device.
Attention during usage:
* set the number of nodes: set the number of nodes of an assignment by the environment variable :code:`PADDLE_NUM_TRAINERS` , and this variable will also be set in every training process.
* set the number of devices of each node: by activating the parameter :code:`--gpus` , you can set the number of GPU devices of each node, and the sequence number of each process will be set in the environment variable :code:`PADDLE_TRAINER_ID` automatically.
* data segment: mult-process mode means one process in each device. Generally, each process manages a part of training data, in order to make sure that all processes can manage the whole data set.
* entrance file: entrance file is the training script for actual startup.
* journal: for each training process, the joural is saved in the default :code:`./mylog` directory, and you can assign by the parameter :code:`--log_dir` .
startup example:
.. code-block:: bash
> PADDLE_NUM_TRAINERS=<TRAINER_COUNT> python -m paddle.distributed.launch --gpus <NUM_GPUS_ON_HOSTS> <ENTRYPOINT_SCRIPT> --arg1 --arg2 ...
Important Notes on NCCL2 Distributed Training
**Note:** When using distributed training in NCCL2 mode, if you only want to use a part of cards in one node, you can appoint by configuring the environment variable :code:`export CUDA_VISIBLE_DEVICES=0,1,2,3` .
**Note:** Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
exit at the final iteration. There are two common ways:
- Randomly sample some data to complement nodes where less data are distributed. (We recommend this method for sake of a complete dataset to be trained)
- Each node only trains fixed number of batches per pass, which is controlled by python codes. If a node has more data than this fixed amount, then these
marginal data will not be trained.
**Note** : If there are multiple network devices in the system, you need to manually specify the devices used by NCCL2.
Assuming you need to use :code:`eth2` as the communication device, you need to set the following environment variables:
.. code-block:: bash
In addition, NCCL2 provides other switch environment variables, such as whether to enable GPU Direct, whether to use RDMA, etc. For details, please refer to
`ncclknobs <https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#ncclknobs>`_ .
.. _cluster_quick_start:
使用Fleet API进行分布式训练
从Paddle Fluid `Release 1.5.1 <https://github.com/PaddlePaddle/Paddle/releases/tag/v1.5.1>`_ 开始,官方推荐使用Fleet API进行分布式训练,关于Fleet API的介绍可以参考 `Fleet Design Doc <https://github.com/PaddlePaddle/Fleet>`_
[x] 成功安装Paddle Fluid,如果尚未安装,请参考 `快速开始 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/beginners_guide/quick_start_cn.html>`_
[x] 学会最基本的单机训练方法,请参考 `单机训练 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/single_node.html>`_ 中描述的单卡训练,进行学习
本文使用一个简单的示例,点击率预估任务,来说明如何使用Fleet API进行分布式训练的配置方法,并利用单机环境模拟分布式环境给出运行示例。示例的源码来自 `CTR with Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`_
为了方便学习,这里给出的示例是单机与多机混合的代码,用户可以通过不同的启动命令进行单机或多机任务的启动。获取数据的部分,以及对数据预处理的逻辑可以参考 `CTR with Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`_ 的源码和说明,这里不做过多描述。
.. code-block:: python
from __future__ import print_function
from args import parse_args
import os
import paddle.fluid as fluid
import sys
from network_conf import ctr_dnn_model_dataset
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
dense_feature_dim = 13
sparse_feature_dim = 10000001
batch_size = 100
thread_num = 10
embedding_size = 10
args = parse_args()
def main_function(is_local):
# common code for local training and distributed training
dense_input = fluid.layers.data(
name="dense_input", shape=[dense_feature_dim], dtype='float32')
sparse_input_ids = [
fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1,
dtype="int64") for i in range(1, 27)]
label = fluid.layers.data(name="label", shape=[1], dtype="int64")
dataset = fluid.DatasetFactory().create_dataset()
dataset.set_use_var([dense_input] + sparse_input_ids + [label])
pipe_command = "python criteo_reader.py %d" % sparse_feature_dim
whole_filelist = ["raw_data/part-%d" % x
for x in range(len(os.listdir("raw_data")))]
loss, auc_var, batch_auc_var = ctr_dnn_model_dataset(
dense_input, sparse_input_ids, label, embedding_size,
exe = fluid.Executor(fluid.CPUPlace())
def train_loop(epoch=20):
for i in range(epoch):
# local training
def local_train():
optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
# distributed training
def dist_train():
role = role_maker.PaddleCloudRoleMaker()
strategy = DistributeTranspilerConfig()
strategy.sync_mode = False
optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
optimizer = fleet.distributed_optimizer(optimizer, strategy)
if fleet.is_server():
elif fleet.is_worker():
if is_local:
if __name__ == '__main__':
* 说明:示例中使用的IO方法是dataset,想了解具体的文档和用法请参考 `Dataset API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/api_cn/dataset_cn.html>`_ 。示例中使用的 ``train_from_dataset`` 接口,想了解具体的文档和使用方法请参考 `Executor API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/api_cn/executor_cn.html>`_ 。示例中的 ``from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet`` 表示引入参数服务器架构进行分布式训练,如果想更进一步了解Fleet API的更多选项和示例,请参考 `Fleet API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.6/user_guides/howto/training/fleet_api_howto_cn.html>`_
.. code-block:: bash
python train.py --is_local 1
.. code-block:: bash
python -m paddle.distributed.launch_ps --worker_num 2 --server_num 2 train.py
任务运行的日志在工作目录的logs目录下可以查看,当您能够使用单机模拟分布式训练,可以进行真正的多机分布式训练。我们建议用户直接参考 `百度云运行分布式任务的示例 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.html>`_
Quick start for distributed training
Distributed training with Fleet API
Since Paddle Fluid `Release
1.5.1 <https://github.com/PaddlePaddle/Paddle/releases/tag/v1.5.1>`__,
it is officially recommended to use the Fleet API for distributed
training. For the introduction of the Fleet API, please refer to `Fleet
Design Doc <https://github.com/PaddlePaddle/Fleet>`__.
- [x] Install Paddle Fluid. If not already installed, please refer to
Guide <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/beginners_guide/index_en.html>`__.
- [x] Master the most basic single node training method. Please refer
to the single card training described in `Single-node
training <https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/user_guides/howto/training/single_node_en.html>`__.
Click-through rate prediction
Here, we will use a simple example, click-through rate prediction task,
to illustrate how to configure Fleet API for distributed training, and
gives an example by using a single node environment to simulate the
distributed environment. The source code of the example comes from `CTR
Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`__.
In order to facilitate learning, the example given here is a mixed code
of single node and multi node. You can start single node or multi node
tasks through different startup commands. For the part of obtaining data
and the logic of data preprocessing, please refer to the source code and
description of `CTR with
Fleet <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/ctr>`__.
.. code:: python
from __future__ import print_function
from args import parse_args
import os
import paddle.fluid as fluid
import sys
from network_conf import ctr_dnn_model_dataset
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
dense_feature_dim = 13
sparse_feature_dim = 10000001
batch_size = 100
thread_num = 10
embedding_size = 10
args = parse_args()
def main_function(is_local):
# common code for local training and distributed training
dense_input = fluid.layers.data(
name="dense_input", shape=[dense_feature_dim], dtype='float32')
sparse_input_ids = [
fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1,
dtype="int64") for i in range(1, 27)]
label = fluid.layers.data(name="label", shape=[1], dtype="int64")
dataset = fluid.DatasetFactory().create_dataset()
dataset.set_use_var([dense_input] + sparse_input_ids + [label])
pipe_command = "python criteo_reader.py %d" % sparse_feature_dim
whole_filelist = ["raw_data/part-%d" % x
for x in range(len(os.listdir("raw_data")))]
loss, auc_var, batch_auc_var = ctr_dnn_model_dataset(
dense_input, sparse_input_ids, label, embedding_size,
exe = fluid.Executor(fluid.CPUPlace())
def train_loop(epoch=20):
for i in range(epoch):
# local training
def local_train():
optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
# distributed training
def dist_train():
role = role_maker.PaddleCloudRoleMaker()
strategy = DistributeTranspilerConfig()
strategy.sync_mode = False
optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
optimizer = fleet.distributed_optimizer(optimizer, strategy)
if fleet.is_server():
elif fleet.is_worker():
if is_local:
if __name__ == '__main__':
- Note: The IO method used in this example is dataset, please refer to
API <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/api/dataset.html>`__
for specific documents and usage. For the ``train_from_dataset``
interface, please refer to `Executor
API <https://www.paddlepaddle.org.cn/documentation/docs/en/1.7/api/executor.html>`__.
``from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet``
in this example means to introduce parameter server architecture for
distributed training, which you can refer to `Fleet
API <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.6/user_guides/howto/training/fleet_api_howto_cn.html>`__
for getting more about the options and examples of Fleet API.
Start command of single node training
.. code:: bash
python train.py --is_local 1
Start command of single machine simulation distributed training
Here we use launch\_ps, a built-in launcher of paddle, which users can
specify the number of workers and servers to start the parameter server
.. code:: bash
python -m paddle.distributed.launch_ps --worker_num 2 --server_num 2 train.py
The task running log can be viewed in the logs directory of the working
directory. When you can use a single machine to simulate distributed
training, you can perform true multi node distributed training. We
recommend that users refer directly to
`百度云运行分布式任务的示例 <https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.html>`__.
FleetAPI 设计说明
API接口灵活定义。具体的设计原理可以参考\ `Fleet
API设计文档 <https://github.com/PaddlePaddle/Fleet/blob/develop/README.md>`_\ 。当前FleetAPI还处于paddle.fluid.incubate目录下,未来功能完备后会放到paddle.fluid目录中,欢迎持续关注。
Fleet API快速上手示例
API最常见的两种使用场景,用一个模型做示例,目的是让用户有快速上手体验的模板。快速上手的示例源代码可以在\ `Fleet Quick Start <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/quick-start>`_ 找到。
.. code-block:: python
import paddle.fluid as fluid
def mlp(input_x, input_y, hid_dim=128, label_dim=2):
fc_1 = fluid.layers.fc(input=input_x, size=hid_dim, act='tanh')
fc_2 = fluid.layers.fc(input=fc_1, size=hid_dim, act='tanh')
prediction = fluid.layers.fc(input=[fc_2], size=label_dim, act='softmax')
cost = fluid.layers.cross_entropy(input=prediction, label=input_y)
avg_cost = fluid.layers.mean(x=cost)
return avg_cost
.. code-block:: python
import numpy as np
def gen_data():
return {"x": np.random.random(size=(128, 32)).astype('float32'),
"y": np.random.randint(2, size=(128, 1)).astype('int64')}
.. code-block:: python
import paddle.fluid as fluid
from nets import mlp
from utils import gen_data
input_x = fluid.data(name="x", shape=[None, 32], dtype='float32')
input_y = fluid.data(name="y", shape=[None, 1], dtype='int64')
cost = mlp(input_x, input_y)
optimizer = fluid.optimizer.SGD(learning_rate=0.01)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
step = 1001
for i in range(step):
cost_val = exe.run(feed=gen_data(), fetch_list=[cost.name])
print("step%d cost=%f" % (i, cost_val[0]))
Parameter Server训练方法
参数服务器方法对于大规模数据,简单模型的并行训练非常适用,我们基于单机模型的定义给出使用Parameter Server进行训练的示例如下:
.. code-block:: python
import paddle.fluid as fluid
from nets import mlp
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.incubate.fleet.base import role_maker
from utils import gen_data
input_x = fluid.data(name="x", shape=[None, 32], dtype='float32')
input_y = fluid.data(name="y", shape=[None, 1], dtype='int64')
cost = mlp(input_x, input_y)
optimizer = fluid.optimizer.SGD(learning_rate=0.01)
role = role_maker.PaddleCloudRoleMaker()
optimizer = fleet.distributed_optimizer(optimizer)
if fleet.is_server():
elif fleet.is_worker():
place = fluid.CPUPlace()
exe = fluid.Executor(place)
step = 1001
for i in range(step):
cost_val = exe.run(
print("worker_index: %d, step%d cost = %f" %
(fleet.worker_index(), i, cost_val[0]))
Collective Training通常在GPU多机多卡训练中使用,一般在复杂模型的训练中比较常见,我们基于上面的单机模型定义给出使用Collective方法进行分布式训练的示例如下:
.. code-block:: python
import paddle.fluid as fluid
from nets import mlp
from paddle.fluid.incubate.fleet.collective import fleet
from paddle.fluid.incubate.fleet.base import role_maker
from utils import gen_data
input_x = fluid.data(name="x", shape=[None, 32], dtype='float32')
input_y = fluid.data(name="y", shape=[None, 1], dtype='int64')
cost = mlp(input_x, input_y)
optimizer = fluid.optimizer.SGD(learning_rate=0.01)
role = role_maker.PaddleCloudRoleMaker(is_collective=True)
optimizer = fleet.distributed_optimizer(optimizer)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
step = 1001
for i in range(step):
cost_val = exe.run(
print("worker_index: %d, step%d cost = %f" %
(fleet.worker_index(), i, cost_val[0]))
`点击率预估 <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/distribute_ctr>`_
`语义匹配 <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/simnet_bow>`_
`向量学习 <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/word2vec>`_
`基于Resnet50的图像分类 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/resnet>`_
`基于Transformer的机器翻译 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/transformer>`_
`基于Bert的语义表示学习 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/bert>`_
Fleet API相关的接口说明
Fleet API接口
* init(role_maker=None)
* fleet初始化,需要在使用fleet其他接口前先调用,用于定义多机的环境配置
* is_worker()
* Parameter Server训练中使用,判断当前节点是否是Worker节点,是则返回True,否则返回False
* is_server(model_dir=None)
* Parameter Server训练中使用,判断当前节点是否是Server节点,是则返回True,否则返回False
* init_server()
* Parameter Server训练中,fleet加载model_dir中保存的模型相关参数进行parameter
* run_server()
* Parameter Server训练中使用,用来启动server端服务
* init_worker()
* Parameter Server训练中使用,用来启动worker端服务
* stop_worker()
* 训练结束后,停止worker
* distributed_optimizer(optimizer, strategy=None)
* 分布式优化算法装饰器,用户可带入单机optimizer,并配置分布式训练策略,返回一个分布式的optimizer
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.incubate.fleet.base import role_maker
role = role_maker.MPISymetricRoleMaker()
.. code-block:: python
mpirun -np 2 python trainer.py
Parameter Server训练示例:
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.incubate.fleet.base import role_maker
role = role_maker.PaddleCloudRoleMaker()
.. code-block:: python
python -m paddle.distributed.launch_ps --worker_num 2 --server_num 2 trainer.py
.. code-block:: python
from paddle.fluid.incubate.fleet.collective import fleet
from paddle.fluid.incubate.fleet.base import role_maker
role = role_maker.PaddleCloudRoleMaker(is_collective=True)
.. code-block:: python
python -m paddle.distributed.launch trainer.py
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.incubate.fleet.base import role_maker
role = role_maker.UserDefinedRoleMaker(
role=role_maker.Role.WORKER if bool(int(os.getenv("IS_WORKER")))
else role_maker.Role.SERVER,
* Parameter Server Training
* Sync_mode
* Collective Training
* LocalSGD
* ReduceGrad
Fleet Mode
Parameter Server Training
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
Collective Training
.. code-block:: python
from paddle.fluid.incubate.fleet.collective import fleet
.. _user_guide_distribute_en:
Distributed Training
Multi-node Training
from __future__ import print_function
import paddle.fluid.core as core
import math
import os
import sys
import numpy
import paddle
import paddle.fluid as fluid
def loss_net(hidden, label):
prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
loss = fluid.layers.cross_entropy(input=prediction, label=label)
avg_loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=prediction, label=label)
return prediction, avg_loss, acc
def conv_net(img, label):
conv_pool_1 = fluid.nets.simple_img_conv_pool(
conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
conv_pool_2 = fluid.nets.simple_img_conv_pool(
return loss_net(conv_pool_2, label)
def train(use_cuda, role, endpoints, current_endpoint, trainer_id, trainers):
if use_cuda and not fluid.core.is_compiled_with_cuda():
img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
prediction, avg_loss, acc = conv_net(img, label)
test_program = fluid.default_main_program().clone(for_test=True)
optimizer = fluid.optimizer.Adam(learning_rate=0.001)
t = fluid.DistributeTranspiler()
t.transpile(trainer_id, pservers=endpoints, trainers=trainers)
if role == "pserver":
prog = t.get_pserver_program(current_endpoint)
startup = t.get_startup_program(current_endpoint, pserver_program=prog)
exe = fluid.Executor(fluid.CPUPlace())
elif role == "trainer":
prog = t.get_trainer_program()
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
train_reader = paddle.batch(
paddle.reader.shuffle(paddle.dataset.mnist.train(), buf_size=500),
test_reader = paddle.batch(
paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
for pass_id in range(PASS_NUM):
for batch_id, data in enumerate(train_reader()):
acc_np, avg_loss_np = exe.run(
prog, feed=feeder.feed(data), fetch_list=[acc, avg_loss])
if (batch_id + 1) % 10 == 0:
'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
format(pass_id, batch_id + 1,
float(avg_loss_np.mean()), float(
if __name__ == '__main__':
if len(sys.argv) != 6:
"Usage: python %s role endpoints current_endpoint trainer_id trainers"
% sys.argv[0])
role, endpoints, current_endpoint, trainer_id, trainers = \
train(True, role, endpoints, current_endpoint,
int(trainer_id), int(trainers))
# 混合精度训练最佳实践
Automatic Mixed Precision (AMP) 是一种自动混合使用半精度(FP16)和单精度(FP32)来加速模型训练的技术。AMP技术可方便用户快速将使用 FP32 训练的模型修改为使用混合精度训练,并通过黑白名单和动态`loss scaling`来保证训练时的数值稳定性进而避免梯度Infinite或者NaN(Not a Number)。借力于新一代NVIDIA GPU中Tensor Cores的计算性能,PaddlePaddle AMP技术在ResNet50、Transformer等模型上训练速度相对于FP32训练加速比可达1.5~2.9。
### 半精度浮点类型FP16
如图 1 所示,半精度(Float Precision16,FP16)是一种相对较新的浮点类型,在计算机中使用2字节(16位)存储。在IEEE 754-2008标准中,它亦被称作binary16。与计算中常用的单精度(FP32)和双精度(FP64)类型相比,FP16更适于在精度要求不高的场景中使用。
<figure align="center">
<img src="https://paddleweb-static.bj.bcebos.com/images/fp16.png" width="600" alt='missing'/>
<figcaption><center>图 1. 半精度和单精度数据示意图</center></figcaption>
### 英伟达GPU的FP16算力
* FP16可降低一半的内存带宽和存储需求,这使得在相同的硬件条件下研究人员可使用更大更复杂的模型以及更大的batch size大小。
* FP16可以充分利用英伟达Volta及Turing架构GPU提供的Tensor Cores技术。在相同的GPU硬件上,Tensor Cores的FP16计算吞吐量是FP32的8倍。
### PaddlePaddle AMP功能——牛刀小试
如前文所述,使用FP16数据类型可能会造成计算精度上的损失,但对深度学习领域而言,并不是所有计算都要求很高的精度,一些局部的精度损失对最终训练效果影响很微弱,却能使吞吐和训练速度带来大幅提升。因此,混合精度计算的需求应运而生。具体而言,训练过程中将一些对精度损失不敏感且能利用Tensor Cores进行加速的运算使用半精度处理,而对精度损失敏感部分依然保持FP32计算精度,用以最大限度提升访存和计算效率。
为了避免对每个具体模型人工地去设计和尝试精度混合的方法,PaddlePaadle框架提供自动混合精度训练(AMP)功能,解放"炼丹师"的双手。在PaddlePaddle中使用AMP训练是一件十分容易的事情,用户只需要增加一行代码即可将原有的FP32训练转变为AMP训练。下面以`MNIST`为例介绍PaddlePaddle AMP功能的使用示例。
import paddle.fluid as fluid
def MNIST(data, class_dim):
conv1 = fluid.layers.conv2d(data, 16, 5, 1, act=None, data_format='NHWC')
bn1 = fluid.layers.batch_norm(conv1, act='relu', data_layout='NHWC')
pool1 = fluid.layers.pool2d(bn1, 2, 'max', 2, data_format='NHWC')
conv2 = fluid.layers.conv2d(pool1, 64, 5, 1, act=None, data_format='NHWC')
bn2 = fluid.layers.batch_norm(conv2, act='relu', data_layout='NHWC')
pool2 = fluid.layers.pool2d(bn2, 2, 'max', 2, data_format='NHWC')
fc1 = fluid.layers.fc(pool2, size=64, act='relu')
fc2 = fluid.layers.fc(fc1, size=class_dim, act='softmax')
return fc2
针对CV(Computer Vision)类模型组网,为获得更高的训练性能需要注意如下三点:
* `conv2d``batch_norm`以及`pool2d`等需要将数据布局设置为`NHWC`,这样有助于使用TensorCore技术加速计算过程<sup><a href="#fn1" id="ref1">1</a></sup>
* Tensor Cores要求在使用FP16加速卷积运算时conv2d的输入/输出通道数为8的倍数<sup><a href="#fn2" id="ref2">2</a></sup>,因此设计网络时推荐将conv2d层的输入/输出通道数设置为8的倍数。
* Tensor Cores要求在使用FP16加速矩阵乘运算时矩阵行数和列数均为8的倍数<sup><a href="#fn3" id="ref3">3</a></sup>,因此设计网络时推荐将fc层的size参数设置为8的倍数。
**FP32 训练**
为了训练 MNIST 网络,还需要定义损失函数来更新权重参数,此处使用的优化器是SGDOptimizer。为了简化说明,这里省略了迭代训练的相关代码,仅体现损失函数及优化器定义相关的内容。
import paddle
import numpy as np
data = fluid.layers.data(
name='image', shape=[None, 28, 28, 1], dtype='float32')
label = fluid.layers.data(name='label', shape=[None, 1], dtype='int64')
out = MNIST(data, class_dim=10)
loss = fluid.layers.cross_entropy(input=out, label=label)
avg_loss = fluid.layers.mean(loss)
sgd = fluid.optimizer.SGDOptimizer(learning_rate=1e-3)
与FP32训练相比,用户仅需使用PaddlePaddle提供的`fluid.contrib.mixed_precision.decorate` 函数将原来的优化器SGDOptimizer进行封装,然后使用封装后的优化器(mp_sgd)更新参数梯度即可完成向AMP训练的转换,代码如下所示:
sgd = SGDOptimizer(learning_rate=1e-3)
# 此处只需要使用fluid.contrib.mixed_precision.decorate将sgd封装成AMP训练所需的
# 优化器mp_sgd,并使用mp_sgd.minimize(avg_loss)代替原来的sgd.minimize(avg_loss)语句即可。
mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd)
export FLAGS_conv_workspace_size_limit=1024 # MB,根据所使用的GPU显存容量及模型特点设置数值,值越大越有可能选择到更快的卷积算法
export FLAGS_cudnn_exhaustive_search=1 # 使用穷举搜索方法来选择快速卷积算法
export FLAGS_cudnn_batchnorm_spatial_persistent=1 # 用于触发batch_norm和relu的融合
上述即为最简单的PaddlePaddle AMP功能使用方法。ResNet50模型的AMP训练示例可[点击此处](https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/image_classification/README.md#%E6%B7%B7%E5%90%88%E7%B2%BE%E5%BA%A6%E8%AE%AD%E7%BB%83)查看,其他模型使用PaddlePaddle AMP的方法也与此类似。若AMP训练过程中出现连续的loss nan等不收敛现象,可尝试使用[check nan inf工具](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/flags/check_nan_inf_cn.html#span-id-speed-span)进行调试。
### PaddlePaddle AMP功能——进阶使用
上一小节所述均为默认AMP训练行为,用户当然也可以改变一些默认的参数设置来满足特定的模型训练场景需求。接下来的章节将介绍PaddlePaddle AMP功能使用中用户可配置的参数行为,即进阶使用技巧。
#### 自定义黑白名单
PaddlePaddle AMP功能实现中根据FP16数据类型计算稳定性和加速效果在框架内部定义了算子(Op)的黑白名单。具体来说,将对FP16计算友好且能利用Tensor Cores的Op归类于白名单,将使用FP16计算会导致数值不稳定的Op归类于黑名单,将对FP16计算没有多少影响的Op归类于灰名单。然而,框架开发人员不可能考虑到所有的网络模型情况,尤其是那些特殊场景中使用到的模型。用户可以在使用`fluid.contrib.mixed_precision.decorate` 函数时通过指定自定义的黑白名单列表来改变默认的FP16计算行为。
sgd = SGDOptimizer(learning_rate=1e-3)
# list1是白名单op列表,list2是黑名单op列表,list3是黑名单var_name列表(凡是以这些黑名单var_name为输入或输出的op均会被视为黑名单op)
amp_list = AutoMixedPrecisionLists(custom_white_list=list1, custom_black_list=list2, custom_black_varnames=list3)
mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd, amp_list)
#### 自动loss scaling
为了避免梯度Infinite或者NAN,PaddlePaddle AMP功能支持根据训练过程中梯度的数值自动调整loss scale值。用户在使用`fluid.contrib.mixed_precision.decorate` 函数时也可以改变与loss scaling相关的参数设置,示例如下:
sgd = SGDOptimizer(learning_rate=1e-3)
mp_sgd = fluid.contrib.mixed_precision.decorator.decorate(sgd,
`init_loss_scaling ``incr_every_n_steps` 以及`decr_every_n_nan_or_inf`等参数控制着自动loss scaling的行为。它们仅当 `use_dynamic_loss_scaling`设置为True时有效。下面详述这些参数的意义:
* init_loss_scaling(float):初始loss scaling值。
* incr_every_n_steps(int):每经过incr_every_n_steps个连续的正常梯度值才会增大loss scaling值。
* decr_every_n_nan_or_inf(int):每经过decr_every_n_nan_or_inf个连续的无效梯度值(nan或者inf)才会减小loss scaling值。
* incr_ratio(float):每次增大loss scaling值的扩增倍数,其为大于1的浮点数。
* decr_ratio(float):每次减小loss scaling值的比例系数,其为小于1的浮点数。
### 多卡GPU训练的优化
PaddlePaddle AMP功能对多卡GPU训练进行了深度优化。如图 2 所示,优化之前的参数梯度更新特点:梯度计算时虽然使用的是FP16数据类型,但是不同GPU卡之间的梯度传输数据类型仍为FP32。
<figure align="center">
<img src="https://paddleweb-static.bj.bcebos.com/images/transfer_fp32_grad.png" width="500" alt='missing'/>
<figcaption><center>图 2. 不同GPU卡之间传输梯度使用FP32数据类型(优化前)</center></figcaption>
<figure align="center">
<img src="https://paddleweb-static.bj.bcebos.com/images/transfer_fp16_grad.png" width="500" alt='missing'/>
<figcaption><center>图 3. 不同GPU卡之间传输梯度使用FP16数据类型(优化后)</center></figcaption>
### 训练性能对比(AMP VS FP32)
PaddlePaddle AMP技术在ResNet50、Transformer等模型上训练速度相对于FP32训练上均有可观的加速比,下面是ResNet50和ERNIE Large模型的AMP训练相对于FP32训练的加速效果。
<table align="center">
<caption align="bottom"><center>图 4. Paddle AMP训练加速效果(横坐标为卡数,如8*8代表8机8卡)</center></caption>
<td> <img src="https://paddleweb-static.bj.bcebos.com/images/resnet50.png" alt='missing'/> </td>
<td> <img src="https://paddleweb-static.bj.bcebos.com/images/ernie.png" alt='missing'/> </td>
从图4所示的图表可以看出,ResNet50的AMP训练相对与FP32训练加速比可达$2.8 \times$以上,而ERNIE Large的AMP训练相对与FP32训练加速比亦可达 $1.7 \times -- 2.1 \times$ 。
### 参考文献
* <p> <a href="https://arxiv.org/abs/1710.03740"> Mixed Precision Training </a> </p>
* <p> <a href="https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=cn9312-%e4%bd%bf%e7%94%a8%e8%87%aa%e5%8a%a8%e6%b7%b7%e5%90%88%e7%b2%be%e5%ba%a6%e5%8a%a0%e9%80%9f+paddlepaddle+%e8%ae%ad%e7%bb%83"> 使用自动混合精度加速 PaddlePaddle 训练 </a> </p>
* <p id="fn1"> <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#tensor-layout"> Tensor Layouts In Memory: NCHW vs NHWC </a> <sup> <a href="#ref1"></a> </sub> </p>
* <p id="fn2"> <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#channels"> Channels In And Out Requirements </a> <sup> <a href="#ref2"></a> </sup> </p>
* <p id="fn3"> <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc"> Matrix-Matrix Multiplication Requirements </a> <sup> <a href="#ref3"></a> </sup> </p>
验证深度学习框架,可分为训练和测试两个阶段, 验证指标略有不同,本文只介绍训练阶段的指标验证。训练阶段关注的是模型训练集上的精度,训练集是完备的,因此关注大batch\_size下的训练速度,关注吞吐量,例如图像模型常用的batch\_size=128, 多卡情况下会加大;预测阶段关注的是在测试集上的精度,线上服务测试数据不能提前收集,因此关注小batch\_size下的预测速度,关注延迟,例如预测服务常用的batch\_size=1, 4等。
不同架构的GPU卡性能差异巨大,在验证模型在GPU上训练性能时,可使用NVIDIA提供的命令:```nvidia-smi``` 检验当前使用的GPU型号,如果测试多卡训练性能,需确认硬件连接是 [nvlink](https://zh.wikipedia.org/zh/NVLink)[PCIe](https://zh.wikipedia.org/zh-hans/PCI_Express)。 同样地,CPU型号会极大影响模型在CPU上的训练性能。可读取`/proc/cpuinfo`中的参数,确认当前正在使用的CPU型号。
下载GPU对应的Cuda Tool Kit和 Cudnn,或者使用NVIDIA官方发布的nvidia-docker镜像 [nvidia-docker](https://github.com/NVIDIA/nvidia-docker), 镜像内包含了Cuda和Cudnn,本文采用这种方式。 Cuda Tool Kit包含了GPU代码使用到的基础库,影响在此基础上编译出的Fluid二进制运行性能。
任务种类| 模型名称| 网络结构| 数据集
图像生成| CycleGAN| GAN| horse2zebra
图像分类| SE-ResNeXt50| Resnet-50| image-net
语义分割| DeepLab_V3+| ResNets| cityscapes
自然语言| Bert| Transformer| Wikipedia
机器翻译| Transformer| Attention| Wikipedia
CycleGAN, SE-ResNeXt50, DeepLab_V3+属于CNN模型, Bert, Transformer是一种比传统RNN模型更好的NLP模型。
- GPU 单机单卡测试
本教程使用了Cuda9, Cudnn7.0.1。来源为:```nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04```
nvidia-docker run -it --name CASE_NAME --security-opt seccomp=unconfined -v $PWD/benchmark:/benchmark -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu paddlepaddle/paddle:latest-dev /bin/bash
然后代码中设置为使用CUDAPlace,如果使用Paddle代码库中的脚本,只需要命令行参数传入 use_gpu=True即可。
>>> import paddle.fluid as fluid
>>> place = fluid.CUDAPlace(0) // 0 指第0块GPU
本教程对比相同环境下的Fluid1.4, Pytorch1.1.0和TensorFlow1.12.0的性能表现。
硬件环境为 CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, GPU: Tesla v100(volta) 21729MiB x 1, Nvidia-Driver 384.66。
系统环境为Ubuntu 16.04.3 LTS, 本文中采用了docker环境,系统版本为nvidia-docker17.05.0-ce。
- GPU 单机单卡测试结果
Model|Fluid GPU| TensorFlow/Pytorch GPU
CycleGAN| 7.3 samples/s| 6.1 samples/s
SE-ResNeXt50| 169.4 samples/s | 153.1 samples/s
DeepLab_V3+| 12.8 samples/s | 6.4 samples/s
Bert| 4.0 samples/s | 3.4 samples/s
Transformer| 4.9 samples/s | 4.7 samples/s
# CPU性能调优
此教程会介绍如何使用Python的cProfile包、Python库yep、Google perftools来进行性能分析 (profiling) 与调优(performance tuning)。
Profling 指发现性能瓶颈。系统中的瓶颈可能和程序员开发过程中想象的瓶颈相去甚远。Tuning 指消除瓶颈。性能优化的过程通常是不断重复地 profiling 和 tuning。
PaddlePaddle 用户一般通过调用 Python API 编写深度学习程序。大部分 Python API 调用用 C++ 写的 libpaddle.so。所以 PaddlePaddle 的性能分析与调优分为两个部分:
* Python 代码的性能分析
* Python 与 C++ 混合代码的性能分析
## Python代码的性能分析
### 生成性能分析文件
python -m cProfile -o profile.out main.py
其中 `main.py` 是我们要分析的程序,`-o`标识了一个输出的文件名,用来存储本次性能分析的结果。如果不指定这个文件,`cProfile`会打印到标准输出。
### 查看性能分析文件
`cProfile` 在main.py 运行完毕后输出`profile.out`。我们可以使用[`cprofilev`](https://github.com/ymichael/cprofilev)来查看性能分析结果。`cprofilev`是一个Python的第三方库。使用它会开启一个HTTP服务,将性能分析结果以网页的形式展示出来:
cprofilev -a -p 3214 -f profile.out main.py
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.284 0.284 29.514 29.514 main.py:1(<module>)
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
4696 12.040 0.003 12.040 0.003 {built-in method run}
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
<th>含义 </th>
<td> ncalls</td>
<td> 函数的调用次数</td>
<td> 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间</td>
<td> percall </td>
<td> tottime的每次调用平均时间</td>
<td> cumtime</td>
<td> 函数总时间。包含这个函数调用其他函数的时间</td>
<td> percall</td>
<td> cumtime的每次调用平均时间</td>
<td> filename:lineno(function) </td>
<td> 文件名, 行号,函数名 </td>
### 寻找性能瓶颈
4696 12.040 0.003 12.040 0.003 {built-in method run}
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
Called By:
Ordered by: internal time
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
Function was called by...
ncalls tottime cumtime
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp) <- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp) <- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
Ordered by: internal time
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
## Python与C++混合代码的性能分析
### 生成性能分析文件
C++的性能分析工具非常多。常见的包括`gprof`, `valgrind`, `google-perftools`。但是调试Python中使用的动态链接库与直接调试原始二进制相比增加了很多复杂度。幸而Python的一个第三方库`yep`提供了方便的和`google-perftools`交互的方法。于是这里使用`yep`进行Python与C++混合代码的性能分析
apt update
apt install libgoogle-perftools-dev
pip install yep
python -m yep -v main.py
1. 编译时指定`-g`生成调试信息。使用cmake的话,可以将CMAKE_BUILD_TYPE指定为`RelWithDebInfo`
2. 编译时一定要开启优化。单纯的`Debug`编译性能会和`-O2`或者`-O3`有非常大的差别。`Debug`模式下的性能测试是没有意义的。
3. 运行性能分析的时候,先从单线程开始,再开启多线程,进而多机。毕竟单线程调试更容易。可以设置`OMP_NUM_THREADS=1`这个环境变量关闭openmp优化。
### 查看性能分析文件
go get github.com/google/pprof
pprof -http= `which python` ./main.py.prof
这行命令中,`-http`指开启HTTP服务。`which python`会产生当前Python二进制的完整路径,进而指定了Python可执行文件的路径。`./main.py.prof`输入了性能分析结果。
### 寻找性能瓶颈
# Tune CPU performance
This tutorial introduces techniques we use to profile and tune the
CPU performance of PaddlePaddle. We will use Python packages
`cProfile` and `yep`, and Google's `perftools`.
Profiling is the process that reveals performance bottlenecks,
which could be very different from what's in the developers' mind.
Performance tuning is done to fix these bottlenecks. Performance optimization
repeats the steps of profiling and tuning alternatively.
PaddlePaddle users program AI applications by calling the Python API, which calls
into `libpaddle.so.` written in C++. In this tutorial, we focus on
the profiling and tuning of
1. the Python code and
1. the mixture of Python and C++ code.
## Profiling the Python Code
### Generate the Performance Profiling File
We can use Python standard
package, [`cProfile`](https://docs.python.org/2/library/profile.html),
to generate Python profiling file. For example:
python -m cProfile -o profile.out main.py
where `main.py` is the program we are going to profile, `-o` specifies
the output file. Without `-o`, `cProfile` would outputs to standard
### Look into the Profiling File
`cProfile` generates `profile.out` after `main.py` completes. We can
use [`cprofilev`](https://github.com/ymichael/cprofilev) to look into
the details:
cprofilev -a -p 3214 -f profile.out main.py
where `-a` specifies the HTTP IP, `-p` specifies the port, `-f`
specifies the profiling file, and `main.py` is the source file.
Open the Web browser and points to the local IP and the specifies
port, we will see the output like the following:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.284 0.284 29.514 29.514 main.py:1(<module>)
4696 0.128 0.000 15.748 0.003 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/executor.py:20(run)
4696 12.040 0.003 12.040 0.003 {built-in method run}
1 0.144 0.144 6.534 6.534 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/__init__.py:14(<module>)
where each line corresponds to Python function, and the meaning of
each column is as follows:
<th>meaning </th>
<td> ncalls</td>
<td> the number of calls into a function</td>
<td> the total execution time of the function, not including the execution time of other functions called by the function</td>
<td> percall </td>
<td> tottime divided by ncalls</td>
<td> cumtime</td>
<td> the total execution time of the function, including the execution time of other functions being called</td>
<td> percall</td>
<td> cumtime divided by ncalls</td>
<td> filename:lineno(function) </td>
<td> where the function is define </td>
### Identify Performance Bottlenecks
Usually, `tottime` and the related `percall` time is what we want to
focus on. We can sort above profiling file by tottime:
4696 12.040 0.003 12.040 0.003 {built-in method run}
300005 0.874 0.000 1.681 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/v2/dataset/mnist.py:38(reader)
107991 0.676 0.000 1.519 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:219(__init__)
4697 0.626 0.000 2.291 0.000 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp)
1 0.618 0.618 0.618 0.618 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/__init__.py:1(<module>)
We can see that the most time-consuming function is the `built-in
method run`, which is a C++ function in `libpaddle.so`. We will
explain how to profile C++ code in the next section. At this
moment, let's look into the third function `sync_with_cpp`, which is a
Python function. We can click it to understand more about it:
Called By:
Ordered by: internal time
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
Function was called by...
ncalls tottime cumtime
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:428(sync_with_cpp) <- 4697 0.626 2.291 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp)
/home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:562(sync_with_cpp) <- 4696 0.019 2.316 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:487(clone)
1 0.000 0.001 /home/yuyang/perf_test/.env/lib/python2.7/site-packages/paddle/fluid/framework.py:534(append_backward)
Ordered by: internal time
List reduced from 4497 to 2 due to restriction <'sync_with_cpp'>
The lists of the callers of `sync_with_cpp` might help us understand
how to improve the function definition.
## Profiling Python and C++ Code
### Generate the Profiling File
To profile a mixture of Python and C++ code, we can use a Python
package, `yep`, that can work with Google's `perftools`, which is a
commonly-used profiler for C/C++ code.
In Ubuntu systems, we can install `yep` and `perftools` by running the
following commands:
apt update
apt install libgoogle-perftools-dev
pip install yep
Then we can run the following command
python -m yep -v main.py
to generate the profiling file. The default filename is
Please be aware of the `-v` command line option, which prints the
analysis results after generating the profiling file. By examining the
the print result, we'd know that if we stripped debug
information from `libpaddle.so` at build time. The following hints
help make sure that the analysis results are readable:
1. Use GCC command line option `-g` when building `libpaddle.so` so to
include the debug information. The standard building system of
PaddlePaddle is CMake, so you might want to set
1. Use GCC command line option `-O2` or `-O3` to generate optimized
binary code. It doesn't make sense to profile `libpaddle.so`
without optimization, because it would anyway run slowly.
1. Profiling the single-threaded binary file before the
multi-threading version, because the latter often generates tangled
profiling analysis result. You might want to set environment
variable `OMP_NUM_THREADS=1` to prevents OpenMP from automatically
starting multiple threads.
### Examining the Profiling File
The tool we used to examine the profiling file generated by
`perftools` is [`pprof`](https://github.com/google/pprof), which
provides a Web-based GUI like `cprofilev`.
We can rely on the standard Go toolchain to retrieve the source code
of `pprof` and build it:
go get github.com/google/pprof
Then we can use it to profile `main.py.prof` generated in the previous
pprof -http= `which python` ./main.py.prof
Where `-http` specifies the IP and port of the HTTP service.
Directing our Web browser to the service, we would see something like
the following:
### Identifying the Performance Bottlenecks
Similar to how we work with `cprofilev`, we'd focus on `tottime` and
We can see that the execution time of multiplication and the computing
of the gradient of multiplication takes 2% to 4% of the total running
time, and `MomentumOp` takes about 17%. Obviously, we'd want to
optimize `MomentumOp`.
`pprof` would mark performance critical parts of the program in
red. It's a good idea to follow the hints.
# Heap Memory Profiling and Optimization
Any computer program has the danger of memory leak. Generally, **Memory Leak** is caused by the unreleased heap memory allocated by the program. As the memory occupied by the program becomes larger and larger, it will affect the stability of the program, which may make the running speed slower or give rise to OoM(Out of Memory). It even compromises the stability of the machine in use, and leads to *downtime* .
There are many memory leak analysis tools at present. Typical ones include, [valgrind](http://valgrind.org/docs/manual/quick-start.html#quick-start.intro), [gperftools](https://gperftools.github.io/gperftools/).
Because Fluid runs in C++ core driven by Python, It is very difficult for valgrind to analyze directly. You need to compile the debug version and dedicated Python version with valgrind support, and most of the output information is Python's own symbols and call information. In addition, valgrind will make the program run very slowly, so it is not recommended.
Here we mainly introduce the use of [gperftools](https://gperftools.github.io/gperftools/) .
gperftool mainly supports four functions:
- thread-caching malloc
- heap-checking using tcmalloc
- heap-profiling using tcmalloc
- CPU profiler
Paddle also provides a [tutorial on CPU performance analysis](./cpu_profiling_en.html) based on gperftool.
For the analysis for heap, we mainly use thread-caching malloc and heap-profiling using tcmalloc.
## Environment
This tutorial is based on the Docker development environment paddlepaddle/paddle:latest-dev provided by paddle, based on the Ubuntu 16.04.4 LTS environment.
## Manual
- Install google-perftools
apt-get install libunwind-dev
apt-get install google-perftools
- Install pprof
go get -u github.com/google/pprof
- Configure Running Environment
export PPROF_PATH=/root/gopath/bin/pprof
export PPROF_BINARY_PATH=/root/gopath/bin/pprof
export LD_PRELOAD=/usr/lib/libtcmalloc.so.4
- Use heap profile to run python program. The essence of it is to get a snapshot of the heap allocation periodically.
# HEAPPROFILE sets the directory and file prefix of the generated heap analysis file
# HEAP_PROFILE_ALLOCATION_INTERVAL Sets how many storage dumps are allocated for each dump, default 1GB
env HEAPPROFILE="./perf_log/test.log" HEAP_PROFILE_ALLOCATION_INTERVAL=209715200 python trainer.py
As the program runs, a lot of files will be generated in the perf_log folder as follows:
-rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0001.heap
-rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0002.heap
-rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0003.heap
-rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0004.heap
-rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0005.heap
-rw-r--r-- 1 root root 1.0M Jun 1 15:00 test.log.0006.heap
- Analyze the heap files with pprof. There are two modes of analysis:
- Complete mode. An analysis of the current heap is performed, showing some of the call paths for the current allocation of memory.
pprof --pdf python test.log.0012.heap
The command above will generate a file of profile00x.pdf, which can be opened directly, for example, [memory_cpu_allocator](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_cpu_allocator.pdf). As demonstrated in the chart below, during the running of the CPU version fluid, the module CPUAllocator is allocated with most memory. Other modules are allocated with relatively less memory, so they are ignored. It is very inconvenient for inspecting memory leak for memory leak is a chronic process which cannot be inspected in this picture.
- Diff mode. You can do diff on the heap at two moments, which removes some modules whose memory allocation has not changed, and displays the incremental part.
pprof --pdf --base test.log.0010.heap python test.log.1045.heap
The generated result: [`memory_leak_protobuf`](https://github.com/jacquesqiao/Paddle/blob/bd2ea0e1f84bb6522a66d44a072598153634cade/doc/fluid/howto/optimization/memory_leak_protobuf.pdf)
As shown from the figure: The structure of ProgramDesc has increased by 200MB+ between the two versions, so there is a large possibility that memory leak happens here, and the final result does prove a leak here.
本模块介绍 Fluid 使用过程中的调优方法,包括:
- `CPU性能调优 <cpu_profiling_cn.html>`_:介绍如何使用 cProfile 包、yep库、Google perftools 进行性能分析与调优
- `堆内存分析和优化 <host_memory_profiling_cn.html>`_:介绍如何使用 gperftool 进行堆内存分析和优化,以解决内存泄漏的问题
- `Timeline工具简介 <timeline_cn.html>`_ :介绍如何使用 Timeline 工具进行性能分析和调优
Performance Profiling and Optimization
.. toctree::
This section illustrates how to optimize performance of Fluid:
- `CPU profiling <cpu_profiling_en.html>`_:How to use cProfile, yep, and Google perftools to profile and optimize model performance
- `Heap Memory Profiling and Optimization <host_memory_profiling_en.html>`_:Use gperftool to perform Heap Memory Profiling and Optimization to solve memory leaks.
- `How to use timeline tool to do profiling <timeline_en.html>`_ :How to use timeline tool to do profile and optimization
# timeline工具简介
## <span id="local">本地使用</span>
1. 在训练的主循环外加上`profiler.start_profiler(...)``profiler.stop_profiler(...)`。运行之后,代码会在`/tmp/profile`目录下生成一个profile的记录文件。
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid import profiler
place = fluid.CPUPlace()
def reader():
for i in range(100):
yield [np.random.random([4]).astype('float32'), np.random.random([3]).astype('float32')],
main_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(main_program, startup_program):
data_1 = fluid.layers.data(name='data_1', shape=[1, 2, 2])
data_2 = fluid.layers.data(name='data_2', shape=[1, 1, 3])
out = fluid.layers.fc(input=[data_1, data_2], size=2)
# ...
feeder = fluid.DataFeeder([data_1, data_2], place)
exe = fluid.Executor(place)
pass_num = 10
for pass_id in range(pass_num):
for batch_id, data in enumerate(reader()):
if pass_id == 0 and batch_id == 5:
elif pass_id == 0 and batch_id == 10:
profiler.stop_profiler("total", "/tmp/profile")
outs = exe.run(program=main_program,
1. 运行`python paddle/tools/timeline.py`来处理`/tmp/profile`,这个程序默认会生成一个`/tmp/timeline`文件,你也可以用命令行参数来修改这个路径,请参考[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py)
python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline
1. 打开chrome浏览器,访问<chrome://tracing/>,用`load`按钮来加载生成的`timeline`文件。
1. 结果如下图所示,可以放大来查看timeline的细节信息。
![chrome timeline](./timeline.jpeg)
## 分布式使用
1. trainer打开方式与[本地使用](#local)部分的第1步相同
1. pserver可以通过加两个环境变量打开profile,例如:
FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py
3. 把pserver和trainer的profile文件生成一个timeline文件,例如:
python /paddle/tools/timeline.py
--profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1
--timeline_path ./dist.timeline
4. 在chrome中加载dist.timeline文件,方法和[本地使用](#local)第4步相同。
# How to use timeline tool to do profile
## <span id="local">Local</span>
1. Add `profiler.start_profiler(...)` and `profiler.stop_profiler(...)` to the main training loop. After run, the code will generate a profile record file `/tmp/profile`. **Warning**: Please do not run too many batches when use profiler to record timeline information, for the profile record will grow with the batch number.
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid import profiler
place = fluid.CPUPlace()
def reader():
for i in range(100):
yield [np.random.random([4]).astype('float32'), np.random.random([3]).astype('float32')],
main_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(main_program, startup_program):
data_1 = fluid.layers.data(name='data_1', shape=[1, 2, 2])
data_2 = fluid.layers.data(name='data_2', shape=[1, 1, 3])
out = fluid.layers.fc(input=[data_1, data_2], size=2)
# ...
feeder = fluid.DataFeeder([data_1, data_2], place)
exe = fluid.Executor(place)
pass_num = 10
for pass_id in range(pass_num):
for batch_id, data in enumerate(reader()):
if pass_id == 0 and batch_id == 5:
elif pass_id == 0 and batch_id == 10:
profiler.stop_profiler("total", "/tmp/profile")
outs = exe.run(program=main_program,
2. Run `python paddle/tools/timeline.py` to process `/tmp/profile`, it will generate another
file `/tmp/timeline` by default. You can change the path by cmd parameter, please take a look at
[timeline.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/timeline.py) for details.
python Paddle/tools/timeline.py --profile_path=/tmp/profile --timeline_path=timeline
3. Open chrome and visit <chrome://tracing/>, use `load` button to load the generated `timeline` file.
4. The result timeline should be like:<a name="local_step_4"></a>
![chrome timeline](./timeline.jpeg)
## Distributed
This tool can support distributed train programs(pserver and trainer) too.
1. Open traniner profiler just like how to use in [local](#local).
2. Open pserver profiler: add two environment variables, e.g.:
FLAGS_rpc_server_profile_period=10 FLAGS_rpc_server_profile_path=./tmp/pserver python train.py
3. Merge pservers' and trainers' profiler file, e.g.:
python /paddle/tools/timeline.py
--profile_path trainer0=local_profile_10_pass0_0,trainer1=local_profile_10_pass0_1,pserver0=./pserver_0,pserver1=./pserver_1
--timeline_path ./dist.timeline
4. Load `dist.timeline` in chrome just like the [fourth step in Local](#local_step_4)
# 运行时设备切换
## 如何避免显存超出
下面示例代码中的`embedding`层,其参数`size`包含两个元素,第一个元素为`vocab_size` (词表大小), 第二个为`emb_size``embedding`层维度)。实际场景中,词表可能会非常大。示例代码中,词表大小被设置为10000000。如果在GPU模式下运行,该层创建的权重矩阵的大小为(10000000, 150),仅这一层就需要5.59G的显存,如果词表大小继续增加,极有可能会导致显存超出。
import paddle.fluid as fluid
data = fluid.layers.fill_constant(shape=[1], value=128, dtype='int64')
label = fluid.layers.fill_constant(shape=[1, 150], value=0.5, dtype='float32')
emb = fluid.embedding(input=data, size=(10000000, 150), dtype='float32')
out = fluid.layers.l2_normalize(x=emb, axis=-1)
cost = fluid.layers.square_error_cost(input=out, label=label)
avg_cost = fluid.layers.mean(cost)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
result = exe.run(fluid.default_main_program(), fetch_list=[avg_cost])
import paddle.fluid as fluid
data = fluid.layers.fill_constant(shape=[1], value=128, dtype='int64')
label = fluid.layers.fill_constant(shape=[1, 150], value=0.5, dtype='float32')
with fluid.device_guard("cpu"):
emb = fluid.embedding(input=data, size=(10000000, 150), dtype='float32')
out = fluid.layers.l2_normalize(x=emb, axis=-1)
cost = fluid.layers.square_error_cost(input=out, label=label)
avg_cost = fluid.layers.mean(cost)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
result = exe.run(fluid.default_main_program(), fetch_list=[avg_cost])
## 如何减少数据传输
### 使用profile工具确认是否发生了数据传输
import paddle.fluid as fluid
import paddle.fluid.compiler as compiler
import paddle.fluid.profiler as profiler
data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32')
data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32')
shape = fluid.layers.shape(data2)
shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4])
out = fluid.layers.crop_tensor(data1, shape=shape)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
compiled_prog = compiler.CompiledProgram(fluid.default_main_program())
with profiler.profiler('All', 'total') as prof:
for i in range(10):
result = exe.run(program=compiled_prog, fetch_list=[out])
在程序运行结束后,将会自动地打印出profile report。在下面的profile report中,可以看到 `GpuMemCpy Summary`中给出了2项数据传输的调用耗时。在OP执行过程中,如果输入Tensor所在的设备与OP执行的设备不同,就会发生`GpuMemcpySync`,通常我们可以直接优化的就是这一项。进一步分析,可以看到`slice``crop_tensor`执行中都发生了`GpuMemcpySync`。尽管我们在程序中设置了GPU模式运行,但是框架中有些OP,例如shape,会将输出结果放在CPU上。
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 26.6328
Computation time Total: 13.3133 Ratio: 49.9884%
Framework overhead Total: 13.3195 Ratio: 50.0116%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 30 Total: 1.47508 Ratio: 5.5386%
GpuMemcpyAsync Calls: 10 Total: 0.443514 Ratio: 1.66529%
GpuMemcpySync Calls: 20 Total: 1.03157 Ratio: 3.87331%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
FastThreadedSSAGraphExecutorPrepare 10 9.16493 9.152509 (0.998645) 0.012417 (0.001355) 0.025192 8.85968 0.916493 0.344122
shape 10 8.33057 8.330568 (1.000000) 0.000000 (0.000000) 0.030711 7.99849 0.833057 0.312793
fill_constant 20 4.06097 4.024522 (0.991025) 0.036449 (0.008975) 0.075087 0.888959 0.203049 0.15248
slice 10 1.78033 1.750439 (0.983212) 0.029888 (0.016788) 0.148503 0.290851 0.178033 0.0668471
GpuMemcpySync:CPU->GPU 10 0.45524 0.446312 (0.980388) 0.008928 (0.019612) 0.039089 0.060694 0.045524 0.0170932
crop_tensor 10 1.67658 1.620542 (0.966578) 0.056034 (0.033422) 0.143906 0.258776 0.167658 0.0629515
GpuMemcpySync:GPU->CPU 10 0.57633 0.552906 (0.959357) 0.023424 (0.040643) 0.050657 0.076322 0.057633 0.0216398
Fetch 10 0.919361 0.895201 (0.973721) 0.024160 (0.026279) 0.082935 0.138122 0.0919361 0.0345199
GpuMemcpyAsync:GPU->CPU 10 0.443514 0.419354 (0.945526) 0.024160 (0.054474) 0.040639 0.059673 0.0443514 0.0166529
ScopeBufferedMonitor::post_local_exec_scopes_process 10 0.341999 0.341999 (1.000000) 0.000000 (0.000000) 0.028436 0.057134 0.0341999 0.0128413
eager_deletion 30 0.287236 0.287236 (1.000000) 0.000000 (0.000000) 0.005452 0.022696 0.00957453 0.010785
ScopeBufferedMonitor::pre_local_exec_scopes_process 10 0.047864 0.047864 (1.000000) 0.000000 (0.000000) 0.003668 0.011592 0.0047864 0.00179718
InitLocalVars 1 0.022981 0.022981 (1.000000) 0.000000 (0.000000) 0.022981 0.022981 0.022981 0.000862883
### 通过log查看发生数据传输的具体位置
以上的示例程序比较简单,我们只用看profile report就能知道具体是哪些算子发生了数据传输。但是当模型比较复杂时,可能需要去查看更加详细的调试信息,可以打印出运行时的log去确定发生数据传输的具体位置。依然以上述程序为例,执行`GLOG_vmodule=operator=3 python test_case.py`,会得到如下log信息,会发现发生了2次数据传输:
- `shape`输出的结果在CPU上,在`slice`运行时,`shape`的输出被拷贝到GPU上
- `slice`执行完的结果在GPU上,当`crop_tensor`执行时,它会被拷贝到CPU上。
I0406 14:56:23.286592 17516 operator.cc:180] CUDAPlace(0) Op(shape), inputs:{Input[fill_constant_1.tmp_0:float[1, 3, 5, 5]({})]}, outputs:{Out[shape_0.tmp_0:int[4]({})]}.
I0406 14:56:23.286628 17516 eager_deletion_op_handle.cc:107] Erase variable fill_constant_1.tmp_0 on CUDAPlace(0)
I0406 14:56:23.286725 17516 operator.cc:1210] Transform Variable shape_0.tmp_0 from data_type[int]:data_layout[NCHW]:place[CPUPlace]:library_type[PLAIN] to data_type[int]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0406 14:56:23.286763 17516 scope.cc:169] Create variable shape_0.tmp_0
I0406 14:56:23.286784 17516 data_device_transform.cc:21] DeviceTransform in, src_place CPUPlace dst_place: CUDAPlace(0)
I0406 14:56:23.286867 17516 tensor_util.cu:129] TensorCopySync 4 from CPUPlace to CUDAPlace(0)
I0406 14:56:23.287099 17516 operator.cc:180] CUDAPlace(0) Op(slice), inputs:{EndsTensor[], EndsTensorList[], Input[shape_0.tmp_0:int[4]({})], StartsTensor[], StartsTensorList[]}, outputs:{Out[slice_0.tmp_0:int[4]({})]}.
I0406 14:56:23.287140 17516 eager_deletion_op_handle.cc:107] Erase variable shape_0.tmp_0 on CUDAPlace(0)
I0406 14:56:23.287220 17516 tensor_util.cu:129] TensorCopySync 4 from CUDAPlace(0) to CPUPlace
I0406 14:56:23.287473 17516 operator.cc:180] CUDAPlace(0) Op(crop_tensor), inputs:{Offsets[], OffsetsTensor[], Shape[slice_0.tmp_0:int[4]({})], ShapeTensor[], X[fill_constant_0.tmp_0:float[1, 3, 8, 8]({})]}, outputs:{Out[crop_tensor_0.tmp_0:float[1, 3, 5, 5]({})]}.
### 使用device_guard避免不必要的数据传输
import paddle.fluid as fluid
import paddle.fluid.compiler as compiler
import paddle.fluid.profiler as profiler
data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32')
data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32')
shape = fluid.layers.shape(data2)
with fluid.device_guard("cpu"):
shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4])
out = fluid.layers.crop_tensor(data1, shape=shape)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
compiled_prog = compiler.CompiledProgram(fluid.default_main_program())
with profiler.profiler('All', 'total') as prof:
for i in range(10):
result = exe.run(program=compiled_prog, fetch_list=[out])
再次观察profile report中`GpuMemCpy Summary`的内容,可以看到`GpuMemCpySync`已经被消除。在实际的模型中,若`GpuMemCpySync` 调用耗时占比较大,并且可以通过设置`device_guard`避免,那么就能够带来一定的性能提升。
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 14.5345
Computation time Total: 4.47587 Ratio: 30.7948%
Framework overhead Total: 10.0586 Ratio: 69.2052%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 10 Total: 0.457033 Ratio: 3.14447%
GpuMemcpyAsync Calls: 10 Total: 0.457033 Ratio: 3.14447%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
FastThreadedSSAGraphExecutorPrepare 10 7.70113 7.689066 (0.998433) 0.012064 (0.001567) 0.032657 7.39363 0.770113 0.529852
fill_constant 20 2.62299 2.587022 (0.986287) 0.035968 (0.013713) 0.071097 0.342082 0.13115 0.180466
shape 10 1.93504 1.935040 (1.000000) 0.000000 (0.000000) 0.026774 1.6016 0.193504 0.133134
Fetch 10 0.880496 0.858512 (0.975032) 0.021984 (0.024968) 0.07392 0.140896 0.0880496 0.0605797
GpuMemcpyAsync:GPU->CPU 10 0.457033 0.435049 (0.951898) 0.021984 (0.048102) 0.037836 0.071424 0.0457033 0.0314447
crop_tensor 10 0.705426 0.671506 (0.951916) 0.033920 (0.048084) 0.05841 0.123901 0.0705426 0.0485346
slice 10 0.324241 0.324241 (1.000000) 0.000000 (0.000000) 0.024299 0.07213 0.0324241 0.0223084
eager_deletion 30 0.250524 0.250524 (1.000000) 0.000000 (0.000000) 0.004171 0.016235 0.0083508 0.0172365
ScopeBufferedMonitor::post_local_exec_scopes_process 10 0.047794 0.047794 (1.000000) 0.000000 (0.000000) 0.003344 0.014131 0.0047794 0.00328831
InitLocalVars 1 0.034629 0.034629 (1.000000) 0.000000 (0.000000) 0.034629 0.034629 0.034629 0.00238254
ScopeBufferedMonitor::pre_local_exec_scopes_process 10 0.032231 0.032231 (1.000000) 0.000000 (0.000000) 0.002952 0.004076 0.0032231 0.00221755
### 总结
- 使用profile工具对模型进行分析,看是否存在GpuMemcpySync的调用耗时。若存在,则进一步分析发生数据传输的原因。
- 可以通过profile report找到发生GpuMemcpySync的OP。如果需要,可以通过打印log,找到GpuMemcpySync发生的具体位置。
- 尝试使用`device_guard`设置部分OP的运行设备,来减少GpuMemcpySync的调用。
- 最后可以通过比较修改前后模型的profile report,或者其他用来衡量性能的指标,确认修改后是否带来了性能提升。
# 使用Paddle-TensorRT库预测
NVIDIA TensorRT 是一个高性能的深度学习预测库,可为深度学习推理应用程序提供低延迟和高吞吐量。PaddlePaddle 采用子图的形式对TensorRT进行了集成,即我们可以使用该模块来提升Paddle模型的预测性能。该模块依旧在持续开发中,目前支持的模型如下表所示:
1. 从源码编译时,TensorRT预测库目前仅支持使用GPU编译,且需要设置编译选项TENSORRT_ROOT为TensorRT所在的路径。
2. Windows支持需要TensorRT 版本5.0以上。
3. Paddle-TRT目前仅支持固定输入shape。
4. 下载安装TensorRT后,需要手动在`NvInfer.h`文件中为`class IPluginFactory``class IGpuAllocator`分别添加虚析构函数:
``` c++
virtual ~IPluginFactory() {};
virtual ~IGpuAllocator() {};
## 内容
- [Paddle-TRT使用介绍](#Paddle-TRT使用介绍)
- [Paddle-TRT样例编译测试](#Paddle-TRT样例编译测试)
- [Paddle-TRT INT8使用](#Paddle-TRT_INT8使用)
- [Paddle-TRT子图运行原理](#Paddle-TRT子图运行原理)
- [Paddle-TRT性能测试](#Paddle-TRT性能测试)
## <a name="Paddle-TRT使用介绍">Paddle-TRT使用介绍</a>
``` c++
config->EnableTensorRtEngine(1 << 20 /* workspace_size*/,
batch_size /* max_batch_size*/,
3 /* min_subgraph_size*/,
AnalysisConfig::Precision::kFloat32 /* precision*/,
false /* use_static*/,
false /* use_calib_mode*/);
- **`workspace_size`**,类型:int,默认值为1 << 20。指定TensorRT使用的工作空间大小,TensorRT会在该大小限制下筛选合适的kernel执行预测运算。
- **`max_batch_size`**,类型:int,默认值为1。需要提前设置最大的batch大小,运行时batch大小不得超过此限定值。
- **`min_subgraph_size`**,类型:int,默认值为3。Paddle-TRT是以子图的形式运行,为了避免性能损失,当子图内部节点个数大于`min_subgraph_size`的时候,才会使用Paddle-TRT运行。
- **`precision`**,类型:`enum class Precision {kFloat32 = 0, kHalf, kInt8,};`, 默认值为`AnalysisConfig::Precision::kFloat32`。指定使用TRT的精度,支持FP32(kFloat32),FP16(kHalf),Int8(kInt8)。若需要使用Paddle-TRT int8离线量化校准,需设定`precision``AnalysisConfig::Precision::kInt8`, 且设置`use_calib_mode` 为true。
- **`use_static`**,类型:bool, 默认值为false。如果指定为true,在初次运行程序的时候会将TRT的优化信息进行序列化到磁盘上,下次运行时直接加载优化的序列化信息而不需要重新生成。
- **`use_calib_mode`**,类型:bool, 默认值为false。若要运行Paddle-TRT int8离线量化校准,需要将此选项设置为true。
**Note:** Paddle-TRT目前只支持固定shape的输入,不支持变化shape的输入。
## <a name="Paddle-TRT样例编译测试">Paddle-TRT样例编译测试</a>
1. 下载或编译带有 TensorRT 的paddle预测库,参考[安装与编译C++预测库](../../inference_deployment/inference/build_and_install_lib_cn.html)
3. 下载[预测样例](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz)并解压,进入`sample/paddle-TRT`目录下。
`paddle-TRT` 文件夹目录结构如下:
├── CMakeLists.txt
├── mobilenet_test.cc
├── fluid_generate_calib_test.cc
├── fluid_int8_test.cc
├── mobilenetv1
│ ├── model
│ └── params
├── run.sh
└── run_impl.sh
- `mobilenet_test.cc` 为使用paddle-TRT预测的C++源文件
- `fluid_generate_calib_test.cc` 为使用TRT int8离线量化校准的C++源文件
- `fluid_int8_test.cc` 为使用TRT执行int8预测的C++源文件
- `mobilenetv1` 为模型文件夹
- `run.sh` 为预测运行脚本文件
在这里假设样例所在的目录为 `SAMPLE_BASE_DIR/sample/paddle-TRT`
4. 配置编译与运行脚本
# 设置是否开启MKL、GPU、TensorRT,如果要使用TensorRT,必须打开GPU
# 按照运行环境设置预测库路径、CUDA库路径、CUDNN库路径、TensorRT路径、模型路径
5. 编译与运行样例
## <a name="Paddle-TRT_INT8使用">Paddle-TRT INT8使用</a>
1. Paddle-TRT INT8 简介
1) **生成校准表**(Calibration table):我们准备500张左右的真实输入数据,并将数据输入到模型中去,Paddle-TRT会统计模型中每个op输入和输出值的范围信息,并将其记录到校准表中,这些信息有效减少了模型转换时的信息损失。
2) 生成校准表后,再次运行模型,**Paddle-TRT会自动加载校准表**,并进行INT8模式下的预测。
2. 编译测试INT8样例
``` shell
sh run.sh
即可执行生成校准表样例,在该样例中,我们随机生成了500个输入来模拟这一过程,在实际业务中,建议大家使用真实样例。运行结束后,在 `SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/_opt_cache` 模型目录下会多出一个名字为trt_calib_*的文件,即校准表。
``` shell
cp -rf SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/ SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib
``` shell
sh run.sh
## <a name="Paddle-TRT子图运行原理">Paddle-TRT子图运行原理</a>
PaddlePaddle采用子图的形式对TensorRT进行集成,当模型加载后,神经网络可以表示为由变量和运算节点组成的计算图。Paddle TensorRT实现的功能是对整个图进行扫描,发现图中可以使用TensorRT优化的子图,并使用TensorRT节点替换它们。在模型的推断期间,如果遇到TensorRT节点,Paddle会调用TensorRT库对该节点进行优化,其他的节点调用Paddle的原生实现。TensorRT在推断期间能够进行Op的横向和纵向融合,过滤掉冗余的Op,并对特定平台下的特定的Op选择合适的kernel等进行优化,能够加快模型的预测速度。
<p align="center">
<img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_original.png" width="600">
<p align="center">
<img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_trt.png" width="600">
我们可以在原始模型网络中看到,绿色节点表示可以被TensorRT支持的节点,红色节点表示网络中的变量,黄色表示Paddle只能被Paddle原生实现执行的节点。那些在原始网络中的绿色节点被提取出来汇集成子图,并由一个TensorRT节点代替,成为转换后网络中的`block-25` 节点。在网络运行过程中,如果遇到该节点,Paddle将调用TensorRT库来对其执行。
## <a name="Paddle-TRT性能测试">Paddle-TRT性能测试</a>
### 测试环境
- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4
- TensorRT4.0, CUDA8.0, CUDNNV7
- 测试模型 ResNet50,MobileNet,ResNet101, Inception V3.
### 测试对象
**PaddlePaddle, Pytorch, Tensorflow**
- 在测试中,PaddlePaddle使用子图优化的方式集成了TensorRT, 模型[地址](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models)
- Pytorch使用了原生的实现, 模型[地址1](https://github.com/pytorch/vision/tree/master/torchvision/models)[地址2](https://github.com/marvis/pytorch-mobilenet)
- 对TensorFlow测试包括了对TF的原生的测试,和对TF—TRT的测试,**对TF—TRT的测试并没有达到预期的效果,后期会对其进行补充**, 模型[地址](https://github.com/tensorflow/models)
#### ResNet50
|1|4.64117 |16.3|10.878|
|5|6.90622| 22.9 |20.62|
|10|7.9758 |40.6|34.36|
#### MobileNet
|1| 1.7541 | 7.8 |2.72|
|5| 3.04666 | 7.8 |3.19|
|10|4.19478 | 14.47 |4.25|
#### ResNet101
|1|8.95767| 22.48 |18.78|
|5|12.9811 | 33.88 |34.84|
|10|14.1463| 61.97 |57.94|
#### Inception v3
|1|15.1613 | 24.2 |19.1|
|5|18.5373 | 34.8 |27.2|
|10|19.2781| 54.8 |36.7|
# Use Paddle-TensorRT Library for inference
NVIDIA TensorRT is a is a platform for high-performance deep learning inference. It delivers low latency and high throughput for deep learning inference application.
Subgraph is used in PaddlePaddle to preliminarily integrate TensorRT, which enables TensorRT module to enhance inference performance of paddle models. The module is still under development. Currently supported models are as following:
We will introduce the obtaining, usage and theory of Paddle-TensorRT library in this documentation.
1. When compiling from source, TensorRT library currently only supports GPU compilation, and you need to set the compilation option TensorRT_ROOT to the path where tensorrt is located.
2. Windows support requires TensorRT version 5.0 or higher.
3. Paddle-TRT currently only supports fixed input shape.
4. After downloading and installing tensorrt, you need to manually add virtual destructors for `class IPluginFactory` and `class IGpuAllocator` in the `NvInfer.h` file:
``` c++
virtual ~IPluginFactory() {};
virtual ~IGpuAllocator() {};
## <a name="Paddle-TRT interface usage">Paddle-TRT interface usage</a>
When using AnalysisPredictor, we enable Paddle-TRT by setting
``` c++
config->EnableTensorRtEngine(1 << 20 /* workspace_size*/,
batch_size /* max_batch_size*/,
3 /* min_subgraph_size*/,
AnalysisConfig::Precision::kFloat32 /* precision*/,
false /* use_static*/,
false /* use_calib_mode*/);
The details of this interface is as following:
- **`workspace_size`**: type:int, default is 1 << 20. Sets the max workspace size of TRT. TensorRT will choose kernels under this constraint.
- **`max_batch_size`**: type:int, default is 1. Sets the max batch size. Batch sizes during runtime cannot exceed this value.
- **`min_subgraph_size`**: type:int, default is 3. Subgraph is used to integrate TensorRT in PaddlePaddle. To avoid low performance, Paddle-TRT is only enabled when th number of nodes in th subgraph is more than `min_subgraph_size`.
- **`precision`**: type:`enum class Precision {kFloat32 = 0, kHalf, kInt8,};`, default is `AnalysisConfig::Precision::kFloat32`. Sets the precision of TRT, supporting FP32(kFloat32), FP16(kHalf), Int8(kInt8). Using Paddle-TRT int8 calibration requires setting `precision` to `AnalysisConfig::Precision::kInt8`, and `use_calib_mode` to true.
- **`use_static`**: type:bool, default is false. If set to true, Paddle-TRT will serialize optimization information to disk, to deserialize next time without optimizing again.
- **`use_calib_mode`**: type:bool, default is false. Using Paddle-TRT int8 calibration requires setting this option to true.
**Note:** Paddle-TRT currently only supports fixed input shape.
## <a name="Paddle-TRT example compiling test">Paddle-TRT example compiling test</a>
1. Download or compile Paddle Inference with TensorRT support, refer to [Install and Compile C++ Inference Library](../../inference_deployment/inference/build_and_install_lib_en.html).
2. Download NVIDIA TensorRT(with consistent version of cuda and cudnn in local environment) from [NVIDIA TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) with an NVIDIA developer account.
3. Download [Paddle Inference sample](https://paddle-inference-dist.bj.bcebos.com/tensorrt_test/paddle_inference_sample_v1.7.tar.gz) and uncompress, and enter `sample/paddle-TRT` directory.
`paddle-TRT` directory structure is as following:
├── CMakeLists.txt
├── mobilenet_test.cc
├── fluid_generate_calib_test.cc
├── fluid_int8_test.cc
├── mobilenetv1
│ ├── model
│ └── params
├── run.sh
└── run_impl.sh
- `mobilenet_test.cc` is the c++ source code of inference using Paddle-TRT
- `fluid_generate_calib_test.cc` is the c++ source code of inference using Paddle-TRT int8 calibration to generate calibration table
- `fluid_int8_test.cc` is the c++ source code of inference using Paddle-TRT int8
- `mobilenetv1` is the model dir
- `run.sh` is the script for running inference
Here we assume that the current directory is `SAMPLE_BASE_DIR/sample/paddle-TRT`.
# set whether to enable MKL, GPU or TensorRT. Enabling TensorRT requires WITH_GPU being ON
# set path to CUDA lib dir, CUDNN lib dir, TensorRT root dir and model dir
Please configure `run.sh` depending on your environment.
4. Build and run the sample.
sh run.sh
## <a name="Paddle-TRT INT8 usage">Paddle-TRT INT8 usage</a>
1. Paddle-TRT INT8 introduction
The parameters of the neural network are redundant to some extent. In many tasks, we can turn the Float32 model into Int8 model on the premise of precision. At present, Paddle-TRT supports to turn the trained Float32 model into Int8 model off line. The specific processes are as follows:
1)**Create the calibration table**. We prepare about 500 real input data, and input the data to the model. Paddle-TRT will count the range information of each op input and output value in the model, and record in the calibration table. The information can reduce the information loss during model transformation.
2)After creating the calibration table, run the model again, **Paddle-TRT will load the calibration table automatically**, and conduct the inference in the INT8 mode.
2. compile and test the INT8 example
change the `mobilenet_test` in `run.sh` to `fluid_generate_calib_test` and run
sh run.sh
We generate 500 input data to simulate the process, and it's suggested that you use real example for experiment. After the running period, there will be a new file named trt_calib_* under the `SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/_opt_cache` model directory, which is the calibration table.
Then copy the model dir with calibration infomation to path
cp -rf SAMPLE_BASE_DIR/sample/paddle-TRT/build/mobilenetv1/ SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib
change `fluid_generate_calib_test` in `run.sh` to `fluid_int8_test`, and change model dir path to `SAMPLE_BASE_DIR/sample/paddle-TRT/mobilenetv1_calib` and run
sh run.sh
## <a name="Paddle-TRT subgraph operation principle">Paddle-TRT subgraph operation principle</a>
Subgraph is used to integrate TensorRT in PaddlePaddle. After model is loaded, neural network can be represented as a computing graph composed of variables and computing nodes. Functions Paddle TensorRT implements are to scan the whole picture, discover subgraphs that can be optimized with TensorRT and replace them with TensorRT nodes. During the inference of model, Paddle will call TensorRT library to optimize TensorRT nodes and call native library of Paddle to optimize other nodes. During the inference, TensorRT can integrate Op horizonally and vertically to filter redundant Ops and is able to choose appropriate kernel for specific Op in specific platform to speed up the inference of model.
A simple model expresses the process :
**Original Network**
<p align="center">
<img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_original.png" width="600">
**Transformed Network**
<p align="center">
<img src="https://raw.githubusercontent.com/NHZlX/FluidDoc/add_trt_doc/doc/fluid/user_guides/howto/inference/image/model_graph_trt.png" width="600">
We can see in the Original Network that the green nodes represent nodes supported by TensorRT, the red nodes represent variables in network and yellow nodes represent nodes which can only be operated by native functions in Paddle. Green nodes in original network are extracted to compose subgraph which is replaced by a single TensorRT node to be transformed into `block-25` node in network. When such nodes are encountered during the runtime, TensorRT library will be called to execute them.
## <a name="Paddle-TRT benchmark">Paddle-TRT benchmark</a>
### Test Environment
- CPU:Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz GPU:Tesla P4
- TensorRT 4.0, CUDA 8.0, CUDNN V7
- models: ResNet50,MobileNet,ResNet101, Inception V3.
### Test set
**PaddlePaddle, Pytorch, Tensorflow**
- PaddlePaddle integrates TensorRT with subgraph, model[link](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification/models)
- Pytorch uses original kernels, model[link1](https://github.com/pytorch/vision/tree/master/torchvision/models), [link2](https://github.com/marvis/pytorch-mobilenet)
- We tested TF original and TF-TRT, model[link](https://github.com/tensorflow/models)
#### ResNet50
.. _api_guide_cpu_training_best_practice:
提高CPU使用率主要依赖 :code:`ParallelExecutor`,可以充分利用多个CPU的计算能力来加速计算。
API详细使用方法参考 :ref:`cn_api_fluid_ParallelExecutor` ,简单实例用法:
.. code-block:: python
# 配置执行策略,主要是设置线程数
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = 8
# 配置构图策略,对于CPU训练而言,应该使用Reduce模式进行训练
build_strategy = fluid.BuildStrategy()
if int(os.getenv("CPU_NUM")) > 1:
build_strategy.reduce_strategy = fluid.BuildStrategy.ReduceStrategy.Reduce
pe = fluid.ParallelExecutor(
- :code:`num_threads` : 模型训练使用的线程数,最好和训练所在机器的物理CPU核数接近
- :code:`reduce_strategy` : 对于CPU训练而言,应该选择 fluid.BuildStrategy.ReduceStrategy.Reduce
- :code:`CPU_NUM` :模型副本replica的个数,最好和num_threads一致
要减少通信数据量,提高通信速度,主要是使用稀疏更新 ,目前支持 :ref:`api_guide_sparse_update` 的主要是 :ref:`cn_api_fluid_layers_embedding` 。
.. code-block:: python
data = fluid.layers.data(name='ids', shape=[1], dtype='int64')
fc = fluid.layers.embedding(input=data, size=[dict_size, 16], is_sparse=True)
- :code:`is_sparse` : 配置embedding使用稀疏更新,如果embedding的dict_size很大,而每次数据data很少,建议使用sparse更新方式。
要提高CPU分布式的数据IO速度,可以首先考虑使用dataset API进行数据读取。 dataset是一种多生产者多消费者模式的数据读取方法,默认情况下耦合数据读取线程与训练线程,在多线程的训练中,dataset表现出极高的性能优势。
最后使用 :code:`train_from_dataset` 接口来进行网络的训练:
.. code-block:: python
dataset = fluid.DatasetFactory().create_dataset()
exe = fluid.Executor(fluid.CPUPlace())
CPU分布式训练速度进一步提高的核心在于选择合适的分布式训练策略,比如定义通信策略、编译策略、执行策略等等。paddlepaddle于v1.7版本发布了 :code:`DistributedStrategy` 功能,可以十分灵活且方便的指定分布式运行策略。
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory
.. code-block:: python
# step1: 引入CPU分布式训练策略
# 同步训练策略
strategy = DistributedStrategyFactory.create_sync_strategy()
# 半异步训练策略
strategy = DistributedStrategyFactory.create_half_async_strategy()
# 异步训练策略
strategy = DistributedStrategyFactory.create_async_strategy()
# GEO训练策略
strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400)
# step2: 定义节点角色
role = role_maker.PaddleCloudRoleMaker()
# step3: 分布式训练program构建
optimizer = fluid.optimizer.SGD(learning_rate) # 以SGD优化器为例
optimizer = fleet.distributed_optimizer(optimizer, strategy)
# step4.1: 启动参数服务器节点(Server)
if fleet.is_server():
# step4.2: 启动训练节点(Trainer)
elif fleet.is_worker():
# Do training
- 创建compiled_program所需的build_strategy及exec_strategy可以直接基于strategy获得
.. code-block:: python
compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel(
- 自定义训练策略细节,支持对DistributeTranspilerConfig、TrainerRuntimeConfig、ServerRuntimeConfig、fluid.ExecutionStrategy、fluid.BuildStrategy进行自定义配置。以DistributeTranspilerConfig为例,修改方式如下所示:
.. code-block:: python
strategy = DistributedStrategyFactory.create_sync_strategy()
# 方式一(推荐):
config = strategy.get_program_config()
config.min_block_size = 81920
# 方式二:调用set_program_config修改组网相关配置,支持DistributeTranspilerConfig和dict两种数据类型
config = DistributeTranspilerConfig()
config.min_block_size = 81920
# config = dict()
# config['min_block_size'] = 81920
\ No newline at end of file
.. _api_guide_cpu_training_best_practice_en:
Best practices of distributed training on CPU
To improve the training speed of CPU distributed training, we must consider two aspects:
1. Improve the training speed mainly by improving utilization rate of CPU;
2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication;
3. Improve the data IO speed by dataset API;
4. Improve the distributed training speed by changing distributed training strategy.
Improve CPU utilization
The CPU utilization mainly depends on :code:`ParallelExecutor`, which can make full use of the computing power of multiple CPUs to speed up the calculation.
For detailed API usage, please refer to :ref:`api_fluid_ParallelExecutor` . A simple example:
.. code-block:: python
# Configure the execution strategy, mainly to set the number of threads
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = 8
# Configure the composition strategy, for CPU training, you should use the Reduce mode for training.
build_strategy = fluid.BuildStrategy()
if int(os.getenv("CPU_NUM")) > 1:
pe = fluid.ParallelExecutor(
Among the parameters above:
- :code:`num_threads` : the number of threads used by the model training. It is preferably close to the number of the physical CPU cores of the machine where the training is performed.
- :code:`reduce_strategy` : For CPU training, you should choose fluid.BuildStrategy.ReduceStrategy.Reduce
Configuration of general environment variables:
- :code:`CPU_NUM`: The number of replicas of the model, preferably the same as num_threads
Improve communication speed
To reduce the amount of communication data and improve communication speed is achieved mainly by using sparse updates, the current support for `sparse update <../layers/sparse_update_en.html>`_ is mainly :ref:`api_fluid_layers_embedding`.
.. code-block:: python
data = fluid.layers.data(name='ids', shape=[1], dtype='int64')
fc = fluid.layers.embedding(input=data, size=[dict_size, 16], is_sparse=True)
Among the parameters above:
- :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method.
Improve data IO speed
To improve the CPU's distributed training speed, you can first consider using the dataset API as data reader. Dataset is a multi producer and multi consumer data reading method. By default, data reading thread and training thread are coupled. In multi-threaded training, dataset shows a high performance advantage.
Refer to this page for API introduction: https://www.paddlepaddle.org.cn/documentation/docs/en/api/dataset/QueueDataset.html
Combined with the actual model CTR-DNN, you can learn more about how to use dataset: https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn
Using :code:`train_from_dataset` for network training.
.. code-block:: python
dataset = fluid.DatasetFactory().create_dataset()
exe = fluid.Executor(fluid.CPUPlace())
Change distributed training strategy
The core of improving CPU distributed training speed is to choose appropriate distributed training strategy, such as defining communication strategy, compiling strategy, executing strategy and so on. PaddlePaddle released :code:`DistributedStrategy` API in V1.7 version , which can be very flexible and convenient to specify distributed operation strategy.
First, we need to introduce relevant libraries into the code:
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory
At present, there are four kinds of training strategies: synchronous training, asynchronous, half asynchronous training and GEO training. For details of different strategies, you can view the design documents:
The default configuration of the above policy is introduced by the following code:
.. code-block:: python
# step1: get distributed strategy
# Sync
strategy = DistributedStrategyFactory.create_sync_strategy()
# Half-Async
strategy = DistributedStrategyFactory.create_half_async_strategy()
# Async
strategy = DistributedStrategyFactory.create_async_strategy()
strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400)
# step2: define role of node
role = role_maker.PaddleCloudRoleMaker()
# step3: get distributed training program
optimizer = fluid.optimizer.SGD(learning_rate) # 以SGD优化器为例
optimizer = fleet.distributed_optimizer(optimizer, strategy)
# step4.1: run parameter server node
if fleet.is_server():
# step4.2: run worker node
elif fleet.is_worker():
# Do training
PaddlePaddle supports adjusting the details of the training strategy:
- The build_strategy and exec_strategy which used to create compiled_program can generate from strategy:
.. code-block:: python
compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel(
- Training strategy details can be customized, Paddlepaddle supports customized configuration of distributetranspierconfig, trainerruntimeconfig, serverruntimeconfig, fluid.executionstrategy and fluid.buildstrategy. Take distributetranspillerconfig as an example. The modification method is as follows:
.. code-block:: python
strategy = DistributedStrategyFactory.create_sync_strategy()
# Mode 1 (recommended):
config = strategy.get_program_config()
config.min_block_size = 81920
# Mode 2
config = DistributeTranspilerConfig()
config.min_block_size = 81920
# config = dict()
# config['min_block_size'] = 81920
\ No newline at end of file
.. _best_practice_dist_training_gpu:
PaddlePaddle Fluid支持在现代GPU [#]_ 服务器集群上完成高性能分布式训练。通常可以通过以下方法优化在多机多卡环境训练性能,建议在进行性能优化时,检查每项优化点并验证对应提升,从而提升最终的性能。
一个简单的验证当前的训练程序是否需要进一步优化性能的方法,是查看GPU的计算利用率 [#]_ ,通常用 :code:`nvidia-smi` 命令查看。如果GPU利用率较低,则可能存在较大的优化空间。下面主要从数据准备、训练策略设置和训练方式三个方面介绍GPU分布式训练中常用的优化方法。
- 使用 :code:`DataLoader` 。参考 `这里 <https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/io_cn/DataLoader_cn.html#dataloader>`_ 使用DataLoader,并建议开启 :code:`use_double_buffer` 。
- reader返回uint8类型数据。图片在解码后一般会以uint8类型存储,如果在reader中转换成float类型数据,会将数据体积扩大4倍。直接返回uint8数据,然后在GPU上转化成float类型进行训练可以提升数据读取效率。
- 减少reader初始化时间 (infinite read)。在训练任务开始执行第一轮训练时,reader开始不断异步地从磁盘或其他存储中读取数据并执行预处理,然后将处理好的数据填充到队列中供计算使用。从0开始填充这个队列直到数据可以源源不断供给计算,需要一定时间的预热。所以,如果每轮训练都重新填充队列,会产生一些时间的开销。所以,在使用DataLoader时,可以让reader函数不断地产生数据,直到训练循环结束:
.. code-block:: python
def infinite_reader(file_path):
while True:
with open(file_path) as fn:
for line in fn:
yield process(line)
def train():
for pass_id in xrange(NUM_PASSES):
if pass_id == 0:
for batch_id in (iters_per_pass):
另外,可以使用DALI库提升数据处理性能。DALI是NVIDIA开发的数据加载库,更多内容请参考 `官网文档 <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html>`_ 。飞桨中如何结合使用DALI库请参考 `使用示例 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/resnet>`_ 。
.. csv-table::
:header: "选项", "类型", "默认值", "说明"
:widths: 3, 3, 3, 5
":code:`num_threads`", "int", "1", "CPU线程数"
":code:`nccl_comm_num`", "int", "1", "nccl通信器数量"
":code:`fuse_all_reduce_ops`", "bool", "False", "多卡训练时,将AllReduce操纵进行融合"
":code:`use_hierarchical_allreduce` ", "bool", "False", "分级式reduce"
":code:`num_iteration_per_drop_scope`", "int", "1", "scope drop频率,设置每隔几个batch的迭代之后执行一次清理scope"
":code:`fetch_frequency`", "int", "1", "fetch的刷新频率"
":code:`fuse_bn_act_ops`", "bool", "False", "是否开启batch normalization和激活函数的融合"
":code:`fuse_elewise_add_act_ops`", "bool", "False", "是否开启elementwise add函数和激活函数的融合"
- 关于设置合适的CPU线程数 :code:`num_threads` 和nccl通信器数量 :code:`nccl_comm_num` 。PaddlePaddle Fluid使用“线程池” [#]_ 模型调度并执行Op,Op在启动GPU计算之前,通常需要CPU的协助,然而如果Op本身占用时间很小,“线程池”模型下又会带来额外的调度开销。使用多进程模式时,如果神经网络的计算图 [#]_ 节点间有较高的并发度,即使每个进程只在一个GPU上运行,使用多个线程可以更大限度的提升GPU利用率。nccl通信器数量 :code:`nccl_comm_num` 可以加快GPU之间的通信效率,建议单机设置为1,多机设置为2。针对CPU线程数 :code:`num_threads` ,建议单机设置为1,多机设置为 :code:`nccl_comm_num` +1。
- 关于AllReduce融合 :code:`fuse_all_reduce_ops` ,默认情况下会将同一layer中参数的梯度的AllReduce操作合并成一个,比如对于 :code:`fluid.layers.fc` 中有Weight和Bias两个参数,打开该选项之后,原本需要两次AllReduce操作,现在只用一次AllReduce 操作。此外,为支持更大粒度的参数梯度融合,Paddle提供了 :code:`FLAGS_fuse_parameter_memory_size` 和 :code:`FLAGS_fuse_parameter_groups_size` 两个环境变量选项。用户可以指定融合AllReduce操作之后,每个AllReduce操作的梯度字节数,比如希望每次AllReduce调用传输16MB的梯度,:code:`export FLAGS_fuse_parameter_memory_size=16` ,经验值为总通信量的十分之一。可以指定每次AllReduce操作的最大层数,即到达该层数就进行AllReduce,如指定50层 :code:`export FLAGS_fuse_parameter_groups_size=50` 。注意:目前不支持sparse参数梯度。
- 关于使用分级式reduce :code:`use_hierarchical_allreduce` 。对于多机模式,针对小数据量的通信,Ring AllReduce通信效率低,采用Hierarchical AllReduce可以解决该问题。
- 关于降低scope drop频率 :code:`num_iteration_per_drop_scope` 和fetch频率 :code:`fetch_frequency` 。减少scope drop和fetch频率,可以减少频繁的变量内存申请、释放和拷贝,从而提升性能。
- 关于操作融合:通过参数融合可以提升训练性能。
.. code-block:: python
dist_strategy = DistributedStrategy()
dist_strategy.nccl_comm_num = 2 #建议多机设置为2,单机设置为1
exec_strategy = fluid.ExecutionStrategy()
exe_st.num_threads = 3 #建议多机设置为nccl_comm_num+1,单机设置为1
exec_strategy.num_iteration_per_drop_scope = 30 #scope drop频率
dist_strategy.exec_strategy = exec_strategy
dist_strategy.fuse_all_reduce_ops = True #AllReduce是否融合
with fluid.program_guard(main_prog, startup_prog): #组网
params = model.params
optimizer = optimizer_setting(params)
dist_optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
for pass_id in range(PASS_NUM):
batch_id = 0
while True:
if batch_id % fetch_frequency == 0: #fetch频率
fetched = exe.run(main_prog, fetch_list)
1、Local SGD
GPU多机多卡同步训练过程中存在慢trainer现象,即每步中训练快的trainer的同步通信需要等待训练慢的trainer。由于每步中慢trainer的rank具有随机性,因此我们使用局部异步训练的方式——LocalSGD,通过多步异步训练(无通信阻塞)实现慢trainer时间均摊,从而提升同步训练性能。Local SGD训练方式主要有三个参数,分别是:
.. csv-table::
:header: "选项", "类型", "可选值", "说明"
:widths: 3, 3, 3, 5
":code:`use_local_sgd`", "bool", "False/True", "是否开启Local SGD,默认不开启"
":code:`local_sgd_is_warm_steps`", "int", "大于0", "训练多少轮之后才使用Local SGD方式训练"
":code:`local_sgd_steps`", "int", "大于0", "Local SGD的步长"
- Local SGD的warmup步长 :code:`local_sgd_is_warm_steps` 影响最终模型的泛化能力,一般需要等到模型参数稳定之后在进行Local SGD训练,经验值可以将学习率第一次下降时的epoch作为warmup步长,之后再进行Local SGD训练。
- Local SGD步长 :code:`local_sgd_steps` ,一般该值越大,通信次数越少,训练速度越快,但随之而来的时模型精度下降。经验值设置为2或者4。
具体的Local SGD的训练代码可以参考:https://github.com/PaddlePaddle/Fleet/tree/develop/examples/local_sgd/resnet
V100 GPU提供了 `Tensor Core <https://www.nvidia.com/en-us/data-center/tensorcore/>`_ 可以在混合精度计算场景极大的提升性能。使用混合精度计算的例子可以参考:https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification#using-mixed-precision-training
目前Paddle只提供在两个模型(ResNet, BERT)的混合精度计算实现并支持static loss scaling,其他模型使用混合精度也可以参考以上的实现完成验证。
.. [#] 现代GPU:指至少支持运行 `CUDA <https://developer.nvidia.com/cuda-downloads>`_ 版本7.5以上的GPU
.. [#] GPU利用率:这里指GPU计算能力被使用部分所占的百分比
.. [#] https://en.wikipedia.org/wiki/Thread_pool
.. [#] https://en.wikipedia.org/wiki/Data-flow_diagram
# 低配网络的分布式GPU训练
## 1. 背景
大规模分布式训练需要较高的网络带宽以便进行梯度的聚合更新,这限制了多节点训练时的可扩展性同时也需要昂贵的高带宽设备。在低带宽云网络等环境下进行分布式训练会变得更加糟糕。现有[Deep Gradient Compression](https://arxiv.org/abs/1712.01887)研究表明,分布式SGD中有99.9%的梯度交换都是冗余的,可以使用深度梯度压缩选择重要梯度进行通信来减少通信量,降低对通信带宽的依赖。Paddle目前实现了DGC的稀疏通信方式,可有效在低配网络下进行GPU分布式训练。下面将介绍DGC稀疏通信方式的使用方法、适用场景及基本原理。
## 2. 使用方法
``` python
import paddle.fluid as fluid
# optimizer = fluid.optimizer.Momentum(learning_rate=0.001, momentum=0.9)
# 替换Momentum优化器,添加DGC所需参数
optimizer = fluid.optimizer.DGCMomentumOptimizer(
learning_rate=0.001, momentum=0.9, rampup_begin_step=0)
## 3. 调参&适用场景
### 3.1 预热调参
<div align=center>
![DGC Resnet50 acc1](images/dgc_resnet50_acc1.png)
论文中使用了75%, 93.75%, 98.4375%, 99.6%, 99.9%稀疏度逐渐提升的策略。由于paddle稀疏梯度聚合通信使用了AllGather,通信量会随卡数增加而增长,所以在卡数较多时不推荐较低稀疏度的预热训练。如75%稀疏度时每张卡会选择25%的梯度进行通信,卡数为32时通信量是正常dense通信的32\*(1-0.75)=8倍,所以前几个epoch使用正常的dense通信为佳。可参照如下写法
``` python
# 1. 以1252个step为一个epoch,前2个epochs使用正常dense通信,后3个epochs逐步提升稀疏度为99.9%
optimizer = fluid.optimizer.DGCMomentumOptimizer(
learning_rate=0.001, momentum=0.9, rampup_begin_step=1252*2,
rampup_step=1252*3, sparsity=[0.984375, 0.996, 0.999])
# 2. 前面4个epochs都使用dense通信,之后默认0.999稀疏度运行
optimizer = fluid.optimizer.DGCMomentumOptimizer(
learning_rate=0.001, momentum=0.9, rampup_begin_step=1252*4)
``` python
# 从第0步开始DGC稀疏通信
optimizer = fluid.optimizer.DGCMomentumOptimizer(
learning_rate=0.001, momentum=0.9, rampup_begin_step=0)
### 3.2 适用场景
## 4. 原理
本节原理部分基本来自[Deep Gradient Compression](https://arxiv.org/abs/1712.01887)论文,本文进行了部分理解翻译,英文较好者建议直接阅读论文。
### 4.1 梯度稀疏
换个角度,从理论依据上来看,局部梯度累加等同于随时间推移增加batch size,(DGC相当于每一个梯度有自己的batch size)。设定 $F(w)$ 为需要优化的loss函数,则有着N个训练节点的同步分布式SGD更新公式如下
F(w)=\\frac{1}{\|\\chi\|}\\sum\_{x\\in\\chi}f(x, w), \\qquad w\_{t+1}=w\_{t}-\\eta\\frac{1}{N b}\\sum\_{k=0}^{N}\\sum\_{x\\in\\mathcal{B}\_{k,t}}\\nabla f\\left(x, w\_{t}\\right) \\tag{1}
其中$\chi$是训练集,$w$是网络权值,$f(x, w)$是每个样本$x \in \chi$的loss,$\eta$是学习率,N是训练节点个数,$\mathcal{B}_{k, t}$代表第$k$个节点在第$t$个迭代时的minibatch,大小为b。
w\_{t+T}^{(i)}=w\_{t}^{(i)}-\\eta T \\cdot \\frac{1}{N b T} \\sum\_{k=1}^{N}\\left(\\sum\_{\\tau=0}^{T-1} \\sum\_{x \\in \\mathcal{B}\_{k, t+\\tau}} \\nabla^{(i)} f\\left(x, w\_{t+\\tau}\\right)\\right) \\tag{2}
等式2表明局部梯度累加可以被认为batch size从$Nb$增大为$NbT$,其中T是$w^{(i)}$两次更新的稀疏通信间隔。
### 4.2 局部梯度累加改进
正常情况,稀疏更新会严重影响收敛性。DGC中采用动量修正(Momentum Correction)和局部梯度裁减(local gradient clipping)来解决这个问题。
#### 4.2.1 动量修正
有着N个节点分布式训练中vanilla momentum SGD公式,
u\_{t}=m u\_{t-1}+\\sum\_{k=1}^{N}\\left(\\nabla\_{k, t}\\right), \\quad w\_{t+1}=w\_{t}-\\eta u\_{t} \\tag{3}
其中$m$是动量因子,$N$是节点数,$\nabla_{k, t}=\frac{1}{N b} \sum_{x \in \mathcal{B}_{k, t}} \nabla f\left(x, w_{t}\right)$。
w\_{t+T}^{(i)}=w\_{t}^{(i)}-\\eta\\left[\\cdots+\\left(\\sum\_{\\tau=0}^{T-2} m^{\\tau}\\right) \\nabla\_{k, t+1}^{(i)}+\\left(\\sum\_{\\tau=0}^{T-1} m^{\\tau}\\right) \\nabla\_{k, t}^{(i)}\\right] \\tag{4}
v_{k, t}=v_{k, t-1}+\\nabla_{k, t}, \\quad u_{t}=m u_{t-1}+\\sum_{k=1}^{N} \\operatorname{sparse}\\left(v_{k, t}\\right), \\quad w_{t+1}=w_{t}-\\eta u_{t} \\tag{5}
w_{t+T}^{(i)}=w_{t}^{(i)}-\\eta\\left(\\cdots+\\nabla_{k, t+1}^{(i)}+\\nabla_{k, t}^{(i)}\\right) \\tag{6}
相比传统动量SGD,方程6缺失了累积衰减因子$\sum_{\tau=0}^{T-1} m^{\tau}$,会导致收敛精度的损失。如下图A,正常梯度更新从A点到B点,但是方程6则从A点到C点。当稀疏度很高时,会显著降低模型性能,所以需要在方程5基础上对梯度进行修正。
<div align=center>
<img src=./images/dgc_without_momentum_correction.png width=400>
<img src=./images/dgc_with_momentum_correction.png width=400>
若将方程3中速度项$u_t$当作“梯度”,则方程3第二项可认为是在”梯度“$u_t$上应用传统SGD,前面已经证明了局部梯度累加在传统SGD上是有效的。因此,可以使用方程3局部累加速度项$u_t$而非累加真实的梯度$\nabla_{k, t}$来修正方程5,
u_{k, t}=m u_{k, t-1}+\\nabla_{k, t}, \\quad v_{k, t}=v_{k, t-1}+u_{k, t}, \\quad w_{t+1}=w_{t}-\\eta \\sum_{k=1}^{N} \\operatorname{sparse}\\left(v_{k, t}\\right) \\tag{7}
#### 4.2.2 局部梯度修剪
thr_{G^{k}}=N^{-1 / 2} \\cdot thr_{G} \\tag{8}
### 4.3 克服迟滞效应
#### 4.3.1 动量因子掩藏
Mask \\leftarrow\\left|v_{k, t}\\right|>t h r, \\quad v_{k, t} \\leftarrow v_{k, t} \\odot \\neg Mask, \\quad u_{k, t} \\leftarrow u_{k, t} \\odot \\neg Mask \\tag{9}
#### 4.3.2 预热训练
### 4.4 正则化(Weight Decay)项修正
Paddle框架以Weight Decay的形式实现正则化。以L2Decay为例,公式(3)中传统momentum添加weight decay后公式为
G_{t}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+\\lambda w_{t}, \\quad u_{t}=m u_{t-1}+G_{t}, \\quad w_{t+1}=w_{t}-\\eta u_{t} \\tag{10}
其中$\lambda$为Weight Decay系数,$G_{t}$为添加L2Decay项之后的聚合梯度。由于在公式7中进行了局部动量修正,所以按照相同思路在局部梯度上运用修正的Weight Decay项。如下公式在局部梯度上添加局部Weight Decay项即可。
\\nabla_{k, t}=\\nabla_{k, t}+\\frac{\\lambda}{N} w_{t} \\tag{11}
在模型实际训练中,通常会设置weight decay的系数$\lambda=10^{-4}$,在卡数较多如4机32卡的情况下局部weight decay系数为$\frac{\lambda}{N}=\frac{10^{-4}}{32}=3.125*10^{-6}$,在数值精度上偏低,测试训练时会损失一定精度。为此还需对局部weight decay项进行数值修正。如下公式,
\\nabla_{k, t}^{'}=N \\nabla_{k, t}+\\lambda w_{t}, \\quad
G_{t}^{'}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}^{'}\\right)=N\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+N\\lambda w_{t}, \\quad
G_{t}=\\frac{G_{t}^{'}}{N}=\\sum_{k=1}^{N}\\left(\\nabla_{k, t}\\right)+\\lambda w_{t} \\tag{12}
具体做法为对局部梯度乘以卡数求得$\nabla_{k, t}^{'}$,此时$\lambda$项则无需除以卡数,聚合梯度求得$G_{t}^{'}$再对聚合梯度除以卡数得到$G_{t}$即可。
随着训练数据规模的逐渐增加,训练更大、更深的深度学习模型成为一个主流趋势。目前的深度学习模型训练,通常要求保留前向计算的隐层结果,并且需要保存结果的数量会随着模型层数的增加线性增加,这对于目前能够使用的AI芯片的内存大小是个挑战。Forward Recomputation Backpropagation(FRB)可以在额外增加少量计算的情况下,显著增加模型的层数和宽度,同时也可以显著提升模型训练的batch大小。
- **前向计算**:运行前向算子(Operator) 来计算中间隐层(Variable)的值
- **反向计算**:运行反向算子来计算参数(Parameter)的梯度
- **优化**:应用优化算法以更新参数值
占据大量的内存。Paddle的 `显存回收机制 <https://paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/best_practice/memory_optimize.html>`_
举个简单的例子, 我们定义一个由mul算子构成的网络,其前向计算为:
.. math::
y = W_1 * x
z = W_2 * y
其中 :math:`x, y, z` 为向量, :math:`W_1, W_2` 为矩阵。容易知道,求 :math:`W_2` 梯度的反向计算为:
.. math::
W_{2}^{'} = z^{'} / y
可以看到反向计算中用到了前向计算生成的变量 :math:`y` ,因此变量 :math:`y` 必须存储在内存中,直到这个反向算子计算完毕。当模型加深时,我们会有大量的“ :math:`y` ”,占据了大量的内存。
Forward Recomputation Backpropagation(FRB)的思想是将深度学习网络切分为k个部分(segments)。对每个segment而言:前向计算时,除了小部分必须存储在内存中的Variable外(我们后续会讨论这些特殊Variable),其他中间结果都将被删除;在反向计算中,首先重新计算一遍前向算子,以获得中间结果,再运行反向算子。简而言之,FRB和普通的网络迭代相比,多计算了一遍前向算子。
那么问题来了,如何选择checkpoints呢?自从FRB方法提出以来 \ :sup:`[1], [2]`,大量学者在研究这一关键问题。
Mitsuru Kusumoto \ :sup:`[3]` 等提出了一种基于动态规划的算法,
下图是由4个fc Layer、3个relu Layer、1个sigmoid Layer和1个log-loss Layer串联而成的一个网络:最左侧为其前向计算流程、中间是普通的前向计算和反向计算流程、最右侧为添加FRB后的前向计算和反向计算流程。其中方框代表算子(Operator),红点代表前向计算的中间结果、蓝点代表checkpoints。
.. image:: images/recompute.png
注:该例子完整代码位于 `source <https://github.com/PaddlePaddle/examples/blob/master/community_examples/recompute/demo.py>`_
您可以根据其 `源码 <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/optimizer.py>`_
`文档 <https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/RecomputeOptimizer_cn.html>`_
直接调用与Fleet API中使用。在单机单卡或者CPU训练中建议您直接调用RecomputeOptimizer,
在多卡训练或者多机训练任务上建议您在Fleet API中使用Recompute。
**1. 直接调用**
.. code-block:: python
import paddle.fluid as fluid
# 定义网络
def mlp(input_x, input_y, hid_dim=128, label_dim=2):
fc_1 = fluid.layers.fc(input=input_x, size=hid_dim)
prediction = fluid.layers.fc(input=[fc_1], size=label_dim, act='softmax')
cost = fluid.layers.cross_entropy(input=prediction, label=input_y)
sum_cost = fluid.layers.reduce_mean(cost)
return sum_cost, fc_1, prediction
input_x = fluid.layers.data(name="x", shape=[32], dtype='float32')
input_y = fluid.layers.data(name="y", shape=[1], dtype='int64')
cost, fc_1, pred = mlp(input_x, input_y)
# 定义RecomputeOptimizer
sgd = fluid.optimizer.Adam(learning_rate=0.01)
sgd = fluid.optimizer.RecomputeOptimizer(sgd)
# 设置checkpoints
sgd._set_checkpoints([fc_1, pred])
# 运行优化算法
**2. 在Fleet API中使用Recompute**
`Fleet API <https://github.com/PaddlePaddle/Fleet>`_
是基于Fluid的分布式计算高层API。在Fleet API中添加RecomputeOptimizer
- 设置dist_strategy.forward_recompute为True;
- 设置dist_strategy.recompute_checkpoints。
.. code-block:: python
from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
dist_strategy = DistributedStrategy()
dist_strategy.forward_recompute = True
optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
为了帮助您快速地用Fleet API使用Recompute任务,我们提供了一些例子,
- 用Recompute做Bert Fine-tuning: `source <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/recompute/bert>`_
- 用Recompute做目标检测:开发中.
- **是否支持带有随机性的Op?**
dropout Operator,可以保证重计算与初次计算结果保持一致。
- **有没有更多Recompute的官方例子?**
更多Recompute的例子将更新在 `examples <https://github.com/PaddlePaddle/examples/tree/master/community_examples/recompute>`_
和 `Fleet <https://github.com/PaddlePaddle/Fleet>`_ 库下,欢迎关注。
- **有没有添加checkpoints的建议?**
[1] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin . Training deep nets with sublinear memory cost.
arXiv preprint, arXiv:1604.06174, 2016.
[2] Audrunas Gruslys , Rémi Munos , Ivo Danihelka , Marc Lanctot , and Alex Graves. Memory efficient
backpropagation through time. In Advances in Neural Information Processing Systems (NIPS), pages 4125 4133,
[3] Kusumoto, Mitsuru, et al. "A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation." arXiv preprint arXiv:1905.11722 (2019).
Recompute: Training with bigger batch size
As the amount of training data increases, training deeper neural network models becomes more and more popular. Current deep-learning training usually keeps the hidden layer outputs in memory during the forward propagation,
and the number of outputs increases linearly with
the increase of the number of model layers,
which becomes a challenge of the memory size
for common devices.
As we know, a training process of a deep-learning network contains 3 steps:
- **Forward Propagation**:Running forward operators and generate temporary variables as output
- **Backward Propagation**:Running backward operators to compute gradients of parameters
- **Optimization**:Applying optimization algorithm to update parameters
When the model becomes deeper, the number of temporary variables
generated in the forward propagation process can reach tens
of thousands, occupying a large amount of memory.
The `Garbage Collection mechanism <https://paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/best_practice/memory_optimize.html>`_
in Paddle can delete useless variables for the sake of saving memory.
However, some variables serve as inputs of backward operators,
they must be kept in memory until particular operator finish.
Take a simple example, define a network contains two `mul` operators,
the forward propagation works as follows:
.. math::
y = W_1 * x
z = W_2 * y
where :math:`x, y, z` are vectors, :math:`W_1, W_2` are matrix。It is easy to conduct that the gradient of :math:`W_2` is:
.. math::
W_{2}^{'} = z^{'} / y
We can see that :math:`y` is used in the backward propagation process,
thus it must be kept in the memory during the whole forward propagation.
When network grows deeper, more 'y's need to be stored,
adding more requirements to the memory.
Forward Recomputation Backpropagation(FRB) splits a deep network to k segments.
For each segment, in forward propagation,
most of the temporary variables are erased in time,
except for some special variables (we will talk about that later);
in backward propagation, the forward operators will be recomputed
to get these temporary variables before running backward operators.
In short, FBR runs forward operators twice.
But how to split the network? A deep learning network usually consists
of connecting modules in series:
ResNet-50 contains 16 blocks and Bert-Large contains 24 transformers.
It is a good choice to treat such modules as segments.
The variables among segments are
called as checkpoints.
The following picture is a network with 4 fc layers, 3 relu layers,
1 sigmoid layer and 1 log-loss layer in series.
The left column is the forward propagation,
the middle column is the normal backward propagation,
and the right column is the FRB.
Rectangular boxes represent the operators, red dots represent
the intermediate variables in forward computation, blue dots
represent checkpoints and arrows represent the dependencies between operators.
.. image:: images/recompute.png
Note: the complete source code of this example: `source <https://github.com/PaddlePaddle/examples/blob/master/community_examples/recompute/demo.py>`_
After applying FBR, the forward computation only needs to store
2 variables (the blue dots) instead of 4 variables (the red
dots), saving the corresponding memories. It is notable that
recomputing operators generate new intermediate variables at the same time,
a trade-off needs to be considered in this situation.
While according to our experiments,
FBR usually saves rather than increase the memory load.
We have implemented the FRB algorithm named "RecomputeOptimizer"
based on Paddle. More information about this algorithm can
be learned by the `source code <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/optimizer.py>`_
and the
`document <https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/optimizer_cn/RecomputeOptimizer_cn.html>`_
of RecomputeOptimizer.
There are 2 methods to apply RecomputeOptimizer in your Paddle
program: call RecomputeOptimizer directly or use it with Fleet
API. For single-GPU card training or CPU training, we recommend
directly calling; For multi-GPU training, we
recommend using with Fleet API.
**1. Directly calling**
Calling RecomputeOptimizer is very easy: first, define a classic
optimizer, such as Adam; second, wrap it with RecomputeOptimizer;
third, set the checkpoints.
.. code-block:: python
import paddle.fluid as fluid
# Define the network
def mlp(input_x, input_y, hid_dim=128, label_dim=2):
fc_1 = fluid.layers.fc(input=input_x, size=hid_dim)
prediction = fluid.layers.fc(input=[fc_1], size=label_dim, act='softmax')
cost = fluid.layers.cross_entropy(input=prediction, label=input_y)
sum_cost = fluid.layers.reduce_mean(cost)
return sum_cost, fc_1, prediction
input_x = fluid.layers.data(name="x", shape=[32], dtype='float32')
input_y = fluid.layers.data(name="y", shape=[1], dtype='int64')
cost, fc_1, pred = mlp(input_x, input_y)
# define RecomputeOptimizer
sgd = fluid.optimizer.Adam(learning_rate=0.01)
sgd = fluid.optimizer.RecomputeOptimizer(sgd)
# set checkpoints
sgd._set_checkpoints([fc_1, pred])
# apply optimization
In principle, recompute is for all kinds of optimizers in Paddle.
**2. Using Recompute in Fleet API**
`Fleet API <https://github.com/PaddlePaddle/Fleet>`_
is a high-level API for distributed training in Fluid. Adding
RecomputeOptimizer to Fluid takes two steps:
- set dist_strategy.forward_recompute to True
- set dist_strategy.recompute_checkpoints
.. code-block:: python
from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
dist_strategy = DistributedStrategy()
dist_strategy.forward_recompute = True
optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
We supply some examples of using recompute in Fleet API for users.
We also post corresponding training speed,
test results and memory usages of these examples for reference.
- Fine-tuning Bert Large model with recomputing: `source <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/recompute/bert>`_
- Training object detection models with recomputing:developing.
- **Does RecomputeOptimizer support operators with random outputs?**
We currently found that the dropout operator has random results
and RecomputeOptimizer is able to keep the outputs of
first-computation and recomputation consistent.
- **Are there more official examples of Recompute?**
More examples will be updated at `examples <https://github.com/PaddlePaddle/examples/tree/master/community_examples/recompute>`_
and `Fleet <https://github.com/PaddlePaddle/Fleet>`_ . Feel free to
raise issues if you get any problem with these examples.
- **How should I set checkpoints?**
The position of checkpoints is important:
we suggest setting the variable between the sub-model as checkpoints,
that is, set a variable as a checkpoint if it
can separate the network into two parts without short-cut connections.
The number of checkpoints is also important:
too few checkpoints will reduce the memory saved by recomputing while
too many checkpoints will occupy a lot of memory themselves.
We will add a tool to estimate the memory usage with specific checkpoints,
helping users to choose checkpointing variables.
[1] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin . Training deep nets with sublinear memory cost.
arXiv preprint, arXiv:1604.06174, 2016.
[2] Audrunas Gruslys , Rémi Munos , Ivo Danihelka , Marc Lanctot , and Alex Graves. Memory efficient
backpropagation through time. In Advances in Neural Information Processing Systems (NIPS), pages 4125 4133,
[3] Kusumoto, Mitsuru, et al. "A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation." arXiv preprint arXiv:1905.11722 (2019).
.. _api_guide_memory_optimize:
1. PaddlePaddle的显存分配策略
1.1. 显存自增长AutoGrowth策略
由于原生的CUDA系统调用 :code:`cudaMalloc` 和 :code:`cudaFree` 均是同步操作,非常耗时。
- 在前几次显存分配时,框架会调用 :code:`cudaMalloc` 按需分配,但释放时不会调用 :code:`cudaFree` 返回给GPU,而是在框架内部缓存起来。
- 在随后的显存分配时,框架会首先检查缓存的显存中是否有合适的块,若有则从中分割出所需的显存空间返回,否则才调用 :code:`cudaMalloc` 直接从GPU中分配。随后的显存释放亦会缓存起来供后续分配使用。
因此,显存自增长AutoGrowth策略会在前几个batch训练时分配较慢(因为频繁调用 :code:`cudaMalloc` ),在随后训练过程中基本不会影响模型训练速度。
1.2. 显存预分配策略
除了显存自增长AutoGrowth策略以外,PaddlePaddle还提供了显存预分配策略。显存预分配策略是PaddlePaddle 1.7版本前的默认显存分配策略。
其中,chunk_size由环境变量 :code:`FLAGS_fraction_of_gpu_memory_to_use` 确定,chunk_size的计算公式为:
.. code-block:: python
chunk_size = FLAGS_fraction_of_gpu_memory_to_use * 单张GPU卡的当前可用显存值
:code:`FLAGS_fraction_of_gpu_memory_to_use` 的默认值为0.92,即框架预先分配显卡92%的当前可用显存值。
- 在分配requested_size大小的显存时,
- 若requested_size <= chunk_size,则框架会预先分配chunk_size大小的显存池chunk,并从chunk中分出requested_size大小的块返回。之后每次申请显存都会从chunk中分配。
- 若requested_size > chunk_size,则框架会直接调用 :code:`cudaMalloc` 分配requested_size大小的显存返回。
- 在释放free_size大小的显存时,
- 若free_size <= chunk_size,则框架会将该显存放回预分配的chunk中,而不是直接返回给CUDA。
- 若free_size > chunk_size,则框架会直接调用 :code:`cudaFree` 将显存返回给CUDA。
若你的GPU卡上有其他任务占用显存,你可以适当将 :code:`FLAGS_fraction_of_gpu_memory_to_use` 减少,保证框架能预分配到合适的显存块,例如:
.. code-block:: shell
export FLAGS_fraction_of_gpu_memory_to_use=0.4 # 预先40%的GPU显存
若 :code:`FLAGS_fraction_of_gpu_memory_to_use` 设为0,则每次显存分配和释放均会调用 :code:`cudaMalloc` 和 :code:`cudaFree` ,会严重影响性能,不建议你使用。
只有当你想测量网络的实际显存占用量时,你可以设置 :code:`FLAGS_fraction_of_gpu_memory_to_use` 为0,观察nvidia-smi显示的显存占用情况。
1.3. 显存分配策略的选择方式
自1.6+版本起,PaddlePaddle同时支持显存自增长AutoGrowth策略和显存预分配策略,并通过环境变量 :code:`FLAGS_allocator_strategy` 控制。
.. code-block:: shell
export FLAGS_allocator_strategy=auto_growth # 选择显存自增长AutoGrowth策略
.. code-block:: shell
export FLAGS_allocator_strategy=naive_best_fit # 选择显存预分配策略
此外,自1.7.2+版本起,PaddlePaddle提供了环境变量 :code:`FLAGS_gpu_memory_limit_mb` ,用于控制单个任务进程可分配的最大显存,单位是MB。默认值是0,表示没有限制,可分配全部显存。如果设置为大于0的值,则会在分配的显存超过限制时报错,即使此时系统还存在空闲的显存空间。
2. PaddlePaddle的存储优化策略
2.1. GC策略: 存储垃圾及时回收
GC(Garbage Collection)的原理是在网络运行阶段及时释放无用变量的存储空间,达到节省存储空间的目的。GC适用于使用Executor,ParallelExecutor做模型训练/预测的场合,但不适用于C++预测库接口。
- :code:`FLAGS_eager_delete_tensor_gb`
GC策略的使能开关,double类型,在<1.6的版本中默认值为-1,在1.6+版本中默认值为0。GC策略会积攒一定大小的存储垃圾后再统一释放,:code:`FLAGS_eager_delete_tensor_gb` 控制的是存储垃圾的阈值,单位是GB。**建议用户设置** :code:`FLAGS_eager_delete_tensor_gb=0` 。
若 :code:`FLAGS_eager_delete_tensor_gb=0` ,则一旦有存储垃圾则马上回收,最为节省存储空间。
若 :code:`FLAGS_eager_delete_tensor_gb=1` ,则存储垃圾积攒到1G后才触发回收。
若 :code:`FLAGS_eager_delete_tensor_gb<0` ,则GC策略关闭。
- :code:`FLAGS_memory_fraction_of_eager_deletion`
GC内部会根据变量占用的存储空间大小,对变量进行降序排列,且仅回收前 :code:`FLAGS_memory_fraction_of_eager_deletion` 大的变量的存储空间。**建议用户维持默认值**,即 :code:`FLAGS_memory_fraction_of_eager_deletion=1` 。
若 :code:`FLAGS_memory_fraction_of_eager_deletion=0.6` ,则表示仅回收存储占用60%大的变量的存储空间。
若 :code:`FLAGS_memory_fraction_of_eager_deletion=0` ,则表示不回收任何变量的存储空间,GC策略关闭。
若 :code:`FLAGS_memory_fraction_of_eager_deletion=1` ,则表示回收所有变量的存储空间。
- :code:`FLAGS_fast_eager_deletion_mode`
快速GC策略的开关,bool类型,默认值为True,表示使用快速GC策略。快速GC策略会不等待CUDA Kernel结束直接释放显存。**建议用户维持默认值**,即 :code:`FLAGS_fast_eager_deletion_mode=True` 。
2.2. Inplace策略: Op内部的输出复用输入
Inplace策略适用于使用ParallelExecutor或CompiledProgram+with_data_parallel的场合,通过 :code:`BuildStrategy` 设置。此策略不支持使用Executor+Program做单卡训练、使用C++预测库接口等场合。
.. code-block:: python
build_strategy = fluid.BuildStrategy()
build_strategy.enable_inplace = True # 开启Inplace策略
compiled_program = fluid.CompiledProgram(train_program)
.with_data_parallel(loss_name=loss.name, build_strategy=build_strategy)
.. code-block:: python
loss.persistable = True
acc.persistable = True
3. 存储优化Best Practice
- 开启GC策略:设置 :code:`FLAGS_eager_delete_tensor_gb=0` 。
- 开启Inplace策略:设置 :code:`build_strategy.enable_inplace = True` ,并在<1.6版本中设置fetch_list中的 :code:`var.persistable = True` 。
.. _api_guide_memory_optimize_en:
Memory Allocation and Optimization
1. Memory Allocation Strategy
1.1. AutoGrowth Strategy
Since version 1.6+, PaddlePaddle supports the AutoGrowth strategy, which allocates memory on demand.
AutoGrowth strategy has been enabled by default in version 1.7+, making it convenient for users to
run multiple tasks on the same GPU card at the same time.
Because the native CUDA system calls :code:`cudaMalloc` and :code:`cudaFree` are synchronous operations,
which are very time-consuming, the AutoGrowth strategy will cache the allocated memory for subsequent allocation.
The specific methods are as follows:
- In the first few memory allocations, PaddlePaddle framework will call :code:`cudaMalloc` and allocate memory on demand. When releasing the allocated memory, it will not call :code:`cudaFree` to return the memory to GPU, but cache the memory inside the framework.
- In the subsequent allocations, PaddlePaddle framework will first check if there is a fit block (block size larger than the required memory size) in the cached memory. If there is, it will split the required memory from the fit block and return. Otherwise, it will call :code:`cudaMalloc` to allocate memory from GPU. The allocated memory are also cached when being released for subsequent allocation.
Therefore, the AutoGrowth strategy may slow the speed in the first few batches of model training,
but will not affect the speed in the subsequent training process.
1.2. Pre-Allocation Strategy
In addition to the AutoGrowth strategy, paddlepaddle also provides a Pre-Allocation strategy,
which is the default memory allocation strategy before paddlepaddle 1.7.
The Pre-Allocation strategy allocates a large size chunk at the first allocation, and the subsequent memory allocation is mostly obtained from the pre allocated memory chunk.
Among them, the chunk size is determined by the environment variable :code:`FLAGS_fraction_of_gpu_memory_to_use`, and the calculation formula of chunk size is:
.. code-block:: python
chunk_size = FLAGS_fraction_of_gpu_memory_to_use * number of current available memory of a single GPU card
The default value of :code:`FLAGS_fraction_of_gpu_memory_to_use` is 0.92, that is, the framework will pre allocates
92% of the currently available memory of the GPU card.
The specific way of Pre-Allocation strategy to allocate GPU memory is:
- When allocating memory of requested_size,
- If requested_size <= chunk_size, the framework will first allocate a memory chunk of chunk_size, then split a block of requested_size and return the block. Every subsequent memory allocation will be performed on the chunk.
- If requested_size > chunk_size, the framework will call :code:`cudaMalloc` to allocate memory block of requested_size and return.
- When freeing memory of requested_size,
- If free_size <= chunk_size, the framework will put the memory block back into the pre-allocated chunk, instead of returning back to GPU.
- If free_size > chunk_size, the framework will call :code:`cudaFree` and return the memory back to GPU.
If there are other tasks on your GPU card that occupy the memory, you can appropriately decrease :code:`FLAGS_fraction_of_gpu_memory_to_use`
to ensure that the framework can pre-allocate the memory block of appropriate size, for example
.. code-block:: shell
export FLAGS_fraction_of_gpu_memory_to_use=0.4 # Pre-allocate 40% memory of a single GPU card
If :code:`FLAGS_fraction_of_gpu_memory_to_use` is set to 0, the framework will call :code:`cudaMalloc` and :code:`cudaFree` every time the memory is allocated and released, which will seriously affect the performance and is not recommended. Only when you want to measure the actual memory usage of the network, you could set :code:`FLAGS_fraction_of_gpu_memory_to_use` to 0, and observe the memory usage of command nvidia-smi display.
1.3. Configuration of memory allocation strategy
Since version 1.6+, PaddlePaddle supports both the AutoGrowth strategy and the Pre-Allocation Strategy, and control the strategy used in framework by
the environment variable :code:`FLAGS_allocator_strategy`.
Use AutoGrowth strategy:
.. code-block:: shell
export FLAGS_allocator_strategy=auto_growth # Use AutoGrowth strategy
Use Pre-Allocation strategy:
.. code-block:: shell
export FLAGS_allocator_strategy=naive_best_fit # Use Pre-Allocation strategy
Plus, since version 1.7.2+, PaddlePaddle provides an environment variable :code:`FLAGS_gpu_memory_limit_mb`, which controls the maximum gpu memory limit that the process can allocate.
If it is equal to 0, there would be no limit and all gpu memory would be available to the process. If it is larger than 0, the process would raise out of memory error if the allocated
memory exceeds the limit even though there is available memory on the gpu card. The unit is MB and default value is 0.
2. Memory Optimization Strategy
Paddlepaddle provides several general memory optimization methods to optimize the memory usage of your network (including general memory and GPU memory).
2.1. GC Strategy: memory garbage eager collection
The principle of GC(Garbage Collection)is to release the memory space of useless variables eagerly during network running,
in order to save memory space. GC is suitable for training and inference using Executor or ParallelExecutor, but it is not suitable for C++ inference library.
**Since version 1.6+, GC Strategy is enabled by default.**
GC Strategy is controlled by 3 environment variable:
- :code:`FLAGS_eager_delete_tensor_gb`
Variable to enable GC, its data type is double. The default value is -1 in PaddlePaddle with version < 1.6,
and is 0 in PaddlePaddle with version >= 1.6. GC Strategy will cache a certain amount of memory garbage and release it uniformly.
:code:`FLAGS_eager_delete_tensor_gb` means the threshold of cached memory garbage, the unit of which is GB. **It is recommended to set** :code:`FLAGS_eager_delete_tensor_gb=0`.
If :code:`FLAGS_eager_delete_tensor_gb=0`, once there is memory garbage, it will be collected immediately to save memory.
If :code:`FLAGS_eager_delete_tensor_gb=1`, the memory garbage is collected when the cached amount of garbage reaches 1GB.
If :code:`FLAGS_eager_delete_tensor_gb<0`, GC Strategy is disabled.
- :code:`FLAGS_memory_fraction_of_eager_deletion`
Variable to control GC Strategy, its data type is double. The default value is 1, range [0,1]. It is only suitable for ParallelExecutor or CompiledProgram+with_data_parallel.
GC will sort the variables in descending order according to the memory space occupied by the variables,
and only collect the memory space of top :code:`FLAGS_memory_fraction_of_eager_deletion` variables.
**It is recommended to remain default value**, that is :code:`FLAGS_memory_fraction_of_eager_deletion=1`.
If :code:`FLAGS_memory_fraction_of_eager_deletion=0.6`, top 60% variables will be collected.
If :code:`FLAGS_memory_fraction_of_eager_deletion=0`, no variable will be collected, GC Strategy is disabled.
If :code:`FLAGS_memory_fraction_of_eager_deletion=1`, all variables will be collected.
- :code:`FLAGS_fast_eager_deletion_mode`
Variable to enable fast GC Strategy, its type is bool. The default value is True, which means use fast GC Strategy.
Fast GC Strategy will collect the memory garbage immediately instead of waiting for CUDA Kernel finish. **It is recommended to remain default value**, that is :code:`FLAGS_fast_eager_deletion_mode=True`.
2.2. Inplace Strategy: output reuses input inside operator
The principle of Inplace strategy is that the output of some operators can reuses the memory space of input.
For example, the output and input of operator :code:`reshape` can reuse the same memory space.
Inplace Strategy is suitable for ParallelExecutor or CompiledProgram+with_data_parallel, which can be set through :code:`BuildStrategy`.
The Strategy is not suitable for Executor+Program or C++ inference library.
**Since version 1.6+, Inplace Strategy is enabled by default.**
The specific way of Inplace strategy is:
.. code-block:: python
build_strategy = fluid.BuildStrategy()
build_strategy.enable_inplace = True # Enable Inplace Strategy
compiled_program = fluid.CompiledProgram(train_program)
.with_data_parallel(loss_name=loss.name, build_strategy=build_strategy)
In PaddlePaddle with version < 1.6, due to of some design problems, when the Inplace Strategy is enabled,
the variable in fetch_list in the subsequent :code:`exe.run` must be persistent.
That is, if you the variables you want to fetch are loss and acc, you must set:
.. code-block:: python
loss.persistable = True
acc.persistable = True
**Since version 1.6+, setting variables in fetch_list to persistable is not needed.**
3. Memory Optimization Best Practice
We recommend the best memory optimization strategy as:
- Enable GC strategy:set :code:`FLAGS_eager_delete_tensor_gb=0`.
- Enable Inplace strategy:set :code:`build_strategy.enable_inplace = True`, and set variables in fetch_list to persistable using :code:`var.persistable = True` when the version of PaddlePaddle < 1.6.
**Since version 1.6+, the above optimal strategy have been enabled by default and setting variables in fetch_list to persistable is not needed.**
.. _api_guide_singlenode_training_best_practice:
PaddlePaddle Fluid可以支持在现代CPU、GPU平台上进行训练。如果您发现Fluid进行单机训练的速度较慢,您可以根据这篇文档的建议对您的Fluid程序进行优化。
1. 网络构建过程中的配置优化
1.1 cuDNN操作的选择
cuDNN是NVIDIA提供的深度神经网络计算库,其中包含了很多神经网络中常用算子,Paddle中的部分Op底层调用的是cuDNN库,例如 :code:`conv2d` :
.. code-block:: python
在 :code:`use_cudnn=True` 时,框架底层调用的是cuDNN中的卷积操作。
通常cuDNN库提供的操作具有很好的性能表现,其性能明显优于Paddle原生的CUDA实现,比如 :code:`conv2d` 。但是cuDNN中有些操作的性能较差,比如: :code:`conv2d_transpose` 在 :code:`batch_size=1` 时、:code:`pool2d` 在 :code:`global_pooling=True` 时等,这些情况下,cuDNN实现的性能差于Paddle的CUDA实现,建议手动设置 :code:`use_cudnn=False` 。
1.2 减少模型中Layer的个数
(1) :code:`fluid.layers.softmax_with_cross_entropy` ,该操作其实是 :code:`fluid.layers.softmax` 和 :code:`fluid.layers.cross_entropy` 的组合,因此如果模型中有出现
.. code-block:: python
logits = fluid.layers.softmax(logits)
loss = fluid.layers.cross_entropy(logits, label, ignore_index=255)
.. code-block:: python
loss = fluid.layers.softmax_with_cross_entropy(logits, label, ignore_index=255, numeric_stable_mode=True)
(2) 如果模型中需要对数据进行标准化,可以直接使用 :code:`fluid.layers.data_norm` ,而不用通过一系列layer组合出数据的标准化操作。
2. 数据准备优化
这两部分需要用户根据自己的模型需要进行设置,只需要最后得到Data Reader接口即可。Data Reader返回iterable对象,可以每次返回一条样本或者一组样本。代码示例如下:
.. code-block:: python
def data_reader(width, height):
def reader():
while True:
yield np.random.uniform(-1, 1,size=width*height), np.random.randint(0,10)
return reader
train_data_reader = data_reader(32, 32)
Paddle提供了两种方式从Data Reader中读取数据: :ref:`user_guide_use_numpy_array_as_train_data` 和 :ref:`user_guides_use_py_reader` ,详情请参考文档 :ref:`user_guide_prepare_data` 。
2.1 同步数据读取
.. code-block:: python
image = fluid.data(name="image", shape=[None, 1, 28, 28], dtype="float32")
label = fluid.data(name="label", shape=[None, 1], dtype="int64")
# 模型定义
# ……
prediction = fluid.layers.fc(input=image, size=10)
loss = fluid.layers.cross_entropy(input=prediction, label=label)
avg_loss = fluid.layers.mean(loss)
# ……
# 读取数据
# paddle.dataset.mnist.train()返回数据读取的Reader,每次可以从Reader中读取一条样本,batch_size为128
train_reader = paddle.batch(paddle.dataset.mnist.train(), 128)
# 读取数据
end = time.time()
for batch_id, batch in enumerate(train_reader):
data_time = time.time() - end
# 训练网络
executor.run(feed={...}, fetch_list=[...])
batch_time = time.time() - end
end = time.time()
用户首先需要通过 :code:`fluid.data` 定义模型的输入,然后根据输入构建模型,最后从事先自定义的Reader函数中获取一个batch的数据,并将数据传递给执行器。
采用同步数据读取方式时,用户可通过加入Python计时函数 :code:`time.time()` 来统计数据准备部分和执行部分所占用的时间。
2.2 异步数据读取
Paddle里面使用 paddle.fluid.io. :ref:`cn_api_fluid_io_DataLoader` 接口来实现异步数据读取,代码示例如下:
.. code-block:: python
image = fluid.data(name="image", shape=[None, 1, 28, 28], dtype="float32")
label = fluid.data(name="label", shape=[None, 1], dtype="int64")
dataloader = fluid.io.DataLoader.from_generator(
feed_list=[image, label],
# 模型定义
# ……
prediction = fluid.layers.fc(input=image, size=10)
loss = fluid.layers.cross_entropy(input=prediction, label=label)
avg_loss = fluid.layers.mean(loss)
# ……
# 读取数据
train_reader = paddle.batch(paddle.dataset.mnist.train(), 128)
data_loader.set_batch_generator(train_reader, places=places)
# 启动data_loader
batch_id = 0
end = time.time()
while True:
print("queue size: ", data_loader.queue.size())
loss, = executor.run(fetch_list=[...])
# ...
batch_time = time.time() - end
end = time.time()
batch_id += 1
except fluid.core.EOFException:
用户首先需要通过 :code:`fluid.io.DataLoader.from_generator` 定义DataLoader对象,并使用 :code:`set_batch_generator` 方法将自定义的Reader与DataLoader绑定。
若DataLoader被定义成不可迭代的( :code:`iterable=False` ),在训练开始之前,通过调用 :code:`start()` 方法来启动数据读取。
在数据读取结束之后, :code:`executor.run` 会抛出 :code:`fluid.core.EOFException` ,表示训练已经遍历完Reader中的所有数据。
另外,Paddle提供的一些FLAGS也能很好的帮助分析性能。如果用户希望评估一下在完全没有数据读取开销情况下模型的性能,可以设置一下环境变量::code:`FLAGS_reader_queue_speed_test_mode` ,在该变量为True情况下,C++端从数据队列中获取数据之后,不会从数据队列中移除,这样能够保证数据队列始终不为空,从而避免了C++端读取数据时的等待开销。
**需要特别注意的是,** :code:`FLAGS_reader_queue_speed_test_mode` **只能在性能分析的时候打开,正常训练模型时需要关闭。**
为降低训练的整体时间,建议用户使用异步数据读取的方式,并开启 :code:`use_double_buffer=True` 。用户可根据模型的实际情况设置数据队列的大小。
常用的方法是 **使用Python多进程准备数据** ,一个简单的使用多进程准备数据的示例,可以参考 `YOLOv3 <https://github.com/PaddlePaddle/models/blob/develop/PaddleCV/yolov3/reader.py>`_ 。
3. 模型训练相关优化
3.1 执行器介绍
目前Paddle的Python API中提供了 :code:`fluid.compiler.CompiledProgram` 的概念,用户可以通过 :code:`CompiledProgram` 将传入的program进行编译。
如果希望采用数据并行模式训练,只需要将 :code:`CompiledProgram` 返回的对象调用一下 :code:`with_data_parallel` 即可,最后统一通过 :code:`executor.run(…)` 执行compiled_program。
虽然统一通过 :code:`executor.run(…)` 接口来执行,实际底层的执行策略有两种,对应C++部分的两个执行器,即 :code:`Executor` 和 :code:`ParallelExecutor` ,如果用户采用数据并行模式,C++部分使用的是 :code:`ParallelExecutor` ,除此之外都是使用 :code:`Executor` 。
.. csv-table::
:header: "执行器 ", "执行对象", "执行策略"
:widths: 3, 3, 5
":code:`Executor`", ":code:`Program`", "根据 :code:`Program` 中Operator定义的先后顺序依次运行。"
":code:`ParallelExecutor`", "SSA Graph", "根据Graph中各个节点之间的依赖关系,通过多线程运行。"
可以看出, :code:`Executor` 的内部逻辑非常简单,但性能可能会弱一些,因为 :code:`Executor` 对于program中的操作是串行执行的。
而 :code:`ParallelExecutor` 首先会将program转变为计算图,并分析计算图中节点间的连接关系,对图中没有相互依赖的节点(OP),通过多线程并行执行。
因此, :code:`Executor` 是一个轻量级的执行器,目前主要用于参数初始化、模型保存、模型加载。
:code:`ParallelExecutor` 是 :code:`Executor` 的升级版本,目前 :code:`ParallelExecutor` 主要用于模型训练,包括单机单卡、单机多卡以及多机多卡训练。
:code:`ParallelExecutor` 执行计算图之前,可以对计算图进行一些优化,比如使计算图中的一些操作是In-place的、将计算图中的参数更新操作进行融合等。
用户还可以调整 :code:`ParallelExecutor` 执行过程中的一些配置,比如执行计算图的线程数等。这些配置分别是构建策略(BuildStrategy)和执行策略(ExecutionStrategy)参数来设置的。
.. code-block:: python
build_strategy = fluid.BuildStrategy()
build_strategy.enable_inplace = True
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = 4
train_program = fluid.compiler.CompiledProgram(main_program).with_data_parallel(
place = fluid.CUDAPlace(0)
exe = Executor(place)
# 使用DataLoader读取数据,因此执行时不需要设置feed
fetch_outs = exe.run(train_program, fetch_list=[loss.name])
3.2 构建策略(BuildStrategy)配置参数介绍
BuildStrategy中提供了一些关于计算图优化的策略,这些策略可以在不同程度上提升模型的训练速度,但是其中一些策略与模型的结构有关,比如 :code:`fuse_all_optimizer_ops` 不支持sparse梯度,我们正在积极的完善这些策略,并在下一个版本将这些策略默认打开。
.. csv-table::
:header: "选项", "类型", "默认值", "说明"
:widths: 3, 3, 3, 5
":code:`reduce_strategy`", ":code:`fluid.BuildStrategy.ReduceStrategy`", ":code:`fluid.BuildStrategy.ReduceStrategy.AllReduce`", "使用数据并行训练模型时选用 :code:`AllReduce` 模式训练还是 :code:`Reduce` 模式训练。"
":code:`enable_backward_optimizer_op_deps`", "bool", "True", "在反向操作和参数更新操作之间添加依赖,保证在所有的反向操作都运行结束之后才开始运行参数更新操作。"
":code:`fuse_all_optimizer_ops`", "bool", "False", "对模型中的参数更新算法进行融合。"
":code:`fuse_all_reduce_ops`", "bool", "False", "多卡训练时,将all_reduce操作进行融合。"
":code:`fuse_relu_depthwise_conv`", "bool", "False", "如果模型中存在relu和depthwise_conv,并且是连接的,即relu->depthwise_conv,该选项可以将这两个操作合并为一个。"
":code:`fuse_broadcast_ops`", "bool", "False", "在 :code:`Reduce` 模式下,将最后的多个Broadcast操作融合为一个。"
":code:`mkldnn_enabled_op_types`", "list", "{}", "如果是CPU训练,可以用 :code:`mkldnn_enabled_op_types` 指明模型中的那些操作可以使用MKLDNN库。默认情况下,模型中用到的操作如果在Paddle目前支持的可以使用mkldnn库计算的列表中,这些操作都会调用mkldnn库的接口进行计算。"
":code:`debug_graphviz_path`", "str", "{}", "将Graph以graphviz格式输出到debug_graphviz_path所指定的文件中。"
(1) 关于 :code:`reduce_strategy` ,在 :code:`ParallelExecutor` 对于数据并行支持两种参数更新模式: :code:`AllReduce` 和 :code:`Reduce` 。在 :code:`AllReduce` 模式下,各个节点上计算得到梯度之后,调用 :code:`AllReduce` 操作,梯度在各个节点上聚合,然后各个节点分别进行参数更新。在 :code:`Reduce` 模式下,参数的更新操作被均匀的分配到各个节点上,即各个节点计算得到梯度之后,将梯度在指定的节点上进行 :code:`Reduce` ,然后在该节点上,最后将更新之后的参数Broadcast到其他节点。即:如果模型中有100个参数需要更新,训练时使用的是4个节点,在 :code:`AllReduce` 模式下,各个节点需要分别对这100个参数进行更新;在 :code:`Reduce` 模式下,各个节点需要分别对这25个参数进行更新,最后将更新的参数Broadcast到其他节点上。注意:如果是使用CPU进行数据并行训练,在Reduce模式下,不同CPUPlace上的参数是共享的,所以在各个CPUPlace上完成参数更新之后不用将更新后的参数Broadcast到其他CPUPlace。
(2) 关于 :code:`enable_backward_optimizer_op_deps` ,在多卡训练时,打开该选项可能会提升训练速度。
(3) 关于 :code:`fuse_all_optimizer_ops` ,目前只支持SGD、Adam和Momentum算法。 **注意:目前不支持sparse参数梯度** 。
(4) 关于 :code:`fuse_all_reduce_ops` ,多GPU训练时,可以对 :code:`AllReduce` 操作进行融合,以减少 :code:`AllReduce` 的调用次数。默认情况下会将同一layer中参数的梯度的 :code:`AllReduce` 操作合并成一个,比如对于 :code:`fluid.layers.fc` 中有Weight和Bias两个参数,打开该选项之后,原本需要两次 :code:`AllReduce` 操作,现在只用一次 :code:`AllReduce` 操作。此外,为支持更大粒度的参数梯度融合,Paddle提供了 :code:`FLAGS_fuse_parameter_memory_size` 选项,用户可以指定融合AllReduce操作之后,每个 :code:`AllReduce` 操作的梯度字节数,比如希望每次 :code:`AllReduce` 调用传输64MB的梯度,:code:`export FLAGS_fuse_parameter_memory_size=64` 。 **注意:目前不支持sparse参数梯度** 。
(5) 关于 :code:`mkldnn_enabled_op_types` ,目前Paddle的Op中可以使用mkldnn库计算的操作包括:transpose、sum、softmax、requantize、quantize、pool2d、lrn、gaussian_random、fc、dequantize、conv2d_transpose、conv2d、conv3d、concat、batch_norm、relu、tanh、sqrt、abs。
3.3 执行策略(ExecutionStrategy)配置参数介绍
.. csv-table::
:header: "选项", "类型", "默认值", "说明"
:widths: 3, 3, 5, 5
":code:`num_iteration_per_drop_scope`", "INT", "100", "经过多少次迭代之后清理一次local execution scope"
":code:`num_threads`", "INT", "对于CPU:2*dev_count;对于GPU:4*dev_count. (这是一个经验值)", ":code:`ParallelExecutor` 中执行所有Op使用的线程池大小"
(1) 关于 :code:`num_iteration_per_drop_scope` ,框架在运行过程中会产生一些临时变量,默认每经过一个batch就要清理一下临时变量。由于GPU是异步设备,在清理之前需要对所有的GPU调用一次同步操作,因此耗费的时间较长。为此我们在execution_strategy中添加了 :code:`num_iteration_per_drop_scope` 选项。用户可以指定经过多少次迭代之后清理一次。
(2) 关于 :code:`num_threads` ,:code:`ParallelExecutor` 根据Op之间的依赖关系确定Op的执行顺序,即:当Op的输入都已经变为ready状态之后,该Op会被放到一个队列中,等待被执行。 :code:`ParallelExecutor` 内部有一个任务调度线程和一个线程池,任务调度线程从队列中取出所有Ready的Op,并将其放到线程队列中。 :code:`num_threads` 表示线程池的大小。根据以往的经验,对于CPU任务,:code:`num_threads=2*dev_count` 时性能较好,对于GPU任务,:code:`num_threads=4*dev_count` 时性能较好。 **注意:线程池不是越大越好** 。
4. 运行时FLAGS设置优化
(1) :code:`FLAGS_cudnn_exhaustive_search` 表示在调用cuDNN中的卷积操作时,根据输入数据的shape等信息,采取穷举搜索的策略从算法库中选取到更快的卷积算法,进而实现对模型中卷积操作的加速。需要注意的是:
- 在搜索算法过程中需要使用较多的显存,如果用户的模型中卷积操作较多,或者GPU卡显存较小,可能会出现显存不足问题。
- 通过穷举搜索选择好算法之后,该算法会进入Cache,以便下次运行时,如果输入数据的shape等信息不变,直接使用Cache中算法。
(2) :code:`FLAGS_enable_cublas_tensor_op_math` 表示是否使用TensorCore加速cuBLAS等NV提供的库中的操作。需要注意的是,这个环境变量只在Tesla V100以及更新的GPU上适用,且可能会带来一定的精度损失,通常该损失不会影响模型的收敛性。
5. 优秀实践
(1) 尽可能的使用飞桨提供的单个layer实现所需操作。
(2) 采用异步数据读取。
(3) 模型训练相关优化:
- 使用ParallelExecutor作为底层执行器。单卡训练,也可以调用with_data_parallel方法。代码示例:
.. code-block:: python
compiled_prog = compiler.CompiledProgram(
- 如果模型中参数的梯度都是非sparse的,可以打开fuse_all_optimizer_ops选项,将多个参数更新操作融合为一个。
- 如果是多卡训练,可以打开enable_backward_optimizer_op_deps、fuse_all_reduce_ops选项。如果想指定每次每次AllReduce操作的数据大小,可以设置 :code:`FLAGS_fuse_parameter_memory_size`,比如 :code:`export FLAGS_fuse_parameter_memory_size=1` ,表示每次AllReduce调用传输1MB的梯度。
- 使用CPU做数据并行训练时,推荐使用Reduce模型,因为在使用CPU进行数据并行训练时,在Reduce模式下,不同CPUPlace 上的参数是共享的,所以在各个CPUPlace 上完成参数更新之后不用将更新后的参数Broadcast到其他CPUPlace上,这对提升速度也有很大帮助。
- 如果是Reduce模式,可打开fuse_broadcast_ops选项。
- 如果用户的模型较小,比如mnist、language_model等,可以将num_threads设为1。
- 在显存足够的前提下,建议将 :code:`exec_strategy.num_iteration_per_drop_scope` 设置成一个较大的值,比如设置为100,这样可以避免反复地申请和释放内存。
(4) FLAGS设置
.. code-block:: bash
FLAGS_cudnn_exhaustive_search = True
FLAGS_enable_cublas_tensor_op_math = True
6. 使用Profile工具进行性能分析
为方便用户更好的发现程序中的性能瓶颈,Paddle提供了多种Profile工具,这些工具的详细介绍和使用说明请参考 :ref:`api_guide_analysis_tools` 。
# 如何在框架外部自定义C++ OP
通常,如果PaddlePaddle的Operator(OP)库中没有您所需要的操作,建议先尝试使用已有的OP组合,如果无法组合出您需要的操作,可以尝试使用`fluid.layers.py_func`,也可以按照这篇教程自定义C++ OP。当然,如果用若干OP组合出来的OP性能无法满足您的要求,也可以自定义C++ OP。
1. 实现OP和注册OP,和在框架内部写OP完全相同,遵守"如何写新的C++ OP"的规范和步骤。当然,实现Gradient OP是可选的。
2. 编译出动态库。
3. 封装该OP的Python接口。
4. 写OP的单测。
下面通过一个具体的例子来详细的介绍,一步一步教会您如何实现。下面通过实现relu op来介绍。
## 自定义OP的实现
OP的实现与"如何写新的C++ OP"的教程相同,简答的说需要: 1). 定义OP的ProtoMaker,即描述OP的输入、输出、属性信息;2). 实现OP的定义和InferShape,以及OP的kernel函数,反向OP类似。3). 注册OP,以及OP的计算函数。
ReLU OP的CPU实现, ``relu_op.cc`` 文件:
// relu_op.cc
#include "paddle/fluid/framework/op_registry.h"
namespace paddle {
namespace operators {
// 前向OP的输入X、输出Y、属性
class Relu2OpMaker : public framework::OpProtoAndCheckerMaker {
void Make() override {
AddInput("X", "The input tensor.");
AddOutput("Y", "Output of relu_op");
Relu Operator.
Y = max(X, 0)
// 前向OP的定义和InferShape实现,设置输出Y的shape
class Relu2Op : public framework::OperatorWithKernel {
using framework::OperatorWithKernel::OperatorWithKernel;
void InferShape(framework::InferShapeContext* ctx) const override {
auto in_dims = ctx->GetInputDim("X");
ctx->SetOutputDim("Y", in_dims);
// 实现前向OP的Kernel计算函数: Y = max(0, X)
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class Relu2Kernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const override {
auto* in_t = ctx.Input<Tensor>("X");
auto* out_t = ctx.Output<Tensor>("Y");
auto x = in_t->data<T>();
// mutable_data分配内存、获取指针
auto y = out_t->mutable_data<T>(ctx.GetPlace());
for (int i = 0; i < in_t->numel(); ++i) {
y[i] = std::max(static_cast<T>(0.), x[i]);
// 定义反向OP的输入Y和dY、输出dX、属性:
template <typename T>
class Relu2GradMaker : public framework::SingleGradOpMaker<T> {
using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
void Apply(GradOpPtr<T> op) const override {
op->SetInput("Y", this->Output("Y"));
op->SetInput(framework::GradVarName("Y"), this->OutputGrad("Y"));
op->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
// 定义反向OP和InferShape实现,设置dX的shape
class Relu2GradOp : public framework::OperatorWithKernel {
using framework::OperatorWithKernel::OperatorWithKernel;
void InferShape(framework::InferShapeContext* ctx) const override {
auto in_dims = ctx->GetInputDim(framework::GradVarName("Y"));
ctx->SetOutputDim(framework::GradVarName("X"), in_dims);
// 实现反向OP的kernel函数 dx = dy * ( y > 0. ? 1. : 0)
template <typename DeviceContext, typename T>
class Relu2GradKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const override {
auto* dy_t = ctx.Input<Tensor>(framework::GradVarName("Y"));
auto* y_t = ctx.Input<Tensor>("Y");
auto* dx_t = ctx.Output<Tensor>(framework::GradVarName("X"));
auto dy = dy_t->data<T>();
auto y = y_t->data<T>();
auto dx = dx_t->mutable_data<T>(ctx.GetPlace());
for (int i = 0; i < y_t->numel(); ++i) {
dx[i] = dy[i] * (y[i] > static_cast<T>(0) ? 1. : 0.);
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
using CPU = paddle::platform::CPUDeviceContext;
// 注册前向和反向op
// 为了和框架内部的relu区分,这里注册的OP type为relu2
REGISTER_OPERATOR(relu2_grad, ops::Relu2GradOp);
// 注册CPU的Kernel
ops::Relu2Kernel<CPU, float>,
ops::Relu2Kernel<CPU, double>);
ops::Relu2GradKernel<CPU, float>,
ops::Relu2GradKernel<CPU, double>);
ReLU OP的GPU实现, ``relu_op.cu`` 文件:
// relu_op.cu
#include "paddle/fluid/framework/op_registry.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename T>
__global__ void KeRelu2(const T* x, const int num, T* y) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = gid; i < num; i += blockDim.x * gridDim.x) {
y[i] = max(x[i], static_cast<T>(0.));
// 前向OP的kernel的GPU实现
template <typename DeviceContext, typename T>
class Relu2CUDAKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const override {
auto* in_t = ctx.Input<Tensor>("X");
auto* out_t = ctx.Output<Tensor>("Y");
auto x = in_t->data<T>();
auto y = out_t->mutable_data<T>(ctx.GetPlace());
auto& dev_ctx = ctx.template device_context<DeviceContext>();
int num = in_t->numel();
int block = 512;
int grid = (num + block - 1) / block;
KeRelu2<T><<<grid, block, 0, dev_ctx.stream()>>>(x, num, y);
template <typename T>
__global__ void KeRelu2Grad(const T* y, const T* dy, const int num, T* dx) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = gid; i < num; i += blockDim.x * gridDim.x) {
dx[i] = dy[i] * (y[i] > 0 ? 1. : 0.);
// 反向OP的kernel的GPU实现
template <typename DeviceContext, typename T>
class Relu2GradCUDAKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const override {
auto* dy_t = ctx.Input<Tensor>(framework::GradVarName("Y"));
auto* y_t = ctx.Input<Tensor>("Y");
auto* dx_t = ctx.Output<Tensor>(framework::GradVarName("X"));
auto dy = dy_t->data<T>();
auto y = y_t->data<T>();
auto dx = dx_t->mutable_data<T>(ctx.GetPlace());
auto& dev_ctx = ctx.template device_context<DeviceContext>();
int num = dy_t->numel();
int block = 512;
int grid = (num + block - 1) / block;
KeRelu2Grad<T><<<grid, block, 0, dev_ctx.stream()>>>(y, dy, num, dx);
} // namespace operators
} // namespace paddle
using CUDA = paddle::platform::CUDADeviceContext;
// 注册前向的GPU Kernel
paddle::operators::Relu2CUDAKernel<CUDA, float>,
paddle::operators::Relu2CUDAKernel<CUDA, double>);
// 注册反向的GPU Kernel
paddle::operators::Relu2GradCUDAKernel<CUDA, float>,
paddle::operators::Relu2GradCUDAKernel<CUDA, double>);
1. OP的type不能和PaddlePaddle已有的OP type相同,否则在Python中使用时会报错。
## 自定义OP的编译
编译需要include PaddlePaddle的相关头文件,如上面代码 `paddle/fluid/framework/op_registry.h` ,需要链接PaddlePaddle的lib库。 可通过下面命令获取到:
# python
>>> import paddle
>>> print(paddle.sysconfig.get_include())
>>> print(paddle.sysconfig.get_lib())
include_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_include())' )
lib_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_lib())' )
echo $include_dir
echo $lib_dir
# PaddlePaddel >=1.6.1, 仅需要include ${include_dir} 和 ${include_dir}/third_party
nvcc relu_op.cu -c -o relu_op.cu.o -ccbin cc -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -DPADDLE_WITH_MKLDNN -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
-I ${include_dir} \
-I ${include_dir}/third_party \
g++ relu_op.cc relu_op.cu.o -o relu2_op.so -shared -fPIC -std=c++11 -O3 -DPADDLE_WITH_MKLDNN \
-I ${include_dir} \
-I ${include_dir}/third_party \
-L /usr/local/cuda/lib64 \
-L ${lib_dir} -lpaddle_framework -lcudart
1. 通过NVCC编译CUDA源文件时,需要加编译选项 `-DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO`,在框架源码中会使用这些宏定义进行条件编译。用户自定义的C++ OP实现编译时,选项的开启状态需要和核心框架编译行为一致。如`EIGEN_USE_GPU`是使用Eigen数学库的GPU实现时需要增加的编译选项。
2. 如果飞桨安装包中不包含MKLDNN库,则需要去掉编译选项`-DPADDLE_WITH_MKLDNN`。核心框架源码中(比如tensor.h)有使用此宏定义进行条件编译,该选项是否打开同样需要和核心框架编译行为保持一致。默认的飞桨安装包中含有MKLDNN库。
3. 可多个OP编译到同一个动态库中。
4. 通过pip方式安装的PaddlePaddle由GCC 4.8编译得到,由于GCC 4.8和GCC 5以上**C++11 ABI不兼容**,您编写的自定义OP,需要通过GCC 4.8编译。若是GCC 5及以上的环境上使用自定义OP,推荐使用[Docker安装PaddlePaddle](https://www.paddlepaddle.org.cn/install/doc/docker),使得编Paddle和编译自定义OP的GCC版本相同。
## 封装Python Layer接口
需要使用 `fluid.load_op_library` 接口调用加载动态库,使得PaddlePaddle的主进程中可以使用用户自定义的OP。
# custom_op.py
import paddle.fluid as fluid
# 调用load_op_library加载动态库
from paddle.fluid.layer_helper import LayerHelper
def relu2(x, name=None):
# relu2的type和在OP中定义的type相同
helper = LayerHelper("relu2", **locals())
# 创建输出Variable
out = helper.create_variable_for_type_inference(dtype=x.dtype)
helper.append_op(type="relu2", inputs={"X": x}, outputs={"Y": out})
return out
1. 一个动态库只需使用`fluid.load_op_library``paddle.fluid` import之后加载一次即可。
2. Python接口的封装和PaddlePaddle框架内部的封装相同,更多的示例也可以阅读源码中 `python/paddle/fluid/layers/nn.py`的代码示例。
## 单测测试
import numpy as np
import paddle.fluid as fluid
from custom_op import relu2
data = fluid.layers.data(name='data', shape=[32], dtype='float32')
relu = relu2(data)
use_gpu = True # or False
place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
exe = fluid.Executor(place)
x = np.random.uniform(-1, 1, [4, 32]).astype('float32')
out, = exe.run(feed={'data': x}, fetch_list=[relu])
np.allclose(out, np.maximum(x,0.))
## 如何在C++预测库中使用
## FAQ
1. Q: 如果出现类似错误: `relu2_op.so: cannot open shared object file: No such file or directory` 以及 `libpaddle_framework.so: cannot open shared object file: No such file or directory`
A: 需要将`relu2_op.so`所在路径以及`libpaddle_framework.so`路径(即`paddle.sysconfig.get_lib()`得到路径)设置到环境变量LD_LIBRARY_PATH中:
# 假如relu2_op.so路径是:`paddle/test`,对于Linux环境设置:
export LD_LIBRARY_PATH=paddle/test:$( python -c 'import paddle; print(paddle.sysconfig.get_lib())'):$LD_LIBRARY_PATH
- `如何写新的C++ op <./new_op.html>`_
- `C++ op相关注意事项 <./op_notes.html>`_
- `如何写新的Python op <./new_python_op.html>`_
- `如何在框架外部自定义C++ op <./custom_op.html>`_
.. toctree::
Write New Operators
This section will guide you how to add an operator, and it also includes some necessary notes.
- `How to write new operator <new_op_en.html>`_ :guides to write new operators
- `op notes <op_notes_en.html>`_ :notes on developing new operators
.. toctree::
# 如何写新的C++ OP
## 概念简介
- `framework::OperatorBase`: Operator(简写,Op)基类。
- `framework::OpKernel`: Op计算函数的基类,称作Kernel。
- `framework::OperatorWithKernel`:继承自OperatorBase,Op有计算函数,称作有Kernel。
- `framework::OpProtoAndCheckerMaker`:描述该Op的输入、输出、属性、注释,主要用于Python API接口生成。
- 包含Kernel的Op继承自`OperatorWithKernel`,这类Op的功能实现与输入的数据类型、数据布局、数据所在的设备以及Op实现所调用第三方库等有关。比如ConvOp,如果使用CPU计算,一般通过调用mkl库中的矩阵乘操作实现,如果使用GPU计算,一般通过调用cublas库中的矩阵乘操作实现,或者直接调用cudnn库中的卷积操作。
- 不包含Kernel的Op继承自`OperatorBase`,因为这类Op的功能实现与设备以及输入的数据不相关。比如WhileOp、IfElseOp等。
<td>OpProtoMake定义 </td>
<td>.cc 文件 </td>
<td>Op定义 </td>
<td> .cc 文件</td>
<td>Kernel实现 </td>
<td> CPU、CUDA共享Kernel实现在.h 文件中,否则,CPU 实现在.cc 文件中,CUDA 实现在.cu 文件中。</td>
<td>注册Op </td>
<td> Op注册实现在.cc 文件;Kernel注册CPU实现在.cc 文件中,CUDA实现在.cu 文件中</td>
## 实现C++类
### 定义ProtoMaker类
矩阵乘法的公式:$Out = X * Y$, 可见该计算由两个输入,一个输出组成。
class MulOpMaker : public framework::OpProtoAndCheckerMaker {
void Make() override {
AddInput("X", "(Tensor), The first input tensor of mul op.");
AddInput("Y", "(Tensor), The second input tensor of mul op.");
AddOutput("Out", "(Tensor), The output tensor of mul op.");
"(bool, default false) Only used in mkldnn kernel")
R"DOC((int, default 1), The mul_op can take tensors with more than two
dimensions as its inputs. If the input $X$ is a tensor with more
than two dimensions, $X$ will be flattened into a two-dimensional
matrix first. The flattening rule is: the first `num_col_dims`
will be flattened to form the first dimension of the final matrix
(the height of the matrix), and the rest `rank(X) - num_col_dims`
dimensions are flattened to form the second dimension of the final
matrix (the width of the matrix). As a result, height of the
flattened matrix is equal to the product of $X$'s first
`x_num_col_dims` dimensions' sizes, and width of the flattened
matrix is equal to the product of $X$'s last `rank(x) - num_col_dims`
dimensions' size. For example, suppose $X$ is a 6-dimensional
tensor with the shape [2, 3, 4, 5, 6], and `x_num_col_dims` = 3.
Thus, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] =
[24, 30].
R"DOC((int, default 1), The mul_op can take tensors with more than two,
dimensions as its inputs. If the input $Y$ is a tensor with more
than two dimensions, $Y$ will be flattened into a two-dimensional
matrix first. The attribute `y_num_col_dims` determines how $Y$ is
flattened. See comments of `x_num_col_dims` for more details.
"scale_x to be used for int8 mul input data x. scale_x has the"
"same purpose as scale_in in OPs that support quantization."
"Only to be used with MKL-DNN INT8")
"scale_y to be used for int8 mul input data y. scale_y has the"
"same purpose as scale_weights in OPs that support quantization."
"Only to be used with MKL-DNN INT8")
"scale_out to be used for int8 output data."
"Only used with MKL-DNN INT8")
"(bool, default false) Force quantize kernel output FP32, only "
"used in quantized MKL-DNN.")
Mul Operator.
This operator is used to perform matrix multiplication for input $X$ and $Y$.
The equation is:
$$Out = X * Y$$
Both the input $X$ and $Y$ can carry the LoD (Level of Details) information,
or not. But the output only shares the LoD information with input $X$.
### 定义GradOpMaker类
比如`relu`操作的前向操作为:`out.device(d) = x.cwiseMax(static_cast<T>(0));`反向操作为:`dx.device(d) = dout * (out > static_cast<T>(0)).template cast<T>();`。显然,反向操作中只是用到了`out``dout``dx`,没有用到`x`。因此,通常不建议使用默认的`DefaultGradOpMaker`
template <typename T>
class MulOpGradMaker : public framework::SingleGradOpMaker<T> {
using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
void Apply(GradOpPtr<T> retv) const override {
retv->SetInput("X", this->Input("X"));
retv->SetInput("Y", this->Input("Y"));
retv->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
retv->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
retv->SetOutput(framework::GradVarName("Y"), this->InputGrad("Y"));
- 有些Op的前向逻辑和反向逻辑是一样的,比如[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc).这种情况下,前向Op和反向Op的Kernel可以为同一个。
- 有些前向Op所对应的反向Op可能有多个,比如[`SumOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/sum_op.cc),这种情况下,`GradMaker`需要继承`framework::GradOpDescMakerBase`
- 有些Op的反向对应另一个Op的前向,比如[`SplitOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/split_op.h),这种情况下,[`SplitGradMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/split_op.h#L157)中定义的`SplitOp`反向Op的Type就是`concat`
- 为高效地同时支持命令式编程模式(动态图)和声明式编程模式(静态图),`SingleGradOpMaker`是一个模板类,在注册Operator时需要同时注册`MulOpGradMaker<OpDesc>`(声明式编程模式使用)和`MulOpGradMaker<OpBase>`(命令式编程模式使用)。
### 定义Operator类
class MulOp : public framework::OperatorWithKernel {
using framework::OperatorWithKernel::OperatorWithKernel;
void InferShape(framework::InferShapeContext* ctx) const override {
ctx->HasInput("X"), true,
platform::errors::NotFound("Input(X) of MulOp should not be null."));
ctx->HasInput("Y"), true,
platform::errors::NotFound("Input(Y) of MulOp should not be null."));
ctx->HasOutput("Out"), true,
platform::errors::NotFound("Output(Out) of MulOp should not be null."));
auto x_dims = ctx->GetInputDim("X");
auto y_dims = ctx->GetInputDim("Y");
int x_num_col_dims = ctx->Attrs().Get<int>("x_num_col_dims");
int y_num_col_dims = ctx->Attrs().Get<int>("y_num_col_dims");
VLOG(3) << "mul operator x.shape=" << x_dims << " y.shape=" << y_dims
<< " x_num_col_dims=" << x_num_col_dims
<< " y_num_col_dims=" << y_num_col_dims;
PADDLE_ENFORCE_NE(framework::product(y_dims), 0,
"The Input variable Y(%s) has not "
"been initialized. You may need to confirm "
"if you put exe.run(startup_program) "
"after optimizer.minimize function.",
x_dims.size(), x_num_col_dims,
"The input tensor X's dimensions of MulOp "
"should be larger than x_num_col_dims. But received X's "
"dimensions = %d, X's shape = [%s], x_num_col_dims = %d.",
x_dims.size(), x_dims, x_num_col_dims));
y_dims.size(), y_num_col_dims,
"The input tensor Y's dimensions of MulOp "
"should be larger than y_num_col_dims. But received Y's "
"dimensions = %d, Y's shape = [%s], y_num_col_dims = %d.",
y_dims.size(), y_dims, y_num_col_dims));
auto x_mat_dims = framework::flatten_to_2d(x_dims, x_num_col_dims);
auto y_mat_dims = framework::flatten_to_2d(y_dims, y_num_col_dims);
x_mat_dims[1], y_mat_dims[0],
"After flatten the input tensor X and Y to 2-D dimensions "
"matrix X1 and Y1, the matrix X1's width must be equal with matrix "
"Y1's height. But received X's shape = [%s], X1's shape = [%s], "
"X1's "
"width = %s; Y's shape = [%s], Y1's shape = [%s], Y1's height = "
x_dims, x_mat_dims, x_mat_dims[1], y_dims, y_mat_dims,
std::vector<int64_t> output_dims;
static_cast<size_t>(x_num_col_dims + y_dims.size() - y_num_col_dims));
for (int i = 0; i < x_num_col_dims; ++i) {
for (int i = y_num_col_dims; i < y_dims.size(); ++i) {
ctx->SetOutputDim("Out", framework::make_ddim(output_dims));
ctx->ShareLoD("X", /*->*/ "Out");
framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext& ctx) const {
framework::LibraryType library = framework::LibraryType::kPlain;
framework::DataLayout layout = framework::DataLayout::kAnyLayout;
int customized_type_value =
auto input_data_type = OperatorWithKernel::IndicateVarDataType(ctx, "X");
if (library == framework::LibraryType::kPlain &&
platform::CanMKLDNNBeUsed(ctx)) {
library = framework::LibraryType::kMKLDNN;
layout = framework::DataLayout::kMKLDNN;
if (input_data_type == framework::DataTypeTrait<int8_t>::DataType() ||
input_data_type == framework::DataTypeTrait<uint8_t>::DataType()) {
customized_type_value = kMULMKLDNNINT8;
return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout,
library, customized_type_value);
using framework::OperatorWithKernel::OperatorWithKernel;
MulOp(const std::string &type, const framework::VariableNameMap &inputs,
const framework::VariableNameMap &outputs,
const framework::AttributeMap &attrs)
: OperatorWithKernel(type, inputs, outputs, attrs) {}
此外,Operator类通常需要重写`InferShape`接口,并在有必要时重写`GetExpectedKernelType`接口。`InferShape`为const函数,不能修改Op的成员变量,参数为`framework::InferShapeContext* ctx`,通过该参数可获取到输入输出以及属性。它的功能是:
- 做检查, 尽早报错:检查输入数据维度、类型等是否合法。
- 设置输出Tensor的形状以及LoD信息。
`GetExpectedKernelType`接口OperatorWithKernel类中用于获取指定设备(例如CPU,GPU)上指定数据类型(例如double,float)的OpKernel的方法。该方法的重写可见请参考[写C++ OP相关注意事项](op_notes.html#getexpectedkerneltype)
### InferShape区分 compile time 和 run time
在我们的声明式编程模式网络中,`InferShape`操作在[编译时(compile time)和运行时(run time)](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md#%E8%AE%A9%E6%88%91%E4%BB%AC%E5%9C%A8fluid%E7%A8%8B%E5%BA%8F%E5%AE%9E%E4%BE%8B%E4%B8%AD%E5%8C%BA%E5%88%86%E7%BC%96%E8%AF%91%E6%97%B6%E5%92%8C%E8%BF%90%E8%A1%8C%E6%97%B6)都会被调用,在compile time时,由于真实的维度未知,框架内部用-1来表示,在run time时,用实际的维度表示,因此维度的值在compile time和 run time时可能不一致,如果存在维度的判断和运算操作,InferShape就需要区分compile time 和 run time。
以下两种情况需要区分compile time和 run time。
auto x_dim = ctx->GetInputDim("X");
int i = xxx;
PADDLE_ENFORCE_GT( x_dim[i] , 10)
在compile time的时候,x_dim[i]可能等于-1,导致这个PADDLE_ENFORCE_GT报错退出。
PADDLE_ENFORCE_EQ ( x_dim[i] , 10)
PADDLE_ENFORCE_NE ( x_dim[i] , 10)
PADDLE_ENFORCE_GT ( x_dim[i] , 10)
PADDLE_ENFORCE_GE ( x_dim[i] , 10)
PADDLE_ENFORCE_LT ( x_dim[i] , 10)
PADDLE_ENFORCE_LE ( x_dim[i] , 10)
都需要区分compile time和run time
**2. 运算**
auto x_dim = ctx->GetInputDim("X");
int i = xxx;
y_dim[0] = x_dim[i] + 10
在compile time的时候,x_dim[i]可能等于-1,得到的 y_dim[0] 等于 9,是不符合逻辑的
y_dim[i] = x_dim[i] + 10
y_dim[i] = x_dim[i] - 10
y_dim[i] = x_dim[i] * 10
y_dim[i] = x_dim[i] / 10
y_dim[i] = x_dim[i] + z_dim[i]
都需要区分compile time和run time
- 检查: compile time的时候不判断维度等于-1的情况,但在runtime的时候检查
- 运算: -1和其他数做任何运算都要等于-1
1. 判断的实现方法可以参考[cross_entropy_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.cc#L39),cross_entropy_op 要求X和labels的两个输入,除了最后一维以外,其他的维度完全一致
bool contain_unknown_dim = framework::contain_unknown_dim(x_dims) ||
bool check = ctx->IsRuntime() || !contain_unknown_dim;
if (check) {
PADDLE_ENFORCE_EQ(framework::slice_ddim(x_dims, 0, rank - 1),
framework::slice_ddim(label_dims, 0, rank - 1),
"Input(X) and Input(Label) shall have the same shape "
"except the last dimension.");
2. 运算的实现可以参考[concat_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/concat_op.cc#L59),concat在InferShape判断时,调用`ComputeAndCheckShape`,除了进行concat轴之外,其他的维度完全一致;在生成output的维度时,把concat轴的维度求和,其他的维度和输入保持一致。
const size_t n = inputs_dims.size();
auto out_dims = inputs_dims[0];
size_t in_zero_dims_size = out_dims.size();
for (size_t i = 1; i < n; i++) {
for (size_t j = 0; j < in_zero_dims_size; j++) {
if (j == axis) {
if (is_runtime) {
out_dims[axis] += inputs_dims[i][j];
} else {
if (inputs_dims[i][j] == -1) {
out_dims[axis] = -1;
} else {
out_dims[axis] += inputs_dims[i][j];
} else {
bool check_shape =
is_runtime || (out_dims[j] > 0 && inputs_dims[i][j] > 0);
if (check_shape) {
// check all shape in run time
inputs_dims[0][j], inputs_dims[i][j],
"ShapeError: Dimension %d in inputs' shapes must be equal. "
"But recevied input[0]'s shape = "
"[%s], input[%d]'s shape = [%s].",
j, inputs_dims[0], i, inputs_dims[i]);
### 定义OpKernel类
- `typename DeviceContext`: 表示设备类型。不同设备(CPU、CUDA)共享同一个Kernel时,需加该模板参数;不共享则不加,一个不共享的例子是[`SGDOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/optimizers/sgd_op.h)
- `typename T` : 表示数据类型,如`float`, `double`, `int16`等。
- `Compute`接受一个输入参数:`const framework::ExecutionContext& context`
- `Compute`函数里实现`OpKernel`的具体计算逻辑。
**注意:** 若op的输入/输出的变量类型是`LoDTensor`(fluid默认所有的`Tensor`默认都是`LoDTensor`类型),请写成`ExecutionContext::Input<LoDTensor>()``ExecutionContext::Output<LoDTensor>()`,不要写`ExecutionContext::Input<Tensor>()``ExecutionContext::Output<Tensor>()`。因为若实际的变量类型为`SelectedRows``Input<Tensor>()``Output<Tensor>()`方法会将`SelectedRows`类型特化为`Tensor`,导致潜在的错误。
下面是 `MulKernel` `Compute`的实现:
template <typename DeviceContext, typename T>
class MulKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& context) const override {
const Tensor* x = context.Input<Tensor>("X");
const Tensor* y = context.Input<Tensor>("Y");
Tensor* z = context.Output<Tensor>("Out");
const Tensor x_matrix =
x->dims().size() > 2
? framework::ReshapeToMatrix(
*x, context.template Attr<int>("x_num_col_dims"))
: *x;
const Tensor y_matrix =
y->dims().size() > 2
? framework::ReshapeToMatrix(
*y, context.template Attr<int>("y_num_col_dims"))
: *y;
auto z_dim = z->dims();
if (z_dim.size() != 2) {
z->Resize({x_matrix.dims()[0], y_matrix.dims()[1]});
auto blas = math::GetBlas<DeviceContext, T>(context);
blas.MatMul(x_matrix, y_matrix, z);
if (z_dim.size() != 2) {
为了使`OpKernel`的计算过程书写更加简单,并且CPU、CUDA的代码可以复用,我们通常借助 Eigen unsupported Tensor模块来实现`Compute`接口。关于在PaddlePaddle中如何使用Eigen库,请参考[使用文档](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/use_eigen_cn.md)
### 注册Operator
-`.cc`文件中注册前向、反向Op类,注册CPU Kernel。
namespace ops = paddle::operators;
REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker, ops::MulOpInferVarType,
REGISTER_OPERATOR(mul_grad, ops::MulGradOp);
ops::MulKernel<paddle::platform::CPUDeviceContext, float>,
ops::MulKernel<paddle::platform::CPUDeviceContext, double>);
ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>,
ops::MulGradKernel<paddle::platform::CPUDeviceContext, double>);
-`.cu`文件中注册CUDA Kernel。
- 请注意,如果CUDA Kernel的实现基于Eigen unsupported模块,那么在 `.cu`的开始请加上宏定义 `#define EIGEN_USE_GPU`,代码示例如下:
// if use Eigen unsupported module before include head files
namespace ops = paddle::operators;
ops::MulKernel<paddle::platform::CUDADeviceContext, float>,
ops::MulKernel<paddle::platform::CUDADeviceContext, double>);
ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>,
ops::MulGradKernel<paddle::platform::CUDADeviceContext, double>);
在运行Op时,框架系统会根据输入数据所在的设备、输入数据的类型等信息自动的选择合适的OpKernel,比如输入的数据是在GPU上,并且为`float`类型,框架系统会选择由`REGISTER_OP_CUDA_KERNEL`注册的`ops::MulKernel<paddle::platform::CUDADeviceContext, float>`。如果用户希望指定运行时可被调用的OpKernel,用户需要覆盖`framework::OperatorWithKernel`中的`GetExpectedKernelType`函数,比如`MulOp`会根据属性`use_mkldnn``false`还是为`true`决定是否调用mkldnn库来完成计算。
### 编译
make mul_op
## 绑定Python
### 使用mul操作在Python端构建Layer
$$Out = Act({X*W + b})$$
## 实现单元测试
### 前向Operator单测
> 注意:输入输出请以`ndarray`的类型配置输入/输出,如果需要配置一个带`LOD`的输入/输出,请以`tuple`的形式传入,`tuple`中应该有两个类型为`ndarray`的元素,第一个是实际的数据,第二个是`LOD`
2. 生成随机的输入数据。
3. 在Python脚本中实现与前向operator相同的计算逻辑,得到输出值,与operator前向计算的输出进行对比。
4. 反向计算已经自动集成进测试框架,直接调用相应接口即可。
import unittest
import numpy as np
from op_test import OpTest
class TestMulOp(OpTest):
def setUp(self):
self.op_type = "mul"
self.inputs = {
'X': np.random.random((32, 84)).astype("float32"),
'Y': np.random.random((84, 100)).astype("float32")
self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
def test_check_output(self):
def test_check_grad_normal(self):
self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
def test_check_grad_ingore_x(self):
['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
def test_check_grad_ingore_y(self):
['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
- `self.op_type = "mul" ` : 定义类型,与operator注册时注册的类型一致。
- `self.inputs` : 定义输入,类型为`numpy.array`,并初始化。
- `self.outputs` : 定义输出,并在Python脚本中完成与operator同样的计算逻辑,返回Python端的计算结果。
### 反向operator单测
- `test_check_grad_normal`中调用`check_grad`使用数值法检测梯度正确性和稳定性。
- 第一个参数`["X", "Y"]` : 指定对输入变量`X``Y`做梯度检测。
- 第二个参数`"Out"` : 指定前向网络最终的输出目标变量`Out`
- 第三个参数`max_relative_error`:指定检测梯度时能容忍的最大错误值。
- `test_check_grad_ingore_x``test_check_grad_ingore_y`分支用来测试只需要计算一个输入梯度的情况。
### 编译和执行
`python/paddle/fluid/tests/unittests/` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。
请注意,**运行单元测试测时需要编译整个工程**,并且编译时需要打开`WITH_TESTING`, 即`cmake -DWITH_TESTING=ON ..`。编译成功后,执行下面的命令来运行单元测试:
make test ARGS="-R test_mul_op -V"
ctest -R test_mul_op
## 注意事项
- 注册Op时的类型名,需要和该Op的名字一样。即不允许在`A_op.cc`里面,注册`REGISTER_OPERATOR(B, ...)`等,这将会导致单元测试出错。
- 如果Op没有实现CUDA Kernel,请不要创建空的`*_op.cu`,这将会导致单元测试出错。
- 如果多个Op依赖一些共用的函数,可以创建非`*_op.*`格式的文件来存放,如`gather.h`文件。
PADDLE_ENFORCE(表达式, 错误提示信息)
PADDLE_ENFORCE_EQ(比较对象A, 比较对象B, 错误提示信息)
#### 总体原则
任何使用了PADDLE_ENFORCE与PADDLE_ENFORCE_XX检查的地方,必须有详略得当的备注解释!<font color="#FF0000">**错误提示信息不能为空!**</font>
#### 提示信息书写标准
1. [required] 哪里错了?为什么错了?
- 例如:`ValueError: Mismatched label shape`
2. [optional] 期望的输入是什么样的?实际的输入是怎样的?
- 例如:`Expected labels dimension=1. Received 4.`
3. [optional] 能否给出修改意见?
- 例如:`Suggested Fix:If your classifier expects one-hot encoding label,check your n_classes argument to the estimatorand/or the shape of your label.Otherwise, check the shape of your label.`
#### FAQ 典型问题
1. 无报错信息或报错信息过于简单,不能给用户提供有效的提示!
问题示例1 :未写提示信息
PADDLE_ENFORCE(ctx->HasInput("X"), "");
问题示例2 :提示信息过于简单
PADDLE_ENFORCE(i != nullptr, "i must be set"); // i是什么?
2. 在报错信息中使用开发人员定义的变量缩写,不易理解!
PADDLE_ENFORCE(forward_pd != nullptr,
"Fail to find eltwise_fwd_pd in device context"); //eltwise_fwd_pd用户可能看不懂
3. OP内部调用非法接口:Op内部如果出现Output = ShareDataWith(Input)
auto *out = ctx.Output<framework::LoDTensor>("Out");
auto *in = ctx.Input<framework::LoDTensor>("X");
Op内部如果出现Output = ShareDataWith(Input),相当于operator图的中有一条隐藏边,连接了Input和Output,这条边无法在图分析中表达,引发基于图优化的错误。
4. OP实现的性能实践
调用了eigen的broadcast, chop等操作,性能会比手写cuda kernel差几倍以上。此时cpu的实现可以复用eigen,gpu实现可以实现cuda kernel.
#### OP InferShape检查提示信息特别说明
- 检查输入输出变量,请统一遵循以下格式
`Input(变量名) of OP名 operator should not be null.`
"Input(Input) of LSTMP operator should not be null.");
- 反向Op的输入输出检查,要写明反向Op的名字
"Input(X) of LoDResetGrad opreator should not be null.");
# How to write a new operator
<a name="Background"></a>
## Background
Here are the base types needed. For details, please refer to the design docs.
- `class OpProtoAndCheckerMaker`: Describes an Operator's input, output, attributes and description, mainly used to interface with Python API.
- `framework::OperatorBase`: Operator (Op)base class.
- `framework::OpKernel`: Base class for Op computation kernel.
- `framework::OperatorWithKernel`: Inherited from OperatorBase, describing an operator with computation kernels.
Operators can be categorized into two groups: operator with kernel(s) and operator without kernel(s). An operator with kernel(s) inherits from `OperatorWithKernel` while the one without kernel(s) inherits from `OperatorBase`. This tutorial focuses on implementing operators with kernels. In short, an operator includes the following information:
<th> Where is it defined</th>
<td>OpProtoMake definition </td>
<td> `.cc`files, Backward Op does not need an OpProtoMake interface. </td>
<td>Op definition </td>
<td> `.cc` files</td>
<td>Kernel implementation </td>
<td> The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files.</td>
<td>Registering the Op </td>
<td> Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation.</td>
New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions.**
Let's take matrix multiplication operator, [MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc), as an example to introduce the writing of an Operator with Kernel.
<a name="Implementing C++ Types"></a>
## Implementing C++ Types
<a name="Defining ProtoMaker"></a>
### Defining ProtoMaker
Matrix Multiplication can be written as $Out = X * Y$, meaning that the operation consists of two inputs and one output.
First, define `ProtoMaker` to describe the Operator's input, output, and additional comments:
class MulOpMaker : public framework::OpProtoAndCheckerMaker {
MulOpMaker(OpProto *proto, OpAttrChecker *op_checker)
: OpProtoAndCheckerMaker(proto, op_checker) {
AddInput("X", "(Tensor), 2D tensor of size (M x K)");
AddInput("Y", "(Tensor), 2D tensor of size (K x N)");
AddOutput("Out", "(Tensor), 2D tensor of size (M x N)");
Two Element Mul Operator.
The equation is: Out = X * Y
[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L76-L127)is inherited from`framework::OpProtoAndCheckerMaker`, consisting of 2 variables in the constructor:
- `framework::OpProto` stores Operator input and variable attribute, used for generating Python API interfaces.
- `framework::OpAttrChecker` is used to validate variable attributes.
The constructor utilizes `AddInput` to add input parameter, `AddOutput` to add output parameter, and `AddComment` to add comments for the Op, so that the corresponding information will be added to `OpProto`.
The code above adds two inputs `X` and `Y` to `MulOp`, an output `Out`, and their corresponding descriptions. Names are given in accordance to Paddle's [naming convention](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/name_convention.md).
An additional example [`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/scale_op.cc#L38-L55) is implemented as follows:
template <typename AttrType>
class ScaleOpMaker : public framework::OpProtoAndCheckerMaker {
ScaleOpMaker(OpProto *proto, OpAttrChecker *op_checker)
: OpProtoAndCheckerMaker(proto, op_checker) {
AddInput("X", "(Tensor) Input tensor of scale operator.");
AddOutput("Out", "(Tensor) Output tensor of scale operator.");
Scale operator
$$Out = scale*X$$
"(float, default 1.0)"
"The scaling factor of the scale operator.")
Note `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` adds `scale`constant as an attribute, and sets the default value to 1.0.
<a name="Defining the GradProtoMaker class"></a>
### Defining the GradProtoMaker class
Each Op must have a corresponding GradProtoMaker. If GradProtoMaker corresponding to the forward Op is not customized, Fluid provides DefaultGradProtoMaker. The default registration will use all input and output, including Input, Output, Output@Grad and so on. Using unnecessary variables will cause waste of memory.
The following example defines ScaleOp's GradProtoMaker.
class ScaleGradMaker : public framework::SingleGradOpDescMaker {
using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
std::unique_ptr<framework::OpDesc> Apply() const override {
auto *grad_op = new framework::OpDesc();
grad_op->SetInput("X", OutputGrad("Out"));
grad_op->SetOutput("Out", InputGrad("X"));
grad_op->SetAttr("scale", GetAttr("scale"));
return std::unique_ptr<framework::OpDesc>(grad_op);
<a name="Defining Operator"></a>
### Defining Operator
The following code defines the interface for MulOp:
class MulOp : public framework::OperatorWithKernel {
using framework::OperatorWithKernel::OperatorWithKernel;
void InferShape(const framework::InferShapeContext &ctx) const override {
//never use Input<Tensor> or Output<Tensor> if you want a to get a LoDTensor.
auto dim0 = ctx.Input<LoDTensor>("X")->dims();
auto dim1 = ctx.Input<LoDTensor>("Y")->dims();
PADDLE_ENFORCE_EQ(dim0.size(), 2,
"input X(%s) should be a tensor with 2 dims, a matrix",
PADDLE_ENFORCE_EQ(dim1.size(), 2,
"input Y(%s) should be a tensor with 2 dims, a matrix",
dim0[1], dim1[0],
"First matrix's width must be equal with second matrix's height.");
ctx.Output<LoDTensor>("Out")->Resize({dim0[0], dim1[1]});
[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L24) is inherited from `OperatorWithKernel`. Its `public` member
using framework::OperatorWithKernel::OperatorWithKernel;
expresses an operator constructor using base class `OperatorWithKernel`, alternatively written as
MulOp(const std::string &type, const framework::VariableNameMap &inputs,
const framework::VariableNameMap &outputs,
const framework::AttributeMap &attrs)
: OperatorWithKernel(type, inputs, outputs, attrs) {}
`InferShape` interface needs to be re-written.`InferShape` is a const method and cannot modify Op's member variables. Its constant member `const framework::InferShapeContext &ctx` can be used to extract input, output, and attributes. Its functions are
- 1). validate and error out early: it checks input data dimensions and types.
- 2). configures the tensor shape in the output.
Usually `OpProtoMaker` and `Op` definitions are written in `.cc` files, which also include the registration methods introduced later.
<a name="Defining OpKernel"></a>
### Defining OpKernel
`MulKernel` is derived from `framework::OpKernel`, which includes the following templates:
- `typename DeviceContext` denotes device context type. When different devices, namely the CPU and the CUDA, share the same kernel, this template needs to be added. If they don't share kernels, this must not be added. An example of a non-sharing kernel is [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.h#L43).
- `typename T` denotes data type, such as `float` or `double`.
`MulKernel` types need to rewrite the interface for `Compute`.
- `Compute` takes one input parameter: `const framework::ExecutionContext& context`.
- Compared with `InferShapeContext`, `ExecutionContext` includes device types, and can similarly extract input, output, and attribute variables.
- `Compute` function implements the computation logics of an `OpKernel`.
The input and output of Op can be obtained by `ExecutionContext::Input<T>()` and `ExecutionContext::Output<T>()` respectively.
**Note:** If the input/output variable type of op is `LoDTensor` (In Fluid, all Tensors are LoDTensor type by default), please write `ExecutionContext::Input<LoDTensor>()` and `ExecutionContext:: Output<LoDTensor>()`, do not write `ExecutionContext::Input<Tensor>()` and `ExecutionContext::Output<Tensor>()`. Because if the actual variable type is `SelectedRows`, the `Input<Tensor>()` and `Output<Tensor>()` methods will specialize the `SelectedRows` type to `Tensor`, causing a potential error.
`MulKernel`'s implementation of `Compute` is as follows:
template <typename DeviceContext, typename T>
class MulKernel : public framework::OpKernel {
void Compute(const framework::ExecutionContext& context) const override {
auto* X = context.Input<LoDTensor>("X");
auto* Y = context.Input<LoDTensor>("Y");
auto* Z = context.Output<LoDTensor>("Out");
auto& device_context = context.template device_context<DeviceContext>();
math::matmul<DeviceContext, T>(*X, false, *Y, false, 1, Z, 0, device_context);
Note that **different devices (CPU, CUDA)share one Op definition; whether or not they share the same `OpKernel` depends on whether functions called by `Compute`can support both devices.**
`MulOp`'s CPU and CUDA share the same `Kernel`. A non-sharing `OpKernel` example can be seen in [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/cross_entropy_op.cc).
To ease the writing of `OpKernel` compute, and for reusing code cross-device, [`Eigen-unsupported Tensor`](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?fileviewer=file-view-default) module is used to implement `Compute` interface. To learn about how the Eigen library is used in PaddlePaddle, please see [usage document](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/dev/use_eigen_cn.md).
This concludes the forward implementation of an operator. Next its operation and kernel need to be registered in a `.cc` file.
The definition of its corresponding backward operator, if applicable, is similar to that of an forward operator. **Note that a backward operator does not include a `ProtoMaker`**.
<a name="Registering Operator and OpKernel"></a>
### Registering Operator and OpKernel
- In `.cc` files, register forward and backward operator classes and the CPU kernel.
namespace ops = paddle::operators;
REGISTER_OPERATOR(mul, ops::MulOp, ops::MulOpMaker,
REGISTER_OPERATOR(mul_grad, ops::MulGradOp)
REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUDeviceContext, float>);
ops::MulGradKernel<paddle::platform::CPUDeviceContext, float>);
In that code block,
- `REGISTER_OPERATOR` registers the `ops::MulOp` class, with the type named `mul`. Its `ProtoMaker` is `ops::MulOpMaker`. Register `ops::MulOpGrad` as type named `mul_grad`.
- `REGISTER_OP_CPU_KERNEL` registers `ops::MulKernel` class and specializes template parameters as type `paddle::platform::CPUPlace` and `float`, and also registers `ops::MulGradKernel`.
- Registering CUDA Kernel in `.cu` files
- Note that if CUDA Kernel is implemented using the `Eigen unsupported` module, then on top of `.cu`, a macro definition `#define EIGEN_USE_GPU` is needed, such as
// if use Eigen unsupported module before include head files
namespace ops = paddle::operators;
REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel<paddle::platform::CUDADeviceContext, float>);
ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>);
<a name="Compilation"></a>
### Compilation
In folder `build/paddle/fluid/operators`, run the following commands to compile.
make mul_op
<a name="Python Binding"></a>
## Python Binding
The system will automatically bind the new op to Python and link it to a generated library.
<a name="Unit Tests"></a>
## Unit Tests
Unit tests for an operator include
1. comparing a forward operator's implementations on different devices (CPU, CUDA)
2. comparing a backward operator's implementation on different devices (CPU, CUDA)
3. a gradient test for the backward operator.
Here, we introduce the [unit tests for `MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_mul_op.py).
<a name="Unit Test for Forward Operators"></a>
### Unit Test for Forward Operators
The Op unit test is inherited from `OpTest`. More specific unit tests are done in `TestMulOp`. To test the Operator, you need to:
1. Define input, output, and related property parameters in the `setUp` function.
2. Generate random input data.
3. Implement the same calculation logic as the forward operator in the Python script to get the output, which is to be compared with the output of the forward operator calculation.
4. The backward calculation has been automatically integrated into the test framework and the corresponding interface can be called directly.
import unittest
import numpy as np
from op_test import OpTest
class TestMulOp(OpTest):
def setUp(self):
self.op_type = "mul"
self.inputs = {
'X': np.random.random((32, 84)).astype("float32"),
'Y': np.random.random((84, 100)).astype("float32")
self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
def test_check_output(self):
def test_check_grad_normal(self):
self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
def test_check_grad_ingore_x(self):
['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
def test_check_grad_ingore_y(self):
['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
The code above first loads required packages. In addition, we have
- `self.op_type = "mul" ` defines the type that is identical to what the operator's registered type.
- `self.inputs` defines input, with type `numpy.array` and initializes it.
- `self.outputs` defines output and completes the same operator computation in the Python script, and returns its result from the Python script.
<a name="Unit test for backward operators"></a>
### Unit Test for Backward Operators
In the backward operator test:
- `check_grad` is called in `test_check_grad_normal` to use numerical methods to detect gradient correctness and stability.
- The first parameter `["X", "Y"]` : specifies gradient check for the input variables `X`, `Y`.
- The second parameter `"Out"` : specifies the final output target variable `Out` of the forward network.
- The third parameter `max_relative_error`: specifies the maximum error value that can be tolerated when checking gradients.
- The `test_check_grad_ingore_x` and `test_check_grad_ingore_y` branches are used to test cases where only one input gradient needs to be calculated.
<a name="Compiling and Running"></a>
### Compiling and Running
Any new unit testing file of the format `test_*.py` added to the directory `python/paddle/fluid/tests/unittests/` is automatically added to the project to compile.
Note that **running unit tests requires compiling the entire project** and requires compiling with flag `WITH_TESTING` on i.e. `cmake paddle_dir -DWITH_TESTING=ON`.
After successfully compiling the project, run the following command to run unit tests:
make test ARGS="-R test_mul_op -V"
ctest -R test_mul_op
<a name="Remarks"></a>
## Remarks
- The type with which an operator is registered needs to be identical to the Op's name. Registering `REGISTER_OPERATOR(B, ...)` in `A_op.cc` will cause unit testing failures.
- If the operator does not implement a CUDA kernel, please refrain from creating an empty `*_op.cu` file, or else unit tests will fail.
- If multiple operators rely on some shared methods, a file NOT named `*_op.*` can be created to store them, such as `gather.h`.
<a name="PADDLE_ENFORCE Usage Note"></a>
To check the validity of data when implementing Op, you need to use macro definitions such as PADDLE_ENFORCE and PADDLE_ENFORCE_EQ. The basic format is as follows:
PADDLE_ENFORCE (expression, error message)
PADDLE_ENFORCE_EQ (comparison object A, comparison object B, error message)
If the expression is true, or the comparison object A=B, the check will be passed, otherwise the program will be terminated and the corresponding error message will be fed back to the user.
In order to ensure that the feedbacks are user-friendly and easy to understand, developers need to pay attention to how to use them.
<a name="General Principles"></a>
#### General Principles
Any place where PADDLE_ENFORCE and PADDLE_ENFORCE_EQ are used must have a properly detailed explanation of the comments! **Error message** can't be empty!
<a name="Error Message Standard"></a>
#### Error Message Standard
1. [required] Where does it go wrong? Why is it wrong?
- For example: `ValueError: Mismatched label shape`
2. [optional] What is the expected input? What is the actual input?
- For example: `Expected labels dimension=1. Received 4.`
3. [optional] Can you come up with a suggestion?
- For example: `Suggested Fix: If your classifier expects one-hot encoding label, check your n_classes argument to the estimatorand/or the shape of your label.Otherwise, check the shape of your label.`
If it is not necessary or concise description is enough to clearly express the above points, just write based on actual needs.
<a name="Typical Problems"></a>
#### Typical Problems
1.No error message exists or error message is too short to provide effective notification to the user.
Problem example 1: Absent message
PADDLE_ENFORCE(ctx->HasInput("X"), "");
Problem example 2: The prompt message is too short
PADDLE_ENFORCE(i != nullptr, "i must be set"); // What is i?
2.Using developer-defined variable abbreviations in error messages is not easy to understand.
Example of the problem:
PADDLE_ENFORCE(forward_pd != nullptr,
"Fail to find eltwise_fwd_pd in device context"); //eltwise_fwd_pduser may not be understood
3.The OP internally calls the illegal interface: If Op appears inside Output = ShareDataWith(Input)
Example of the problem:
auto *out = ctx.Output<framework::LoDTensor>("Out");
auto *in = ctx.Input<framework::LoDTensor>("X");
If there is Output = ShareDataWith(Input) inside Op, it will equivalently indicate a hidden edge in the operator graph, which connects Input and Output. This edge cannot be expressed in graph analysis, causing error based on graph optimization.
4.Performance of OP implementation. It called eigen's broadcast, chop and other operations, the performance will be over several times worse than the handwritten cuda kernel. At this point, the implementation of cpu can reuse eigen, and the gpu implementation can implement cuda kernel.
<a name="Special instructions for OP InferShape check message"></a>
#### Special Instructions for OP InferShape Check Message
- Check input and output variables, please follow the following format
`Input(variable name) of OP name operator should not be null.`
The correct example:
"Input(Input) of LSTMP operator should not be null.");
- Backward Op input and output check, to write the name of the backward Op
The correct example:
"Input(X) of LoDResetGrad opreator should not be null.");
# 如何写新的Python OP
PaddlePaddle Fluid通过 `py_func` 接口支持在Python端自定义OP。 py_func的设计原理在于Paddle中的LodTensor可以与numpy数组可以方便的互相转换,从而可以使用Python中的numpy API来自定义一个Python OP。
## py_func接口概述
`py_func` 具体接口为:
def py_func(func, x, out, backward_func=None, skip_vars_in_backward_input=None):
- `x` 是Python Op的输入变量,可以是单个 `Variable` | `tuple[Variable]` | `list[Variable]` 。多个Variable以tuple[Variable]或list[Variale]的形式传入,其中Variable为LoDTensor或Tenosr。
- `out` 是Python Op的输出变量,可以是单个 `Variable` | `tuple[Variable]` | `list[Variable]` 。其中Variable既可以为LoDTensor或Tensor,也可以为numpy数组。
- `func` 是Python Op的前向函数。在运行网络前向时,框架会调用 `out = func(*x)` ,根据前向输入 `x` 和前向函数 `func` 计算前向输出 `out`。在 ``func`` 建议先主动将LoDTensor转换为numpy数组,方便灵活的使用numpy相关的操作,如果未转换成numpy,则可能某些操作无法兼容。
- `backward_func` 是Python Op的反向函数。若 `backward_func``None` ,则该Python Op没有反向计算逻辑;
`backward_func` 不为 `None`,则框架会在运行网路反向时调用 `backward_func` 计算前向输入 `x` 的梯度。
- `skip_vars_in_backward_input` 为反向函数 `backward_func` 中不需要的输入,可以是单个 `Variable` | `tuple[Variable]` | `list[Variable]`
## 如何使用py_func编写Python Op
以下以tanh为例,介绍如何利用 `py_func` 编写Python Op。
- 第一步:定义前向函数和反向函数
若前向函数的输入为 `x_1`, `x_2`, ..., `x_n` ,输出为`y_1`, `y_2`, ..., `y_m`,则前向函数的定义格式为:
def foward_func(x_1, x_2, ..., x_n):
return y_1, y_2, ..., y_m
默认情况下,反向函数的输入参数顺序为:所有前向输入变量 + 所有前向输出变量 + 所有前向输出变量的梯度,因此对应的反向函数的定义格式为:
def backward_func(x_1, x_2, ..., x_n, y_1, y_2, ..., y_m, dy_1, dy_2, ..., dy_m):
return dx_1, dx_2, ..., dx_n
若反向函数不需要某些前向输入变量或前向输出变量,可设置 `skip_vars_in_backward_input` 进行排除(步骤三中会叙述具体的排除方法)。
注:,x_1, ..., x_n为输入的多个LodTensor,请以tuple(Variable)或list[Variable]的形式在py_func中传入。建议先主动将LodTensor通过numpy.array转换为数组,否则Python与numpy中的某些操作可能无法兼容使用在LodTensor上。
import numpy as np
# 前向函数1:模拟tanh激活函数
def tanh(x):
# 可以直接将LodTensor作为np.tanh的输入参数
return np.tanh(x)
# 前向函数2:将两个2-D LodTenosr相加,输入多个LodTensor以list[Variable]或tuple(Variable)形式
def element_wise_add(x, y):
# 必须先手动将LodTensor转换为numpy数组,否则无法支持numpy的shape操作
x = np.array(x)
y = np.array(y)
if x.shape != y.shape:
raise AssertionError("the shape of inputs must be the same!")
result = np.zeros(x.shape, dtype='int32')
for i in range(len(x)):
for j in range(len(x[0])):
result[i][j] = x[i][j] + y[i][j]
return result
# 前向函数3:可用于调试正在运行的网络(打印值)
def debug_func(x):
# 可以直接将LodTensor作为print的输入参数
# 前向函数1对应的反向函数,默认的输入顺序为:x、out、out的梯度
def tanh_grad(x, y, dy):
# 必须先手动将LodTensor转换为numpy数组,否则"+/-"等操作无法使用
return np.array(dy) * (1 - np.square(np.array(y)))
注意,前向函数和反向函数的输入均是 `LoDTensor` 类型,输出可以是Numpy Array或 `LoDTensor`
由于 `LoDTensor` 实现了Python的buffer protocol协议,因此即可通过 `numpy.array` 直接将 `LoDTensor` 转换为numpy Array来进行操作,也可直接将 `LoDTensor` 作为numpy函数的输入参数。但建议先主动转换为numpy Array,则可以任意的使用python与numpy中的所有操作(例如"numpy array的+/-/shape")。
tanh的反向函数不需要前向输入x,因此我们可定义一个不需要前向输入x的反向函数,并在后续通过 `skip_vars_in_backward_input` 进行排除 :
def tanh_grad_without_x(y, dy):
return np.array(dy) * (1 - np.square(np.array(y)))
- 第二步:创建前向输出变量
我们需调用 `Program.current_block().create_var` 创建前向输出变量。在创建前向输出变量时,必须指明变量的名称name、数据类型dtype和维度shape。
import paddle.fluid as fluid
def create_tmp_var(program, name, dtype, shape):
return program.current_block().create_var(name=name, dtype=dtype, shape=shape)
in_var = fluid.layers.data(name='input', dtype='float32', shape=[-1, 28, 28])
# 手动创建前向输出变量
out_var = create_tmp_var(fluid.default_main_program(), name='output', dtype='float32', shape=[-1, 28, 28])
- 第三步:调用 `py_func` 组建网络
`py_func` 的调用方式为:
fluid.layers.py_func(func=tanh, x=in_var, out=out_var, backward_func=tanh_grad)
若我们不希望在反向函数输入参数中出现前向输入,则可使用 `skip_vars_in_backward_input` 进行排查,简化反向函数的参数列表。
fluid.layers.py_func(func=tanh, x=in_var, out=out_var, backward_func=tanh_grad_without_x,
至此,使用 `py_func` 编写Python Op的步骤结束。我们可以与使用其他Op一样进行网路训练/预测。
## 注意事项
- `py_func` 的前向函数和反向函数内部不应调用 `fluid.layers.xxx` ,因为前向函数和反向函数是在网络运行时调用的,且输入参数均为C++端的 `LoDTensor`
`fluid.layers.xxx` 是在组建网络的阶段调用的,且输入参数为Python端的 `Variable`
- `skip_vars_in_backward_input` 只能跳过前向输入变量和前向输出变量,不能跳过前向输出的梯度。
- 若某个前向输出变量没有梯度,则 `backward_func` 将接收到 `None` 的输入。若某个前向输入变量没有梯度,则我们应在 `backward_func` 中主动返回
# C++ OP相关注意事项
## Fluid中Op的构建逻辑
### 1.Fluid中Op的构建逻辑
Op的核心方法是Run,Run方法需要两方面的资源:数据资源和计算资源,这两个资源分别通过`Scope``Place`获取。框架内部有一个全局的`DeviceContextPool`,用来记录`Place``DeviceContext`之间的对应的关系,即每个`Place`有且仅有一个`DeviceContext`与之对应,`DeviceContext`中存放了当前设备的计算资源。比如对于GPU,这些资源包括`cudnn_handle``cublas_handle``stream`等,**Op内部所有的计算(数据拷贝和CUDA Kernel等)都必须在`DeviceContext`中进行**
### 2.Op的注册逻辑
OpCreator creator_;
GradOpMakerFN grad_op_maker_;
proto::OpProto* proto_{nullptr};
OpAttrChecker* checker_{nullptr};
InferVarTypeFN infer_var_type_;
InferShapeFN infer_shape_;
<td>proto::OpProto </td>
<td>Class </td>
<td>存放Op的输入/输出/属性/Op类型 </td>
<td>编译时调用 </td>
<td>GradOpMakerFN </td>
<td>Functor </td>
<td>返回当前Op对应的反向Op的一组OpDesc,因为正向Op的反向可能有多个Op构成 </td>
<td>编译时调用 </td>
<td>OpAttrChecker </td>
<td>Class </td>
<td>对Op的attr进行check </td>
<td>InferVarTypeFN </td>
<td>Functor </td>
<td>用于推断输出Var的Type,比如是LoDTensor还是SelectedRows,或者其他 </td>
<td>编译时调用 </td>
<td>InferShapeFN </td>
<td>Functor </td>
<td>用于推断Output的Shape </td>
<td>分为编译时和运行时,编译时是在Python端调用;如果Op继承自OperatorWithKernel,运行时是在op.run中调用 </td>
<td>OpCreator </td>
<td>Functor </td>
<td>每次调用都会创建一个新的OperatorBase </td>
<td>运行时调用 </td>
1. 对于所有Op,前三个参数是必须的,op_type指明op的名字,OperatorBase是该Op的对象,op_maker_and_checker_maker是op的maker以及Op中attr的checker。
2. 如果该Op有反向,则必须要有op_grad_opmaker,因为在backward会根据正向的Op中获取反向Op的Maker。
3. 框架提供了一个默认的op_grad_opmaker:`DefaultGradOpDescMaker`,这个Maker会将前向Op的输入和输出都作为反向Op的输入,将前向Op的输入的梯度作为反向Op的输出,并将前向Op的属性拷贝过来。**注意:DefaultGradOpDescMaker会将前向Op的所有输入输出都做反向Op的输入,即使这个输入是没有必要的,这将会导致无法对没有用到的变量做内存优化**
4. 框架没有提供默认的op_infer_var_shape方法。如果该Op是无OpKernel的,通常需要用户添加对应的op_infer_var_shape方法;如果该Op是有OpKernel的,需要实现`OperatorWithKernel`中的`InferShape`方法,此时不需要提供op_infer_var_shape方法。具体实现可参考[while_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/controlflow/while_op.cc)[conv_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_op.cc)
5. 框架没有提供默认的op_infer_var_type方法,用户需要根据实际情况添加op_infer_var_type。严格来说每个Op都应该注册一个InferVarType,op_infer_var_type根据输入的Var的type和dtype推断输出Var的type和dtype。**注意:在Python端的LayerHelper中create_variable_for_type_inference操作返回的Variable里面是LoDTensor,C++端的InferVarType可以修改`Variable`的type和dtype**
更多内容请参考: [如何写新的Op](new_op.html)
## 写Op注意事项
### 1.Op可以支持输入输出类型
- 代码中经常出现`context.Input<Tensor>("Input")`,并不表示"Input"的`Variable``Tensor`,而是从"Input"的`Variable``LoDTensor`中获取`Tensor`。如果"Input"的`Variable``SelectedRows`,则会报错。
- 如果”Input”是`SelectedRows``context->GetInputDim("Input")`返回的是`var->Get<SelectedRows>().GetCompleteDims()`,而不是`SelectedRows``Tensor`的Dim。
### 2.在Op内部不能对输入的数据做任何的改写
### 3.OpKernel需要注册的数据类型
### 4.GetExpectedKernelType方法重写
#### 4.1 仅在必要时重写此方法
- [MeanOp](https://github.com/PaddlePaddle/Paddle/blob/3556514e971bdbb98fdf0f556371c527f4dfa98c/paddle/fluid/operators/mean_op.cc#L39):该Op的所有输入变量在Run之前应该全部被初始化,初始化检查是必要且合理的
1. OP的输入有多个,且数据类型不同,例如 [AccuracyOp](https://github.com/PaddlePaddle/Paddle/blob/370f0345b6d35a513c8e64d519a0edfc96b9276c/paddle/fluid/operators/metrics/accuracy_op.cc#L80),需要重写GetExpectedKernelType方法,指定用某一输入变量获取kernel类型
2. Op包含Dispensable的输入变量,该类输入变量是可选的,当用户未输入时,该类变量未被初始化属于合理情况,例如 [ConvOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/conv_op.cc#L206),存在Bias等可选的输入变量,需要重写GetExpectedKernelType方法,指定用必须提供的输入变量获取kernel类型
3. Op的部分输入变量即使未被初始化也属于合理情况,例如 [ConcatOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/concat_op.cc#L90),输入变量X中有个Tensor需要连接,其中可能包含未被初始化的Tensor,需要重写GetExpectedKernelType方法,使用输入变量X获取kernel的过程中,合理忽略掉部分Tensor为空的情况
4. OP的Kernel类型与输入变量无关(可能由其他参数指定),例如 [FillOp](https://github.com/PaddlePaddle/Paddle/blob/efbdad059634bef022d4a3f5b00aef6ef8e88ed6/paddle/fluid/operators/one_hot_op.cc#L72),该Op没有输入,Kernel类型通过Op的dtype参数指定,因此需要重写GetExpectedKernelType方法,用参数指定的数据类型获取kernel类型
5. Op Kernel的部分参数在使用某些库时,需要指定为相应的值,因此需要重写GetExpectedKernelType方法,覆盖默认参数
- 使用CUDNN库:需要指定OpKernel的LibraryType为kCUDNN,例如 [AffineGridOp](https://github.com/PaddlePaddle/Paddle/blob/370f0345b6d35a513c8e64d519a0edfc96b9276c/paddle/fluid/operators/affine_grid_op.cc#L78)
- 使用MKLDNN库:需要指定OpKernel的LibraryType和DataLayout为kMKLDNN [MulOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/mul_op.cc#L89)
#### 4.2 重写此方法时需要对输入变量进行初始化检查
在需要重写GetExpectedKernelType方法时,一般会根据某一输入变量获取Kernel的数据类型,此时请使用`OperatorWithKernel::IndicateVarDataType`接口获取变量的dtype,该方法对指定的输入变量进行了必要的初始化检查,详见[Paddle PR #20044](https://github.com/PaddlePaddle/Paddle/pull/20044),实现示例如下,:
framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext& ctx) const override {
return framework::OpKernelType(
OperatorWithKernel::IndicateVarDataType(ctx, "X"), ctx.GetPlace());
如果未使用带有初始化检查的方法,直接使用了`Tensor->type()`,可能会导致报出`holder_ should not be null. Tensor not initialized yet when Tensor::type()`的错误,例如[Paddle issue #19522](https://github.com/PaddlePaddle/Paddle/issues/19522) ,用户仅凭该错误信息将无法得知具体出错的Op,不利于调试。
### 5.Op兼容性问题
对Op的修改需要考虑兼容性问题,要保证Op修改之后,之前的模型都能够正常加载及运行,即新版本的Paddle预测库能成功加载运行旧版本训练的模型。<font color="#FF0000">**所以,需要保证Op的Input、Output和Attribute不能被修改(文档除外)或删除,可以新增Input、Output和Attribute,但是新增的Input,Output必须设置AsDispensable,新增的Attribute必须设置默认值。更多详细内容请参考[OP修改规范:Input/Output/Attribute只能做兼容修改](https://github.com/PaddlePaddle/Paddle/wiki/OP-Input-Output-Attribute-Compatibility-Modification)**</font>
### 6.ShareDataWith的调用
### 7.稀疏梯度参数更新方法
### 8.显存优化
#### 8.1 为可原位计算的Op注册Inplace
fluid提供了`DECLARE_INPLACE_OP_INFERER`宏用于注册`Inplace`,该宏第一个参数是一个类名,如`ReshapeOpInplaceInToOut`;第二个参数是一对复用的输入输出,以`{"X", "Out"}`的形式给出。在`REGISTER_OPERATOR`时,
DECLARE_INPLACE_OP_INFERER(ReshapeOpInplaceInToOut, {"X", "Out"});
reshape, ops::ReshapeOp, ops::ReshapeOpMaker,
paddle::framework::DefaultGradOpMaker<paddle::framework::OpDesc, true>,
paddle::framework::DefaultGradOpMaker<paddle::imperative::OpBase, true>,
#### 8.2 减少OP中的无关变量
- Fluid提供的`DefaultGradOpMaker`,默认会将前向op的所有输入(`Input`)、输出(`Output`)以及输出变量所对应的梯度(`Output@Grad`)作为反向Op的输入,将前向Op输入所对应的梯度(`Input@Grad`)作为反向Op的输出。所以在使用`DefaultGradOpMaker`时需要考虑是否有些变量在计算中不被用到。
- 如果`DefaultGradOpMaker`不能够满足需求,需要用户自己手动构建`GradOpMaker`,具体实现请参考[相关文档](new_op.html#gradopmaker);
- 如果有些反向Op需要依赖前向Op的输入或输出变量的的Shape或LoD,但不依赖于变量中Tensor的Buffer,且不能根据其他变量推断出该Shape和LoD,则可以通过`DECLARE_NO_NEED_BUFFER_VARS_INFERER`接口对该变量(以下称该变量为`X`)在反向Op中进行注册`NoNeedBufferVars`**一旦注册了`NoNeedBufferVars`,反向op中就不能读写该变量对应的Tensor中的buffer,只能调用Tensor的dims()和lod()方法,同时,反向Op中的`GetExpectedKernelType()`必须要重写,并且`GetExpectedKernelType()`中不能访问`X`变量中Tensor的type()方法**。比如在`SliceOpGrad`中只会用到`Input`中变量的Shape信息,所以需要为对`Input``SliceOpGrad`上进行注册:
namespace paddle {
namespace operators {
// ...
class SliceOpGrad : public framework::OperatorWithKernel {
using framework::OperatorWithKernel::OperatorWithKernel;
void InferShape(framework::InferShapeContext* ctx) const override {
// ...
framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext& ctx) const override {
// Note: don't get data type from ctx.Input<framework::Tensor>("Input");
auto dtype = ctx.Input<framework::Tensor>(framework::GradVarName("Out"))->type();
return framework::OpKernelType( dtype, ctx.GetPlace());
template <typename T>
class SliceOpGradMaker : public framework::SingleGradOpMaker<T> {
using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
void Apply(GradOpPtr<T> bind) const override {
bind->SetInput("Input", this->Input("Input"));
if (this->HasInput("StartsTensor")) {
bind->SetInput("StartsTensor", this->Input("StartsTensor"));
if (this->HasInput("EndsTensor")) {
bind->SetInput("EndsTensor", this->Input("EndsTensor"));
if (this->HasInput("StartsTensorList")) {
bind->SetInput("StartsTensorList", this->Input("StartsTensorList"));
if (this->HasInput("EndsTensorList")) {
bind->SetInput("EndsTensorList", this->Input("EndsTensorList"));
bind->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
bind->SetOutput(framework::GradVarName("Input"), this->InputGrad("Input"));
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OPERATOR(slice, ops::SliceOp, ops::SliceOpMaker,
REGISTER_OPERATOR(slice_grad, ops::SliceOpGrad,
### 9.混合设备调用
The following device operations are asynchronous with respect to the host:
Kernel launches;
Memory copies within a single device's memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
- 如果数据传输是从GPU端到非页锁定的CPU端,数据传输将是同步,即使调用的是异步拷贝操作。
- 如果数据传输是从CPU端到CPU端,数据传输将是同步的,即使调用的是异步拷贝操作。
更多内容可参考:[Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution)[API synchronization behavior](https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior)
### 10. LoD 在 Op 内部的传导规范
[LoD](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/lod_tensor.md) 是 Paddle Fluid 框架用来表示变长序列数据的属性,除了仅支持输入是 padding data 的 Op 外,所有 Op 的实现都要考虑 LoD 的传导问题。
根据 OP 的计算过程中是否用到 LoD,我们可以将涉及到 LoD 传导问题的 OP 分为两类: LoD-Transparent 与 LoD-Based。
<td>LoD-Transparent </td>
<td>计算过程不依赖 LoD,输入是否有 LoD 不会影响计算的结果,通常是 position-wise 的计算 </td>
<td>conv2d_op、batch_norm_op、dropout_op 等 </td>
<td>LoD-Based </td>
<td>计算以序列为单位, 计算过程依赖 LoD </td>
<td> lstm_op、gru_op、sequence_ops 等 </td>
这两类 OP 的 LoD 传导需要考虑前向和反向两个过程。
#### 前向传导
在前向传导过程,与输入的 LoD 相比较,Op 输出的 LoD 可能出现不变、改变和消失这三种情况:
- 不变:适用于所有的 LoD-Transparent OP 与部分的 LoD-Based OP。可以在`InferShape` 中调用 `ShareLoD()` 直接将输入 Var 的 LoD 共享给输出 Var, 可参考 [lstm_op](https://github.com/PaddlePaddle/Paddle/blob/a88a1faa48a42a8c3737deb0f05da968d200a7d3/paddle/fluid/operators/lstm_op.cc#L92); 如果有多个输入且都可能存在 LoD 的情况,通常默认共享第一个输入, 例如 [elementwise_ops forward](https://github.com/PaddlePaddle/Paddle/blob/5d6a1fcf16bcb48d2e66306b27d9994d9b07433c/paddle/fluid/operators/elementwise/elementwise_op.h#L69)
- 改变:适用于部分 LoD-Based OP。在实现 OpKernel 时需考虑输出 LoD 的正确计算,真实的 LoD 在前向计算结束后才能确定,此时仍需要在`InferShape` 中调用 `ShareLoD()`,以确保CompileTime 时对 LoD Level 做了正确的传导,可参考 [sequence_expand_op](https://github.com/PaddlePaddle/Paddle/blob/565d30950138b9f831caa33904d9016cf53c6c2e/paddle/fluid/operators/sequence_ops/sequence_expand_op.cc)
- 消失:适用于输出不再是序列数据的 LoD-Based OP。此时不用再考虑前向的 LoD 传导问题,可参考 [sequence_pool_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/sequence_ops/sequence_pool_op.cc)
- 实现 LoD-Based OP 时,需要处理好 LoD 传导的边界情况,例如对长度为零的输入的支持,并完善相应的单测,单测 case 覆盖空序列出现在 batch 开头、中间和末尾等位置的情况,可参考 [test_lstm_op.py](https://github.com/PaddlePaddle/Paddle/blob/4292bd8687ababc7737cffbddc0d38ead2138c00/python/paddle/fluid/tests/unittests/test_lstm_op.py#L203-L216)
- 对 LoD Level 有明确要求的 OP,推荐的做法是在 `InferShape` 中即完成 LoD Level的检查,例如 [sequence_pad_op](https://github.com/PaddlePaddle/Paddle/blob/4292bd8687ababc7737cffbddc0d38ead2138c00/paddle/fluid/operators/sequence_ops/sequence_pad_op.cc#L79)
#### 反向传导
通常来讲,OP 的某个输入 Var 所对应的梯度 GradVar 的 LoD 应该与 Var 自身相同,所以应直接将 Var 的 LoD 共享给 GradVar,可以参考 [elementwise ops 的 backward](https://github.com/PaddlePaddle/Paddle/blob/a88a1faa48a42a8c3737deb0f05da968d200a7d3/paddle/fluid/operators/elementwise/elementwise_op.h#L189-L196)
## Op性能优化
### 1.第三方库的选择
### 2.Op性能优化
Op的计算速度与输入的数据量有关,对于某些Op可以根据输入数据的Shape和Op的属性参数来选择不同的计算方式。比如concat_op,当axis>=1时,在对多个tensor做拼接过程中需要对每个tensor做很多次拷贝,如果是在GPU上,需要调用cudaMemCopy。相对CPU而言,GPU属于外部设备,所以每次调用GPU的操作都会有一定的额外开销,并且当需要拷贝的次数较多时,这种开销就更为凸现。目前concat_op的实现会根据输入数据的Shape以及axis值来选择不同的调用方式,如果输入的tensor较多,且axis不等于0,则将多次拷贝操作转换成一个CUDA Kernel来完成;如果输入tensor较少,且axis等于0,使用直接进行拷贝。相关实验过程在该PR([#8669](https://github.com/PaddlePaddle/Paddle/pull/8669))中有介绍。
由于CUDA Kernel的调用有一定的额外开销,所以如果Op中出现多次调用CUDA Kernel,可能会影响Op的执行速度。比如之前的sequence_expand_op中包含很多CUDA Kernel,通常这些CUDA Kernel处理的数据量较小,所以频繁调用这样的Kernel会影响Op的计算速度,这种情况下最好将这些小的CUDA Kernel合并成一个。在优化sequence_expand_op过程(相关PR[#9289](https://github.com/PaddlePaddle/Paddle/pull/9289))中就是采用这种思路,优化后的sequence_expand_op比之前的实现平均快出约1倍左右,相关实验细节在该PR([#9289](https://github.com/PaddlePaddle/Paddle/pull/9289))中有介绍。
## Op数值稳定性问题
### 1.有些Op存在数值稳定性问题
### 2.WITH_FAST_MATH的开与关
## 其他
### 1.报错信息
### 2.Op的数学公式
如果Op有数学公式,一定要在代码中将数学公式写明,并在Python API的Doc中显示,因为用户在对比不同框架的计算结果时可能需要了解Paddle对Op是怎么实现的。
### 3.Op变量名的命名要规范
### 4.Python端Op接口中参数的顺序
Python API中参数的顺序一般按照重要性来排,以fc为例:
def fc(input,
# Notes on operator development
## Building logic of Fluid's op
### 1.Building logic of Fluid's op
All Ops in Fluid are derived from `OperatorBase` , and all Ops are stateless. Each Op contains only four variable members: type, inputs, outputs, and attribute.
The core method of Op is Run. The Run method requires two resources: data resources and computing resources. These two resources are obtained respectively from `Scope` and `Place`. Inside the framework, there is a global `DeviceContextPool`, which is used to record the mapping relationship between `Place` and `DeviceContext`, which means each `Place` has only one `DeviceContext` corresponding to it, and `DeviceContext` stores the computing resources of the current device. For example, for GPU, these resources include `cudnn_handle`, `cublas_handle`, `stream`, and so on. All the internal calculations (data copy and CUDA Kernel, etc.) of Op must be done in `DeviceContext`.
The Fluid framework is designed to run on a variety of devices and third-party libraries, and some Op implementations may vary on different the devices or third-party libraries. Therefore, Fluid introduced the OpKernel's approach, which means an Op can have multiple OpKernels. Such Ops are derived from `OperatorWithKernel`, and the representative of such Ops is conv, the OpKernels of conv_op are: `GemmConvKernel`, `CUDNNConvOpKernel`, `ConvMKLDNNOpKernel`, and each OpKernel has two data types, double and float. Ops that do not need OpKernel inclue `WhileOp` and so on.
Operator inheritance diagram:
For further information, please refer to: [multi_devices](https://github.com/PaddlePaddle/FluidDoc/tree/develop/doc/fluid/design/multi_devices) , [scope](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/scope.md) , [Developer's_Guide_to_Paddle_Fluid](https://github.com/PaddlePaddle/FluidDoc/blob/release/1.2/doc/fluid/getstarted/Developer's_Guide_to_Paddle_Fluid.md)
### 2.Op's registration logic
The registration entries for each Operator include:
OpCreator creator_;
GradOpMakerFN grad_op_maker_;
proto::OpProto* proto_{nullptr};
OpAttrChecker* checker_{nullptr};
InferVarTypeFN infer_var_type_;
InferShapeFN infer_shape_;
<th>Registration Entry</th>
<td>proto::OpProto </td>
<td>Class </td>
<td>Store the input/output/properties/Op type of Op </td>
<td>Call at compile time </td>
<td>GradOpMakerFN </td>
<td>Functor </td>
<td> Return a set of OpDescs of the reverse Op corresponding to the current Op, because the reverse ones of the forward Op may consist of multiple Ops </td>
<td>Call at compile time </td>
<td>OpAttrChecker </td>
<td>Class </td>
<td>Check the Op's attr </td>
<td>Call at compile time </td>
<td>InferVarTypeFN </td>
<td>Functor </td>
<td> Used to infer the type of the output Var, such as LoDTensor, SelectedRows, or others </td>
<td>Call at compile time </td>
<td>InferShapeFN </td>
<td>Functor </td>
<td> Used to infer the Shape of the Output </td>
<td> The usage is different at compile time and runtime. At compile time, it is called in Python side; If the Op is derived from OperatorWithKernel, at the runtime it will be called at op.run </td>
<td>OpCreator </td>
<td>Functor </td>
<td>Create a new OperatorBase for each call </td>
<td>Call at runtime </td>
Usually you need to call REGISTER_OPERATOR when you make comments on Op, which is:
1. For all Op, the first three parameters are required, op_type specifies the name of op, OperatorBase is the object instance of this Op, op_maker_and_checker_maker is the maker of op and the checker of attr in op.
2. If the Op has a reverse, it must have op_grad_opmaker, because in backward, the reverse Op's Maker will be obtained from the forward Op.
3. The framework provides a default op_grad_opmaker:`DefaultGradOpDescMaker`, which will use the input and output of the forward Op as the input of the reverse Op, and the gradients of the input to forward Op's as the output of the reverse Op, and copy the attributes of the forward Op to it. **Note:** DefaultGradOpDescMaker will take all the input and output of the forward Op as the reverse Op input. Even if this input is not necessary, the absence of this will prevent us from doing memory optimization for the unused variables.
4. The framework does not provide a default op_infer_var_shape method. If the Op has no OpKernel, you usually need to add the corresponding op_infer_var_shape method. If the Op has OpKernel, you need to implement the `InferShape` method of `OperatorWithKernel`. You don't need to provide the op_infer_var_shape method. For details, refer to [while_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/controlflow/while_op.cc), [conv_op.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/conv_op.cc).
5. The framework does not provide a default op_infer_var_type method, the user needs to add op_infer_var_type according to the actual situation. Strictly speaking, every Op should register an InferVarType, and op_infer_var_type infers the type and dtype of the output Var according to the type and dtype of the input Var. **Note:** In the Python-side LayerHelper, the create_variable_for_type_inference operation returns a Variable which is a LoDTensor. The C++-side InferVarType can modify the type and dtype of the `Variable`.
For more details, please refer to: [How to write a new Op](new_op_en.html)
## Notes on Writing an Op
### 1. input and output types supported by Op
The input and output of Fluid's Ops are `Variable`. In design, `Variable` can store any type. Op's input and output `Variable` may be of any type, and usually the `Variable` stores `LoDTensor` and `SelectedRows` .
- `context.Input<Tensor>("Input")` often appears in the code. It does not mean that the `Variable` of "Input" is `Tensor`, but indicates that the `Tensor` is obtained from `LoDTensor` in the `Variable` of the "Input". If the `Variable` of "Input" is `SelectedRows`, an error will be reported.
- If "Input" is `SelectedRows`, `context->GetInputDim("Input")` will return `var->Get<SelectedRows>().GetCompleteDims()` instead of Dim of `Tensor` in `SelectedRows` .
### 2. Do not modify the input data inside Op.
Never make any modification of the input data inside Op, as there may be other Ops that need to read this input.
### 3. The data type needs to be registered for OpKernel
Currently all OpKernel are required to register double and float data types.
### 4.Op compatibility issue
The modification of Op needs to consider the compatibility problem. Please ensure that the previous model can be loaded and run normally after the modification of Op which means that the model trained by the old version can be loaded and run with Paddle inference library of new version. <font color="#FF0000">**So developers should ensure that the Input, Output and Attribute of OPs cannot be modified (except for documents) or deleted. And developers can add Input, Output and Attribute, but the added Input and Output must be set to be dispensable, and the default value of added Attribute must be set. For more details, please refer to [OP Input/Output/Attribute Compatibility Modification](https://github.com/PaddlePaddle/Paddle/wiki/OP-Input-Output-Attribute-Compatibility-Modification(English-Version))**</font>.
### 5.Call ShareDataWith
The function of ShareDataWith is to make the two Tensors share the underlying buffer. When calling this operation, special attention should be paid. In the Op, the ShareDataWith cannot be applied to the output of Op. In other words, the Tensor of the Op output must be from Malloc.
### 6. Sparse gradient parameter's update method
At present, the sparse gradient will first merge the gradient when updating, which is to add up the gradients of the same parameter, and then update the parameters and additional parameters (such as velocity).
### 7. (Video) Memory optimization
If the reverse of Op does not require all of the input and output of the forward op as its input, please do not use `DefaultGradOpDescMaker`, which will prevent Memory/Video Memory optimization for unused variables.
### 8. Calls made on Hybrid device
Since the GPU is executed asynchronously, the GPU side may not be actually executed after the CPU call returns. Therefore, if you create a temporary variable in Op that you need to use at the GPU runtime, when the GPU starts running, the temporary variable may have been released on the CPU side, which may cause GPU calculation errors.
Some of the synchronous and asynchronous operations in the GPU:
The following device operations are asynchronous with respect to the host:
Kernel launches;
Memory copies within a single device's memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
Note on cudaMemCpy and cudaMemCpyAsync:
- If the data transfer is from the GPU side to the CPU side with non-pinned memory , the data transfer will be synchronous, even if an asynchronous copy operation is called.
- If the data is transferred from the CPU side to the CPU side, the data transfer will be synchronous, even if an asynchronous copy operation is called.
For more information, please refer to: [Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution) , [API synchronization behavior](https://Docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior)
## Op Performance Optimization
### 1. Selection of third-party libraries
In the process of writing Op, the operations provided by high-performance libraries (such as cudnn, mkldnn, mklml, eigen, etc.) are preferred, but the benchmark must be done. Some operations in the library may be slower in deep learning tasks. Because the operations provided in high-performance libraries (such as eigen, etc.) are more generalized and in terms of performance, they may not be sufficient. Usually the amount of data in the deep learning model is small, so in some cases some of the high-performance libraries may be compromised to a slower speed. For example, all Op (forward and reverse) of the Elementwise set. The Elementwise operation is called relatively frequently in the model. Especially Elementwise_add, which is used to add offset to many operations. In the previous implementation, Elementwise_op directly calls the Eigen library. Since the Elementwise operation needs to broadcast the data in many cases, and the experiment finds that the Eigen library is slower to broadcast, whose reason is in this PR[#6229](https://github.com/PaddlePaddle/Paddle/pull/6229).
### 2.Op performance optimization
The calculation speed of Op is related to the amount of data input. For some Op, different calculation methods can be selected according to the attribute parameters in Op and Shape of the input data. For example, concat_op, when axis>=1, in the process of concatenating multiple tensors, you need to make many copies for each tensor. If it is on GPU, you need to call cudaMemCopy. Relative to the CPU, the GPU is an external device. So each time the GPU is called, there will a certain overhead. And when more times of copying are required, the overhead is more prominent. At present, the implementation of concat_op will select different calling methods according to the Shape and axis values of the input data. If there are a relatively large number of input tensors, and the axis is not equal to 0, the multiple copy operations will be converted into a CUDA Kernel to complete the process; if input tensor are less, and the axis is equal to 0, direct copy will be used. The relevant experiment is described in this PR ([#8669](https://github.com/PaddlePaddle/Paddle/pull/8669)) .
Since the call of CUDA Kernel has a certain overhead, multiple calls of the CUDA Kernel in Op may affect the execution speed of Op. For example, the previous sequence_expand_op contains many CUDA Kernels. Usually, these CUDA Kernels process a small amount of data, so frequent calls to such Kernels will affect the calculation speed of Op. In this case, it is better to combine these small CUDA Kernels into one. This idea is used in the optimization of the sequence_expand_op procedure (related PR[#9289](https://github.com/PaddlePaddle/Paddle/pull/9289)). The optimized sequence_expand_op is about twice as fast as the previous implementation, the relevant experiments are introduced in the PR ([#9289](https://github.com/PaddlePaddle/Paddle/pull/9289)).
Reduce the number of copy and sync operations between the CPU and the GPU. For example, the fetch operation will update the model parameters and get a loss after each iteration, and the copy of the data from the GPU to the Non-Pinned-Memory CPU is synchronous, so frequent fetching for multiple parameters will reduce the model training speed.
## Op numerical stability
### 1. Some Ops have numerical stability problems
The main reason for numerical stability is that when the program is run multiple times, the order in which the floating-point data is processed may be different, resulting in different final calculation results. The GPU is accelerated by multi-threaded parallel computing, so it is commonplace that the order of operations on floating-point numbers is not fixed.
At present, it is found that the result of the convolution operation in cudnn, MaxPooling in cudnn, CudaAtomicXX in CUDA, and aggregation of parameter gradients in Reduce mode of ParallelExecutor are not certain.
For this purpose, some FLAGS is added to the Fluid. For example, FLAGS_cudnn_deterministic is used to force cudnn to use the deterministic algorithm, and FLAGS_cpu_deterministic to force the CPU-side calculation to use the deterministic method.
### 2.On/Off of WITH_FAST_MATH
If WITH_FAST_MATH is ON, NVCC will use --use_fast_math when compiling Paddle and Egien. This may cause some operations in CUDA to get faster on the condition that they lose some precision, such as log, exp, tanh. But it may lead to wrong results of some operations, such as pow operation, please read [torch/DEPRECEATED-torch7-distro#132](https://github.com/torch/DEPRECEATED-torch7-distro/issues/132) for specific reasons.
## Other
### 1. Error message
The Enforce prompt message cannot be empty and needs to be written, because the error message can analyze the cause of the error more quickly and conveniently.
### 2.Op's mathematical formula
If Op has a mathematical formula, be sure to write the mathematical formula in the code and display it in the Doc of the Python API, because the user may need to understand how Paddle implements Op when comparing the calculation results among different frameworks.
**Note:** The formula preview must be done before the merge to the develop branch. Example: [dynamic_lstmp](../../../api/layers/nn.html#dynamic-lstmp).
### 3. The order of parameters in the Python-side Op interface
The order of the parameters in the Python API is generally ranked by importance, taking fc as an example:
def fc(input,
.. _contribute_to_paddle_faq:
.. contents::
1. CLA签署不成功,怎么办?
由于 `CLA <https://github.com/cla-assistant/cla-assistant>`_ 是第三方开源库,有时候会不稳定。如果确定自己已签署CLA,但CLA没触发成功,可尝试:
* 关闭并重新开启本PR,来重新触发CLA。点击 :code:`Close pull request` ,再点击 :code:`Reopen pull request` ,并等待几分钟。
* 如果上述操作重复2次仍未生效,请重新提一个PR或评论区留言。
2. CI没有触发,怎么办?
* 请在commit信息中添加正确的CI触发规则:
* develop分支请添加 :code:`test=develop`
* release分支请添加如 :code:`test=release/1.4` 来触发release/1.4分支
* 文档预览请添加 :code:`test=document_preview`
* 该CI触发规则以commit为单位,即对同一个PR来说,不管前面的commit是否已经添加,如果新commit想继续触发CI,那么仍然需要添加。
* 添加CI触发规则后,仍有部分CI没有触发:请关闭并重新开启本PR,来重新触发CI。
3. CI随机挂,即错误信息与本PR无关,怎么办?
4. 如何修改API.spec?
修改方法请参考 `diff_api.py <https://github.com/PaddlePaddle/Paddle/blob/ddfc823c73934d483df36fa9a8b96e67b19b67b4/tools/diff_api.py#L29-L34>`_ 。
.. toctree::
:maxdepth: 1
How to contribute codes to Paddle
.. toctree::
:maxdepth: 1
# 本地开发指南
## 代码要求
- 代码注释请遵守 [Doxygen](http://www.doxygen.nl/) 的样式。
- 确保编译器选项 `WITH_STYLE_CHECK` 已打开,并且编译能通过代码样式检查。
- 所有代码必须具有单元测试。
- 通过所有单元测试。
- 请遵守[提交代码的一些约定](#提交代码的一些约定)
## [Fork](https://help.github.com/articles/fork-a-repo/)
跳转到[PaddlePaddle](https://github.com/PaddlePaddle/Paddle) GitHub首页,然后单击 `Fork` 按钮,生成自己目录下的仓库,比如 <https://github.com/USERNAME/Paddle>
## 克隆(Clone)
将远程仓库 clone 到本地:
➜ git clone https://github.com/USERNAME/Paddle
cd Paddle
## 创建本地分支
Paddle 目前使用[Git流分支模型](http://nvie.com/posts/a-successful-git-branching-model/)进行开发,测试,发行和维护,具体请参考 [Paddle 分支规范](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/others/releasing_process.md)
所有的 feature 和 bug fix 的开发工作都应该在一个新的分支上完成,一般从 `develop` 分支上创建新分支。
使用 `git checkout -b` 创建并切换到新分支。
➜ git checkout -b my-cool-stuff
值得注意的是,在 checkout 之前,需要保持当前分支目录 clean,否则会把 untracked 的文件也带到新分支上,这可以通过 `git status` 查看。
## 使用 `pre-commit` 钩子
Paddle 开发人员使用 [pre-commit](http://pre-commit.com/) 工具来管理 Git 预提交钩子。 它可以帮助我们格式化源代码(C++,Python),在提交(commit)前自动检查一些基本事宜(如每个文件只有一个 EOL,Git 中不要添加大文件等)。
`pre-commit`测试是 Travis-CI 中单元测试的一部分,不满足钩子的 PR 不能被提交到 Paddle,首先安装并在当前目录运行它:
➜ pip install pre-commit
➜ pre-commit install
Paddle 使用 `clang-format` 来调整 C/C++ 源代码格式,请确保 `clang-format` 版本在 3.8 以上。
注:通过`pip install pre-commit``conda install -c conda-forge pre-commit`安装的`yapf`稍有不同的,Paddle 开发人员使用的是`pip install pre-commit`
## 开始开发
在本例中,我删除了 README.md 中的一行,并创建了一个新文件。
通过 `git status` 查看当前状态,这会提示当前目录的一些变化,同时也可以通过 `git diff` 查看文件具体被修改的内容。
➜ git status
On branch test
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: README.md
Untracked files:
(use "git add <file>..." to include in what will be committed)
no changes added to commit (use "git add" and/or "git commit -a")
## 编译和单元测试
关于编译 PaddlePaddle 的源码,请参见[从源码编译](../../../install/compile/fromsource.html) 选择对应的操作系统。
关于单元测试,可参考[Op单元测试](../new_op/new_op.html#id7) 的运行方法。
## 提交(commit)
接下来我们取消对 README.md 文件的改变,然后提交新添加的 test 文件。
➜ git checkout -- README.md
➜ git status
On branch test
Untracked files:
(use "git add <file>..." to include in what will be committed)
nothing added to commit but untracked files present (use "git add" to track)
➜ git add test
Git 每次提交代码,都需要写提交说明,这可以让其他人知道这次提交做了哪些改变,这可以通过`git commit` 完成。
➜ git commit
CRLF end-lines remover...............................(no files to check)Skipped
yapf.................................................(no files to check)Skipped
Check for added large files..............................................Passed
Check for merge conflicts................................................Passed
Check for broken symlinks................................................Passed
Detect Private Key...................................(no files to check)Skipped
Fix End of Files.....................................(no files to check)Skipped
clang-formater.......................................(no files to check)Skipped
[my-cool-stuff c703c041] add test file
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 233
## 保持本地仓库最新
在准备发起 Pull Request 之前,需要同步原仓库(<https://github.com/PaddlePaddle/Paddle>)最新的代码。
首先通过 `git remote` 查看当前远程仓库的名字。
➜ git remote
➜ git remote -v
origin https://github.com/USERNAME/Paddle (fetch)
origin https://github.com/USERNAME/Paddle (push)
这里 origin 是我们 clone 的远程仓库的名字,也就是自己用户名下的 Paddle,接下来我们创建一个原始 Paddle 仓库的远程主机,命名为 upstream。
➜ git remote add upstream https://github.com/PaddlePaddle/Paddle
➜ git remote
获取 upstream 的最新代码并更新当前分支。
➜ git fetch upstream
➜ git pull upstream develop
## Push 到远程仓库
将本地的修改推送到 GitHub 上,也就是 https://github.com/USERNAME/Paddle。
# 推送到远程仓库 origin 的 my-cool-stuff 分支上
➜ git push origin my-cool-stuff
# Guide of local development
You will learn how to develop programs in local environment under the guidelines of this document.
## Requirements of coding
- Please refer to the coding comment format of [Doxygen](http://www.doxygen.nl/)
- Make sure that option of builder `WITH_STYLE_CHECK` is on and the build could pass through the code style check.
- Unit test is needed for all codes.
- Pass through all unit tests.
- Please follow [regulations of submitting codes](#regulations of submitting codes).
The following guidiance tells you how to submit code.
## [Fork](https://help.github.com/articles/fork-a-repo/)
Transfer to the home page of Github [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) ,and then click button `Fork` to generate the git under your own file directory,such as <https://github.com/USERNAME/Paddle>
## Clone
Clone remote git to local:
➜ git clone https://github.com/USERNAME/Paddle
cd Paddle
## Create local branch
At present [Git stream branch model](http://nvie.com/posts/a-successful-git-branching-model/) is applied to Paddle to undergo task of development,test,release and maintenance.Please refer to [branch regulation of Paddle](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/others/releasing_process.md) about details。
All development tasks of feature and bug fix should be finished in a new branch which is extended from `develop` branch.
Create and switch to a new branch with command `git checkout -b`.
➜ git checkout -b my-cool-stuff
It is worth noting that before the checkout, you need to keep the current branch directory clean, otherwise the untracked file will be brought to the new branch, which can be viewed by `git status` .
## Use `pre-commit` hook
Paddle developers use the [pre-commit](http://pre-commit.com/) tool to manage Git pre-commit hooks. It helps us format the source code (C++, Python) and automatically check some basic things before committing (such as having only one EOL per file, not adding large files in Git, etc.).
The `pre-commit` test is part of the unit test in Travis-CI. A PR that does not satisfy the hook cannot be submitted to Paddle. Install `pre-commit` first and then run it in current directory:
➜ pip install pre-commit
➜ pre-commit install
Paddle modify the format of C/C++ source code with `clang-format` .Make sure the version of `clang-format` is above 3.8.
Note:There are differences between the installation of `yapf` with `pip install pre-commit` and that with `conda install -c conda-forge pre-commit` . Paddle developers use `pip install pre-commit`
## Start development
I delete a line of README.md and create a new file in the case.
View the current state via `git status` , which will prompt some changes to the current directory, and you can also view the file's specific changes via `git diff` .
➜ git status
On branch test
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: README.md
Untracked files:
(use "git add <file>..." to include in what will be committed)
no changes added to commit (use "git add" and/or "git commit -a")
## Build and test
Please refer to [Compile From Source Code](../../../install/compile/fromsource_en.html) about more information of building PaddlePaddle source codes.
Please refer to [Op Unit Tests](../new_op/new_op_en.html#unit-tests) about more information of running unit tests.
## Commit
Next we cancel the modification of README.md,and submit new added test file.
➜ git checkout -- README.md
➜ git status
On branch test
Untracked files:
(use "git add <file>..." to include in what will be committed)
nothing added to commit but untracked files present (use "git add" to track)
➜ git add test
It's required that the commit message is also given on every Git commit, through which other developers will be notified of what changes have been made. Type `git commit` to realize it.
➜ git commit
CRLF end-lines remover...............................(no files to check)Skipped
yapf.................................................(no files to check)Skipped
Check for added large files..............................................Passed
Check for merge conflicts................................................Passed
Check for broken symlinks................................................Passed
Detect Private Key...................................(no files to check)Skipped
Fix End of Files.....................................(no files to check)Skipped
clang-formater.......................................(no files to check)Skipped
[my-cool-stuff c703c041] add test file
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 233
## Keep the latest local repository
It needs to keep up with the latest code of original repository(<https://github.com/PaddlePaddle/Paddle>)before Pull Request.
Check the name of current remote repository with `git remote`.
➜ git remote
➜ git remote -v
origin https://github.com/USERNAME/Paddle (fetch)
origin https://github.com/USERNAME/Paddle (push)
origin is the name of remote repository that we clone,which is also the Paddle under your own account. Next we create a remote host of an original Paddle and name it upstream.
➜ git remote add upstream https://github.com/PaddlePaddle/Paddle
➜ git remote
Get the latest code of upstream and update current branch.
➜ git fetch upstream
➜ git pull upstream develop
## Push to remote repository
Push local modification to GitHub(https://github.com/USERNAME/Paddle).
# submit it to remote git the branch my-cool-stuff of origin
➜ git push origin my-cool-stuff
# 提交PR注意事项
## 建立 Issue 并完成 Pull Request
建立一个 Issue 描述问题,并记录它的编号。
切换到所建分支,然后点击 `New pull request`
<img width="295" alt="screen shot 2017-04-26 at 9 09 28 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436054/a6d98c66-2ac4-11e7-9cb1-18dd13150230.png">
<img width="750" alt="screen shot 2017-04-26 at 9 11 52 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436139/f83b1e6c-2ac4-11e7-8c0e-add499023c46.png">
在 PR 的描述说明中,填写 `resolve #Issue编号` 可以在这个 PR 被 merge 后,自动关闭对应的 Issue,具体请见[这里](https://help.github.com/articles/closing-issues-via-commit-messages/)
接下来等待 review,如果有需要修改的地方,参照上述步骤更新 origin 中的对应分支即可。
## 签署CLA协议和通过单元测试
### 签署CLA
在首次向PaddlePaddle提交Pull Request时,您需要您签署一次CLA(Contributor License Agreement)协议,以保证您的代码可以被合入,具体签署方式如下:
- 请您查看PR中的Check部分,找到license/cla,并点击右侧detail,进入CLA网站
<div align="center">
<img src="https://github.com/PaddlePaddle/FluidDoc/blob/release/1.1/doc/fluid/advanced_usage/development/contribute_to_paddle/img/cla_unsigned.png?raw=true" height="40" width="500">
- 请您点击CLA网站中的“Sign in with GitHub to agree”,点击完成后将会跳转回您的Pull Request页面
<div align="center">
<img src="https://github.com/PaddlePaddle/FluidDoc/blob/release/1.1/doc/fluid/advanced_usage/development/contribute_to_paddle/img/sign_cla.png?raw=true" height="330" width="400">
### 通过单元测试
您在Pull Request中每提交一次新的commit后,会触发CI单元测试,请确认您的commit message中已加入必要的说明,请见[提交(commit)](local_dev_guide.html#permalink-8--commit-)
请您关注您Pull Request中的CI单元测试进程,它将会在几个小时内完成
如果所需的测试后出现了红色叉号,代表您本次的commit未通过某项单元测试,在这种情况下,请您点击detail查看报错详情,并将报错原因截图,以评论的方式添加在您的Pull Request中,我们的工作人员将帮您查看
## 删除远程分支
在 PR 被 merge 进主仓库后,我们可以在 PR 的页面删除远程仓库的分支。
<img width="775" alt="screen shot 2017-04-26 at 9 18 24 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436457/e4cdd472-2ac5-11e7-9272-badc76c4a23e.png">
也可以使用 `git push origin :分支名` 删除远程分支,如:
➜ git push origin :my-cool-stuff
## 删除本地分支
# 切换到 develop 分支
➜ git checkout develop
# 删除 my-cool-stuff 分支
➜ git branch -D my-cool-stuff
## 提交代码的一些约定
1)请保证Travis-CI 中单元测试能顺利通过。如果没过,说明提交的代码存在问题,评审人一般不做评审。
2)提交PUll Request前:
- 请注意commit的数量:
建议:每次提交时,保持尽量少的commit,可以通过`git commit --amend`补充上次的commit。对已经Push到远程仓库的多个commit,可以参考[squash commits after push](http://stackoverflow.com/questions/5667884/how-to-squash-commits-in-git-after-they-have-been-pushed)
- 请注意每个commit的名称:应能反映当前commit的内容,不能太随意。
3)如果解决了某个Issue的问题,请在该PUll Request的**第一个**评论框中加上:`fix #issue_number`,这样当该PUll Request被合并后,会自动关闭对应的Issue。关键词包括:close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved,请选择合适的词汇。详细可参考[Closing issues via commit messages](https://help.github.com/articles/closing-issues-via-commit-messages)
- 对评审意见同意且按其修改完的,给个简单的`Done`即可;
- 对评审意见不同意的,请给出您自己的反驳理由。
- 请给出总体的修改情况。
- 请采用[start a review](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/)进行回复,而非直接回复的方式。原因是每个回复都会发送一封邮件,会造成邮件灾难。
# Guide of submitting PR to Github
## Create an Issue and finish Pull Request
Create an Issue to describe your problem and keep its number.
Switch to the branch you have created and click `New pull request`
<img width="295" alt="screen shot 2017-04-26 at 9 09 28 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436054/a6d98c66-2ac4-11e7-9cb1-18dd13150230.png">
Switch to targeted branch:
<img width="750" alt="screen shot 2017-04-26 at 9 11 52 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436139/f83b1e6c-2ac4-11e7-8c0e-add499023c46.png">
A note of `resolve #Issue number` in PR description results in automatic close of corresponding Issue after the merge of PR.More details can be viewed [here](https://help.github.com/articles/closing-issues-via-commit-messages/)
Then please wait for review.If there is any need to make a modification,you can update corresponding branch in origin following the steps above.
## Sign CLA and pass unit tests
### Sign CLA
For the first time to submit Pull Request,you need to sign CLA(Contributor License Agreement) to ensure merge of your code.Specific steps are listed as follows:
- Please check the Check in PR to find license/cla and click detail on the right to change into CLA website.
<div align="center">
<img src="https://github.com/PaddlePaddle/FluidDoc/blob/release/1.1/doc/fluid/advanced_usage/development/contribute_to_paddle/img/cla_unsigned.png?raw=true" height="40" width="500">
- Please click “Sign in with GitHub to agree” in CLA website.It will change into your Pull Request page after the click.
<div align="center">
<img src="https://github.com/PaddlePaddle/FluidDoc/blob/release/1.1/doc/fluid/advanced_usage/development/contribute_to_paddle/img/sign_cla.png?raw=true" height="330" width="400">
### Pass unit tests
Every new commit in your Pull Request will trigger CI unit tests,so please make sure that necessary comments have been included in your commit message.Please refer to [commit](local_dev_guide.html#permalink-8--commit-)
Please note the procedure of CI unit tests in your Pull Request which will be finished in several hours.
You only need to focus on CI projects associated with your submitted branch.For example,there is no need to check whether release/1.1 pass test or not if you submit code to develop branch.
Green ticks after all tests means that your commit has passed all unit tests.
Red cross after the tests means your commit hasn't passed certain unit test.Please click detail to view bug details and make a screenshot of bug,then add it as a comment in your Pull Request.Our stuff will help you check it.
## Delete remote branch
We can delete branches of remote repository in PR page after your PR is successfully merged into master repository.
<img width="775" alt="screen shot 2017-04-26 at 9 18 24 pm" src="https://cloud.githubusercontent.com/assets/11692045/25436457/e4cdd472-2ac5-11e7-9272-badc76c4a23e.png">
We can also delete the branch of remote repository with `git push origin :the_branch_name`,such as:
➜ git push origin :my-cool-stuff
## Delete local branch
Finally,we delete local branch
# Switch to develop branch
➜ git checkout develop
# delete my-cool-stuff branch
➜ git branch -D my-cool-stuff
And now we finish a full process of code contribution
## Certain regulations about submitting code
In order that reviewers focus on code in the code review,please follow these rules every time you submit your code:
1)Make sure that unit tests in Travis-CI pass through successfully.If it fails,it means problems have been found in submitted code which will not be reviewed by reviewer.
2)Before the submit of PUll Request:
- Please note the number of commit:
Reason:It will bother reviewers a lot if a dozen of commits are submitted after modification of only one file and only a few modifications are updated in every commit.Reviewers have to check commit one by one to figure out the modification.And sometimes it needs to take the overlap among commits into consideration.
Suggestion:Keep commit concise as much as possible at every submit.You can make a supplyment to the previous commit with `git commit --amend`.About several commits having been pushed to remote repository,you can refer to [squash commits after push](http://stackoverflow.com/questions/5667884/how-to-squash-commits-in-git-after-they-have-been-pushed)
- Pay attention to the name of every commit:It would be better to abstract the content of present commit and be not too arbitrary.
3)If you have tackled with problems of an Issue,please add `fix #issue_number` to the *first* comment area of PULL Request.Then the corresponding Issue will be closed automatically after the merge of PULL Request.Keywords are including:close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved.Please select appropriate word.Please refer to [Closing issues via commit messages](https://help.github.com/articles/closing-issues-via-commit-messages) for more details.
In addition,please follow the following regulations in response to the suggestion of reviewers:
1)A reply to every comment of reviewers(It's a fundamental complimentary conduct in open source community.An expression of appreciation is a need for help from others):
- If you adopt the suggestion of reviewer and make a modification accordingly, it's courteous to reply with a simple `Done` .
- Please clarify your reason to the disagreenment
2)If there are many suggestions
- Please show general modification
- Please follow [start a review](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/) to give your reply,instead of directly replying for that every comment will result in sending an email causing email disaster.
......@@ -8,20 +8,26 @@
- `Tensor概念介绍 <./tensor_introduction_cn.html>`_ : 飞桨中数据的表示方式,Tensor概念介绍。
- `飞桨广播介绍 <./broadcasting_cn.html>`_ : 飞桨中广播概念的介绍。
- `飞桨框架2.0beta升级指南 <./upgrade_guide_cn.html>`_: 介绍飞桨开源框架2.0beta的主要变化和如何升级。
- `版本迁移工具 <./migration_cn.html>`_: 介绍paddle1to2转换工具的使用。
- `动态图转静态图 <./dygraph_to_static/index_cn.html>`_: 介绍飞桨动态图转静态图的方法
- `模型存储与载入 <./model_save_load_cn.html>`_: 介绍飞桨模型与参数存储载入的方法
- `飞桨框架2.0整体介绍 <./01_paddle2.0_introduction/index_cn.html>`_ : 飞桨框架2.0新特性的介绍与飞桨框架2.0升级指南的说明。
- `飞桨框架2.0模型开发 <./02_paddle2.0_develop/index_cn.html>`_ : 飞桨框架2.0模型开发全流程说明。
- `模型可视化 <./03_VisualDL/index_cn.html>` : 介绍如何用VisualDL实现飞桨框架模型的可视化。
- `动态图转静态图 <./04_dygraph_to_static/index_cn.html>`_ : 介绍飞桨框架动态图转静态图的方法。
- `预测部署 <./05_inference_deployment/index_cn.html>`_ : 介绍如何使用训练好的模型进行预测。
- `分布式训练 <./06_distributed_training/index_cn.html>_` : 介绍如何使用分布式进行训练。
- `性能优化 <./07_performance_improving/index_cn.html>`_ : 介绍飞桨框架使用过程中的调优方法。
- `自定义OP <./08_new_op/index_cn.html>`_ : 介绍飞桨框架自定义OP的方法。
- `参与开发 <./09_contribution/index_cn.html>`_ : 介绍如何参与飞桨框架的开发。
.. toctree::
