未验证 提交 db2c1529 编写于 作者: D Daniel Yang 提交者: GitHub

this is a pull-request to revise the docs, test=develop (#1790)

上级 e2841c89
.. _addon_development:
.. _addon_development:
########
二次开发
########
.. toctree::
:hidden:
:maxdepth: 1
design_idea/fluid_design_idea.md
new_op/index_cn.rst
contribute_code/index_cn.rst
contribute-docs/write_docs_cn.md
......@@ -6,11 +6,11 @@ Addon Development
################
.. toctree::
:hidden:
:maxdepth: 1
design_idea/fluid_design_idea_en.md
new_op/index_en.rst
contribute_code/index_en.rst
contribute-docs/write_docs_en.md
......@@ -2,6 +2,8 @@
Write New Operators
###################
This section will guide you how to add an operator, and it also includes some necessary notes.
- `How to write new operator <new_op_en.html>`_ :guides to write new operators
- `op notes <op_notes_en.html>`_ :notes on developing new operators
......
.. _user_guide_distribute:
##########
分布式训练
##########
......
......@@ -9,5 +9,4 @@ Distributed Training
cluster_quick_start_en.rst
cluster_howto_en.rst
train_on_baidu_cloud_en.rst
############
模型评估和调试
############
PaddlePaddle Fluid提供了常用的模型评估指标,并提供了VisualDL工具可视化模型效果。
.. toctree::
:maxdepth: 2
metrics
......@@ -4,9 +4,9 @@
本部分包括两篇文档:
- `模型评估 <../evaluation_and_debugging/evaluation/metrics.html>`_:介绍常用模型评估指标的构造方法
- `模型评估 <evaluation/metrics.html>`_:介绍常用模型评估指标的构造方法
- `Visual DL 工具 <../evaluation_and_debugging/debug/visualdl.html>`_:介绍如何利用 Visual DL 工具可视化训练过程
- `Visual DL 工具 <debug/visualdl.html>`_:介绍如何利用 Visual DL 工具可视化训练过程
.. toctree::
:hidden:
......
......@@ -6,9 +6,9 @@ Model Evaluation and Debugging
There are two articles in this section:
- `Model Evaluation <../evaluation_and_debugging/evaluation/metrics_en.html>`_:This section will introduce the construction of common metrics.
- `Model Evaluation <evaluation/metrics_en.html>`_:This section will introduce the construction of common metrics.
- `Visual DL <../evaluation_and_debugging/debug/visualdl_en.html>`_:How to use Visual DL to visualise training process.
- `VisualDL Tools <debug/index_en.html>`_:How to use Visual DL to visualise training process.
.. toctree::
:hidden:
......
.. _user_guide_:
########
进阶指南
########
......@@ -14,9 +12,9 @@
- `性能调优 <../advanced_guide/performance_improving/index_cn.html>`_ :介绍飞桨使用过程中的调优方法
- `模型评估调试 <../advanced_guide/evaluation_debugging/index_cn.html>`_ :介绍模型评估与调试的典型方法
- `模型评估/调试 <../advanced_guide/evaluation_debugging/index_cn.html>`_ :介绍模型评估与调试的典型方法
- `二次开发 <../advanced_guide/addon_development/index_cn.html>`_ :介绍如何新增Operator和如何向飞桨社区贡献代码和文档
- `二次开发 <../advanced_guide/addon_development/index_cn.html>`_ :介绍如何新增Operator和如何向飞桨开源社区贡献代码
- `环境变量FLAGS <../advanced_guide/flags/index_cn.html>`_
......
......@@ -8,19 +8,19 @@ Advanced User Guides
So far you have already been familiar with PaddlePaddle. And the next expectation, read more on:
- `Data Preparing <../advanced_guide/data_preparing/index_cn.html>`_:How to prepare the data efficently.
- `Prepare Data <data_preparing/index_en.html>`_:How to prepare the data efficiently.
- `Distributed Training <../advanced_guide/distributed_training/index_cn.html>`_ :How to apply the distributed training in your projects.
- `Distributed Training <distributed_training/index_en.html>`_ :How to apply the distributed training in your projects.
- `Deploy Inference Model <../advanced_guide/inference_deployment/index_cn.html>`_ :How to deploy the trained network to perform practical inference
- `Deploy Inference Model <inference_deployment/index_en.html>`_ :How to deploy the trained network to perform practical inference
- `Performance Profiling <../advanced_guide/performance_improving/index_cn.html>`_ :How to do profiling for Fluid programs
- `Practice Improving <performance_improving/index_en.html>`_ :How to do profiling for Fluid programs
- `Model Evalution <../advanced_guide/evaluation_debugging/index_cn.html>`_ :How to evalute your program.
- `Model Evaluation and Debugging <evaluation_debugging/index_en.html>`_ :How to evaluate your program.
- `Add on development <../advanced_guide/addon_development/index_cn.html>`_ :How to contribute codes and documentation to our communities
- `Addon Development <addon_development/index_en.html>`_ :How to contribute codes and documentation to our communities
- `Env FLAGS <../advanced_guide/flags/index_cn.html>`_
- `FLAGS <flags_en.html>`_
.. toctree::
......
......@@ -2,9 +2,9 @@
预测部署
########
- `服务器端部署 <inference/index_cn.html>`_ :介绍了支持模型部署上线的Fluid C++ API
- `服务器端部署 <inference/index_cn.html>`_ :介绍了如何在服务器端将模型部署上线
- `移动端部署 <mobile/index_cn.html>`_:介绍了 PaddlePaddle组织下的嵌入式平台深度学习框架Paddle-Lite
- `移动端部署 <mobile/index_cn.html>`_:介绍了 PaddlePaddle 组织下的嵌入式平台深度学习框架Paddle-Lite
.. toctree::
:hidden:
......
......@@ -2,11 +2,10 @@
Deploy Inference Model
#######################
- `Server side Deployment <inference/index_en.html>`_ : This section illustrates Fluid C++ API to support deployment and release of trained models.
- `Server side Deployment <inference/index_en.html>`_ : This section illustrates the method how to deploy and release the trained models on the servers
- `Paddle Lite <mobile/index_en.html>`_ : Embedded deep learning framework Paddle-Lite organized by PaddlePaddle.
.. toctree::
:hidden:
inference/index_en.rst
inference/index_en.rst
\ No newline at end of file
.. _user_guide_inference:
############
服务器端部署
############
......
......@@ -2,7 +2,7 @@
Server-side Deployment
######################
PaddlePaddle Fluid provides C++ API to support deployment and release of trained models.
PaddlePaddle provides various methods to support deployment and release of trained models.
.. toctree::
:titlesonly:
......@@ -10,5 +10,4 @@ PaddlePaddle Fluid provides C++ API to support deployment and release of trained
build_and_install_lib_en.rst
windows_cpp_inference_en.md
native_infer_en.md
paddle_tensorrt_infer_en.md
paddle_gpu_benchmark_en.md
.. _user_guide_mobile:
##########
移动端部署
##########
本模块介绍了飞桨的端侧推理引擎Paddle-Lite以及在模型压缩工具PaddleSlim,包括:
* `项目简介 <mobile_index.html>`_:简要介绍了 Paddle-Lite 特点以及使用说明。
* `Paddle Lite <mobile_index.html>`_:简要介绍了 Paddle-Lite 特点以及使用说明。
* `项目简介 <paddle_slim.html>`_:简要介绍了PaddleSlim 特点以及使用说明。
* `PaddleSlim <paddle_slim.html>`_:简要介绍了PaddleSlim 特点以及使用说明。
.. toctree::
:hidden:
......
##########
性能调优
##########
###############
性能优化分析及工具
###############
.. toctree::
:hidden:
......
.. _performance_improving_:
########
性能调优
########
......
......@@ -3,7 +3,7 @@ Practice Improving
###############
.. toctree::
:hidden:
:maxdepth: 1
multinode_training_improving/cpu_train_best_practice_en.rst
......
......@@ -2,16 +2,16 @@
基本概念
############
本文介绍 Paddle 中的基本概念:
本文介绍飞桨核心框架中的基本概念:
- `编程指南 <./programming_guide/programming_guide.html>`_ : 介绍 Paddle 的基本概念和使用方法。
- `Variable <variable.html>`_ : Variable表示变量,在Paddle中可以包含任何类型的值,在大多数情况下是一个Lod-Tensor。
- `编程指南 <./programming_guide/programming_guide.html>`_ : 介绍飞桨的基本概念和使用方法。
- `Variable <variable.html>`_ : Variable表示变量,在飞桨中可以包含任何类型的值,在大多数情况下是一个Lod-Tensor。
- `Tensor <tensor.html>`_ : Tensor表示数据。
- `LoD-Tensor <lod_tensor.html>`_ : LoD-Tensor是Paddle的高级特性,它在Tensor基础上附加了序列信息,支持处理变长数据。
- `LoD-Tensor <lod_tensor.html>`_ : LoD-Tensor是飞桨的高级特性,它在Tensor基础上附加了序列信息,支持处理变长数据。
- `Operator <operator.html>`_ : Operator表示对数据的操作。
- `Program <program.html>`_ : Program表示对计算过程的描述。
- `Executor <executor.html>`_ : Executor表示执行引擎。
- `DyGraph模式 <./dygraph/DyGraph.html>`_ : Executor表示执行引擎
- `动态图机制-DyGraph <./dygraph/DyGraph.html>`_ : 介绍飞桨动态图执行机制
.. toctree::
:hidden:
......
......@@ -322,6 +322,6 @@ with fluid.layers.control_flow.Switch() as switch:
完成网络搭建后,可以开始在单机上训练您的网络了,详细步骤请参考[单机训练](../../coding_practice/single_node.html)
除此之外,使用文档模块根据开发者的不同背景划分了三个学习阶段:[快速入门](../../index_cn.html)[典型案例](../../../user_guides/index_cn.html)[进阶指南](../../../advanced_guide/index_cn.html)
除此之外,使用文档模块根据开发者的不同背景划分了三个学习阶段:[快速上手](../../index_cn.html)[典型案例](../../../user_guides/index_cn.html)[进阶指南](../../../advanced_guide/index_cn.html)
如果您希望阅读更多场景下的应用案例,可以参考[典型案例](../../../user_guides/index_cn.html)。已经具备深度学习基础知识的用户,也可以从[进阶指南](../../../advanced_guide/index_cn.html)开始阅读。
......@@ -2,17 +2,17 @@
快速上手
########
PaddlePaddle (PArallel Distributed Deep LEarning)是一个易用、高效、灵活、可扩展的深度学习框架
PaddlePaddle (PArallel Distributed Deep LEarning)是一个易用、高效、灵活、可扩展的深度学习框架
您可参考PaddlePaddle的 `Github <https://github.com/PaddlePaddle/Paddle>`_ 了解详情,也可阅读 `版本说明 <../release_note_cn.html>`_ 了解新版本的特性
您可参考PaddlePaddle的 `Github <https://github.com/PaddlePaddle/Paddle>`_ 了解详情,也可阅读 `版本说明 <../release_note_cn.html>`_ 了解新版本的特性
让我们从学习PaddlePaddle基本概念这里开始:
- `基本概念 <../beginners_guide/basic_concept/index_cn.html>`_:介绍 Paddle的基本概念和使用方法
如果您已经掌握了飞桨的基本概念,期望可以针对实际问题建模、搭建自己网络,编程实践中提供了一些 Paddle 的使用细节供您参考:
如果您已经掌握了飞桨的基本概念,期望可以针对实际问题建模、搭建自己网络,编程实践中提供了一些 Paddle 的使用细节供您参考:
- `编程实践 <../beginners_guide/coding_practice/index_cn.html>`_
- `编程实践 <../beginners_guide/coding_practice/index_cn.html>`_:介绍如何针对实际问题建模、搭建自己网络
.. toctree::
......
......@@ -5,16 +5,15 @@ Beginner's Guide
PaddlePaddle (PArallel Distributed Deep LEarning) is a
simple, efficient and extensible deep learning framework.
Please refer to `PaddlePaddle Github <https://github.com/PaddlePaddle/Paddle>`_ for details, and `release note <../release_note_en.html>`_ for features incorporated in current version.
Please refer to `PaddlePaddle Github <https://github.com/PaddlePaddle/Paddle>`_ for details, and `Release Note <../release_note_en.html>`_ for features incorporated in current version.
For beginners of PaddlePaddle, the following documentation will tutor you about installing PaddlePaddle:
Let's start with studying basic concept of PaddlePaddle:
- `Installation Manuals <../beginners_guide/install/index_en.html>`_ :Installation on Ubuntu/CentOS/Windows/MacOS is supported.
- `Basic Concept <../beginners_guide/basic_concept/index_en.html>`_ : introduce the basic concept and usage of Paddle
If you have been armed with certain level of deep learning knowledge, and it happens to be the first time to try PaddlePaddle, the following cases of model building will expedite your learning process:
If you have mastered the basic concept of Paddle and you expect to model and build your own network according to the actual problems, you can refer to some details of the use of paddle in the Coding Practice :
- `Programming with Fluid <../beginners_guide/programming_guide/programming_guide_en.html>`_ : Core concepts and basic usage of Fluid
- `Deep Learning Basics <../beginners_guide/basics/index_en.html>`_: This section encompasses various fields of fundamental deep learning knowledge, such as image classification, customized recommendation, machine translation, and examples implemented by Fluid are provided.
- `Coding Practice <../beginners_guide/coding_practice/index_en.html>`_ : introduce how to model and build your own network for practical problems
.. toctree::
......
......@@ -16,4 +16,4 @@
advanced_guide/index_cn.rst
api_cn/index_cn.rst
faq/index_cn.rst
release_note_en.rst
release_note_cn.md
......@@ -10,4 +10,4 @@
advanced_guide/index_en.rst
api/index_en.rst
faq/index_en.rst
release_note_en.rst
\ No newline at end of file
release_note_en.md
# **使用conda安装**
# **使用Conda安装**
[Anaconda](https://www.anaconda.com/)是一个免费开源的Python和R语言的发行版本,用于计算科学,Anaconda致力于简化包管理和部署。Anaconda的包使用软件包管理系统Conda进行管理。Conda是一个开源包管理系统和环境管理系统,可在Windows、macOS和Linux上运行。
......
Release Notes
==============
## 重要更新
本版本对框架功能层面进行了重点增强,预测部署能力全面提升,分布式发布PLSC支持超大规模分类,并对参数服务器模式进行优化整合。对编译选项、编译依赖以及代码库进行了全面清理优化。模型库持续完善,优化了整体层次结构,增加了动态图模型实现。端到端开发套件和工具组件进一步完善。
**训练框架**:增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
**预测部署**:服务器端预测库的Python API大幅优化,新增R语言、Go语言调用预测库的使用方法和示例,强化了量化支持能力;Paddle Lite支持无校准数据的训练后量化方法生成的模型,加强对OpenCL的支持,支持昆仑XPU的预测;模型压缩库PaddleSlim重构裁剪、量化、蒸馏、搜索接口,新增大规模可扩展知识蒸馏框架 Pantheon,与模型库充分打通。
**分布式方面**:参数服务器模式下针对transpiler的同步、半异步、全异步三种模式,后端实现上统一到communicator中,前端接口统一到fleet中,通过fleet strategy灵活选择不同模式;发布大规模分类库PLSC,通过模型并行支持超多类别的分类任务。
**基础模型库**:发布语音合成库Parakeet,包括多个前沿合成算法;PaddleCV新增14个图像分类预训练模型,3D和跟踪方向模型持续丰富;PaddleNLP的分词和词性标注模型支持jieba分词;PaddleRec增加多任务模型MMoE。模型库整体增加了广泛的动态图模型实现。模型库整体层次结构做了调整优化。
**端到端开发套件**:PaddleDetection和PaddleSeg新增大量模型实现及预训练模型,典型模型的训练速度和精度提升,模型压缩和部署能力大幅提升,使用体验全面优化。发布ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。
**工具组件**:PaddleHub新增52个预训练模型,总数超过100,功能和体验持续优化;多任务学习框架PALM升级内核,开放API调用,支持更多的任务类型;联邦学习PaddleFL新增公开数据集。
## 训练框架
- API
- 增加自动混合精度训练AMP接口:能以通用的方式把一个网络转成混合精度训练,同时保证精度波动在正常范围内
- 增加新的控制流接口并推荐使用:新增while_loop(循环控制功能)、cond(条件分支功能)、case和switch_case(分支控制功能)4个控制流OP,更加易用,且支持如下新增功能:
- 支持使用python callable作为控制条件或执行体
- 支持控制流中的不同分支使用不同loss或optimizer
- 支持控制流中的condition部分使用CPU数据或GPU数据
- 部分API参数支持使用变量列表:针对部分API的parameter_list或no_grad_set参数只支持使用字符串列表的情况,增加对变量列表的支持,使用如下API时不再需要提前获取相关变量的name属性:
- fluid.backward.append_backward(loss, parameter_list=None, no_grad_set=None, callbacks=None)
- fluid.backward.gradients(targets, inputs, target_gradients=None, no_grad_set=None)
- 各种Optimizer的minimize方法,如Adam的minimize:minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)
- 基础功能优化
- 支持使用numpy的float16类型设置Tensor数据,无需先转换为uint16类型。
- 支持直接使用负号,得到Tensor的相反数。
- 显存分配策略:
- 默认策略变为AutoGrowth:在不影响训练速度的情况下,按需申请显存。规避之前的默认显存预分配策略下难以在同一张GPU卡上再起新任务的问题。
- 多卡任务显存分配调整:将不同GPU卡上的显存分配器设置为Lazy初始化的方式。若用户不使用某张卡,则不会在该卡上申请显存。避免当其他GPU卡上有显存占用时,在空闲GPU卡上跑任务若不设置CUDA_VISIBLE_DEVICES导致显存OOM的问题。
- OP功能升级
- elu:该激活函数支持计算二阶梯度。
- prroi_pool:rois参数可以接受Tensor或LoDTensor类型。
- conv2d,pool2d,batch_norm,lrn:反向计算全部支持使用MKL-DNN高性能计算库。
- argsort:支持降序排序(新增descending参数,默认值False)。
- 基础性能优化
- DALI预处理加速
- 增加对Nvidia DALI GPU数据预处理库的支持,可用于加速图片,视频,语音等数据预处理。
- 自动混合精度训练优化
- 实现如下优化策略,并配合DALI数据预处理,ResNet50模型训练吞吐大幅提升:V100单卡混合精度训练吞吐从600+ images/sec提升到1000+ images/sec;单机8卡吞吐达到7840 image/sec,4机32卡吞吐达到28594 images/sec。
- 增强batch_norm和conv2d等op对NHWC数据布局输入的支持,以使用Tensor Core技术加速fp16计算。
- 基于IR Pass机制对模型中的部分op pattern进行融合,如batch_norm和relu等。
- 优化elementwise(add,mul)等op的计算kernel。
- 优化RecomputeOptimizer提升batchsize, 在Bert-large模型上最大batchsize比不使用RecomputeOptimizer增大533.62%,比上一版本提升一倍。
- OP性能优化
- 实现embedding和sequence_pool的融合算子fuse_emb_seq_pool,优化bloom_filter中的murmurhash3_x64_128,有效提升部分NLP模型的训练速度。
- 优化了mean op的GPU性能,输入数据为32*32*8*8的Tensor时,前向计算速度提升2.7倍。
- 优化assign、lod_reset op,避免不需要的显存拷贝和data transform。
- 优化了stack OP的kernel实现,XLnet/Ernie模型GPU单卡性能提升4.1%。
- 动态图
- 功能优化
- 移除了动态图Layers 中的 name_scope 参数,使得用户更方便继承和调用。
- 移除to_variable接口中的block参数,简化了API的使用。
- 针对模型参数依赖数据的问题,移除了 build_once设计,使得Layers在 **init** 执行完成之后就可以获取到所有的参数表,方便save load、参数初始化、参数debug、参数优化等。
- 完善自动剪枝,方便用户组网并减少反向计算量。
- 支持 SelectedRows 操作,使 Embedding 层支持单卡的稀疏更新。
- 针对框架缺少容器类的问题,新增ParameterList、LayerList、Sequencial功能,方便用户组网。
- 支持named_sublayers、named_parameters功能,方便用户编程。
- 支持Linear lr warmup decay策略。
- 性能优化
- 优化了python 与c++ 交互,GradMaker、OperatorBase、allocator等。基于LSTM的语言模型任务p在P40机器上性能提升提升270%。
- 针对optimize中多次调用optimized_guard无用代码导致的性能问题,移除了冗余代码。Transformer模型(batch_size=64)在P40机器上,SGD、Adam等优化器有5%~8%%的性能提升。
- 针对AdamOptimizer中额外添加scale_op更新beta参数对性能的影响,将beta更新逻辑融合到adam_op中,减少op kernel调用开销。Dialogue-PLATO模型P40机器上性能提升9.67%。
- 优化动态图异步DataLoader,在Mnist、ResNet、等模型上整体训练速度提升约30%。
- 新增numpy bridge功能,支持在cpu模式下Tensor和ndarray之间共享底层数据,避免创建Variable时numpy输入需要拷贝的问题,提升效率。
- 显存优化:提前删除反向不需要Tensor Buffer的前向变量空间的优化策略,在ResNet等模型上最大batch size提升20%-30%以上。
- 动态图部署
- 支持TracedLayer接口,实现 data independent的动态图模型转为静态图可预测部署的模型。
- 调试分析
- 报错信息优化 :对框架报错信息整体归类,实现报错信息的体系化,同时完成文案优化,帮助用户更快速、准确的定位和解决问题。
- 优化性能分析profile 功能
- 增强profiler的功能和准确性,支持不同级别的profile选项,能够在profile数据中记录事件的调用关系并打印出来。
- 优化nan inf检查调试(通过FLAGS_check_nan_inf生效),性能、功能及输出信息均有较大提升:
- 速度上,v100测试ResNet50模型相比原工具组件约有1000倍性能提升,保持正常训练80%以上的效率。
- 功能上,增加fp16的支持,可设置环境变量跳过op、op_role、op_var的检查,方便fp16模型的调试。
- 输出信息更加翔实,除出错的op及tensor名称外,还会打印出错的nan、inf及正常数值的数量以便于调试。
- 发布cpu训练和预测的轻量级安装包paddlepaddle-tiny,支持window/linux/Mac操作系统以及python27/python35/python36/python37:
- 编译选项:no avx, no ml, no gpu, no unittest
- 裁剪掉slim和部分dataset。
- linux包体积从90M减小到37M;windows包体积从50.8M减小到9.6M;mac包体积从59M减小到19.8M。
- 安装requirements依赖从15个减小到7个。
## 预测部署
- 服务器端预测库
- Python API
- 支持从内存读写模型,以满足模型加密的需求。
- 不再在预测模型最后添加 Scale 算子。
- 新增对ZeroCopy预测的支持,与C++接口基本一致,支持以numpy.ndarray作为输入和输出,在Python端使用更加方便。
- 在AnalysisConfig中增加多个接口,完整覆盖C++预测的功能,包括删除pass、禁用预测glog等。
- 其他编程语言的支持
- 新增R语言、Go语言调用预测库的使用方法和示例
- 对外提供 ProtoBuf 对应的头文件,方便用户解析模型结构的需求。
- 带TRT编译的预测库不再从thrid_party中提供TensorRT库,需要用户自行到https://developer.nvidia.com/tensorrt 下载
- 功能增强:
- 打通Paddle Lite以子图方式接入,已验证 ResNet50。
- 新增MKL-DNN FC INT8 kernel的支持
- Paddle-TensorRT支持Ernie模型,Ernie模型(seq length=128) 在T4卡上fp16预测速度为3.6ms, 比fp32加速37%。
- 量化:在ERNIE INT8精度相比于FP32 精度提升2%下,ERNIE INT8在第二代至强可扩展平台6271上单线程性能优化提升2.70倍,多线程性能提升1.79倍
- 移动/嵌入式端Paddle Lite(https://github.com/PaddlePaddle/Paddle-Lite)
- 对应发布v2.3版本。
- model_optimize_tool多项功能升级。
- 支持“无校准数据的训练后量化方法”,减小模型存储空间(2~4倍)。
- OpenCL:完成30个Image2D Kernel迁移,涵盖14个OP。
- 对FPGA、NPU的支持进一步加强;支持昆仑XPU的预测。
- 发布全新官网文档;新增“无校准数据的训练后量化方法”使用文档。
- Paddle Serving(https://github.com/PaddlePaddle/Serving):
- 发布bert类语义理解模型的远程文本向量表示预测服务。
- 发布了paddle-gpu-serving whl包,通过pip安装和Python代码即可部署和使用预测服务;
- 支持Paddlehub中的13种语义理解模型,支持单机多卡,使用Ernie_tiny模型在单张P4 GPU下平均样本长度为7时预测速度为869.56样本每秒。
- PaddleSlim(https://github.com/PaddlePaddle/PaddleSlim):
- 拆分PaddleSlim为独立repo。
- 重构裁剪、量化、蒸馏、搜索接口,对用户开放底层接口。
- 量化:
- 新增基于KL散度的离线量化功能,支持对Embedding层量化。
- 新增对FC的QAT MKL-DNN量化策略支持
- 新增PostTrainingQuantization,完整实现训练后量化功能:支持量化30种OP,支持灵活设置需要量化的OP,生成统一格式的量化模型,具有耗时短、易用性强、精度损失较小的优点。
- 量化训练支持设定需要量化的OP类型。
- 裁剪: 重构剪裁实现,方便扩展支持更多类型的网络。
- 搜索:
- 支持SA搜索,增加更多的搜索空间,支持用户自定义搜索空间。
- 新增one-shot搜索算法,搜索速度比上个版本快20倍。
- 新增大规模可扩展知识蒸馏框架 Pantheon
- student 与 teacher 、teacher与 teacher 模型之间充分解耦,可分别独立运行在不同的物理设备上,便于充分利用计算资源;
- 支持 teacher 模型的单节点多设备大规模预测,在 BERT 等模型上测试加速比达到线性;
- 用 TCP/IP 协议实现在线蒸馏模式的通信,支持在同一网络环境下,运行在任意两个物理设备上的 teacher 模型和 student 模型之间进行知识传输;
- 统一在线和离线两种蒸馏模式的 API 接口,不同的 teacher 模型可以工作在不同的模式下;
- 在 student 端自动完成知识的归并与知识数据的 batch 重组,便于多 teacher 模型的知识融合。
- 模型库:
- 发布ResNet50、MobileNet模型的压缩benchmark
- 打通检测库,并发布YOLOv3系列模型的压缩benchmark
- 打通分割库,并发布Deepabv3+系列分割模型的压缩benchmark
- 完善文档:
- 补充API文档;新增入门教程和高级教程;增加ModelZoo文档,覆盖分类、检测、分割任务。所有文档包含中、英文。
## 分布式
- 参数服务器模式:
- 大幅降低训练过程中的内存占用,在1亿规模embedding任务上,Trainer端内存可以降低90%
- 大幅降低分布式保存模型、加载模型的内存占用, Pserver端内存峰值最大可降低为原先的$1/N,N$为Pserver节点个数。
- 优化GEO-SGD 稠密参数通信
- 支持分布式AUC指标计算
- 新增分布式Barrier功能
- 非Fleet的transpiler API加入过期警示, 该API计划在PaddlePaddle-Fluid 2.0中移除
- Communicator加入半异步模式和同步模式
- TrainFromDataset训练接口支持半异步模式和同步模式
- Fleet加入DistributedStrategy, 进一步提升分布式易用性, 整合目前分布式相关FLAG
- Fleet pslib模式支持一个program多loss训练,优化训练性能
- 千亿稀疏模式支持k8s环境。
- 大规模分类库PLSC:支持受限于显存容量数据并行无法处理的大规模分类问题(https://github.com/PaddlePaddle/PLSC)
- 内建ResNet50、ResNet101和ResNet152三种模型,并支持自定义模型;单机8张V100 GPU配置下,ResNet50模型百万类别训练速度2,122.56 images/s,相比标准ResNet50模型加速倍1.3倍;
- 发布模型在线预测服务plsc-serving whl包,预测人脸识别模型的图片语义向量表示,支持使用用户训练的模型进行预测。ResNet50模型(batch size=256)在单张V100 GPU下预测速度为523.47 images/s;
- 发布基于ResNet50网络和MS1M-ArcFace数据集的预训练模型:https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz。
- 发布ResNet50混合精度训练benchmark(单卡、多卡、多机)。
## 基础模型库
(https://github.com/PaddlePaddle/models)
- PaddleNLP
- seq2seq支持RL和GAN等训练模式
- 发布分词和词性标注训练模型,利用知识蒸馏框架 Pantheon,在自有数据集上比paddleNLP上LAC上F1值提升1%;合入jieba分词,通过加入use_paddle标签来开启深度学习模型模式;并在在jieba加入paddle版本检测和回退机制,保障用户体验。
- 增加动态图模型实现:word2vec、senta、transformer、bert、seq2seq、LAC。
- PaddleSpeech
- 语音合成:发布合成库Parakeet
- 实现语音合成模型数据预处理、训练和合成等的标准工作流
- 提供对常见数据集的开箱即用的预处理实现
- 提供语音合成领域常用模型组件,为实现模型提供支持
- 发布语音合成模型 DeepVoice3、ClarinNet 、TransformerTTS、FastSpeech、WaveNet、WaveFlow
- PaddleCV
- 图像分类:
- 新增预训练模型SENet-vd、Res2Net、HRNet系列模型总共14个:
- SE_ResNet18_vd,SE_ResNet34_vd,SE_ResNeXt50_vd_32x4d,ResNeXt152_vd_32x4d
- Res2Net50_26w_4s,Res2Net50_14w_8s,Res2Net50_vd_26w_4s
- HRNet_W18_C,HRNet_W30_C,HRNet_W32_C,HRNet_W40_C,HRNet_W44_C,HRNet_W48_C,HRNet_W64_C
- 支持使用DALI加速数据预处理,在ImageNet训练上获得1.5倍(ResNet50) 至3倍以上(ShuffleNet))加速,并大幅提升GPU利用率。
- 3D方向:
- 发布模型PointNet++、PointRCNN。
- 跟踪模型库 :
- 发布模型SiamFC、SiamRPN、SiamMASK、ATOM、ATP。
- 增加动态图模型实现: MobileNet-v1/v2、YOLOv3、FasterRCNN、MaskRCNN、视频分类TSM模型、视频动作定位BMN模型。
- PaddleRec
- 发布推荐领域多任务模型MMoE, 适用于工业界大规模多任务联合训练。
- 增加动态图模型实现:gru4rec、deepfm。
## 端到端开发套件
- PaddleDetection(https://github.com/PaddlePaddle/PaddleDetection)
- 进一步提升YOLOv3模型精度,COCO数据上精度达到43.2%,相比上个版本绝对提升1.4%。
- 新增模型实现及预训练模型:
- 新增Google AI Open Images 2019-Object Detction比赛中的最佳单模型CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd,同时也发布此算法基于Objects365数据的预训练模型。
- 新增backbone为CBResNet、Res2Net、HRNet的系列预训练模型。
- 新增LibraRCNN算法及预训练模型。
- FasterRCNN R50 FPN模型新增基于GIoU、DIoU、CIoU loss的预训练模型,不降低预测速度的情况下,在COCO数据上精度分别提升1.1%,0.9%,1.3%。
- 新增模块:
- 主干网络: 新增CBResNet、Res2Net、HRNet。
- Loss模块: 新增GIoU loss、 DIoU loss、CIoU loss,以及Libra loss,YOLOv3的loss支持细粒度op组合。
- 后处理模块: 新增softnms,DIOU nms模块。
- 正则模块: 新增DropBlock模块。
- 功能优化和改进:
- 加速YOLOv3数据预处理,整体训练提速40%。
- 优化数据预处理逻辑。
- 增加人脸检测预测benchmark数据。
- 增加Paddle预测库Python API下的预测示例。
- 检测模型压缩 :
- 裁剪: 发布MobileNet-YOLOv3裁剪方案和模型,在VOC数据集上FLOPs - 69.6%, mAP + 1.4%,在COCO数据集上FLOPS-28.8%, mAP + 0.9%; 发布ResNet50vd-dcn-YOLOv3裁剪方案和模型,在COCO数据集上FLOPS - 18.4%, mAP + 0.8%。
- 蒸馏: 发布MobileNet-YOLOv3蒸馏方案和模型,在VOC数据上mAP + 2.8%,在COCO数据上mAP + 2.1%。
- 量化: 发布YOLOv3-MobileNet和BlazeFace的量化模型。
- 裁剪+蒸馏: 发布MobileNet-YOLOv3裁剪+蒸馏方案和模型,在COCO数据集上FLOPS - 69.6%,GPU下预测加速64.5%,mAP - 0.3 %; 发布ResNet50vd-dcn-YOLOv3裁剪+蒸馏方案和模型,基于COCO数据FLOPS - 43.7%,GPU下预测加速24.0%,mAP + 0.6 %。
- 搜索: 开源BlazeFace-Nas的完整搜索方案。
- 预测部署:
- 适配Paddle预测库对TensorRT的支持、对FP16精度的支持。
- 文档:
- 新增数据预处理模块介绍文档、实现自定义数据Reader的文档。
- 新增如何新增算法模型的文档。
- 文档部署到网站: https://paddledetection.readthedocs.io/zh/latest/
- PaddleSeg(https://github.com/PaddlePaddle/PaddleSeg)
- 新增模型
- 适用于车道线分割场景的LaneNet模型。
- 适用于轻量级Fast-SCNN模型。
- 适用于高精度场景的HRNet语义分割模型 。
- 发布基于PaddleSlim的多种模型压缩方案:
- 基于Cityscape的Fast-SCNN裁剪方案和模型。
- 基于Cityscape的Deeplabv3p-Xception和Deeplabv3p-MobilenetV2蒸馏方案。
- 基于Cityscape的Deeplabv3p-MobilenetV2搜索方案。
- 基于Cityscape的Deeplabv3p-Mobilenet量化方案和模型。
- 预测部署能力提升
- 新增Python轻量级部署。
- 新增对 FP16、Int8量化模型的TensorRT预测加速支持。
- 新增DeepLabv3p-MobileNetV2的人像分割Paddle-Lite移动端部署教程和案例。
- 优化模型导出环节,支持图像预处理和后处理的GPU化,性能提升10%~20%。
- 提供U-Net, ICNet, PSPNet, DeepLabv3+等模型的在不同尺寸图像的预测性能Benchmark,便于用户根据性能进行模型选型。
- 体验优化
- 新增学习率warmup功能,支持与不同的学习率Decay策略配合使用,提升Fine-tuning的稳定性。
- 支持对标注图使用伪彩色图像格式的保存,提升标注图片的预览体验。
- 新增自动保存mIoU最优模型的功能。
- 全面优化文档逻辑,提供如工业质检、眼底筛查等工业场景的AIStudio实战教程。
- ElasticRec(https://github.com/PaddlePaddle/ElasticRec)
-
- 发布了ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。
## 工具组件
- PaddleHub(https://github.com/PaddlePaddle/PaddleHub)
- 预训练模型丰富,新增52个预训练模型,目前预训练模型总数100+:
- 语义模型:新增RoBERTa_wwm、BERT_wwm、ERNIE-Tiny等5个语义模型
- 文本分类:新增黄反鉴别模型3个。
- 图像分类:新增ResNext-WSL、EfficientNet等共36个图像分类模型。
- 目标检测:新增行人检测,车辆检测等共5个检测模型。
- 关键点检测:新增人脸关键点检测和人体姿态关键点检测模型2个。
- 人脸口罩检测:新增基于PyramidBox-Lite的人脸口罩检测模型2个。
- 通用人脸检测:新增Ultra Light Fast Generic Face Detector、PyramidBox-Lite等通用人脸检测模型4个。
- 功能:
- 新增基于Paddle Serving的Bert Service文本向量表示服务。
- Task灵活性增强,新增Hook机制可以支持用户自定义代码加载。
- 新增彩色Colorlog,修复日志重复打印问题。
- 优化代码结果,命令行执行速度提升50% 。
- 重构Dataset、Reader,适配自定义数据集代码量降低60%。
- 优化AutoFinetune接口,支持多实验的可视化效果显示。
- 体验优化
- 逻辑全面优化,新增丰富的AIStudio教程内容。
- 官网落地页全新升级,提供在线快速体验和教程指导的功能。
- 多任务学习框架PALM(https://github.com/PaddlePaddle/PALM)
- 支持python3和windows
- 升级框架内核和多任务底层机制,开放API调用
- 灵活的模型保存机制,支持单任务保存和全图保存
- 支持连续训练和连续预测,单次执行下可自由切换数据集文件
- 新增模型定制化/自定义功能
- 重构多任务底层kernel,修复若干影响通用性和稳定性的bugs
- 强化多任务学习能力
- 支持多任务场景下每个任务有不同的batch size和sequence length
- 修复了多任务多卡训练时,各个显卡上任务不一致的问题
- 优化了多任务学习调度和终止策略,普遍提升模型泛化能力
- 强化支持的任务的功能和类型
- 匹配任务支持增强,支持pairwise learning和多类别(如NLI句子关系判断)。
- 机器阅读理解任务支持增强,新增用户可控的预处理超参数。
- 新增支持序列标注任务。
- 强化大规模训练/推理能力
- 新增自动多卡预测能力
- 重构异步reader,多卡场景下支持变长padding
- 新增预训练模型管理和下载模块
- 支持BERT、ERNIE、RoBERTa等各预训练模型的管理和下载
- 新增RoBERTa中文预训练模型
- 联邦学习PaddleFL(https://github.com/PaddlePaddle/PaddleFL):
- 新增scheduler与submitter功能:scheduler可用于在训练过程中控制trainer是否参加更新 。submitter可用于完成在MPI集群提交paddleFL任务的功能
- 新增LEAF dataset联邦学习公开数据集,并添加api,用于设置benchmark。支持图像分类,情感分析,字符预测等领域的经典数据集,如MNIST,Sentiment140
- 根据新增组件,在example中修改了原有的样例,并添加了femnist_demo, submitter_demo样例
- 优化fl_distribute_transpiler,使FedAvg strategy新增对adam optimizer支持;
- 新增SecAgg strategy(Secure Aggregation),用于实现安全的参数聚合;
## 代码重构和升级
- 编译
- 增加WITH_NCCL编译选项,单卡用户可显示指定WITH_NCCL=OFF加速编译。
- 新增编译选项WITH_TP_CACHE,缓存第三方源码,避免重复下载,Windows用户可将其设置为ON,加快编译速度并提高编译稳定性。
- `CUDA_ARCH_NAME`默认值设成`Auto`(`All`表示编译所有gpu架构,`Auto`表示只编译当前机器gpu架构),对开发者来说,使用`Auto``All`节省非常多的编译时间,提高开发效率。
- 减少了冗余的link环节与产物、多余的文件拷贝,加快了Windows下的编译速度。
- 外部依赖库
- 升级MKL-DNN到最新1.1版本。
- 将预测库与`third_party` 解耦,重构了28个第三方依赖的编译代码,便于统一管理外部依赖。
- 移除了第三方依赖的私人仓库2个、无用依赖1个、无用的patch下代码2000+行,提高仓库质量。
- 代码清理、重构和优化
- 去掉无用的`contrib/float16`目录,删除BRPC下无用的snappy/snappystream依赖。
-`python/paddle/fluid/layers/nn.py`中,根据API功能拆出`loss.py``sequence_lod.py`,减少`nn.py`的代码量,便于阅读。
- 修复`-Wno-error=sign-compare`的warning对应的代码(共100多处),后续所有该类warning会在编译时报错,提高代码质量
- 去掉WindowsMSVC编译的`WarningLnk4006/WarningLnk4221`(共约300处),提高仓库质量。
- 减少reduce_op, expand_op, expand_as_op模版类数量,加速GPU编译和减少whl包70M的空间。
- 动态图下通过代码自动生成每个OP的pybind函数,用于在layers中直接调用,提高动态图性能并减少与静态图的耦合度。
## BUG修复
- 修复基于PaddleDetection的 Faster-RCNN使用Python API预测时MKL-DNN报错问题。
- 修复sum op的GPU实现中,由于部分Tensor没有初始化引起训练挂掉的问题。
- 修复fill_constant中,value设置为大整数时精度损失的问题。
- 修复softmax_with_cross_entropy_op在CUDA上的精度不一致问题。
- 修复clone program时program中的stop_gradient属性不能拷贝到新program的问题。
- 修复elementwise_pow op在整数上的精度损失问题。
- 修复一些 GFLAGS 不能在预测库外进行指定的问题。
- 修复 Analysistor 多线程下若干 Pass 导致预测随机 core 的问题。(fc_gru_fuse_pass,seqconv_eltadd_relu_fuse_pass,attention_lstm_fuse_pass,embedding_fc_lstm_fuse_pass,fc_lstm_fuse_pass,seq_concat_fc_fuse_pass)
- 修复了在使用 NativePredictor 指定使用 CPU 预测后,在同一进程内使用 AnalysisConfig 指定 GPU 不生效的错误。
- 修复-DWITH_MKL=OFF时编译报错(setup.py拷贝与op_function_cmd出错)的bug。
- 修复py_func OP无法输入tuple(Variable) 的bug,新增如何写PythonOP的代码示例。
- 修复sigmoid cudnn kernel错调用成tanh cudnn kernel的问题。
- 修复部分动态图模式下reshape、depthwiseconv相关的bug;修复网络中部分参数无梯度,导致程序crash 的bug。
- 修复GradientClip在参数服务器模式下运行错误的BUG。
- 修复参数服务器全异步模式下内存泄露的问题。
==============
Release Notes
==============
目录
##########
* 重要更新
* 用户体验提升
* 编程易用性提升
* 默认配置项
* 接口优化
* 报错信息优化
* 文档优化
* 编译优化
* Windows支持增强
* 训练框架
* 性能优化
* OP优化
* Intel N-Graph集成
* 动态图
* 预测部署
* 服务器云端预测库
* 移动、嵌入式端侧预测库
* Paddle Serving
* PaddleSlim
* 分布式训练
* 性能优化
* 容错
* 部署
* 模型建设
* 易用性优化
* PaddleNLP
* PaddleCV
* PaddleSpeech
* PaddleRec
* 工具组件
* PaddleHub
* PGL 图学习框架
* PARL 深度强化学习框架
* PaddleFL 联邦学习
* Paddle2ONNX
* X2Paddle
* BUG修复
重要更新
##########
* 用户体验和易用性专项提升,包含全面的文档优化、报错信息优化、配置项优化、接口优化、编译优化、多平台支持以及编程易用性提升等各方面。
* 训练框架进一步优化了速度,完善了显存优化机制,并支持在框架外部自定义C++/CUDA OP。新增了大量OP,并从多个维度优化了大量存量OP,包括兼容性、行为一致性、功能提升等方面。
* 分布式训练新增LocalSGD、GEO-SGD等策略,大规模同步训练、异步训练速度继续提升,并支持K8S + Volcano任务提交。
* 部署能力强化
* 服务器端预测库增加C API,并支持版本兼容检查,实现了大量性能优化工作。
* 发布PaddleLite,定位高性能、多平台、轻量化的端侧预测引擎,并可作为服务器端预测库的加速库。
* PaddleServing新增超大规模分布式预估服务能力。
* PaddleSlim强化了量化训练功能,增加了基于硬件的小模型搜索功能。
* 模型库易用性和丰富度提升
* PaddleNLP,发布全新seq2seq相关API和文本生成模型样例。语义表示库新增XLNet预训练模型;开源EMNLP2019阅读理解竞赛冠军模型D-NET,同时支持18个不同抽取式阅读理解数据集打榜。发布飞桨多任务学习库PALM (PAddLe Multi-task learning),更便捷支持多任务机器学习调研。
* PaddleCV,发布训练部署端到端的图像分割库PaddleSeg。图像分类新增EfficientNet等43个预训练模型。PaddleDetection新增2019 Objects365 Full Track冠军模型、BlazeFace等人脸检测小模型,行人检测和车辆检测的预训练模型。PaddleVedio新增ActivityNet Challenge 2019夺冠模型,扩展包含video caption、video grounding等模型。
* 发布PaddleSpeech,包含语音识别模型DeepSpeech和语音合成模型 DeepVoice3 ;
* 增加PaddleRec的模型覆盖
* 配套工具组件全面升级:
* PaddleHub新增超参优化Auto Fine-tune功能,并全面提升Fine-tune功能的灵活性和易用性,预训练模型数量大幅增加。
* 飞桨图学习框架PGL正式版发布,易用性、规模性、丰富性全面提升。
* 飞桨深度强化学习框架PARL并行能力进一步提升,支持进化算法。
* Paddle2ONNX和X2Paddle全面升级,飞桨和其他框架的模型互转更加方便。
* 发布飞桨联邦学习框架PaddleFL
用户体验提升
#########
* 编程易用性提升
* fetch变量便利化:针对以往基于变量重命名的存储优化策略必须要求fetch变量设置persistable = True的bug,重构了Inplace复用和跨Operator复用策略,不再强制要求fetch变量必须设置persistable=True,且不会改变任何变量的名称,且均能保证结果的正确性。
* optimizer.minimize和其他接口调用的位置敏感性问题
针对用户搭建网络时易将exe.run(startup_program)置于optimizer.minimize之后执行,从而导致不明报错的问题,在Optimizer类Op中增加了初始化检查以及易于理解的提示信息,使再出现此类问题时用户能够快速定位错误。
* 针对用户搭建网络时易将test_program = main_program.clone(for_test=True)置于optimizer.minimize之后执行,从而导致模型测试结果错误的问题,增加了prune_backward接口对在minimize之后clone的test_program进行反向部分的裁剪,使test_program的clone操作的正确执行不再依赖于optimizer.minimize的先后关系。
* 默认配置项
* 显存Garbage collection开关默认打开(对应FLAGS_eager_delete_tensor_gb环境变量=0)。
* build_strategy的选项:
* build_strategy.enable_inplaceinplace策略默认打开。这样显存Garbage collection策略和inplace策略全默认打开,默认策略即已验证过的最优策略。
* build_strategy.memory_optimize跨Op显存复用优化策略的默认行为调整为:在Garbage collection策略打开时默认关闭(规避两者合用会比只用Garbage collection策略效果差的问题);而在Garbage Collection策略关闭时默认打开规避两者合用会比只用Garbage collection策略效果差的问题。用户可显式设置build_strategy.memory_optimize = True/False强制打开或关闭跨op显存复用优化策略。
* 提升了一些速度优化策略的普适性,将fuse_all_reduce_ops、fuse_broadcast_ops 选项默认打开,可以减少计算图中的计算节点个数,进而加速计算图执行。
* execution_strategy选项:
* 将num_iteration_per_drop_scope默认值从1改成100,每次迭代之后都要进行一次同步操作,提升速度。
* 接口优化
* 针对Python存储优化接口paddle.fluid.memory_optimize优化效果欠佳、不稳定等问题,彻底废弃了此接口,此版本后该接口不会对用户网络进行任何优化,并可能在后续版本中彻底移除,建议用户删除代码中的paddle.fluid.memory_optimize调用。
* 统一DataLoader接口。针对以往Reader接口繁多、名称晦涩难懂等问题,统一了PyReader和Dataset接口,用户可通过fluid.io.DataLoader.from_xxx创建数据加载器,可通过for-range方式迭代,简化使用方法,统一接口形式。
* RecordIO接口移除,不再支持RecordIO接口。
* 优化data接口,新的fluid.data接口相对fluid.layes.data 接口将对输入的数据的 shape 和 dtype 进行检查,使用None 和 -1 支持可变长维度。如果输入的 shape 或者 dtype 不对,将会报错。
* 报错信息优化
* 简化C++信息栈输出,过滤和paddle函数无关的、对调试几乎没有帮助的栈信息和符号,大幅缩短了信息栈长度,提升了调试体验。
* 对报错信息栈重新排版,添加清晰的分段标识与提示,并将核心提示置于最后,便于用户迅速定位重要信息,提升了调试体验。
* 对34个重点python api增加输入类型检查,能正确报出输入类型不符合的错误,避免误导性报错。
* 增强34个重点Op的维度检查报错信息,能打出详细维度信息,便于用户调试。
* 针对sequence类op输入不含LoD的Tensor时报错不清晰的问题,为sequence类op增加了Input Tensor LoD信息检查,使错误提示更加直观易懂。
* 强化机器自动化报错信息输出,在CI中强制推荐使用PADDLE_ENFORCE_XXX来替换PADDLE_ENFORCE接口,模版化打印出更具体的报错信息,并对应完成修复存量修复。
* 文档优化
* 全面优化了所有API的中英文文档,保证文档的正确性、规范性、易读性,完善对应示例。
* 增加了动态图中相关的更多文档说明和实例。
* 对预测教程文档进行整体修改,重新组织结构和内容,提高了可读性和实用性。
* 优化了部分指南性文档。
* 编译优化
* 将默认的CMAKE_BUILD_TYPE从RelWithDebInfo改成Release,减少初次接触的开发者的编译目录大小,避免因为编译目录太大导致编译失败。
* 修复inference_lib.cmake编译随机失败的问题。
* 去掉use_fast_math编译选项,避免为了提升性能而降低了CPU/GPU上的精度。
* Windows支持增强
* 支持vs2017编译。
* 编译流程优化,拆分第三方和Paddle的编译依赖关系,不再依赖openblas的预编译库。
* 支持cuda10。
* 增加模型支持,修复之前在windows无法正常运行的模型。
* 支持Paddle CPU 版本离线安装包。
* 支持预测SDK C-API。
训练框架
##########
* 性能优化
* GPU性能优化
* 使用cuRAND库优化dropout的GPU实现,dropout op本身加速3.4倍,Transformer base模型和big模型在V100上的训练分别加速3.8%和3.0%。
* 对smooth_label的CUDA核函数完成代替Eigen实现,smooth_label op本身加速1.47倍。
* 对 recurrent_op 的冗余 tensor copy 进行 share data,和删除运算过的 scope,该优化使得 benchmark 中 RNN 相关模型显存占用减少了 3 - 4 倍,速度有 2% - 数倍的提升。
* CPU性能优化
* BERT优化:新增matmul multi-head MKL的支持。
* 对lookup_table_op和sequence_pool_op (sum类型)做fuse,使用sparse GEMM优化,PyramidDNN模型在CPU上的训练速度获得8%的提升。
* 内存/显存优化
* 新增变长输入下的MKLDNN分层缓存策略和清理策略,修复MKLDNN在变长输入下内存泄漏问题 。
* 添加了控制流 op 多层嵌套情况下的显存优化策略支持。
* Allocator容错机制。针对多线程并发申请显存导致显存可能瞬间峰值超标问题,设计了Allocator重试策略,在第一次申请显存失败后会等待最长10s进行失败重试(若期间有显存释放,会提前触发失败重试)。
* 显存Cache清理。解决了以往TemporaryAllocator和Cudnn workspace单例会cache显存不释放的问题,提高显存利用率。
* 新增AutoGrowth显存分配策略。用户可通过设置环境变量FLAGS_allocator_strategy=auto_growth开启显存自增长策略,按需分配显存,解决了原有预分配92%可用显存策略占用显存过多、难以按需分配的问题,且不影响模型训练速度。
* 显存的Allocator容错机制完善,保证Allocator的稳定性。针对多线程并发申请显存导致显存可能瞬间峰值超标问题,设计了Allocator重试策略,在第一次申请显存失败后会等待最长10s进行失败重试(若期间有显存释放,会提前触发失败重试)。
* OP优化
* 支持用户在框架外部、脱离框架自定义C++/CUDA OP。
* 新增OP
* 新增eye_op,用于构建单位矩阵,或一批单位矩阵。
* 新增gather_nd_op,gather_op的高维推广,用于将输入数据中的切片,收集到由索引指定的形状的张量中。
* 新增scatter_nd_op,scatter_op的高维推广,这个操作与scatter_nd_add_op类似,除了相加的张量是通过零初始化的。相应地,scatter_nd(index, updates, shape) 等价于 scatter_nd_add(fluid.layers.zeros(shape, updates.dtype), index, updates)。 用于根据索引indices将更新数据updates散布到新的(初始为零)张量中。
* 新增scatter_nd_add_op:通过对Variable中的单个值或切片应用稀疏加法,从而得到输出的Variable。
* 新增center_loss:用以辅助Softmax Loss进行人脸的训练,利用softmax loss来分开不同类别,利用center loss来压缩同一类别。center loss意思为:为每一个类别提供一个类别中心,最小化mini-batch中每个样本与对应类别中心的距离,从而达到缩小类内距离的目的。
* 新增LookAHead Optimizer:针对Paddle不支持Lookahead优化算法这一问题,我们新增了这一优化算法。它的核心原理是:维护两个参数,快参数正常做前向反向运算,当快参数更新k次后,用它来更新慢参数,使二者同步。他的效果是在某些模型上能收敛更快。
* 新增InstanceNorm op 实例归一化:根据每个样本的每个通道的均值和方差做归一化,一般用在图像生成模型中,把一个样本的风格迁移到另一个样本中。
* 新增PreciseRoiPooling :PrROI Pooling采用积分方式计算每个pool区域的值,这种计算方式将区域中的插值看作是连续的,计算所有插值点求积分得到该区域所包围点的总和,最后除以pool区域面积就得到该区域的值,因此结果更加准确。
* 新增hard_swish_op:hard_swish激活函数,在MobileNetV3架构中被提出,相较于swish激活函数,具有数值稳定性好,计算速度快等优点。
* 新增mse_loss_op:均方损失函数,用于计算两个输入间的均方差。
* 新增elementwise_mod的float/doule kernel 。
* 新增strided_slice op 。
* MKLDNN kernel更新:
* 新增Leaky_relu的MKL-DNN kernel 和 conv + activation fusion pass。
* 支持不同axis的softmax MKL-DNN kernel。
* 重构5个op (conv, pooling, batch_norm, softmax,LRN)的FP32 MKL-DNN kernel代码,增强代码可维护性和可读性。
* OP功能优化升级
* 部分op参数升级支持tensor及包含tensor的list,支持常数对应维度的推断
* slice op 涉及参数starts 和ends。
* reshape op 涉及参数shape。
* expand op 涉及参数expand_times。
* pow op 涉及参数factor。
* fill_constant op 涉及参数 shape ,并将calc_gradient接口中使用的fill_constant_batch_size_like替换为fill_constant。
* uniform_random op 涉及参数shape, 支持tensor及包含tensor的list。
* image_resize、resize_nearest、resize_bilinear、resize_trilinear支持out_shape为tensor或者包含tensor的list,支持常数对应维度的推断,scale 参数支持tensor。
* 新增crop_tensor,支持shape参数为tensor或者包含tensor的list,支持常数对应维度的推断。
* 优化部分op输入tensor的维度检查
* 移除huber_loss 、rank_loss和cross_entropy op中输入shape的最后一维强制为1的限制,输出loss的shape与label保持一致。
* 新增fluid.one_hot和fluid.embeddingop,移除input参数shape最后一维为1的限制。
* 优化sequence_pad和sequence_unpadop中length的shape,由[n,1]简化为[n]。
* 部分op升级支持channel_last格式输入
* conv2d、conv3d、pool2d、pool3d新增data_format参数,支持channel_last格式输入。
* conv2d_transpose、conv3d_transpose新增data_format参数,支持channel_last格式输入。
* image_resize、resize_nearest、resize_bilinear、resize_trilinear新增data_format参数,支持channel_last格式输入。
* group_norm支持channel_last格式输入。
* 涉及padding操作的OP,支持非对称padding,以及SAME和VALID 两种padding方式
* conv2d、conv3d、pool2d、pool3d支持上述padding方式。
* conv2d_transpose、conv3d_transpose支持上述padding方式。
* 对以下op进行inplace显存优化支持
* elementwise_add_grad_grad, elementwise_sub_grad_grad, elementwise_mul_grad_grad, elementwise_div_grad_grad, relu_grad_grad, leaky_relu_grad_grad, sqrt_grad_grad, square_grad_grad。针对GAN模型梯度惩罚显存占用较高的问题,为二重反向op添加inplace,优化其显存占用。
* 升级部分仅支持LoDTensor输入的OP兼容padding模式,包括linear_crf_op, crf_decoding_op, hash_op, edit_distance_op, chunk_eval_op, warpctc_op, ctc_align_op, row_conv_op。
* Intel N-Graph集成
* 增加了ngraph_subgraph_pass对训练的支持,通过build strategy激活N-Graph提供对parallel executor的支持。
* 修正N-Graph对多线程问题,提供对多线程预测的支持。
* 动态图
* 性能优化
* 对动态图底层执行机制进行了重构,在大部分模型上有30%左右的速度提升 ,显存开销有2%左右下降。
* 功能完善
* 支持基于stop_gradient设置的自动剪枝功能和detach接口,满足冻结部分子网的需求。
* 支持模型在不同设备上执行data_transform, 可以使用less_than/greater_than等功能。
* 重新实现op(unsqueezed_op、unstack_op、flatten_op、fill_constant_op)等,使之能够支持动态图。
* 易用性提升
* 针对部分动态图不支持的接口提供了优化的报错 (包括Variable相关接口和Optimizer相关接口)。
* 针对Layer中的参数提供了可供访问的接口。
* 优化动态图save load接口,旧的dygraph下面的 save_persistables 删除。
* 支持了Layer call()可以使用关键字传入,使得前向执行时可以自定义传入的参数。
预测部署
########
* 服务器云端预测库
* 接口优化
* 增加预测C API。
* 针对设置环境变量GLOG_v=4可以打印出预测过程中包含模型op及op fuse的详细log会暴露较多信息,为AnalysisConfig添加DisableGlogInfo()接口(当前仅支持全局最多调用一次),方便使用者关闭GLOG输出,避免模型结构泄漏。
* 针对用户在使用C++预测库时不易获得模型描述中的输入shape的问题,为AnalysisPredictor添加GetInputTensorShape()接口,方便用户在运行预测引擎之前从模型中拿到输入shape,以避免输入错误的shape。
* 功能优化
* 在模型中添加了模型版本号及算子兼容性信息。在此版本之后,旧版本模型在新版本 Paddle 库上使用 AnalysisPredictor 执行预测时会进行兼容性检查。
* CPU INT8量化预测支持持续加强:支持mobilenet-ssd的训练后量化, 精度下降1%内, 性能提升3倍在第二代智强可扩展处理器6271上;新增Mul op的INT8 MKL-DNN kernel。
* 性能优化
* 优化了Mobilenetv2, ShuffleNet, Effecientnet 在CUDA GPU下的预测速度,mobilenetv2 从 5.3ms 减至 1.9ms,Shufflenetv2 从 6.3ms 减至1.4ms,Effecientnet 从60ms 减至 32ms。
* 实现一个简化Graph中基础op的Pass,预测时,upscale_in_train类型的dropout op直接移除,downgrade_in_infer类型的dropout op使用scale op代替。该优化使ERNIE模型在P40上的预测速度提升1.8%。
* 实现一个cudnn_placement_pass,将Graph中所有op的use_cudnn设置成true。该优化使ERNIE模型在P40上的预测速度提升10%。
* 实现fc op的GPU Kernel,并支持将激活操作融合到fc op中。该优化使ERNIE模型在P40上的预测速度提升2.1%。
* 实现融合fc+elementwise_add+layer_norm操作的Pass和GPU Kernel。该优化使ERNIE模型在P40上的预测速度提升4%。
* 实现了multihead matmul 融合算法的相关PASS和Kernel。该优化使Ernie模型在P4 GPU上的速度提升超过30%。
* 优化QAT(训练中量化)训练出来的模型在CPU INT8 kernel上执行的速度。通过PASS对训练出的QAT模型进行修改,结合训练后优化的PASS,使QAT训练出的模型可以在MobilenetV1, MobilenetV2, ResNet50,VGG16上精度变化(相比于FP32模拟量化)在0.1%内,ResNet101和VGG19精度变化在0.3%内,性能在6个模型上提升相比于原始未优化的QAT模型在第二代智强可扩展处理器6271上可达到4-9倍的性能提升。
* 问题修复
* 针对之前AnalysisPredictor中设置FLAGS_profile无效的问题,为AnalysisConfig添加EnableProfile()接口,现在用户可以调用该接口开启预测的profiler,而无需设置FLAG。
* 对ZeroCopyTensor的copy_from_cpu、mutable_data等方法添加了uint8模板支持,目前ZeroCopyRun已经可以正确地接收uint8输入进行预测。
* 针对Paddle-TRT在包含多个op共享同一参数的模型如retinanet、faster_rcnn、cascade_rcnn中出现的重复设定weight、过早删除参数等bug进行了修复,Paddle-TRT已可以支持上述模型。
* 移动、嵌入式端侧预测库
* 发布PaddleLite,定位高性能、多平台、轻量化的端侧预测引擎,并可作为服务器端飞桨原生预测库的加速库。具体见https://github.com/PaddlePaddle/Paddle-Lite
* Paddle Serving
* 新增支持超大规模分布式预估服务能力
* 发布了来源于百度内部经过海量数据检验的高性能分布式版本kv存储器组件cube,提供稀疏参数的分布式存储和查找,在高并发条件下单位时间吞吐总量是redis的13倍,是单机版kv存储器rocksDB的6倍。
* 发布了Elastic CTR解决方案:针对超大规模稀疏参数的CTR任务,提供了基于k8s集群的分布式训练以及serving分布式参数部署预测的流程文档,并提供了一键式的解决方案。
* PaddleServing编译速度提升
* 预测接口的编译依赖由paddle源码改为paddle inference lib,编译速度提升6倍。
* PaddleServing易用性提升
* 支持Python client
* PaddleSlim
* 添加基于硬件的小模型结构搜索功能。
* 对量化训练、蒸馏和通道裁剪三种策略扩充分类模型示例,添加检测模型示例。
* 新增部分量化功能的支持,目前用户可选择对同一类型的op仅部分进行量化。
* 新增对pool2d、elementwise_add等op的量化训练支持。
分布式训练
############
* 性能优化
* 新增LocalSGD多机训练算法:针对GPU多机多卡同步训练过程中存在trainer速度不一致(随机)导致同步等待问题,设计了局部异步训练策略,通过多步异步训练(无通信阻塞)实现慢trainer时间均摊,从而提升同步训练性能。在4机32块V100 GPU卡的配置下,在Resnet50 Imagenet分类任务上,测试集top5准确率达到93%的情况下,训练吞吐提升8.16%。模型链接: https://github.com/PaddlePaddle/Fleet/tree/develop/examples/local_sgd/resnet 。
* 新增GEO-SGD分布式CPU多线程全异步训练算法:通过训练节点维护独立参数且局部多轮更新,同时全局参数增量更新,大幅降低了训练中的通信占比。在文本匹配Simnet_bow模型上,GEO-SGD相比飞桨1.5全异步模式,在25节点12线程下,训练速度提升2.65倍,保持效果对齐。在Word2Vec模型上,GEO-SGD相比飞桨1.5全异步模式,在4、8、16、32节点16线程下,训练速度分别提升3.79倍、3.92倍、4.69倍、6.88倍,效果保持对齐。
* Fast Resnet:采用可变图像大小、可变batch size和矩形验证图像等策略,显著提升Resnet50模型在ImageNet数据集的训练速度。在4机32块V100 GPU卡的配置下,top5准确率达到93%的时间缩短至35分钟,收敛速度提升2.21倍。在8机64块V100 GPU卡的配置下,top5准确率达到93%的时间缩短至27分钟。模型链接:https://github.com/PaddlePaddle/Fleet/tree/develop/examples/fast_imagenet 。
* 新增超大Batch训练优化器RecomputeOptimizer。在内存固定的情况下,Recompute优化器可以显著提高模型可以运行的batch size,提升为原来的 17%-309%;训练效果是无损的,收敛趋势一致,但实际吞吐会有一定损失。
* 新增Collective Op:all_reduce_op、broadcast_op、all_gahter_op、reduce_scatter_op,支持在组网中实现进程通信。
* 容错
* CPU全异步训练模式加入训练节点心跳检查,及时发现异常节点。
* 加入retry机制 修复rpc errorcode 14的错误。
* 部署
* Paddle-K8S-Operator新增支持Volcano Job的提交,支持CPU分布式训练。
模型建设(PaddlePaddle/models)
##############################
* 易用性优化
* 全面优化了PaddleNLP和PaddleCV主要模型(Transformer,BERT,DMTK,PaddleDetection,PaddleGAN,PaddleVideo,ImageClassification)的安装、自定义数据以及对windows平台的支持等功能和体验。
* PaddleNLP
* 发布文本生成库Seq2seq
* 开源多个文本生成模型,包括vanilla seq2seq,seq2seq with memory network,variational seq2seq。
* 升级阅读理解库
* 开源EMNLP2019阅读理解竞赛百度夺冠模型D-Net和相关预训练模型,兼容MRQA2019开放的18个抽取式阅读理解公开数据集的并行训练、高性能评估以及搭建阅读理解serving的相关工作。
* 升级语义表示库升级
* 开源EMNLP2019阅读理解竞赛百度夺冠模型D-Net和相关预训练模型,兼容MRQA2019开放的18个抽取式阅读理解公开数据集的并行训练、高性能评估以及搭建阅读理解serving的相关工作。
* 升级语义表示库升级
* 新增语义表示模型XLNet。
* 发布开放多任务学习库PALM
* 开源MRQA2019比赛百度夺冠使用的多任务学习框架PALM,只需要几十行代码就可以完成基于ERNIE、BERT等预训练模型的硬共享、层次共享等多任务学习算法。
* PaddleCV
* 发布图像分割库 PaddleSeg:具备丰富数据增强、模块化设计、高性能和端到端部署四大特点。
* 模型
* 新增DeeplabV3+/UNet/PSPNet/ICNet四种网络支持,对应预训练模型共18个。
* 新增车道线分割、人像分割、人体部件分割三个预测模型。
* 功能
* 支持softmax loss、bce loss、dice loss以及损失函数组合配置。
* 支持翻转、旋转、多尺度变换、模糊、色彩饱和度调整等十余种数据增强策略。
* 支持数据检查、边训边评估、模型导出、自动可视化、调参模式等易用性功能。
* 支持FP16混合精度训练以及动态Loss Scaling。
* 支持多进程训练与数据预处理。
* 端到端部署
* 提供多平台(Windows/Linux)的C++高性能预测库编译、开发和部署。
* 基于Paddle Serving提供高性能图像分割服务化部署能力。
* 升级检测库 PaddleDetection
* 新增2019 Objects365 Full Track比赛夺冠模型;新增DeformableConv系列模型;新增VGG-SSD系列模型;新增Cascade+Mask+FPN模型;新增更多基于的COCO两阶段模型;新增行人检测和车辆检测预训练模型;新增人脸检测模型Faceboxes和BlazeFace系列模型,并发布改进版的轻量级模型。
* 功能
* 支持multi-scale的训练、multi-scale测试,支持group norm等。支持FP16训练。增加C++预测部署能力,支持Windows和Linux系统。
* 增加模型压缩量化和剪枝示例。
* 增加中文文档,增加基于小数据的快速开始、迁移学习、模型导出、预测部署等文档,增加预测benchmark文档。
* 完善图像分类模型
* 发布9个EfficientNet预训练模型:EfficientNet-b0,EfficientNet-b1,EfficientNet-b2,EfficientNet-b3,EfficientNet-b4,EfficientNet-b5,EfficientNet-b6,EfficientNet-b7,EfficientNet-small。精度与论文持平。
* 持续新增34个预训练模型:DarkNet53, DenseNet121,Densenet161, DenseNet169, DenseNet201, DenseNet264, SqueezeNet1_0, SqueezeNet1_1, ResNeXt50_vd_32x4d, ResNeXt152_64x4d, ResNeXt101_32x8d_wsl, ResNeXt101_32x16d_wsl, ResNeXt101_32x32d_wsl, ResNeXt101_32x48d_wsl, Fix_ResNeXt101_32x48d_wsl,ResNet18_vd,ResNet34_vd,MobileNetV1_x0_25,MobileNetV1_x0_5,MobileNetV1_x0_75,MobileNetV2_x0_75,MobilenNetV3_small_x1_0,DPN68,DPN92,DPN98,DPN107,DPN131,ResNeXt101_vd_32x4d,ResNeXt152_vd_64x4d,Xception65,Xception71,Xception41_deeplab,Xception65_deeplab,SE_ResNet50_vd。
* 升级PaddleVedio
* 新增动作定位模型: BMN和BSN,其中BMN模型是ActivityNet2019比赛的冠军。
* 新增VideoGrounding方向的BaseLine模型:TALL。
* 新增VideoCaption方向的BaseLine模型:ETS。
* 升级PaddleGAN
* 新增SPADE模型。
* 替换Instanceorm实现,STGAN上判别器速度提升12%左右。
* PaddleSpeech
* 升级语音识别模型 DeepSpeech 至飞桨最新版本。
* 开源语音合成模型 DeepVoice3 。
* PaddleRec
* 新增支持分布式训练的DeepFM、XDeepFM、DeepCrossNetwork。
工具组件
#########
* PaddleHub
* 新增超参优化Auto Fine-tune功能,实现给定超参搜索空间,自动给出较佳的超参组合。
* 支持两种超参优化算法:基于贝叶斯优化的HAZero和哈密尔顿系统的PSHE2。
* 支持两种评估方式:Full-Trail和Population-Based。
* 预训练模型丰富
* 升级ERNIE 1.0中文模型,提升模型载长文本情况下的效果(max_seq_len=512)。
* 升级LAC模型至v2.0.0,保持效果的同时精简模型结构,提升预测速度。
* 新增ERNIE 2.0 英文预训练模型。
* 新增Ultra-Light-Fast-Generic-Face-Detector-1MB人脸检测模型。
* 新增人体部件分割ACE2P模型。
* 新增基于DeepLabv3+的人像分割模型HumanSeg。
* 新增图像生成模型STGAN、AttGAN、StarGAN。
* Fine-tune API升级,灵活性与易用性提升
* 新增阅读理解Fine-tune任务。
* 新增多指标评估功能。
* 优化predict接口,提升预测性能。
* 新增优化策略ULMFiT,包括以下三种配置
* Slanted triangular learning rates:斜三角形学习率微调。
* Discriminative fine-tuning:支持计算图按拓扑序分层采用不同学习率微调。
* Gradual unfreezing:根据计算图的拓扑结构逐层参数解冻。
* PGL 图学习框架
* 对应发布飞桨图学习框架PGL v1.0正式版。
* 易用性:新增异构图的Metapath采样与Message Passing消息传递双机制,支持包含多种类型节点和边特征的异构图建模,新增Metapath2vec、GATNE等异构图算法。同时,文档、API、Tutorial等材料也进一步完善。
* 规模性:新增分布式图引擎和分布式Embedding,可支持十亿节点百亿边的超巨图的多种分布式训练模式。新增distributed deepwalk和distributed graphSage两个分布式样例。
* 丰富性:新增8个、累计13个图学习模型,涵盖了图神经网络和图表征学习的主流模型。新增的8个模型分别是LINE、struc2vec、metapath2vec、GES、GATNE、SGC、Unsup-GraphSage、DGI。
* PARL 深度强化学习框架
* 对应发布飞桨强化学习框架PARL 1.2。
* 更全更完善的并行RL机制,资源调度集群化,进一步降低并行算法实现门槛。
* 支持大规模并行进化算法,可数百个CPU并发搜索索(https://github.com/PaddlePaddle/PARL/tree/develop/examples/ES)。
* 上线更加全面的官方PARL文档(https://parl.readthedocs.io/en/latest/)。
* PaddleFL 联邦学习
* 发布飞桨联邦学习框架PaddleFL,方便快捷地支持联邦学习和AI隐私算法研究,并实现了FedAvg算法和基于差分隐私的SGD算法,支持分布式安全共享学习算法调研。https://github.com/PaddlePaddle/PaddleFL
* Paddle2ONNX
* 对应升级paddle2onnx至0.2版本。
* 新增pip安装方式。
* 适配飞桨 v1.6的算子和ONNX v1.5版本。
* 新增精度对齐框架,提供新增代码和模型转换的正确性验证功能。
* 支持ResNet、DenseNe等10个Paddle图像分类模型的转换。
* 支持SSD_MobileNet、YoloV3_DarkNet5等4个Paddle目标检测模型的转换。
* X2Paddle
* 对应升级x2paddle至0.5版本。
* 新增pip安装方式。
* 新增统一的caffe、tensorflow和onnx模型计算图中间表示。
* 支持caffe多分支模型的转换。
* 大幅提升主流框架的模型转换能力,支持44个tensorflow OP,33个caffe Layer和48个onnx OP。
* 为Paddle Lite提供多框架模型部署能力,支持包括图像分类、目标检测和语义分割在内共18个模型的无损转换。
BUG修复
##########
* 修复 rnn_search 模型无法跑起来的bug。
* 修复 save_inference_model 在 prune recurernt_op 时的 bug(该 bug 会导致一些 RNN 模型在 save inference model 后 load 预测出错)。
* 修复了动态图中多个Layer中act和bias等参数不生效的问题(其中包括:BilinearTensorProduct, GRUUnit,Conv2DTranspose ,LayerNorm,NCE )、优化器保存的bug 、python端内存泄漏的问题、部分参数minimize段错误的问题、使用python中has_attr的失效的问题进行了修复。
* 修复FC mkldnn pass在AVX2机器上的精度diff问题。
* 升级MKL-DNN到0.20,并提升MKL-DNN单侧覆盖率到90%以上。
* 修复MKL-DNN训练后量化convolution和dequant op的squash问题。
代码重构和升级
#########
* 清理了6个废弃的第三方库recordio,snappystream,snappy,jemalloc,anakin,gzstream。
Release Notes
==============
## Important Updates
In this version, the authors focus on enhancing the framework function level, the forecast deployment capability is fully improved, the distributed release PLSC supports the super-large-scale classification, and the parameter server mode is optimized and integrated. The compilation options, the compilation dependence, and the code library are fully cleaned up and optimized. The model library is continuously improved, the overall hierarchy is optimized, and the implementation of the dynamic graph model is added. The end-to-end development kits and utility components are further perfected.
**Training Framework**: An AMP interface and a new control flow interface are added. The tensor usage method and the GPU memory allocation strategy are optimized. A library that supports the Nvidia DALI GPU data preprocessing is added. The function and performance of the basic OP are continually optimized. The function of the dynamic graph is further perfected and the performance is greatly improved. A function that converts the data independent dynamic graph model into the static graph predictable deployment model is provided. The framework debugging analysis function and the ease of use are fully enhanced.
**Forecast Deployment**: The Python API of the server-side forecast library is significantly optimized. A usage method and example of the R language and Go language call forecast library are added. The quantification support capability is strengthened. Paddle Lite supports a model generated by the post-training quantification method without calibration data. Tailoring, quantification, distillation, and search interfaces are reconstructed for the model compression library PaddleSlim. A large-scale scalable knowledge distillation framework Pantheon is added to fully connect to the model library.
**Distributed Aspect**: In parameter server mode, the back-end implementation is united into the communicator and the front-end interface is united into the fleet for the synchronous, semi-asynchronous, and fully asynchronous modes of the transpiler. Different modes are flexibly selected using the fleet strategy. A large-scale classification library PLSC is released and the classification tasks of a great many classes are supported using model parallel.
**Basic Model Library**: A speech synthesis library Parakeet is released, including several leading-edge synthesis algorithms. 14 image classification pre-training models are added in PaddleCV. The 3D and tracking direction model continues to be enriched. The participle and part-of-speech tagging model of PaddleNLP supports a jieba participle. A multi-task model MMoE is added in PaddleRec. Extensive dynamic graph model implementations are added in the model library as a whole. The overall hierarchy of the model library is adjusted and optimized.
**End-to-End Development Kits**: A large number of model implementations and pre-training models are added in PaddleDetection and PaddleSeg. The training speed and accuracy of typical models are enhanced. The model compression and deployment capabilities are significantly improved. The user experience is fully optimized. A recommended sorting system ElasticRec is released. Deployment is performed via K8S. Streaming training and online forecast services are supported.
**Utility Components**: 52 pre-training models are added in PaddleHub, with a total of more than 100. The function and experience are continuously optimized. The kernel of the multi-task learning framework PALM is upgraded. The API call is open. More task types are supported. An open dataset is added in the federated learning PaddleFL.
## Training Framework
- API
- An AMP interface is added: A network can be converted into mixed accuracy training in a general way while the accuracy fluctuation is ensured to be within the normal range.
- A new control flow interface is added and recommended: Four control flow Ops including while\_loop (loop control function), cond (conditional branch function), case, and switch\_case (branch control function) are added for the ease of use and the following new functions are supported:
- Python callable is used as a control condition or executive.
- Different branches in the control flow use different losses or optimizers.
- Conditions in the control flow partially use CPU or GPU data.
- Parameters of some APIs support the use of a variable list: Support for a variable list is added according to the case that the parameter\_list or no\_grad\_set parameter of some APIs supports only the use of a string list. It is no longer necessary to obtain the name attribute of related variables in advance when using the following APIs:
- fluid.backward.append\_backward(loss, parameter\_list=None, no\_grad\_set=None, callbacks=None)
- fluid.backward.gradients(targets, inputs, target\_gradients=None, no\_grad\_set=None)
- The minimize methods of various optimizers, such as Adam’s minimize: minimize(loss, startup\_program=None, parameter\_list=None, no\_grad\_set=None, grad\_clip=None)
- Basic Function Optimization
- The float16 type of numpy is used to set to Tensor data without the necessity of conversion into the uint16 type first.
- The minus sign is directly used to get the opposite number of Tensor.
- GPU memory Allocation Strategy:
- The default policy is changed to AutoGrowth: The GPU memory is applied for as needed without affecting the training speed. This avoids the problem that it is difficult to restart a new task on the same GPU card under the previous default GPU memory pre-allocation strategy.
- GPU memory allocation adjustment for multi-card tasks: The GPU memory allocators on different GPU cards are set to the Lazy initialization mode. If a user does not use a card, no GPU memory will be applied for on this card. This avoids the GPU memory OOM problem caused by running tasks on idle GPU cards without setting CUDA\_VISIBLE\_DEVICES when any GPU memory is occupied on other GPU cards.
- OP Function Upgrade
- elu: This activation function supports the calculation of second-order gradients.
- Prroi\_pool: The rois parameter may accept the Tensor or LoDTensor type.
- Conv2d, pool2d, batch\_norm, LRN: All reverse calculations support the use of the MKL-DNN high-performance calculation library.
- argsort: The descending sort is supported (A descending parameter is added. The default is False).
- Basic Performance Optimization
- DALI Preprocessing Acceleration
- The support for the Nvidia DALI GPU data preprocessing library is added, which can be used to accelerate the preprocessing of data such as images, videos, and speeches.
- Automatic Mixed Precision Training Optimization
- With the implementation of the following optimization strategy as well as DALI data preprocessing, the training throughput of the ResNet50 model is increased substantially: The mixed accuracy training throughput of a single V100 card is increased to 1,000+ images/s from 600+ images/s. The throughput of 8 cards for a single machine is 7,840 image/s. The throughput of 32 cards for 4 machines is 28,594 images/s.
- The support of batch\_norm, conv2d, and other ops for NHWC data layout input is enhanced to accelerate fp16 calculation using Tensor Core technology.
- Some op patterns in the model such as batch\_norm and relu are fused based on the IR Pass mechanism.
- The kernel of elementwise (add, mul) and other ops is optimized.
- RecomputeOptimizer is optimized to improve the batchsize. In the bert-large model, the maximum batchsize is increased by 533.62% compared with that without using RecomputeOptimizer, doubling the maximum batchsize of the previous version.
- OP Performance Optimization
- The fusion operator fuse\_emb\_seq\_pool of embedding and sequence\_pool is implemented and murmurhash3\_x64\_128 in bloom\_filter is optimized. The training speed of some NLP models is effectively improved.
- The GPU performance of mean op is optimized. When the input data is 32328\*8 Tensor, the forward calculation speed is increased by 2.7 times.
- Optimize assign and lod\_reset op are optimized to avoid unwanted GPU memory copy and data transform.
- The kernel implementation of stack OP is optimized. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%.
- Dynamic Graph
- Function Optimization
- The name\_scope parameter in the dynamic graph Layers is removed to make it easier for users to inherit and call.
- The block parameter in the to\_variable interface is removed to simplify the use of the API.
- As for the problem that model parameters depend on data, the build\_once design is removed so that Layers can get all the parameter tables at the end of **init** execution, which is convenient for load saving, parameter initialization, parameter debugging, and parameter optimization.
- Automatic pruning is improved to facilitate user networking and reduce the reverse calculation amount.
- The SelectedRows operation is supported so that the Embedding layer supports sparse update of a single card.
- As for the problem that the framework lacks containers, ParameterList, LayerList, and Sequencial functions are added to facilitate user networking.
- Named\_sublayers and named\_parameters functions are supported to facilitate user programming.
- The Linear lr warmup decay strategy is supported.
- Performance Optimization
- The interaction of python with c++, GradMaker, OperatorBase, and allocator are optimized. For the LSTM-based language model task p on the P40 machine, the performance is improved by 270%.
- Redundant codes are removed for performance problems caused by calling dead codes of optimized\_guard in optimize for many times. For the Transformer model (batch\_size=64) on the P40 machine, the performance of optimizers such as SGD and Adam is improved by 5% to 8%.
- For the performance impact caused by adding scale\_op extra to update the beta parameter in AdamOptimizer, the beta updating logic is fused into adam\_op to reduce the call overhead of the op kernel. For the Dialogue-PLATO model on the P40 machine, the performance is improved by 9.67%.
- The asynchronous DataLoader of the dynamic graph is optimized. The overall training speed is improved by about 30% in the Mnist, ResNet, and other models.
- The numpy bridge function is added. Sharing the underlying data between Tensor and ndarray in CPU mode is supported to avoid the problem of needing to copy a numpy input when creating variables, and to improve efficiency.
- GPU memory optimization: Optimization strategy of deleting in advance the forward variable space that does not require Tensor Buffer in reverse. The maximum batch size is increased by more than 20%-30% in the ResNet and other models.
- Dynamic Graph Deployment
- The TracedLayer interface is supported. The conversion of the dynamic graph model into the static graph predictable deployment model is implemented.
- Debugging Analysis
- Error message optimization: Framework error messages are classified as a whole to achieve the , systematization of error messages. Copywriting optimization is finished to help users locate and solve problems more quickly and accurately.
- Optimization of the Performance Analysis Profile Function
- The function and accuracy of the profiler is enhanced. Profile options at different levels are supported. The call relation of events can be recorded in the profile data and printed.
- The nan inf check and debugging are optimized (effective through FLAGS\_check\_nan\_inf) and the performance, function, and output information are all greatly improved:
- In terms of speed, the v100 test ResNet50 model has a performance improvement of about 1000 times compared with the original utility components, and maintains an over 80% efficiency for normal training.
- In terms of function, the support for fp16 is added and environment variables can be set to skip the inspection of op, op\_role, and op\_var to facilitate the debugging of the fp16 model.
- The output information is detailed and accurate. Besides wrong op and tensor names, the quantity of wrong nan, inf, and normal numerical values are printed to facilitate debugging.
- A lightweight installation package paddlepaddle-tiny for CPU training and forecast is released and the window/linux/Mac operating system and python27/python35/python36/python37 are supported:
- The following options are compiled: no avx, no ml, no gpu, no unittest
- The slim and some datasets are pruned off.
- The Linux package size is reduced to 37 M from 90 M. The Windows package size is reduced to 9.6 M from 50.8 M. The MAC package size is reduced to 19.8 M from 59 M.
- The number of installation requirement dependencies are reduced to 7 from 15.
## Forecast Deployment
- Server-side Forecast Library
- Python API
- The read and write model from the memory is supported to meet the model encryption requirements.
- The Scale operator is no longer added at the end of the forecast model.
- The support for ZeroCopy forecast is added. The interface is basically the same as the C++ interface and supports numpy.ndarray as input and output. It is easier to use on the Python side.
- Multiple interfaces are added in AnalysisConfig to completely cover the C++ forecast functions, including removing pass and disabling forecast glog.
- Support for Other Programming Languages
- The usage method and example of the R language and Go language call forecast library are added.
- The corresponding header file of ProtoBuf is provided to external users to facilitate users to analyze the requirements for the model structure.
- For a forecast library with TRT compilation, a TensorRT library is not provided from thrid\_party any more and needs to be downloaded by users at https://developer.nvidia.com/tensorrt.
- Function Enhancement:
- Access to Paddle Lite using a submap is achieved and ResNet50 has been verified.
- The support for MKL-DNN FC INT8 kernel is added.
- Paddle-TensorRT supports the Ernie model. For the Ernie model (seq length = 128) on the T4 card, the fp16 forecast speed is 3.6 ms, which is faster than the fp32 forecast speed by 37%.
- Quantification: Under the 2% improvement of the ERNIE INT8 accuracy compared with the FP32 accuracy, the single-threaded performance and the multi-threaded performance are improved by 2.79 times and 1.79 times for ERNIE INT8 on the second-generation Xeon scalable platform 6271 respectively.
- Mobile/Embedded End-side Paddle Lite (https://github.com/PaddlePaddle/Paddle-Lite)
- Version v2.3 is released.
- Multiple functions of Model\_optimize\_tool are upgraded.
- “The post-training quantification method without calibration data” is supported. The model storage space is reduced (by 2 to 4 times).
- OpenCL: The migration of 30 Image2D Kernels are finished and 14 Ops are covered.
- The support for FPGA and NPU is further strengthened. The forecast of Kunlun XPU is supported.
- A new official website document is released. A "post-training quantification method without calibration data" usage document is added.
- Paddle Serving (https://github.com/PaddlePaddle/Serving):
- The forecast service of remote text vector representation of the bert-type semantic understanding model is released.
- A paddle-gpu-serving WHL package is released. The forecast service can be deployed and used through pip installation and Python codes.
- 13 semantic understanding models in Paddlehub are supported. The single-machine multi-card mode is supported. The forecast speed is 869.56 samples/s when the average sample length is 7 under a single P4 GPU using the Ernie\_tiny model.
- PaddleSlim (https://github.com/PaddlePaddle/PaddleSlim):
- PaddleSlim is split into independent repo.
- The tailoring, quantification, distillation and search interfaces are reconstructed. The underlying interfaces are open to users.
- Quantification:
- An offline quantification function based on KL divergence is added. The quantification of the Embedding layer is supported.
- The QAT MKL-DNN quantification strategy support for FC is added.
- PostTrainingQuantization is added to fully implement the post-training quantification function: The quantization of 30 kinds of Ops is supported. The flexible setting of OPs to be quantified is supported. Quantitative models are generated in a unified format . It has the advantages of short time consumption, ease of use, and small precision loss.
- Quantitative training supports setting the type of OP to be quantified.
- Tailoring: The tailoring implementation is reconstructed to support more types of networks.
- Search:
- SA search is supported. More search space is added. User-defined search space is supported.
- A one-shot search algorithm is added. The search speed is 20 times faster than that of the previous version.
- A large-scale scalable knowledge distillation framework Pantheon is added.
- Full decoupling is achieved between student and teacher models and between teacher models. They can independently run on different physical devices respectively to make full use of computing resources.
- The single-node multi-device large-scale forecast of the teacher model is supported. The acceleration ratio is tested to be linear on BERT and other models.
- TCP/IP protocol is used to achieve communication in online distillation mode. Knowledge transmission between teacher and student models running on any two physical devices in the same network environment is supported.
- API interfaces in online and offline distillation modes are unified. Different teacher models may operate in different modes.
- The merging of knowledge and the batch reorganization of knowledge data are completed automatically on the student side to facilitate the knowledge fusion of the multi-teacher model.
- Model Library:
- The compression benchmark of ResNet50 and MobileNet models is released.
- The detection library is connected and the compression benchmark for the YOLOv3 series of models is released.
- The segmentation library is connected and the compression benchmark for the Deepabv3+ series of segmentation models is released.
- Document Improvement:
- An API document is supplemented. An introductory tutorial and an advanced tutorial are added. A ModelZoo document is added to cover classification, detection, and segmentation tasks. All documents contain Chinese and English.
## Distributed
- Parameter Server Mode:
- The memory usage is greatly reduced during training. On 100 million embedding tasks, the Trainer-side memory can be reduced by 90%.
- The memory usage of distributed saving and loading models is greatly reduced. The Pserver-side memory peak value can be minimized to $1/N of the original value, where N$ is the number of Pserver nodes.
- The geo-sgd dense parameter communication is optimized.
- The distributed AUC index calculation is supported.
- A distributed barrier function is added.
- An overdue warning is added in the non-Fleet transpiler API. This API is planned to be removed in PaddlePaddle-Fluid 2.0。
- Semi-asynchronous and synchronous modes are added in Communicator.
- The TrainFromDataset training interface supports semi-asynchronous and synchronous modes.
- DistributedStrategy is added in Fleet to further improve the distributed ease of use and integrate the current distributed related flags.
- The Fleet pslib mode supports single-program multi-loss training to optimize the training performance.
- 100 billion sparse mode supports the k8s environment.
- Large-scale classification library PLSC: It supports the large-scale classification problem that data parallel cannot solve due to the limitation of video memory capacity (https://github.com/PaddlePaddle/PLSC).
- Three built-in models ResNet50, ResNet101, and ResNet152 are available and User-defined models are supported. Under the single-machine eight-V100 GPU configuration, the ResNet50 model has a million-class training speed of 2,122.56 images/s, which is 1.3 times faster than that of the standard ResNet50 model.
- A plsc-serving whl package for model online forecast service is released to forecasts the image semantic vector representation of the face recognition model. Making a forecast using a user-trained model is supported. The forecast speed of the ResNet50 model (batch size=256) under a single V100 GPU is 523.47 images/s.
- A pre-training model based on the ResNet50 network and the MS1M-ArcFace dataset is released: https://plsc.bj.bcebos.com/pretrained\_model/resnet50\_distarcface\_ms1mv2.tar.gz.
- The benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine) is released.
## Basic Model Library
(https://github.com/PaddlePaddle/models)
- PaddleNLP
- Seq2seq supports training modes such as RL and GAN.
- A training model for participle and part-of-speech tagging is released. A knowledge distillation framework Pantheon is used. The F1 value for its own dataset is 1% more than that of paddleNLP LAC. Jieba participles are incorporated. The deep learning model mode is enabled by adding a use\_paddle label. In addition, the paddle version detection and rollback mechanism is added in jieba to ensure user experience.
- Dynamic graph model implementations are added: word2vec, senta, transformer, Bert, seq2seq, LAC.
- PaddleSpeech
- Speech synthesis: A synthesis library Parakeet is released.
- A standard workflow for data preprocessing, training, and synthesis of the speech synthesis model is implemented.
- The out-of-the-box pre-processing implementation of typical datasets is provided.
- Commonly-used model components in the speech synthesis field are provided to support the model implementation.
- Speech synthesis models DeepVoice3, ClarinNet, TransformerTTS, FastSpeech, WaveNet, and WaveFlow are released.
- PaddleCV
- Image Classification:
- A total of 14 pre-training models including SENet-vd, Res2Net, and HRNet series of models are added:
- SE\_ResNet18\_vd, SE\_ResNet34\_vd, SE\_ResNeXt50\_vd\_32x4d, ResNeXt152\_vd\_32x4d
- Res2Net50\_26w\_4s, Res2Net50\_14w\_8s, Res2Net50\_vd\_26w\_4s
- HRNet\_W18\_C, HRNet\_W30\_C, HRNet\_W32\_C, HRNet\_W40\_C, HRNet\_W44\_C, HRNet\_W48\_C, HRNet\_W64\_C
- Accelerating data preprocessing by using DALI is supported. On the ImageNet training, 1.5 times (ResNet50) to more than 3 times (ShuffleNet) the acceleration is obtained and the GPU utilization is greatly improved.
- 3D Direction:
- The models PointNet++ and PointRCNN are released.
- Tracking Model Library:
- The models SiamFC, SiamRPN, SiamMASK, ATOM, and ATP are released.
- Dynamic graph model implementations are added: MobileNet-v1/v2, YOLOv3, FasterRCNN, MaskRCNN, video classification TSM model, and video motion positioning BMN model.
- PaddleRec
- A multi-task model MMoE for the recommended field is released and applies to large-scale multi-task joint training in the industrial circles.
- Dynamic graph model implementations are added: gru4rec, deepfm.
## End-To-End Development Kits
- PaddleDetection (https://github.com/PaddlePaddle/PaddleDetection)
- The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.
- Model implementations and pre-training models are added:
- The best single model CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd in the Google AI Open Images 2019-Object Detction competition is added. A pre-training model of this algorithm based on Objects365 data is also released.
- Backbone is added as CBResNet, Res2Net, and HRNet series of pre-training models.
- A LibraRCNN algorithm and a pre-training model are added.
- GIoU, DIoU, and CIoU loss-based pre-training models are added in the FasterRCNN R50 FPN model. Without reducing the forecast speed, the precision for the COCO data is improved by 1.1%, 0.9%, and 1.3% respectively.
- Added Modules:
- Backbone network: CBResNet, Res2Net, and HRNet are added.
- Loss modules: GIoU loss, DIoU loss, and CIoU loss are added. Libra loss and YOLOv3 loss support a fine-grained op combination.
- Postprocessing modules: The softnms and DIOU nms modules are added.
- Regular module: A DropBlock module is added.
- Functional Optimization and Improvement:
- YOLOv3 data preprocessing is accelerated. The overall training speeds up by 40%.
- The data preprocessing logic is optimized.
- The benchmark data for face detection forecast is added.
- Forecast examples under the Paddle forecast library Python API are added.
- Detection Model Compression:
- Tailoring: A Mobilenet-yolov3MobileNet-YOLOv3 tailoring solution and model are released, with FLOPs - 69.6%, mAP + 1.4% for the VOC dataset, and FLOPS - 28.8%, mAP + 0.9% for the COCO dataset. A ResNet50vd-dcn-YOLOv3 tailoring solution and model are released, with FLOPs - 18.4%, mAP + 0.8% for the COCO dataset.
- Distillation: A MobileNet-YOLOv3 distillation solution and model are released, with mAP + 2.8% for the VOC data and mAP + 2.1% for the COCO data.
- Quantification: YOLOv3-MobileNet and BlazeFace quantitative models are released.
- Tailoring + Distillation: A MobileNet-YOLOv3 tailoring + distillation solution and model are released, with FLOPS - 69.6%, forecast speedup 64.5% under the GPU, mAP - 0.3 % for the COCO dataset. A ResNet50vd-dcn-YOLOv3 tailoring + distillation solution and model are released, with FLOPS - 43.7%, forecast speedup 24.0% under the GPU, mAP + 0.6 % based on the COCO data.
- Search: A complete search solution for the open source blazeface-nas.
- Forecast Deployment:
- The support of the Paddle forecast library for TensorRT and FP16 precision is adapted.
- Documents:
- A document for introducing the data preprocessing module and a document for implementing the user-defined data Reader are added.
- A document about how to add an algorithm model is added.
- Documents are deployed to the website: https://paddledetection.readthedocs.io/zh/latest/
- PaddleSeg (https://github.com/PaddlePaddle/PaddleSeg)
- Added Models
- LaneNet model applicable to lane segmentation scenarios.
- Fast-SCNN model applicable to the lightweight.
- HRNet semantic segmentation model applicable to high-precision scenarios.
- Multiple PaddleSlim-based model compression solutions are released:
- Cityscape-based Fast-SCNN tailoring solution and model.
- Cityscape-based Deeplabv3p-Xception and Deeplabv3p-MobilenetV2 distillation solutions.
- Cityscape-based Deeplabv3p-MobilenetV2 search solution.
- Cityscape-based Deeplabv3p-Mobilenet quantitative solution and model.
- Enhancement of the Forecast Deployment Capability
- Lightweight deployment of Python is added.
- The TensorRT forecast acceleration support for FP16 and Int8 quantitative models is added.
- Tutorials and cases for portrait segmentation Paddle-Lite mobile-side deployment of DeepLabv3p-MobileNetV2 are added.
- Model export is optimized. GPU implementation of image preprocessing and postprocessing is supported. The performance is improved by 10%-20%.
- The benchmark for the forecast performance of U-Net, ICNet, PSPNet, DeepLabv3+, and other models for images of different sizes is provided to facilitate users to select models based on performance.
- Experience Optimization
- A learning rate warmup function is added. It supports the use with different learning rate decay strategies to improve Fine-tuning stability.
- Marked imaged can be saved in pseudo-color image format to improve their preview experience.
- The function of automatically saving an optimal mIoU model is added.
- The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided.
- ElasticRec (https://github.com/PaddlePaddle/ElasticRec) -
- An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online forecast service are supported.
## Utility Components
- PaddleHub (https://github.com/PaddlePaddle/PaddleHub)
- The pre-training models are rich, with 52 added pre-training models. Currently, the total number of pre-training models is 100+:
- Semantic models: Five semantic models such as RoBERTa\_wwm, BERT\_wwm, and ERNIE-Tiny are added.
- Text classification: Three yellow anti-identification models are added.
- Image classification: A total of 36 image classification models such as ResNext-WSL and EfficientNet are added.
- Target detection: Five detection models such as pedestrian detection and vehicle detection are added.
- Key point detection: Two models for key point detection of face and body posture are added.
- Face mask detection: Two PyramidBox-Lite-based face mask detection models are added.
- Universal face detection: Four universal Face detection models such as Ultra Light Fast Generic Face Detector and PyramidBox-Lite are added.
- Function:
- A Bert Service text vector representation service based on Paddle Serving is added.
- Task flexibility is enhanced. An added hook mechanism supports the loading of user-defined codes.
- A color Colorlog is added. The problem on the repeated printing of logs is fixed.
- Code results are optimized. The command line execution speed is increased by 50%.
- Dataset and Reader are reconstructed. The quantity of adaptive user-defined dataset codes is reduced by 60%.
- The AutoFinetune interface is optimized. Multi-experiment visualization effect display is supported.
- Experience Optimization
- The logic is fully optimized. Rich AIStudio tutorial contents are added.
- The landing page of the official website has been fully upgraded to provide the function of quick online experience and tutorial guidance.
- Multi-task learning framework PALM (https://github.com/PaddlePaddle/PALM)
- Python3 and Windows are supported.
- The framework kernel and the multi-tasking underlying mechanism, are upgraded. The API call is open.
- The flexible model saving mechanism supports single-task saving and full-image saving.
- Continuous training and forecast are supported. Dataset files can be switched over freely under a single execution.
- A model customization/self-definition function is added.
- The multi-task underlying kernel is reconstructed. Some bugs that affect universality and stability are fixed.
- The multi-task learning ability is strengthened.
- It is supported that every task has a different batch size and sequence length under a multi-task scenario.
- The problem on inconsistent tasks on each video card during multi-task multi-card training is fixed.
- The multi-task learning scheduling and termination strategies are optimized to generally improve the model generalization ability.
- The function and type of supported tasks are strengthened.
- Matching task support is enhanced. Pairwise learning and multiple categories (e.g. NLI sentence relation judgment) are supported.
- The support for machine reading comprehension tasks is enhanced. User controllable preprocessing hyper-parameters are added.
- The support for sequence labeling tasks is added.
- The large-scale training/inferential capability is strengthened.
- The automatic multi-card forecast capability is added.
- An asynchronous reader is supported. A variable-length padding is supported in multi-card scenarios.
- A module for the management and downloading of pre-training models is added.
- The management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa are supported.
- A RoBERTa Chinese pre-training model is added.
- Federated Learning PaddleFL (https://github.com/PaddlePaddle/PaddleFL):
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI cluster.
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character forecast, and other fields , such as MNIST and Sentiment140, are supported.
- According to the added components, the original samples are modified in example and the femnist\_demo and submitter\_demo examples are added
- Fl\_distribute\_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
- SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.
## Code Reconstruction and Upgrade
- Compilation
- A compilation option WITH\_NCCL is added. Single-card users can display and specify WITH\_NCCL=OFF to accelerate compilation.
- A compilation option WITH\_TP\_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability.
- The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency.
- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows.
- External Dependency Library
- MKL-DNN is upgraded to the latest Version 1.1.
- The forecast library is decoupled from `third_party` and 28 third-party-dependent compilation codes are refactored to facilitate the unified management of external dependencies.
- Two third-party-dependent private warehouses, one unnecessary dependency, and 2000+ lines of unnecessary codes under the patch are removed to improve the warehouse quality.
- Code Cleanup, Refactoring, and Optimization
- The unnecessary `contrib/float16` directory is removed. The unnecessary snappy/snappystream dependency under the BRPC is deleted.
- `loss.py` and `sequence_lod.py` are split out of `python/paddle/fluid/layers/nn.py` according to the API functions, thus reducing the code quantity of `nn.py` and facilitating reading.
- The codes corresponding to the warnings of `-Wno-error=sign-compare` (at a total of more than 100 points) are fixed. An error will be reported for all subsequent warnings of this kind during compilation, thus improving the code quality.
- `WarningLnk4006/WarningLnk4221` compiled by WindowsMSVC (at a total of about 300 points) is removed to improve the warehouse quality.
- The quantity of reduce\_op, expand\_op, and expand\_as\_op templates is reduced to accelerate GPU compilation and reduce whl package space by 70 M.
- The pybind function of every OP is automatically generated under the dynamic graph using codes and directly called in layers to improve the dynamic graph performance and reduce the coupling degree with the static graph.
## Bug Fixes
- Fix the problem of MKL-DNN error when PaddleDetection-based Faster-RCNN uses the Python API to make a forecast.
- Fix the problem of training suspension in the GPU implementation of sum op because some Tensors are not initialized.
- Fix the problem of precision loss when the value in fill\_constant is set to a large integer.
- Fix the problem of precision inconsistency of softmax\_with\_cross\_entropy\_op with regard to the CUDA.
- Fix the problem that when a clone program is fixed, the stop\_gradient attribute in the program can not be copied to a new program.
- Fix the problem of precision loss of elementwise\_pow op with regard to integers.
- Fixed the problem that some GFLAGSs cannot perform specifying outside the forecast library.
- Fix the problem of random forecast core caused by some passes in Analysistor multithreading. (fc\_gru\_fuse\_pass, seqconv\_eltadd\_relu\_fuse\_pass, attention\_lstm\_fuse\_pass, embedding\_fc\_lstm\_fuse\_pass, fc\_lstm\_fuse\_pass, seq\_concat\_fc\_fuse\_pass)
- Fix the error that specifying a GPU in the same process using AnalysisConfig does not take effect after NativePredictor is used to specify the use of CPU forecast.
- Fix the bug of compilation error (setup.py copy and op\_function\_cmd error) in the case of -DWITH\_MKL=OFF.
- Fix the bug that tuple (Variable) cannot be entered in the py\_func OP; add an example of how to write PythonOP codes.
- Fix the problem of the sigmoid cudnn kernel being called as the tanh cudnn kernel by mistake.
- Fix some bugs related to reshape and depthwiseconv in dynamic graph mode; fix the problem of some parameters in the network having no gradient, causing the bug of program crash.
- Fix the bug of running error of GradientClip in parameter server mode.
- Fix the problem of memory leak in full asynchronous mode of of the parameter server.
==============
Release Notes
==============
Table of Contents
#####################################
* Highlights
* Fundamental framework updates
* Installation
* Optimization on Intermediate Representation IR and Pass
* IO optimization
* Execution optimization
* Video memory optimization
* Refine CPU JITKernel
* Low-level Intel CPU computing optimization
* Intel nGraph graph compiling engine integration
* Adjustments to basic framework functionality
* Accomplished basic functions in the preview version of dynamic graph Inference engine
* Inference engine
* Server-side Inference Engine
* Mobile Inference Engine
* Deployment tools
* Distributed training
* Model construction
* PaddleCV Intelligent Vision
* PaddleNLP intelligent text processing
* PaddleRec intelligent recommendation
* Tools and Components
* Bug fixes notes
Highlights
#####################################
* Significant improvement has been made on training speed and memory management of the fundamental framework. Full support for quantitative training has been incorporated. Integration of Intel nGraph is also accomplished. Besides, the basic functions of single-card and single-node in the preview version of dynamic graph are perfectly implemented.
* We have officially released the model compression toolkit `PaddleSlim <https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim>`_ and the model inference service `Paddle Serving <https://github.com/PaddlePaddle/Serving>`_ to broadly enhance the PaddlePaddle deployment capabilities.
* Boosted distributed IO interfaces and the stream read capability of remote file systems. Synchronous multi-machine multi-card GPU training promotes bandwidth-insensitive training through enabling sparse communication. For low-bandwidth network, such as network of 10G, synchronous training is 10 times faster.
* Support for the K8S ecosystem is smoothened through Paddle-K8S-Operator support in industrial environments; Kubeflow supports paddle-job.
* We have officially released the `video classification toolkit <https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/video>`_ which covers mainstream video classification models, including Non-Local, TSM, Attention Cluster, NeXtVLAD, Attention LSTM, StNet, TSN.
* `ERNIE <https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE>`_ , a Chinese semantic representation model is introduced, which attains accuracy with absolute 1-2 percentage points higher than BERT on multiple Chinese language tasks. Generic dialogue comprehension model DGU is incorporated, with support for 5 types of dialogue tasks, and reaches SOTA in 3 public datasets.
* The Recommendation Model Based on `Graph Neural Network <https://github.com/PaddlePaddle/models/tree/develop/PaddleRec/gnn>`_ (GNN) is carried out, for which Benchmark expectation has been reproduced on public dataset.
* `PaddleHub <https://github.com/PaddlePaddle/PaddleHub>`_ , a management tool for pre-trained models, has been officially released, offering three functions: pre-trained model management, command-line one-click manipulation and transfer learning. It strives to facilitate model management and conduct transfer learning more efficiently.
* Open source `AutoDL Design <https://github.com/PaddlePaddle/AutoDL/tree/master/AutoDL%20Design>`_ is officially released to enable automatic network design.
* Latest upgrades on the parallelization-oriented `PARL1.1 <https://github.com/PaddlePaddle/PARL>`_ . Users are allowed to implement parallelized reinforcement learning algorithms by using a decorator.
* The model conversion tool `X2Paddle <https://github.com/PaddlePaddle/X2Paddle>`_ has been officially published, which enables transfer of inference models in other deep learning frameworks to PaddlePaddle without any compromise.
Fundamental Framework Updates
#####################################
* Installation
* install\_check.run\_check() interface is introduced to provide a more graceful check on whether the installation was successful.
* Optimization on Intermediate Representation IR and Pass
* The encapsulation is fulfilled of IrGraph, IrNode, IrVarNode, and IrOpNode. IR Passes scripted in Python is also enabled.
* IO optimization
* PyReader optimization: the brand new interface reader = fluid.io.PyReader (..., iterable=True, ...) makes it possible to create an iterable (by 'for' loop) reader and the data will be sent to the network through the 'feed' method.
* Execution optimization
* The 'place' parameter in with\_data\_parallel can be set to specify to run model on which GPU cards to execute single-process multi-training tasks.
* Scheduling strategy applied on the multi-card executor is optimized, which is proved on the performance that execution speed on the ResNet50 and Transformer models has witnessed a increase of 8%~19%.
* For Multi-card environment, grouped Fuse for AllReduce is developed. With this manner in place, ResNet model on multi-card is accelerated by 8%~30% (the figure varies with the number of cards). Moreover, Transformer model running on multiple cards picks up speed by 4%.
* Video Memory optimization
* GC strategy optimization: Eager Deletion strategy supports timely deletion of internal while\_op variables; supports non-full-quantity Eager Deletion strategy, users can set FLAGS\_memory\_fraction\_of\_eager\_deletion=0.xx to control the percentage of immediate deletion memory/memory\_space in real time.
* Op optimization: Optimize the backward registration mechanism of cross entropy, expand, layer\_norm, dropout, etc., and remove irrelevant variable dependencies, and improve the video memory performance.
* Two new FLAGS (FLAGS\_initial\_gpu\_memory\_in\_mb and FLAGS\_reallocate\_gpu\_memory\_in\_mb) to allow the users to specify the initial memory pool capacity and the reallocated memory pool capacity.
* Adjust the inplace\_op\_pass strategy to increase the coverage of the inplace strategy.
* Removed the logic for doing activation op inplace optimization on the python side, and included it to inplace\_op\_pass.
* Memory Profile function is provided.
* Refine CPU JITKernel
* Modify the manner to call JITKernel, employ cache mechanism and interfaces to get all functions of the same type, which is convenient for developers to flexibly call desired interfaces.
* As JITKernel is adopted to optimize the SGD algorithm, the equivalent OP part speed is increased by 44% and the overall training speed is increased by 12% in the PyramidDNN model; On the other hand, JITKernel is used to optimize fused\_embedding\_seq\_pool, and the backward versions of corresponding ops in the PyramidDNN model is accelerated by 18% and overall training speeds up by 6%.
* low-level Intel CPU computing optimization
* MKLDNN is upgraded to v0.18 and includes various performance boosts (e.g. GEMM-based convolution operations/INT8 convolution operations, etc.).
* GELU OP is accelerated by MKL. After optimization, the OP performance attains 3 times of the previous.
* Unit testing of MKLDNN-related Kernels are refined.
* Intel nGraph graph compiling engine integration is to facilitate the support for more hardware backends for PaddlePaddle
* The subgraphs are transferred to the nGraph core via ngraph\_engine OP, and then optimized with graph algorithms, after which they will be dispatched to execute on CPUs. nGraph can be called at runtime with the environment variable set as FLAGS\_use\_ngraph=true.
* Training and inference of the ResNet50 model on the CPU is fulfilled. The performance of the ResNet50 training and inference on CPU gains notable increase compared with the direct optimization by MKLDNN.
* Adjustments to basic framework functionality
* Synchronized Batch Norm operation becomes available; specifying axis in softmax is allowed; new operators are in place: spectral norm, rang, acos, asin, atanh; Npair Loss is adopted for feature learning.
* cosine\_decay , a new learning rate strategy, is implemented.
* Users can use sampled\_softmax\_with\_cross\_entropy to improve training efficiency in large dictionaries.
* Fuse is possible between SGD and Adam optimization algorithms. If enabled, on the Transformer model, the speed can increase by 2%, while on the Cycle GAN model, the gain turns out to be 6%.
* A more sophisticated lsmtp, which is able to perform clipping internal cell, initializing cell state and hidden state.
* A more adjustable adagrad by which users can initialize cumulative momentum.
* Users are allowed to handle Tensor through \_\_getitem\_\_ method.
* QuantizationFreezePass, ConvertToInt8Pass, and TransformForMobilePass are introduced with comprehensive support for both dynamic and static quantitative training methods and saving corresponding model.
* Accomplished basic functions in the preview version of dynamic graph
* Basic functions: LRDecay, single GPU card and single-node CPU model training and evaluation.
* API: expose the rudimentary interfaces of dynamic graph to users; reconstruct current Layers; build Layers such as GRU, LayerNorm, NCE, PRelu.
* Performance: performance evaluated on the ResNet, MNIST model is essentially the same as the static graph.
* Dynamic graph implementation of models such as Transformer, MNIST, SE-ResNeXt.
Inference Engine
#####################################
Server-side Inference Engine
+++++++++++++++++++++++++++++++++++++
* Inference library is currently integrated with PaddlePaddle/Anakin to unify interfaces for a more efficient inference process
* able to handle Anakin GPU submaps and CPU submaps.
* The Python inference interface has accepted Anakin subgraph.
* significant Inference acceleration on ResNet, VGG, GoogleNet, MobileNet, ShuffleNet, Faster R-CNN, YOLO, SSD and other models
* Inference framework optimization. Inference of small models expedites noticeably
* Through configuring runtime\_context\_cache\_pass, focal models have obtained a speed-up of 17%.
* The infershape of 5 OPs are refined, so that the focal models accelerate by 13%.
* The ZeroCopy interface is upgraded to avoid redundant CPU copies when using AnalysisPredictor.
* Reinforce INT8 quantitative Inference
* More inclusive support for INT8 Quantization through TensorRT, applicable for AlexNet, Googlenet, VGG, MobileNet, ShuffleNet and more. Utilize the information on TensorRT in an optimal manner to perform the serialization and deserialization so that a model will be initialized more speedily.
* Implement the INT8 quantization framework based on C++ Pass. A few new INT8 OP Kernel: Transpose, Contact, Requantize. By fine-tuning the quantization strategy in MkldnnQuantizerConfig, users can promptly get the INT8 quantization model that meets the accuracy requirements. The INT8 quantized ResNet-50/MobileNet v1 model achieved a performance 7 times/3 times higher compared with the original FP32 model (tested on the Xeon 6271 server supporting the AVX512-DL Boost instruction set).
Mobile Inference Engine
+++++++++++++++++++++++++++++++++++++
* ARM CPU
* Paddle Mobile has reconstructed and enhanced efficiency of the matrix operation library sgemm and sgemv, which gives rise to performance boost of 10%~100% on most models.
* 19 new operators are provided in this version such as while, sequence\_expand, sequence\_pool, sequence\_softmax, gru\_unit, beam\_search, and beam\_search\_decode. Apart from that, there has also been a large amount of optimization, and the support attention-based end-to-end Model prediction.
* arm v8 of winograd implementation: higher inference performance on v8 hardware on IOS; winograd support for operator fusion to ensure higher efficiency after operator fusion.
* Direct convolution for kernel with a 3x3 sliding window, which will be more efficient than winograd and gemm on the condition that the number of channels is small.
* Reconstructed and optimized depthwise convolution with the kernel size 3x3: in contrast to previous versions, it supports arbitrary padding, and attains better performance and returns more reliable calculation results.
* Depthwise convolution with the kernel size 5x5 on armv8: the NAS model prediction speeds up by more than 30%.
* Complete the efficiency optimization of the deconvolution conv2d\_transpose.
* Consolidated with memory reuse strategy based on graph optimization. When the strategy is applied, most models can reduce memory usage by nearly 50%. It is automatically turned on for the ARM CPU (not compatible with FPGA and GPU).
* ARM GPU
* Paddle Mobile completes the convolution optimization for the kernel with size 1x1, and MobileNet v1 has an average inference performance improvement of 35% on Qualcomm Adreno GPUs.
* Paddle Inference has preliminarily unified of Paddle Mobile and Anakin interfaces. Further integration is pending.
Deployment Tools
+++++++++++++++++++++++++++++++++++++
* Model compression toolkit PaddleSlim
* Model clipping compression strategy: users can select sensitivity or uniform modes, apply it for various models such as VGG, ResNet, MobileNet, and customize clipping range.
* Quantitative training model compression strategy: there are two two quantitative training modes, dynamic mode and static mode. Channel quantization or overall quantization of parameters are also selectable. Users can save models with float type simulating int8 value domain, with int8 type, or with formats compatible with Paddle Mobile .
* Model distillation compression strategy: users are permitted to add combined loss at any layer in the teacher network and student network. FSP Loss, L2 Loss, Softmax with Cross-entropy Loss are all available methods.
* Other functions: Users can configure hyper-parameters of file compression task, and are allowed to combine multiple compression strategies. Moreover, checkpoints function is also applicable for distillation and clipping compression process.
* Paddle Serving
* Remote paddle inference deployment is accomplished.
* The server allows users to add data processing Operator, or define inference logic, and it supports model hot-loading.
* The client side offers a C++ SDK which can be called business logic if needed. Users are allowed to customize protobuf to define network data transfer protocols, and A/B testing capabilities.
* Provides sample templates for classic tasks in paddle serving, including text classification and image classification tasks.
* Benchmarks for latency and throughput for text classification tasks.
Distributed training
#####################################
* Distributed IO optimization
* Pipe Reader Interface Optimization: high-efficiency IO methods are in place as maintaining flexibility of data pre-processing. Enterprise-class Linux system customization is supported. High-performance IO components are implemented. Unified maintenance is carried out in the procedure of off-line data preprocessing. Remote file system stream read capability is enhanced to support the modes in which data are loaded to memory and distributed shuffling.
* Integration of Executor and distributed IO
* AsyncExecutor is integrated into Executor, equipped with a new train\_from\_dataset/infer\_from\_dataset interface. It supports Pipe Reader-based training, and accepts user-defined PipeLine program on the condition of maintaining multi-queue IO function, and provides flexible python-side data processing.
* bandwidth insensitive training ability of synchronous multi-node multi-card GPU training
* Sync GPU training is capable of sparse communication and adopts sparse all reduce.
* Guarantee model convergence from the algorithm perspective and introduce DGCOptimizer through control of communication sparsity.
* Experiments on ResNet50 on imagenet prove that: in terms of model convergence, for 90 rounds of ResNet50, convergence remains stable; in high-speed interconnected network environment, sparse communication does not compromise training speed; for low network bandwidth network environment (such as 10G network) ), sparse communication has notable advantages in training speed, where the speed of synchronous training is 10 times faster than that of dense communication.
* Collective Operator mode
* Collective Operator mode is available. Multiple all reduce operations are allowed under GPU. Incorporating collective op into Program through the Python API makes the development of distributed optimization algorithms much more flexible.
* Convergence speed optimization for ResNet50 on Imagenet
* Dynamic BatchSize, dynamic ImageSize, and rectangular crop can be used. With FP32 precision, on v100 single-node 8 card testing environment, the convergence speed increases by 68% (acc1\>=75.9%, acc5=93.0%).
* K8S Ecosystem Support
* Kubeflow has supported paddle-job and contributed to the kubeflow community.
* The Paddle-K8S-Operator for industrial application is supported. It can collaborate with kubeflow.
* The K8S environment is suitable for beginners to submit task scripts, of which reproducible tutorials are given on Baidu Cloud.
Model Construction
#####################################
* PaddleCV Intelligent Vision
* Video Classification Toolkit is formally released. It covers mainstream video classification models, including Non-Local, TSM, Attention Cluster, NeXtVLAD, Attention LSTM, StNet, TSN, and attains the level of mainstream implementations.
* New pre-trained ImageNet-based model: GoogleNet, ShuffleNetv2, ResNet18, ResNet34.
* New target detection YOLOv3 model. The effect is equivalent to the finest open implementation (mAP is 7 percentage points higher than the original author).
* The Simple Baselines human pose estimation model based on COCO and MPII data is realized. The effect is able to parallel mainstream implementation.
* npair loss is introduced to feature learning models, and raises recall@1 to 79.03% (+0.78%) based on the pre-trained model (arcmargin loss).
* PaddleNLP intelligent text processing
* The Chinese semantic representation ELMo model is available. It supports multi-card training, and the training speed is twice as fast as mainstream implementation. It has been verified that the F1 value is increased by absolute 1.1% in Chinese lexical analysis tasks, and the Rouge-L value increases by 1% in Chinese reading comprehension tasks.
* The Chinese semantic representation model ERNIE is implemented, which has improved the accuracy by absolute 1% ~ 2% compared with the BERT Chinese model in Chinese tasks such as natural language inference, semantic similarity, named entity recognition, sentiment analysis, and question and answer matching.
* The read understanding model is upgraded by optimizing data pre-processing and document selection. The effect is that Rouge-L was upgraded to 65 (baseline 39.29) on DuReader validation datasets.
* A knowledge-aware dialogue model is added. Compared with the baseline generation dialog model, it outperforms by an average of 1 percentage point on the F1, BLEU1, and BLEU2 metrics.
* The dialogue model toolkit is available. It consists of Deep Attention Matching Net, a new automatic dialogue assessment tool and the BERT-based generic dialog understanding model DGU (Dialogue General Understanding), which supports five types of dialogue tasks, namely dialogue semantic matching, DA, DST, slot analysis and intention recognition, and attains the effect of SOTA on three public datasets.
* The PaddleNLP toolkit is released to unify the modeling of NLP tasks such as text classification, text matching, sequence labeling, reading comprehension, and intelligent dialogue. And their corresponding industrial pre-trained models are also open to use.
* PaddleRec intelligent recommendation
* Deep Interest Network (DIN): DIN is fulfilled in this version. reproduce effect on public dataset and support single/multi-card training in both cpu and gpu mode. DIN is appropriate for the sorting scenarios in recommendation (such as ctr prediction). The main feature is the combination of the estimated target information in the process of modeling the historical sequence.
* Graph Neural Network (GNN): a session-based graph neural network recommendation model is introduced. Effect has been reproduced on public dataset. It supports single-node single-card training in both CPU and GPU mode. The model is suitable for the recall scenario in the recommendation. Using GNN to model the user's historical information can capture more complex transformation relationships underlying item sequences.
* Word2vec: word2vec sampling strategy is adjusted. The effect is reproduced on the public dataset. Multi-machine training support is included as well.
Tools and Components
#####################################
* Open source AutoDL Design is officially released to enable automatic network design
* A series of neural networks generated with the AutoDL Design, and a total of six models trained on CIFAR10 data have saved the network structures and involved weights. Therefore, any developer or researcher interested in deep learning can easily work on PaddlePaddle and public CIFAR10 data to perform inference and model fusion on these six models, which have attained an accuracy over 98%.
* The source code for the encoder and the critic is made open source. The source code is based on the PaddlePaddle platform and the PARL framework developed entirely by Baidu. The code also comes with Chinese documentation and some brief demos that make it easier for users to run effortlessly. (for example, with "How many 1s is generated by RNN" as a standard, you can quickly verify the correctness of the entire framework). Moreover, users can download, install, run, and try to generate your own original neural network structure.
* Latest upgrades on the parallelization-oriented PARL1.1. Users are allowed to implement parallelized reinforcement learning algorithms by using a decorator
* Parallelization can be achieved simply with a modifier (@parl.remote_class). After computing-intensive tasks, such as the data-preprocessing and simulator simulation tasks, have encountered this decorator, the data will be automatically deployed to the specified computing resources, and no longer occupy the computing resources of the main thread.
* Support parallelization algorithms such as IMPALA, A2C, and GA3C.
* PaddleHub, a pre-trained model management tool, is released and strives to help users manage models and conduct transfer learning more efficiently
* **Pre-trained model management:** Pre-trained model download, search, version management and other functions in the PaddlePaddle ecosystem can be completed through the hub command line.
* **One-click command line:** Free from code, you can use the pre-trained model to infer straight through the command line, and quickly examine the effect of the training model. The current version supports the following models: lexical analysis LAC; sentiment analysis Senta; target detection SSD; image classification ResNet, MobileNet.
* **Transfer Learning:** Provides a Finetune API based on pre-trained models. Users can complete transfer learning with a small amount of code. The API mainly includes BERT/ERNIE text classification, sequence labeling, image classification transfer.
* The X2Paddle model conversion tool is officially released to transfer prediction models implemented in other deep learning frameworks to PaddlePaddle without loss. The tool is also attached with detailed comparison documents of TensorFlow, the Caffe framework's API , to help users transform the model to PaddlePaddle more easily
BUG fixes notes
#####################################
* Fixed precision inconsistency in BFS occurred in backward computation.
* Fixed redundant backward inputs created by optimizer minimize.
* Fixed Paddle-TRT occupying too much video memory.
* Fixed bugs in AllReduceDepPass.
* Fixed bugs in FastThreadedExecutor.
* Fixed bugs in Op such as Reshape, cross\_entropy, arg\_min\_max, recurrent, etc.
* Fixed problems with VarBase construction
* Fixed a number of problems and bugs in memory\_optimize\_pass: Adjusted the multiplexing logic from \>= to =, reduced fragmentation caused by Variable multiplexing, removing the dependency of memory\_opitmize\_pass on BlockDesc. Fixed a bug that different types of Variables would be reused mutually.
* Fixed an issue with util.plot in python3.
* Improved the stability of the Profiler and introduced Memory Profile function.
* Fixed the problem that multithreading was effective only when C++ inference had been cloned within the thread.
* fix bugs of some ops in InferShape.
* fix bugs of some ops with input LoD length = 0.
* fix bugs of recurrent op for StaticRNN.
* fix bugs of dygraph when saving and loading model checkpoint.
\ No newline at end of file
......@@ -3,80 +3,8 @@
################
本章由1篇文档组成,将指导您如何使用PaddlePaddle完成基础的计算机视觉深度学习任务
本章文档涉及大量了深度学习基础知识,也介绍了如何使用PaddlePaddle实现这些内容,请参阅以下说明了解如何使用:
内容简介
======================
您现在在看的这本书是一本“交互式”电子书 —— 每一章都可以运行在一个Jupyter Notebook里。
.. toctree::
:titlesonly:
gan/README.cn.md
我们把Jupyter、PaddlePaddle、以及各种被依赖的软件都打包进一个Docker image了。所以您不需要自己来安装各种软件,只需要安装Docker即可。对于各种Linux发行版,请参考 https://www.docker.com 。如果您使用 `Windows <https://www.docker.com/docker-windows>`_ 或者 `Mac <https://www.docker.com/docker-mac>`_,可以考虑 `给Docker更多内存和CPU资源 <http://stackoverflow.com/a/39720010/724872>`_ 。
使用方法
======================
本书默认使用CPU训练,若是要使用GPU训练,使用步骤会稍有变化,请参考下文“使用GPU训练”
使用CPU训练
>>>>>>>>>>>>
只需要在命令行窗口里运行:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
即可从DockerHub.com下载和运行本书的Docker image。阅读和在线编辑本书请在浏览器里访问 http://localhost:8888
如果您访问DockerHub.com很慢,可以试试我们的另一个镜像docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
使用GPU训练
>>>>>>>>>>>>>
为了保证GPU驱动能够在镜像里面正常运行,我们推荐使用 `nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`_ 来运行镜像。请先安装nvidia-docker,之后请运行:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
或者使用国内的镜像请运行:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
还需要将以下代码
.. code-block:: python
use_cuda = False
改成:
.. code-block:: python
use_cuda = True
贡献新章节
=============
您要是能贡献新的章节那就太好了!请发Pull Requests把您写的章节加入到 :code:`pending` 下面的一个子目录里。当这一章稳定下来,我们一起把您的目录挪到根目录。
为了写作、运行、调试,您需要安装Python 2.x和Go >1.5, 并可以用 `脚本程序 <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ 来生成新的Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -2,92 +2,9 @@
Computer Vision
############################
This section collects 1 document arranging from the simplest to the most challenging, which will guide you through the basic deep learning tasks in PaddlePaddle.
The documentation in this chapter covers a lot of deep learning basics and how to implement them with PaddlePaddle. See the instructions below for how to use:
Overview
======================
The book you are reading is an "interactive" e-book - each chapter can be run in a Jupyter Notebook.
.. toctree::
:titlesonly:
gan/README.md
We packaged Jupyter, PaddlePaddle, and various dependency softwares into a Docker image. It frees you from installing these softwares by yourself, and you only need to just install Docker. For various Linux versions, please refer to https://www.docker.com . If you use docker on `Windows <https://www.docker.com/docker-windows>`_ or `Mac <https://www.docker.com/docker-mac>`_ , consider `allocate more Memory and CPU resources to Docker <http://stackoverflow.com/a/39720010/724872>`_ .
Instructions
======================
This book assumes you are performing CPU training by default. If you want to use GPU training, the steps will vary slightly. Please refer to "GPU Training" below.
CPU training
>>>>>>>>>>>>
Just run these in shell:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
It downloads the Docker image for running books from DockerHub.com.
To read and edit this book on-line, please visit http://localhost:8888 in your browser.
If the Internet connection to DockerHub.com is compromised, try our spare docker image named docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
GPU training
>>>>>>>>>>>>>
To ensure that the GPU driver works properly in the image, we recommend running the image with `nvidia docker <https://github.com/NVIDIA/nvidia-docker>`_ . Please install nvidia-docker first, then run:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
Or use a image source in China to run:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
modify the following codes
.. code-block:: python
use_cuda = False
into :
.. code-block:: python
use_cuda = True
Contribute to Book
===================
We highly appreciate your original contributions of new chapters to Book! Just Pull Requests of your contributions to the sub-directory in :code:`pending` . When this chapter is endorsed, we'll gladly move it to the root directory.
For writing, running, debugging, you need to install `shell <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ to generate Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -6,13 +6,23 @@
如果您已经掌握了快速上手阶段的内容,期望可以针对实际问题建模、搭建自己网络,本模块提供了一些 Paddle 的具体典型案例供您参考:
本章文档将指导您如何使用PaddlePaddle完成基础的深度学习任务
本章文档涉及大量了深度学习基础知识,也介绍了如何使用PaddlePaddle实现这些内容,请参阅以下说明了解如何使用:
内容简介
======================
- `简单案例 <../user_guides/simple_case/index_cn.html>`_ :介绍了 Paddle 的基本案例
- `计算机视觉 <../user_guides/cv_case/index_cn.html>`_ :介绍使用 Paddle 解决计算机视觉领域的案例
- `自然语言处理 <../user_guides/nlp_case/index_cn.html>`_: 介绍使用 Paddle 实现自语言处理方向的案例
- `自然语言处理 <../user_guides/nlp_case/index_cn.html>`_: 介绍使用 Paddle 实现自语言处理方向的案例
- `推荐 <../user_guides/rec_case/index_cn.html>`_:介绍如何使用 Paddle 完成推荐领域任务的案例
- `模型库 <../user_guides/models/index_cn.html>`_:介绍了 Paddle 经典的模型库
- `工具组件 <../user_guides/tools/index_cn.html>`_:介绍在 Paddle 工具组件的使用案例
......@@ -23,4 +33,71 @@
cv_case/index_cn.rst
nlp_case/index_cn.rst
rec_case/index_cn.rst
models/index_cn.rst
tools/index_cn.rst
我们把Jupyter、PaddlePaddle、以及各种被依赖的软件都打包进一个Docker image了。所以您不需要自己来安装各种软件,只需要安装Docker即可。对于各种Linux发行版,请参考 https://www.docker.com 。如果您使用 `Windows <https://www.docker.com/docker-windows>`_ 或者 `Mac <https://www.docker.com/docker-mac>`_,可以考虑 `给Docker更多内存和CPU资源 <http://stackoverflow.com/a/39720010/724872>`_ 。
使用方法
======================
本书默认使用CPU训练,若是要使用GPU训练,使用步骤会稍有变化,请参考下文“使用GPU训练”
使用CPU训练
>>>>>>>>>>>>
只需要在命令行窗口里运行:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
即可从DockerHub.com下载和运行本书的Docker image。阅读和在线编辑本书请在浏览器里访问 http://localhost:8888
如果您访问DockerHub.com很慢,可以试试我们的另一个镜像docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
使用GPU训练
>>>>>>>>>>>>>
为了保证GPU驱动能够在镜像里面正常运行,我们推荐使用 `nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`_ 来运行镜像。请先安装nvidia-docker,之后请运行:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
或者使用国内的镜像请运行:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
还需要将以下代码
.. code-block:: python
use_cuda = False
改成:
.. code-block:: python
use_cuda = True
贡献新章节
=============
您要是能贡献新的章节那就太好了!请发Pull Requests把您写的章节加入到 :code:`pending` 下面的一个子目录里。当这一章稳定下来,我们一起把您的目录挪到根目录。
为了写作、运行、调试,您需要安装Python 2.x和Go >1.5, 并可以用 `脚本程序 <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ 来生成新的Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -8,19 +8,22 @@ If you have got the hang of Beginner's Guide, and wish to model practical proble
you with some detailed operations:
- `LoD-Tensor Concepts <../user_guides/howto/basic_concept/index_en.html>`_ :It explains basic concepts of Fluid LoD-Tensor.
This section collects several documents arranging from the simplest to the most challenging, which will guide you through the basic deep learning tasks in PaddlePaddle.
- `Prepare Data <../user_guides/howto/prepare_data/index_en.html>`_ :This section introduces data types supported and data transmission methods when you are training your networks with Fluid.
The documentation in this chapter covers a lot of deep learning basics and how to implement them with PaddlePaddle. See the instructions below for how to use:
- `Set up Simple Model <../user_guides/howto/configure_simple_model/index_en.html>`_: This section illustrates how to model practical problems and build networks with related operators of Fluid.
- `Train Neural Networks <../user_guides/howto/training/index_en.html>`_:This section will guide you to perform single-node training, multi-node training, and save or load model variables.
Overview
======================
- `Model Evaluation and Debugging <../user_guides/howto/evaluation_and_debugging/index_en.html>`_:It introduces the model evaluation and debugging methods in Fluid
- `Simple Case <../user_guides/simple_case/index_en.html>`_ :introduces basic cases of Paddle
Reproduced classic models of multiple directions in Fluid:
- `Natural Language Processing <../user_guides/nlp_case/index_en.html>`_:introduces cases of using paddle to realize Natural Language Processing tasks
- `Fluid Model Library <../user_guides/models/index_en.html>`_
- `Recommend <../user_guides/rec_case/index_en.html>`_:introduces cases of using paddle to realize Recommend tasks
- `Models Zoo <../user_guides/models/index_en.html>`_:introduces the models zoo of Paddle
.. toctree::
:hidden:
......@@ -28,4 +31,80 @@ Reproduced classic models of multiple directions in Fluid:
simple_case/index_en.rst
nlp_case/index_en.rst
rec_case/index_en.rst
tools/index_en.rst
models/index_cn.rst
We packaged Jupyter, PaddlePaddle, and various dependency softwares into a Docker image. It frees you from installing these softwares by yourself, and you only need to just install Docker. For various Linux versions, please refer to https://www.docker.com . If you use docker on `Windows <https://www.docker.com/docker-windows>`_ or `Mac <https://www.docker.com/docker-mac>`_ , consider `allocate more Memory and CPU resources to Docker <http://stackoverflow.com/a/39720010/724872>`_ .
Instructions
======================
This book assumes you are performing CPU training by default. If you want to use GPU training, the steps will vary slightly. Please refer to "GPU Training" below.
CPU training
>>>>>>>>>>>>
Just run these in shell:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
It downloads the Docker image for running books from DockerHub.com.
To read and edit this book on-line, please visit http://localhost:8888 in your browser.
If the Internet connection to DockerHub.com is compromised, try our spare docker image named docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
GPU training
>>>>>>>>>>>>>
To ensure that the GPU driver works properly in the image, we recommend running the image with `nvidia docker <https://github.com/NVIDIA/nvidia-docker>`_ . Please install nvidia-docker first, then run:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
Or use a image source in China to run:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
modify the following codes
.. code-block:: python
use_cuda = False
into :
.. code-block:: python
use_cuda = True
Contribute to Book
===================
We highly appreciate your original contributions of new chapters to Book! Just Pull Requests of your contributions to the sub-directory in :code:`pending` . When this chapter is endorsed, we'll gladly move it to the root directory.
For writing, running, debugging, you need to install `shell <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ to generate Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -2,16 +2,6 @@
自然语言处理
################
本章由3篇文档组成,将指导您如何使用PaddlePaddle完成自然语言处理领域的基础深度学习任务
本章文档涉及大量了深度学习基础知识,也介绍了如何使用PaddlePaddle实现这些内容,请参阅以下说明了解如何使用:
内容简介
======================
您现在在看的这本书是一本“交互式”电子书 —— 每一章都可以运行在一个Jupyter Notebook里。
.. toctree::
:titlesonly:
......@@ -19,67 +9,3 @@
label_semantic_roles/README.cn.md
machine_translation/README.cn.md
我们把Jupyter、PaddlePaddle、以及各种被依赖的软件都打包进一个Docker image了。所以您不需要自己来安装各种软件,只需要安装Docker即可。对于各种Linux发行版,请参考 https://www.docker.com 。如果您使用 `Windows <https://www.docker.com/docker-windows>`_ 或者 `Mac <https://www.docker.com/docker-mac>`_,可以考虑 `给Docker更多内存和CPU资源 <http://stackoverflow.com/a/39720010/724872>`_ 。
使用方法
======================
本书默认使用CPU训练,若是要使用GPU训练,使用步骤会稍有变化,请参考下文“使用GPU训练”
使用CPU训练
>>>>>>>>>>>>
只需要在命令行窗口里运行:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
即可从DockerHub.com下载和运行本书的Docker image。阅读和在线编辑本书请在浏览器里访问 http://localhost:8888
如果您访问DockerHub.com很慢,可以试试我们的另一个镜像docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
使用GPU训练
>>>>>>>>>>>>>
为了保证GPU驱动能够在镜像里面正常运行,我们推荐使用 `nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`_ 来运行镜像。请先安装nvidia-docker,之后请运行:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
或者使用国内的镜像请运行:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
还需要将以下代码
.. code-block:: python
use_cuda = False
改成:
.. code-block:: python
use_cuda = True
贡献新章节
=============
您要是能贡献新的章节那就太好了!请发Pull Requests把您写的章节加入到 :code:`pending` 下面的一个子目录里。当这一章稳定下来,我们一起把您的目录挪到根目录。
为了写作、运行、调试,您需要安装Python 2.x和Go >1.5, 并可以用 `脚本程序 <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ 来生成新的Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -2,15 +2,6 @@
Natural Language Processing
############################
This section collects 3 documents arranging from the simplest to the most challenging, which will guide you through the basic deep learning tasks in PaddlePaddle.
The documentation in this chapter covers a lot of deep learning basics and how to implement them with PaddlePaddle. See the instructions below for how to use:
Overview
======================
The book you are reading is an "interactive" e-book - each chapter can be run in a Jupyter Notebook.
.. toctree::
:titlesonly:
......@@ -19,78 +10,3 @@ The book you are reading is an "interactive" e-book - each chapter can be run in
label_semantic_roles/README.md
machine_translation/README.md
We packaged Jupyter, PaddlePaddle, and various dependency softwares into a Docker image. It frees you from installing these softwares by yourself, and you only need to just install Docker. For various Linux versions, please refer to https://www.docker.com . If you use docker on `Windows <https://www.docker.com/docker-windows>`_ or `Mac <https://www.docker.com/docker-mac>`_ , consider `allocate more Memory and CPU resources to Docker <http://stackoverflow.com/a/39720010/724872>`_ .
Instructions
======================
This book assumes you are performing CPU training by default. If you want to use GPU training, the steps will vary slightly. Please refer to "GPU Training" below.
CPU training
>>>>>>>>>>>>
Just run these in shell:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
It downloads the Docker image for running books from DockerHub.com.
To read and edit this book on-line, please visit http://localhost:8888 in your browser.
If the Internet connection to DockerHub.com is compromised, try our spare docker image named docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
GPU training
>>>>>>>>>>>>>
To ensure that the GPU driver works properly in the image, we recommend running the image with `nvidia docker <https://github.com/NVIDIA/nvidia-docker>`_ . Please install nvidia-docker first, then run:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
Or use a image source in China to run:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
modify the following codes
.. code-block:: python
use_cuda = False
into :
.. code-block:: python
use_cuda = True
Contribute to Book
===================
We highly appreciate your original contributions of new chapters to Book! Just Pull Requests of your contributions to the sub-directory in :code:`pending` . When this chapter is endorsed, we'll gladly move it to the root directory.
For writing, running, debugging, you need to install `shell <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ to generate Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -2,81 +2,8 @@
推荐
################
本章由1篇文档组成,将指导您如何使用PaddlePaddle完成推荐领域的基础深度学习任务
本章文档涉及大量了深度学习基础知识,也介绍了如何使用PaddlePaddle实现这些内容,请参阅以下说明了解如何使用:
内容简介
======================
您现在在看的这本书是一本“交互式”电子书 —— 每一章都可以运行在一个Jupyter Notebook里。
.. toctree::
:titlesonly:
recommender_system/README.cn.md
我们把Jupyter、PaddlePaddle、以及各种被依赖的软件都打包进一个Docker image了。所以您不需要自己来安装各种软件,只需要安装Docker即可。对于各种Linux发行版,请参考 https://www.docker.com 。如果您使用 `Windows <https://www.docker.com/docker-windows>`_ 或者 `Mac <https://www.docker.com/docker-mac>`_,可以考虑 `给Docker更多内存和CPU资源 <http://stackoverflow.com/a/39720010/724872>`_ 。
使用方法
======================
本书默认使用CPU训练,若是要使用GPU训练,使用步骤会稍有变化,请参考下文“使用GPU训练”
使用CPU训练
>>>>>>>>>>>>
只需要在命令行窗口里运行:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
即可从DockerHub.com下载和运行本书的Docker image。阅读和在线编辑本书请在浏览器里访问 http://localhost:8888
如果您访问DockerHub.com很慢,可以试试我们的另一个镜像docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
使用GPU训练
>>>>>>>>>>>>>
为了保证GPU驱动能够在镜像里面正常运行,我们推荐使用 `nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`_ 来运行镜像。请先安装nvidia-docker,之后请运行:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
或者使用国内的镜像请运行:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
还需要将以下代码
.. code-block:: python
use_cuda = False
改成:
.. code-block:: python
use_cuda = True
贡献新章节
=============
您要是能贡献新的章节那就太好了!请发Pull Requests把您写的章节加入到 :code:`pending` 下面的一个子目录里。当这一章稳定下来,我们一起把您的目录挪到根目录。
为了写作、运行、调试,您需要安装Python 2.x和Go >1.5, 并可以用 `脚本程序 <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ 来生成新的Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -2,92 +2,9 @@
Recommend
############################
This section collects 1 document arranging from the simplest to the most challenging, which will guide you through the basic deep learning tasks in PaddlePaddle.
The documentation in this chapter covers a lot of deep learning basics and how to implement them with PaddlePaddle. See the instructions below for how to use:
Overview
======================
The book you are reading is an "interactive" e-book - each chapter can be run in a Jupyter Notebook.
.. toctree::
:titlesonly:
recommender_system/README.md
We packaged Jupyter, PaddlePaddle, and various dependency softwares into a Docker image. It frees you from installing these softwares by yourself, and you only need to just install Docker. For various Linux versions, please refer to https://www.docker.com . If you use docker on `Windows <https://www.docker.com/docker-windows>`_ or `Mac <https://www.docker.com/docker-mac>`_ , consider `allocate more Memory and CPU resources to Docker <http://stackoverflow.com/a/39720010/724872>`_ .
Instructions
======================
This book assumes you are performing CPU training by default. If you want to use GPU training, the steps will vary slightly. Please refer to "GPU Training" below.
CPU training
>>>>>>>>>>>>
Just run these in shell:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
It downloads the Docker image for running books from DockerHub.com.
To read and edit this book on-line, please visit http://localhost:8888 in your browser.
If the Internet connection to DockerHub.com is compromised, try our spare docker image named docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
GPU training
>>>>>>>>>>>>>
To ensure that the GPU driver works properly in the image, we recommend running the image with `nvidia docker <https://github.com/NVIDIA/nvidia-docker>`_ . Please install nvidia-docker first, then run:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
Or use a image source in China to run:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
modify the following codes
.. code-block:: python
use_cuda = False
into :
.. code-block:: python
use_cuda = True
Contribute to Book
===================
We highly appreciate your original contributions of new chapters to Book! Just Pull Requests of your contributions to the sub-directory in :code:`pending` . When this chapter is endorsed, we'll gladly move it to the root directory.
For writing, running, debugging, you need to install `shell <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ to generate Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -2,16 +2,6 @@
简单案例
################
本章由4篇文档组成,将指导您如何使用PaddlePaddle完成基础的深度学习任务
本章文档涉及大量了深度学习基础知识,也介绍了如何使用PaddlePaddle实现这些内容,请参阅以下说明了解如何使用:
内容简介
======================
您现在在看的这本书是一本“交互式”电子书 —— 每一章都可以运行在一个Jupyter Notebook里。
.. toctree::
:titlesonly:
......@@ -19,67 +9,3 @@
recognize_digits/README.cn.md
image_classification/README.cn.md
word2vec/README.cn.md
我们把Jupyter、PaddlePaddle、以及各种被依赖的软件都打包进一个Docker image了。所以您不需要自己来安装各种软件,只需要安装Docker即可。对于各种Linux发行版,请参考 https://www.docker.com 。如果您使用 `Windows <https://www.docker.com/docker-windows>`_ 或者 `Mac <https://www.docker.com/docker-mac>`_,可以考虑 `给Docker更多内存和CPU资源 <http://stackoverflow.com/a/39720010/724872>`_ 。
使用方法
======================
本书默认使用CPU训练,若是要使用GPU训练,使用步骤会稍有变化,请参考下文“使用GPU训练”
使用CPU训练
>>>>>>>>>>>>
只需要在命令行窗口里运行:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
即可从DockerHub.com下载和运行本书的Docker image。阅读和在线编辑本书请在浏览器里访问 http://localhost:8888
如果您访问DockerHub.com很慢,可以试试我们的另一个镜像docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
使用GPU训练
>>>>>>>>>>>>>
为了保证GPU驱动能够在镜像里面正常运行,我们推荐使用 `nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`_ 来运行镜像。请先安装nvidia-docker,之后请运行:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
或者使用国内的镜像请运行:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
还需要将以下代码
.. code-block:: python
use_cuda = False
改成:
.. code-block:: python
use_cuda = True
贡献新章节
=============
您要是能贡献新的章节那就太好了!请发Pull Requests把您写的章节加入到 :code:`pending` 下面的一个子目录里。当这一章稳定下来,我们一起把您的目录挪到根目录。
为了写作、运行、调试,您需要安装Python 2.x和Go >1.5, 并可以用 `脚本程序 <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ 来生成新的Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
......@@ -2,15 +2,6 @@
Simple Case
############################
This section collects 4 documents arranging from the simplest to the most challenging, which will guide you through the basic deep learning tasks in PaddlePaddle.
The documentation in this chapter covers a lot of deep learning basics and how to implement them with PaddlePaddle. See the instructions below for how to use:
Overview
======================
The book you are reading is an "interactive" e-book - each chapter can be run in a Jupyter Notebook.
.. toctree::
:titlesonly:
......@@ -20,77 +11,3 @@ The book you are reading is an "interactive" e-book - each chapter can be run in
image_classification/README.md
word2vec/README.md
We packaged Jupyter, PaddlePaddle, and various dependency softwares into a Docker image. It frees you from installing these softwares by yourself, and you only need to just install Docker. For various Linux versions, please refer to https://www.docker.com . If you use docker on `Windows <https://www.docker.com/docker-windows>`_ or `Mac <https://www.docker.com/docker-mac>`_ , consider `allocate more Memory and CPU resources to Docker <http://stackoverflow.com/a/39720010/724872>`_ .
Instructions
======================
This book assumes you are performing CPU training by default. If you want to use GPU training, the steps will vary slightly. Please refer to "GPU Training" below.
CPU training
>>>>>>>>>>>>
Just run these in shell:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
It downloads the Docker image for running books from DockerHub.com.
To read and edit this book on-line, please visit http://localhost:8888 in your browser.
If the Internet connection to DockerHub.com is compromised, try our spare docker image named docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
GPU training
>>>>>>>>>>>>>
To ensure that the GPU driver works properly in the image, we recommend running the image with `nvidia docker <https://github.com/NVIDIA/nvidia-docker>`_ . Please install nvidia-docker first, then run:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
Or use a image source in China to run:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
modify the following codes
.. code-block:: python
use_cuda = False
into :
.. code-block:: python
use_cuda = True
Contribute to Book
===================
We highly appreciate your original contributions of new chapters to Book! Just Pull Requests of your contributions to the sub-directory in :code:`pending` . When this chapter is endorsed, we'll gladly move it to the root directory.
For writing, running, debugging, you need to install `shell <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ to generate Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
.. ctr:
.. role:: raw-html-m2r(raw)
:format: html
ELASTIC CTR
......@@ -14,7 +14,8 @@ ELASTIC CTR
* `4. 查看结果 <#head4>`_
* `5. 二次开发指南 <#head5>`_
`<span id='head_1'>1. 总体概览</span>`
:raw-html-m2r:`<span id='head_1'>1. 总体概览</span>`
-------------
本项目提供了端到端的CTR训练和二次开发的解决方案,主要特点:
......@@ -65,12 +66,13 @@ ELASTIC CTR
**第5节 二次开发** 提出本一键部署方案可定制改善的部分,给出具体修改位置等
`<span id='head2'>2. 前置需求</span>`
:raw-html-m2r:`<span id='head2'>2. 前置需求</span>`
------------
运行本方案前,需要用户已经搭建好k8s集群,并安装好volcano组件。k8s环境部署比较复杂,本文不涉及。百度智能云CCE容器引擎申请后即可使用,仅以百度云上创建k8s为例。
2.1 创建k8s集群
---------------
^^^^^^^^^^^^
请参考
`百度智能云CCE容器引擎帮助文档-创建集群 <https://cloud.baidu.com/doc/CCE/GettingStarted/24.5C.E5.88.9B.E5.BB.BA.E9.9B.86.E7.BE.A4.html#.E6.93.8D.E4.BD.9C.E6.AD.A5.E9.AA.A4>`_\ ,在百度智能云上建立一个集群,节点配置需要满足如下要求
......@@ -89,7 +91,7 @@ ELASTIC CTR
创建完成后,即可参考\ `百度智能云CCE容器引擎帮助文档-查看集群 <https://cloud.baidu.com/doc/CCE/GettingStarted.html#.E6.9F.A5.E7.9C.8B.E9.9B.86.E7.BE.A4>`_\ ,查看刚刚申请的集群信息。
2.2 如何操作集群
----------------
^^^^^^^^^^^^^
集群的操作可以通过百度云web或者通过kubectl工具进行,推荐用kubectl工具。
......@@ -119,7 +121,7 @@ ELASTIC CTR
* 关于kubectl的其他信息,可以参考\ `Overview of kubectl <https://kubernetes.io/docs/reference/kubectl/overview/>`_\ 。
2.3 设置访问权限
----------------
^^^^^^^^^^
建立分布式任务需要pod间有API互相访问的权限,可以按如下步骤
......@@ -130,7 +132,7 @@ ELASTIC CTR
注意: --namespace 指定的default 为创建集群时候的名称
2.4 安装Volcano
---------------
^^^^^^^^^^
我们使用volcano作为训练阶段的批量任务管理工具。关于volcano的详细信息,请参考\ `官方网站 <https://volcano.sh/>`_\ 的Documentation。
......@@ -146,15 +148,16 @@ ELASTIC CTR
:alt: image
3.`<span id='head3'>分布式训练+Serving方案一键部署</span>`
3. :raw-html-m2r:`<span id='head3'>分布式训练+Serving方案一键部署</span>`
---------------------------------
3.1 下载部署方案脚本文件
------------------------
^^^^^^^^^^^^
请将\ `本方案所需所有脚本文件 <https://github.com/PaddlePaddle/edl/tree/develop/example/ctr/script>`_\ 下载到本地
3.2 一键部署
------------
^^^^^^^^^^^
执行以下脚本,一键将所有组件部署到k8s集群。
......@@ -169,7 +172,7 @@ ELASTIC CTR
**注**\ :以下\ **3.3-3.8节所述内容已经在一键部署脚本中包含,无需手动执行**\ 。但为方便理解,将该脚本的每一步执行过程给出说明。
3.3 选择一个node作为输出节点
----------------------------
^^^^^^^^^^^^^^^^
.. code-block:: bash
......@@ -178,7 +181,7 @@ ELASTIC CTR
这句话的意思是给这个node做一个标记,之后的文件服务和模型产出都被强制分配在这个node上进行,把NAME的一串字符替换 \$NODE_NAME即可。
3.4 启动文件服务器
------------------
^^^^^^^^^^^^^^
.. code-block:: bash
......@@ -209,7 +212,7 @@ ELASTIC CTR
3.5 启动Cube稀疏参数服务器
--------------------------
^^^^^^^^^^^^^^^^
.. code-block:: bash
......@@ -230,7 +233,7 @@ ELASTIC CTR
**注**\ :分片数量可根据稀疏字典大小灵活修改,参考5.3节。
3.6 启动Paddle Serving
----------------------
^^^^^^^^^^^^^^^
.. code-block:: bash
......@@ -259,7 +262,7 @@ ELASTIC CTR
3.7 启动Cube稀疏参数服务器配送工具
----------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
......@@ -288,7 +291,7 @@ ELASTIC CTR
3.8 执行Paddle CTR分布式训练
----------------------------
^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
......@@ -308,10 +311,11 @@ ELASTIC CTR
:alt: image
4. `<span id='head4'>`\ 查看结果\ :raw-html-m2r:`<span>`
4. :raw-html-m2r:`<span id='head4'>`\ 查看结果\ :raw-html-m2r:`<span>`
-------------------------------------------
4.1 查看训练日志
----------------
^^^^^^^^^^^^^
百度云容器引擎CCE提供了web操作台方便查看pod的运行状态。
......@@ -334,7 +338,7 @@ pserver日志示例:
4.2 验证Paddle Serving预测结果
------------------------------
^^^^^^^^^^^^^^^^^^^
执行
......@@ -360,10 +364,11 @@ pserver日志示例:
:alt: image
5. `<span id='head5'>二次开发指南</span>`
5. :raw-html-m2r:`<span id='head5'>二次开发指南</span>`
-----------------------------
5.1 指定数据集的输入和读取方式
------------------------------
^^^^^^^^^^^^^^^^^^^
现有的数据的输入是从edldemo镜像当中的/workspace/ctr/data/download.sh目录进行下载。下载之后会解压在/workspace/ctr/data/raw文件夹当中,包含train.txt和test.txt。所有的数据的每一行通过空格隔开40个属性。
......@@ -398,7 +403,7 @@ pserver日志示例:
推荐使用百度云提供的镜像仓库,这里是说明文档\ `推送镜像到镜像仓库 <https://cloud.baidu.com/doc/CCE/s/Yjxppt74z/#%E6%8E%A8%E9%80%81%E9%95%9C%E5%83%8F%E5%88%B0%E9%95%9C%E5%83%8F%E4%BB%93%E5%BA%93>`_\
5.2 指定训练规模
----------------
^^^^^^^^^^^^^^
在ctr.yaml文件当中,我们会发现这个是在volcano的框架下定义的Job。在Job里面,我们给出了很多Pserver和Trainer的定义,在总体的Job也给出了MinAvailable数量的定义。Pserver和Trainer下面有自己的Replicas,环境变量当中有PSERVER_NUM和TRAINER_MODEL和TRAINER_NUM的数量。通常MinAvailable= PServer Num + Trainer Num,这样我们就可以启动相应的服务。
......@@ -427,7 +432,7 @@ pserver日志示例:
如上图所示
5.3 指定Cube参数服务器的分片数量和副本数量
------------------------------------------
^^^^^^^^^^^^^^^^^^^^
在cube.yaml文件当中,我们可以看到每一个Cube的节点的定义,有一个\ ``cube server pod``\ 和\ ``cube server service``\ 。如果我们需要增加cube的副本数和分片数,只需要在yaml文件中复制相关的定义和环境变量即可。
......@@ -446,7 +451,7 @@ pserver日志示例:
以上两个图片,一个是对Cube POD的定义,一个是对CubeSERVICE的定义。如果需要扩展Cube分片数量,可以复制POD和SERVICE的定义,并重命名它们。示例程序给出的是2个分片,复制之后第3个可以命名为cube-2。
5.4 Serving适配新的模型
-----------------------
^^^^^^^^^^^^^^^^^^^
在本示例中,我们如果按照5.1节的方式,修改了CTR模型训练脚本的feed数据格式,就需要相应修改Serving的代码,以适应新的feed样例字段数量和数据类型。
......@@ -462,7 +467,7 @@ pserver日志示例:
注释
----------
注1. :raw-html-m2r:`<span id='annotation_1'>Cube和Redis性能对比测试环境</span>`
-----------------------------------------------------------------------------------
......@@ -580,4 +585,4 @@ client端为基于\ `redisplusplus <https://github.com/sewenew/redis-plus-plus>`
在扩展性方面,Redis受制于单线程模型,随并发数增加,响应时间加倍增加,而总吞吐在1000qps左右即不再上涨;而Cube则随着压测并发数增加,总的qps一直上涨,说明Cube能够较好处理并发请求,具有良好的扩展能力。
RocksDB在线程数较少的时候,平均响应时间和qps慢于Redis,但是在16以及更多线程的测试当中,RocksDB提供了更快的响应时间和更大的qps。
RocksDB在线程数较少的时候,平均响应时间和qps慢于Redis,但是在16以及更多线程的测试当中,RocksDB提供了更快的响应时间和更大的qps。
\ No newline at end of file
......@@ -2,12 +2,6 @@
工具组件
################
本章由1篇文档组成,将指导您如何使用PaddlePaddle工具组件完成深度学习任务
本章文档涉及大量了深度学习基础知识,也介绍了如何使用PaddlePaddle实现这些内容,请参阅以下说明了解如何使用:
.. toctree::
:titlesonly:
......
############################
Basic Deep Learning Models
############################
This section collects 8 documents arranging from the simplest to the most challenging, which will guide you through the basic deep learning tasks in PaddlePaddle.
The documentation in this chapter covers a lot of deep learning basics and how to implement them with PaddlePaddle. See the instructions below for how to use:
Overview
======================
The book you are reading is an "interactive" e-book - each chapter can be run in a Jupyter Notebook.
.. toctree::
:titlesonly:
fit_a_line/README.md
recognize_digits/README.md
image_classification/index_en.md
word2vec/index_en.md
recommender_system/index_en.md
understand_sentiment/index_en.md
label_semantic_roles/index_en.md
machine_translation/index_en.md
We packaged Jupyter, PaddlePaddle, and various dependency softwares into a Docker image. It frees you from installing these softwares by yourself, and you only need to just install Docker. For various Linux versions, please refer to https://www.docker.com . If you use docker on `Windows <https://www.docker.com/docker-windows>`_ or `Mac <https://www.docker.com/docker-mac>`_ , consider `allocate more Memory and CPU resources to Docker <http://stackoverflow.com/a/39720010/724872>`_ .
Instructions
======================
This book assumes you are performing CPU training by default. If you want to use GPU training, the steps will vary slightly. Please refer to "GPU Training" below.
CPU training
>>>>>>>>>>>>
Just run these in shell:
.. code-block:: shell
docker run -d -p 8888:8888 paddlepaddle/book
It downloads the Docker image for running books from DockerHub.com.
To read and edit this book on-line, please visit http://localhost:8888 in your browser.
If the Internet connection to DockerHub.com is compromised, try our spare docker image named docker.paddlepaddlehub.com:
::
docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book
GPU training
>>>>>>>>>>>>>
To ensure that the GPU driver works properly in the image, we recommend running the image with `nvidia docker <https://github.com/NVIDIA/nvidia-docker>`_ . Please install nvidia-docker first, then run:
::
nvidia-docker run -d -p 8888:8888 paddlepaddle/book:latest-gpu
Or use a image source in China to run:
::
nvidia-docker run -d -p 8888:8888 docker.paddlepaddlehub.com/book:latest-gpu
modify the following codes
.. code-block:: python
use_cuda = False
into :
.. code-block:: python
use_cuda = True
Contribute to Book
===================
We highly appreciate your original contributions of new chapters to Book! Just Pull Requests of your contributions to the sub-directory in :code:`pending` . When this chapter is endorsed, we'll gladly move it to the root directory.
For writing, running, debugging, you need to install `shell <https://github.com/PaddlePaddle/book/blob/develop/.tools/convert-markdown-into-ipynb-and-test.sh>`_ to generate Docker image。
**Please Note:** We also provide `English Readme <https://github.com/PaddlePaddle/book/blob/develop/README.md>`_ for PaddlePaddle book
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册