未验证 提交 9f4a4002 编写于 作者: A adaxiadaxi 提交者: GitHub

update_fluiddoc_0303,test=develop (#1872)

上级 fcbf40fa
无法预览此类型文件
无法预览此类型文件
...@@ -28,5 +28,5 @@ ...@@ -28,5 +28,5 @@
performance_improving/index_cn.rst performance_improving/index_cn.rst
evaluation_debugging/index_cn.rst evaluation_debugging/index_cn.rst
addon_development/index_cn.rst addon_development/index_cn.rst
flags_cn.rst flags/flags_cn.rst
...@@ -20,7 +20,7 @@ So far you have already been familiar with PaddlePaddle. And the next expectatio ...@@ -20,7 +20,7 @@ So far you have already been familiar with PaddlePaddle. And the next expectatio
- `Addon Development <addon_development/index_en.html>`_ :How to contribute codes and documentation to our communities - `Addon Development <addon_development/index_en.html>`_ :How to contribute codes and documentation to our communities
- `FLAGS <flags_en.html>`_ - `FLAGS <flags/flags_en.html>`_
.. toctree:: .. toctree::
...@@ -32,6 +32,6 @@ So far you have already been familiar with PaddlePaddle. And the next expectatio ...@@ -32,6 +32,6 @@ So far you have already been familiar with PaddlePaddle. And the next expectatio
performance_improving/index_en.rst performance_improving/index_en.rst
evaluation_debugging/index_en.rst evaluation_debugging/index_en.rst
addon_development/index_en.rst addon_development/index_en.rst
flags_en.rst flags/flags_en.rst
...@@ -118,7 +118,7 @@ NVIDIA Jetson是NVIDIA推出的嵌入式AI平台,Paddle Inference支持在 NVI ...@@ -118,7 +118,7 @@ NVIDIA Jetson是NVIDIA推出的嵌入式AI平台,Paddle Inference支持在 NVI
make inference_lib_dist -j4 make inference_lib_dist -j4
3. 样例测试 3. 样例测试
请参照官网样例:https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/deploy/inference/paddle_tensorrt_infer.html#id2 请参照官网样例:https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.html#id2
**FAQ** **FAQ**
......
...@@ -121,7 +121,7 @@ NVIDIA Jetson is an AI computing platform in embedded systems introduced by NVID ...@@ -121,7 +121,7 @@ NVIDIA Jetson is an AI computing platform in embedded systems introduced by NVID
make inference_lib_dist -j4 make inference_lib_dist -j4
3. Test with samples 3. Test with samples
Please refer to samples on https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/deploy/inference/paddle_tensorrt_infer.html#id2 Please refer to samples on https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.html#id2
**FAQ** **FAQ**
......
# Release Notes
Release Notes ## 重要更新
==============
## 重要更新
本版本对框架功能层面进行了重点增强,预测部署能力全面提升,分布式训练发布PLSC支持超大规模分类,并对参数服务器模式进行优化整合。对编译选项、编译依赖以及代码库进行了全面清理优化。模型库持续完善,优化了整体层次结构,增加了动态图模型实现。端到端开发套件和工具组件进一步完善。 本版本对框架功能层面进行了重点增强,预测部署能力全面提升,分布式训练发布PLSC支持超大规模分类,并对参数服务器模式进行优化整合。对编译选项、编译依赖以及代码库进行了全面清理优化。模型库持续完善,优化了整体层次结构,增加了动态图模型实现。端到端开发套件和工具组件进一步完善。
**训练框架:**增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
**训练框架**:增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
**预测部署**:服务器端预测库的Python API大幅优化,新增R语言、Go语言调用预测库的使用方法和示例,强化了量化支持能力;Paddle Lite支持无校准数据的训练后量化方法生成的模型,加强对OpenCL的支持,支持昆仑XPU的预测;模型压缩库PaddleSlim重构裁剪、量化、蒸馏、搜索接口,与模型库充分打通,新增大规模可扩展知识蒸馏框架 Pantheon。 **预测部署**:服务器端预测库的Python API大幅优化,新增R语言、Go语言调用预测库的使用方法和示例,强化了量化支持能力;Paddle Lite支持无校准数据的训练后量化方法生成的模型,加强对OpenCL的支持,支持昆仑XPU的预测;模型压缩库PaddleSlim重构裁剪、量化、蒸馏、搜索接口,与模型库充分打通,新增大规模可扩展知识蒸馏框架 Pantheon。
**分布式训练**:参数服务器模式下针对transpiler半异步、全异步、GEO三种模式,后端实现上统一到communicator中,前端接口统一到fleet中,通过fleet strategy灵活选择不同模式;发布大规模分类库PLSC,通过模型并行支持超多类别的分类任务。 **分布式训练**:参数服务器模式下针对transpiler半异步、全异步、GEO三种模式,后端实现上统一到communicator中,前端接口统一到fleet中,通过fleet strategy灵活选择不同模式;发布大规模分类库PLSC,通过模型并行支持超多类别的分类任务。
**基础模型库**:发布语音合成库Parakeet,包括多个前沿合成算法;PaddleCV新增14个图像分类预训练模型,3D和跟踪方向模型持续丰富;PaddleNLP的分词和词性标注模型支持jieba分词;PaddleRec增加多任务模型MMoE。模型库整体增加了广泛的动态图模型实现。模型库整体层次结构做了调整优化。 **基础模型库**:发布语音合成库Parakeet,包括多个前沿合成算法;PaddleCV新增14个图像分类预训练模型,3D和跟踪方向模型持续丰富;PaddleNLP的分词和词性标注模型支持jieba分词;PaddleRec增加多任务模型MMoE。模型库整体增加了广泛的动态图模型实现。模型库整体层次结构做了调整优化。
**端到端开发套件**:PaddleDetection和PaddleSeg新增大量模型实现及预训练模型,典型模型的训练速度和精度提升,模型压缩和部署能力大幅提升,使用体验全面优化。发布ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。 **端到端开发套件**:PaddleDetection和PaddleSeg新增大量模型实现及预训练模型,典型模型的训练速度和精度提升,模型压缩和部署能力大幅提升,使用体验全面优化。发布ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。
**工具组件**:PaddleHub新增52个预训练模型,总数超过100,功能和体验持续优化;多任务学习框架PALM升级内核,开放API调用,支持更多的任务类型;联邦学习PaddleFL新增公开数据集。深度强化学习框架PARL和飞桨图学习框架PGL也对应版本升级,支持更多功能,开放更多算法和基线。 **工具组件**:PaddleHub新增52个预训练模型,总数超过100,功能和体验持续优化;多任务学习框架PALM升级内核,开放API调用,支持更多的任务类型;联邦学习PaddleFL新增公开数据集。深度强化学习框架PARL和飞桨图学习框架PGL也对应版本升级,支持更多功能,开放更多算法和基线。
## 训练框架 ## 训练框架
- API - API
- 增加自动混合精度训练AMP接口:能以通用的方式把一个网络转成混合精度训练,同时保证精度波动在正常范围内 - 增加自动混合精度训练AMP接口:能以通用的方式把一个网络转成混合精度训练,同时保证精度波动在正常范围内
- 增加新的控制流接口并推荐使用:新增while_loop(循环控制功能)、cond(条件分支功能)、case和switch_case(分支控制功能)4个控制流OP,更加易用,且支持如下新增功能: - 增加新的控制流接口并推荐使用:新增while_loop(循环控制功能)、cond(条件分支功能)、case和switch_case(分支控制功能)4个控制流OP,更加易用,且支持如下新增功能:
...@@ -29,7 +18,7 @@ Release Notes ...@@ -29,7 +18,7 @@ Release Notes
- 支持控制流中的condition部分使用CPU数据或GPU数据 - 支持控制流中的condition部分使用CPU数据或GPU数据
- 部分API参数支持使用变量列表:针对部分API的parameter_list或no_grad_set参数只支持使用字符串列表的情况,增加对变量列表的支持,使用如下API时不再需要提前获取相关变量的name属性: - 部分API参数支持使用变量列表:针对部分API的parameter_list或no_grad_set参数只支持使用字符串列表的情况,增加对变量列表的支持,使用如下API时不再需要提前获取相关变量的name属性:
- fluid.backward.append_backward(loss, parameter_list=None, no_grad_set=None, callbacks=None) - fluid.backward.append_backward(loss, parameter_list=None, no_grad_set=None, callbacks=None)
- fluid.backward.gradients(targets, inputs, target_gradients=None, no_grad_set=None) - fluid.backward.gradients(targets, inputs, target_gradients=None, no_grad_set=None)
- 各种Optimizer的minimize方法,如Adam的minimize:minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) - 各种Optimizer的minimize方法,如Adam的minimize:minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)
- 基础功能优化 - 基础功能优化
- 支持使用numpy的float16类型设置Tensor数据,无需先转换为uint16类型。 - 支持使用numpy的float16类型设置Tensor数据,无需先转换为uint16类型。
...@@ -41,7 +30,7 @@ Release Notes ...@@ -41,7 +30,7 @@ Release Notes
- elu:该激活函数支持计算二阶梯度。 - elu:该激活函数支持计算二阶梯度。
- prroi_pool:rois参数可以接受Tensor或LoDTensor类型。 - prroi_pool:rois参数可以接受Tensor或LoDTensor类型。
- conv2d,pool2d,batch_norm,lrn:反向计算全部支持使用MKL-DNN高性能计算库。 - conv2d,pool2d,batch_norm,lrn:反向计算全部支持使用MKL-DNN高性能计算库。
- argsort:支持降序排序(新增descending参数,默认值False)。 - argsort:支持降序排序(新增descending参数,默认值False)。
- 基础性能优化 - 基础性能优化
- DALI预处理加速 - DALI预处理加速
- 增加对Nvidia DALI GPU数据预处理库的支持,可用于加速图片,视频,语音等数据预处理。 - 增加对Nvidia DALI GPU数据预处理库的支持,可用于加速图片,视频,语音等数据预处理。
...@@ -53,7 +42,7 @@ Release Notes ...@@ -53,7 +42,7 @@ Release Notes
- 优化RecomputeOptimizer提升batchsize, 在Bert-large模型上最大batchsize比不使用RecomputeOptimizer增大533.62%,比上一版本提升一倍。 - 优化RecomputeOptimizer提升batchsize, 在Bert-large模型上最大batchsize比不使用RecomputeOptimizer增大533.62%,比上一版本提升一倍。
- OP性能优化 - OP性能优化
- 实现embedding和sequence_pool的融合算子fuse_emb_seq_pool,优化bloom_filter中的murmurhash3_x64_128,有效提升部分NLP模型的训练速度。 - 实现embedding和sequence_pool的融合算子fuse_emb_seq_pool,优化bloom_filter中的murmurhash3_x64_128,有效提升部分NLP模型的训练速度。
- 优化了mean op的GPU性能,输入数据为32\*32\*8\*8的Tensor时,前向计算速度提升2.7倍。 - 优化了mean op的GPU性能,输入数据为32*32*8*8的Tensor时,前向计算速度提升2.7倍。
- 优化assign、lod_reset op,避免不需要的显存拷贝和data transform。 - 优化assign、lod_reset op,避免不需要的显存拷贝和data transform。
- 优化了stack OP的kernel实现,XLnet/Ernie模型GPU单卡性能提升4.1%。 - 优化了stack OP的kernel实现,XLnet/Ernie模型GPU单卡性能提升4.1%。
- 动态图 - 动态图
...@@ -70,6 +59,7 @@ Release Notes ...@@ -70,6 +59,7 @@ Release Notes
- 优化了python 与c++ 交互,GradMaker、OperatorBase、allocator等。基于LSTM的语言模型任务p在P40机器上性能提升提升270%。 - 优化了python 与c++ 交互,GradMaker、OperatorBase、allocator等。基于LSTM的语言模型任务p在P40机器上性能提升提升270%。
- 针对optimize中多次调用optimized_guard无用代码导致的性能问题,移除了冗余代码。Transformer模型(batch_size=64)在P40机器上,SGD、Adam等优化器有5%~8%%的性能提升。 - 针对optimize中多次调用optimized_guard无用代码导致的性能问题,移除了冗余代码。Transformer模型(batch_size=64)在P40机器上,SGD、Adam等优化器有5%~8%%的性能提升。
- 针对AdamOptimizer中额外添加scale_op更新beta参数对性能的影响,将beta更新逻辑融合到adam_op中,减少op kernel调用开销。Dialogue-PLATO模型P40机器上性能提升9.67%。 - 针对AdamOptimizer中额外添加scale_op更新beta参数对性能的影响,将beta更新逻辑融合到adam_op中,减少op kernel调用开销。Dialogue-PLATO模型P40机器上性能提升9.67%。
- To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`.To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`, Iintegrate the updating logic of `beta` into `adam_op` to reduce the cost of calling op kernel. The performance 偶发of is improved by 9.67% on the P40 machine.
- 优化动态图异步DataLoader,对于Mnist、ResNet等CV模型任务在P40机器上单卡训练速度提升超过40%。 - 优化动态图异步DataLoader,对于Mnist、ResNet等CV模型任务在P40机器上单卡训练速度提升超过40%。
- 新增numpy bridge功能,支持在cpu模式下Tensor和ndarray之间共享底层数据,避免创建Variable时numpy输入需要拷贝的问题,提升效率。 - 新增numpy bridge功能,支持在cpu模式下Tensor和ndarray之间共享底层数据,避免创建Variable时numpy输入需要拷贝的问题,提升效率。
- 显存优化:提前删除反向不需要Tensor Buffer的前向变量空间的优化策略,在ResNet等模型上最大batch size提升20%-30%以上。 - 显存优化:提前删除反向不需要Tensor Buffer的前向变量空间的优化策略,在ResNet等模型上最大batch size提升20%-30%以上。
...@@ -90,7 +80,6 @@ Release Notes ...@@ -90,7 +80,6 @@ Release Notes
- 安装requirements依赖从15个减小到7个。 - 安装requirements依赖从15个减小到7个。
## 预测部署 ## 预测部署
- 服务器端预测库 - 服务器端预测库
- Python API - Python API
- 支持从内存读写模型,以满足模型加密的需求。 - 支持从内存读写模型,以满足模型加密的需求。
...@@ -98,7 +87,7 @@ Release Notes ...@@ -98,7 +87,7 @@ Release Notes
- 新增ZeroCopy API,与C++接口基本一致,支持以numpy.ndarray作为输入和输出,在Python端使用更加方便。 - 新增ZeroCopy API,与C++接口基本一致,支持以numpy.ndarray作为输入和输出,在Python端使用更加方便。
- 在AnalysisConfig中增加多个接口,完整覆盖C++预测的功能,包括删除pass、禁用预测glog等。 - 在AnalysisConfig中增加多个接口,完整覆盖C++预测的功能,包括删除pass、禁用预测glog等。
- 其他编程语言的支持 - 其他编程语言的支持
- 新增R语言、Go语言调用预测库的使用方法和示例 - 新增R语言、Go语言的预测API,并增加相关的使用方法和示例
- 对外提供 ProtoBuf 对应的头文件,方便用户解析模型结构的需求。 - 对外提供 ProtoBuf 对应的头文件,方便用户解析模型结构的需求。
- 带TRT编译的预测库不再从thrid_party中提供TensorRT库,需要用户自行到https://developer.nvidia.com/tensorrt 下载 - 带TRT编译的预测库不再从thrid_party中提供TensorRT库,需要用户自行到https://developer.nvidia.com/tensorrt 下载
- 功能增强: - 功能增强:
...@@ -106,18 +95,18 @@ Release Notes ...@@ -106,18 +95,18 @@ Release Notes
- 新增MKL-DNN FC INT8 kernel的支持 - 新增MKL-DNN FC INT8 kernel的支持
- Paddle-TensorRT支持Ernie模型,Ernie模型(seq length=128) 在T4卡上fp16预测速度为3.6ms, 比fp32加速37%。 - Paddle-TensorRT支持Ernie模型,Ernie模型(seq length=128) 在T4卡上fp16预测速度为3.6ms, 比fp32加速37%。
- 量化:ERNIE INT8精度相比于FP32 精度略有下降,但其在第二代至强可扩展平台6271上单线程性能优化提升2.70倍,多线程性能提升1.79倍 - 量化:ERNIE INT8精度相比于FP32 精度略有下降,但其在第二代至强可扩展平台6271上单线程性能优化提升2.70倍,多线程性能提升1.79倍
- 移动/嵌入式端Paddle Lite(https://github.com/PaddlePaddle/Paddle-Lite) - 移动/嵌入式端[Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite)
- 对应发布v2.3版本。 - 对应发布v2.3版本。
- model_optimize_tool多项功能升级。 - model_optimize_tool多项功能升级。
- 支持“无校准数据的训练后量化方法”,模型存储空间可减少2~4倍。 - 支持“无校准数据的训练后量化方法”,模型存储空间可减少2~4倍。
- OpenCL:完成30个Image2D Kernel迁移,涵盖14个OP。 - OpenCL:完成30个Image2D Kernel迁移,涵盖14个OP。
- 对FPGA、NPU的支持进一步加强;支持昆仑XPU的预测。 - 对FPGA、NPU的支持进一步加强;支持昆仑XPU的预测。
- 发布全新官网文档;新增“无校准数据的训练后量化方法”使用文档。 - 发布全新官网文档;新增“无校准数据的训练后量化方法”使用文档。
- Paddle Serving(https://github.com/PaddlePaddle/Serving) - [Paddle Serving](https://github.com/PaddlePaddle/Serving)
- 发布bert类语义理解模型的远程文本向量表示预测服务。 - 发布bert类语义理解模型的远程文本向量表示预测服务。
- 发布了paddle-gpu-serving whl包,通过pip安装和Python代码即可部署和使用预测服务; - 发布了paddle-gpu-serving whl包,通过pip安装和Python代码即可部署和使用预测服务;
- 支持Paddlehub中的13种语义理解模型,支持单机多卡,使用Ernie_tiny模型在单张P4 GPU下平均样本长度为7时预测速度为869.56样本每秒。 - 支持Paddlehub中的13种语义理解模型,支持单机多卡,使用Ernie_tiny模型在单张P4 GPU下平均样本长度为7时预测速度为869.56样本每秒。
- PaddleSlim(https://github.com/PaddlePaddle/PaddleSlim) - [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
- 拆分PaddleSlim为独立repo。 - 拆分PaddleSlim为独立repo。
- 重构裁剪、量化、蒸馏、搜索接口,对用户开放底层接口。 - 重构裁剪、量化、蒸馏、搜索接口,对用户开放底层接口。
- 量化: - 量化:
...@@ -131,7 +120,7 @@ Release Notes ...@@ -131,7 +120,7 @@ Release Notes
- 新增one-shot搜索算法,搜索速度比上个版本快20倍。 - 新增one-shot搜索算法,搜索速度比上个版本快20倍。
- 新增大规模可扩展知识蒸馏框架 Pantheon - 新增大规模可扩展知识蒸馏框架 Pantheon
- student 与 teacher 、teacher与 teacher 模型之间充分解耦,可分别独立运行在不同的物理设备上,便于充分利用计算资源; - student 与 teacher 、teacher与 teacher 模型之间充分解耦,可分别独立运行在不同的物理设备上,便于充分利用计算资源;
- 支持 teacher 模型的单节点多设备大规模预测,在 BERT 等模型上测试加速比达到线性; - 支持 teacher 模型的单节点多设备大规模预测,在 BERT 等复杂模型上测试加速比达到线性;
- 用 TCP/IP 协议实现在线蒸馏模式的通信,支持在同一网络环境下,运行在任意两个物理设备上的 teacher 模型和 student 模型之间进行知识传输; - 用 TCP/IP 协议实现在线蒸馏模式的通信,支持在同一网络环境下,运行在任意两个物理设备上的 teacher 模型和 student 模型之间进行知识传输;
- 统一在线和离线两种蒸馏模式的 API 接口,不同的 teacher 模型可以工作在不同的模式下; - 统一在线和离线两种蒸馏模式的 API 接口,不同的 teacher 模型可以工作在不同的模式下;
- 在 student 端自动完成知识的归并与知识数据的 batch 重组,便于多 teacher 模型的知识融合。 - 在 student 端自动完成知识的归并与知识数据的 batch 重组,便于多 teacher 模型的知识融合。
...@@ -143,7 +132,6 @@ Release Notes ...@@ -143,7 +132,6 @@ Release Notes
- 补充API文档;新增入门教程和高级教程;增加ModelZoo文档,覆盖分类、检测、分割任务。所有文档包含中、英文。 - 补充API文档;新增入门教程和高级教程;增加ModelZoo文档,覆盖分类、检测、分割任务。所有文档包含中、英文。
## 分布式 ## 分布式
- 参数服务器模式: - 参数服务器模式:
- 大幅降低训练过程中的内存占用,在1亿规模embedding任务上,Trainer端内存可以降低90% - 大幅降低训练过程中的内存占用,在1亿规模embedding任务上,Trainer端内存可以降低90%
- 大幅降低分布式保存模型、加载模型的内存占用, Pserver端内存峰值最大可降低为原先的$1/N,N$为Pserver节点个数。 - 大幅降低分布式保存模型、加载模型的内存占用, Pserver端内存峰值最大可降低为原先的$1/N,N$为Pserver节点个数。
...@@ -156,103 +144,98 @@ Release Notes ...@@ -156,103 +144,98 @@ Release Notes
- Fleet加入DistributedStrategy, 进一步提升分布式易用性, 整合目前分布式相关FLAG - Fleet加入DistributedStrategy, 进一步提升分布式易用性, 整合目前分布式相关FLAG
- Fleet pslib模式支持一个program多loss训练,优化训练性能 - Fleet pslib模式支持一个program多loss训练,优化训练性能
- 千亿稀疏模式支持k8s环境。 - 千亿稀疏模式支持k8s环境。
- 大规模分类库PLSC:支持受限于显存容量数据并行无法处理的大规模分类问题(https://github.com/PaddlePaddle/PLSC) - [大规模分类库PLSC](https://github.com/PaddlePaddle/PLSC):支持受限于显存容量数据并行无法处理的大规模分类问题
- 内建ResNet50、ResNet101和ResNet152三种模型,并支持自定义模型;单机8张V100 GPU配置下,ResNet50模型百万类别训练速度2,122.56 images/s,相比标准ResNet50模型加速倍1.3倍; - 内建ResNet50、ResNet101和ResNet152三种模型,并支持自定义模型;单机8张V100 GPU配置下,ResNet50模型百万类别训练速度2,122.56 images/s,相比标准ResNet50模型加速倍1.3倍;
- 发布模型在线预测服务plsc-serving whl包,预测人脸识别模型的图片语义向量表示,支持使用用户训练的模型进行预测。ResNet50模型(batch size=256)在单张V100 GPU下预测速度为523.47 images/s; - 发布模型在线预测服务plsc-serving whl包,预测人脸识别模型的图片语义向量表示,支持使用用户训练的模型进行预测。ResNet50模型(batch size=256)在单张V100 GPU下预测速度为523.47 images/s;
- 发布基于ResNet50网络和MS1M-ArcFace数据集的预训练模型:https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz。 - 发布基于ResNet50网络和MS1M-ArcFace数据集的预训练模型:https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz。
- 发布ResNet50混合精度训练benchmark(单卡、多卡、多机)。 - 发布ResNet50混合精度训练benchmark(单卡、多卡、多机)。
## 基础模型库 ## 基础模型库
(https://github.com/PaddlePaddle/models) - [模型库地址](https://github.com/PaddlePaddle/models)
- PaddleNLP - PaddleNLP
- seq2seq支持RL和GAN等训练模式 - seq2seq支持RL和GAN等训练模式
- 发布分词和词性标注训练模型,利用知识蒸馏框架 Pantheon,在自有数据集上比PaddleNLP上LAC上F1值提升1%;合入jieba分词,通过加入use_paddle标签来开启深度学习模型模式;并在在jieba加入paddle版本检测和回退机制,保障用户体验。 - 发布分词和词性标注训练模型,利用知识蒸馏框架 Pantheon,在自有数据集上比PaddleNLP上LAC上F1值提升1%;合入jieba分词,通过加入use_paddle标签来开启深度学习模型模式;并在在jieba加入paddle版本检测和回退机制,保障用户体验。
- 增加动态图模型实现:word2vec、senta、transformer、bert、seq2seq、LAC。 - 增加动态图模型实现:word2vec、senta、transformer、bert、seq2seq、LAC。
- PaddleSpeech - PaddleSpeech
- 发布语音合成库Parakeet (Paddle PARAllel text-to-speech toolkit) - 发布语音合成库Parakeet (Paddle PARAllel text-to-speech toolkit)
- 实现语音合成模型数据预处理、训练和合成等的标准工作流 - 实现语音合成模型数据预处理、训练和合成等的标准工作流
- 提供对常见数据集的开箱即用的预处理实现 - 提供对常见数据集的开箱即用的预处理实现
- 提供语音合成领域常用模型组件,为实现模型提供支持 - 提供语音合成领域常用模型组件,为实现模型提供支持
- 发布语音合成模型 DeepVoice3、ClarinNet 、TransformerTTS、FastSpeech、WaveNet、WaveFlow - 发布语音合成模型 DeepVoice3、ClarinNet 、TransformerTTS、FastSpeech、WaveNet、WaveFlow
- PaddleCV
- PaddleCV
- 图像分类: - 图像分类:
- 新增预训练模型SENet-vd、Res2Net、HRNet系列模型总共14个: - 新增预训练模型SENet-vd、Res2Net、HRNet系列模型总共14个:
- SE_ResNet18_vd,SE_ResNet34_vd,SE_ResNeXt50_vd_32x4d,ResNeXt152_vd_32x4d - SE_ResNet18_vd,SE_ResNet34_vd,SE_ResNeXt50_vd_32x4d,ResNeXt152_vd_32x4d
- Res2Net50_26w_4s,Res2Net50_14w_8s,Res2Net50_vd_26w_4s - Res2Net50_26w_4s,Res2Net50_14w_8s,Res2Net50_vd_26w_4s
- HRNet_W18_C,HRNet_W30_C,HRNet_W32_C,HRNet_W40_C,HRNet_W44_C,HRNet_W48_C,HRNet_W64_C - HRNet_W18_C,HRNet_W30_C,HRNet_W32_C,HRNet_W40_C,HRNet_W44_C,HRNet_W48_C,HRNet_W64_C
- 支持使用DALI加速数据预处理,在ImageNet训练上获得1.5倍(ResNet50) 至3倍以上(ShuffleNet)加速,并大幅提升GPU利用率。 - 支持使用DALI加速数据预处理,在ImageNet训练上获得1.5倍(ResNet50) 至3倍以上(ShuffleNet)加速,并大幅提升GPU利用率。
- 3D方向: - 3D方向:
- 发布模型PointNet++、PointRCNN。 - 发布模型PointNet++、PointRCNN。
- 跟踪模型库 : - 跟踪模型库 :
- 发布模型SiamFC、ATOM。 - 发布模型SiamFC、ATOM。
- 增加动态图模型实现: MobileNet-v1/v2、YOLOv3、FasterRCNN、MaskRCNN、视频分类TSM模型、视频动作定位BMN模型。 - 增加动态图模型实现: MobileNet-v1/v2、YOLOv3、FasterRCNN、MaskRCNN、视频分类TSM模型、视频动作定位BMN模型。
- PaddleRec - PaddleRec
- 发布推荐领域多任务模型MMoE, 适用于工业界大规模多任务联合训练。 - 发布推荐领域多任务模型MMoE, 适用于工业界大规模多任务联合训练。
- 增加动态图模型实现:gru4rec、deepfm。 - 增加动态图模型实现:gru4rec、deepfm。
## 端到端开发套件 ## 端到端开发套件
- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)
- PaddleDetection(https://github.com/PaddlePaddle/PaddleDetection)
- 进一步提升YOLOv3模型精度,COCO数据上精度达到43.2%,相比上个版本绝对提升1.4%。 - 进一步提升YOLOv3模型精度,COCO数据上精度达到43.2%,相比上个版本绝对提升1.4%。
- 新增模型实现及预训练模型: - 新增模型实现及预训练模型:
- 新增Google AI Open Images 2019-Object Detction比赛中的最佳单模型CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd,同时也发布此算法基于Objects365数据的预训练模型。 - 新增Google AI Open Images 2019-Object Detction比赛中的最佳单模型CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd,同时也发布此算法基于Objects365数据的预训练模型。
- 新增backbone为CBResNet、Res2Net、HRNet的系列预训练模型。 - 新增backbone为CBResNet、Res2Net、HRNet的系列预训练模型。
- 新增LibraRCNN算法及预训练模型。 - 新增LibraRCNN算法及预训练模型。
- FasterRCNN R50 FPN模型新增基于GIoU、DIoU、CIoU loss的预训练模型,不降低预测速度的情况下,在COCO数据上精度分别提升1.1%,0.9%,1.3%。 - FasterRCNN R50 FPN模型新增基于GIoU、DIoU、CIoU loss的预训练模型,不降低预测速度的情况下,在COCO数据上精度分别提升1.1%,0.9%,1.3%。
- 新增模块: - 新增模块:
- 主干网络: 新增CBResNet、Res2Net、HRNet。 - 主干网络: 新增CBResNet、Res2Net、HRNet。
- Loss模块: 新增GIoU loss、 DIoU loss、CIoU loss,以及Libra loss,YOLOv3的loss支持细粒度op组合。 - Loss模块: 新增GIoU loss、 DIoU loss、CIoU loss,以及Libra loss,YOLOv3的loss支持细粒度op组合。
- 后处理模块: 新增softnms,DIOU nms模块。 - 后处理模块: 新增softnms,DIOU nms模块。
- 正则模块: 新增DropBlock模块。 - 正则模块: 新增DropBlock模块。
- 功能优化和改进: - 功能优化和改进:
- 加速YOLOv3数据预处理,整体训练提速40%。 - 加速YOLOv3数据预处理,整体训练提速40%。
- 优化数据预处理逻辑。 - 优化数据预处理逻辑。
- 增加人脸检测预测benchmark数据。 - 增加人脸检测预测benchmark数据。
- 增加Paddle预测库Python API下的预测示例。 - 增加Paddle预测库Python API下的预测示例。
- 检测模型压缩 : - 检测模型压缩:
- 裁剪: 发布MobileNet-YOLOv3裁剪方案和模型,在VOC数据集上FLOPs - 69.6%, mAP + 1.4%,在COCO数据集上FLOPS-28.8%, mAP + 0.9%; 发布ResNet50vd-dcn-YOLOv3裁剪方案和模型,在COCO数据集上FLOPS - 18.4%, mAP + 0.8%。 - 裁剪: 发布MobileNet-YOLOv3裁剪方案和模型,在VOC数据集上FLOPs - 69.6%, mAP + 1.4%,在COCO数据集上FLOPS-28.8%, mAP + 0.9%; 发布ResNet50vd-dcn-YOLOv3裁剪方案和模型,在COCO数据集上FLOPS - 18.4%, mAP + 0.8%。
- 蒸馏: 发布MobileNet-YOLOv3蒸馏方案和模型,在VOC数据上mAP + 2.8%,在COCO数据上mAP + 2.1%。 - 蒸馏: 发布MobileNet-YOLOv3蒸馏方案和模型,在VOC数据上mAP + 2.8%,在COCO数据上mAP + 2.1%。
- 量化: 发布YOLOv3和BlazeFace的量化模型。 - 量化: 发布YOLOv3和BlazeFace的量化模型。
- 裁剪+蒸馏: 发布MobileNet-YOLOv3裁剪+蒸馏方案和模型,在COCO数据集上FLOPS - 69.6%,GPU下预测加速64.5%,mAP - 0.3 %; 发布ResNet50vd-dcn-YOLOv3裁剪+蒸馏方案和模型,基于COCO数据FLOPS - 43.7%,GPU下预测加速24.0%,mAP + 0.6 %。 - 裁剪+蒸馏: 发布MobileNet-YOLOv3裁剪+蒸馏方案和模型,在COCO数据集上FLOPS - 69.6%,GPU下预测加速64.5%,mAP - 0.3 %; 发布ResNet50vd-dcn-YOLOv3裁剪+蒸馏方案和模型,基于COCO数据FLOPS - 43.7%,GPU下预测加速24.0%,mAP + 0.6 %。
- 搜索: 开源BlazeFace-Nas的完整搜索方案。 - 搜索: 开源BlazeFace-Nas的完整搜索方案。
- 预测部署: - 预测部署:
- 适配Paddle预测库对TensorRT的支持、对FP16精度的支持。 - 适配Paddle预测库对TensorRT的支持、对FP16精度的支持。
- 文档: - 文档:
- 新增数据预处理模块介绍文档、实现自定义数据Reader的文档。 - 新增数据预处理模块介绍文档、实现自定义数据Reader的文档。
- 新增如何新增算法模型的文档。 - 新增如何新增算法模型的文档。
- 文档部署到网站: https://paddledetection.readthedocs.io/zh/latest/ - 文档部署到网站: https://paddledetection.readthedocs.io/zh/latest/
- PaddleSeg(https://github.com/PaddlePaddle/PaddleSeg) - [PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)
- 新增模型 - 新增模型
- 适用于车道线分割场景的LaneNet模型。 - 适用于车道线分割场景的LaneNet模型。
- 适用于轻量级Fast-SCNN模型。 - 轻量级Fast-SCNN模型。
- 适用于高精度场景的HRNet语义分割模型 。 - 适用于高精度场景的HRNet语义分割模型 。
- 发布基于PaddleSlim的多种模型压缩方案: - 发布基于PaddleSlim的多种模型压缩方案:
- 基于Cityscape的Fast-SCNN裁剪方案和模型。 - 基于Cityscapes的Fast-SCNN裁剪方案和模型。
- 基于Cityscape的Deeplabv3p-Xception和Deeplabv3p-MobilenetV2蒸馏方案。 - 基于Cityscapes的Deeplabv3p-Xception和Deeplabv3p-MobilenetV2蒸馏方案。
- 基于Cityscape的Deeplabv3p-MobilenetV2搜索方案。 - 基于Cityscapes的Deeplabv3p-MobilenetV2搜索方案。
- 基于Cityscape的Deeplabv3p-Mobilenet量化方案和模型。 - 基于Cityscapes的Deeplabv3p-Mobilenet量化方案和模型。
- 预测部署能力提升 - 预测部署能力提升
- 新增Python轻量级部署。 - 新增Python轻量级部署。
- 新增对 FP16、Int8量化模型的TensorRT预测加速支持。 - 新增对 FP16、Int8量化模型的TensorRT预测加速支持。
- 新增DeepLabv3p-MobileNetV2的人像分割Paddle-Lite移动端部署教程和案例。 - 新增DeepLabv3p-MobileNetV2的人像分割Paddle-Lite移动端部署教程和案例。
- 优化模型导出环节,支持图像预处理和后处理的GPU化,性能提升10%~20%。 - 优化模型导出环节,支持图像预处理和后处理的GPU化,性能提升10%~20%。
- 提供U-Net, ICNet, PSPNet, DeepLabv3+等模型的在不同尺寸图像的预测性能Benchmark,便于用户根据性能进行模型选型。 - 提供U-Net, ICNet, PSPNet, DeepLabv3+等模型的在不同尺寸图像的预测性能Benchmark,便于用户根据性能进行模型选型。
- 体验优化 - 体验优化
- 新增学习率warmup功能,支持与不同的学习率Decay策略配合使用,提升Fine-tuning的稳定性。 - 新增学习率warmup功能,支持与不同的学习率Decay策略配合使用,提升Fine-tuning的稳定性。
- 支持对标注图使用伪彩色图像格式的保存,提升标注图片的预览体验。 - 支持对标注图使用伪彩色图像格式的保存,提升标注图片的预览体验。
- 新增自动保存mIoU最优模型的功能。 - Marked imaged can be saved in pseudo-color image format to improve their preview experience.• Optimizes the logic of documents. Provides AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening.
- 全面优化文档逻辑,提供如工业质检、眼底筛查等工业场景的AIStudio实战教程 - 新增自动保存mIoU最优模型的功能
- 全面优化文档逻辑,提供如工业质检、眼底筛查等工业场景的AIStudio实战教程。
- ElasticRec(https://github.com/PaddlePaddle/ElasticRec) - [ElasticRec](https://github.com/PaddlePaddle/ElasticRec)
- 发布ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。 - 发布ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。
## 工具组件 ## 工具组件
- [PaddleHub](https://github.com/PaddlePaddle/PaddleHub)
- PaddleHub(https://github.com/PaddlePaddle/PaddleHub)
- 预训练模型丰富,新增52个预训练模型,目前预训练模型总数100+: - 预训练模型丰富,新增52个预训练模型,目前预训练模型总数100+:
- 语义模型:新增RoBERTa_wwm、BERT_wwm、ERNIE-Tiny等5个语义模型 - 语义模型:新增RoBERTa_wwm、BERT_wwm、ERNIE-Tiny等5个语义模型
- 文本分类:新增黄反鉴别模型3个。 - 文本分类:新增黄反鉴别模型3个。
...@@ -264,58 +247,55 @@ Release Notes ...@@ -264,58 +247,55 @@ Release Notes
- 功能: - 功能:
- 新增基于Paddle Serving的Bert Service文本向量表示服务。 - 新增基于Paddle Serving的Bert Service文本向量表示服务。
- Task灵活性增强,新增Hook机制可以支持用户自定义代码加载。 - Task灵活性增强,新增Hook机制可以支持用户自定义代码加载。
- 新增彩色Colorlog,修复日志重复打印问题。
- 优化代码结果,命令行执行速度提升50% 。 - 优化代码结果,命令行执行速度提升50% 。
- 重构Dataset、Reader,适配自定义数据集代码量降低60%。 - 重构Dataset、Reader,适配自定义数据集代码量降低60%。
- 优化AutoFinetune接口,支持多实验的可视化效果显示。 - 优化AutoFinetune接口,支持多实验的可视化效果显示。
- 体验优化 - 体验优化
- 逻辑全面优化,新增丰富的AIStudio教程内容。 - 逻辑全面优化,新增丰富的AIStudio教程内容。
- 官网落地页全新升级,提供在线快速体验和教程指导的功能。 - 官网落地页全新升级,提供在线快速体验和教程指导的功能。
- 多任务学习框架[PALM](https://github.com/PaddlePaddle/PALM)
- 多任务学习框架PALM(https://github.com/PaddlePaddle/PALM) - 支持python3和windows
- 支持python3和windows - 升级框架内核和多任务底层机制,开放API调用
- 升级框架内核和多任务底层机制,开放API调用 - 灵活的模型保存机制,支持单任务保存和全图保存
- 灵活的模型保存机制,支持单任务保存和全图保存 - 支持连续训练和连续预测,单次执行下可自由切换数据集文件
- 支持连续训练和连续预测,单次执行下可自由切换数据集文件 - 新增模型定制化/自定义功能
- 新增模型定制化/自定义功能 - 重构多任务底层kernel,修复若干影响通用性和稳定性的bugs
- 重构多任务底层kernel,修复若干影响通用性和稳定性的bugs - 强化多任务学习能力
- 强化多任务学习能力 - 支持多任务场景下每个任务有不同的batch size和sequence length
- 支持多任务场景下每个任务有不同的batch size和sequence length - 修复了多任务多卡训练时,各个显卡上任务不一致的问题
- 修复了多任务多卡训练时,各个显卡上任务不一致的问题 - 优化了多任务学习调度和终止策略,普遍提升模型泛化能力
- 优化了多任务学习调度和终止策略,普遍提升模型泛化能力 - 强化支持的任务的功能和类型
- 强化支持的任务的功能和类型 - 匹配任务支持增强,支持pairwise learning和多类别(如NLI句子关系判断)。
- 匹配任务支持增强,支持pairwise learning和多类别(如NLI句子关系判断)。 - 机器阅读理解任务支持增强,新增用户可控的预处理超参数。
- 机器阅读理解任务支持增强,新增用户可控的预处理超参数。 - 新增支持序列标注任务。
- 新增支持序列标注任务。 - 强化大规模训练/推理能力
- 强化大规模训练/推理能力 - 新增自动多卡预测能力
- 新增自动多卡预测能力 - 重构异步reader,多卡场景下支持变长padding
- 重构异步reader,多卡场景下支持变长padding - 新增预训练模型管理和下载模块
- 新增预训练模型管理和下载模块 - 支持BERT、ERNIE、RoBERTa等各预训练模型的管理和下载
- 支持BERT、ERNIE、RoBERTa等各预训练模型的管理和下载 - 新增RoBERTa中文预训练模型
- 新增RoBERTa中文预训练模型 - 联邦学习[PaddleFL](https://github.com/PaddlePaddle/PaddleFL)
- 联邦学习PaddleFL(https://github.com/PaddlePaddle/PaddleFL):
- 新增scheduler与submitter功能:scheduler可用于在训练过程中控制trainer是否参加更新 。submitter可用于完成在MPI集群提交paddleFL任务的功能 - 新增scheduler与submitter功能:scheduler可用于在训练过程中控制trainer是否参加更新 。submitter可用于完成在MPI集群提交paddleFL任务的功能
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus– Supports the models NeurIPS2019, which is the reforcement learning challenge champion modelReleases the version v1.1:
- 新增LEAF dataset联邦学习公开数据集,并添加api,用于设置benchmark。支持图像分类,情感分析,字符预测等领域的经典数据集,如MNIST,Sentiment140 - 新增LEAF dataset联邦学习公开数据集,并添加api,用于设置benchmark。支持图像分类,情感分析,字符预测等领域的经典数据集,如MNIST,Sentiment140
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.– Releases a garaph solution called PGL-Rec and a knowledge graph embedding algorithm set called PGL-KE.– Releases a high-order API of PGL.
- 根据新增组件,在example中修改了原有的样例,并添加了femnist_demo, submitter_demo样例 - 根据新增组件,在example中修改了原有的样例,并添加了femnist_demo, submitter_demo样例
- 优化fl_distribute_transpiler,使FedAvg strategy新增对adam optimizer支持; - 优化fl_distribute_transpiler,使FedAvg strategy新增对adam optimizer支持;
- 新增SecAgg strategy(Secure Aggregation),用于实现安全的参数聚合; - 新增SecAgg strategy(Secure Aggregation),用于实现安全的参数聚合;
- 深度强化学习框架[PARL](https://github.com/PaddlePaddle/PARL)
- 深度强化学习框架PARL(https://github.com/PaddlePaddle/PARL) - 发布v1.3版。
- 发布v1.3版。 - 新增对Multi-Agent RL算法支持,包括MADDPG。
- 新增对Multi-Agent RL算法支持,包括MADDPG。 - 新增对多卡训练的支持,发布多卡DQN算法示例。
- 新增对多卡训练的支持,发布多卡DQN算法示例。 - 开源连续控制领域的SOTA算法TD3和SAC。
- 开源连续控制领域的SOTA算法TD3和SAC。 - 开源NeurIPS2019强化学习挑战赛事冠军模型实现和训练方案,开放训练好的模型(可考虑公开课)
- 开源NeurIPS2019强化学习挑战赛事冠军模型实现和训练方案,开放训练好的模型(可考虑公开课) - 飞桨图学习框架[PGL](https://github.com/PaddlePaddle/PGL)
- 飞桨图学习框架PGL(https://github.com/PaddlePaddle/PGL) - 发布v1.1版:
- 发布v1.1版:
- 新增对权威图学习数据集OGB的支持,全面支持nodepropered、linkpred、graphpropered三大类型任务,并发布SOTA基线。 - 新增对权威图学习数据集OGB的支持,全面支持nodepropered、linkpred、graphpropered三大类型任务,并发布SOTA基线。
- 发布图推荐解决方案PGL-Rec和知识图嵌入算法集PGL-KE。 - 发布图推荐解决方案PGL-Rec和知识图嵌入算法集PGL-KE。
- 易用化改进,发布PGL高阶API。 - 易用化改进,发布PGL高阶API。
- 其他升级点:多进程图采样优化,加速GraphSAGE类模型3倍;新增基于Lod Tensor的Graph Batch算子,Graph Pooling算子;Model Zoo新增模型,包括分布式异构图算法、GraphZoom、PinSage等。 - 其他升级点:多进程图采样优化,加速GraphSAGE类模型3倍;新增基于Lod Tensor的Graph Batch算子,Graph Pooling算子;Model Zoo新增模型,包括分布式异构图算法、GraphZoom、PinSage等。
## 代码重构和升级 ## 代码重构和升级
- 编译 - 编译
- 增加WITH_NCCL编译选项,单卡用户可显示指定WITH_NCCL=OFF加速编译。 - 增加WITH_NCCL编译选项,单卡用户可显示指定WITH_NCCL=OFF加速编译。
- 新增编译选项WITH_TP_CACHE,缓存第三方源码,避免重复下载,Windows用户可将其设置为ON,加快编译速度并提高编译稳定性。 - 新增编译选项WITH_TP_CACHE,缓存第三方源码,避免重复下载,Windows用户可将其设置为ON,加快编译速度并提高编译稳定性。
...@@ -334,7 +314,6 @@ Release Notes ...@@ -334,7 +314,6 @@ Release Notes
- 动态图下通过代码自动生成每个OP的pybind函数,用于在layers中直接调用,提高动态图性能并减少与静态图的耦合度。 - 动态图下通过代码自动生成每个OP的pybind函数,用于在layers中直接调用,提高动态图性能并减少与静态图的耦合度。
## BUG修复 ## BUG修复
- 修复基于PaddleDetection的 Faster-RCNN使用Python API预测时MKL-DNN报错问题。 - 修复基于PaddleDetection的 Faster-RCNN使用Python API预测时MKL-DNN报错问题。
- 修复sum op的GPU实现中,由于部分Tensor没有初始化引起训练挂掉的问题。 - 修复sum op的GPU实现中,由于部分Tensor没有初始化引起训练挂掉的问题。
- 修复fill_constant中,value设置为大整数时精度损失的问题。 - 修复fill_constant中,value设置为大整数时精度损失的问题。
......
# Release Notes
Release Notes ## Important Updates
==============
## Important Updates
This version focuses on enhancement of the framework functions, includes improving the inference deployment capability, releasing PLSC for super-large-scale classification training task, and optimizing the parameter server mode. In addition, the compilation options, compilation dependence and code library are fully cleaned up and optimized. The model library is optimized by adjusting the structure and adding dynamic graph models. The development kits and utility components are upgraded. This version focuses on enhancement of the framework functions, includes improving the inference deployment capability, releasing PLSC for super-large-scale classification training task, and optimizing the parameter server mode. In addition, the compilation options, compilation dependence and code library are fully cleaned up and optimized. The model library is optimized by adjusting the structure and adding dynamic graph models. The development kits and utility components are upgraded.
**Training Framework**: ### Training Framework
- Adds AMP (Automatic Mixed Precision) interfaces and control flow interfaces.
- Adds AMP (Automatic Mixed Precision) interfaces and control flow interfaces. - Optimizes the tensor using method and GPU memory allocation strategy.
- Optimizes the tensor using method and GPU memory allocation strategy. - Supports Nvidia DALI GPU data preprocessing library.
- Supports Nvidia DALI GPU data preprocessing library. - Optimizes the functions and performance of basic Ops
- Optimizes the functions and performance of basic Ops - Enhances the functions of dynamic graph models, including performance improvement and supporting new APIs which can converts the data independent dynamic graph model into static graph model.
- Enhances the functions of dynamic graph models, including performance improvement and supporting new APIs which can converts the data independent dynamic graph model into static graph model. - Improves the user experience of debug functions.
- Improves the user experience of debug functions.
**Inference Deployment**:
### Inference Deployment
- Paddle Serving - Paddle Serving
- Optimizes the Python API. - Optimizes the Python API.
- Supports new programming languages API, such as R and Go. - Supports new programming languages API, such as R and Go.
- Enhanced the quantitative capability. - Enhanced the quantitative capability.
- Paddle Lite - Paddle Lite
- Supports deploying the model generated by the post-training quantization method without calibration data. - Supports deploying the model generated by the post-training quantization method without calibration data.
- Enhanced the OpenCL capability. - Enhanced the OpenCL capability.
- Supports Kunlun XPU. - Supports Kunlun XPU.
- Paddle Slim - Paddle Slim
- Optimizes the pruning, quantization, distillation and NAS (Network Architecture Search) API for adapting the model library. - Optimizes the pruning, quantization, distillation and NAS (Network Architecture Search) API for adapting the model library.
- Supports large-scale knowledge distillation framework called Pantheon. - Supports large-scale knowledge distillation framework called Pantheon.
**Distributed Training**:
### Distributed Training
- Unified the implementation mode of the semi-asynchronous, fully asynchronous and GEO modes in parameter server mode. The back-end is unified into communicator. The front-end interface is unified into fleet. Select different mode by configuring the fleet strategy. - Unified the implementation mode of the semi-asynchronous, fully asynchronous and GEO modes in parameter server mode. The back-end is unified into communicator. The front-end interface is unified into fleet. Select different mode by configuring the fleet strategy.
- Releases the PLSC for super-large-scale classification training task. - Releases the PLSC for super-large-scale classification training task.
**Model Construction**: ### Model Construction
- Releases the text-so-speech model library called Parakeet, including several leading-edge text-to-speech algorithms. - Releases the text-so-speech model library called Parakeet, including several leading-edge text-to-speech algorithms.
- Adds 14 image classification pre-training models in PaddleCV, for enriching the 3D and tracking direction models. - Adds 14 image classification pre-training models in PaddleCV, for enriching the 3D and tracking direction models.
- Supports Jieba word segmentation in PaddleNLP. - Supports Jieba word segmentation in PaddleNLP.
...@@ -42,347 +36,309 @@ This version focuses on enhancement of the framework functions, includes improvi ...@@ -42,347 +36,309 @@ This version focuses on enhancement of the framework functions, includes improvi
- Adds more dynamic graph models. - Adds more dynamic graph models.
- Adjusts and optimizes the structure of model library. - Adjusts and optimizes the structure of model library.
**Development Kits**: ### Development Kits
- Optimizes the PaddleDetection and PaddleSeg by adding a large number of models as well as pre-training models, enhancing the training speed and accuracy of typical models, and strengthens the model compression and deployment capabilities. - Optimizes the PaddleDetection and PaddleSeg by adding a large number of models as well as pre-training models, enhancing the training speed and accuracy of typical models, and strengthens the model compression and deployment capabilities.
- Releases the recommended sorting system called ElasticRec, can be deployed via K8S and support streaming training and online forecast services. - Releases the recommended sorting system called ElasticRec, can be deployed via K8S and support streaming training and online forecast services.
**Utility Components**: ### Utility Components
- Adds 52 pre-training models to enrich the models up to 100+, as well as improves the function experience. - Adds 52 pre-training models to enrich the models up to 100+, as well as improves the function experience.
- Upgrades the kernel of PALM, opens API, and supports more task types. - Upgrades the kernel of PALM, opens API, and supports more task types.
- Adds an open dataset in PaddleFL (Federated learning framework). - Adds an open dataset in PaddleFL (Federated learning framework).
- Upgrades the versions of PARL (Deep reinforcement learning framework) and PGL (Graph learning framework) . Opens more algorithm and supports more functions. - Upgrades the versions of PARL (Deep reinforcement learning framework) and PGL (Graph learning framework) . Opens more algorithm and supports more functions.
## Training Framework ## Training Framework
- API - API
- Adds AMP (Automatic Mixed Precision) APIs, which can convert a network training mode into mixed accuracy mode in a general way, and ensuring the accuracy fluctuation within the normal range. - Adds AMP (Automatic Mixed Precision) APIs, which can convert a network training mode into mixed accuracy mode in a general way, and ensuring the accuracy fluctuation within the normal range.
- Adds control flow OPs, such as while_loop, cond, case and switch_case. It is recommended to use the new APIs for much easier to use. The following functions are supported: - Adds control flow OPs, such as while_loop, cond, case and switch_case. It is recommended to use the new APIs for much easier to use. The following functions are supported:
- Supports using python callable as the control condition or executive objects. - Supports using python callable as the control condition or executive objects.
- Supports using different losses or optimizers in different branches of the control flow. - Supports using different losses or optimizers in different branches of the control flow.
- Supports using CPU data or GPU data in condition of the control flow - Supports using CPU data or GPU data in condition of the control flow
- Supports using the variable lists as parameters for some APIs, while these APIs only supported string lists as the `parameter_list` or `no_grad_set`. Do not need to obtain the `name` attribute of variables when using the following APIs:
- Supports using the variable lists as parameters for some APIs, while these APIs only supported string lists as the ‘parameter_list’ or ‘no_grad_set’. Do not need to obtain the ‘name’ attribute of variables when using the following APIs: - fluid.backward.append_backward(loss, parameter_list=None, no_grad_set=None, callbacks=None)
- fluid.backward.append_backward(loss, parameter_list=None, no_grad_set=None, callbacks=None) - fluid.backward.gradients(targets, inputs, target_gradients=None, no_grad_set=None)
- fluid.backward.gradients(targets, inputs, target_gradients=None, no_grad_set=None) - The minimize methods of optimizers, such as Adam's minimize: minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)
- The minimize methods of optimizers, such as Adam’s minimize: minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None) - Basic Functions Optimization
- Basic Function Optimization - Supports configuring tensor data with numpy float16 data types, and no need to convert to unit16 type first.
- Supports configuring tensor data with numpy float16 data types, and no need to convert to unit16 type first. - Supports using minus sign to express the tensor's opposite.
- Supports using minus sign to express the tensor’s opposite. - GPU memory Allocation Strategy:
- GPU memory Allocation Strategy: - Changes the default policy to `AutoGrowth`. In this policy, the GPU memory is applied on demand when not affecting the training speed. While it`s difficult to start another task on the same GPU in the GPU memory pre-allocation strategy before. This change can avoid this problem.
- Changes the default policy to ‘AutoGrowth’. In this policy, the GPU memory is applied on demand when not affecting the training speed. While it’s difficult to start another task on the same GPU in the GPU memory pre-allocation strategy before. This change can avoid this problem. - Adjusts the GPU memory allocation for multi-card tasks: Set the GPU memory allocators on different GPU cards to the `Lazy` initialization mode. If a card is not used, the GPU memory will not be applied for this card. While the GPU memory OOM problem could be caused by running tasks on idle GPU cards without setting CUDA_VISIBLE_DEVICES, when GPU memory is occupied on other GPU cards. This change can avoid this problem.
- Adjusts the GPU memory allocation for multi-card tasks: Set the GPU memory allocators on different GPU cards to the ‘Lazy’ initialization mode. If a card is not used, the GPU memory will not be applied for this card. While the GPU memory OOM problem could be caused by running tasks on idle GPU cards without setting CUDA_VISIBLE_DEVICES, when GPU memory is occupied on other GPU cards. This change can avoid this problem. - OP Function Upgrade
- OP Function Upgrade - elu: This activation function supports the calculation of second-order gradients.
- elu: This activation function supports the calculation of second-order gradients. - Prroi_pool: The parameter `rois` supports the `Tensor` or `LoDTensor` type.
- Prroi_pool: The parameter ‘rois’ supports the ‘Tensor’ or ‘LoDTensor’ type. - Conv2d, pool2d, batch_norm, lrn: supports using the MKL-DNN library to perform gradient calculation of these OPs.
- Conv2d, pool2d, batch_norm, lrn: supports using the MKL-DNN library to perform gradient calculation of these OPs. - argsort: Supports descending. A new parameter `descending` is added, default value is `False`.
- argsort: Supports descending. A new parameter ‘descending’ is added, default value is ‘False’.
- Basic Performance Optimization - Basic Performance Optimization
- DALI Preprocessing Acceleration - DALI Preprocessing Acceleration
- Supports the Nvidia DALI GPU data preprocessing library, which can be used to accelerate the preprocessing speed of data such as images, videos, and speeches. - Supports the Nvidia DALI GPU data preprocessing library, which can be used to accelerate the preprocessing speed of data such as images, videos, and speeches.
- Automatic Mixed Precision Training Optimization - Automatic Mixed Precision Training Optimization
- Implements the following optimization strategies to increase the training throughput of the ResNet50 model, along with the DALI data preprocessing module. The mixed accuracy training throughput of a single V100 card is increased from 600+ images/s to 1,000+ images/s. The throughput of 8 cards for a single machine is increased to 7,840 image/s. The throughput of 32 cards for 4 machines is increased to 28,594 images/s. - Implements the following optimization strategies to increase the training throughput of the ResNet50 model, along with the DALI data preprocessing module. The mixed accuracy training throughput of a single V100 card is increased from 600+ images/s to 1,000+ images/s. The throughput of 8 cards for a single machine is increased to 7,840 image/s. The throughput of 32 cards for 4 machines is increased to 28,594 images/s.
- Supports NHWC data layout inputs for some OPs such as batch_norm, conv2d. Accelerates fp16 calculation speed by using Tensor Core technology. - Supports NHWC data layout inputs for some OPs such as batch_norm, conv2d. Accelerates fp16 calculation speed by using Tensor Core technology.
- Fusing some op patterns in the model, such as batch_norm and relu, based on the IR Pass mechanism. - Fusing some op patterns in the model, such as batch_norm and relu, based on the IR Pass mechanism.
- Optimizes kernel of some elementwise OPs, such as add, mul. - Optimizes kernel of some elementwise OPs, such as add, mul.
- Optimize the ‘RecomputeOptimizer’ to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the ‘RecomputeOptimizer’. - Optimize the `RecomputeOptimizer` to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the `RecomputeOptimizer`.
- OP Performance Optimization - OP Performance Optimization
- Implements the fusion operator called ‘fuse_emb_seq_pool’ of ‘embedding’ and ‘sequence_pool’. Optimizes the ‘murmurhash3_x64_128’ in ‘bloom_filter’. These optimization increases the training speed of some NLP models. - Implements the fusion operator called `fuse_emb_seq_pool` of `embedding` and `sequence_pool`. Optimizes the `murmurhash3_x64_128` in `bloom_filter`. These optimization increases the training speed of some NLP models.
- Optimizes the GPU performance of ‘mean op’. When a data of 3232 8 *8 tensor is input, the forward calculation speed is increased by 2.7 times. - Optimizes the GPU performance of `mean op`. When a data of 3232 8 *8 tensor is input, the forward calculation speed is increased by 2.7 times.
- Optimizes OPs of ‘assign’ and ‘lod_reset’, to avoid nnecessary GPU memory copy and data transform. - Optimizes OPs of `assign` and `lod_reset`, to avoid nnecessary GPU memory copy and data transform.
- Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%. - Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%.
- Dynamic Graph - Dynamic Graph
- Function Optimization - Function Optimization
- Removes the ‘name_scope’ parameter in ‘Layers’ to make it easier to inherit and call. - Removes the `name_scope` parameter in `Layers` to make it easier to inherit and call.
- Removes the ‘block’ parameter in the ‘to_variable’ to simplify the use of the API. - Removes the `block` parameter in the `to_variable` to simplify the use of the API.
- Removes the ‘build_once’ as for the the problem that model parameters depend on data. So that ‘Layers’ can get all the parameter tables when implementing the ‘init’ execution. It’s convenient for saving and loading, parameter initialization, parameter debugging, and parameter optimization. - Removes the `build_once` as for the the problem that model parameters depend on data. So that `Layers` can get all the parameter tables when implementing the `init` execution. It`s convenient for saving and loading, parameter initialization, parameter debugging, and parameter optimization.
- Optimizes the automatic pruning function facilitate user networking and reduce the reverse calculation amount. - Optimizes the automatic pruning function facilitate user networking and reduce the reverse calculation amount.
- Supports ‘SelectedRows’ operation so that the Embedding layer supports sparse update of a single card. - Supports `SelectedRows` operation so that the Embedding layer supports sparse update of a single card.
- Adds functions such as ParameterList, LayerList, and Sequencial, as for the problem that the framework lacks containers. It’s more convenient for networking with these functions. - Adds functions such as ParameterList, LayerList, and Sequencial, as for the problem that the framework lacks containers. It`s more convenient for networking with these functions.
- Supports functions such as named_sublayers and named_parameters to facilitate programming. - Supports functions such as named_sublayers and named_parameters to facilitate programming.
- Supports the ‘Linear lr warmup decay’ strategy. - Supports the `Linear lr warmup decay` strategy.
- Performance Optimization - Performance Optimization
- Optimizes the interaction of python with c++, GradMaker, OperatorBase, and allocator. The performance is improved by 270% for the LSTM-based language model task on the P40 machine. - Optimizes the interaction of python with c++, GradMaker, OperatorBase, and allocator. The performance is improved by 270% for the LSTM-based language model task on the P40 machine.
- Removes the redundant codes for performance problems caused by calling dead codes of ‘optimized_guard’. The performance of optimizers such as SGD and Adam is improved by 5% to 8% for or the Transformer model (batch_size=64) on the P40 machine. - Removes the redundant codes for performance problems caused by calling dead codes of `optimized_guard`. The performance of optimizers such as SGD and Adam is improved by 5% to 8% for or the Transformer model (batch_size=64) on the P40 machine.
- To reduce the performance impact caused by adding extra ‘scale_op’ to update the beta parameter in ‘AdamOptimizer’.To reduce the performance impact caused by adding extra ‘scale_op’ to update the beta parameter in ‘AdamOptimizer’, Iintegrate the updating logic of ‘beta’ into ‘adam_op’ to reduce the cost of calling op kernel. The performance 偶发of is improved by 9.67% on the P40 machine. - Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine.
- Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine. - Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency.
- Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency. - Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet.
- Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet. - Dynamic Graph Deployment
- Dynamic Graph Deployment - Supports the `TracedLayer` interface to convert the dynamic graph model into the static graph.
- Supports the ‘TracedLayer’ interface to convert the dynamic graph model into the static graph.
- Debugging Analysis - Debugging Analysis
- Optimizes the error message. Classifies the framework error messages and optimizes the message descriptions for more convenient to solve the problem according to the messages. - Optimizes the error message. Classifies the framework error messages and optimizes the message descriptions for more convenient to solve the problem according to the messages.
- Optimizes the performance analysis profile function. - Optimizes the performance analysis profile function.
- Enhances the functions and accuracy of the profile. Supports profile options at different levels. The call relation of events can be recorded in the profile data and printed. - Enhances the functions and accuracy of the profile. Supports profile options at different levels. The call relation of events can be recorded in the profile data and printed.
- Optimizes the checking and debugging functions of ‘nan inf’ which is enabled through ‘FLAGS_check_nan_inf’. The performance, function, and output information are all greatly improved. - Optimizes the checking and debugging functions of `nan inf` which is enabled through `FLAGS_check_nan_inf`. The performance, function, and output information are all greatly improved.
- In terms of speed, the v100 test ResNet50 model has a performance improvement of about 1000 times compared with the original utility components, and maintains an over 80% efficiency for normal training. - In terms of speed, the v100 test ResNet50 model has a performance improvement of about 1000 times compared with the original utility components, and maintains an over 80% efficiency for normal training.
- In terms of function, the support for fp16 is added and environment variables can be set to skip the inspection of op, op_role, and op_var to facilitate the debugging of the fp16 model. - In terms of function, the support for fp16 is added and environment variables can be set to skip the inspection of op, op_role, and op_var to facilitate the debugging of the fp16 model.
- The output information is detailed and accurate. Besides wrong op and tensor names, the quantity of wrong nan, inf, and normal numerical values are printed to facilitate debugging. - The output information is detailed and accurate. Besides wrong op and tensor names, the quantity of wrong nan, inf, and normal numerical values are printed to facilitate debugging.
- Releases the lightweight installation package ‘paddlepaddle-tiny’ for CPU training and forecast, supporting installed on Windows/Linux/Mac OS and python27/python35/python36/python37. - Releases the lightweight installation package `paddlepaddle-tiny` for CPU training and forecast, supporting installed on Windows/Linux/Mac OS and python27/python35/python36/python37.
- Supports the following compile functions: no avx, no ml, no gpu, no unittest. - Supports the following compile functions: no avx, no ml, no gpu, no unittest.
- Remove the slim and some dataset. - Remove the slim and some dataset.
- Reduce the Linux package size from 90M to 37M. Reduce the Windows package size from50.8 M to 9.6M. Reduce the MAC package size from 59M to 19.8M. - Reduce the Linux package size from 90M to 37M. Reduce the Windows package size from50.8 M to 9.6M. Reduce the MAC package size from 59M to 19.8M.
- Reduce the number of installation requirement dependencies from 15 to 7. - Reduce the number of installation requirement dependencies from 15 to 7.
## Inference Deployment ## Inference Deployment
- Server-side Inference Library
- Server-side Forecast Library - Python API
- Python API - Supports reading and writing model from the memory to meet the model encryption requirements.
- Supports reading and writing model from the memory to meet the model encryption requirements. - The Scale operator is no longer added at the end of the inference model.
- The Scale operator is no longer added at the end of the inference model. - Adds ZeroCopy API, which is basically the same as the C++ APIs. Supports using numpy.ndarray as the input and output. It`s convenient for Python scenario.
- Adds ZeroCopy API, which is basically the same as the C++ APIs. Supports using numpy.ndarray as the input and output. It’s convenient for Python scenario. - Adds several interfaces in AnalysisConfig to completely cover the C++ inference functions, including removing pass and disabling inference glog.
- Adds several interfaces in AnalysisConfig to completely cover the C++ inference functions, including removing pass and disabling inference glog. - Support for Other Programming Languages
- Support for Other Programming Languages - Add inference API of R and Go, and the related usage methods and examples are added.
- Add inference API of R and Go, and the related usage methods and examples are added. - Provides the corresponding header file of ProtoBuf to facilitate users to analyzing structure of models.
- Provides the corresponding header file of ProtoBuf to facilitate users to analyzing structure of models. - For a inference library with TRT compilation, the TensorRT library is not provided from thrid_party any more and needs to be downloaded by users at https://developer.nvidia.com/tensorrt.
- For a inference library with TRT compilation, the TensorRT library is not provided from thrid_party any more and needs to be downloaded by users at https://developer.nvidia.com/tensorrt. - Functional Enhancement:
- Functional Enhancement: - Supports access Paddle Lite by submap mode, and ResNet50 has been verified.
- Supports access Paddle Lite by submap mode, and ResNet50 has been verified. - Supports the MKL-DNN FC INT8 kernel.
- Supports the MKL-DNN FC INT8 kernel. - Supports Ernie model in Paddle-TensorRT. For the Ernie model (seq length = 128) on the T4 card, the delay of fp16 inference is 3.6 ms, which is faster than the fp32 inference by 37%.
- Supports Ernie model in Paddle-TensorRT. For the Ernie model (seq length = 128) on the T4 card, the delay of fp16 inference is 3.6 ms, which is faster than the fp32 inference by 37%. - Quantization: the single-threaded performance and the multi-threaded performance are improved by 2.79 times and 1.79 times for ERNIE INT8 on the second-generation Xeon scalable platform 6271 respectively, while the Ernie INT8 model has only slight decline precision compared with the FP32 model.
- Quantization: the single-threaded performance and the multi-threaded performance are improved by 2.79 times and 1.79 times for ERNIE INT8 on the second-generation Xeon scalable platform 6271 respectively, while the Ernie INT8 model has only slight decline precision compared with the FP32 model. - Mobile/Embedded End-side [Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite)
- Mobile/Embedded End-side Paddle Lite (https://github.com/PaddlePaddle/Paddle-Lite) - Releases the version v2.3.
- Releases the version v2.3. - Upgrades the functions of Model_optimize_tool.
- Upgrades the functions of Model_optimize_tool. - Supports "The post-training quantization method without calibration data". The model storage space can be reduced by 2 to 4 times.
- Supports“The post-training quantization method without calibration data”. The model storage space can be reduced by 2 to 4 times. - OpenCL: The migration of 30 Image2D Kernels are finished and 14 Ops are covered.
- OpenCL: The migration of 30 Image2D Kernels are finished and 14 Ops are covered. - Strenthens the capability with FPGA, NPU. Supports Kunlun XPU for inference.
- Strenthens the capability with FPGA, NPU. Supports Kunlun XPU for inference. - Releases a new official website document. Adds the document of "post-training quantization method without calibration data"
- Releases a new official website document. Adds the document of “post-training quantization method without calibration data” - [Paddle Serving](https://github.com/PaddlePaddle/Serving):
- Paddle Serving (https://github.com/PaddlePaddle/Serving): - Releases the forecast service of remote text vector representation of the bert-type semantic understanding model.
- Releases the forecast service of remote text vector representation of the bert-type semantic understanding model. - Release the paddle-gpu-serving WHL package. Supports pip installation and Python codes.
- Release the paddle-gpu-serving WHL package. Supports pip installation and Python codes. - Supports 13 semantic understanding models in Paddlehub. Supports the single-machine multi-card mode. The forecast speed is 869.56 samples per second using the Ernie_tiny model, when the average sample length is 7 under a single P4 GPU.
- Supports 13 semantic understanding models in Paddlehub. Supports the single-machine multi-card mode. The forecast speed is 869.56 samples per second using the Ernie_tiny model, when the average sample length is 7 under a single P4 GPU. - [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim):
- PaddleSlim (https://github.com/PaddlePaddle/PaddleSlim): - Moves PaddleSlim to independent repo.
- Moves PaddleSlim to independent repo. - Refactors pruning, quantization, distillation and NAS API. Provide more low-level APIs for developer.
- Refactors pruning, quantization, distillation and NAS API. Provide more low-level APIs for developer . - Quantization:
- Quantification: - Adds post training quantization strategy based on KL divergence. Supports quantization of the embedding layer.
- Adds post training quantization strategy based on KL divergence. Supports quantization of the embedding layer. - Supports quantization for MKL-DNN-FC layer based on QAT.
- Supports quantization for MKL-DNN-FC layer based on QAT. - Adds post training quantization that support 30 kinds of operators. Supports spartial operators to skip quantization.
- Adds post training quantization that support 30 kinds of operators. Supports spartial operators to skip quantization. - Supports skipping some operators in training aware strategy
- Supports skipping some operators in training aware strategy - Pruning: Refactors and enhances the code of pruning to support more kinds of networks.
- Pruning: Refactors and enhances the code of pruning to support more kinds of networks. - NAS:
- NAS: - Supports NAS based on simulated annealing. Provides more predefined search spaces and support custom search spaces.
- Supports NAS based on simulated annealing. Provides more predefined search spaces and support custom search spaces. - Adds one-shot algorithm for NAS. The speed of search is 20 times faster than that of the previous version.
- Adds one-shot algorithm for NAS. The speed of search is 20 times faster than that of the previous version. - Releases the large-scale scalable knowledge distillation framework called Pantheon.
- Releases the large-scale scalable knowledge distillation framework called Pantheon. - Achieves full decoupling between student and teacher models and among teacher models. They can independently run on different physical devices respectively to make full use of computing resources.
- Achieves full decoupling between student and teacher models and among teacher models. They can independently run on different physical devices respectively to make full use of computing resources. - Supports the multi-device large-scale inference of the teacher model in the single node. The acceleration ratio is tested to be linear on BERT-like complex models.
- Supports the multi-device large-scale inference of the teacher model in the single node. The acceleration ratio is tested to be linear on BERT-like complex models. - Supports knowledge transmission between teacher and student models running on any two physical devices in the same Internet environment. By using TCP/IP protocol for communication in online distillation mode.
- Supports knowledge transmission between teacher and student models running on any two physical devices in the same Internet environment. By using TCP/IP protocol for communication in online distillation mode. - Unifies API interfaces for online and offline distillation modes, enabling different teacher models operating in different distillation modes.
- Unifies API interfaces for online and offline distillation modes, enabling different teacher models operating in different distillation modes. - The merging of knowledge and the batch reorganization of knowledge data are completed automatically on the student side to facilitate the knowledge fusion of the multi-teacher models.
- The merging of knowledge and the batch reorganization of knowledge data are completed automatically on the student side to facilitate the knowledge fusion of the multi-teacher models. - Model Zoo:
- Releases benchmark of image classification model such as ResNet50, MobileNet.
- Model Zoo: - Adapts PaddleDetection library and release benchmark of YOLOv3 models with different backbone.
- Releases benchmark of image classification model such as ResNet50, MobileNet. - Adapts PaddleSeg library and release benchmark of Deepabv3+ models with different backbone.
- Adapts PaddleDetection library and release benchmark of YOLOv3 models with different backbone. - Refines Document:
- Adapts PaddleSeg library and release benchmark of Deepabv3+ models with different backbone. - Refines documents of API. Adds some QuickStart tutorials and advanced tutorials. Adds model zoo docu which contain models for image classification, object detection, semantic segmentation. Translates all documents to English.
- Refines Document:
- Refines documents of API. Adds some QuickStart tutorials and advanced tutorials. Adds model zoo docu which contain models for image classification, object detection, semantic segmentation. Translates all documents to English.
## Distributed ## Distributed
- Parameter Server Mode: - Parameter Server Mode:
- Reduces the memory usage greately during training. On 100 million embedding trainging tasks, the Trainer-side memory can be reduced by 90%. - Reduces the memory usage greatly during training. On 100 million embedding training tasks, the Trainer-side memory can be reduced by 90%.
- Reduces the memory usage of distributed saving and loading models greatly. The Pserver-side memory peak value can be minimized to 1/N of the original value, where N is the number of Pserver nodes. - Reduces the memory usage of distributed saving and loading models greatly. The Pserver-side memory peak value can be minimized to $1/N $ of the original value, where N is the number of Pserver nodes.
- Optimizes the dense parameter communication in GEO mode. - Optimizes the dense parameter communication in GEO mode.
- Supports distributed AUC index calculation. - Supports distributed AUC index calculation.
- Adds distributed barrier functions. - Adds distributed barrier functions.
- Adds Semi-asynchronous modes in Communicator. - Adds Semi-asynchronous modes in Communicator.
- Supports semi-asynchronous modes of the ‘TrainFromDataset’ training interface. - Supports semi-asynchronous modes of the `TrainFromDataset` training interface.
- Adds ‘DistributedStrategy’ in ‘Fleet’ to improve the convenient usage. Integrates the current distributed related flags. - Adds `DistributedStrategy` in `Fleet` to improve the convenient usage. Integrates the current distributed related flags.
- Supports single-program multi-loss training modes in ‘Fleet pslib’ to optimize the training performance. - Supports single-program multi-loss training modes in `Fleet pslib` to optimize the training performance.
- Supports k8s environment in 100 billion sparse mode. - Supports k8s environment in 100 billion sparse mode.
- Large-scale classification library PLSC: It supports the large-scale classification problem that data parallel cannot solve due to the limitation of video memory capacity (https://github.com/PaddlePaddle/PLSC). - [Large-scale classification library PLSC](https://github.com/PaddlePaddle/PLSC): It supports the large-scale classification problem that data parallel cannot solve due to the limitation of video memory capacity.
- Supports three built-in models such as ResNet50, ResNet101, and ResNet152. Supports User-defined models. Under the single-machine eight-V100 GPU configuration environment, the ResNet50 model has a million-class training speed of 2,122.56 images/s, which is 1.3 times faster than that of the standard ResNet50 model. - Supports three built-in models such as ResNet50, ResNet101, and ResNet152. Supports User-defined models. Under the single-machine eight-V100 GPU configuration environment, the ResNet50 model has a million-class training speed of 2,122.56 images/s, which is 1.3 times faster than that of the standard ResNet50 model.
- Releases a ‘plsc-serving whl’ package for model online forecast service. It can forecast the image semantic vector representation of the face recognition model. Supports making a forecast using a user-trained model. The forecast speed of the ResNet50 model (batch size=256) is 523.47 images/s under a single V100 GPU. - Releases a `plsc-serving whl` package for model online forecast service. It can forecast the image semantic vector representation of the face recognition model. Supports making a forecast using a user-trained model. The forecast speed of the ResNet50 model (batch size=256) is 523.47 images/s under a single V100 GPU.
- Releases the pre-training models based on the ResNet50 network and the MS1M-ArcFace dataset: https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz.- The benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine) is released. - Releases the pre-training models based on the ResNet50 network and the MS1M-ArcFace dataset: https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz.
- Releases the benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine) - Releases the benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine)
## Basic Model Library ## Basic Model Library
- [Models repo github](https://github.com/PaddlePaddle/models)
(https://github.com/PaddlePaddle/models)
- PaddleNLP - PaddleNLP
- Seq2seq supports training modes such as RL and GAN in the static-graph of Paddle.
- Seq2seq supports training modes such as RL and GAN in the static-graph of Paddle. - A training model for word segmentation and part-of-speech tagging is released. With the knowledge distillation framework Pantheon, the F1 score of this model on the own dataset is improved 1% over that of PaddleNLP LAC. This model is merged into the jieba repo, with adding a flag use_paddle to enable deep learning model mode. In addition, the paddle version detection and rollback mechanism is added in jieba to ensure user experience.
- A training model for word segmentation and part-of-speech tagging is released. With the knowledge distillation framework Pantheon, the F1 score of this model on the own dataset is improved 1% over that of PaddleNLP LAC. This model is merged into the jieba repo, with adding a flag use_paddle to enable deep learning model mode. In addition, the paddle version detection and rollback mechanism is added in jieba to ensure user experience. - Adds dynamic graph model implementations for these models: word2vec, senta, transformer, Bert, seq2seq, and LAC.
- Adds dynamic graph model implementations for these models: word2vec, senta, transformer, Bert, seq2seq, and LAC.
- PaddleSpeech - PaddleSpeech
- Releases text-to-speech toolkit Parakeet (Paddle PARAllel text-to-speech toolkit). - Releases text-to-speech toolkit Parakeet (Paddle PARAllel text-to-speech toolkit).
- Implements the standard workflow for data preprocessing, training, and synthesis of the TTS models. - Implements the standard workflow for data preprocessing, training, and synthesis of the TTS models.
- Provides the out-of-the-box pre-processing implementation of typical datasets. - Provides the out-of-the-box pre-processing implementation of typical datasets.
- Provides the commonly-used model components in the TTS field to facilitate the model implementation. - Provides the commonly-used model components in the TTS field to facilitate the model implementation.
- Reseases the TTS models DeepVoice3, ClarinNet, TransformerTTS, FastSpeech, WaveNet, and WaveFlow. - Reseases the TTS models DeepVoice3, ClarinNet, TransformerTTS, FastSpeech, WaveNet, and WaveFlow.
- PaddleCV - PaddleCV
- Image Classification:
- Image Classification: - Adds 14 pre-training models including SENet-vd, Res2Net, and HRNet series of models:
- Adds 14 pre-training models including SENet-vd, Res2Net, and HRNet series of models: - Supports accelerating data preprocessing by using DALI. On the ImageNet training, 1.5 times (ResNet50) to more than 3 times (ShuffleNet) the acceleration is obtained and the GPU utilization is greatly improved.
- SE_ResNet18_vd,SE_ResNet34_vd,SE_ResNeXt50_vd_32x4d,ResNeXt152_vd_32x4d - 3D Vision:
- Res2Net50_26w_4s,Res2Net50_14w_8s,Res2Net50_vd_26w_4s - Releases PointNet++, PointRCNN models.
- HRNet_W18_C,HRNet_W30_C,HRNet_W32_C,HRNet_W40_C,HRNet_W44_C,HRNet_W48_C,HRNet_W64_C - Tracking Model Library:
- Supports accelerating data preprocessing by using DALI. On the ImageNet training, 1.5 times (ResNet50) to more than 3 times (ShuffleNet) the acceleration is obtained and the GPU utilization is greatly improved. - Releases SiamFC and ATOM models,
- 3D Vision: - Add dynamic graph model implementations for the following models: MobileNet-v1/v2, YOLOv3, FasterRCNN, MaskRCNN, video classification TSM model, and video motion positioning BMN model.
- Releases PointNet++、PointRCNN models.
- Tracking Model Library:
- Releases SiamFC and ATOM models,
- Add dynamic graph model implementations for the following models: MobileNet-v1/v2, YOLOv3, FasterRCNN, MaskRCNN, video classification TSM model, and video motion positioning BMN model.
- PaddleRec - PaddleRec
- Releases a multi-task model called MMoE for the recommended field. It can be applied to large-scale multi-task joint training in the industrial circles.
- Releases a multi-task model called MMoE for the recommended field. It can be applied to large-scale multi-task joint training in the industrial circles. - Adds dynamic graph model implementations for the following models: gru4rec, deepfm.
- Adds dynamic graph model implementations for the following models: gru4rec, deepfm.
## End-To-End Development Kits ## End-To-End Development Kits
- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)
- PaddleDetection (https://github.com/PaddlePaddle/PaddleDetection) - The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.
- Add the following model implementations and pre-training models:
- The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.– The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.– Improves the precision of the YOLOv3 model. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version. - Add the best single model CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd in the Google AI Open Images 2019-Object Detction competition is added. Releases a pre-training model of this algorithm based on Objects365 data.
- Add the following model implementations and pre-training models: - Add a series of CBResNet, Res2Net, and HRNet pre-training models.
- Add the best single model CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd in the Google AI Open Images 2019-Object Detction competition is added. Releases a pre-training model of this algorithm based on Objects365 data. - Adds a LibraRCNN algorithm and the pre-training models.
- Add a series of CBResNet, Res2Net, and HRNet pre-training models. - Add GIoU, DIoU, and CIoU loss-based pre-training models in the FasterRCNN R50 FPN model. Without reducing the inference speed, the precision for the COCO data is improved by 1.1%, 0.9%, and 1.3% respectively.
- Adds a LibraRCNN algorithm and the pre-training models. - Adds Modules:
- Add GIoU, DIoU, and CIoU loss-based pre-training models in the FasterRCNN R50 FPN model. Without reducing the inference speed, the precision for the COCO data is improved by 1.1%, 0.9%, and 1.3% respectively. - Backbone network: CBResNet, Res2Net, and HRNet are added.
- Added Modules: - Loss modules: GIoU loss, DIoU loss, and CIoU loss are added. Libra loss and YOLOv3 loss support a fine-grained op combination.
- Backbone network: CBResNet, Res2Net, and HRNet are added. - Postprocessing modules: The softnms and DIOU nms modules are added.
- Loss modules: GIoU loss, DIoU loss, and CIoU loss are added. Libra loss and YOLOv3 loss support a fine-grained op combination. - Regular module: A DropBlock module is added.
- Postprocessing modules: The softnms and DIOU nms modules are added.
- Regular module: A DropBlock module is added.
- Functional Optimization and Improvement: - Functional Optimization and Improvement:
- YOLOv3 data preprocessing is accelerated. The overall training speeds up by 40%. - YOLOv3 data preprocessing is accelerated. The overall training speeds up by 40%.
- The data preprocessing logic is optimized. - The data preprocessing logic is optimized.
- The benchmark data for face detection inference is added. - The benchmark data for face detection inference is added.
- Inferenerence examples under the Paddle inference library Python API are added. - Inference examples under the Paddle inference library Python API are added.
- Detection Model Compression: - Detection Model Compression:
- Pruning: A MobileNet-YOLOv3 uningtailoring solution and model are released, with FLOPs - 69.6%, mAP + 1.4% for the VOC dataset, and FLOPS - 28.8%, mAP + 0.9% for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning solution and model are released, with FLOPs - 18.4%, mAP + 0.8% for the COCO dataset. - Pruning: A MobileNet-YOLOv3 uningtailoring solution and model are released, with FLOPs - 69.6%, mAP + 1.4% for the VOC dataset, and FLOPS - 28.8%, mAP + 0.9% for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning solution and model are released, with FLOPs - 18.4%, mAP + 0.8% for the COCO dataset.
- Distillation: A MobileNet-YOLOv3 distillation solution and model are released, with mAP + 2.8% for the VOC data and mAP + 2.1% for the COCO data. - Distillation: A MobileNet-YOLOv3 distillation solution and model are released, with mAP + 2.8% for the VOC data and mAP + 2.1% for the COCO data.
- Quantization: YOLOv3 and BlazeFace quantitative models are released. - Quantization: YOLOv3 and BlazeFace quantitative models are released.
- Pruning + Distillation: A MobileNet-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 69.6%, inference speedup 64.5% under the GPU, mAP - 0.3 % for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 43.7%, inference speedup 24.0% under the GPU, mAP + 0.6 % based on the COCO data. - Pruning + Distillation: A MobileNet-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 69.6%, inference speedup 64.5% under the GPU, mAP - 0.3 % for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 43.7%, inference speedup 24.0% under the GPU, mAP + 0.6 % based on the COCO data.
- Search: A complete search solution for the open source BalzeFace-nas. - Search: A complete search solution for the open source BalzeFace-nas.
- Inference Deployment: - Inference Deployment:
- The support of the Paddle inferencerence library for TensorRT and FP16 precision is adapted.• Adapts the Paddle forecastrence library for TensorRT and FP16 precision - The support of the Paddle inference library for TensorRT and FP16 precision is adapted.
- Documents: - Documents:
- Adds the documents for introducing the data preprocessing module and a document for implementing the user-defined data Readers. - Adds the documents for introducing the data preprocessing module and a document for implementing the user-defined data Readers.
- Adds the documents about how to add an algorithm model. - Adds the documents about how to add an algorithm model.
- Documents are deployed to the website: https://paddledetection.readthedocs.io/zh/latest/ - Documents are deployed to the website: https://paddledetection.readthedocs.io/zh/latest/
- [PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)
- PaddleSeg (https://github.com/PaddlePaddle/PaddleSeg) - Adds Models
- LaneNet model applicable to lane segmentation scenarios.
- Adds Models - Lightweight Fast-SCNN model applicable to high performance scenarios.
- LaneNet model applicable to lane segmentation scenarios. - HRNet semantic segmentation model applicable to high-precision scenarios.
- Lightweight Fast-SCNN model applicable to high performance scenarios.
- HRNet semantic segmentation model applicable to high-precision scenarios.
- Releases multiple PaddleSlim-based model compression solutions: - Releases multiple PaddleSlim-based model compression solutions:
- Fast-SCNN tailoring solution and model on Cityscapes dataset. - Fast-SCNN tailoring solution and model on Cityscapes dataset.
- Deeplabv3p-Xception and Deeplabv3p-MobilenetV2 distillation solutions on Cityscapes dataset. - Deeplabv3p-Xception and Deeplabv3p-MobilenetV2 distillation solutions on Cityscapes dataset.
- Deeplabv3p-MobilenetV2 search solution on Cityscapes dataset. - Deeplabv3p-MobilenetV2 search solution on Cityscapes dataset.
- Deeplabv3p-Mobilenet quantitative solution and model on Cityscapes dataset.• Adds the TensorRT acceleration support for FP16 and Int8 quantitative models - Deeplabv3p-Mobilenet quantitative solution and model on Cityscapes dataset.
- Enhance the deployment capability - Enhance the deployment capability
- Adds the lightweight deployment of Python. - Adds the lightweight deployment of Python.
- The TensorRT acceleration support for FP16 and Int8 quantitative models is added. - The TensorRT acceleration support for FP16 and Int8 quantitative models is added.
- Adds the tutorials for human portraits segmentation Paddle-Lite mobile deployment of DeepLabv3p-MobileNetV2 - Adds the tutorials for human portraits segmentation Paddle-Lite mobile deployment of DeepLabv3p-MobileNetV2
- Optimizes the Model exportation step. Supports GPU implementation of image preprocessing and post processing. The performance is improved by 10%-20%. - Optimizes the Model exportation step. Supports GPU implementation of image preprocessing and post processing. The performance is improved by 10%-20%.
- Provides the benchmark for the prediction performance of U-Net, ICNet, PSPNet, DeepLabv3+, and other models for images of different sizes to facilitate users to select models based on performance. - Provides the benchmark for the prediction performance of U-Net, ICNet, PSPNet, DeepLabv3+, and other models for images of different sizes to facilitate users to select models based on performance.
- Experience Optimization - Experience Optimization
- Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability. - Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability.
- Marked imaged can be saved in pseudo-color image format to improve their preview experience.• Optimizes the logic of documents. Provides AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening. - Adds the function of automatically saving an optimal mIoU model.
- Adds the function of automatically saving an optimal mIoU model. - The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided.
- The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided. - [ElasticRec](https://github.com/PaddlePaddle/ElasticRec)
- An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online inference service are supported.
- ElasticRec (https://github.com/PaddlePaddle/ElasticRec) -
- An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online inference service are supported.
## Utility Components ## Utility Components
- [PaddleHub](https://github.com/PaddlePaddle/PaddleHub)
- PaddleHub (https://github.com/PaddlePaddle/PaddleHub) - 52 new pre-trained models are added. Currently, the total number of pre-training models is 100+:
- Semantic models: Five semantic models such as RoBERTa_wwm, BERT_wwm, and ERNIE-Tiny are added.
- 52 new pre-trained models are added. Currently, the total number of pre-training models is 100+: - Text classification: Three anti-porned models are added.
- Semantic models: Five semantic models such as RoBERTa_wwm, BERT_wwm, and ERNIE-Tiny are added. - Image classification: A total of 36 image classification models such as ResNext-WSL and EfficientNet are added.
- Text classification: Three anti-porned models are added. - Object detection: Five detection models such as pedestrian detection and vehicle detection are added.
- Image classification: A total of 36 image classification models such as ResNext-WSL and EfficientNet are added. - Key point detection: Two models for key point detection of face and body posture are added.
- Object detection: Five detection models such as pedestrian detection and vehicle detection are added. - Face mask detection: Two PyramidBox-Lite-based face mask detection models are added.
- Key point detection: Two models for key point detection of face and body posture are added. - Universal face detection: Four universal Face detection models such as Ultra Light Fast Generic Face Detector and PyramidBox-Lite are added.
- Face mask detection: Two PyramidBox-Lite-based face mask detection models are added. - Function:
- Universal face detection: Four universal Face detection models such as Ultra Light Fast Generic Face Detector and PyramidBox-Lite are added. - Bert Service, a text vector representation service based on Paddle Serving is added.
- Function: - Task flexibility is enhanced. An hook mechanism supports the loading of user-defined codes is added.
- Bert Service, a text vector representation service based on Paddle Serving is added. - Code results are optimized. The command line execution speed is increased by 50%.
- Task flexibility is enhanced. An hook mechanism supports the loading of user-defined codes is added. - Dataset and Reader are refactored, The quantity of adaptive user-defined dataset codes is reduced by 60%.
- Code results are optimized. The command line execution speed is increased by 50%. - The AutoFinetune interface is optimized. Multi-experiment visualization effect display is supportsed.
- The quantity of adaptive user-defined dataset codes is reduced by 60%. - Experience Optimization
- The AutoFinetune interface is optimized. Multi-experiment visualization effect display is supported. - The logic is fully optimized. Rich AIStudio tutorial contents are added.
- Experience Optimization - The landing page of the official website has been fully upgraded to provide the function of quick online experience and tutorial guidance.
- The logic is fully optimized. Rich AIStudio tutorial contents are added. - Multi-task learning framework [PALM](https://github.com/PaddlePaddle/PALM)
- The landing page of the official website has been fully upgraded to provide the function of quick online experience and tutorial guidance.
- Multi-task learning framework PALM (https://github.com/PaddlePaddle/PALM)
- Python3 and Windows are supported. - Python3 and Windows are supported.
- Release APIs and the multi-task learning kernel are upgraded. - Release APIs and the multi-task learning kernel are upgraded.
- Support independent task saver. - Support independent task saver.
- Continuous training and inference are supported. Dataset files can be switched over freely under a single execution.– Ugrades the machine reading comprehension tasks. Adds preprocessing hyper-parameters.• Strengthens - Continuous training and inference are supported, Dataset files can be switched over freely under a single execution.
- Supports model customization. - Supports model customization.
- The multi-task learning kernel is refactored and fix some bugs. - The multi-task learning kernel is refactored and fix some bugs.
- Upgrade multi-task learning ability. - Upgrade multi-task learning ability.
- Support independent settings of batch size and sequence length across tasks.• Adds a module for the management and downloading pre-training models.– Supports the management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa. - Support independent settings of batch size and sequence length across tasks.
- Fix inconsistent problem of the tasks on GPUs. - Fix inconsistent problem of the tasks on GPUs.
- The multi-task learning scheduling and termination strategies are optimized to generally improve the model generalization ability. - The multi-task learning scheduling and termination strategies are optimized to generally improve the model generalization ability.
- Upgrade the ability and types of pre-defined tasks. - Upgrade the ability and types of pre-defined tasks.
- Upgrade matching task. Add pairwise learning and multiple categories support. - Upgrade matching task. Add pairwise learning and multiple categories support.
- The support for machine reading comprehension tasks is enhanced. User controllable preprocessing hyper-parameters are added. - The support for machine reading comprehension tasks is enhanced. User controllable preprocessing hyper-parameters are added.
- The support for sequence labeling tasks is added. - The support for sequence labeling tasks is added.
- The large-scale training/inference capability is strengthened. - The large-scale training/inference capability is strengthened.
- Add automatic multi-gpus inference. - Add automatic multi-gpus inference.
- Refactor asynchronous reader. Support dynamic padding length for multi-task learning running on multiple-gpus. - Refactor asynchronous reader. Support dynamic padding length for multi-task learning running on multiple-gpus.
- A module for the management and downloading of pre-training models is added. - A module for the management and downloading of pre-training models is added.
- The management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa are supported. - The management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa are supported.
- A RoBERTa Chinese pre-training model is addedReleases the version v1.3. - A RoBERTa Chinese pre-training model is added Releases the version v1.3.
- Federated Learning PaddleFL (https://github.com/PaddlePaddle/PaddleFL): - Federated Learning [PaddleFL](https://github.com/PaddlePaddle/PaddleFL):
- According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus– Supports the models NeurIPS2019, which is the reforcement learning challenge champion modelReleases the version v1.1: - Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.– Releases a garaph solution called PGL-Rec and a knowledge graph embedding algorithm set called PGL-KE.– Releases a high-order API of PGL. - SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.
- According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added - Deep Reinforcement Learning Framework [PARL](https://github.com/PaddlePaddle/PARL)
- Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
- SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.
- Deep Reinforcement Learning Framework PARL (https://github.com/PaddlePaddle/PARL)
- Version v1.3 is released. - Version v1.3 is released.
- The support for the Multi-Agent RL algorithm including MADDPG is added. - The support for the Multi-Agent RL algorithm including MADDPG is added.
- The support for multi-card training is added. An example of a multi-card DQN algorithm is released. - The support for multi-card training is added. An example of a multi-card DQN algorithm is released.
- SOTA algorithms TD3 and SAC in the open source continuous control field. - SOTA algorithms TD3 and SAC in the open source continuous control field.
- Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class) - Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class)
- Paddle Graph Learning Framework PGL (https://github.com/PaddlePaddle/PGL) - Paddle Graph Learning Framework [PGL](https://github.com/PaddlePaddle/PGL)
- Version v1.1 is released: - Version v1.1 is released:
- The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.– Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies.s - The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.�C Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies.
- A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released.– Removes - A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released.
- An improvement on ease of use is made. A high-order API of PGL is released.– Removes the unnecessary contrib/float16 directory. Deletes the unnecessary snappy/snappystream dependency under the BRPC. - An improvement on ease of use is made. A high-order API of PGL is released.
- Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo. - Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo.
## Code Reconstruction and Upgrade ## Code Reconstruction and Upgrade
- Compilation - Compilation
- A compilation thus improving the code quality.– Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation.– Removes the - A compilation thus improving the code quality.
- A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability. �C Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation.
- The CUDA_ARCH_NAME default value is set to Auto (All indicates compiling all GPU architectures and Auto indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using Auto than using All, thus improving development efficiency. - A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability.
- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows. - The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency.
- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows.
- External Dependency Library - External Dependency Library
- MKL-DNN is upgraded to the latest Version 1.1. - MKL-DNN is upgraded to the latest Version 1.1.
- The inference library is decoupled from third_party and 28 third-party-dependent compilation codes are refactored to facilitate the unified management of external dependencies. - The inference library is decoupled from `third_party` and 28 third-party-dependent compilation codes are refactored to facilitate the unified management of external dependencies.
- Two third-party-dependent private code repository, one unnecessary ernal dependency, and 2000+ lines of unnecessary codes under the patch are removed to improve the code repository quality. - Two third-party-dependent private code repository, one unnecessary external dependency, and 2000+ lines of unnecessary codes under the patch are removed to improve the code repository quality.
- Code Cleanup, Refactoring, and Optimization - Code Cleanup, Refactoring, and Optimization
- The unnecessary contrib/float16 directory is removed. The unnecessary snappy/snappystream dependency under the BRPC is deleted. - The unnecessary `contrib/float16` directory is removed. The unnecessary snappy/snappystream dependency under the BRPC is deleted.
- loss.py and sequence_lod.py are split out of python/paddle/fluid/layers/nn.py according to the API functions, thus reducing the code quantity of nn.py and facilitating reading. - `loss.py` and `sequence_lod.py` are split out of `python/paddle/fluid/layers/nn.py` according to the API functions, thus reducing the code quantity of `nn.py` and facilitating reading.
- The codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points) are fixed. An error will be reported for all subsequent warnings of this kind during compilation, thus improving the code quality. - The codes corresponding to the warnings of `-Wno-error=sign-compare` (at a total of more than 100 points) are fixed. An error will be reported for all subsequent warnings of this kind during compilation, thus improving the code quality.
- WarningLnk4006/WarningLnk4221 (There are about 300) compiled by Windows MSVC is removed to improve the code repository quality. - `WarningLnk4006/WarningLnk4221` (There are about 300) compiled by Windows MSVC is removed to improve the code repository quality.
- The quantity of reduce_op, expand_op, and expand_as_op templates is reduced to accelerate GPU compilation and reduce whl package space by 70 M. - The quantity of reduce_op, expand_op, and expand_as_op templates is reduced to accelerate GPU compilation and reduce whl package space by 70 M.
- The pybind function of every OP is automatically generated under the dynamic graph using codes and directly called in layers to improve the dynamic graph performance and reduce the coupling degree with the static graph. - The pybind function of every OP is automatically generated under the dynamic graph using codes and directly called in layers to improve the dynamic graph performance and reduce the coupling degree with the static graph.
## Bug Fixes ## Bug Fixes
- Fix the problem of MKL-DNN error when PaddleDetection-based Faster-RCNN uses the Python API to make a inference. - Fix the problem of MKL-DNN error when PaddleDetection-based Faster-RCNN uses the Python API to make a inference.
- Fix the problem of training suspension in the GPU implementation of sum op because some Tensors are not initialized. - Fix the problem of training suspension in the GPU implementation of sum op because some Tensors are not initialized.
- Fix the problem of precision loss when the value in fill_constant is set to a large integer. - Fix the problem of precision loss when the value in fill_constant is set to a large integer.
...@@ -397,4 +353,4 @@ This version focuses on enhancement of the framework functions, includes improvi ...@@ -397,4 +353,4 @@ This version focuses on enhancement of the framework functions, includes improvi
- Fix the problem of the sigmoid cudnn kernel being called as the tanh cudnn kernel by mistake. - Fix the problem of the sigmoid cudnn kernel being called as the tanh cudnn kernel by mistake.
- Fix some bugs related to reshape and Conv2D depthwisecoin dynamic graph mode; fix the problem of some parameters in the network having no gradient, causing the bug of program crash. - Fix some bugs related to reshape and Conv2D depthwisecoin dynamic graph mode; fix the problem of some parameters in the network having no gradient, causing the bug of program crash.
- Fix the bug of running error of GradientClip in parameter server mode. - Fix the bug of running error of GradientClip in parameter server mode.
- Fix the problem of memory leak in full asynchronous mode of the parameter server. - Fix the problem of memory leak in full asynchronous mode of the parameter server.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册