未验证 提交 9b39af02 编写于 作者: A adaxiadaxi 提交者: GitHub

update_release_not_0312,test=develop (#1906)

上级 e061bc64
无法预览此类型文件
...@@ -2,11 +2,17 @@ ...@@ -2,11 +2,17 @@
## 重要更新 ## 重要更新
本版本对框架功能层面进行了重点增强,预测部署能力全面提升,分布式训练发布PLSC支持超大规模分类,并对参数服务器模式进行优化整合。对编译选项、编译依赖以及代码库进行了全面清理优化。模型库持续完善,优化了整体层次结构,增加了动态图模型实现。端到端开发套件和工具组件进一步完善。 本版本对框架功能层面进行了重点增强,预测部署能力全面提升,分布式训练发布PLSC支持超大规模分类,并对参数服务器模式进行优化整合。对编译选项、编译依赖以及代码库进行了全面清理优化。模型库持续完善,优化了整体层次结构,增加了动态图模型实现。端到端开发套件和工具组件进一步完善。
**训练框架:**增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
**训练框架**:增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
**预测部署**:服务器端预测库的Python API大幅优化,新增R语言、Go语言调用预测库的使用方法和示例,强化了量化支持能力;Paddle Lite支持无校准数据的训练后量化方法生成的模型,加强对OpenCL的支持,支持昆仑XPU的预测;模型压缩库PaddleSlim重构裁剪、量化、蒸馏、搜索接口,与模型库充分打通,新增大规模可扩展知识蒸馏框架 Pantheon。 **预测部署**:服务器端预测库的Python API大幅优化,新增R语言、Go语言调用预测库的使用方法和示例,强化了量化支持能力;Paddle Lite支持无校准数据的训练后量化方法生成的模型,加强对OpenCL的支持,支持昆仑XPU的预测;模型压缩库PaddleSlim重构裁剪、量化、蒸馏、搜索接口,与模型库充分打通,新增大规模可扩展知识蒸馏框架 Pantheon。
**分布式训练**:参数服务器模式下针对transpiler半异步、全异步、GEO三种模式,后端实现上统一到communicator中,前端接口统一到fleet中,通过fleet strategy灵活选择不同模式;发布大规模分类库PLSC,通过模型并行支持超多类别的分类任务。 **分布式训练**:参数服务器模式下针对transpiler半异步、全异步、GEO三种模式,后端实现上统一到communicator中,前端接口统一到fleet中,通过fleet strategy灵活选择不同模式;发布大规模分类库PLSC,通过模型并行支持超多类别的分类任务。
**基础模型库**:发布语音合成库Parakeet,包括多个前沿合成算法;PaddleCV新增14个图像分类预训练模型,3D和跟踪方向模型持续丰富;PaddleNLP的分词和词性标注模型支持jieba分词;PaddleRec增加多任务模型MMoE。模型库整体增加了广泛的动态图模型实现。模型库整体层次结构做了调整优化。 **基础模型库**:发布语音合成库Parakeet,包括多个前沿合成算法;PaddleCV新增14个图像分类预训练模型,3D和跟踪方向模型持续丰富;PaddleNLP的分词和词性标注模型支持jieba分词;PaddleRec增加多任务模型MMoE。模型库整体增加了广泛的动态图模型实现。模型库整体层次结构做了调整优化。
**端到端开发套件**:PaddleDetection和PaddleSeg新增大量模型实现及预训练模型,典型模型的训练速度和精度提升,模型压缩和部署能力大幅提升,使用体验全面优化。发布ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。 **端到端开发套件**:PaddleDetection和PaddleSeg新增大量模型实现及预训练模型,典型模型的训练速度和精度提升,模型压缩和部署能力大幅提升,使用体验全面优化。发布ElasticRec推荐排序系统,通过K8S进行部署,支持流式训练和在线预测服务。
**工具组件**:PaddleHub新增52个预训练模型,总数超过100,功能和体验持续优化;多任务学习框架PALM升级内核,开放API调用,支持更多的任务类型;联邦学习PaddleFL新增公开数据集。深度强化学习框架PARL和飞桨图学习框架PGL也对应版本升级,支持更多功能,开放更多算法和基线。 **工具组件**:PaddleHub新增52个预训练模型,总数超过100,功能和体验持续优化;多任务学习框架PALM升级内核,开放API调用,支持更多的任务类型;联邦学习PaddleFL新增公开数据集。深度强化学习框架PARL和飞桨图学习框架PGL也对应版本升级,支持更多功能,开放更多算法和基线。
## 训练框架 ## 训练框架
...@@ -42,7 +48,7 @@ ...@@ -42,7 +48,7 @@
- 优化RecomputeOptimizer提升batchsize, 在Bert-large模型上最大batchsize比不使用RecomputeOptimizer增大533.62%,比上一版本提升一倍。 - 优化RecomputeOptimizer提升batchsize, 在Bert-large模型上最大batchsize比不使用RecomputeOptimizer增大533.62%,比上一版本提升一倍。
- OP性能优化 - OP性能优化
- 实现embedding和sequence_pool的融合算子fuse_emb_seq_pool,优化bloom_filter中的murmurhash3_x64_128,有效提升部分NLP模型的训练速度。 - 实现embedding和sequence_pool的融合算子fuse_emb_seq_pool,优化bloom_filter中的murmurhash3_x64_128,有效提升部分NLP模型的训练速度。
- 优化了mean op的GPU性能,输入数据为32*32*8*8的Tensor时,前向计算速度提升2.7倍。 - 优化了mean op的GPU性能,输入数据为32 * 32 * 8 * 8的Tensor时,前向计算速度提升2.7倍。
- 优化assign、lod_reset op,避免不需要的显存拷贝和data transform。 - 优化assign、lod_reset op,避免不需要的显存拷贝和data transform。
- 优化了stack OP的kernel实现,XLnet/Ernie模型GPU单卡性能提升4.1%。 - 优化了stack OP的kernel实现,XLnet/Ernie模型GPU单卡性能提升4.1%。
- 动态图 - 动态图
...@@ -59,7 +65,6 @@ ...@@ -59,7 +65,6 @@
- 优化了python 与c++ 交互,GradMaker、OperatorBase、allocator等。基于LSTM的语言模型任务p在P40机器上性能提升提升270%。 - 优化了python 与c++ 交互,GradMaker、OperatorBase、allocator等。基于LSTM的语言模型任务p在P40机器上性能提升提升270%。
- 针对optimize中多次调用optimized_guard无用代码导致的性能问题,移除了冗余代码。Transformer模型(batch_size=64)在P40机器上,SGD、Adam等优化器有5%~8%%的性能提升。 - 针对optimize中多次调用optimized_guard无用代码导致的性能问题,移除了冗余代码。Transformer模型(batch_size=64)在P40机器上,SGD、Adam等优化器有5%~8%%的性能提升。
- 针对AdamOptimizer中额外添加scale_op更新beta参数对性能的影响,将beta更新逻辑融合到adam_op中,减少op kernel调用开销。Dialogue-PLATO模型P40机器上性能提升9.67%。 - 针对AdamOptimizer中额外添加scale_op更新beta参数对性能的影响,将beta更新逻辑融合到adam_op中,减少op kernel调用开销。Dialogue-PLATO模型P40机器上性能提升9.67%。
- To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`.To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`, Iintegrate the updating logic of `beta` into `adam_op` to reduce the cost of calling op kernel. The performance 偶发of is improved by 9.67% on the P40 machine.
- 优化动态图异步DataLoader,对于Mnist、ResNet等CV模型任务在P40机器上单卡训练速度提升超过40%。 - 优化动态图异步DataLoader,对于Mnist、ResNet等CV模型任务在P40机器上单卡训练速度提升超过40%。
- 新增numpy bridge功能,支持在cpu模式下Tensor和ndarray之间共享底层数据,避免创建Variable时numpy输入需要拷贝的问题,提升效率。 - 新增numpy bridge功能,支持在cpu模式下Tensor和ndarray之间共享底层数据,避免创建Variable时numpy输入需要拷贝的问题,提升效率。
- 显存优化:提前删除反向不需要Tensor Buffer的前向变量空间的优化策略,在ResNet等模型上最大batch size提升20%-30%以上。 - 显存优化:提前删除反向不需要Tensor Buffer的前向变量空间的优化策略,在ResNet等模型上最大batch size提升20%-30%以上。
...@@ -228,7 +233,6 @@ ...@@ -228,7 +233,6 @@
- 体验优化 - 体验优化
- 新增学习率warmup功能,支持与不同的学习率Decay策略配合使用,提升Fine-tuning的稳定性。 - 新增学习率warmup功能,支持与不同的学习率Decay策略配合使用,提升Fine-tuning的稳定性。
- 支持对标注图使用伪彩色图像格式的保存,提升标注图片的预览体验。 - 支持对标注图使用伪彩色图像格式的保存,提升标注图片的预览体验。
- Marked imaged can be saved in pseudo-color image format to improve their preview experience.• Optimizes the logic of documents. Provides AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening.
- 新增自动保存mIoU最优模型的功能。 - 新增自动保存mIoU最优模型的功能。
- 全面优化文档逻辑,提供如工业质检、眼底筛查等工业场景的AIStudio实战教程。 - 全面优化文档逻辑,提供如工业质检、眼底筛查等工业场景的AIStudio实战教程。
- [ElasticRec](https://github.com/PaddlePaddle/ElasticRec) - [ElasticRec](https://github.com/PaddlePaddle/ElasticRec)
...@@ -276,9 +280,7 @@ ...@@ -276,9 +280,7 @@
- 新增RoBERTa中文预训练模型 - 新增RoBERTa中文预训练模型
- 联邦学习[PaddleFL](https://github.com/PaddlePaddle/PaddleFL) - 联邦学习[PaddleFL](https://github.com/PaddlePaddle/PaddleFL)
- 新增scheduler与submitter功能:scheduler可用于在训练过程中控制trainer是否参加更新 。submitter可用于完成在MPI集群提交paddleFL任务的功能 - 新增scheduler与submitter功能:scheduler可用于在训练过程中控制trainer是否参加更新 。submitter可用于完成在MPI集群提交paddleFL任务的功能
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus– Supports the models NeurIPS2019, which is the reforcement learning challenge champion modelReleases the version v1.1:
- 新增LEAF dataset联邦学习公开数据集,并添加api,用于设置benchmark。支持图像分类,情感分析,字符预测等领域的经典数据集,如MNIST,Sentiment140 - 新增LEAF dataset联邦学习公开数据集,并添加api,用于设置benchmark。支持图像分类,情感分析,字符预测等领域的经典数据集,如MNIST,Sentiment140
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.– Releases a garaph solution called PGL-Rec and a knowledge graph embedding algorithm set called PGL-KE.– Releases a high-order API of PGL.
- 根据新增组件,在example中修改了原有的样例,并添加了femnist_demo, submitter_demo样例 - 根据新增组件,在example中修改了原有的样例,并添加了femnist_demo, submitter_demo样例
- 优化fl_distribute_transpiler,使FedAvg strategy新增对adam optimizer支持; - 优化fl_distribute_transpiler,使FedAvg strategy新增对adam optimizer支持;
- 新增SecAgg strategy(Secure Aggregation),用于实现安全的参数聚合; - 新增SecAgg strategy(Secure Aggregation),用于实现安全的参数聚合;
......
...@@ -79,7 +79,7 @@ This version focuses on enhancement of the framework functions, includes improvi ...@@ -79,7 +79,7 @@ This version focuses on enhancement of the framework functions, includes improvi
- Optimize the `RecomputeOptimizer` to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the `RecomputeOptimizer`. - Optimize the `RecomputeOptimizer` to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the `RecomputeOptimizer`.
- OP Performance Optimization - OP Performance Optimization
- Implements the fusion operator called `fuse_emb_seq_pool` of `embedding` and `sequence_pool`. Optimizes the `murmurhash3_x64_128` in `bloom_filter`. These optimization increases the training speed of some NLP models. - Implements the fusion operator called `fuse_emb_seq_pool` of `embedding` and `sequence_pool`. Optimizes the `murmurhash3_x64_128` in `bloom_filter`. These optimization increases the training speed of some NLP models.
- Optimizes the GPU performance of `mean op`. When a data of 3232 8 *8 tensor is input, the forward calculation speed is increased by 2.7 times. - Optimizes the GPU performance of `mean op`. When a data of 32 *32 *8 *8 tensor is input, the forward calculation speed is increased by 2.7 times.
- Optimizes OPs of `assign` and `lod_reset`, to avoid nnecessary GPU memory copy and data transform. - Optimizes OPs of `assign` and `lod_reset`, to avoid nnecessary GPU memory copy and data transform.
- Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%. - Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%.
- Dynamic Graph - Dynamic Graph
...@@ -98,6 +98,7 @@ This version focuses on enhancement of the framework functions, includes improvi ...@@ -98,6 +98,7 @@ This version focuses on enhancement of the framework functions, includes improvi
- Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine. - Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine.
- Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency. - Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency.
- Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet. - Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet.
- To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`. Iintegrate the updating logic of `beta` into `adam_op` to reduce the cost of calling op kernel. The performance of is improved by 9.67% on the P40 machine.
- Dynamic Graph Deployment - Dynamic Graph Deployment
- Supports the `TracedLayer` interface to convert the dynamic graph model into the static graph. - Supports the `TracedLayer` interface to convert the dynamic graph model into the static graph.
- Debugging Analysis - Debugging Analysis
...@@ -259,6 +260,7 @@ This version focuses on enhancement of the framework functions, includes improvi ...@@ -259,6 +260,7 @@ This version focuses on enhancement of the framework functions, includes improvi
- Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability. - Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability.
- Adds the function of automatically saving an optimal mIoU model. - Adds the function of automatically saving an optimal mIoU model.
- The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided. - The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided.
- Marked imaged can be saved in pseudo-color image format to improve their preview experience.
- [ElasticRec](https://github.com/PaddlePaddle/ElasticRec) - [ElasticRec](https://github.com/PaddlePaddle/ElasticRec)
- An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online inference service are supported. - An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online inference service are supported.
...@@ -306,6 +308,8 @@ This version focuses on enhancement of the framework functions, includes improvi ...@@ -306,6 +308,8 @@ This version focuses on enhancement of the framework functions, includes improvi
- According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added - According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added
- Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer. - Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
- SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation. - SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.
- Deep Reinforcement Learning Framework [PARL](https://github.com/PaddlePaddle/PARL) - Deep Reinforcement Learning Framework [PARL](https://github.com/PaddlePaddle/PARL)
- Version v1.3 is released. - Version v1.3 is released.
- The support for the Multi-Agent RL algorithm including MADDPG is added. - The support for the Multi-Agent RL algorithm including MADDPG is added.
...@@ -314,7 +318,7 @@ This version focuses on enhancement of the framework functions, includes improvi ...@@ -314,7 +318,7 @@ This version focuses on enhancement of the framework functions, includes improvi
- Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class) - Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class)
- Paddle Graph Learning Framework [PGL](https://github.com/PaddlePaddle/PGL) - Paddle Graph Learning Framework [PGL](https://github.com/PaddlePaddle/PGL)
- Version v1.1 is released: - Version v1.1 is released:
- The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.C Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies. - The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.C Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies.
- A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released. - A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released.
- An improvement on ease of use is made. A high-order API of PGL is released. - An improvement on ease of use is made. A high-order API of PGL is released.
- Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo. - Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo.
...@@ -322,7 +326,7 @@ This version focuses on enhancement of the framework functions, includes improvi ...@@ -322,7 +326,7 @@ This version focuses on enhancement of the framework functions, includes improvi
## Code Reconstruction and Upgrade ## Code Reconstruction and Upgrade
- Compilation - Compilation
- A compilation thus improving the code quality. - A compilation thus improving the code quality.
C Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation. C Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation.
- A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability. - A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability.
- The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency. - The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency.
- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows. - Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows.
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册