**训练框架:**增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
**训练框架**:增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
- To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`.To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`, Iintegrate the updating logic of `beta` into `adam_op` to reduce the cost of calling op kernel. The performance 偶发of is improved by 9.67% on the P40 machine.
- Marked imaged can be saved in pseudo-color image format to improve their preview experience.• Optimizes the logic of documents. Provides AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening.
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus– Supports the models NeurIPS2019, which is the reforcement learning challenge champion modelReleases the version v1.1:
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.– Releases a garaph solution called PGL-Rec and a knowledge graph embedding algorithm set called PGL-KE.– Releases a high-order API of PGL.
@@ -79,7 +79,7 @@ This version focuses on enhancement of the framework functions, includes improvi
...
@@ -79,7 +79,7 @@ This version focuses on enhancement of the framework functions, includes improvi
- Optimize the `RecomputeOptimizer` to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the `RecomputeOptimizer`.
- Optimize the `RecomputeOptimizer` to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the `RecomputeOptimizer`.
- OP Performance Optimization
- OP Performance Optimization
- Implements the fusion operator called `fuse_emb_seq_pool` of `embedding` and `sequence_pool`. Optimizes the `murmurhash3_x64_128` in `bloom_filter`. These optimization increases the training speed of some NLP models.
- Implements the fusion operator called `fuse_emb_seq_pool` of `embedding` and `sequence_pool`. Optimizes the `murmurhash3_x64_128` in `bloom_filter`. These optimization increases the training speed of some NLP models.
- Optimizes the GPU performance of `mean op`. When a data of 3232 8 *8 tensor is input, the forward calculation speed is increased by 2.7 times.
- Optimizes the GPU performance of `mean op`. When a data of 32 *32 *8 *8 tensor is input, the forward calculation speed is increased by 2.7 times.
- Optimizes OPs of `assign` and `lod_reset`, to avoid nnecessary GPU memory copy and data transform.
- Optimizes OPs of `assign` and `lod_reset`, to avoid nnecessary GPU memory copy and data transform.
- Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%.
- Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%.
- Dynamic Graph
- Dynamic Graph
...
@@ -98,6 +98,7 @@ This version focuses on enhancement of the framework functions, includes improvi
...
@@ -98,6 +98,7 @@ This version focuses on enhancement of the framework functions, includes improvi
- Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine.
- Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine.
- Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency.
- Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency.
- Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet.
- Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet.
- To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`. Iintegrate the updating logic of `beta` into `adam_op` to reduce the cost of calling op kernel. The performance of is improved by 9.67% on the P40 machine.
- Dynamic Graph Deployment
- Dynamic Graph Deployment
- Supports the `TracedLayer` interface to convert the dynamic graph model into the static graph.
- Supports the `TracedLayer` interface to convert the dynamic graph model into the static graph.
- Debugging Analysis
- Debugging Analysis
...
@@ -259,6 +260,7 @@ This version focuses on enhancement of the framework functions, includes improvi
...
@@ -259,6 +260,7 @@ This version focuses on enhancement of the framework functions, includes improvi
- Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability.
- Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability.
- Adds the function of automatically saving an optimal mIoU model.
- Adds the function of automatically saving an optimal mIoU model.
- The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided.
- The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided.
- Marked imaged can be saved in pseudo-color image format to improve their preview experience.
- An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online inference service are supported.
- An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online inference service are supported.
...
@@ -306,6 +308,8 @@ This version focuses on enhancement of the framework functions, includes improvi
...
@@ -306,6 +308,8 @@ This version focuses on enhancement of the framework functions, includes improvi
- According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added
- According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added
- Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
- Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
- SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.
- SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.
- Deep Reinforcement Learning Framework [PARL](https://github.com/PaddlePaddle/PARL)
- Deep Reinforcement Learning Framework [PARL](https://github.com/PaddlePaddle/PARL)
- Version v1.3 is released.
- Version v1.3 is released.
- The support for the Multi-Agent RL algorithm including MADDPG is added.
- The support for the Multi-Agent RL algorithm including MADDPG is added.
...
@@ -314,7 +318,7 @@ This version focuses on enhancement of the framework functions, includes improvi
...
@@ -314,7 +318,7 @@ This version focuses on enhancement of the framework functions, includes improvi
- Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class)
- Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class)
- The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.�C Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies.
- The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.C Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies.
- A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released.
- A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released.
- An improvement on ease of use is made. A high-order API of PGL is released.
- An improvement on ease of use is made. A high-order API of PGL is released.
- Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo.
- Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo.
...
@@ -322,7 +326,7 @@ This version focuses on enhancement of the framework functions, includes improvi
...
@@ -322,7 +326,7 @@ This version focuses on enhancement of the framework functions, includes improvi
## Code Reconstruction and Upgrade
## Code Reconstruction and Upgrade
- Compilation
- Compilation
- A compilation thus improving the code quality.
- A compilation thus improving the code quality.
�C Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation.
C Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation.
- A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability.
- A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability.
- The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency.
- The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency.
- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows.
- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows.