@@ -121,7 +121,7 @@ NVIDIA Jetson is an AI computing platform in embedded systems introduced by NVID
...
@@ -121,7 +121,7 @@ NVIDIA Jetson is an AI computing platform in embedded systems introduced by NVID
make inference_lib_dist -j4
make inference_lib_dist -j4
3. Test with samples
3. Test with samples
Please refer to samples on https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/deploy/inference/paddle_tensorrt_infer.html#id2
Please refer to samples on https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/performance_improving/inference_improving/paddle_tensorrt_infer.html#id2
**训练框架:**增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
**训练框架**:增加自动混合精度训练AMP接口和新控制流接口;优化Tensor使用方式和显存分配策略;新增支持Nvidia DALI GPU数据预处理库;持续优化基础OP的功能和性能;动态图的功能进一步完善,性能大幅提升,对data independent的动态图模型提供转为静态图可预测部署模型的功能;框架调试分析功能和易用性全面提升。
- To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`.To reduce the performance impact caused by adding extra `scale_op` to update the beta parameter in `AdamOptimizer`, Iintegrate the updating logic of `beta` into `adam_op` to reduce the cost of calling op kernel. The performance 偶发of is improved by 9.67% on the P40 machine.
- Marked imaged can be saved in pseudo-color image format to improve their preview experience.• Optimizes the logic of documents. Provides AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening.
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus– Supports the models NeurIPS2019, which is the reforcement learning challenge champion modelReleases the version v1.1:
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.– Releases a garaph solution called PGL-Rec and a knowledge graph embedding algorithm set called PGL-KE.– Releases a high-order API of PGL.
This version focuses on enhancement of the framework functions, includes improving the inference deployment capability, releasing PLSC for super-large-scale classification training task, and optimizing the parameter server mode. In addition, the compilation options, compilation dependence and code library are fully cleaned up and optimized. The model library is optimized by adjusting the structure and adding dynamic graph models. The development kits and utility components are upgraded.
This version focuses on enhancement of the framework functions, includes improving the inference deployment capability, releasing PLSC for super-large-scale classification training task, and optimizing the parameter server mode. In addition, the compilation options, compilation dependence and code library are fully cleaned up and optimized. The model library is optimized by adjusting the structure and adding dynamic graph models. The development kits and utility components are upgraded.
**Training Framework**:
### Training Framework
- Adds AMP (Automatic Mixed Precision) interfaces and control flow interfaces.
- Adds AMP (Automatic Mixed Precision) interfaces and control flow interfaces.
- Optimizes the tensor using method and GPU memory allocation strategy.
- Optimizes the tensor using method and GPU memory allocation strategy.
- Supports Nvidia DALI GPU data preprocessing library.
- Supports Nvidia DALI GPU data preprocessing library.
- Optimizes the functions and performance of basic Ops
- Optimizes the functions and performance of basic Ops
- Enhances the functions of dynamic graph models, including performance improvement and supporting new APIs which can converts the data independent dynamic graph model into static graph model.
- Enhances the functions of dynamic graph models, including performance improvement and supporting new APIs which can converts the data independent dynamic graph model into static graph model.
- Improves the user experience of debug functions.
- Improves the user experience of debug functions.
**Inference Deployment**:
### Inference Deployment
- Paddle Serving
- Paddle Serving
- Optimizes the Python API.
- Optimizes the Python API.
- Supports new programming languages API, such as R and Go.
- Supports new programming languages API, such as R and Go.
- Enhanced the quantitative capability.
- Enhanced the quantitative capability.
- Paddle Lite
- Paddle Lite
- Supports deploying the model generated by the post-training quantization method without calibration data.
- Supports deploying the model generated by the post-training quantization method without calibration data.
- Enhanced the OpenCL capability.
- Enhanced the OpenCL capability.
- Supports Kunlun XPU.
- Supports Kunlun XPU.
- Paddle Slim
- Paddle Slim
- Optimizes the pruning, quantization, distillation and NAS (Network Architecture Search) API for adapting the model library.
- Optimizes the pruning, quantization, distillation and NAS (Network Architecture Search) API for adapting the model library.
- Supports large-scale knowledge distillation framework called Pantheon.
- Supports large-scale knowledge distillation framework called Pantheon.
**Distributed Training**:
### Distributed Training
- Unified the implementation mode of the semi-asynchronous, fully asynchronous and GEO modes in parameter server mode. The back-end is unified into communicator. The front-end interface is unified into fleet. Select different mode by configuring the fleet strategy.
- Unified the implementation mode of the semi-asynchronous, fully asynchronous and GEO modes in parameter server mode. The back-end is unified into communicator. The front-end interface is unified into fleet. Select different mode by configuring the fleet strategy.
- Releases the PLSC for super-large-scale classification training task.
- Releases the PLSC for super-large-scale classification training task.
**Model Construction**:
### Model Construction
- Releases the text-so-speech model library called Parakeet, including several leading-edge text-to-speech algorithms.
- Releases the text-so-speech model library called Parakeet, including several leading-edge text-to-speech algorithms.
- Adds 14 image classification pre-training models in PaddleCV, for enriching the 3D and tracking direction models.
- Adds 14 image classification pre-training models in PaddleCV, for enriching the 3D and tracking direction models.
- Supports Jieba word segmentation in PaddleNLP.
- Supports Jieba word segmentation in PaddleNLP.
...
@@ -42,347 +36,309 @@ This version focuses on enhancement of the framework functions, includes improvi
...
@@ -42,347 +36,309 @@ This version focuses on enhancement of the framework functions, includes improvi
- Adds more dynamic graph models.
- Adds more dynamic graph models.
- Adjusts and optimizes the structure of model library.
- Adjusts and optimizes the structure of model library.
**Development Kits**:
### Development Kits
- Optimizes the PaddleDetection and PaddleSeg by adding a large number of models as well as pre-training models, enhancing the training speed and accuracy of typical models, and strengthens the model compression and deployment capabilities.
- Optimizes the PaddleDetection and PaddleSeg by adding a large number of models as well as pre-training models, enhancing the training speed and accuracy of typical models, and strengthens the model compression and deployment capabilities.
- Releases the recommended sorting system called ElasticRec, can be deployed via K8S and support streaming training and online forecast services.
- Releases the recommended sorting system called ElasticRec, can be deployed via K8S and support streaming training and online forecast services.
**Utility Components**:
### Utility Components
- Adds 52 pre-training models to enrich the models up to 100+, as well as improves the function experience.
- Adds 52 pre-training models to enrich the models up to 100+, as well as improves the function experience.
- Upgrades the kernel of PALM, opens API, and supports more task types.
- Upgrades the kernel of PALM, opens API, and supports more task types.
- Adds an open dataset in PaddleFL (Federated learning framework).
- Adds an open dataset in PaddleFL (Federated learning framework).
- Upgrades the versions of PARL (Deep reinforcement learning framework) and PGL (Graph learning framework) . Opens more algorithm and supports more functions.
- Upgrades the versions of PARL (Deep reinforcement learning framework) and PGL (Graph learning framework) . Opens more algorithm and supports more functions.
## Training Framework
## Training Framework
- API
- API
- Adds AMP (Automatic Mixed Precision) APIs, which can convert a network training mode into mixed accuracy mode in a general way, and ensuring the accuracy fluctuation within the normal range.
- Adds AMP (Automatic Mixed Precision) APIs, which can convert a network training mode into mixed accuracy mode in a general way, and ensuring the accuracy fluctuation within the normal range.
- Adds control flow OPs, such as while_loop, cond, case and switch_case. It is recommended to use the new APIs for much easier to use. The following functions are supported:
- Adds control flow OPs, such as while_loop, cond, case and switch_case. It is recommended to use the new APIs for much easier to use. The following functions are supported:
- Supports using python callable as the control condition or executive objects.
- Supports using python callable as the control condition or executive objects.
- Supports using different losses or optimizers in different branches of the control flow.
- Supports using different losses or optimizers in different branches of the control flow.
- Supports using CPU data or GPU data in condition of the control flow
- Supports using CPU data or GPU data in condition of the control flow
- Supports using the variable lists as parameters for some APIs, while these APIs only supported string lists as the `parameter_list` or `no_grad_set`. Do not need to obtain the `name` attribute of variables when using the following APIs:
- Supports using the variable lists as parameters for some APIs, while these APIs only supported string lists as the ‘parameter_list’ or ‘no_grad_set’. Do not need to obtain the ‘name’ attribute of variables when using the following APIs:
- The minimize methods of optimizers, such as Adam's minimize: minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)
- The minimize methods of optimizers, such as Adam’s minimize: minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)
- Basic Functions Optimization
- Basic Function Optimization
- Supports configuring tensor data with numpy float16 data types, and no need to convert to unit16 type first.
- Supports configuring tensor data with numpy float16 data types, and no need to convert to unit16 type first.
- Supports using minus sign to express the tensor's opposite.
- Supports using minus sign to express the tensor’s opposite.
- GPU memory Allocation Strategy:
- GPU memory Allocation Strategy:
- Changes the default policy to `AutoGrowth`. In this policy, the GPU memory is applied on demand when not affecting the training speed. While it`s difficult to start another task on the same GPU in the GPU memory pre-allocation strategy before. This change can avoid this problem.
- Changes the default policy to ‘AutoGrowth’. In this policy, the GPU memory is applied on demand when not affecting the training speed. While it’s difficult to start another task on the same GPU in the GPU memory pre-allocation strategy before. This change can avoid this problem.
- Adjusts the GPU memory allocation for multi-card tasks: Set the GPU memory allocators on different GPU cards to the `Lazy` initialization mode. If a card is not used, the GPU memory will not be applied for this card. While the GPU memory OOM problem could be caused by running tasks on idle GPU cards without setting CUDA_VISIBLE_DEVICES, when GPU memory is occupied on other GPU cards. This change can avoid this problem.
- Adjusts the GPU memory allocation for multi-card tasks: Set the GPU memory allocators on different GPU cards to the ‘Lazy’ initialization mode. If a card is not used, the GPU memory will not be applied for this card. While the GPU memory OOM problem could be caused by running tasks on idle GPU cards without setting CUDA_VISIBLE_DEVICES, when GPU memory is occupied on other GPU cards. This change can avoid this problem.
- OP Function Upgrade
- OP Function Upgrade
- elu: This activation function supports the calculation of second-order gradients.
- elu: This activation function supports the calculation of second-order gradients.
- Prroi_pool: The parameter `rois` supports the `Tensor` or `LoDTensor` type.
- Prroi_pool: The parameter ‘rois’ supports the ‘Tensor’ or ‘LoDTensor’ type.
- Conv2d, pool2d, batch_norm, lrn: supports using the MKL-DNN library to perform gradient calculation of these OPs.
- Conv2d, pool2d, batch_norm, lrn: supports using the MKL-DNN library to perform gradient calculation of these OPs.
- argsort: Supports descending. A new parameter `descending` is added, default value is `False`.
- argsort: Supports descending. A new parameter ‘descending’ is added, default value is ‘False’.
- Basic Performance Optimization
- Basic Performance Optimization
- DALI Preprocessing Acceleration
- DALI Preprocessing Acceleration
- Supports the Nvidia DALI GPU data preprocessing library, which can be used to accelerate the preprocessing speed of data such as images, videos, and speeches.
- Supports the Nvidia DALI GPU data preprocessing library, which can be used to accelerate the preprocessing speed of data such as images, videos, and speeches.
- Automatic Mixed Precision Training Optimization
- Automatic Mixed Precision Training Optimization
- Implements the following optimization strategies to increase the training throughput of the ResNet50 model, along with the DALI data preprocessing module. The mixed accuracy training throughput of a single V100 card is increased from 600+ images/s to 1,000+ images/s. The throughput of 8 cards for a single machine is increased to 7,840 image/s. The throughput of 32 cards for 4 machines is increased to 28,594 images/s.
- Implements the following optimization strategies to increase the training throughput of the ResNet50 model, along with the DALI data preprocessing module. The mixed accuracy training throughput of a single V100 card is increased from 600+ images/s to 1,000+ images/s. The throughput of 8 cards for a single machine is increased to 7,840 image/s. The throughput of 32 cards for 4 machines is increased to 28,594 images/s.
- Supports NHWC data layout inputs for some OPs such as batch_norm, conv2d. Accelerates fp16 calculation speed by using Tensor Core technology.
- Supports NHWC data layout inputs for some OPs such as batch_norm, conv2d. Accelerates fp16 calculation speed by using Tensor Core technology.
- Fusing some op patterns in the model, such as batch_norm and relu, based on the IR Pass mechanism.
- Fusing some op patterns in the model, such as batch_norm and relu, based on the IR Pass mechanism.
- Optimizes kernel of some elementwise OPs, such as add, mul.
- Optimizes kernel of some elementwise OPs, such as add, mul.
- Optimize the ‘RecomputeOptimizer’ to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the ‘RecomputeOptimizer’.
- Optimize the `RecomputeOptimizer` to enable bigger batchsize. The batchsize of Bert-large model increases by 533.62% while using the `RecomputeOptimizer`.
- OP Performance Optimization
- OP Performance Optimization
- Implements the fusion operator called ‘fuse_emb_seq_pool’ of ‘embedding’ and ‘sequence_pool’. Optimizes the ‘murmurhash3_x64_128’ in ‘bloom_filter’. These optimization increases the training speed of some NLP models.
- Implements the fusion operator called `fuse_emb_seq_pool` of `embedding` and `sequence_pool`. Optimizes the `murmurhash3_x64_128` in `bloom_filter`. These optimization increases the training speed of some NLP models.
- Optimizes the GPU performance of ‘mean op’. When a data of 3232 8 *8 tensor is input, the forward calculation speed is increased by 2.7 times.
- Optimizes the GPU performance of `mean op`. When a data of 3232 8 *8 tensor is input, the forward calculation speed is increased by 2.7 times.
- Optimizes OPs of ‘assign’ and ‘lod_reset’, to avoid nnecessary GPU memory copy and data transform.
- Optimizes OPs of `assign` and `lod_reset`, to avoid nnecessary GPU memory copy and data transform.
- Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%.
- Optimizes the kernel implementation of stack OP. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%.
- Dynamic Graph
- Dynamic Graph
- Function Optimization
- Function Optimization
- Removes the ‘name_scope’ parameter in ‘Layers’ to make it easier to inherit and call.
- Removes the `name_scope` parameter in `Layers` to make it easier to inherit and call.
- Removes the ‘block’ parameter in the ‘to_variable’ to simplify the use of the API.
- Removes the `block` parameter in the `to_variable` to simplify the use of the API.
- Removes the ‘build_once’ as for the the problem that model parameters depend on data. So that ‘Layers’ can get all the parameter tables when implementing the ‘init’ execution. It’s convenient for saving and loading, parameter initialization, parameter debugging, and parameter optimization.
- Removes the `build_once` as for the the problem that model parameters depend on data. So that `Layers` can get all the parameter tables when implementing the `init` execution. It`s convenient for saving and loading, parameter initialization, parameter debugging, and parameter optimization.
- Optimizes the automatic pruning function facilitate user networking and reduce the reverse calculation amount.
- Optimizes the automatic pruning function facilitate user networking and reduce the reverse calculation amount.
- Supports ‘SelectedRows’ operation so that the Embedding layer supports sparse update of a single card.
- Supports `SelectedRows` operation so that the Embedding layer supports sparse update of a single card.
- Adds functions such as ParameterList, LayerList, and Sequencial, as for the problem that the framework lacks containers. It’s more convenient for networking with these functions.
- Adds functions such as ParameterList, LayerList, and Sequencial, as for the problem that the framework lacks containers. It`s more convenient for networking with these functions.
- Supports functions such as named_sublayers and named_parameters to facilitate programming.
- Supports functions such as named_sublayers and named_parameters to facilitate programming.
- Supports the ‘Linear lr warmup decay’ strategy.
- Supports the `Linear lr warmup decay` strategy.
- Performance Optimization
- Performance Optimization
- Optimizes the interaction of python with c++, GradMaker, OperatorBase, and allocator. The performance is improved by 270% for the LSTM-based language model task on the P40 machine.
- Optimizes the interaction of python with c++, GradMaker, OperatorBase, and allocator. The performance is improved by 270% for the LSTM-based language model task on the P40 machine.
- Removes the redundant codes for performance problems caused by calling dead codes of ‘optimized_guard’. The performance of optimizers such as SGD and Adam is improved by 5% to 8% for or the Transformer model (batch_size=64) on the P40 machine.
- Removes the redundant codes for performance problems caused by calling dead codes of `optimized_guard`. The performance of optimizers such as SGD and Adam is improved by 5% to 8% for or the Transformer model (batch_size=64) on the P40 machine.
- To reduce the performance impact caused by adding extra ‘scale_op’ to update the beta parameter in ‘AdamOptimizer’.To reduce the performance impact caused by adding extra ‘scale_op’ to update the beta parameter in ‘AdamOptimizer’, Iintegrate the updating logic of ‘beta’ into ‘adam_op’ to reduce the cost of calling op kernel. The performance 偶发of is improved by 9.67% on the P40 machine.
- Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine.
- Optimizes asynchronous DataLoader of the dynamic graph. For the Mnist, ResNet and other CV models , the single card training speed is improved by more than 40% on the P40 machine.
- Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency.
- Adds numpy bridge function, to support sharing the underlying data between Tensor and ndarray in CPU mode. This can avoid the copy problem of numpy input when creating variables, and improve efficiency.
- Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet.
- Optimizes the GPU memory by the forward variable space strategy, which can delete the Tensor Buffer not required in reverse calculation in advance. The maximum batch size is increased by more than 20%-30% in some models such as ResNet.
- Dynamic Graph Deployment
- Dynamic Graph Deployment
- Supports the `TracedLayer` interface to convert the dynamic graph model into the static graph.
- Supports the ‘TracedLayer’ interface to convert the dynamic graph model into the static graph.
- Debugging Analysis
- Debugging Analysis
- Optimizes the error message. Classifies the framework error messages and optimizes the message descriptions for more convenient to solve the problem according to the messages.
- Optimizes the error message. Classifies the framework error messages and optimizes the message descriptions for more convenient to solve the problem according to the messages.
- Optimizes the performance analysis profile function.
- Optimizes the performance analysis profile function.
- Enhances the functions and accuracy of the profile. Supports profile options at different levels. The call relation of events can be recorded in the profile data and printed.
- Enhances the functions and accuracy of the profile. Supports profile options at different levels. The call relation of events can be recorded in the profile data and printed.
- Optimizes the checking and debugging functions of ‘nan inf’ which is enabled through ‘FLAGS_check_nan_inf’. The performance, function, and output information are all greatly improved.
- Optimizes the checking and debugging functions of `nan inf` which is enabled through `FLAGS_check_nan_inf`. The performance, function, and output information are all greatly improved.
- In terms of speed, the v100 test ResNet50 model has a performance improvement of about 1000 times compared with the original utility components, and maintains an over 80% efficiency for normal training.
- In terms of speed, the v100 test ResNet50 model has a performance improvement of about 1000 times compared with the original utility components, and maintains an over 80% efficiency for normal training.
- In terms of function, the support for fp16 is added and environment variables can be set to skip the inspection of op, op_role, and op_var to facilitate the debugging of the fp16 model.
- In terms of function, the support for fp16 is added and environment variables can be set to skip the inspection of op, op_role, and op_var to facilitate the debugging of the fp16 model.
- The output information is detailed and accurate. Besides wrong op and tensor names, the quantity of wrong nan, inf, and normal numerical values are printed to facilitate debugging.
- The output information is detailed and accurate. Besides wrong op and tensor names, the quantity of wrong nan, inf, and normal numerical values are printed to facilitate debugging.
- Releases the lightweight installation package ‘paddlepaddle-tiny’ for CPU training and forecast, supporting installed on Windows/Linux/Mac OS and python27/python35/python36/python37.
- Releases the lightweight installation package `paddlepaddle-tiny` for CPU training and forecast, supporting installed on Windows/Linux/Mac OS and python27/python35/python36/python37.
- Supports the following compile functions: no avx, no ml, no gpu, no unittest.
- Supports the following compile functions: no avx, no ml, no gpu, no unittest.
- Remove the slim and some dataset.
- Remove the slim and some dataset.
- Reduce the Linux package size from 90M to 37M. Reduce the Windows package size from50.8 M to 9.6M. Reduce the MAC package size from 59M to 19.8M.
- Reduce the Linux package size from 90M to 37M. Reduce the Windows package size from50.8 M to 9.6M. Reduce the MAC package size from 59M to 19.8M.
- Reduce the number of installation requirement dependencies from 15 to 7.
- Reduce the number of installation requirement dependencies from 15 to 7.
## Inference Deployment
## Inference Deployment
- Server-side Inference Library
- Server-side Forecast Library
- Python API
- Python API
- Supports reading and writing model from the memory to meet the model encryption requirements.
- Supports reading and writing model from the memory to meet the model encryption requirements.
- The Scale operator is no longer added at the end of the inference model.
- The Scale operator is no longer added at the end of the inference model.
- Adds ZeroCopy API, which is basically the same as the C++ APIs. Supports using numpy.ndarray as the input and output. It`s convenient for Python scenario.
- Adds ZeroCopy API, which is basically the same as the C++ APIs. Supports using numpy.ndarray as the input and output. It’s convenient for Python scenario.
- Adds several interfaces in AnalysisConfig to completely cover the C++ inference functions, including removing pass and disabling inference glog.
- Adds several interfaces in AnalysisConfig to completely cover the C++ inference functions, including removing pass and disabling inference glog.
- Support for Other Programming Languages
- Support for Other Programming Languages
- Add inference API of R and Go, and the related usage methods and examples are added.
- Add inference API of R and Go, and the related usage methods and examples are added.
- Provides the corresponding header file of ProtoBuf to facilitate users to analyzing structure of models.
- Provides the corresponding header file of ProtoBuf to facilitate users to analyzing structure of models.
- For a inference library with TRT compilation, the TensorRT library is not provided from thrid_party any more and needs to be downloaded by users at https://developer.nvidia.com/tensorrt.
- For a inference library with TRT compilation, the TensorRT library is not provided from thrid_party any more and needs to be downloaded by users at https://developer.nvidia.com/tensorrt.
- Functional Enhancement:
- Functional Enhancement:
- Supports access Paddle Lite by submap mode, and ResNet50 has been verified.
- Supports access Paddle Lite by submap mode, and ResNet50 has been verified.
- Supports the MKL-DNN FC INT8 kernel.
- Supports the MKL-DNN FC INT8 kernel.
- Supports Ernie model in Paddle-TensorRT. For the Ernie model (seq length = 128) on the T4 card, the delay of fp16 inference is 3.6 ms, which is faster than the fp32 inference by 37%.
- Supports Ernie model in Paddle-TensorRT. For the Ernie model (seq length = 128) on the T4 card, the delay of fp16 inference is 3.6 ms, which is faster than the fp32 inference by 37%.
- Quantization: the single-threaded performance and the multi-threaded performance are improved by 2.79 times and 1.79 times for ERNIE INT8 on the second-generation Xeon scalable platform 6271 respectively, while the Ernie INT8 model has only slight decline precision compared with the FP32 model.
- Quantization: the single-threaded performance and the multi-threaded performance are improved by 2.79 times and 1.79 times for ERNIE INT8 on the second-generation Xeon scalable platform 6271 respectively, while the Ernie INT8 model has only slight decline precision compared with the FP32 model.
- Releases the forecast service of remote text vector representation of the bert-type semantic understanding model.
- Releases the forecast service of remote text vector representation of the bert-type semantic understanding model.
- Release the paddle-gpu-serving WHL package. Supports pip installation and Python codes.
- Release the paddle-gpu-serving WHL package. Supports pip installation and Python codes.
- Supports 13 semantic understanding models in Paddlehub. Supports the single-machine multi-card mode. The forecast speed is 869.56 samples per second using the Ernie_tiny model, when the average sample length is 7 under a single P4 GPU.
- Supports 13 semantic understanding models in Paddlehub. Supports the single-machine multi-card mode. The forecast speed is 869.56 samples per second using the Ernie_tiny model, when the average sample length is 7 under a single P4 GPU.
- Refactors pruning, quantization, distillation and NAS API. Provide more low-level APIs for developer.
- Refactors pruning, quantization, distillation and NAS API. Provide more low-level APIs for developer .
- Quantization:
- Quantification:
- Adds post training quantization strategy based on KL divergence. Supports quantization of the embedding layer.
- Adds post training quantization strategy based on KL divergence. Supports quantization of the embedding layer.
- Supports quantization for MKL-DNN-FC layer based on QAT.
- Supports quantization for MKL-DNN-FC layer based on QAT.
- Adds post training quantization that support 30 kinds of operators. Supports spartial operators to skip quantization.
- Adds post training quantization that support 30 kinds of operators. Supports spartial operators to skip quantization.
- Supports skipping some operators in training aware strategy
- Supports skipping some operators in training aware strategy
- Pruning: Refactors and enhances the code of pruning to support more kinds of networks.
- Pruning: Refactors and enhances the code of pruning to support more kinds of networks.
- NAS:
- NAS:
- Supports NAS based on simulated annealing. Provides more predefined search spaces and support custom search spaces.
- Supports NAS based on simulated annealing. Provides more predefined search spaces and support custom search spaces.
- Adds one-shot algorithm for NAS. The speed of search is 20 times faster than that of the previous version.
- Adds one-shot algorithm for NAS. The speed of search is 20 times faster than that of the previous version.
- Releases the large-scale scalable knowledge distillation framework called Pantheon.
- Releases the large-scale scalable knowledge distillation framework called Pantheon.
- Achieves full decoupling between student and teacher models and among teacher models. They can independently run on different physical devices respectively to make full use of computing resources.
- Achieves full decoupling between student and teacher models and among teacher models. They can independently run on different physical devices respectively to make full use of computing resources.
- Supports the multi-device large-scale inference of the teacher model in the single node. The acceleration ratio is tested to be linear on BERT-like complex models.
- Supports the multi-device large-scale inference of the teacher model in the single node. The acceleration ratio is tested to be linear on BERT-like complex models.
- Supports knowledge transmission between teacher and student models running on any two physical devices in the same Internet environment. By using TCP/IP protocol for communication in online distillation mode.
- Supports knowledge transmission between teacher and student models running on any two physical devices in the same Internet environment. By using TCP/IP protocol for communication in online distillation mode.
- Unifies API interfaces for online and offline distillation modes, enabling different teacher models operating in different distillation modes.
- Unifies API interfaces for online and offline distillation modes, enabling different teacher models operating in different distillation modes.
- The merging of knowledge and the batch reorganization of knowledge data are completed automatically on the student side to facilitate the knowledge fusion of the multi-teacher models.
- The merging of knowledge and the batch reorganization of knowledge data are completed automatically on the student side to facilitate the knowledge fusion of the multi-teacher models.
- Model Zoo:
- Releases benchmark of image classification model such as ResNet50, MobileNet.
- Model Zoo:
- Adapts PaddleDetection library and release benchmark of YOLOv3 models with different backbone.
- Releases benchmark of image classification model such as ResNet50, MobileNet.
- Adapts PaddleSeg library and release benchmark of Deepabv3+ models with different backbone.
- Adapts PaddleDetection library and release benchmark of YOLOv3 models with different backbone.
- Refines Document:
- Adapts PaddleSeg library and release benchmark of Deepabv3+ models with different backbone.
- Refines documents of API. Adds some QuickStart tutorials and advanced tutorials. Adds model zoo docu which contain models for image classification, object detection, semantic segmentation. Translates all documents to English.
- Refines Document:
- Refines documents of API. Adds some QuickStart tutorials and advanced tutorials. Adds model zoo docu which contain models for image classification, object detection, semantic segmentation. Translates all documents to English.
## Distributed
## Distributed
- Parameter Server Mode:
- Parameter Server Mode:
- Reduces the memory usage greately during training. On 100 million embedding trainging tasks, the Trainer-side memory can be reduced by 90%.
- Reduces the memory usage greatly during training. On 100 million embedding training tasks, the Trainer-side memory can be reduced by 90%.
- Reduces the memory usage of distributed saving and loading models greatly. The Pserver-side memory peak value can be minimized to 1/N of the original value, where N is the number of Pserver nodes.
- Reduces the memory usage of distributed saving and loading models greatly. The Pserver-side memory peak value can be minimized to $1/N $ of the original value, where N is the number of Pserver nodes.
- Optimizes the dense parameter communication in GEO mode.
- Optimizes the dense parameter communication in GEO mode.
- Supports distributed AUC index calculation.
- Supports distributed AUC index calculation.
- Adds distributed barrier functions.
- Adds distributed barrier functions.
- Adds Semi-asynchronous modes in Communicator.
- Adds Semi-asynchronous modes in Communicator.
- Supports semi-asynchronous modes of the ‘TrainFromDataset’ training interface.
- Supports semi-asynchronous modes of the `TrainFromDataset` training interface.
- Adds ‘DistributedStrategy’ in ‘Fleet’ to improve the convenient usage. Integrates the current distributed related flags.
- Adds `DistributedStrategy` in `Fleet` to improve the convenient usage. Integrates the current distributed related flags.
- Supports single-program multi-loss training modes in ‘Fleet pslib’ to optimize the training performance.
- Supports single-program multi-loss training modes in `Fleet pslib` to optimize the training performance.
- Supports k8s environment in 100 billion sparse mode.
- Supports k8s environment in 100 billion sparse mode.
-Large-scale classification library PLSC: It supports the large-scale classification problem that data parallel cannot solve due to the limitation of video memory capacity (https://github.com/PaddlePaddle/PLSC).
-[Large-scale classification library PLSC](https://github.com/PaddlePaddle/PLSC): It supports the large-scale classification problem that data parallel cannot solve due to the limitation of video memory capacity.
- Supports three built-in models such as ResNet50, ResNet101, and ResNet152. Supports User-defined models. Under the single-machine eight-V100 GPU configuration environment, the ResNet50 model has a million-class training speed of 2,122.56 images/s, which is 1.3 times faster than that of the standard ResNet50 model.
- Supports three built-in models such as ResNet50, ResNet101, and ResNet152. Supports User-defined models. Under the single-machine eight-V100 GPU configuration environment, the ResNet50 model has a million-class training speed of 2,122.56 images/s, which is 1.3 times faster than that of the standard ResNet50 model.
- Releases a ‘plsc-serving whl’ package for model online forecast service. It can forecast the image semantic vector representation of the face recognition model. Supports making a forecast using a user-trained model. The forecast speed of the ResNet50 model (batch size=256) is 523.47 images/s under a single V100 GPU.
- Releases a `plsc-serving whl` package for model online forecast service. It can forecast the image semantic vector representation of the face recognition model. Supports making a forecast using a user-trained model. The forecast speed of the ResNet50 model (batch size=256) is 523.47 images/s under a single V100 GPU.
- Releases the pre-training models based on the ResNet50 network and the MS1M-ArcFace dataset: https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz.- The benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine) is released.
- Releases the pre-training models based on the ResNet50 network and the MS1M-ArcFace dataset: https://plsc.bj.bcebos.com/pretrained_model/resnet50_distarcface_ms1mv2.tar.gz.
- Releases the benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine)
- Releases the benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine)
- Seq2seq supports training modes such as RL and GAN in the static-graph of Paddle.
- Seq2seq supports training modes such as RL and GAN in the static-graph of Paddle.
- A training model for word segmentation and part-of-speech tagging is released. With the knowledge distillation framework Pantheon, the F1 score of this model on the own dataset is improved 1% over that of PaddleNLP LAC. This model is merged into the jieba repo, with adding a flag use_paddle to enable deep learning model mode. In addition, the paddle version detection and rollback mechanism is added in jieba to ensure user experience.
- A training model for word segmentation and part-of-speech tagging is released. With the knowledge distillation framework Pantheon, the F1 score of this model on the own dataset is improved 1% over that of PaddleNLP LAC. This model is merged into the jieba repo, with adding a flag use_paddle to enable deep learning model mode. In addition, the paddle version detection and rollback mechanism is added in jieba to ensure user experience.
- Adds dynamic graph model implementations for these models: word2vec, senta, transformer, Bert, seq2seq, and LAC.
- Adds dynamic graph model implementations for these models: word2vec, senta, transformer, Bert, seq2seq, and LAC.
- Implements the standard workflow for data preprocessing, training, and synthesis of the TTS models.
- Implements the standard workflow for data preprocessing, training, and synthesis of the TTS models.
- Provides the out-of-the-box pre-processing implementation of typical datasets.
- Provides the out-of-the-box pre-processing implementation of typical datasets.
- Provides the commonly-used model components in the TTS field to facilitate the model implementation.
- Provides the commonly-used model components in the TTS field to facilitate the model implementation.
- Reseases the TTS models DeepVoice3, ClarinNet, TransformerTTS, FastSpeech, WaveNet, and WaveFlow.
- Reseases the TTS models DeepVoice3, ClarinNet, TransformerTTS, FastSpeech, WaveNet, and WaveFlow.
- PaddleCV
- PaddleCV
- Image Classification:
- Image Classification:
- Adds 14 pre-training models including SENet-vd, Res2Net, and HRNet series of models:
- Adds 14 pre-training models including SENet-vd, Res2Net, and HRNet series of models:
- Supports accelerating data preprocessing by using DALI. On the ImageNet training, 1.5 times (ResNet50) to more than 3 times (ShuffleNet) the acceleration is obtained and the GPU utilization is greatly improved.
- Supports accelerating data preprocessing by using DALI. On the ImageNet training, 1.5 times (ResNet50) to more than 3 times (ShuffleNet) the acceleration is obtained and the GPU utilization is greatly improved.
- Releases SiamFC and ATOM models,
- 3D Vision:
- Add dynamic graph model implementations for the following models: MobileNet-v1/v2, YOLOv3, FasterRCNN, MaskRCNN, video classification TSM model, and video motion positioning BMN model.
- Releases PointNet++、PointRCNN models.
- Tracking Model Library:
- Releases SiamFC and ATOM models,
- Add dynamic graph model implementations for the following models: MobileNet-v1/v2, YOLOv3, FasterRCNN, MaskRCNN, video classification TSM model, and video motion positioning BMN model.
- PaddleRec
- PaddleRec
- Releases a multi-task model called MMoE for the recommended field. It can be applied to large-scale multi-task joint training in the industrial circles.
- Releases a multi-task model called MMoE for the recommended field. It can be applied to large-scale multi-task joint training in the industrial circles.
- Adds dynamic graph model implementations for the following models: gru4rec, deepfm.
- Adds dynamic graph model implementations for the following models: gru4rec, deepfm.
- The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.
- Add the following model implementations and pre-training models:
- The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.– The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.– Improves the precision of the YOLOv3 model. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.
- Add the best single model CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd in the Google AI Open Images 2019-Object Detction competition is added. Releases a pre-training model of this algorithm based on Objects365 data.
- Add the following model implementations and pre-training models:
- Add a series of CBResNet, Res2Net, and HRNet pre-training models.
- Add the best single model CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd in the Google AI Open Images 2019-Object Detction competition is added. Releases a pre-training model of this algorithm based on Objects365 data.
- Adds a LibraRCNN algorithm and the pre-training models.
- Add a series of CBResNet, Res2Net, and HRNet pre-training models.
- Add GIoU, DIoU, and CIoU loss-based pre-training models in the FasterRCNN R50 FPN model. Without reducing the inference speed, the precision for the COCO data is improved by 1.1%, 0.9%, and 1.3% respectively.
- Adds a LibraRCNN algorithm and the pre-training models.
- Adds Modules:
- Add GIoU, DIoU, and CIoU loss-based pre-training models in the FasterRCNN R50 FPN model. Without reducing the inference speed, the precision for the COCO data is improved by 1.1%, 0.9%, and 1.3% respectively.
- Backbone network: CBResNet, Res2Net, and HRNet are added.
- Added Modules:
- Loss modules: GIoU loss, DIoU loss, and CIoU loss are added. Libra loss and YOLOv3 loss support a fine-grained op combination.
- Backbone network: CBResNet, Res2Net, and HRNet are added.
- Postprocessing modules: The softnms and DIOU nms modules are added.
- Loss modules: GIoU loss, DIoU loss, and CIoU loss are added. Libra loss and YOLOv3 loss support a fine-grained op combination.
- Regular module: A DropBlock module is added.
- Postprocessing modules: The softnms and DIOU nms modules are added.
- Regular module: A DropBlock module is added.
- Functional Optimization and Improvement:
- Functional Optimization and Improvement:
- YOLOv3 data preprocessing is accelerated. The overall training speeds up by 40%.
- YOLOv3 data preprocessing is accelerated. The overall training speeds up by 40%.
- The data preprocessing logic is optimized.
- The data preprocessing logic is optimized.
- The benchmark data for face detection inference is added.
- The benchmark data for face detection inference is added.
- Inferenerence examples under the Paddle inference library Python API are added.
- Inference examples under the Paddle inference library Python API are added.
- Detection Model Compression:
- Detection Model Compression:
- Pruning: A MobileNet-YOLOv3 uningtailoring solution and model are released, with FLOPs - 69.6%, mAP + 1.4% for the VOC dataset, and FLOPS - 28.8%, mAP + 0.9% for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning solution and model are released, with FLOPs - 18.4%, mAP + 0.8% for the COCO dataset.
- Pruning: A MobileNet-YOLOv3 uningtailoring solution and model are released, with FLOPs - 69.6%, mAP + 1.4% for the VOC dataset, and FLOPS - 28.8%, mAP + 0.9% for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning solution and model are released, with FLOPs - 18.4%, mAP + 0.8% for the COCO dataset.
- Distillation: A MobileNet-YOLOv3 distillation solution and model are released, with mAP + 2.8% for the VOC data and mAP + 2.1% for the COCO data.
- Distillation: A MobileNet-YOLOv3 distillation solution and model are released, with mAP + 2.8% for the VOC data and mAP + 2.1% for the COCO data.
- Quantization: YOLOv3 and BlazeFace quantitative models are released.
- Quantization: YOLOv3 and BlazeFace quantitative models are released.
- Pruning + Distillation: A MobileNet-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 69.6%, inference speedup 64.5% under the GPU, mAP - 0.3 % for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 43.7%, inference speedup 24.0% under the GPU, mAP + 0.6 % based on the COCO data.
- Pruning + Distillation: A MobileNet-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 69.6%, inference speedup 64.5% under the GPU, mAP - 0.3 % for the COCO dataset. A ResNet50vd-dcn-YOLOv3 pruning + distillation solution and model are released, with FLOPS - 43.7%, inference speedup 24.0% under the GPU, mAP + 0.6 % based on the COCO data.
- Search: A complete search solution for the open source BalzeFace-nas.
- Search: A complete search solution for the open source BalzeFace-nas.
- Inference Deployment:
- Inference Deployment:
- The support of the Paddle inferencerence library for TensorRT and FP16 precision is adapted.• Adapts the Paddle forecastrence library for TensorRT and FP16 precision
- The support of the Paddle inference library for TensorRT and FP16 precision is adapted.
- Documents:
- Documents:
- Adds the documents for introducing the data preprocessing module and a document for implementing the user-defined data Readers.
- Adds the documents for introducing the data preprocessing module and a document for implementing the user-defined data Readers.
- Adds the documents about how to add an algorithm model.
- Adds the documents about how to add an algorithm model.
- Documents are deployed to the website: https://paddledetection.readthedocs.io/zh/latest/
- Documents are deployed to the website: https://paddledetection.readthedocs.io/zh/latest/
- LaneNet model applicable to lane segmentation scenarios.
- Adds Models
- Lightweight Fast-SCNN model applicable to high performance scenarios.
- LaneNet model applicable to lane segmentation scenarios.
- HRNet semantic segmentation model applicable to high-precision scenarios.
- Lightweight Fast-SCNN model applicable to high performance scenarios.
- HRNet semantic segmentation model applicable to high-precision scenarios.
- Releases multiple PaddleSlim-based model compression solutions:
- Releases multiple PaddleSlim-based model compression solutions:
- Fast-SCNN tailoring solution and model on Cityscapes dataset.
- Fast-SCNN tailoring solution and model on Cityscapes dataset.
- Deeplabv3p-Xception and Deeplabv3p-MobilenetV2 distillation solutions on Cityscapes dataset.
- Deeplabv3p-Xception and Deeplabv3p-MobilenetV2 distillation solutions on Cityscapes dataset.
- Deeplabv3p-MobilenetV2 search solution on Cityscapes dataset.
- Deeplabv3p-MobilenetV2 search solution on Cityscapes dataset.
- Deeplabv3p-Mobilenet quantitative solution and model on Cityscapes dataset.• Adds the TensorRT acceleration support for FP16 and Int8 quantitative models
- Deeplabv3p-Mobilenet quantitative solution and model on Cityscapes dataset.
- Enhance the deployment capability
- Enhance the deployment capability
- Adds the lightweight deployment of Python.
- Adds the lightweight deployment of Python.
- The TensorRT acceleration support for FP16 and Int8 quantitative models is added.
- The TensorRT acceleration support for FP16 and Int8 quantitative models is added.
- Adds the tutorials for human portraits segmentation Paddle-Lite mobile deployment of DeepLabv3p-MobileNetV2
- Adds the tutorials for human portraits segmentation Paddle-Lite mobile deployment of DeepLabv3p-MobileNetV2
- Optimizes the Model exportation step. Supports GPU implementation of image preprocessing and post processing. The performance is improved by 10%-20%.
- Optimizes the Model exportation step. Supports GPU implementation of image preprocessing and post processing. The performance is improved by 10%-20%.
- Provides the benchmark for the prediction performance of U-Net, ICNet, PSPNet, DeepLabv3+, and other models for images of different sizes to facilitate users to select models based on performance.
- Provides the benchmark for the prediction performance of U-Net, ICNet, PSPNet, DeepLabv3+, and other models for images of different sizes to facilitate users to select models based on performance.
- Experience Optimization
- Experience Optimization
- Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability.
- Adds a learning rate function called warmup. Supports using with different learning rate decay strategies to improve fine-tuning stability.
- Marked imaged can be saved in pseudo-color image format to improve their preview experience.• Optimizes the logic of documents. Provides AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening.
- Adds the function of automatically saving an optimal mIoU model.
- Adds the function of automatically saving an optimal mIoU model.
- The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided.
- The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided.
- Release APIs and the multi-task learning kernel are upgraded.
- Release APIs and the multi-task learning kernel are upgraded.
- Support independent task saver.
- Support independent task saver.
- Continuous training and inference are supported. Dataset files can be switched over freely under a single execution.– Ugrades the machine reading comprehension tasks. Adds preprocessing hyper-parameters.• Strengthens
- Continuous training and inference are supported, Dataset files can be switched over freely under a single execution.
- Supports model customization.
- Supports model customization.
- The multi-task learning kernel is refactored and fix some bugs.
- The multi-task learning kernel is refactored and fix some bugs.
- Upgrade multi-task learning ability.
- Upgrade multi-task learning ability.
- Support independent settings of batch size and sequence length across tasks.• Adds a module for the management and downloading pre-training models.– Supports the management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa.
- Support independent settings of batch size and sequence length across tasks.
- Fix inconsistent problem of the tasks on GPUs.
- Fix inconsistent problem of the tasks on GPUs.
- The multi-task learning scheduling and termination strategies are optimized to generally improve the model generalization ability.
- The multi-task learning scheduling and termination strategies are optimized to generally improve the model generalization ability.
- Upgrade the ability and types of pre-defined tasks.
- Upgrade the ability and types of pre-defined tasks.
- According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added
- The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI clus– Supports the models NeurIPS2019, which is the reforcement learning challenge champion modelReleases the version v1.1:
- Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
- A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character inference, and other fields , such as MNIST and Sentiment140, are supported.– Releases a garaph solution called PGL-Rec and a knowledge graph embedding algorithm set called PGL-KE.– Releases a high-order API of PGL.
- SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.
- According to the added components, the original samples are modified in example and the femnist_demo and submitter_demo examples are added
- Deep Reinforcement Learning Framework [PARL](https://github.com/PaddlePaddle/PARL)
- Fl_distribute_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
- SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.
- Deep Reinforcement Learning Framework PARL (https://github.com/PaddlePaddle/PARL)
- Version v1.3 is released.
- Version v1.3 is released.
- The support for the Multi-Agent RL algorithm including MADDPG is added.
- The support for the Multi-Agent RL algorithm including MADDPG is added.
- The support for multi-card training is added. An example of a multi-card DQN algorithm is released.
- The support for multi-card training is added. An example of a multi-card DQN algorithm is released.
- SOTA algorithms TD3 and SAC in the open source continuous control field.
- SOTA algorithms TD3 and SAC in the open source continuous control field.
- Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class)
- Implementation and training solution for the open source NeurIPS2019 reforcement learning challenge champion model. Trained models are open (Consideration can be given to open class)
- The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.– Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies.s
- The support for the authoritative graph learning database OGB is added. Three types of tasks including nodepropered, linkpred, and graphpropered are fully supported. A SOTA baseline is released.�C Decouples the forecast library from third_party. Refactors 28 third-party-dependent compilation codes to facilitate the unified management of external dependencies.
- A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released.– Removes
- A graph solution PGL-Rec and a knowledge graph embedding algorithm set PGL-KE are released.
- An improvement on ease of use is made. A high-order API of PGL is released.– Removes the unnecessary contrib/float16 directory. Deletes the unnecessary snappy/snappystream dependency under the BRPC.
- An improvement on ease of use is made. A high-order API of PGL is released.
- Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo.
- Other upgrade points: Sampling of a multi-process graph is optimized and a GraphSAGE kind of models is accelerated by three times. Lod Tensor-based Graph Batch and Graph Pooling operators are added. Models including distributed heterogeneous task graph algorithm, GraphZoom, and PinSage are added for Model Zoo.
## Code Reconstruction and Upgrade
## Code Reconstruction and Upgrade
- Compilation
- Compilation
- A compilation thus improving the code quality.– Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation.– Removes the
- A compilation thus improving the code quality.
- A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability.
�C Fixes the codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points). An error will be reported for all subsequent warnings of this kind during compilation, option WITH_NCCL is added. Single-card users can display and specify WITH_NCCL=OFF to accelerate compilation.
- The CUDA_ARCH_NAME default value is set to Auto (All indicates compiling all GPU architectures and Auto indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using Auto than using All, thus improving development efficiency.
- A compilation option WITH_TP_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability.
- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows.
- The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency.
- Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows.
- External Dependency Library
- External Dependency Library
- MKL-DNN is upgraded to the latest Version 1.1.
- MKL-DNN is upgraded to the latest Version 1.1.
- The inference library is decoupled from third_party and 28 third-party-dependent compilation codes are refactored to facilitate the unified management of external dependencies.
- The inference library is decoupled from `third_party` and 28 third-party-dependent compilation codes are refactored to facilitate the unified management of external dependencies.
- Two third-party-dependent private code repository, one unnecessary ernal dependency, and 2000+ lines of unnecessary codes under the patch are removed to improve the code repository quality.
- Two third-party-dependent private code repository, one unnecessary external dependency, and 2000+ lines of unnecessary codes under the patch are removed to improve the code repository quality.
- Code Cleanup, Refactoring, and Optimization
- Code Cleanup, Refactoring, and Optimization
- The unnecessary contrib/float16 directory is removed. The unnecessary snappy/snappystream dependency under the BRPC is deleted.
- The unnecessary `contrib/float16` directory is removed. The unnecessary snappy/snappystream dependency under the BRPC is deleted.
- loss.py and sequence_lod.py are split out of python/paddle/fluid/layers/nn.py according to the API functions, thus reducing the code quantity of nn.py and facilitating reading.
-`loss.py` and `sequence_lod.py` are split out of `python/paddle/fluid/layers/nn.py` according to the API functions, thus reducing the code quantity of `nn.py` and facilitating reading.
- The codes corresponding to the warnings of -Wno-error=sign-compare (at a total of more than 100 points) are fixed. An error will be reported for all subsequent warnings of this kind during compilation, thus improving the code quality.
- The codes corresponding to the warnings of `-Wno-error=sign-compare` (at a total of more than 100 points) are fixed. An error will be reported for all subsequent warnings of this kind during compilation, thus improving the code quality.
- WarningLnk4006/WarningLnk4221 (There are about 300) compiled by Windows MSVC is removed to improve the code repository quality.
-`WarningLnk4006/WarningLnk4221` (There are about 300) compiled by Windows MSVC is removed to improve the code repository quality.
- The quantity of reduce_op, expand_op, and expand_as_op templates is reduced to accelerate GPU compilation and reduce whl package space by 70 M.
- The quantity of reduce_op, expand_op, and expand_as_op templates is reduced to accelerate GPU compilation and reduce whl package space by 70 M.
- The pybind function of every OP is automatically generated under the dynamic graph using codes and directly called in layers to improve the dynamic graph performance and reduce the coupling degree with the static graph.
- The pybind function of every OP is automatically generated under the dynamic graph using codes and directly called in layers to improve the dynamic graph performance and reduce the coupling degree with the static graph.
## Bug Fixes
## Bug Fixes
- Fix the problem of MKL-DNN error when PaddleDetection-based Faster-RCNN uses the Python API to make a inference.
- Fix the problem of MKL-DNN error when PaddleDetection-based Faster-RCNN uses the Python API to make a inference.
- Fix the problem of training suspension in the GPU implementation of sum op because some Tensors are not initialized.
- Fix the problem of training suspension in the GPU implementation of sum op because some Tensors are not initialized.
- Fix the problem of precision loss when the value in fill_constant is set to a large integer.
- Fix the problem of precision loss when the value in fill_constant is set to a large integer.
...
@@ -397,4 +353,4 @@ This version focuses on enhancement of the framework functions, includes improvi
...
@@ -397,4 +353,4 @@ This version focuses on enhancement of the framework functions, includes improvi
- Fix the problem of the sigmoid cudnn kernel being called as the tanh cudnn kernel by mistake.
- Fix the problem of the sigmoid cudnn kernel being called as the tanh cudnn kernel by mistake.
- Fix some bugs related to reshape and Conv2D depthwisecoin dynamic graph mode; fix the problem of some parameters in the network having no gradient, causing the bug of program crash.
- Fix some bugs related to reshape and Conv2D depthwisecoin dynamic graph mode; fix the problem of some parameters in the network having no gradient, causing the bug of program crash.
- Fix the bug of running error of GradientClip in parameter server mode.
- Fix the bug of running error of GradientClip in parameter server mode.
- Fix the problem of memory leak in full asynchronous mode of the parameter server.
- Fix the problem of memory leak in full asynchronous mode of the parameter server.