release_note_en.md


Release Notes
==============

## Important Updates

In this version, the authors focus on enhancing the framework function level, the forecast deployment capability is fully improved, the distributed release PLSC supports the super-large-scale classification, and the parameter server mode is optimized and integrated. The compilation options, the compilation dependence, and the code library are fully cleaned up and optimized. The model library is continuously improved, the overall hierarchy is optimized, and the implementation of the dynamic graph model is added. The end-to-end development kits and utility components are further perfected.

**Training Framework**: An AMP interface and a new control flow interface are added. The tensor usage method and the GPU memory allocation strategy are optimized. A library that supports the Nvidia DALI GPU data preprocessing is added. The function and performance of the basic OP are continually optimized. The function of the dynamic graph is further perfected and the performance is greatly improved. A function that converts the data independent dynamic graph model into the static graph predictable deployment model is provided. The framework debugging analysis function and the ease of use are fully enhanced.

**Forecast Deployment**: The Python API of the server-side forecast library is significantly optimized. A usage method and example of the R language and Go language call forecast library are added. The quantification support capability is strengthened. Paddle Lite supports a model generated by the post-training quantification method without calibration data. Tailoring, quantification, distillation, and search interfaces are reconstructed for the model compression library PaddleSlim. A large-scale scalable knowledge distillation framework Pantheon is added to fully connect to the model library.

**Distributed Aspect**: In parameter server mode, the back-end implementation is united into the communicator and the front-end interface is united into the fleet for the synchronous, semi-asynchronous, and fully asynchronous modes of the transpiler. Different modes are flexibly selected using the fleet strategy. A large-scale classification library PLSC is released and the classification tasks of a great many classes are supported using model parallel.

**Basic Model Library**: A speech synthesis library Parakeet is released, including several leading-edge synthesis algorithms. 14 image classification pre-training models are added in PaddleCV. The 3D and tracking direction model continues to be enriched. The participle and part-of-speech tagging model of PaddleNLP supports a jieba participle. A multi-task model MMoE is added in PaddleRec. Extensive dynamic graph model implementations are added in the model library as a whole. The overall hierarchy of the model library is adjusted and optimized.

**End-to-End Development Kits**: A large number of model implementations and pre-training models are added in PaddleDetection and PaddleSeg. The training speed and accuracy of typical models are enhanced. The model compression and deployment capabilities are significantly improved. The user experience is fully optimized. A recommended sorting system ElasticRec is released. Deployment is performed via K8S. Streaming training and online forecast services are supported.

**Utility Components**: 52 pre-training models are added in PaddleHub, with a total of more than 100. The function and experience are continuously optimized. The kernel of the multi-task learning framework PALM is upgraded. The API call is open. More task types are supported. An open dataset is added in the federated learning PaddleFL.

## Training Framework

- API
  - An AMP interface is added: A network can be converted into mixed accuracy training in a general way while the accuracy fluctuation is ensured to be within the normal range.
  - A new control flow interface is added and recommended: Four control flow Ops including while\_loop (loop control function), cond (conditional branch function), case, and switch\_case (branch control function) are added for the ease of use and the following new functions are supported:
    - Python callable is used as a control condition or executive.
    - Different branches in the control flow use different losses or optimizers.
    - Conditions in the control flow partially use CPU or GPU data.
  - Parameters of some APIs support the use of a variable list: Support for a variable list is added according to the case that the parameter\_list or no\_grad\_set parameter of some APIs supports only the use of a string list. It is no longer necessary to obtain the name attribute of related variables in advance when using the following APIs:
    - fluid.backward.append\_backward(loss, parameter\_list=None, no\_grad\_set=None, callbacks=None)
    - fluid.backward.gradients(targets, inputs, target\_gradients=None, no\_grad\_set=None)
    - The minimize methods of various optimizers, such as Adam’s minimize: minimize(loss, startup\_program=None, parameter\_list=None, no\_grad\_set=None, grad\_clip=None)
- Basic Function Optimization
  - The float16 type of numpy is used to set to Tensor data without the necessity of conversion into the uint16 type first.
  - The minus sign is directly used to get the opposite number of Tensor.
  - GPU memory Allocation Strategy:
    - The default policy is changed to AutoGrowth: The GPU memory is applied for as needed without affecting the training speed. This avoids the problem that it is difficult to restart a new task on the same GPU card under the previous default GPU memory pre-allocation strategy.
    - GPU memory allocation adjustment for multi-card tasks: The GPU memory allocators on different GPU cards are set to the Lazy initialization mode. If a user does not use a card, no GPU memory will be applied for on this card. This avoids the GPU memory OOM problem caused by running tasks on idle GPU cards without setting CUDA\_VISIBLE\_DEVICES when any GPU memory is occupied on other GPU cards.
  - OP Function Upgrade
    - elu: This activation function supports the calculation of second-order gradients.
    - Prroi\_pool: The rois parameter may accept the Tensor or LoDTensor type.
    - Conv2d, pool2d, batch\_norm, LRN: All reverse calculations support the use of the MKL-DNN high-performance calculation library.
    - argsort: The descending sort is supported (A descending parameter is added. The default is False).
- Basic Performance Optimization
  - DALI Preprocessing Acceleration
    - The support for the Nvidia DALI GPU data preprocessing library is added, which can be used to accelerate the preprocessing of data such as images, videos, and speeches.
  - Automatic Mixed Precision Training Optimization
    - With the implementation of the following optimization strategy as well as DALI data preprocessing, the training throughput of the ResNet50 model is increased substantially: The mixed accuracy training throughput of a single V100 card is increased to 1,000+ images/s from 600+ images/s. The throughput of 8 cards for a single machine is 7,840 image/s. The throughput of 32 cards for 4 machines is 28,594 images/s.
      - The support of batch\_norm, conv2d, and other ops for NHWC data layout input is enhanced to accelerate fp16 calculation using Tensor Core technology.
      - Some op patterns in the model such as batch\_norm and relu are fused based on the IR Pass mechanism.
      - The kernel of elementwise (add, mul) and other ops is optimized.
  - RecomputeOptimizer is optimized to improve the batchsize. In the bert-large model, the maximum batchsize is increased by 533.62% compared with that without using RecomputeOptimizer, doubling the maximum batchsize of the previous version.
  - OP Performance Optimization
    - The fusion operator fuse\_emb\_seq\_pool of embedding and sequence\_pool is implemented and murmurhash3\_x64\_128 in bloom\_filter is optimized. The training speed of some NLP models is effectively improved.
    - The GPU performance of mean op is optimized. When the input data is 32328\*8 Tensor, the forward calculation speed is increased by 2.7 times.
    - Optimize assign and lod\_reset op are optimized to avoid unwanted GPU memory copy and data transform.
    - The kernel implementation of stack OP is optimized. The performance of a single card of GPU in the XLnet/Ernie model is improved by 4.1%.
- Dynamic Graph
  - Function Optimization
    - The name\_scope parameter in the dynamic graph Layers is removed to make it easier for users to inherit and call.
    - The block parameter in the to\_variable interface is removed to simplify the use of the API.
    - As for the problem that model parameters depend on data, the build\_once design is removed so that Layers can get all the parameter tables at the end of **init** execution, which is convenient for load saving, parameter initialization, parameter debugging, and parameter optimization.
    - Automatic pruning is improved to facilitate user networking and reduce the reverse calculation amount.
    - The SelectedRows operation is supported so that the Embedding layer supports sparse update of a single card.
    - As for the problem that the framework lacks containers, ParameterList, LayerList, and Sequencial functions are added to facilitate user networking.
    - Named\_sublayers and named\_parameters functions are supported to facilitate user programming.
    - The Linear lr warmup decay strategy is supported.
  - Performance Optimization
    - The interaction of python with c++, GradMaker, OperatorBase, and allocator are optimized. For the LSTM-based language model task p on the P40 machine, the performance is improved by 270%.
    - Redundant codes are removed for performance problems caused by calling dead codes of optimized\_guard in optimize for many times. For the Transformer model (batch\_size=64) on the P40 machine, the performance of optimizers such as SGD and Adam is improved by 5% to 8%.
    - For the performance impact caused by adding scale\_op extra to update the beta parameter in AdamOptimizer, the beta updating logic is fused into adam\_op to reduce the call overhead of the op kernel. For the Dialogue-PLATO model on the P40 machine, the performance is improved by 9.67%.
    - The asynchronous DataLoader of the dynamic graph is optimized. The overall training speed is improved by about 30% in the Mnist, ResNet, and other models.
    - The numpy bridge function is added. Sharing the underlying data between Tensor and ndarray in CPU mode is supported to avoid the problem of needing to copy a numpy input when creating variables, and to improve efficiency.
    - GPU memory optimization: Optimization strategy of deleting in advance the forward variable space that does not require Tensor Buffer in reverse. The maximum batch size is increased by more than 20%-30% in the ResNet and other models.
  - Dynamic Graph Deployment
    - The TracedLayer interface is supported. The conversion of the dynamic graph model into the static graph predictable deployment model is implemented.
- Debugging Analysis
  - Error message optimization: Framework error messages are classified as a whole to achieve the , systematization of error messages. Copywriting optimization is finished to help users locate and solve problems more quickly and accurately.
  - Optimization of the Performance Analysis Profile Function
    - The function and accuracy of the profiler is enhanced. Profile options at different levels are supported. The call relation of events can be recorded in the profile data and printed.
  - The nan inf check and debugging are optimized (effective through FLAGS\_check\_nan\_inf) and the performance, function, and output information are all greatly improved:
    - In terms of speed, the v100 test ResNet50 model has a performance improvement of about 1000 times compared with the original utility components, and maintains an over 80% efficiency for normal training.
    - In terms of function, the support for fp16 is added and environment variables can be set to skip the inspection of op, op\_role, and op\_var to facilitate the debugging of the fp16 model.
    - The output information is detailed and accurate. Besides wrong op and tensor names, the quantity of wrong nan, inf, and normal numerical values are printed to facilitate debugging.
- A lightweight installation package paddlepaddle-tiny for CPU training and forecast is released and the window/linux/Mac operating system and python27/python35/python36/python37 are supported:
  - The following options are compiled: no avx, no ml, no gpu, no unittest
  - The slim and some datasets are pruned off.
  - The Linux package size is reduced to 37 M from 90 M. The Windows package size is reduced to 9.6 M from 50.8 M. The MAC package size is reduced to 19.8 M from 59 M.
  - The number of installation requirement dependencies are reduced to 7 from 15.

## Forecast Deployment

- Server-side Forecast Library
  - Python API
    - The read and write model from the memory is supported to meet the model encryption requirements.
    - The Scale operator is no longer added at the end of the forecast model.
    - The support for ZeroCopy forecast is added. The interface is basically the same as the C++ interface and supports numpy.ndarray as input and output. It is easier to use on the Python side.
    - Multiple interfaces are added in AnalysisConfig to completely cover the C++ forecast functions, including removing pass and disabling forecast glog.
  - Support for Other Programming Languages
    - The usage method and example of the R language and Go language call forecast library are added.
  - The corresponding header file of ProtoBuf is provided to external users to facilitate users to analyze the requirements for the model structure.
  - For a forecast library with TRT compilation, a TensorRT library is not provided from thrid\_party any more and needs to be downloaded by users at https://developer.nvidia.com/tensorrt.
  - Function Enhancement:
    - Access to Paddle Lite using a submap is achieved and ResNet50 has been verified.
    - The support for MKL-DNN FC INT8 kernel is added.
    - Paddle-TensorRT supports the Ernie model. For the Ernie model (seq length = 128) on the T4 card, the fp16 forecast speed is 3.6 ms, which is faster than the fp32 forecast speed by 37%.
    - Quantification: Under the 2% improvement of the ERNIE INT8 accuracy compared with the FP32 accuracy, the single-threaded performance and the multi-threaded performance are improved by 2.79 times and 1.79 times for ERNIE INT8 on the second-generation Xeon scalable platform 6271 respectively.
- Mobile/Embedded End-side Paddle Lite (https://github.com/PaddlePaddle/Paddle-Lite)
  - Version v2.3 is released.
  - Multiple functions of Model\_optimize\_tool are upgraded.
  - “The post-training quantification method without calibration data” is supported. The model storage space is reduced (by 2 to 4 times).
  - OpenCL: The migration of 30 Image2D Kernels are finished and 14 Ops are covered.
  - The support for FPGA and NPU is further strengthened. The forecast of Kunlun XPU is supported.
  - A new official website document is released. A "post-training quantification method without calibration data" usage document is added.
- Paddle Serving (https://github.com/PaddlePaddle/Serving):
  - The forecast service of remote text vector representation of the bert-type semantic understanding model is released.
  - A paddle-gpu-serving WHL package is released. The forecast service can be deployed and used through pip installation and Python codes.
  - 13 semantic understanding models in Paddlehub are supported. The single-machine multi-card mode is supported. The forecast speed is 869.56 samples/s when the average sample length is 7 under a single P4 GPU using the Ernie\_tiny model.
- PaddleSlim (https://github.com/PaddlePaddle/PaddleSlim):
  - PaddleSlim is split into independent repo.
  - The tailoring, quantification, distillation and search interfaces are reconstructed. The underlying interfaces are open to users.
    - Quantification:
      - An offline quantification function based on KL divergence is added. The quantification of the Embedding layer is supported.
      - The QAT MKL-DNN quantification strategy support for FC is added.
      - PostTrainingQuantization is added to fully implement the post-training quantification function: The quantization of 30 kinds of Ops is supported. The flexible setting of OPs to be quantified is supported. Quantitative models are generated in a unified format . It has the advantages of short time consumption, ease of use, and small precision loss.
      - Quantitative training supports setting the type of OP to be quantified.
    - Tailoring: The tailoring implementation is reconstructed to support more types of networks.
    - Search:
      - SA search is supported. More search space is added. User-defined search space is supported.
      - A one-shot search algorithm is added. The search speed is 20 times faster than that of the previous version.
  - A large-scale scalable knowledge distillation framework Pantheon is added.
    - Full decoupling is achieved between student and teacher models and between teacher models. They can independently run on different physical devices respectively to make full use of computing resources.
    - The single-node multi-device large-scale forecast of the teacher model is supported. The acceleration ratio is tested to be linear on BERT and other models.
    - TCP/IP protocol is used to achieve communication in online distillation mode. Knowledge transmission between teacher and student models running on any two physical devices in the same network environment is supported.
    - API interfaces in online and offline distillation modes are unified. Different teacher models may operate in different modes.
    - The merging of knowledge and the batch reorganization of knowledge data are completed automatically on the student side to facilitate the knowledge fusion of the multi-teacher model.
  - Model Library:
    - The compression benchmark of ResNet50 and MobileNet models is released.
    - The detection library is connected and the compression benchmark for the YOLOv3 series of models is released.
    - The segmentation library is connected and the compression benchmark for the Deepabv3+ series of segmentation models is released.
  - Document Improvement:
    - An API document is supplemented. An introductory tutorial and an advanced tutorial are added. A ModelZoo document is added to cover classification, detection, and segmentation tasks. All documents contain Chinese and English.

## Distributed

- Parameter Server Mode:
  - The memory usage is greatly reduced during training. On 100 million embedding tasks, the Trainer-side memory can be reduced by 90%.
  - The memory usage of distributed saving and loading models is greatly reduced. The Pserver-side memory peak value can be minimized to $1/N of the original value, where N$ is the number of Pserver nodes.
  - The geo-sgd dense parameter communication is optimized.
  - The distributed AUC index calculation is supported.
  - A distributed barrier function is added.
  - An overdue warning is added in the non-Fleet transpiler API. This API is planned to be removed in PaddlePaddle-Fluid 2.0。
  - Semi-asynchronous and synchronous modes are added in Communicator.
  - The TrainFromDataset training interface supports semi-asynchronous and synchronous modes.
  - DistributedStrategy is added in Fleet to further improve the distributed ease of use and integrate the current distributed related flags.
  - The Fleet pslib mode supports single-program multi-loss training to optimize the training performance.
  - 100 billion sparse mode supports the k8s environment.
- Large-scale classification library PLSC: It supports the large-scale classification problem that data parallel cannot solve due to the limitation of video memory capacity (https://github.com/PaddlePaddle/PLSC).
  - Three built-in models ResNet50, ResNet101, and ResNet152 are available and User-defined models are supported. Under the single-machine eight-V100 GPU configuration, the ResNet50 model has a million-class training speed of 2,122.56 images/s, which is 1.3 times faster than that of the standard ResNet50 model.
  - A plsc-serving whl package for model online forecast service is released to forecasts the image semantic vector representation of the face recognition model. Making a forecast using a user-trained model is supported. The forecast speed of the ResNet50 model (batch size=256) under a single V100 GPU is 523.47 images/s.
  - A pre-training model based on the ResNet50 network and the MS1M-ArcFace dataset is released: https://plsc.bj.bcebos.com/pretrained\_model/resnet50\_distarcface\_ms1mv2.tar.gz.
- The benchmark for ResNet50 mixed precision training (single-card, multi-card, and multi-machine) is released.

## Basic Model Library

(https://github.com/PaddlePaddle/models)

- PaddleNLP

  - Seq2seq supports training modes such as RL and GAN.
  - A training model for participle and part-of-speech tagging is released. A knowledge distillation framework Pantheon is used. The F1 value for its own dataset is 1% more than that of paddleNLP LAC. Jieba participles are incorporated. The deep learning model mode is enabled by adding a use\_paddle label. In addition, the paddle version detection and rollback mechanism is added in jieba to ensure user experience.
  - Dynamic graph model implementations are added: word2vec, senta, transformer, Bert, seq2seq, LAC.

- PaddleSpeech

  - Speech synthesis: A synthesis library Parakeet is released.
    - A standard workflow for data preprocessing, training, and synthesis of the speech synthesis model is implemented.
    - The out-of-the-box pre-processing implementation of typical datasets is provided.
    - Commonly-used model components in the speech synthesis field are provided to support the model implementation.
    - Speech synthesis models DeepVoice3, ClarinNet, TransformerTTS, FastSpeech, WaveNet, and WaveFlow are released.

- PaddleCV

  - Image Classification:
    - A total of 14 pre-training models including SENet-vd, Res2Net, and HRNet series of models are added:
      - SE\_ResNet18\_vd, SE\_ResNet34\_vd, SE\_ResNeXt50\_vd\_32x4d, ResNeXt152\_vd\_32x4d
      - Res2Net50\_26w\_4s, Res2Net50\_14w\_8s, Res2Net50\_vd\_26w\_4s
      - HRNet\_W18\_C, HRNet\_W30\_C, HRNet\_W32\_C, HRNet\_W40\_C, HRNet\_W44\_C, HRNet\_W48\_C, HRNet\_W64\_C
    - Accelerating data preprocessing by using DALI is supported. On the ImageNet training, 1.5 times (ResNet50) to more than 3 times (ShuffleNet) the acceleration is obtained and the GPU utilization is greatly improved.
  - 3D Direction:
    - The models PointNet++ and PointRCNN are released.
  - Tracking Model Library:
    - The models SiamFC, SiamRPN, SiamMASK, ATOM, and ATP are released.
  - Dynamic graph model implementations are added: MobileNet-v1/v2, YOLOv3, FasterRCNN, MaskRCNN, video classification TSM model, and video motion positioning BMN model.

- PaddleRec

  - A multi-task model MMoE for the recommended field is released and applies to large-scale multi-task joint training in the industrial circles.
  - Dynamic graph model implementations are added: gru4rec, deepfm.

## End-To-End Development Kits

- PaddleDetection (https://github.com/PaddlePaddle/PaddleDetection)

  - The precision of the YOLOv3 model is further improved. The precision for the COCO data reaches 43.2%, an absolute increase of 1.4% compared with the previous version.
  - Model implementations and pre-training models are added:
    - The best single model CascadeCARCNN-FPN-Dcnv2-Nonlocal ResNet200-vd in the Google AI Open Images 2019-Object Detction competition is added. A pre-training model of this algorithm based on Objects365 data is also released.
    - Backbone is added as CBResNet, Res2Net, and HRNet series of pre-training models.
    - A LibraRCNN algorithm and a pre-training model are added.
    - GIoU, DIoU, and CIoU loss-based pre-training models are added in the FasterRCNN R50 FPN model. Without reducing the forecast speed, the precision for the COCO data is improved by 1.1%, 0.9%, and 1.3% respectively.
  - Added Modules:
    - Backbone network: CBResNet, Res2Net, and HRNet are added.
    - Loss modules: GIoU loss, DIoU loss, and CIoU loss are added. Libra loss and YOLOv3 loss support a fine-grained op combination.
    - Postprocessing modules: The softnms and DIOU nms modules are added.
    - Regular module: A DropBlock module is added.
  - Functional Optimization and Improvement:
    - YOLOv3 data preprocessing is accelerated. The overall training speeds up by 40%.
    - The data preprocessing logic is optimized.
    - The benchmark data for face detection forecast is added.
    - Forecast examples under the Paddle forecast library Python API are added.
  - Detection Model Compression:
    - Tailoring: A Mobilenet-yolov3MobileNet-YOLOv3 tailoring solution and model are released, with FLOPs - 69.6%, mAP + 1.4% for the VOC dataset, and FLOPS - 28.8%, mAP + 0.9% for the COCO dataset. A ResNet50vd-dcn-YOLOv3 tailoring solution and model are released, with FLOPs - 18.4%, mAP + 0.8% for the COCO dataset.
    - Distillation: A MobileNet-YOLOv3 distillation solution and model are released, with mAP + 2.8% for the VOC data and mAP + 2.1% for the COCO data.
    - Quantification: YOLOv3-MobileNet and BlazeFace quantitative models are released.
    - Tailoring + Distillation: A MobileNet-YOLOv3 tailoring + distillation solution and model are released, with FLOPS - 69.6%, forecast speedup 64.5% under the GPU, mAP - 0.3 % for the COCO dataset. A ResNet50vd-dcn-YOLOv3 tailoring + distillation solution and model are released, with FLOPS - 43.7%, forecast speedup 24.0% under the GPU, mAP + 0.6 % based on the COCO data.
    - Search: A complete search solution for the open source blazeface-nas.
  - Forecast Deployment:
    - The support of the Paddle forecast library for TensorRT and FP16 precision is adapted.
  - Documents:
    - A document for introducing the data preprocessing module and a document for implementing the user-defined data Reader are added.
    - A document about how to add an algorithm model is added.
    - Documents are deployed to the website: https://paddledetection.readthedocs.io/zh/latest/

- PaddleSeg (https://github.com/PaddlePaddle/PaddleSeg)

  - Added Models
    - LaneNet model applicable to lane segmentation scenarios.
    - Fast-SCNN model applicable to the lightweight.
    - HRNet semantic segmentation model applicable to high-precision scenarios.
  - Multiple PaddleSlim-based model compression solutions are released:
    - Cityscape-based Fast-SCNN tailoring solution and model.
    - Cityscape-based Deeplabv3p-Xception and Deeplabv3p-MobilenetV2 distillation solutions.
    - Cityscape-based Deeplabv3p-MobilenetV2 search solution.
    - Cityscape-based Deeplabv3p-Mobilenet quantitative solution and model.
  - Enhancement of the Forecast Deployment Capability
    - Lightweight deployment of Python is added.
    - The TensorRT forecast acceleration support for FP16 and Int8 quantitative models is added.
    - Tutorials and cases for portrait segmentation Paddle-Lite mobile-side deployment of DeepLabv3p-MobileNetV2 are added.
    - Model export is optimized. GPU implementation of image preprocessing and postprocessing is supported. The performance is improved by 10%-20%.
    - The benchmark for the forecast performance of U-Net, ICNet, PSPNet, DeepLabv3+, and other models for images of different sizes is provided to facilitate users to select models based on performance.
  - Experience Optimization
    - A learning rate warmup function is added. It supports the use with different learning rate decay strategies to improve Fine-tuning stability.
    - Marked imaged can be saved in pseudo-color image format to improve their preview experience.
    - The function of automatically saving an optimal mIoU model is added.
    - The document logic is comprehensively optimized. An AIStudio practical tutorial on industrial scenarios such as industrial quality inspection and fundus screening is provided.

- ElasticRec (https://github.com/PaddlePaddle/ElasticRec) -

  - An ElasticRec recommended sorting system is released. It is deployed through K8S. Streaming training and online forecast service are supported.

## Utility Components

- PaddleHub (https://github.com/PaddlePaddle/PaddleHub)

  - The pre-training models are rich, with 52 added pre-training models. Currently, the total number of pre-training models is 100+:
    - Semantic models: Five semantic models such as RoBERTa\_wwm, BERT\_wwm, and ERNIE-Tiny are added.
    - Text classification: Three yellow anti-identification models are added.
    - Image classification: A total of 36 image classification models such as ResNext-WSL and EfficientNet are added.
    - Target detection: Five detection models such as pedestrian detection and vehicle detection are added.
    - Key point detection: Two models for key point detection of face and body posture are added.
    - Face mask detection: Two PyramidBox-Lite-based face mask detection models are added.
    - Universal face detection: Four universal Face detection models such as Ultra Light Fast Generic Face Detector and PyramidBox-Lite are added.
  - Function:
    - A Bert Service text vector representation service based on Paddle Serving is added.
    - Task flexibility is enhanced. An added hook mechanism supports the loading of user-defined codes.
    - A color Colorlog is added. The problem on the repeated printing of logs is fixed.
    - Code results are optimized. The command line execution speed is increased by 50%.
    - Dataset and Reader are reconstructed. The quantity of adaptive user-defined dataset codes is reduced by 60%.
    - The AutoFinetune interface is optimized. Multi-experiment visualization effect display is supported.
  - Experience Optimization
    - The logic is fully optimized. Rich AIStudio tutorial contents are added.
    - The landing page of the official website has been fully upgraded to provide the function of quick online experience and tutorial guidance.

- Multi-task learning framework PALM (https://github.com/PaddlePaddle/PALM)

  - Python3 and Windows are supported.
  - The framework kernel and the multi-tasking underlying mechanism, are upgraded. The API call is open.
    - The flexible model saving mechanism supports single-task saving and full-image saving.
    - Continuous training and forecast are supported. Dataset files can be switched over freely under a single execution.
    - A model customization/self-definition function is added.
    - The multi-task underlying kernel is reconstructed. Some bugs that affect universality and stability are fixed.
  - The multi-task learning ability is strengthened.
    - It is supported that every task has a different batch size and sequence length under a multi-task scenario.
    - The problem on inconsistent tasks on each video card during multi-task multi-card training is fixed.
    - The multi-task learning scheduling and termination strategies are optimized to generally improve the model generalization ability.
  - The function and type of supported tasks are strengthened.
    - Matching task support is enhanced. Pairwise learning and multiple categories (e.g. NLI sentence relation judgment) are supported.
    - The support for machine reading comprehension tasks is enhanced. User controllable preprocessing hyper-parameters are added.
    - The support for sequence labeling tasks is added.
  - The large-scale training/inferential capability is strengthened.
    - The automatic multi-card forecast capability is added.
    - An asynchronous reader is supported. A variable-length padding is supported in multi-card scenarios.
  - A module for the management and downloading of pre-training models is added.
    - The management and downloading of pre-training models such as BERT, ERNIE, and RoBERTa are supported.
    - A RoBERTa Chinese pre-training model is added.

- Federated Learning PaddleFL (https://github.com/PaddlePaddle/PaddleFL):

  - The scheduler and submitter functions are added: The scheduler is used to control whether the trainer participates in update during training. The submitter is used to complete the function of submitting paddleFL tasks in the MPI cluster.
  - A LEAF dataset federated learning open dataset is added. An API is added to set a benchmark. Classical datasets in the image classification, emotion analysis, character forecast, and other fields , such as MNIST and Sentiment140, are supported.
  - According to the added components, the original samples are modified in example and the femnist\_demo and submitter\_demo examples are added
  - Fl\_distribute\_transpiler is optimized to add the support of FedAvg strategy for the adam optimizer.
  - SecAgg strategy (Secure Aggregation) is added to achieve secure parameter aggregation.

## Code Reconstruction and Upgrade

- Compilation
  - A compilation option WITH\_NCCL is added. Single-card users can display and specify WITH\_NCCL=OFF to accelerate compilation.
  - A compilation option WITH\_TP\_CACHE is added to cache third-party source codes to avoid repeated downloading. Windows users can set it to ON to speed up compilation and improve compilation stability.
  - The `CUDA_ARCH_NAME` default value is set to `Auto` (`All` indicates compiling all GPU architectures and `Auto` indicates compiling only the current machine GPU architecture). For developers, a lot of compilation time is saved using `Auto` than using `All`, thus improving development efficiency.
  - Redundant links and products and needless file copying are reduced, thus speeding up the compilation in Windows.
- External Dependency Library
  - MKL-DNN is upgraded to the latest Version 1.1.
  - The forecast library is decoupled from `third_party` and 28 third-party-dependent compilation codes are refactored to facilitate the unified management of external dependencies.
  - Two third-party-dependent private warehouses, one unnecessary dependency, and 2000+ lines of unnecessary codes under the patch are removed to improve the warehouse quality.
- Code Cleanup, Refactoring, and Optimization
  - The unnecessary `contrib/float16` directory is removed. The unnecessary snappy/snappystream dependency under the BRPC is deleted.
  - `loss.py` and `sequence_lod.py` are split out of `python/paddle/fluid/layers/nn.py` according to the API functions, thus reducing the code quantity of `nn.py` and facilitating reading.
  - The codes corresponding to the warnings of `-Wno-error=sign-compare` (at a total of more than 100 points) are fixed. An error will be reported for all subsequent warnings of this kind during compilation, thus improving the code quality.
  - `WarningLnk4006/WarningLnk4221` compiled by WindowsMSVC (at a total of about 300 points) is removed to improve the warehouse quality.
  - The quantity of reduce\_op, expand\_op, and expand\_as\_op templates is reduced to accelerate GPU compilation and reduce whl package space by 70 M.
  - The pybind function of every OP is automatically generated under the dynamic graph using codes and directly called in layers to improve the dynamic graph performance and reduce the coupling degree with the static graph.

## Bug Fixes

- Fix the problem of MKL-DNN error when PaddleDetection-based Faster-RCNN uses the Python API to make a forecast.
- Fix the problem of training suspension in the GPU implementation of sum op because some Tensors are not initialized.
- Fix the problem of precision loss when the value in fill\_constant is set to a large integer.
- Fix the problem of precision inconsistency of softmax\_with\_cross\_entropy\_op with regard to the CUDA.
- Fix the problem that when a clone program is fixed, the stop\_gradient attribute in the program can not be copied to a new program.
- Fix the problem of precision loss of elementwise\_pow op with regard to integers.
- Fixed the problem that some GFLAGSs cannot perform specifying outside the forecast library.
- Fix the problem of random forecast core caused by some passes in Analysistor multithreading. (fc\_gru\_fuse\_pass, seqconv\_eltadd\_relu\_fuse\_pass, attention\_lstm\_fuse\_pass, embedding\_fc\_lstm\_fuse\_pass, fc\_lstm\_fuse\_pass, seq\_concat\_fc\_fuse\_pass)
- Fix the error that specifying a GPU in the same process using AnalysisConfig does not take effect after NativePredictor is used to specify the use of CPU forecast.
- Fix the bug of compilation error (setup.py copy and op\_function\_cmd error) in the case of -DWITH\_MKL=OFF.
- Fix the bug that tuple (Variable) cannot be entered in the py\_func OP; add an example of how to write PythonOP codes.
- Fix the problem of the sigmoid cudnn kernel being called as the tanh cudnn kernel by mistake.
- Fix some bugs related to reshape and depthwiseconv in dynamic graph mode; fix the problem of some parameters in the network having no gradient, causing the bug of program crash.
- Fix the bug of running error of GradientClip in parameter server mode.
- Fix the problem of memory leak in full asynchronous mode of of the parameter server.