this is version 2.1.1

2.1.1 Release Note

重要更新

本版本主要是对2.1.0中一些功能和性能问题的修复,并对部分功能点做了增强,重点如下:

  • 完成了 paddle.distributed、paddle.device、paddle.vision 目录API的可见性优化。
  • 动态图转静态图新增对 paddle.nn.Sequential容器内 sublayer 的用户代码的动静转换。
  • 动态图增加 SyncBatchNorm 对AMP的支持,提升动态图 SyncBatchNorm 层在AMP模式的性能。

训练框架

功能优化(含分布式)

基础API

  • paddle.distributed、paddle.device、paddle.vision 等层级新增推荐使用方式,推荐使用方式的具体说明请见下文2.1.0 Release Note。(#33420)
  • 新增 paddle.is_compiled_with_rocm 。(#33228)
  • 新增 paddle.strided_slice bool type输入的支持。(#33373
  • 新增 paddle.equal_all、paddle.equal、paddle.greater_equal、paddle.greater_than、paddle.less_equal、paddle.less_than、paddle.not_equal bool type输入的支持。 (#33551
  • 修复 paddle.utils.download 在ConnectionError异常时不进行Retry逻辑。(#33454
  • 修复 paddle.gather 在axis不等于0下,infershape错误的问题。(#33553
  • 修复 paddle.io.DataLoadernum_workers=0Dataset 生成GPU Tensor 送入DataLoader 时导致的段错误。(#33487, #33249
  • 修复 slice 操作结果作为左值使用inplace操作时,反向运行报错提示与错误无关的问题。(#32981
  • 修复 paddle.concat 动态图支持 uint8 出错的问题。(#33667)
  • 修复 paddle.grid_sample 显存溢出和输出结果异常的问题。(#33100#33232
  • 修复 roi_align 中align=True模式下输入为0时的问题。(#33446
  • 修复了在特定情况下 log_softmax 会把输入改为nan的问题。(#32937

动态图转静态图

  • 新增支持对 paddle.nn.Sequential容器内 sublayer 的用户代码的动静转换。(#33065
  • 修复了在控制流 for 语句转换中,在变量静态类型分析阶段未正确处理 Subscript 语法的问题。(#32969
  • 重构了动转静 param_guard 逻辑代码,全面解决动静态图 Tensor 类型互转问题。(#32985

分布式训练

  • 修复 paddle.distributed.spawn 在使用默认 nprocs 参数时出错的问题。(#33249
  • 修复流水线并行通信组创建不一致导致训练启动hang住的问题。(#32890#33473
  • 修复混合并行中保存参数失败的问题。(#33595#33588
  • 修复Fleet API无法直接运行 Program 的问题。(#33511
  • 修复异构参数服务器纯GPU训练模式中样本分桶不均导致hang住的问题。(#32957
动态图混合并行
  • 修复 TensorParallel 的精度问题。改变 TensorParallel 的参数初始化方式,保证参数切分后的随机性。(#33087
  • 修复 PipeLineParallel 的精度问题。解决 PipeLineParallelmicrobatch 使用不正确的问题。(#33097
  • 修复 new_group API创建多个通信组,会hang的问题。(#33553

混合精度训练

  • 动态图增加 SyncBatchNorm 对AMP的支持,提升动态图 SyncBatchNorm 层在AMP模式的性能,在PaddleSegDeepLabV3P模型上8卡AMP模式加速比提升19%。(#33709)

自定义OP

  • 移除了自定义OP编译时对 PADDLE_WITH_MKLDNN 宏的依赖。(#32903
  • 默认设置 GLIBCXX_USE_CXX11_ABI=1 以解决GCC版本过低导致编译时可能报错的问题。(#33185
  • 新增支持c++14的语法特性,默认开启-std=c++14编译选项。 (#33227

其他

  • 修复了多线程下 LoDTensorArray 作为Op输入时,训练会随机出段错误的问题。(#32984
  • 修复 paddle.ParamAttr 的 regularizer 和 paddle.optimizer.Momentumweight_decay 同时被指定为 L2Decay 时,参数正则化被执行2次的问题。(#32881
  • 修复windows系统下warning信息可能显示乱码问题。(#33689

推理部署

模型量化

  • 修复动态图量化训练功能中跳过OP量化的问题。(#32879
  • 修复量化模型保存时 layer_norm不保存 out_threahold 属性的问题。(#33610

Paddle Inference

功能升级

  • Paddle-TRT新增 gather_ndreduce_sum 的converter/plugin。(#33365
  • Paddle-TRT新增 reshape 。(#33372

性能优化

  • 增加TensorRT的 layer_norm 动态shape plugin,提升模型动态shape推理性能。(#33448

易用性优化

  • 新增 Paddle Inference ROCm 版的预测示例文档以及增加C++预测库的version.txt中与ROCM相关版本信息 (#33290)
  • 更新了XPU的编译选项,具体编译选项请参考 #33581

问题修复

  • 修复 fused_fc_elementwise_layernorm 在海光DCU下的线程数过大导致的计算结果错误问题。 (#33299)
  • 修复yolov3模型在Jetson Nano和Jetson TX2上开启gpu后运行失败的问题。(#33442)
  • Paddle-TensorRT plugin multihead_matmul 修复当seq_len > 1024的计算错误。(#33365
  • 修复了ERNIE 模型变长情况下,输入的顺序不一致导致输出结果不对的问题。(#33622
  • 修复OCR模型在GPU上预测报错问题。(#33431)
  • 修复 paddle.static.io.normalize_program 没有导出 paddle.static.normalize_program 的问题。(#33408
  • 修复TensorRT6.0使用stride > 1的conv失败的问题。(#33198 )
  • 修复批量推理图片时的显存访问越界错误。(#33370 )(#33531 )
  • 修复X86 CPU上MKLDNN缓存大小设置失效的问题。 (#33571
  • 修复TensorRT conv2d_transpose op converter维度错误设置问题。(#33242
  • 修复Jetson 设备上分CUDA Arch编译出的预测库结果错误的问题,本版本将发布分Arch编译的Jetson预测库,供对预测库体积有需求的用户使用。(#33269
  • 修复使用PaddleSlim量化模型从内存加载预测时,仍会因未设置校准表路径而报错的问题。(#33629
  • 修复BERT/ERNIE在非0号卡上使用TensorRT预测时报错cuda error 400的问题。(#33706
  • 修复在Linux下设置自定义编译参数时引发的cmake语法错误。(#33621
  • 优化 layer_norm 计算精度,修复大数据输入时输出Nan的问题。(#33420)
  • 修复windows下,TensorRT推理传入左斜杠做分隔符的模型路径时,opt路径错误问题。(#33885)

环境适配

新硬件适配

昆仑硬件训练支持

  • 修复 gather op,新增支持 logsumexp 。 (#32931)

2.1.1 Release Note

Important Updates

This version fixed some function and performance issues of PaddlePaddle 2.1.0, and optimized some function. The important updates are as following:

  • Optimize the API visibility of paddle.distributed、paddle.device、paddle.vision .
  • Add support for dynamic conversion of user code for sublayer in the paddle.nn.Sequential.
  • Add SyncBatchNorm support for AMP in dynamic graph, to improve the performance of dynamic graph SyncBatchNorm layer in AMP mode,

Training Framework

Functional optimization (including distributed)

Basic API

  • Optimize the API visibility of paddle.distributed、paddle.device、paddle.vision , for more information, please see 2.1.0 Release Note. (#33420)
  • Add paddle.is_compiled_with_rocm. (#33228)
  • Add the paddle.strided_slice to support bool type.(#33373
  • Add paddle.equal_all、paddle.equal、paddle.greater_equal、paddle.greater_than、paddle.less_equal、paddle.less_than、paddle.not_equal to support bool type. (#33551
  • Fix paddle.utils.download does not perform Retry when ConnectionError is abnormal.(#33454
  • Fix the issue of infershape error when paddle.gather axis is not equal to 0.(#33553
  • Fix segment fault caused by paddle.io.DataLoader when num_workers=0 and Dataset returns GPU Tensor and sends it to DataLoader .(#33487, #33249
  • Fix the issue that when use slice result as an lvalue of inplace operation, the error message of backward is not related to the error. (#32981
  • Fix the issue of paddle.concat support uint8 in dynamic graph.(#33667)
  • Fix the issue of paddle.grid_sample GPU memory overflow and abnormal output. (#33100#33232
  • Fix bug of roi_align, when the input width or height of rois is 0, the output feature should be 0 .(#33446
  • Fixed in some corner cases, input was modified to 'nan' bug of log_softmax op. (#32937

Dynamic Graphs to Static Graphs

  • Add support for dynamic conversion of user code for sublayer in the paddle.nn.Sequential .(#33065
  • Fix the issue of subscript syntax errors in the phase of static type analysis of variables in control flow for statement conversions. (#32969
  • Refactor the dynamic to static param_guard logic code to comprehensively solve the dynamic to static graph Tensor type conversion problem.(#32985

Distributed Training

  • Fix the error in paddle.distributed.spawn when using the default nprocs argument.(#33249
  • Fix the hang issue of training start caused by the inconsistent creation of pipeline parallel communication group.(#32890#33473
  • Fix the issue of failed to save parameters in mixed parallelism.(#33595#33588
  • Fix the issue that Fleet API cannot run Program directly.(#33511
  • Fix the hang issue caused by the uneven sample bucketing in the pure GPU training mode of heterogeneous parameter server.(#32957
Hybrid Parallelism with Dynamic Graph
  • Fix the the accuracy error ofTensorParallel. Change the parameter initialization method of TensorParallel to ensure the randomness of the parameter after slicing.(#33087
  • Fix an accuracy error of PipeLineParallel. Fix the incorrect use of microbatch for PipeLineParallel.(#33097
  • Fix the issue that new_group API will hang when creating multiple communication groups.(#33553

Mixed Precision Training

  • Add SyncBatchNorm support for AMP in Dynamic graph, to improve the performance of dynamic graph SyncBatchNorm layer in AMP mode, and improve the 8-card AMP mode speedup ratio by 19% on DeepLabV3P model of [PaddleSeg].(#33709)

Custom OP

  • Remove the dependency on PADDLE_WITH_MKLDNN macro for custom OP compilation.(#32903
  • Default setting GLIBCXX_USE_CXX11_ABI=1 to resolve the issue of low GCC version that may cause compile-time errors.(#33185
  • Add support for c++14 syntax feature, and enable -std=c++14 compile option by default. (#33227

Others

  • Fix the random segment error of training when LoDTensorArray is input of Op under multi-threading.(#32984
  • Fix an issue where parameter regularization is executed twice when both the regularizer of paddle.ParamAttr and the weight_decay of paddle.optimize are specified as L2Decay.(#32881
  • Fix the issue of corrupted characters of warning information in windows system.(#33689

Inference Deployment

Model Quantification

  • Fix the issue of skipping OP quantization in dynamic graph quantization training function.(#32879
  • Fix the issue that layer_norm does not save out_threahold attribute when quantized model is saved.(#33610

Paddle Inference

Function Upgrades

  • Add converter/plugin of gather_ndreduce_sum in Paddle-TRT.(#33365
  • Add reshape in Paddle-TRT.(#33372

Performance Optimization

  • Add the dynamic shape plugin of TensorRT layer_norm to improve model dynamic shape inference performance.(#33448

易用性优化

  • Add Paddle Inference ROCm version of Prediction Example Document, so as to add C++ prediction library version.txt with ROCm related version information. (#33290)
  • Update XPU compilation options. Please refer to #33581 for specific compilation options.

Bug Fixes

  • Fix the calculation error of fused_fc_elementwise_layernorm caused by too large number of threads under DCU. (#33299)
  • Fix the issue that yolov3 model fails to run after gpu is turned on on nano and TX2.(#33442)
  • Fix the computation error when seq_len > 1024 in Paddle-TRT multihead_matmul plugin .(#33365
  • Fix the incorrect output error caused by inconsistent order of input when ERNIE model becomes longer.(#33622
  • Fix the reports error of OCR model in prediction on GPU.(#33431)
  • Fix the issue that paddle.static.io.normalize_program failed to export paddle.static.normalize_program.(#33408
  • Fix the issue that conv with stride > 1 fails in TRT6.0 and below.(#33198 )
  • Fix the out-of-bounds error of GPU memory access when batch predicting images. (#33370 )(#33531 )
  • Fix the issue of cache size setting failure on X86 CPU. (#33571
  • Fix TRT conv2d_transpose op converter dimension error setting. Now the model of conv2d_transpose op can work normally on TRT.(#33242
  • Fix the error of prediction library compiled by sub-CUDA Arch on Jetson devices. This version will release the Jetson prediction library compiled by sub-Arch for users who have demand for shrinked prediction library binary size.(#33269
  • Fix the issue that when using PaddleSlim quantitative model to load prediction from memory, it still reports an error because the calibration table path is not set.(#33629
  • Fix the issue that BERT/ERNIE gives wrong cuda error 400 when using TRT prediction on non-0 card.(#33706
  • Fix a cmake syntax error caused by setting custom compilation parameters under Linux.(#33621
  • Optimize the calculation accuracy of layer_norm and fix the problem of outputting Nan when input is large data. (#33420)

Environment Adaptation

Compile and install

Support of new hardware training

support of Kunlun chips

  • Fix the gather op, add support of logsumexp op. (#32931)

Thanks to our Contributors

This release contains contributions from: Aurelius84, cc, ceci3, Chen Weihang, danleifeng, feng_shuai, houj04, jiangcheng, JZ-LIANG, Kaipeng Deng, lidanqing, LielinJiang, Lijunhui, lilong12, liuyuhui, liym27, Pei Yang, Peihan, Qi Li, Ren Wei (任卫), Roc, Shang Zhizhou, ShenLiang, Shibo Tao, TeslaZhao, tianshuo78520a, TTerror, wangguanzhong, Wangzheee, wawltor, WeiXin, wenbin, Wenyu, whs, Wilber, wuhuanzhou, Zhang Ting, zhiboniu, Zhou Wei, zhoujun, 李季, 王明冬

项目简介

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

🚀 Github 镜像仓库 🚀

源项目地址

https://github.com/paddlepaddle/paddle

发行版本 60

PaddlePaddle 2.5.0 Release Note

全部发行版

贡献者 246

全部贡献者

开发语言

  • C++ 49.8 %
  • Python 41.0 %
  • Cuda 7.0 %
  • CMake 1.1 %
  • Shell 0.6 %