this is version 2.3.1

2.3.1 Release Note

1. 重要更新

  • 2.3.1 版本是在 2.3 版本的基础上修复了已知问题,并且发布了支持 CUDA 11.6 的安装包。

2. 训练框架(含分布式)

(1)功能优化

API

  • 修改paddle.nn.initializer.KaimingUniformpaddle.nn.initializer.KaimingNormal 两种初始化方式,使其支持多种类型的激活函数。(#43721, #43827)
  • 优化 paddle.io.DataLoader 的数据预读取功能,使其支持设置了 prefetch_factor 设定的预读取数据的缓存数量,避免在读取大块数据时出现 IO 阻塞。(#43674 )

新动态图执行机制

  • 修改新动态图 API 逻辑中 optional 类型 Tensor 的初始化方法,防止被提前析构导致数据异常。(#42561)

全新静态图执行器

  • 延迟初始化执行器中的线程池,避免只执行一轮的 program(如 save、load、startup_program等)创建线程池。(#43768)

混合精度训练

  • 设置 paddle.nn.Layerset_state_dict中禁用 state_dict hook。(#43407)

分布式训练

  • 使 paddle.incubate.nn.functional.fused_attentionpaddle.incubate.nn.functional.fused_feedforward支持张量模型并行。(#43505)

其他

  • 调整框架算子内核打印字符串的格式,便于进行自动化拆分解析。(#42931)
  • 更新模型量化 API,支持rounding to nearest ties to even的四舍五入方式,支持量化取值范围 [-128, 127]。(#43829)
  • 量化感知训练适配支持 AMP 混合精度训练。(#43689)
  • 量化感知训练在启动时新增 progress bar,便于查看量化初始化进度,统计 out_threshold 时跳过 scale op,加速初始化过程。(#43454)
  • 动态图量化训练支持 convbn 融合,静态图离线量化支持设置 skip_tensor_list 来跳过某些层不做量化。(#43301)

(2)性能优化

  • 优化paddle.incubate.nn.functional.fused_attentionpaddle.incubate.nn.functional.fused_feedforward算子,增加add_residual属性,用以控制最后一步是否进行加residual操作,CAE 模型性能提升 7.7%。(#43719)
  • 优化 linspace 算子,将 startstopnum三个输入 Tensor 初始化在 CPU 上,避免在算子中进行 GPU -> CPU 拷贝,SOLOv2 模型性能提升6%。(#43746)

(3)问题修复

API

  • 修复 paddle.io.DataLoaderreturn_list=True 时因多线程冲突小概率报错问题。(#43691)
  • 修复 paddle.nn.Layer的参数存在 None类型参数时 to方法报 NoneType 不存在 device 属性的错误。(#43597)
  • 修复 cumsum op 在某些 shape下计算结果出错的问题。 (#42500, #43777)
  • 修复静态图下 Tensor.__getitem__在使用 bool索引时组网阶段输出结果维度为 0 的问题。 (#43246)
  • 修复 paddle.slicepaddle.strided_slice 处理参数为负数时出现异常的问题。(#43432)
  • 修复 set_value op 在处理切片 step为负数时赋值结果异常的问题。 (#43694)
  • 修复 C++ 端 copy接口不能在多卡设备间拷贝的问题。(#43728)
  • 修改 paddle.incubate.nn.functional.fused_attentionpaddle.incubate.nn.functional.fused_feedforward 中属性命名引发的推理时的问题。(#43505)
  • 修复 ConditionalBlockGrad op 处理不需要 grad的 Tensor 时异常的问题。(#43034)
  • 解决 C++ 的 einsum op 反向速度优化引起的显存增加问题,并将反向优化默认打开。(#43397)
  • 修复单卡下 paddle.io.DataLoader多进程数据读取在固定随机种子时数据无法固定的问题。(#43702)
  • 修复 softmax op 在 Tensor 元素超过 2G 时,触发 CUDNN_STATUS_NOT_SUPPORT 的错误。(#43719)
  • 修复 trace op Event 字符串在不同算子无区分,导致性能分析不便利的问题。(#42789)

其他

  • 修复动转静多次 deepcopy 并保存导致的显存溢出问题。(#43141)
  • 修复自定义算子中使用的 PlaceType 类型升级引入的 device id 在多卡场景中出错的问题。(#43830)
  • 优化 paddle.profiler.Profiler timeline 可视化逻辑,将在 python 脚本中自定义的事件从 C++ 折叠层显示移动至 python 折叠层显示。(#42790)

3. 部署方向(Paddle Inference)

(1)新增特性

新增功能

  • CPU 上 ONNX Runtime 后端新增 PaddleSlim 量化模型支持。 (#43774, #43796)

(2)底层优化

CPU性能优化

  • EnableMkldnn 配置中移除 gpu_cpu_reshape2_matmul_fuse_pass,修复 ResNet50 性能下降的问题。 (#43750)

GPU 性能优化

  • 添加 bilinear_interp_v2 TensorRT convert 支持。 (#43618)
  • 添加 matmul_scale_fuse_passmultihead_matmul_fuse_pass_v3到 GPU pass,并添加单测。(#43765)
  • 添加 GPU handle 延迟初始化支持。 (#43661)

(3)问题修复

框架及API修复

  • 修复联编 Paddle-Lite XPU 时的编译报错问题。(#43178)
  • 修复 ERNIE 3.0 pass误触发的问题。(#43948)
  • 修复 multihead op 中 int8 量化属性读不到的问题。(#43020)

后端能力修复

  • 修复 MKLDNN 中 elementwise_mul 和 matmul 两个 op 在运行量化推理过程中崩溃的问题。 (#43725)
  • 修复同一模型在推理时 TensorRT 子图序列化文件反复生成的问题。(#42945, #42633)
  • 修复 ONNX Runtime 后端与外部使用的 protobuf 冲突问题。(#43159, #43742)
  • 修复 python 预测库 ONNX Runtime 后端在多输入情况下推理报错问题。 (#43621)

4. 环境适配

编译安装

  • 完成对 CUDA 11.6 的验证和适配,并在官网发布 CUDA 11.6 的安装包。(#43935, #44005)
  • 修复在 Windows 上使用 CUDA 11.6 编译时的 cub 报错问题。(#43935, #44005)
  • 修复 elementwise、reduce op 编译时间较长的问题。(#43202, #42779, #43205)

新硬件适配

  • 寒武纪 MLU 支持飞桨 Profiler。(#42115)
  • GraphCore IPU 支持显示编译进度。(#42078)

2.3.1 Release Note

1. Important Updates

  • V2.3.1 is built on V2.3 by fixing known issues and releasing precompiled binary that supports CUDA 11.6.

2. Training Framework (distributed included)

(1) Function Optimization

API

  • Modify two initialization modes of paddle.nn.initializer.KaimingUniform and paddle.nn.initializer.KaimingNormal, to support multiple types of activation functions. (#43721, #43827)
  • Optimize the data pre-fetching function of paddle.io.DataLoader, so that it can support the setting of the prefetch_factor to set the cache size of pre-fetched data. This can avoid IO blocking when reading large blocks of data. (#43674)

New dynamic graph execution mechanism

  • Modify the initialization method of optional type Tensor in the new dynamic graph API logic to prevent data exceptions caused by early destruction. (#42561)

New static graph executor

  • Defer initialization of the thread pools in the executor, to avoid creating thread pools for programs that execute only once (e.g.,save, load, startup_program, etc.). (#43768)

Mixed precision training

  • Disabling state_dict hook in set_state_dict in paddle.nn.Layer. (#43407)

Distributed training

  • Enabling tensor parallelism in paddle.incubate.nn.functional.fused_attention and paddle.incubate.nn.functional.fused_feedforward. (#43505)

Others

  • Adjust print format of the framework operator kernels to facilitate automated splitting and parsing. (#42931)
  • Update the model quantization API to support the round-off in rounding to nearest ties to even, and support quantization in the range [-128, 127]. (#43829)
  • Support AMP mixed precision training in quantization-aware training. (#43689)
  • Add the progress bar at the beginning of quantization-aware training, so that it is easy to check the progress of quantization initialization. Skip the scale op when counting out_threshold to speed up the initialization process. (#43454)
  • Support conv and bn fusion in the dynamic graph quantization training. Support the settings of skip_tensor_list in the static graph offline quantization, to skip some layers without quantization. (#43301)

(2) Performance Optimization

  • Optimizepaddle.incubate.nn.functional.fused_attention and paddle.incubate.nn.functional.fused_feedforwardoperators. Add add_residual property to control whether to perform add-residual operation in the last step. The performance of CAE model is improved by 7.7%. (#43719)
  • Optimize linspace operator. Initialize three input Tensor of start,stop and num on CPU, to avoid GPU->CPU copy in the operator. This can speed up SOLOv2 model performance by 6%. (#43746)

(3) Bug Fix

API

  • Fix the error reported by paddle.io.DataLoader when return_list=True due to multi-thread conflict. (#43691)
  • Fix the error that the to method reports NoneType does not have the device attribute when the paddle.nn.Layer parameter has the None type parameter. (#43597)
  • Fix the bug that the calculation result of cumsum op is wrong in some shape settings. (#42500, #43777)
  • Fix the bug that the output result dimension of Tensor.__getitem__ is 0 in the networking stage when using bool index in the static graph.(#43246)
  • Fix the bug occurred when paddle.slice and paddle.strided_slice handle negative parameters. (#43432)
  • Fix the bug that the assignment result of set_value op is abnormal when the processing slice step is negative. (#43694)
  • Fix the bug that the copy interface in C++ cannot copy between multiple cards. (#43728)
  • Fix the bug in inference stage caused by attribute naming in paddle.incubate.nn.functional.fused_attentionand paddle.incubate.nn.functional.fused_feedforward . (#43505)
  • Fix an exception in ConditionalBlockGrad op when processing Tensor that does not require grad. (#43034)
  • Fix the bug of device memory increase caused by einsum op in the speed optimization of backward computation. By default, this optimization is enabled. (#43397)
  • Fix the bug that data fails to be fixed when paddle.io.DataLoader multi-process data reads the fixing random seeds under a single card. (#43702)
  • Fix the bug that softmax op triggers CUDNN_STATUS_NOT_SUPPORT when the Tensor exceeds 2G. (#43719)
  • Fix the bug that the trace op Event string is indistinguishable among different operators that cause the inconvenient performance analysis. (#42789)

Others

  • Fix the bug of overflowing device memory caused by multiple deepcopy and saving in case of dynamic-to-static. (#43141)
  • Fix the bug that the device id introduced by the upgrade of PlaceType used in the custom operator is wrong in the multi-card scenario.(#43830)
  • Optimize the paddle.profiler.Profiler timeline visualization logic, move events customized in python scripts from C++ folding display to python folding display. (#42790)

3. Deployment Direction (Paddle Inference)

(1) New Features

New functions

  • Add the support of the PaddleSlim quantization model for ONNX Runtime backends on CPUs. (#43774, #43796)

(2) Underlying Optimization

CPU performance optimization

  • Remove gpu_cpu_reshape2_matmul_fuse_pass from EnableMkldnn configuration to fix the bug of ResNet50 performance degradation. (#43750)

GPU performance optimization

  • Add the support of bilinear_interp_v2 TensorRT convert. (#43618)
  • Add matmul_scale_fuse_pass and multihead_matmul_fuse_pass_v3 to GPU pass. (#43765)
  • Add the support of the GPU handle deferred initialization. (#43661)

(3) Bug Fixing

Framework and API fixing

  • Fix the compile error problem when binding Paddle-Lite XPU. (#43178)
  • Fix the bug of false trigger of ERNIE 3.0 pass. (#43948)
  • Fix the bug that int8 quantization attribute in multihead op cannot be read. (#43020)

Backend capability fixing

  • Fix the bug that two ops of elementwise_mul and matmul in MKLDNN are crashed during quantitative inference. (#43725)
  • Fix a bug where TensorRT subgraph serialization files are repeatedly generated for the same model during inference. (#42945, #42633)
  • Fix a conflict between the ONNX Runtime backend and the externally use of protobuf. (#43159, #43742)
  • Fix an error reported by python prediction library when using ONNX Runtime backend in case of multiple inputs. (#43621)

4. Environment Adaptation

Compile and install

  • Complete verification and adaptation of CUDA 11.6, and release CUDA 11.6 precompiled binary. (#43935, #44005)
  • Fix a cub error when compiling with CUDA 11.6 on Windows. (#43935, #44005)
  • Fix the bug of long compilation time for elementwise and reduce op. (#43202, #42779, #43205)

New hardware adaptation

  • Cambricon MLU supports PaddlePaddle Profiler. (#42115)
  • GraphCore IPU supports visualization of compilation progress. (#42078)

项目简介

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

🚀 Github 镜像仓库 🚀

源项目地址

https://github.com/paddlepaddle/paddle

发行版本 60

PaddlePaddle 2.5.0 Release Note

全部发行版

贡献者 246

全部贡献者

开发语言

  • C++ 49.8 %
  • Python 41.0 %
  • Cuda 7.0 %
  • CMake 1.1 %
  • Shell 0.6 %