this is version 2.2.1

2.2.1 Release Note

1. 重要更新

我们很高兴的发布飞桨框架2.2.1版本,主要是对2.2.0中一些功能和性能问题的修复,并对部分功能点做了增强,重点如下:

  • 新增 paddle.linalg.triangular_solve,用于计算带有三角系数矩阵的线性方程组。
  • 新增 paddle.device.cuda.graphs.CUDAGraph API,支持NVIDIA的CUDA Graph功能,注意目前该API还处于实验阶段,尚未稳定。
  • 修复了基础API、Tensor 索引中的已知问题。

2. 训练框架(含分布式)

(1)新功能

API

  • 新增paddle.linalg.triangular_solve API,用于计算带有三角系数矩阵的线性方程组。(#36714)
  • 新增paddle.device.cuda.graphs.CUDAGraph API,支持NVIDIA的CUDA Graph功能,可以将GPU计算全部捕捉到一张CUDA Graph中,往后多次调用,可以去除框架的额外开销,提升运行性能。注意目前该API还处于实验阶段,尚未稳定。(#37109)
  • 新增paddle.incubate.graph_send_recv API,主要应用于图学习领域,目的是为了减少在消息传递过程中带来的中间变量显存或内存的损耗,包含 SUM、MEAN、MIN、MAX 共四种更新模式。(#37205)
  • 新增paddle.incubate.operators.ResNetUnit API,用于 ResNet 网络里的卷积、批归一化、shortcut/bottleneck操作融合。(#37109)

(2)功能优化

API

  • paddle.incubate.FusedTransformerEncoderLayer,添加 src_mask=None 的支持,添加pure fp16的支持。 (#37229)

IR(Intermediate Representation)

  • 动态图转静态图
    • 使用@paddle.jit.to_static装饰单独的 function 时,提供 train()、eval() 函数支持切换到 train、eval 模式。(#37383)

分布式训练

  • 异构参数服务器完善任意次切图能力,增加流水线训练功能,提升训练吞吐。(#37446)

其他

  • 针对 paddle.scatterindex 越界导致 core dump 的问题,加强了越界检查,并完善对应的报错信息。(#37431)

(3)性能优化

  • 优化 paddle.top_k,根据 k 的大小和 input_width 大小进行选择不同的实现方案,当 k>=75% input_width 时选择 cub 实现,否则选择手写 kernel 实现。(#37325)
  • 优化paddle.fluid.optimizer.LarsMomentumOptimizer,通过 optimizer 算子融合 + CUDA Cooperative Groups的方式提高OP性能。(#37109)

(4)问题修复

API

  • 修复paddle.nn.ELUpaddle.nn.functional.elu 的计算公式,解决 alpha<0 时结果错误的问题;paddle.nn.functional.elu_不支持 alpha<0 的场景,在 alpha<0 时会报错。(#37437)
  • 修复paddle.slice反向执行时出现 out_of_range 的问题。(#37584)
  • paddle.shape 没有反向,显式设置 stop_gradientTrue。(#37412)
  • paddle.arange 没有反向,显式设置 stop_gradientTrue。(#37486)
  • paddle.shard_index 在输入数据的最后一维不为1时进行报错提示。(#37421)
  • 修复 paddle.matmul 使用int8量化,反量化时维度错误的问题。(#36982)
  • 修复 paddle.nn.Dropouteval 模式下不计算梯度的问题。(#37305)
  • 修复 paddle.nn.functional.dropout 在静态图下输入 Tenor 形状中有 -1 并指定 drop 该维时报错的问题。(#37223)
  • 修复RNN类API paddle.nn.LSTM,paddle.nn.GRU, paddle.nn.SimpleRNN在CPU训练时多层RNN(dropout设置为0)反向计算出错的问题。(#37086)
  • 修复 paddle.incubate.FusedTransformerEncoderLayer 反向计算梯度错误、pre_layer_norm 处理不正确、参数处理不正确,漏传参数、 add_bias 计算错误等问题。 (#37229)
  • 修复 paddle.incubate.fused_multi_head_attention 不支持 biasNone 的问题。(#37411, #37566)
  • 修复paddle.vision.datasets.Cifar10, paddle.vision.datasets.Cifar100加载数据没有顺序的问题。 (#37528)
  • 修复一维Tensor在使用省略号(...)索引时维度检测异常报错的问题。(#37192)
  • 修复Tensor索引赋值(setitem)梯度属性无法传播的问题,详见issue。(#37028)

IR(Intermediate Representation)

  • 动态图转静态图
    • 动转静后的模型调用 paddle.flops 能够正确统计模型参数。(#36852)
    • 动转静模块能够正确转换for i in [1, 2, 3]循环语句。(#37259)

分布式训练

  • fleet.load_model: 修复参数服务器模式下模型加载API不可用问题。(#37461)
  • fleet.save_inference_model: 修复参数服务器模式下模型保存 dense 参数前,未从 server 端拉取参数的问题。(#37461)

其他

  • 修复动态图 inplace 操作的问题:对一个非叶子节点进行 inplace 操作后,立即执行 backward,该节点及更前的节点的梯度计算错误。(#37420)

3. 部署方向(Paddle Inference)

(1)问题修复

  • 在明确关闭日志的情况下,进一步去除冗余的调试日志。(#37212)
  • 修复内存/显存优化策略,避免因不当的内存/显存优化导致预测结果有误或崩溃。(#37324, #37123)
  • 修复 Transformer 模型的 MultiHead 结构中融合后 QkvToContextPluginDynamicscale 的 scale 计算错误问题,这是由于 cuda 函数的 block 和 thread 设置错误引起的。(#37096)
  • 将所有的推理OP在int8量化的功能中注册:解决因历史原因有些推理OP没有在int8量化中注册的问题。(#37266)

2.2.1 Release Note

1. Important Updates

This version fixed some function and performance issues of PaddlePaddle 2.2.0, and optimized some functions. The highlights are as follows:

  • Add paddle.linalg.triangular_solve to calculate linear equations with triangular coefficient matrices.
  • Add paddle.device.cuda.graphs.CUDAGraph API that supports the CUDA Graph function of NVIDIA. Note that this API is still experimental and not yet stable.
  • Fix known issues of basic API and Tensor index.

2. Training Framework(Distributed Included)

(1)New Functions

API

  • Add paddle.linalg.triangular_solve API to calculate linear equations with triangular coefficient matrices. (#36714)
  • Add paddle.device.cuda.graphs.CUDAGraph API that supports the CUDA Graph function of NVIDIA by capturing all GPU calculations into a single CUDA Graph and calling them for later use, which not only cuts the extra overhead but also improves the runtime performance. Note that the API is still experimental and not yet stable. (#37109)
  • Addpaddle.incubate.graph_send_recv API for graph learning to reduce the loss of intermediate variables in memory or video memory during message passing. It contains four update modes, namely, SUM, MEAN, MIN, and MAX. (#37205)
  • Add paddle.incubate.operators.ResNetUnit API to integrate the convolution, batch normalization, and shortcut/bottleneck operation in the ResNet network. (#37109)

(2)Function Optimization

API

  • paddle.incubate.FusedTransformerEncoderLayer adds src_mask=None and supports pure fp16.(#37229)

IR(Intermediate Representation)

  • Dynamic Graph to Static Graph
    • When adopting@paddle.jit.to_static to decorate single function, train()、eval() functions are provided to support the switch to train、eval mode. (#37383)

Distributed Training

  • Optimize the ability of arbitrary cutting and add pipeline training in the heterogeneous parameter server, which enhance training throughput.(#37446)

Others

  • Enhance the out-of-bounds check for the index of ``paddle.scatter` that causes core dump, and improve the corresponding error reporting message. (#37431)

(3)Performance Optimization

  • Optimize paddle.top_k by enabling it to choose different implementations according to the size of k and input_width: cub implementation when k>=75% input_width, otherwise the handwritten kernel implementation.(#37325)
  • Optimize paddle.fluid.optimizer.LarsMomentumOptimizer to improve OP performance by integrating optimizer operator and CUDA Cooperative Groups. (#37109)

(4)Bug Fixes

API

  • Fix the calculation error of paddle.nn.ELU and paddle.nn.functional.elu when alpha<0;please note the inplace version:paddle.nn.functional.elu_ will raise error when alpha<0. ([#37437]
  • (https://github.com/PaddlePaddle/Paddle/pull/37437))
  • Fix the problem of out_of_range when the paddle.slice is reversely executed. (#37584)
  • paddle.shape doesn't support backward, explicitly set stop_gradient to True. (#37412)
  • paddle.arange doesn't support backward, explicitly set stop_gradient to True.(#37486)
  • paddle.shard_index reports an error if the last dimension of the input data is not 1. (#37421)
  • Fix the wrong dimension of inverse quantization when paddle.matmul adopts int8 quantization. (#36982)
  • Fix the issue that paddle.nn.Dropout, under eval, does not calculate the gradient. (#37305)
  • Fix the issue that paddle.nn.functional.dropout, in static graph mode, reports an error when -1 is included in the input shape of Tensor and it is specified to drop this dimension. (#37223)
  • Fix the backward calculation errors of multi-layer RNN (dropout set 0) in CPU training by RNN API paddle.nn.LSTM,paddle.nn.GRU, paddle.nn.SimpleRNN. (#37086)
  • Fix issues such as the gradient error ofpaddle.incubate.FusedTransformerEncoderLayer backward calculation, incorrect processing of pre_layer_norm, incorrect parameter processing, missing parameters, calculation errors of add_bias, etc. (#37229)
  • Fix the issue that paddle.incubate.fused_multi_head_attention does not support bias as None.(#37411, #37566)
  • Fix the disordered data loaded by paddle.vision.datasets.Cifar10, paddle.vision.datasets.Cifar100. (#37528)
  • Fix the issue that one-dimensional Tensor reports an exception error of dimension detection when using ellipsis(...) indexing. (#37192)
  • Fix the issue that the gradient attribute ofTensor cannot be spread during indexing and assignment (setitem), see issue for details. (#37028)

IR(Intermediate Representation)

  • Dynamic Graph to Static Graph
    • The model can call paddle.flops to count the model parameters correctly. (#36852)
    • The model can correctly convert the loop statements for i in [1, 2, 3].(#37259)

Distributed Training

  • fleet.load_model: Fix the unavailable API loaded by the model in parameter server mode.(#37461)
  • fleet.save_inference_model: Fix the issue that the model does not pull parameters from the server side before saving dense parameters in parameter server mode. (#37461)

Others

  • Fix the problem of inplace operation of dynamic graph: after performing inplace operation on a non-leaf node, followed by immediate execution of backward, the gradient of this node and the nodes before is calculated incorrectly. (#37420)

3. Paddle Inference

(1)Bug Fixes

  • Further removal of redundant debug logs in the case of clear log disable.(#37212)
  • Fix memory/video memory optimization policies to avoid incorrect prediction results or crashes due to improper memory/video memory optimization. (#37324, #37123)
  • Fix the scale calculation error in the MultiHead structure of Transformer model after integrating QkvToContextPluginDynamicscale, which is caused by wrong block and thread settings of cuda function. (#37096)
  • Register all inference OPs in the function of int8 quantization: Solve the issues that some inference OPs are not registered in int8 quantization due to historical reasons. (#37266)

项目简介

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

🚀 Github 镜像仓库 🚀

源项目地址

https://github.com/paddlepaddle/paddle

发行版本 60

PaddlePaddle 2.5.0 Release Note

全部发行版

贡献者 246

全部贡献者

开发语言

  • C++ 49.8 %
  • Python 41.0 %
  • Cuda 7.0 %
  • CMake 1.1 %
  • Shell 0.6 %