提交 · f72d52e79b79c7b98f52d35dba9c63d2e9ddba8c · PaddlePaddle / Paddle

22 9月, 2021 1 次提交
- W
  [cherry-pick] trt engine dtor when the last predictor dtor (#35881) · f72d52e7
  由 Wilber 提交于 9月 22, 2021
```
* cherry-pick 32842
```
  f72d52e7
18 9月, 2021 4 次提交

由 Feiyu Chan 提交于 9月 18, 2021

* 1. add interface for fft;
2. add data type predicate;
3. fix paddle.roll.

* add fft c2c cufft kernel

* implement argument checking & op calling parts for fft_c2c and fftn_c2c

* add operator and opmaker definitions

* only register float and double for cpu.

* add common code for implementing FFT, add pocketfft as a dependency

* add fft c2c cufft kernel function

* fix bugs in python interface

* add support for c2r, r2c operators, op makers, kernels and kernel functors.

* test and fix bugs

* 1. fft_c2c function: add support for onesided=False;
2. add complex<float>, complex<double> support for concat and flip.

* 1. fft: fix python api bugs;
2. shape_op: add support for complex data types.

* fft c2c cufft kernel done with complie and link

* fix shape_op, add mkl placeholder

* remove mkl

* complete fft c2c in gpu

* 1. implement mkl-based fft, FFTC2CFunctor and common function exec_fft;
2. change the design, add input and output typename as template parameter for all FFTFunctors, update pocketfft-based implementation.

* complete fft c2c on gpu in ND

* complete fft c2c on gpu in ND

* complete fft c2c backward in ND

* fix MKL-based implementation

* Add frame op and CPU/GPU kernels.

* Add frame op forward unittest.

* Add frame op forward unittest.

* Remove axis parameter in FrameFunctor.

* Add frame op grad CPU/GPU kernels and unittest.

* Add frame op grad CPU/GPU kernels and unittest.

* Update doc string.

* Update after review and remove librosa requirement in unittest.

* Update grad kernel.

* add fft_c2r op

* Remove data allocation in TransCompute function.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* last fft c2r functor

* fix C2R and R2C for cufft, becase the direction is not an option in these cases.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* fix bugs in python APIs

* fix fft_c2r grad kernal

* fix bugs in python APIs

* add cuda fft c2r grad kernal functor

* clean code

* fix fft_c2r python API

* fill fft r2c result with conjugate symmetry (#19)

fill fft r2c result with conjugate symmetry

* add placeholder for unittests (#24)

* simple parameterize test function by auto generate test case from parm list (#25)

* miscellaneous fixes for python APIs (#26)

* add placeholder for unittests

* resize fft inputs before computation is n or s is provided.

* add complex kernels for pad and pad_grad

* simplify argument checking.

* add type promotion

* add int to float or complex promotion

* fix output data type for static mode

* fix fft's input dtype dispatch, import fft to paddle

* fix typos in axes checking (#27)

* fix typos in axes checking

* fix argument checking (#28)

* fix argument checking

* Add C2R Python layer normal and abnormal use cases (#29)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* complete rfft,rfft2,rfftn,ihfft,ihfft2,ihfftn unittest and doc string (#30)

* Documentation of the common interfaces of c2r and c2c (#31)

* Documentation of the common interfaces of c2r and c2c

* clean c++ code  (#32)

* clean code

* Add numpy-based implementation of spectral ops (#33)

* add numpy reference implementation of spectral ops

* Add fft_c2r numpy based implementation for unittest. (#34)

* add fft_c2r numpy implementation

* Add deframe op and stft/istft api. (#23)

* Add frame api

* Add deframe op and kernels.

* Add stft and istft apis.

* Add deframe api. Update stft and istft apis.

* Fix bug in frame_from_librosa function when input dims >= 3

* Rename deframe to overlap_add.

* Update istft.

* Update after code review.

* Add overlap_add op and stft/istft api unittest (#35)

* Add overlap_add op unittest.

* Register complex kernels of squeeze/unsquuze op.

* Add stft/istft api unittest.

* Add unittest for fft helper functions (#36)

* add unittests for fft helper functions. add complex kernel for roll op.

* complete static graph unittest for all public api (#37)

* Unittest of op with FFT C2C, C2R and r2c added (#38)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* Documentation of the common interfaces of c2r and c2c

* Unittest of op with FFT C2C, C2R and r2c added
Co-authored-by: lijiaqi <lijiaqi0612@163.com>

* add fft related options to CMakeLists.txt

* fix typos and clean code (#39)

* fix invisible character in mkl branch and fix error in error message

* clean code: remove docstring from unittest for signal.py.

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. (#40)

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.

* fix CI Errors: numpy dtype comparison, thrust when cuda is not available (#41)

1. always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.
2. promote floating point tensor to complex tensor ior fft_c2c and fft_c2r;
3. fix unittest to catch UnImplementedError and RuntimeError;
4. fix compile error by avoid using thrust when cuda is not available.
5.  fix sample code, use paddle.fft instead of paddle.tensor.fft

* remove inclusion of thrust, add __all__ list for fft (#42)

* Add api doc and update unittest. (#43)

* Add doc strings.
* Update overlap_add op unittest

* fix MKL-based FFT implementation (#44)

* fix MKL-based FFT implementation, MKL CDFT's FORWARD DOMAIN is always REAL for R2C and C2R

* remove code for debug (#45)

* use dynload for cufft (#46)

* use std::ptrdiff_t as datatype of stride (instead of int64_t) to avoid argument mismatch on some platforms.

* add complex support for fill_zeros_like

* use dynload for cufft

* Update doc and unittest. (#47)

* Add doc of frame op and overlap_add op.

* Update unittest.

* use dynload for cufft (#48)

1. use dynload for cufft
2. fix unittest;
3. temporarily disable Rocm.

* fix conflicts and merge upstream (#49)

fix conflicts and merge upstream

* fix compile error: only link dyload_cuda when cuda is available (#50)

* fix compile error: only link dyload_cuda when cuda is available

* fix dynload for cufft on windows (#51)

1. fix dynload for cufft on windows;
2. fix unittests.

* add NOMINMAX to compile on windows (#52)

 add NOMINMAX to compile on windows

* explicitly specify capture mode for lambdas (#55)

 explicitly specify capture mode for lambdas

* fix fft sample (#53)

* fix fft sample

* update scipy and numpy version for unittests of fft (#56)

update scipy and numpy version for unittests of fft

* Add static graph unittests of frame and overlap_add api. (#57)

* Remove cache of cuFFT & Disable ONEMKL (#59)

1. replace numpy.fft with scipy.fft as numpy<1.20 not support ortho norm
2. remove cache of cufft plans;
3. enhance error checking.
4. default WITH_ONEMKL to OFF
Co-authored-by: Njeff41404 <jeff41404@gmail.com>
Co-authored-by: Nroot <root@bjyz-sys-gpu-kongming9.bjyz.baidu.com>
Co-authored-by: NKP <109694228@qq.com>
Co-authored-by: lijiaqi <lijiaqi0612@163.com>
Co-authored-by: NXiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Nlijiaqi0612 <33169170+lijiaqi0612@users.noreply.github.com>

11518a43

A
Clean ParseMemInfo and Fix unittest failed under multi-thread (#35840) · 2fff5a58
由 Aurelius84 提交于 9月 18, 2021
```
* Clean ParaseMemInfo and fix unittest with multi-thread

* fix declare
```
2fff5a58

[oneDNN] Disable caching of Reorder operation (#35664) · e4c2a854

由 Jacek Czaja 提交于 9月 18, 2021

* - REorder disabling caching

* - compilation fix

* - another compilation fix

* - another compilation fix

* - compilation fix

* - Fix

* - yet another compilation fix

* - suppresingly another compilation fix

* - lint

* - fix after review

* - fix

e4c2a854

Add new API "eigvals" in linalg (#35720) · d411a038

由 From00 提交于 9月 18, 2021

* Add linalg.eigvals API

* pre-commit check

* Adjust code style

* Fix conflict

* Improve code style

* Modify the test code to ignore testing CUDA kernel

* Sort ouput data before checking in test code

* Set timeout value for UT

* Improve API example code to pass CI

* Fix bug for None fetch_list in Windows

* Delete grad Op

d411a038

17 9月, 2021 4 次提交

Z

change to PADDLE_DEFINE_EXPORTED (#35841) · d22914fd
由 Zeng Jinle 提交于 9月 17, 2021

d22914fd

add inplace op support to prune, scale_op is no longer need in jit.save (#35730) · 21921936

由 Haipeng Wang 提交于 9月 17, 2021

* add scale_op in model save step is not necessary, just fix the prune method to support static graph and inplace op

* fix jit.save, no need to add scale_op to each outputvar anymore.
fix prune_with_input, now it supports inplace op

* temporarily disable test_trt_dynamic_shape.TRTDynamicShapeOutOfBound2Test

21921936

Intergrate MultiThreadedWorkQueue to execute program ops (#35356) · a0871194

由 Aurelius84 提交于 9月 17, 2021

* format code

* format interface

* polish interface

* Remove std::memory_order

* modify into SpinLock

* remove fetch_context_pool_

* fix comment

* modify into WorkQueueGroup

* refine code

* fix pointer

* fix paddle_enforce

* split into AsyncWorkQueue

* polish code

* specify std::memory_relax

* fix atomic fetch_sub

* fix num_thread

a0871194

GeneratePass for Python Pass (#35708) · f6db9806

由 wuhuanzhou 提交于 9月 17, 2021

#### 背景

#35602 提供Python侧开发子图替换类Pass的方式：

- 利用Paddle Python API或者辅助类型定义子图program用来匹配/替换图；
- Python侧注册Pass时，将注册函数最终转换为protobuf定义的PassDesc数据形式，供C++侧进行解析完成Pass实例注册。

本PR即为根据PassDesc规则描述解析生成Pass实例。

#### 方案设计

##### Pass规则验证

在以往的Pass开发中，会存在随着算子迭代引发的匹配失效或者错误匹配的问题，该问题可以通过扫描算子支持的参数设置及参数类型等来判断是否应该使用该Pass或者给出提示需要修改Pass代码。

当前Pass开发中提供了算子兼容性OpCompatSensiblePass用于解决上述问题。但同时还存在不足：由于以往Pass开发在运行时才能获取到pattern信息，所以需要在执行Pass时才可以判断。

使用PassDesc表示的Pass可以在执行Pass前验证上述问题，这个过程在VerifyDesc中完成。

##### 根据匹配子图构造pattern

GeneratePass对于图匹配和替换使用GraphPatternDecetor完成，构造匹配pattern实际上就是将对应对象成员PDPattern中添加PDNode和边关系。该过程在函数`InitGeneratePattern`中完成，该函数没有作为GeneratePass的成员方法，主要出于后续可能开发新的Decetor考虑，GeneratePass与Decetor的操作是没有关联的。

初始化pattern主要通过遍历匹配子图program的全部算子实现：

1. 添加当前算子对应PDNode及限制条件（算子类型、属性限制等）；
2. 遍历当前算子对应输入并从pattern中尝试获取PDNode：
   - 在pattern中获取到PDNode且为输出节点：表示属于匹配子图的中间节点，将该PDNode设置为中间节点；
   - 在pattern中没有获取到PDNode：添加该输入PDNode并设置作为输入节点；
   - 设置输入到算子的边关系；
3. 遍历当前算子对应输出：
   - 在pattern中获取到PDNode且为输入节点：表示属于匹配子图的中间节点，将该PDNode设置为中间节点；
   - 在pattern中没有获取到PDNode：添加该输入PDNode并设置作为输出节点；
   - 设置算子到输出的边关系；

##### 根据替换子图操作graph

替换子图操作的过程在`GetGenerateRewrite`函数中完成，与`InitGeneratePattern`类似没有作为GeneratePass的成员方法。

生成替换子图操作过程如下：

1. 判断冗余替换子图；
2. 遍历替换子图program的全部算子添加替换子图Node：
   1. 添加当前算子的Node及属性设置；
   2. 遍历当前算子对应输入，添加中间variable节点；
   3. 遍历当前算子对应输出，添加中间variable节点；
   4. 添加输入/输出节点与算子节点的边关系；
3. 删除匹配图中属于中间节点的Node；

##### 优化子图验证

对于替换子图或者替换后的计算图是否可以正确运行等，可以在执行Pass时验证，从而防止在后续执行计算图时出现异常。

当前Pass执行直接修改计算图，验证失败时无法很好的完成还原操作，目前子图验证暂时默认成功，留到后续改进。

f6db9806

16 9月, 2021 3 次提交

fix bug in pscore (#35698) · e64fed86

由 wangguanqun 提交于 9月 16, 2021

* add trainer desc config to distributed strategy

* code style modified

* data_feed set lod

* fix bug

* code style

* fix bug

e64fed86

L

rename the auto parallel suffix (#35765) · 2c70b844
由 lilong12 提交于 9月 16, 2021

2c70b844

Python support register pass via PassDesc (#35602) · bab39eb2

由 wuhuanzhou 提交于 9月 16, 2021

PR主要功能：针对fusion等子图替换场景，支持Python侧开发并注册Pass。

背景
Pass是指输入一个深度学习计算图Graph，依照一定条件进行修改，输出修改后的Graph的过程；
当前PaddlePadle框架编写Pass代码存在以下问题：
用户需要手写Graph的条件匹配、在Graph上的修改代码；
对Graph操作需要深入底层框架代码，了解Graph的结构，并且知道相关Pass写法；
我们提出了针对fusion等子图替换类Pass的优化方案以支持用户在Python侧开发注册Pass，提升二次开发体验：
用户只需要输入匹配和替换的子图描述，由深度学习框架编写的代码来生成匹配和替换的逻辑，不需要用户对Graph进行匹配和替换操作；
API级别的替换，用户可以通过Paddle的Python API构造子图，从而不需要知道Graph的结构，也能写Paddle的Graph Pass代码

bab39eb2

15 9月, 2021 3 次提交
- W
  add inplace logic into new_executor (#35618) · bd79ae09
  由 wanghuancoder 提交于 9月 15, 2021
```
* add inplace logic into new_executor, test=develop

* check shape and add inplace FLAGS, test=develop

* refine, test=develop

* refine, test=develop
```
  bd79ae09
- 王
  clip op extra information when export model. (#35447) · 4d236354
  由王明冬提交于 9月 15, 2021
```
* clip op extra information when export model,test=ocr

* rename clip_extra parameter to kwargs in save_inference_model, test=ocr
```
  4d236354
- W
  
  [hybrid] out data parallel as optimizer sharding parallel (#35593) · 78465703
  由 WangXi 提交于 9月 15, 2021
  
  78465703
14 9月, 2021 6 次提交
- Y
  
  [HETERPS]fix make error when pscore equals on · 3493c46e
  由 yaoxuefeng 提交于 9月 14, 2021
  
  3493c46e
- J
  Add dropout convert test (#35488) · 6327c33b
  由 JingZhuangzhuang 提交于 9月 14, 2021
```
* add dropout convert test

* modify dropout convert test
Co-authored-by: Nxiaoxiaohehe001 <hiteezsf@163.com>
```
  6327c33b
- W
  
  skip sharelod, test=develop (#35625) · 16e40513
  由 wanghuancoder 提交于 9月 14, 2021
  
  16e40513
- A
  
  Refactor StreamAnalyzer and EventManager from InterpreterCore (#35711) · 8528dd9f
  由 Aurelius84 提交于 9月 14, 2021
  
  8528dd9f
- Y
  
  [hybrid performance] Optimize Pipeline Scheduler (#35680) · 04fdb10a
  由 Yuang Liu 提交于 9月 14, 2021
  
  04fdb10a
- A
  Intergrate StandaloneExecutor in Static.Executor Interface with... · 4bc08530
  由 Aurelius84 提交于 9月 14, 2021
```
Intergrate StandaloneExecutor in Static.Executor Interface with FLAGS_USE_STANDALONE_EXECUTOR (#35628)

* Intergrate StandaloneExecutor in Static.Executor Interface with FLAGS_USE_STANDALONE_EXECUTOR

* Enhance unittest and clean code in StandaloneExecutor

* polish unittest
```
  4bc08530
13 9月, 2021 1 次提交
- L
  
  add lstm qat models scales (#35382) · 1ee237c1
  由 lidanqing 提交于 9月 13, 2021
  
  1ee237c1
12 9月, 2021 1 次提交
- 王
  
  fix some op extra error, test=develop (#35667) · 83424033
  由王明冬提交于 9月 12, 2021
  
  83424033
11 9月, 2021 2 次提交
- W
  refactor gc (#35525) · adaa207b
  由 wanghuancoder 提交于 9月 10, 2021
```
* refactor gc, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* gc each tensor, test=develop

* refine, test=develop
```
  adaa207b
- 王
  
  register the with_quant_attr attribute for all operattor. test=develop (#35591) · 8412d6c0
  由王明冬提交于 9月 11, 2021
  
  8412d6c0
10 9月, 2021 2 次提交

add cumprod op (#35185) · 4e509f46

由 hlygit66666 提交于 9月 10, 2021

* add test_cumprod_op

* Revert "add test_cumprod_op"

This reverts commit c96cf6dff5d09ae7d8cc72c1e8ae4369a153aa19.

* recommit

* add error message

* test input(x) initialize

* test use cpu

* update test code

* add test type

* add test case

* solve ci problem

* add complex case test

* add complex case test

* fix review problem

* fix conflict

* fix some docs

* change test case

* change test case

* fix review problems again

* fix docs

* fix inclusivescan bug

4e509f46

import ska flat_hash_map (#34464) · 3d9603dc

由 chentianyu03 提交于 9月 10, 2021

* import ska flat_hash_map

* add define NOMINMAX macro to fix windows build failed bug

* add brackets to std::max in flat_hash_map

* move flat_hash_map directions

* modify namespace to paddle

* modify namespace to paddle

* modify namespace to paddle

* modify namespace to paddle

* rm not used map.h and replace with op_info

3d9603dc

08 9月, 2021 5 次提交

[Auto Parallel] Integrate all modules (#35483) · 12155358

由 Yulong Ao 提交于 9月 08, 2021

* add auto_parallel dir

* mv to paddle.distributed

* add shard_xx api

* add distributed attrs for var

* add ut, test=develop

* add dist

* update

* update

* update

* update

* update

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update

* update

* update

* update

* update

* update, test=develop

* update, test=develop

* update

* update

* delete unused proto

* resotre op_desc

* restore type_defs

* update var_desc

* remove dimss_mapping for proto_pybind

* update interface.py

* update framework.py

* update

* update

* add auto_parallel dir

* mv to paddle.distributed

* add shard_xx api

* add distributed attrs for var

* add ut, test=develop

* [WIP] Add the auto completion feature and related codes

* [WIP] Improve the auto completion and related codes

* [WIP] Make the auto completion to support data-parallel

* [WIP] Make the completion support mp and dp+mp

* [WIP] Refactor auto completion unit test for MLP

* [WIP] Refactor the implementation of DistributedOperatorImpl

* [WIP] Improve dims_mapping update rule and fix a bug

* [WIP] Support auto completion for one transformer decoder layer

* [WIP] Add a minor change

* [WIP] Fix a bug within the uint test

* Shard XShape tensor, add embedding completion and refactor code

* Add the distributed_operators dir to setup.py.in

* Improve the completion process and add the unittest for gpt

* fix process_mesh ut

* fix process_mesh ut

* update

* update, test=develop

* Add support for automatically completing distributed attrs of special ops

* update

* update

* update

* fix doc sample codes, test=develop

* improve coverage, test=develop

* add static_mode check, test=develop

* Model the cluster for cost model and physical mapping

* update, test=develop

* add set_placement, test=develop

* Add the check to make sure the candidate tensors' size is great than zero

* update doc, test=develop

* update doc, test=develop

* update doc, test=develop

* update doc, test=develop

* update, test=develop

* Auto mark dist attrs annotated by user

* update ndarray to nested list, test=develop

* update, test=develop

* Add auto-completion module for auto-parallel (based on PR#33804)

* Remove unnecessary files

* Remove unrelated files for the auto completion pr

* Update the unit test to improve the coverage

* Modify codes based on reviews

* Minor changes for CI

* Improve some codes based on new comments

* Fix bugs caused by shallow copy in attributes.py
* Imporve amend_distributed_attr_for_program in context.py
* Other changes for weihang's comments

* support shard reader

* support shard reader

* add parallel mode

* update process mesh

* add method to compute comm_group

* implement dist_embedding forward func

* implement dist matmul forward func

* implement dist reshape forward func

* add transpiler framework

* add transpiler forward

* implement transpiler forward

* implement transpiler backward & update

* add process

* add unitest

* chmod

* chmod

* chmod

* update unitest

* add unitest for gpt

* remove unused print

* rename transpiler --> partitioner

* rename transpiler --> partitioner

* chmod

* chmod

* bug fixed

* remove amp function

* update case for dp mode

* update case for dp mode

* [Auto Parallel] Integrate all parts with the newest code

* Integrate all parts of auto parallel and improve codes

* Integrate all parts by AutoParallelizer
* Add unit test for AutoParallelizer
* Improve auto completion module for pipeline parallel
* Add support for matmul_v2 in dist_matmul
* Correct the typo "stratergy" to "strategy"

* Modify distributed_strategy.proto to conform the main stream

* Restore parts of distributed_strategy to conform the develop branch
Co-authored-by: Nsandyhouse <lilong12@baidu.com>
Co-authored-by: NJZ-LIANG <jianzhongliang10@gmail.com>

12155358

refactor new executor (#35537) · 0eb7c942

由 wanghuancoder 提交于 9月 08, 2021

* refactor new executor, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

0eb7c942

Enable program passes on Fleet APIs (#34955) · 5f369881

由 Zeng Jinle 提交于 9月 08, 2021

* add fleet api for program pass

* turn on apply pass for CI test

* fix disable fuse_all_optimizer bug

* try to test ci

* fix CI

* fill unspecified op role

* fix fuse_allreduce

* add ut to improve coverage

* remove useless change

* improve c++ coverage

* follow some comments

* test ir pass pipeline

* update doc

* reduce ut time again

5f369881

Work queue group (#35470) · a53460aa

由 liutiexing 提交于 9月 08, 2021

* Split Tracker and WorkQueue

* add WorkQueueGroup

* add unittest

* fix

* update

* update

* fix compile

a53460aa

W

[NPU] add get_float_status op and refine NPU check_nan_inf (#35274) · c727ec4a
由 WangXi 提交于 9月 08, 2021

c727ec4a

07 9月, 2021 2 次提交
- Y
  
  support multi-node (#35396) · c6e0cedc
  由 yaoxuefeng 提交于 9月 07, 2021
  
  c6e0cedc
- C
  
  fix int8 (#35504) · ed97be09
  由 ceci3 提交于 9月 07, 2021
  
  ed97be09
06 9月, 2021 1 次提交

Add fusion_lstm INT8 PTQ (#35334) · 7ef04da6

由 joanna.wozna.intel 提交于 9月 06, 2021

* Add fusion_lstm INT8 PTQ

* Correct mkldnn_cache_capacity and enable fc_lstm_fuse_pass only for this test

* Change mkldnn_cache_capacity

7ef04da6

03 9月, 2021 1 次提交

modify gc logic, use new device_event (#35208) · 80c0cc97

由 wanghuancoder 提交于 9月 03, 2021

* modify gc logic, use new device_event, test=develop

* use GenerateDeviceEventFlag, test=develop

* refine, test=develop

* fix test_standalone_executor.py, test=develop

* refine, test=develop

80c0cc97

01 9月, 2021 4 次提交
- T
  [HeterPs] merge dense && data norm && g2sum (#35029) · a647b80a
  由 Thunderbrook 提交于 9月 01, 2021
```
* merge dense

* log level

* tensor copy sync

* format
```
  a647b80a
- S
  [HybridParallel]Support finetinue model for PipelineParallel (#35287) · 264ff9ef
  由 ShenLiang 提交于 9月 01, 2021
```
* add cache for send_recv

* add eval_batch for pipeline

* add eval batch for pipelineparallel

* add style code
```
  264ff9ef
- W
  modify fetch logic, use D2H Stream (#35191) · c56d6978
  由 wanghuancoder 提交于 9月 01, 2021
```
* modify fetch logic, use D2H Stream, test=develop

* refine, test=develop
```
  c56d6978
- Q
  support KL label smooth (#35177) · 7ca28bb6
  由 QingshuChen 提交于 9月 01, 2021
```
* support KL label smooth

* update UT for KL label_smooth
```
  7ca28bb6

PaddlePaddle / Paddle 1 年多 前同步成功

PaddlePaddle / Paddle
1 年多前同步成功