提交 · 53480c9c3f986265629d804a9dfaf5feca2abe1f · BaiXuePrincess / Paddle

26 10月, 2021 2 次提交

[cherry-pick-2.2] Fused attention op forward (#35905) (#36708) · d2be870a

由 Li Min 提交于 10月 26, 2021

功能：本PR的目标是提高attention模块的计算性能。
为了减少框架层对op的调度开销，本PR通过在C++层手动实现attention模块，对外提供attention 大op；
为了减少防存开销，本PR采取了两种优化方法：
（1）在q,k,v计算时通过共享输入X，将该处的gemm，transpose和bias add从三次调用减少为一次；
（2）使用kernel融合优化技术，在不同cuda kernel之间通过寄存器传输数据；

d2be870a

Y

add slot record dataset (#36200) (#36710) · 3fbb6644
由 yaoxuefeng 提交于 10月 26, 2021

3fbb6644

25 10月, 2021 1 次提交

[cherry-pick 2.2] static model parallel dropout support deterministic RandomSeedGenerator (#36682) · 59615fff

由 WangXi 提交于 10月 25, 2021

* Revert "Add fused_dropout wrapper to ease use. (#36185) (#36640)"

This reverts commit 05d7e2fd.

* [hybrid] seed and dropout op support force-cpu (#35820)

* [HIP] fix op not support AMD GPU bug, the flag PADDLE_WITH_ROCM is invalid

* [HIP] fix op not support AMD GPU bug, the flag PADDLE_WITH_ROCM is invalid

* [HIP] fix op not support AMD GPU bug

* [hybrid] seed and dropout op support force-cpu

* [hybrid] seed and dropout op support force-cpu

* [hybrid] seed and dropout op support force-cpu

* [hybrid] seed and dropout op support force-cpu

* [hybrid] seed and dropout op support force-cpu

* [hybrid] fix seed ci failed issue

* add AsExtra for force_cpu of seed op

* Add fused_dropout wrapper to ease use. (#36185)

* [hybrid] static model parallel dropout support deterministic RandomSeedGenerator (#36228)
Co-authored-by: Nxiayanming <41795079@qq.com>
Co-authored-by: NLi Min <11663212+limin2021@users.noreply.github.com>

59615fff

20 10月, 2021 1 次提交
- W
  
  [cherry-pick] Inference add type check in copy_from_cpu (#36552) · b5404f09
  由 Wilber 提交于 10月 20, 2021
  
  b5404f09
19 10月, 2021 1 次提交

Add operators for async read & async write (#36333) (#36501) · d65f8af8

由 Siming Dai 提交于 10月 19, 2021

* fix async_read bug

* change index place to cpu

* add tensor size judge

* add async_read & async_write test

* fix bug in async_write

* fix mac py3 ci

* fix bug for cpu version paddle

* fix windows ci bug

* change input argument error type

* change const_cast to mutable_data

* add async_write out-of-bound check and consumate error hint

* fix a small bug for dst_tensor

* add docs and refine codes

* refine docs

* notest,test=windows_ci

* fix windows ci

* fix require

* fix code-block

* add core.is_compiled_with_cuda()

d65f8af8

11 10月, 2021 1 次提交
- S
  
  dlpack fix (#35817) (#36177) · 31a5829a
  由 Siming Dai 提交于 10月 11, 2021
  
  31a5829a
27 9月, 2021 3 次提交
- Y
  Add paddle.device.cuda.get_device_properties (#35875) · cea0bc26
  由 Yanxing Shi 提交于 9月 27, 2021
```
* Initial Commit

* fix py2 error

* fix wrong words and doc

* test=document_fix

* fix _gpuDeviceProperties
```
  cea0bc26
- J
  [Cherry-pick] Add new func/class API psroi_pool and UT (#36111) · 81557da6
  由 JYChen 提交于 9月 27, 2021
```
cherry-pick from #35352

Add new detection api paddle.vision.ops.psroi_pool and paddle.vision.ops.PSRoIPool
```
  81557da6
- Z
  [cherry pick] Modify adam to adamw in Optimizer AdamW (#36028) (#36103) · 2de7a7f5
  由 zhangbo9674 提交于 9月 27, 2021
```
The AdamW optimizer modify the op from adamw to adam in pr35521, this is a inappropriate modify. Modify adam to adamw in AdamW.
```
  2de7a7f5
24 9月, 2021 1 次提交

Basic PR on Cost Model (#35774) (#35915) · efcd108d

由 Huihuang Zheng 提交于 9月 24, 2021

Add basic Cost Model, it uses executor to run program and profile it to get op time.

This is an early basic version, we will add more functions in the future.

efcd108d

22 9月, 2021 1 次提交
- W
  
  [cherry-pick] [Inference] Support NNAdapter and ascend310 (#35882) · 2aaa417e
  由 Wilber 提交于 9月 22, 2021
  
  2aaa417e
18 9月, 2021 2 次提交
- A
  split cuda_profiler into .h and .cc (#35821) · 01063218
  由 Aurelius84 提交于 9月 18, 2021
```
* split cuda_profiler into .h and .cc

* fix cmake

* remove inline
```
  01063218
- A
  Clean ParseMemInfo and Fix unittest failed under multi-thread (#35840) · 2fff5a58
  由 Aurelius84 提交于 9月 18, 2021
```
* Clean ParaseMemInfo and fix unittest with multi-thread

* fix declare
```
  2fff5a58
17 9月, 2021 5 次提交

[AMP] Support pure fp16 training mode for dygraph (#35521) · adaeee4d

由 zhangbo9674 提交于 9月 17, 2021

* add pure fp16 major function in auto_cast & tracer

* support master weight in dygraph for pure fp16

* check mix dtype of fp16&fp32 for check_finite_and_unscale op

* change pure fp16 funtion name

* refine some bug in auto_cast

* refine auto_cast interface logic

* add param _casted_by_pure_fp16 for class Layer

* support state_dict hook for save model by user appointed dtype in pure_fp16_decorator

* refine pure_fp16_decorator as decorator

* add unittest

* add comment

* add comment

* support recompute

* add comment for auto_cast and decorator

* support to_static_state_dict for paddle.jit.save

* unlimite models num and optimizers num

* add lookup_table in black_list

* fix momentum and layer state_dict

* fix bug in layer state_dict

* fix bug in layer state_dict_helper

* refine unittest

* refine test_momentun_op

* refine interface and some code

* refine amp_decorator interface

* refine pure fp16 interface

* refine master weight interface

adaeee4d

Z

change to PADDLE_DEFINE_EXPORTED (#35841) · d22914fd
由 Zeng Jinle 提交于 9月 17, 2021

d22914fd

Make flag adding easier (#35823) · 2c781455

由 Zeng Jinle 提交于 9月 17, 2021

* make flag setter easier

* update

* rename macro name

* fix bug of public/writable

* update to pass CI

* polish

* fix CPU link error

2c781455

L
expose cuda stream to users (#35813) · 40cfa512
由 Leo Chen 提交于 9月 17, 2021
```
* expose cuda stream to users

* add ut
```
40cfa512

GeneratePass for Python Pass (#35708) · f6db9806

由 wuhuanzhou 提交于 9月 17, 2021

#### 背景

#35602 提供Python侧开发子图替换类Pass的方式：

- 利用Paddle Python API或者辅助类型定义子图program用来匹配/替换图；
- Python侧注册Pass时，将注册函数最终转换为protobuf定义的PassDesc数据形式，供C++侧进行解析完成Pass实例注册。

本PR即为根据PassDesc规则描述解析生成Pass实例。

#### 方案设计

##### Pass规则验证

在以往的Pass开发中，会存在随着算子迭代引发的匹配失效或者错误匹配的问题，该问题可以通过扫描算子支持的参数设置及参数类型等来判断是否应该使用该Pass或者给出提示需要修改Pass代码。

当前Pass开发中提供了算子兼容性OpCompatSensiblePass用于解决上述问题。但同时还存在不足：由于以往Pass开发在运行时才能获取到pattern信息，所以需要在执行Pass时才可以判断。

使用PassDesc表示的Pass可以在执行Pass前验证上述问题，这个过程在VerifyDesc中完成。

##### 根据匹配子图构造pattern

GeneratePass对于图匹配和替换使用GraphPatternDecetor完成，构造匹配pattern实际上就是将对应对象成员PDPattern中添加PDNode和边关系。该过程在函数`InitGeneratePattern`中完成，该函数没有作为GeneratePass的成员方法，主要出于后续可能开发新的Decetor考虑，GeneratePass与Decetor的操作是没有关联的。

初始化pattern主要通过遍历匹配子图program的全部算子实现：

1. 添加当前算子对应PDNode及限制条件（算子类型、属性限制等）；
2. 遍历当前算子对应输入并从pattern中尝试获取PDNode：
   - 在pattern中获取到PDNode且为输出节点：表示属于匹配子图的中间节点，将该PDNode设置为中间节点；
   - 在pattern中没有获取到PDNode：添加该输入PDNode并设置作为输入节点；
   - 设置输入到算子的边关系；
3. 遍历当前算子对应输出：
   - 在pattern中获取到PDNode且为输入节点：表示属于匹配子图的中间节点，将该PDNode设置为中间节点；
   - 在pattern中没有获取到PDNode：添加该输入PDNode并设置作为输出节点；
   - 设置算子到输出的边关系；

##### 根据替换子图操作graph

替换子图操作的过程在`GetGenerateRewrite`函数中完成，与`InitGeneratePattern`类似没有作为GeneratePass的成员方法。

生成替换子图操作过程如下：

1. 判断冗余替换子图；
2. 遍历替换子图program的全部算子添加替换子图Node：
   1. 添加当前算子的Node及属性设置；
   2. 遍历当前算子对应输入，添加中间variable节点；
   3. 遍历当前算子对应输出，添加中间variable节点；
   4. 添加输入/输出节点与算子节点的边关系；
3. 删除匹配图中属于中间节点的Node；

##### 优化子图验证

对于替换子图或者替换后的计算图是否可以正确运行等，可以在执行Pass时验证，从而防止在后续执行计算图时出现异常。

当前Pass执行直接修改计算图，验证失败时无法很好的完成还原操作，目前子图验证暂时默认成功，留到后续改进。

f6db9806

16 9月, 2021 2 次提交
- 0
  [Dy2stat]fix no_grad context error in dy2stat (#35725) · 3e897489
  由 0x45f 提交于 9月 16, 2021
```
* fix no_grad context error in dy2stat

* remove useless comments

* fix error by drop_kids in python

* add test and fix review
```
  3e897489
- W
  
  add run interface for standalone executor, test=develop (#35761) · 29ef7cc9
  由 wanghuancoder 提交于 9月 15, 2021
  
  29ef7cc9
15 9月, 2021 4 次提交
- 王
  clip op extra information when export model. (#35447) · 4d236354
  由王明冬提交于 9月 15, 2021
```
* clip op extra information when export model,test=ocr

* rename clip_extra parameter to kwargs in save_inference_model, test=ocr
```
  4d236354
- Z
  Change the invoking method of settiem from numpy to set_value op when value isn't tensor (#35701) · 86d4af39
  由 zyfncg 提交于 9月 15, 2021
```
* Change the invoking method of settiem from numpy to set_value op when value is not tensor

* fix the check logic for inplace in setitem

* fix the unittest problem caused by setitem doesn't support fp16

* modify some code format in setitem
```
  86d4af39
- H
  
  add set-xpu-device-id function for inference config. (#35572) · a74d7fb6
  由 houj04 提交于 9月 15, 2021
  
  a74d7fb6
- S
  Add paddle.cuda.device.stream_guard API (#35623) · 3218075d
  由 Siming Dai 提交于 9月 15, 2021
```
Add paddle.cuda.device.stream_guard API 
```
  3218075d
14 9月, 2021 2 次提交

Add api paddle.device.cuda.empty_cache to release idle gpu memory hold by allocator。 (#35427) · 83932715

由 chenenquan 提交于 9月 14, 2021

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Fix test coverage problem for empty_cache

* delete redundant check for empty_cache

* fix the problem of empty_cache's doc

* delete the nvidia-smi comment in doc of empty_cache, test=document_fix

83932715

W

[Inference] Add tuned trt_dynamic_shape mode. (#34806) · 7c96efed
由 Wilber 提交于 9月 14, 2021

7c96efed

11 9月, 2021 1 次提交
- 王
  
  register the with_quant_attr attribute for all operattor. test=develop (#35591) · 8412d6c0
  由王明冬提交于 9月 11, 2021
  
  8412d6c0
10 9月, 2021 1 次提交
- L
  change metaclass of Layer from pybind11_builtins.pybind11_type to type (#35538) · 523f46fe
  由 Leo Chen 提交于 9月 10, 2021
```
* change metaclass of Layer from pybind11_builtins.pybind11_type to type

* fix cast

* add ut
```
  523f46fe
09 9月, 2021 1 次提交

Add matrix_rank Op and it's GPU and CPU kernel (#34823) · eb1fbf12

由 0x45f 提交于 9月 09, 2021

* init matrix_rank op, add matrix_rank CPU code and test

* add GPU kernel, remove svd_eigen.h

* add CPU kernel when tol is tensor

* add cpu and gpu code when tol is tensor

* fix CI-ROCM error

* add matrix_rank API describe, fix PR-CI-Py3 error

* fix PR-CI-Windows error, add matrix_rank API test

* delete useless comments

* fix review

* add my code in svd_helper.h

* update doc commets

* remove spaces

eb1fbf12

08 9月, 2021 4 次提交

Intergrate GLOOParallelContext to support Multi-CPU Core for Dygraph DataParallel (#35154) · 51cc73f0

由 xiongkun 提交于 9月 08, 2021

* can pass the fake test

* add files

* modify cmake to pass windows-ci

* for ci pass

* WITH_GLOO=ON

* for pass coverage test

* add cpuonly testcase

* add

* disable nccl when compile with cuda

* change python version in cpuonly

* add backend argument

* add required gpu

* add required:gpu

51cc73f0

Enable program passes on Fleet APIs (#34955) · 5f369881

由 Zeng Jinle 提交于 9月 08, 2021

* add fleet api for program pass

* turn on apply pass for CI test

* fix disable fuse_all_optimizer bug

* try to test ci

* fix CI

* fill unspecified op role

* fix fuse_allreduce

* add ut to improve coverage

* remove useless change

* improve c++ coverage

* follow some comments

* test ir pass pipeline

* update doc

* reduce ut time again

5f369881

L
[NPU] release gil before op run (#35370) · db6242e9
由 Leo Chen 提交于 9月 08, 2021
```
* release gil before op run

* support npu grad test

* fix op_test
```
db6242e9
W

[NPU] add get_float_status op and refine NPU check_nan_inf (#35274) · c727ec4a
由 WangXi 提交于 9月 08, 2021

c727ec4a

07 9月, 2021 1 次提交
- Y
  
  support multi-node (#35396) · c6e0cedc
  由 yaoxuefeng 提交于 9月 07, 2021
  
  c6e0cedc
06 9月, 2021 1 次提交
- W
  support numpy dtype and polish code of list index. (#35404) · 60c5adaa
  由 WeiXin 提交于 9月 06, 2021
```
* support numpy dtype and polish code of list index.

* polish code.
```
  60c5adaa
04 9月, 2021 1 次提交
- W
  
  update inference trt ut framework (#35418) · e8772486
  由 Wilber 提交于 9月 04, 2021
  
  e8772486
02 9月, 2021 1 次提交
- B
  
  [npu] add update_loss_scaling npu min value (#35270) · 280d7421
  由 Baibaifan 提交于 9月 02, 2021
  
  280d7421
01 9月, 2021 1 次提交

Support settiem by Bool index (#35133) · d387820d

由 zyfncg 提交于 9月 01, 2021

* Support getitem by Bool index

* delete some debug info of bool index

* support the case that the shape of bool index is different from indexed tensor

* support setitem by bool index

* add the unittest for throwing exception

* merge conflict

* add check for int tensor when index is bool

d387820d

31 8月, 2021 2 次提交

Support CostInfo and MemProfiler in InterpreterCore (#34981) · 572bad8a

由 Aurelius84 提交于 8月 31, 2021

* polish code

* fix unittest on windows

* refine pybind interface

* support statistic MemSize of AllocatorPool

* Replace mutex into atomic

572bad8a

S
Revert "Revert "Add copy from tensor (#34406)" (#35173)" (#35256) · 6116f9af
由 Shang Zhizhou 提交于 8月 31, 2021
```
* Revert "Revert "Add copy from tensor (#34406)" (#35173)"

This reverts commit 32c1ec42.

* add template instantiation
```
6116f9af

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致