提交 · e7842ba6670824efa5484b7ccfe9b364949a6fb7 · BaiXuePrincess / Paddle

28 10月, 2021 1 次提交

save/load in ps runtime(the_one_ps) (#36097) · e7842ba6

由 wangguanqun 提交于 10月 28, 2021

* add trainer desc config to distributed strategy

* code style modified

* data_feed set lod

* fix bug

* code style

* fix bug

* save load

* save load

* save unittest

* add unittest of the_one_ps

* unittest

* add todo in communicator sendsparse

e7842ba6

27 10月, 2021 1 次提交
- T
  
  add fp16 unittests for kl2 (#36583) · 6838a187
  由 taixiurong 提交于 10月 27, 2021
  
  6838a187
25 10月, 2021 2 次提交

Add bincount op (#36317) · 39f19127

由 smallv0221 提交于 10月 25, 2021

* Add bincount op

* upload cpu version

* fix unitest

* fix unittest

* fix unittest

* fix en doc

* add more test

* fix en doc

* add more test case

* fix test

* fix input vailidation

* fix input check

* fix unittest

* fix test

* fix en doc

39f19127

add op: fused_feedforward(forward) (#35843) · b18cbfb2

由 zhangkaihuo 提交于 10月 25, 2021

这个PR只包含fused_feedforward前向的代码。

相关kernel实现：fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias

fused_feedforward是一个融合算子，该算子对transformer模型的feed forward层的算子进行融合和封装，使得前端只呈现一个接口，通过融合减少部分访存和kernel launch的时间，以此提升性能。

b18cbfb2

22 10月, 2021 2 次提交

Fused attention op forward (#35905) · d4906214

由 Li Min 提交于 10月 22, 2021

功能：本PR的目标是提高attention模块的计算性能。
为了减少框架层对op的调度开销，本PR通过在C++层手动实现attention模块，对外提供attention 大op；
为了减少防存开销，本PR采取了两种优化方法：
（1）在q,k,v计算时通过共享输入X，将该处的gemm，transpose和bias add从三次调用减少为一次；
（2）使用kernel融合优化技术，在不同cuda kernel之间通过寄存器传输数据；

d4906214

[hapi] support dygraph amp O2 (#36441) · 08248db0

由 Leo Chen 提交于 10月 22, 2021

* [hapi] support dygrapg amp O2

* fix problem of static pure fp16 in hapi

* fix bug

* fix format

* fix ut

* follow comments

* update ut

* update amp save/load

* fix ut

* refine code format

08248db0

20 10月, 2021 1 次提交

Add FasterTokenizer Operator (#34491) · 3f2d6a3f

由 Steffy-zxf 提交于 10月 20, 2021

Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent.

* support the text string as an input Tensor
* support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens
* Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization.
* It first applies basic tokenization, followed by wordpiece tokenization.

3f2d6a3f

19 10月, 2021 3 次提交
- D
  
  [heterps]edit shrink and unseenday logit for pslib (#36194) · 9e494472
  由 danleifeng 提交于 10月 19, 2021
  
  9e494472
- W
  Inference add type check in copy_from_cpu (#36429) · be6a8330
  由 Wilber 提交于 10月 19, 2021
```
* update

* fix ut error

* update ut
```
  be6a8330
- W
  
  [hybrid] static model parallel dropout support deterministic RandomSeedGenerator (#36228) · 8cc8e411
  由 WangXi 提交于 10月 19, 2021
  
  8cc8e411
18 10月, 2021 1 次提交

Add operators for async read & async write (#36333) · 3845afff

由 Siming Dai 提交于 10月 18, 2021

* fix async_read bug

* change index place to cpu

* add tensor size judge

* add async_read & async_write test

* fix bug in async_write

* fix mac py3 ci

* fix bug for cpu version paddle

* fix windows ci bug

* change input argument error type

* change const_cast to mutable_data

* add async_write out-of-bound check and consumate error hint

* fix a small bug for dst_tensor

* add docs and refine codes

* refine docs

* notest,test=windows_ci

* fix windows ci

* fix require

* fix code-block

* add core.is_compiled_with_cuda()

3845afff

13 10月, 2021 2 次提交
- L
  [Amp] refine code of amp level (#36362) · 59e425cd
  由 Leo Chen 提交于 10月 13, 2021
```
* refine amp level

* fix typo

* update tracer._amp_level
```
  59e425cd
- H
  Remove RunFromCinn in PE because We Will Call CinnRunner in Compute of SubgraphOp (#36385) · e051bba0
  由 Huihuang Zheng 提交于 10月 13, 2021
```
Remove RunFromCinn method in PE because We Will Call CinnRunner in Compute method of SubgraphOp
```
  e051bba0
11 10月, 2021 1 次提交

Add use_cinn Flag and RunFromCinn in PE (#36107) · 5690666c

由 Huihuang Zheng 提交于 10月 11, 2021

Add use_cinn flag and use it to control whether we run PaddlePaddle using CINN.

Also add:

Replace PaddlePaddle graph with a CINN graph in a pass
PE Method to feed data and run the graph by CINN

5690666c

08 10月, 2021 2 次提交

Support CUDA Graph on ParallelExecutor (#36250) · f9591bb1

由 Zeng Jinle 提交于 10月 08, 2021

* support CUDA Graph on PE

* add ut, fix CI compile

* reduce memory consumption

* fix CUDA 10 CI

* improve coverage

* improve python coverage

f9591bb1

H
add python interface of sub_graph (#36120) · a29ff4c7
由 huangxu96 提交于 10月 08, 2021
```
Add python interface of subgraph: 1. all_sub_graphs() 2. get_sub_graph(idx)
```
a29ff4c7

29 9月, 2021 2 次提交

Add basic support for CUDA Graph (#36190) · 21b93c3d

由 Zeng Jinle 提交于 9月 29, 2021

* add basic support for CUDA Graph

* fix ci compile error

* fix LOG print, fix windows CI

* follow comments and update

* small fix for default ctor

* fix rocm compile error

* fix CPU compile error

21b93c3d

Y

add slot record dataset (#36200) · 79bd5f90
由 yaoxuefeng 提交于 9月 29, 2021

79bd5f90

28 9月, 2021 2 次提交

Add paddle.device.cuda.get_device_properties (#35661) · 4cbed9e5

由 Yanxing Shi 提交于 9月 28, 2021

* Initial Commit

* add unittest and add error information

* modify doc

* fix some error

* fix some word

* fix bug cudaDeviceProp* and modify error explanation

* fix cudaDeviceProp* error and unnitest samples

* fix hip error and PADDLE_WITH_HIP

* update style

* fix error is_compiled_with_cuda

* fix paddle.device.cuda.get_device_properties

* fix error for multi thread safe

* update style

* merge conflict

* modify after mentor review

* update style

* delete word

* fix unittest error for windows

* support string input and modify some code

* modify doc to support string input

* fix error for express information

* fix error for express information

* fix unnitest for windows

* fix device.startswith('gpu:')

* format error and doc

* fix after review

* format code

* fix error for doc compile

* fix error for doc compile

* fix error for doc compile

* fix error for doc compile

* fix error for doc compile

* fix py2 error

* fix wrong words and doc

* fix _gpuDeviceProperties

4cbed9e5

S

dlpack fix (#35817) · 74ff59cf
由 Siming Dai 提交于 9月 28, 2021

74ff59cf

26 9月, 2021 3 次提交

[new api] add func/class API psroi_pool and UT (#35352) · e45d64ec

由 JYChen 提交于 9月 26, 2021

* add func/class API psroi_pool and UT

* add UT in static mode

* Remove redundant type checks in static mode

* More detailed description for test_psroi_pool_op

* fix code format of UT

* fix en-doc

e45d64ec

T
set file_num in one shard (#35835) · 991dc67d
由 Thunderbrook 提交于 9月 26, 2021
```
* set file_num in one shard

* format
```
991dc67d

modify adam to adamw in AdamW (#36028) · 49c8253f

由 zhangbo9674 提交于 9月 26, 2021

* adam to adamw in AdamW

* add lr_ratio in adamw

* refine logic bug in cpu adamw

* delete fix bug for cpu adamw

* delete fix bug for cpu adamw

49c8253f

22 9月, 2021 2 次提交
- T
  Fix copy elision warning (#35885) · 47d6bc86
  由 Tomasz Socha 提交于 9月 22, 2021
```
* Fix copy elision warning

* Remove redundand code
```
  47d6bc86
- J
  
  [Inference] Support NNAdapter and ascend310 (#35226) · 10e53044
  由 JingZhuangzhuang 提交于 9月 22, 2021
  
  10e53044
18 9月, 2021 3 次提交
- H
  Basic PR on Cost Model (#35774) · 5ba9fe6e
  由 Huihuang Zheng 提交于 9月 18, 2021
```
Add basic Cost Model, it uses executor to run program and profile it to get op time.

This is an early basic version, we will add more functions in the future.
```
  5ba9fe6e
- A
  split cuda_profiler into .h and .cc (#35821) · 01063218
  由 Aurelius84 提交于 9月 18, 2021
```
* split cuda_profiler into .h and .cc

* fix cmake

* remove inline
```
  01063218
- A
  Clean ParseMemInfo and Fix unittest failed under multi-thread (#35840) · 2fff5a58
  由 Aurelius84 提交于 9月 18, 2021
```
* Clean ParaseMemInfo and fix unittest with multi-thread

* fix declare
```
  2fff5a58
17 9月, 2021 5 次提交

[AMP] Support pure fp16 training mode for dygraph (#35521) · adaeee4d

由 zhangbo9674 提交于 9月 17, 2021

* add pure fp16 major function in auto_cast & tracer

* support master weight in dygraph for pure fp16

* check mix dtype of fp16&fp32 for check_finite_and_unscale op

* change pure fp16 funtion name

* refine some bug in auto_cast

* refine auto_cast interface logic

* add param _casted_by_pure_fp16 for class Layer

* support state_dict hook for save model by user appointed dtype in pure_fp16_decorator

* refine pure_fp16_decorator as decorator

* add unittest

* add comment

* add comment

* support recompute

* add comment for auto_cast and decorator

* support to_static_state_dict for paddle.jit.save

* unlimite models num and optimizers num

* add lookup_table in black_list

* fix momentum and layer state_dict

* fix bug in layer state_dict

* fix bug in layer state_dict_helper

* refine unittest

* refine test_momentun_op

* refine interface and some code

* refine amp_decorator interface

* refine pure fp16 interface

* refine master weight interface

adaeee4d

Z

change to PADDLE_DEFINE_EXPORTED (#35841) · d22914fd
由 Zeng Jinle 提交于 9月 17, 2021

d22914fd

Make flag adding easier (#35823) · 2c781455

由 Zeng Jinle 提交于 9月 17, 2021

* make flag setter easier

* update

* rename macro name

* fix bug of public/writable

* update to pass CI

* polish

* fix CPU link error

2c781455

L
expose cuda stream to users (#35813) · 40cfa512
由 Leo Chen 提交于 9月 17, 2021
```
* expose cuda stream to users

* add ut
```
40cfa512

GeneratePass for Python Pass (#35708) · f6db9806

由 wuhuanzhou 提交于 9月 17, 2021

#### 背景

#35602 提供Python侧开发子图替换类Pass的方式：

- 利用Paddle Python API或者辅助类型定义子图program用来匹配/替换图；
- Python侧注册Pass时，将注册函数最终转换为protobuf定义的PassDesc数据形式，供C++侧进行解析完成Pass实例注册。

本PR即为根据PassDesc规则描述解析生成Pass实例。

#### 方案设计

##### Pass规则验证

在以往的Pass开发中，会存在随着算子迭代引发的匹配失效或者错误匹配的问题，该问题可以通过扫描算子支持的参数设置及参数类型等来判断是否应该使用该Pass或者给出提示需要修改Pass代码。

当前Pass开发中提供了算子兼容性OpCompatSensiblePass用于解决上述问题。但同时还存在不足：由于以往Pass开发在运行时才能获取到pattern信息，所以需要在执行Pass时才可以判断。

使用PassDesc表示的Pass可以在执行Pass前验证上述问题，这个过程在VerifyDesc中完成。

##### 根据匹配子图构造pattern

GeneratePass对于图匹配和替换使用GraphPatternDecetor完成，构造匹配pattern实际上就是将对应对象成员PDPattern中添加PDNode和边关系。该过程在函数`InitGeneratePattern`中完成，该函数没有作为GeneratePass的成员方法，主要出于后续可能开发新的Decetor考虑，GeneratePass与Decetor的操作是没有关联的。

初始化pattern主要通过遍历匹配子图program的全部算子实现：

1. 添加当前算子对应PDNode及限制条件（算子类型、属性限制等）；
2. 遍历当前算子对应输入并从pattern中尝试获取PDNode：
   - 在pattern中获取到PDNode且为输出节点：表示属于匹配子图的中间节点，将该PDNode设置为中间节点；
   - 在pattern中没有获取到PDNode：添加该输入PDNode并设置作为输入节点；
   - 设置输入到算子的边关系；
3. 遍历当前算子对应输出：
   - 在pattern中获取到PDNode且为输入节点：表示属于匹配子图的中间节点，将该PDNode设置为中间节点；
   - 在pattern中没有获取到PDNode：添加该输入PDNode并设置作为输出节点；
   - 设置算子到输出的边关系；

##### 根据替换子图操作graph

替换子图操作的过程在`GetGenerateRewrite`函数中完成，与`InitGeneratePattern`类似没有作为GeneratePass的成员方法。

生成替换子图操作过程如下：

1. 判断冗余替换子图；
2. 遍历替换子图program的全部算子添加替换子图Node：
   1. 添加当前算子的Node及属性设置；
   2. 遍历当前算子对应输入，添加中间variable节点；
   3. 遍历当前算子对应输出，添加中间variable节点；
   4. 添加输入/输出节点与算子节点的边关系；
3. 删除匹配图中属于中间节点的Node；

##### 优化子图验证

对于替换子图或者替换后的计算图是否可以正确运行等，可以在执行Pass时验证，从而防止在后续执行计算图时出现异常。

当前Pass执行直接修改计算图，验证失败时无法很好的完成还原操作，目前子图验证暂时默认成功，留到后续改进。

f6db9806

16 9月, 2021 2 次提交
- 0
  [Dy2stat]fix no_grad context error in dy2stat (#35725) · 3e897489
  由 0x45f 提交于 9月 16, 2021
```
* fix no_grad context error in dy2stat

* remove useless comments

* fix error by drop_kids in python

* add test and fix review
```
  3e897489
- W
  
  add run interface for standalone executor, test=develop (#35761) · 29ef7cc9
  由 wanghuancoder 提交于 9月 15, 2021
  
  29ef7cc9
15 9月, 2021 4 次提交
- 王
  clip op extra information when export model. (#35447) · 4d236354
  由王明冬提交于 9月 15, 2021
```
* clip op extra information when export model,test=ocr

* rename clip_extra parameter to kwargs in save_inference_model, test=ocr
```
  4d236354
- Z
  Change the invoking method of settiem from numpy to set_value op when value isn't tensor (#35701) · 86d4af39
  由 zyfncg 提交于 9月 15, 2021
```
* Change the invoking method of settiem from numpy to set_value op when value is not tensor

* fix the check logic for inplace in setitem

* fix the unittest problem caused by setitem doesn't support fp16

* modify some code format in setitem
```
  86d4af39
- H
  
  add set-xpu-device-id function for inference config. (#35572) · a74d7fb6
  由 houj04 提交于 9月 15, 2021
  
  a74d7fb6
- S
  Add paddle.cuda.device.stream_guard API (#35623) · 3218075d
  由 Siming Dai 提交于 9月 15, 2021
```
Add paddle.cuda.device.stream_guard API 
```
  3218075d
14 9月, 2021 1 次提交

Add api paddle.device.cuda.empty_cache to release idle gpu memory hold by allocator。 (#35427) · 83932715

由 chenenquan 提交于 9月 14, 2021

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Fix test coverage problem for empty_cache

* delete redundant check for empty_cache

* fix the problem of empty_cache's doc

* delete the nvidia-smi comment in doc of empty_cache, test=document_fix

83932715

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致