提交 · e59524f86d472f0f36e09cc41c3ca882d3fc2841 · BaiXuePrincess / Paddle

11 1月, 2021 6 次提交

[cherry-pick]Elementwise add grad GPU kernel optimization (#30276) · e59524f8

由 wangchaochaohu 提交于 1月 11, 2021

* elementwise_add_grad Op optimization  (#29575)

* optimize for long width for elementwise (#29602)

* refine (#29622)

* delete the code for fp16 optimization because it is not faster than common template code (#29715)

* fix the shape choose of vectorize for cuda

* optimization for fp16 elementwise add (#29744)

* Fix the compiler error for half type (#29799)

* refine the compiler error for half2 operation (#29816)

* fix the compiler error when gcc4 cuda9.0 (#29997)

e59524f8

[Cherry-Pick] Support pure fp16 training for AMP API. (#29544) (#30241) · d8dfef54

由 Zhen Wang 提交于 1月 11, 2021

* Support pure fp16 training for AMP API. (#29544)

* add cast ops before and after unsupported fp16 ops.

* Keep partial net in FP32 pattern.

* Support check_finite_and_unscale and update_loss_scaling for FP16 calculation mode.

* Add fp16 support for adam op.

* add multi precision attr for adam.

* Fix the bug of test_multi_precision_fp16_train UT.

* Code format for CI.

* Fix the redefine error about MPTypeTrait on windows.

* fix bugs of the _create_accumulators func in Momentum.

* fix bug when inserting post cast op.

* Add the update_loss_scaling op in allow_set of UnusedVarCheck.

* Update for ci coverage.

* Add some doc for OptimizerWithMixedPrecision.

* Fix the code style.

* Imporve the doc of `amp_init`.

* Change for fp16 testing if users have the infer program defined in separate way.

* Remove tensor copy in the update_loss_scaling op. (#29426)

* remove tensor copy in the update_loss_scaling op

* not use thrust.

* fix some cuda memory access error.

d8dfef54

A
Skip convert tensor shape while using Paddle.shape (#30223) (#30239) · 55604248
由 Aurelius84 提交于 1月 11, 2021
```
* fix tensor shape bug

* fix op_num

* clean code
```
55604248

[cherry-pick 2.0] optimize gradient merge (#30185) · e283dc6f

由 WangXi 提交于 1月 11, 2021

* Optimization grad merge performance (#29784)

* [fleet] combine amp and gradient merge, test=develop (#30086)

* fix assign_op_xpu concat_op_xpu warining (#30120)
Co-authored-by: Nliuyuhui <liuyuhui@baidu.com>

e283dc6f

C
[Cherry-pick] remove distributed prepare context (#30219) (#30256) · 1fa98c5d
由 Chen Weihang 提交于 1月 10, 2021
```
att, cherry-pick of #30219
```
1fa98c5d
X
[cherry-pick] clean redundant API alias in 2.0 - part 2 (#30244) · 70cbde83
由 XiaoguangHu 提交于 1月 10, 2021
```
* fix dynamic to static error

* delete paddle.nn.functional.assign
```
70cbde83

10 1月, 2021 1 次提交
- W
  
  fix adamw apply gradient (#30130) (#30207) · c4cd99f3
  由 WangXi 提交于 1月 10, 2021
  
  c4cd99f3
08 1月, 2021 8 次提交

[cherry-pick] [Dy2Stat] Don't convert to paddle.shape if var_x.shape is not... · 2ba9bdd7

由 liym27 提交于 1月 08, 2021

[cherry-pick] [Dy2Stat] Don't convert to paddle.shape if var_x.shape is not negetive #29965 (#30235)

* [Cherry-Pick 2.0] [Dy2Stat] Don't convert to paddle.shape if var_x.shape is not negetive (#29965)

1. When x is Variable, call nn.shape(x) only in following cases:
 1）The shape of x is used in control flow condition.
 2）The dim to be used is negetive
2. When x is Variable, but x.shape or x.shape[idx] doesn't contain negetive value, don't convert to paddle.shape()

* [Cherry-Pick 2.0] [Dy2Stat] Use Paddle2.0 api paddle.tensor.array_* (#30156)

2ba9bdd7

[cherry-pick 2.0] Fix bug: In dynamic mode, if start or end is negetive,... · 5fe3da39

由 liym27 提交于 1月 08, 2021

[cherry-pick 2.0] Fix bug: In dynamic mode, if start or end is negetive, __getitem__  return wrong result(#30003) (#30146)

1. when slice_item is a slice:
 1) the start of __getitem__ should be std::max(start, 0) if slice
 2) the start of __getitem__ should be std::min(end, dim)
2. when slice_item is an integer, it should be in [-dim_len, dim_len)
3. Fix error message to use accurate data

5fe3da39

[Cherry-Pick 2.0][setitem] Support Tensor setitem in static mode (#29708) (#30104) · f46ddc0e

由 liym27 提交于 1月 08, 2021

1. Type of index: int, slice(step must be 1).

2. Type of value:
 (1) int32, int64, float32, bool;
 (2) numpy.array(int32, int64, float32, bool);<Note: float64 is not supported>
 (3) paddle.Tensor(int32, int64, float32, float64, bool);

f46ddc0e

Fix beam search bug (#29824) (#30140) · b2ca2cad

由 Jiaqi Liu 提交于 1月 08, 2021

* fix beam search bug

* add dygraph unittest

* update dynamic_decode argument doc

* add warning info for state which has no lengths attribute

b2ca2cad

[Cherry-pick] [Complex] Simplify prepared op impl to improve performance (#30153) (#30215) · 0e3a1d35

由 Chen Weihang 提交于 1月 08, 2021

* simplify prepared op impl to improve performance

* fix kunlun compile error

* continue fix kunlun compile error

* only transform diff place when dtype diff

* fix failed unittests

* remove useless file

* polish impl by review comment

0e3a1d35

【2.0API CherryPick】LookAhead, ModelAverage, IndexSelect (#30205) · 3ce4d34d

由 123malin 提交于 1月 08, 2021

* Add Lookahead and ModelAverage Optimizer (#30004)

* test=develop, add model_average and lookahead

* Improve Index select cuda kernel (#30139)

* test=develop, add index_select_cuda kernel

3ce4d34d

C
fix syncbn convert (#30158) (#30176) · 030d678c
由 ceci3 提交于 1月 08, 2021
```
* fix syncbn convet

* add unittest
```
030d678c

[Cherry-pick] Simplify the options of spawn based on fleetrun (#30144) (#30197) · 39204d56

由 Chen Weihang 提交于 1月 07, 2021

* Simplify the options of spawn based on fleetrun (#30144)

* Simplify the options of spawn based on fleetrun

* polish details

* polish doc details

* cleanup enum test=develop (#29294)
Co-authored-by: Ngongweibao <weibao.gong@gmail.com>

39204d56

07 1月, 2021 5 次提交

[cherry pick] paddle.save/load ,paddle.static.save/load 保存大文件的bug (#30170) · bfb6f613

由 WeiXin 提交于 1月 07, 2021

* Support storage of large parameters (#29988)

* Support storage of large parameters

* Reduce the complexity of the unittest

* Reduce the complexity of the unittest,commented out unittest for

* add unittest for static.save/load

* Increase the timeout threshold of 'test_static_save_load'

* Increase the timeout threshold of 'test_static_save_load'

* Increase the timeout threshold of 'test_static_save_load' and 'test_paddle_save_load'

* Increase the timeout threshold of 'test_static_save_load' and 'test_paddle_save_load'

* Extend the timeout for the (#30151)

bfb6f613

[cherry pick] Some optimizations of elementwise_add, gelu and dropout for AMP (#30152) · 07f68fad

由 Leo Chen 提交于 1月 07, 2021

* Improve performance of elementwise_add grad op (#29187)

* pass stop_gradient for cast op

* improve performance of elementwise_add grad

* use tensor copy async

* dygraph branch

* fix dygraph branch

* add ut

* make gelu fp16 computing more robust (#29484)

* Add fast path for dropout when p == 0  (#29553)

* add fast path for p == 0 in dropout

* add ut

07f68fad

[Cherry-pick] Layer norm fp16 and Nvidia optimize (#29169 #29434 #29522 #29576) (#30110) · 44b81e63

由 furnace 提交于 1月 07, 2021

* Layer norm fp16 (#29169)

* add fp16 for layer_norm op

* revert layernorm api

* fix forward

* fix forward

* fix backward for layernorm with fp16

* fix unit test for layernorm with fp16

* fix with_mkldnn compile error for layernorm with fp16

* 1. revert to PADDLE_ENFORCE_NOT_NULL, 2. change static_cast<float> to static_cast<U>

* fix with_mkldnn compile error for layernorm with fp16

* fix with_mkldnn compile error for layernorm with fp16
Co-authored-by: Nzhiqiu <chenqiuliang@baidu.com>

* fix layer_norm accuracy (#29434)

* Layernorm opt (#29522)

* layernorm fw opt

* layernorm bw opt

* fix typo, test=develop

* remove const dim3 for windows CI compatibility

* merge develop
Co-authored-by: Nzlsh80826 <zlsh80826@gmail.com>

* Fix compile problem when cuda_arch < 6000 (#29576)

* fix compile problem when cuda_arch < 6000

* refine code

* refine code
Co-authored-by: Nzhiqiu <chenqiuliang@baidu.com>
Co-authored-by: Nzlsh80826 <zlsh80826@gmail.com>

44b81e63

T
pre padding in dygraph (#30179) · a2b0357d
由 tangwei12 提交于 1月 07, 2021
```
Change-Id: Ia5279b0cbb6a5b3970aff66e9510e0d85efa70ce
```
a2b0357d

Cherry pick bn (#30136) · 157ff094

由 ceci3 提交于 1月 07, 2021

* fix bn docs (#30096)

* add attribute for batch_norm (#29950)

* add attribute for batch_norm

157ff094

06 1月, 2021 2 次提交

[Cherry-pick]cherry-pick to Release/2.0 (#30076) · 1ad7fcbf

由 huangxu96 提交于 1月 06, 2021

* add fp16 check into max and avg pool (#29479)

* Add ReserveSpace in dygraph batch_norm. (#29221)

* Add ReserveSpace in dygraph batch_norm.

* Add unittest for reservespace

* add float16 into adaptive_avg_pool2d check list. (#29547)

1ad7fcbf

L
[Cherry-pick 2.0] Migrate 4 APIs about array to paddle.tensor.* (#29565) (#30101) · 52caf787
由 liym27 提交于 1月 06, 2021
```
4 APIs: array_length, array_read, array_write, create_array，cherry-pick #29565
```
52caf787

05 1月, 2021 2 次提交

[Cherry-pick 2.0] cherry pick 3 PRs about Dynamic-to-Static (#30100) · faeee3c3

由 liym27 提交于 1月 05, 2021

* [cherry-pick 2.0] Fix unitest test_slice (#29740)

Before this commit, test_slice use old api `dygraph_to_static_func` to use Dynamic-t-Static and use Executor explicitly，which is not recommended to users.
After fixed, use recommended API `paddle.jit.to_static` to replace `dygraph_to_static_func`, which won't trigger the random exception on coverage CI.

* [cherry-pick 2.0][Dy2Stat] Support grammar: for ele in var[idx] (#29541)

Support to transformfor ele in var stms in which var is a slice of Tensor.

* [cherry-pick 2.0][Dy2Stat] Fix bug for loop: a variable is used and created in loop, but used before created (#29769)

faeee3c3

C

[cherry-pick] Add mkldnn interpolate op, support manual enable mkldnn interpolate op (#30083) · 9a6926f5
由 cc 提交于 1月 05, 2021

9a6926f5

04 1月, 2021 1 次提交
- Z
  [cherry pick 2.0]support deepcopy for Layer/Tensor/Paramerbase (#29387) (#29873) · c06350c9
  由 Zhou Wei 提交于 1月 04, 2021
```
* support deepcopy for Layer/Tensor/Paramerbase

* fix some code
```
  c06350c9
31 12月, 2020 3 次提交
- L
  add the paddle.distributed.split api (#29970) (#30041) · 84c2315a
  由 lilong12 提交于 12月 31, 2020
```
* add distributed.split, test=develop
```
  84c2315a
- L
  fix the bug in pipeline data parallelism (#29731) (#29918) · f0e04e1f
  由 lilong12 提交于 12月 31, 2020
```
* update, test=develop
```
  f0e04e1f
- L
  [Cherry-pick] Disable gloo by default #29559 #29805 (#29601) · 640f8cf0
  由 lilong12 提交于 12月 31, 2020
```
* update, test=develop (#29559)

* Disable gloo by default (#29805)

* update, test=develop

* update, test=develop
```
  640f8cf0
29 12月, 2020 4 次提交

[Kunlun] 2.0 cherry-pick:Support for Baidu Kunlun XPU multi card training (#29713) · 847aa172

由 liuyuhui 提交于 12月 29, 2020

* [Kunlun] PR1:Support one Kunlun card training in parallel executor (#29337)

* [Kunlun] PR2: Support MultiDevicePass and BKCL in parallel executor (#29574)

* [Kunlun] bug fix of PR2: Support MultiDevicePass and BKCL in parallel executor  (#29926)

* add bkcl.so in whl for kunlun (#29947)

* [Kunlun] bug fix of PR2: Support MultiDevicePass and BKCL in parallel executor  (#29961)
Co-authored-by: NQingshuChen <qingshu.chen714@gmail.com>

847aa172

[Cherry-pick] Complex network execute support (#29905) · 91ebc460

由 Chen Weihang 提交于 12月 29, 2020

* [Complex] Add support for complex grad accumulated (#29889)

* add support for complex grad accumulated

* add unittest for coverage

* update test dtype

* remove useless blank line

* [Complex] Handle complex to real after type promotion (#29855)

* try to add fwd op input dtypes

* refactor base impl

* return tmp_ins after dygraph prepare data

* fix typo found in debug

* polish comment & add complex net test

* revert detail change

* fix unittest failed

* add complex kernel condition control

* fix xpu test failed & polish comment

* polish details by review comments

* Complex op test (#29753)

* delete no need to calculate inputs in dygraph op_test

* delete no need to calculate inputs in dygraph op_test

* change grad elementwise_mul for complex types (#29757)

* add conj op for complex types

* add conj for complex types

* add more test case

* add conj_op test

* modify conj api and impl

* add complex type for fill_constant_op xpu

* add setConstant for complex type

* remove complex conj test file

* user define grad for test_conj_op

* add test case for static mode of conj api

* modify conj doc

* change input args name to x

* remove useless codes

* conj support real types

* add conj test case for real number

* delete no need to calculate inputs in dygraph op_test

* delete no need to calculate inputs in dygraph op_test

* modify grad of mul for complex types

* fix the grads of inputs args order not match bug

* change the grad of div when complex types (#29804)

* change the grad of div when complex types

* fix the grads of inputs args order not match bug
Co-authored-by: Nchentianyu03 <chentianyu03@baidu.com>

91ebc460

L
Fix Conv2DTanspose bug when padding='same' (#29915) (#29936) · acb29ff8
由 LielinJiang 提交于 12月 29, 2020
```
* fix conv_transpose bug when padding=same
```
acb29ff8

[cherry-pick] clean redundant API alias in 2.0 - part 1 #29928 (#29960) · c9c835b5

由 XiaoguangHu 提交于 12月 28, 2020

* [cherry-pick] cherry-pick of PR#29928

* delete paddle.metric.chunk_eval and paddle.metric.mean_iou

* delete paddle.nn.clip and paddle.nn.clip_by_norm

* delete paddle.nn.functional.activation.hard_sigmoid and paddle.nn.functional.activation.hard_swish

* [cherry-pick] cherry-pick of PR#29928

* fix extension import error

c9c835b5

28 12月, 2020 2 次提交

[Cherry-Pick 2.0][Dy2Stat] 1. Fix bug of for-range stmts. 2. Support that step... · a8b6dd86

由 liym27 提交于 12月 28, 2020

[Cherry-Pick 2.0][Dy2Stat] 1. Fix bug of for-range stmts. 2. Support that step value is negative in for-range stmts (#29519) (#29874)

1. Fix error in _build_cond_stmt of for-range stmts.

2. Support that step value is negative in for-range stmts

3. Fix code because of the diff between Py2 and Py3

a8b6dd86

[Cherry-pick] Cherry-pick of PR#29579 and PR#29617 (#29904) · 63939597

由 Huihuang Zheng 提交于 12月 28, 2020

* [Dy2stat] Enable jit.save to Save Without Running (#29579)

Enable jit.save to Save Without Running.

* Modify CublasHandleHolder to Fix Random Unittest Failure. test=develop (#29617)

Modify CublasHandleHolder from using PADDLE_ENFORCE_CUDA_SUCCESS to PADDLE_RETRY_CUDA_SUCCESS to fix random unittest failure. We checked that the unittest log showed CUDA allocation error at this file, which may due to GPU not enough. We fixed similar failure in the past, so we applied PADDLE_RETRY_CUDA_SUCCESS here.

63939597

25 12月, 2020 2 次提交

Q
feat: support check_nan_inf for kunlun/xpu device (#29694) (#29898) · 41917fb5
由 QingshuChen 提交于 12月 25, 2020
```
* feat: support check_nan_inf for kunlun device

* support kunlun stack

* minor
```
41917fb5

2 0 ps core 2 (#29894) · f781ab08

由 tangwei12 提交于 12月 25, 2020

* add ps table (#29463)

* add ps table

Change-Id: I468a04bd071d21ff52654926fcf4d5f3da19e178

* add service (#29560)

* add service, remove ut on mac

* fix heter_profiler & add heter stop method

* fix code style

* merge pscore

Change-Id: Ie7f60d1cdde6755a0c29db26863c6283e9843d57

* fix cmake

Change-Id: I6773509a7b4ca79139ecc40b7bf3eb318ceff8bb

* fix conflit

Change-Id: I35575be0c96a8520f9d756ea7f1ff0b904a165ba

* fix conflit

Change-Id: Ic926ea0b0d67803226d51241397ba3b510226bfa

f781ab08

22 12月, 2020 2 次提交

add nearest_interp_v2 on kunlun (#29725) (#29822) · 0e2f5bb1

由 QingshuChen 提交于 12月 22, 2020

* add nearest_interp_v2 on kunlun

* add nearest_interp_v2 on kunlun
Co-authored-by: NTTerror <tangzhiyi11@users.noreply.github.com>

0e2f5bb1

[cherry-pick 2.0] gen nccl id socket (#29746) · 0f49e0b7

由 WangXi 提交于 12月 22, 2020

* gen nccl id use socket (#29431)

* fix gen_nccl_id_op_helper compile failed, test=develop (#29614)

0f49e0b7

18 12月, 2020 1 次提交

[Cherry-pick] Add complex api conj, real and imag (#29750) · ab5cc042

由 Chen Weihang 提交于 12月 18, 2020

* Add complex dtype op (add) test example (#29603)


* add op test case for complex

* polish code details

* add xpu set constant support

* fix argument rror

* remove useless pyc file

* [Complex] Add real & imag op and api for complex tensor (#29672)

* add complex real op & api & unittest

* add imag op & api & unittest

* refactor op impl

* revert simplify writing due to complile failed

* polish details

* polish grad op code

* add conj op for complex types (#29527)

* add conj op for complex types

* add conj for complex types

* add more test case

* add conj_op test

* modify conj api and impl

* add complex type for fill_constant_op xpu

* add setConstant for complex type

* remove complex conj test file

* user define grad for test_conj_op

* add test case for static mode of conj api

* modify conj doc

* change input args name to x

* remove useless codes

* conj support real types

* add conj test case for real number
Co-authored-by: Nchentianyu03 <chentianyu03@baidu.com>

ab5cc042

17 12月, 2020 1 次提交

[cherry-pick]fix matmulv2 bug & add rebuild group & fix bug of download (#29726) · df0430dc

由 ShenLiang 提交于 12月 17, 2020

* Fix the dowanload bug in the case of multiple machines (#29551)

* fix the dowanload bug
* add sort for ips

* Fix bug of matmul_v2 for broadcast case (#29599)

* fix bug of matmul_v2 for broadcast

* Rebuild group automatically in dynamic graph distributed (#29255)

* add tensor_indices in AssignGroupBySize

* add rebuild group in reducer

* fix error message of gather nd (#29521)

df0430dc

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致