提交 · b207b8a7bf312f0aaf30b9f4296928a3e0787eaf · Crayon鑫 / Paddle

12 1月, 2021 3 次提交

[cherry-pick]memory optimization for fuse pattern of elemwise_add + act (#30303) · b207b8a7

由 wangchaochaohu 提交于 1月 12, 2021

* reduce the  occupied size  of memory for the fused pattern of elementwise_add Op and activation Op(relu Op for example) (#29885)

* register OPMaker and Infer Shape Check for fused_elementwise_add (#30259)

b207b8a7

[Cherry-pick]Fix the accuracy problem of allclose op when using float64 data... · 2db79f0a

由 Zhen Wang 提交于 1月 12, 2021

[Cherry-pick]Fix the accuracy problem of allclose op when using float64 data type in static mode.(#29890) (#30313)

* Fix the accuracy problem of allclose op when using float64 data type in static mode.

* Format the code style.

2db79f0a

[Cherry-pick] Complex grad for matmul, kron and type promotion (#30304) · 7346edc2

由 chentianyu03 提交于 1月 12, 2021

* complex gradient matmul  (#29966)

* dot op support complex types

* matmul support complex types

* add test case

* matmul broadcast gradient support complex

* move conjFunctor to complex_functor.h

* change the kron gradient when complex types (#29995)

* type promotion for grad (#30177)

* type promotion for grad

* add type promotion for div op

7346edc2

11 1月, 2021 14 次提交

[Cherry-Pick] Support vector<double> as type of op attribute and op set_value... · d839761e

由 liym27 提交于 1月 11, 2021

[Cherry-Pick] Support vector<double> as type of op attribute and op set_value suppport vector<double> as value (#30126) (#30305)

Cherry-Pick #30126
1. Support vector<float64> as type of op attribute.
2. op set_value suppports float64 numpy.array

d839761e

L
[cherry-pick] Async drop scope in executor (#29714) #30285 · 93ce7f69
由 Leo Chen 提交于 1月 11, 2021
```
[cherry-pick] Async drop scope in executor (#29714)
```
93ce7f69
L
[Cherry-Pick 2.0] Check the rank of input in kernel of set_value op (#30147) (#30301) · a2bbd06a
由 liym27 提交于 1月 11, 2021
```
cherry-pick #30147，For op set_value, check input's rank < 7
```
a2bbd06a

[cherry pick] Add detailed error message for curandStatus_t, cublasStatus_t,... · 04cc659c

由 WeiXin 提交于 1月 11, 2021

[cherry pick] Add detailed error message for curandStatus_t, cublasStatus_t, cusolverStatus_t (#30161) (#30280)

为curandStatus_t、cublasStatus_t、cusolverStatus_t添加详细的报错信息。
原始PR：#30161

04cc659c

[Cherry-pick PR 29913], add View(reuse allocation) strategy on squeeze,... · 7c943a65

由 pangyoki 提交于 1月 11, 2021

[Cherry-pick PR 29913], add View(reuse allocation) strategy on squeeze, unsqueeze, reshape, flatten op (#29913) (#30258)

* add view strategy on squeeze,unsqueeze,reshape,flatten

* add squeeze unittest

* add unittests

* use View strategy as name rather than Reuse Allacation

* fix view api doc

* fix format

* use core.ops when input of reshape2 is Tensor

* fix test_cross_entropy_loss error because of reshape2

* delete selected_rows

* change op_function

* little change

* solve HandleViewBetweenInputAndOutput

7c943a65

Z
[cherry-pick]add cast cuda kernel (#29352) #30263 · afbc6367
由 Zhang Ting 提交于 1月 11, 2021
```
 add cast cuda kernel

cherry-pick #29352
```
afbc6367
W
[cherry-pick]add support for place string representation #30264 · fb66355e
由 wangchaochaohu 提交于 1月 11, 2021
```
cherry-pick #28769, add support for place string representation 
```
fb66355e

[cherry-pick]Elementwise add grad GPU kernel optimization (#30276) · e59524f8

由 wangchaochaohu 提交于 1月 11, 2021

* elementwise_add_grad Op optimization  (#29575)

* optimize for long width for elementwise (#29602)

* refine (#29622)

* delete the code for fp16 optimization because it is not faster than common template code (#29715)

* fix the shape choose of vectorize for cuda

* optimization for fp16 elementwise add (#29744)

* Fix the compiler error for half type (#29799)

* refine the compiler error for half2 operation (#29816)

* fix the compiler error when gcc4 cuda9.0 (#29997)

e59524f8

[Cherry-Pick] Support pure fp16 training for AMP API. (#29544) (#30241) · d8dfef54

由 Zhen Wang 提交于 1月 11, 2021

* Support pure fp16 training for AMP API. (#29544)

* add cast ops before and after unsupported fp16 ops.

* Keep partial net in FP32 pattern.

* Support check_finite_and_unscale and update_loss_scaling for FP16 calculation mode.

* Add fp16 support for adam op.

* add multi precision attr for adam.

* Fix the bug of test_multi_precision_fp16_train UT.

* Code format for CI.

* Fix the redefine error about MPTypeTrait on windows.

* fix bugs of the _create_accumulators func in Momentum.

* fix bug when inserting post cast op.

* Add the update_loss_scaling op in allow_set of UnusedVarCheck.

* Update for ci coverage.

* Add some doc for OptimizerWithMixedPrecision.

* Fix the code style.

* Imporve the doc of `amp_init`.

* Change for fp16 testing if users have the infer program defined in separate way.

* Remove tensor copy in the update_loss_scaling op. (#29426)

* remove tensor copy in the update_loss_scaling op

* not use thrust.

* fix some cuda memory access error.

d8dfef54

[Cherry pick] improve dropout (#30260) · b4931ab1

由 Zhang Ting 提交于 1月 11, 2021

* improve dropout (#29465)

* improve drop out

* add VectorizedRandomGeneratorWithGenerator

* fix bug

* modify according to comments

* improve dropout grad (#29605)

* improve grad perf

* fix the bug of dropout_grad (#29813)

b4931ab1

[cherry-pick] softmax optimize (#30279) · b80beb16

由 GaoWei8 提交于 1月 11, 2021

* Softmax vectorization (#29404)

* vec softmax fw

* vec softmax bw

* add a message argument for compiler compatibility

* optimize softmax forward (#30217)

* optimize softmax forward
Co-authored-by: Nzlsh80826 <zlsh80826@gmail.com>

b80beb16

W

Cherry-pick 30194 30164 30201(#30202) · 36de178a
由 Wilber 提交于 1月 11, 2021

36de178a

[cherry-pick 2.0] optimize gradient merge (#30185) · e283dc6f

由 WangXi 提交于 1月 11, 2021

* Optimization grad merge performance (#29784)

* [fleet] combine amp and gradient merge, test=develop (#30086)

* fix assign_op_xpu concat_op_xpu warining (#30120)
Co-authored-by: Nliuyuhui <liuyuhui@baidu.com>

e283dc6f

add aarch64 and sunway kunlun lib (#30027) (#30237) · eacbd488

由 QingshuChen 提交于 1月 11, 2021

* add aarch64 and sunway kunlun lib

* minor

* optimize elementwise_add for kunlun

* update kunlun dependence

* minor

* minor

eacbd488

08 1月, 2021 4 次提交

[cherry-pick 2.0] Fix bug: In dynamic mode, if start or end is negetive,... · 5fe3da39

由 liym27 提交于 1月 08, 2021

[cherry-pick 2.0] Fix bug: In dynamic mode, if start or end is negetive, __getitem__  return wrong result(#30003) (#30146)

1. when slice_item is a slice:
 1) the start of __getitem__ should be std::max(start, 0) if slice
 2) the start of __getitem__ should be std::min(end, dim)
2. when slice_item is an integer, it should be in [-dim_len, dim_len)
3. Fix error message to use accurate data

5fe3da39

[Cherry-Pick 2.0][setitem] Support Tensor setitem in static mode (#29708) (#30104) · f46ddc0e

由 liym27 提交于 1月 08, 2021

1. Type of index: int, slice(step must be 1).

2. Type of value:
 (1) int32, int64, float32, bool;
 (2) numpy.array(int32, int64, float32, bool);<Note: float64 is not supported>
 (3) paddle.Tensor(int32, int64, float32, float64, bool);

f46ddc0e

[Cherry-pick] [Complex] Simplify prepared op impl to improve performance (#30153) (#30215) · 0e3a1d35

由 Chen Weihang 提交于 1月 08, 2021

* simplify prepared op impl to improve performance

* fix kunlun compile error

* continue fix kunlun compile error

* only transform diff place when dtype diff

* fix failed unittests

* remove useless file

* polish impl by review comment

0e3a1d35

【2.0API CherryPick】LookAhead, ModelAverage, IndexSelect (#30205) · 3ce4d34d

由 123malin 提交于 1月 08, 2021

* Add Lookahead and ModelAverage Optimizer (#30004)

* test=develop, add model_average and lookahead

* Improve Index select cuda kernel (#30139)

* test=develop, add index_select_cuda kernel

3ce4d34d

07 1月, 2021 5 次提交

S

fix error message (#30135) (#30182) · 9f02c284
由 ShenLiang 提交于 1月 07, 2021

9f02c284

[cherry pick] Some optimizations of elementwise_add, gelu and dropout for AMP (#30152) · 07f68fad

由 Leo Chen 提交于 1月 07, 2021

* Improve performance of elementwise_add grad op (#29187)

* pass stop_gradient for cast op

* improve performance of elementwise_add grad

* use tensor copy async

* dygraph branch

* fix dygraph branch

* add ut

* make gelu fp16 computing more robust (#29484)

* Add fast path for dropout when p == 0  (#29553)

* add fast path for p == 0 in dropout

* add ut

07f68fad

[Cherry-pick] Layer norm fp16 and Nvidia optimize (#29169 #29434 #29522 #29576) (#30110) · 44b81e63

由 furnace 提交于 1月 07, 2021

* Layer norm fp16 (#29169)

* add fp16 for layer_norm op

* revert layernorm api

* fix forward

* fix forward

* fix backward for layernorm with fp16

* fix unit test for layernorm with fp16

* fix with_mkldnn compile error for layernorm with fp16

* 1. revert to PADDLE_ENFORCE_NOT_NULL, 2. change static_cast<float> to static_cast<U>

* fix with_mkldnn compile error for layernorm with fp16

* fix with_mkldnn compile error for layernorm with fp16
Co-authored-by: Nzhiqiu <chenqiuliang@baidu.com>

* fix layer_norm accuracy (#29434)

* Layernorm opt (#29522)

* layernorm fw opt

* layernorm bw opt

* fix typo, test=develop

* remove const dim3 for windows CI compatibility

* merge develop
Co-authored-by: Nzlsh80826 <zlsh80826@gmail.com>

* Fix compile problem when cuda_arch < 6000 (#29576)

* fix compile problem when cuda_arch < 6000

* refine code

* refine code
Co-authored-by: Nzhiqiu <chenqiuliang@baidu.com>
Co-authored-by: Nzlsh80826 <zlsh80826@gmail.com>

44b81e63

S

add inference api： DisableTensorRtOps (#30109) (#30178) · cb71fea0
由 Shang Zhizhou 提交于 1月 07, 2021

cb71fea0
L

fix xpu pe sync, test=notest (#30095) (#30114) · 85545bbc
由 liuyuhui 提交于 1月 07, 2021

85545bbc

06 1月, 2021 2 次提交

support dygraph in xpu place (#30051) (#30112) · 285f33e5

由 hong 提交于 1月 06, 2021

* support dygraph in xpu place; test=develop

* fix cpu/gpu compile error; test=develop

* fix compile error; test=develop

* fix xpu compile error; testd=develop

285f33e5

[Cherry-Pick 2.0][Dynamic Inplace] Support ShareInplaceVersionCounterWith for... · 743649b5

由 liym27 提交于 1月 06, 2021

[Cherry-Pick 2.0][Dynamic Inplace] Support ShareInplaceVersionCounterWith for C++ Tensor (#29842) (#30105)

Before this PR, SharePlaceHolderWith share Tensor between different C++ Variable, which meas sharing the data, shape, and inplace_version_counter_ of Tensor.
But in some cases, Sharing data and inplace_version_counter_ but not sharing shape is needed. For example, inplace op reshape, can't share shape.

This PR, discard SharePlaceHolderWith, and expose ShareInplaceVersionCounterWith for C++ Tensor.
This reverts commit b10ecd9d.

* Support ShareInplaceVersionCounterWith to share the same inplace version counter for VarBase

743649b5

05 1月, 2021 3 次提交

T
add topo-aware in heter-ps (#30087) (#30117) · 7fc2ce50
由 Thunderbrook 提交于 1月 05, 2021
```
* add topo aware

* resource.h

* topo aware

* format
```
7fc2ce50

fix large scale memory (#30035) (#30085) · e3975223

由 tangwei12 提交于 1月 05, 2021

* memory holder optimize

Change-Id: Ic91af8ac6f2853336d28a9fbbc5e8d0c57b5d05e

* memory holder optimize

Change-Id: I2fd1c14ecc17f5d5ce88b87890381ea801e6367f

* fix large scale memory holder

Change-Id: Ief0992b02b00220e16c72cc637a56e7b5788140f

* fix large scale memory holder

Change-Id: I910142a3952ead643a5604f8f80955f3e6efe655

e3975223

C

[cherry-pick] Add mkldnn interpolate op, support manual enable mkldnn interpolate op (#30083) · 9a6926f5
由 cc 提交于 1月 05, 2021

9a6926f5

04 1月, 2021 3 次提交
- Z
  [cherry pick 2.0]support deepcopy for Layer/Tensor/Paramerbase (#29387) (#29873) · c06350c9
  由 Zhou Wei 提交于 1月 04, 2021
```
* support deepcopy for Layer/Tensor/Paramerbase

* fix some code
```
  c06350c9
- W
  
  make lite subgraph support multiple tensor precision. (#30055) · 878b6972
  由 Wilber 提交于 1月 04, 2021
  
  878b6972
- S
  
  fix op version checker of pass bug (#30028) (#30084) · 477b0c46
  由 Shang Zhizhou 提交于 1月 04, 2021
  
  477b0c46
31 12月, 2020 1 次提交

[Cherry-pick] Disable gloo by default #29559 #29805 (#29601) · 640f8cf0

由 lilong12 提交于 12月 31, 2020

* update, test=develop (#29559)

* Disable gloo by default (#29805)

* update, test=develop

* update, test=develop

640f8cf0

29 12月, 2020 5 次提交

[Kunlun] 2.0 cherry-pick:Support for Baidu Kunlun XPU multi card training (#29713) · 847aa172

由 liuyuhui 提交于 12月 29, 2020

* [Kunlun] PR1:Support one Kunlun card training in parallel executor (#29337)

* [Kunlun] PR2: Support MultiDevicePass and BKCL in parallel executor (#29574)

* [Kunlun] bug fix of PR2: Support MultiDevicePass and BKCL in parallel executor  (#29926)

* add bkcl.so in whl for kunlun (#29947)

* [Kunlun] bug fix of PR2: Support MultiDevicePass and BKCL in parallel executor  (#29961)
Co-authored-by: NQingshuChen <qingshu.chen714@gmail.com>

847aa172

[Cherry-pick] Complex network execute support (#29905) · 91ebc460

由 Chen Weihang 提交于 12月 29, 2020

* [Complex] Add support for complex grad accumulated (#29889)

* add support for complex grad accumulated

* add unittest for coverage

* update test dtype

* remove useless blank line

* [Complex] Handle complex to real after type promotion (#29855)

* try to add fwd op input dtypes

* refactor base impl

* return tmp_ins after dygraph prepare data

* fix typo found in debug

* polish comment & add complex net test

* revert detail change

* fix unittest failed

* add complex kernel condition control

* fix xpu test failed & polish comment

* polish details by review comments

* Complex op test (#29753)

* delete no need to calculate inputs in dygraph op_test

* delete no need to calculate inputs in dygraph op_test

* change grad elementwise_mul for complex types (#29757)

* add conj op for complex types

* add conj for complex types

* add more test case

* add conj_op test

* modify conj api and impl

* add complex type for fill_constant_op xpu

* add setConstant for complex type

* remove complex conj test file

* user define grad for test_conj_op

* add test case for static mode of conj api

* modify conj doc

* change input args name to x

* remove useless codes

* conj support real types

* add conj test case for real number

* delete no need to calculate inputs in dygraph op_test

* delete no need to calculate inputs in dygraph op_test

* modify grad of mul for complex types

* fix the grads of inputs args order not match bug

* change the grad of div when complex types (#29804)

* change the grad of div when complex types

* fix the grads of inputs args order not match bug
Co-authored-by: Nchentianyu03 <chentianyu03@baidu.com>

91ebc460

石

[cherry-pick] #26920 , #22924 (#29948) · bea300dd
由石晓伟提交于 12月 29, 2020

bea300dd
C

[cherry-pick] map matmul/squeeze2+matmul/reshape2+matmul to mul #29911 (#29980) · 160b3477
由 cc 提交于 12月 29, 2020

160b3477
W

Support mips (#29943) · 5a8d43bb
由 Wilber 提交于 12月 29, 2020

5a8d43bb

Crayon鑫 / Paddle 与 Fork 源项目一致

Crayon鑫 / Paddle
与 Fork 源项目一致