提交 · 50bfe420893e15f48e0aca9dbbc26cac3ce33bae · BaiXuePrincess / Paddle

29 4月, 2022 1 次提交

[cherry-pick 2.3] Add fused_multi_transformer op to optimize transformer... · 50bfe420

由 WangXi 提交于 4月 29, 2022

[cherry-pick 2.3] Add fused_multi_transformer op to optimize transformer generation performance (#42311)

* Add fused_multi_transformer op to optimize transformer generation performance (#41814)

* fix fused_multi_transformer compile failed in cuda arch < sm53 (#42315)

* fix ci timeout

50bfe420

22 4月, 2022 1 次提交
- A
  [IPU] add mixed-precission support for ipu (#41733) (#41906) · c09b1d68
  由 Allen Guo 提交于 4月 22, 2022
```
add mixed-precission support for ipu

cherry-pick from #41733
```
  c09b1d68
16 3月, 2022 1 次提交
- Q
  
  [MLU] support amp O1 of mlu (#40461) · ad81f22c
  由 qipengh 提交于 3月 16, 2022
  
  ad81f22c
19 2月, 2022 1 次提交

Add the DistributedFusedLamb optimizer (#39148) · 5df3cd61

由 sneaxiy 提交于 2月 19, 2022

* add DistributedFusedLamb op

* polish code

* fix compile error

* compatible with pten changement

* fix rocm compile error

* improve converage

* update upstream/develop

* fix cast_with_ptr.h

* add FLAGS_distributed_lamb_divide_nranks_when_allreduce=1

* fix clip before allreduce

* add use_master_param_norm

* code polish

* fix bug

* fix ROCM ci

5df3cd61

07 2月, 2022 1 次提交

Update BF16 amp list (#39304) · 0c43ce22

由 arlesniak 提交于 2月 07, 2022

* amp list updated

* tests updated

* gray list updated

* amp list updated

* test updated

0c43ce22

13 1月, 2022 1 次提交

Added mul BF16/FP32 FWD/BWD oneDNN kernel (#38552) · fc6eed5b

由 jakpiase 提交于 1月 13, 2022

* base changes for mul reimplementation

* empty commit

* tmp save

* full implementation of mul bf16/fp32 fwd bwd

* CI fix

* CI rerun

* changed unity build cmake to avoid gpu issues

* removed mul mkldnn from unity build

* added skipping tests if not cpu_bf16

* CI fix

* CI fix

* CI fix

fc6eed5b

28 12月, 2021 1 次提交

Fix scatter_op fp16 perf problem. (#38499) · 33ce249f

由 Li Min 提交于 12月 28, 2021

* Fix scatter_op fp16 perf problem.

* Add scatter into black list.

* Add scatter into black list for dygraph.

33ce249f

20 12月, 2021 1 次提交

Support FP16 for more ops (#38123) · 1f445bf3

由 sneaxiy 提交于 12月 20, 2021

* support FP16 for more ops

* add amp list tests

* refine reduce_mean_grad

* fix OP benchmark ci

* fix fp16 reduce_mean

* updat ut, but still have some problems

* remove mean/reduce_mean fp16 kernel

1f445bf3

17 12月, 2021 1 次提交

Refine some AMP operators for BERT (#37923) · d80fe268

由 sneaxiy 提交于 12月 17, 2021

* support multi precision update for LAMB

* hide some api

* fix ci uts

* fix lamb output of dygraph

* remove some changes to some PR

* try to fix Py3 CI compile error

* fix test_imperative_optimizer, add lars ut, add layer_norm ut

* fix ut, fix format

* fix ut

* fix windows ci

d80fe268

27 10月, 2021 1 次提交

Fused transformer encoder layer and fused feedforward layer (#36604) · 9f3613f3

由 zhangkaihuo 提交于 10月 27, 2021

本PR是fused_transformer的layer层代码，包含FusedFeedForward的layer层代码和FusedTransformerEncoderLayer的代码。

9f3613f3

14 10月, 2021 1 次提交
- Z
  
  Add the complete code and related files of resnet_unit_op (#36366) · 12e6dbbc
  由 Zhang Zheng 提交于 10月 14, 2021
  
  12e6dbbc
21 9月, 2021 1 次提交

Reuse OneDNN handler for SGD and SUM for SelectedRows input tensors. (#35510) · 799f3861

由 Adam Osewski 提交于 9月 20, 2021

* Create stateful OneDNNAXPYHandler object.

This makes it possible to call it multiple times without recreating the
oneDNN primitives every time.

* Prepare SGDOpKernel to reuse its implementation from OneDNN kernel.

* OneDNN SGD kernel.

* Update call to use new OneDNNAXPYHandler object api.

* Setup seed in proper place.

* Enable OneDNN kernel only for single case.

* For dense param and sparse grad.

* Small refactor.

* Enable oneDNN by op attr or by cmd line flag.

* Use int64_t type for number of elements.

* Support dense param and grad from OneDNN kernel.

* Enable SGD OneDNN kernel when use MP BF16 optimizer.

* Force non-copyable/movable OneDNNAXPYHandler.

* Reuse OneDNNAXPYHandler for spare tensors in SUM op.

* Fix SFINAE rules.

* Remove recording event inside AXPY.

* Get rid of internal primitive caching.

* Stop use PP cache mechanims to store mem and primitive obj.
* Handler obj store and reuse needed desc & prim

* Do not derive from MKLDNNHandlerT

799f3861

10 9月, 2021 1 次提交
- S
  
  fix bug of recompute in hybridparallel (#35588) · d53e567a
  由 ShenLiang 提交于 9月 10, 2021
  
  d53e567a
24 8月, 2021 1 次提交
- A
  Update LearningRate for test fit a line BF16 (#34653) · 36f7e751
  由 Adam Osewski 提交于 8月 24, 2021
```
* Small corrections.

* Fix lr for bf16.

* Revert some changes.
```
  36f7e751
17 8月, 2021 1 次提交
- R
  
  [NPU]Adamw skip update for npu (#34897) · b4474fb4
  由 Roc 提交于 8月 17, 2021
  
  b4474fb4
05 8月, 2021 1 次提交
- W
  
  optimize pipeline performance with recompute and amp, test=allcase (#34519) · 911c8593
  由 WangXi 提交于 8月 05, 2021
  
  911c8593
22 7月, 2021 2 次提交
- L
  copy found_inf to cpu in advance to improve performance (#34274) · 781f4028
  由 Leo Chen 提交于 7月 22, 2021
```
* copy found_inf to cpu in advance to improve performance

* add npu test

* add npu test

* refine code

* refine memcpy op

* fix adam
```
  781f4028
- L
  
  enable amp unsupported_fp16_list for npu (#34314) · b0a2f005
  由 Leo Chen 提交于 7月 22, 2021
  
  b0a2f005
19 7月, 2021 1 次提交

[amp] pass found_inf to adam to suppport skip_update (#34176) · 9bc59673

由 Leo Chen 提交于 7月 19, 2021

* pass found_inf to adam

* add unittest

* fix bug

* refine unittest

* change unit test's directory

* disable unittest on cpu

9bc59673

16 7月, 2021 1 次提交

[NPU] add clear_float_status op (#34190) · 0e4bcede

由 Leo Chen 提交于 7月 16, 2021

* add clear_float_status op

* refine infershape

* fix typo

* refine check_finite_and_scale

* refine code

0e4bcede

05 7月, 2021 1 次提交

add `reduce_sum` op into amp black list (#33960) · aa9fdd0d

由 jiangcheng 提交于 7月 05, 2021

* reduce sum op default fp32, add into amp black list

* reduce_sum default fp32 can avoid return inf when the sum value large than 65504

aa9fdd0d

01 7月, 2021 1 次提交
- T
  
  fix bug DLTP-31078 (#33877) · 3e82a794
  由 taixiurong 提交于 7月 01, 2021
  
  3e82a794
29 6月, 2021 1 次提交
- T
  
  xpu support amp (#33809) · 4d4fb660
  由 taixiurong 提交于 6月 29, 2021
  
  4d4fb660
21 6月, 2021 1 次提交
- W
  
  update fp16 gray_list for tensor parallel (#33660) · 1681a2dd
  由 WangXi 提交于 6月 21, 2021
  
  1681a2dd
16 6月, 2021 1 次提交
- Z
  
  fix new ci check errors (#33561) · 16099abf
  由 zhiboniu 提交于 6月 16, 2021
  
  16099abf
10 6月, 2021 1 次提交
- B
  
  dp c_allreduce_sum_fusion op (#33169) · 003b4616
  由 Baibaifan 提交于 6月 10, 2021
  
  003b4616
26 5月, 2021 1 次提交
- J
  
  [Tensor Parallelism] split fix bug (#33015) · 20b9be65
  由 JZ-LIANG 提交于 5月 26, 2021
  
  20b9be65
07 5月, 2021 1 次提交
- J
  Mechanism that converts startup_program initializers to BF16 (#32720) · ce2bdb0a
  由 joanna.wozna.intel 提交于 5月 07, 2021
```
* Add casting initializers for bf16 training

* Changes after review

* Correct test and add comment
```
  ce2bdb0a
28 4月, 2021 1 次提交
- A
  
  Added pure_bf16 mode (#32281) · bc379ca3
  由 arlesniak 提交于 4月 28, 2021
  
  bc379ca3
23 4月, 2021 1 次提交

[NPU] refactor check_finite_and_scale npu kernel (#32407) · 39a59dcf

由 Leo Chen 提交于 4月 23, 2021

* refactor_check_finite_and_scale_npu_kernel

* fix compile

* add alloc_float_status op

* add alloc_float_status op

* add FloatStatus for check_finite_and_unscale

* refine code

* remove unneccessary logic

* refine for fleet

39a59dcf

22 4月, 2021 1 次提交
- Y
  
  Add fleet get_loss_scaling doc and update alert message (#32419) · d03b0b16
  由 Yuang Liu 提交于 4月 22, 2021
  
  d03b0b16
21 4月, 2021 2 次提交
- H
  
  fix bug in amp O2 (#32343) · 4be3b057
  由 huangxu96 提交于 4月 21, 2021
  
  4be3b057
- Y
  
  add get_loss_scaling to fleet (#32401) · 37bb3342
  由 Yuang Liu 提交于 4月 21, 2021
  
  37bb3342
15 4月, 2021 1 次提交
- F
  fix test sync_with_cpp (#32212) · 0c037d2d
  由 fangshuixun007 提交于 4月 15, 2021
```
fix test sync_with_cpp (#32212)
```
  0c037d2d
08 4月, 2021 1 次提交

The unsupported_fp16_list using in AMP will be created automatically during the runtime. (#32102) · 6e65fe02

由 Zhen Wang 提交于 4月 08, 2021

* Use the runtime to create the unsupported_fp16_list using in AMP.

* Add more infos about supported ops.

* Add some comments for the function of OpSupportedInfos.

* Fix the unit test of test_multi_precision_fp16_train.

6e65fe02

26 3月, 2021 1 次提交
- L
  [3D-parallel] Reformat pipeline parallel (#31786) · c3974d0e
  由 lilong12 提交于 3月 26, 2021
```
* update, test=develop
```
  c3974d0e
22 3月, 2021 1 次提交
- A
  
  [oneDNN] Initial bf16 amp integration (#31093) · 7ccf6b60
  由 arlesniak 提交于 3月 22, 2021
  
  7ccf6b60
20 1月, 2021 1 次提交
- H
  Add fleet amp_init() (#30572) · 13862008
  由 huangxu96 提交于 1月 20, 2021
```
* add fleet amp.init()

* add unittest for fleet_amp_init
```
  13862008
13 1月, 2021 1 次提交
- H
  
  add amp example document (#30314) · 342d62de
  由 huangxu96 提交于 1月 13, 2021
  
  342d62de
08 1月, 2021 1 次提交

Support pure fp16 training for AMP API. (#29544) · 7f7dfccf

由 Zhen Wang 提交于 1月 08, 2021

* add cast ops before and after unsupported fp16 ops.

* Keep partial net in FP32 pattern.

* Support check_finite_and_unscale and update_loss_scaling for FP16 calculation mode.

* Add fp16 support for adam op.

* add multi precision attr for adam.

* Fix the bug of test_multi_precision_fp16_train UT.

* Code format for CI.

* Fix the redefine error about MPTypeTrait on windows.

* fix bugs of the _create_accumulators func in Momentum.

* fix bug when inserting post cast op.

* Add the update_loss_scaling op in allow_set of UnusedVarCheck.

* Update for ci coverage.

* Add some doc for OptimizerWithMixedPrecision.

* Fix the code style.

* Imporve the doc of `amp_init`.

* Change for fp16 testing if users have the infer program defined in separate way.

7f7dfccf

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致