提交 · 1a417a4c74364ec5d1ce5bbd411fee0d2c76041b · BaiXuePrincess / Paddle

30 4月, 2021 2 次提交
- C
  
  remove is_test=True in grad (#32683) · 1a417a4c
  由 ceci3 提交于 4月 30, 2021
  
  1a417a4c
- L
  Add op read_file and decode_jpeg (#32564) (#32686) · 2817239a
  由 LielinJiang 提交于 4月 30, 2021
```
* add op read_file and decode_jpeg
```
  2817239a
29 4月, 2021 3 次提交
- J
  Add BF16 uniform random initializer (#32468) (#32677) · e7c81600
  由 joanna.wozna.intel 提交于 4月 29, 2021
```
* Add bf16 uniform random initializer

* Remove duplicated section

* Change UT to CPU place only

* Put detail functions into anonymous namespace
```
  e7c81600
- A
  Added pure_bf16 mode (#32281) (#32681) · 93535c59
  由 arlesniak 提交于 4月 29, 2021
```
This is cherry-pick of #32281
```
  93535c59
- J
  - Added clearing oneDNN per executor (#32664) · 7ae0a80f
  由 Jacek Czaja 提交于 4月 29, 2021
```
- Executor is nt always having FLAGS_use_mkldnn set to true
```
  7ae0a80f
28 4月, 2021 1 次提交

[Cherry-pick] Optimize update_loss_scaling_op(#32554) (#32606) · 33703da8

由 jiangcheng 提交于 4月 28, 2021

* optimize update_loss_scaling_op by fused for loop to one kernel, test=develop

* remove useless while loop and optimize variable name, test=develop

* optimize variable name from out_addrs_tensor to out_addrs_mem, test=develop

* optimize variable name for readable by change prefix identifier from t_ to local_

33703da8

27 4月, 2021 2 次提交
- Z
  [OPs] Bug fix, fix the segment mean for illegal syncthreads usage. (#32596) (#32610) · 54ab656c
  由 Zhong Hui 提交于 4月 27, 2021
```
* [OPs] Bug fix, fix the segment mean for illegal syncthreads usage.
```
  54ab656c
- A
  
  Fix grad calculation bug in tensor_array_to_tensor (#32558) · 6579432f
  由 Aurelius84 提交于 4月 27, 2021
  
  6579432f
26 4月, 2021 5 次提交

Optimize where_index_op(prefix sum) (#30601) · 6ec4e640

由 jiangcheng 提交于 4月 26, 2021

* new optimize for where_index_op with prefix sum version.

* write a scan prefix sum kernel with stream for where index op.

* optimize where_index by using cub::DeviceScan::InclusiveSum instead of imperfect self-kernel.

* remove CheckTrue struct and rename stide_array for readable.

* optimize variable name for readable.

* optimize function name and annotation.

6ec4e640

W

[HybridParallel] fix port reuse when create multi group (#31876) · 41bfec8d
由 WangXi 提交于 4月 26, 2021

41bfec8d
S
[HybridParallel]Fix model parallel bug by using C++ op (#32536) · ea465fa5
由 ShenLiang 提交于 4月 26, 2021
```
* fix model parallel

* rm parallel_help.py

* add embedding
```
ea465fa5
W
support backward return None, when corresponding input tensor without gradient (#32494) · 8e66046b
由 WeiXin 提交于 4月 26, 2021
```
* support backward return None.

* edit unittest.

* edit code according to CI

* Improve error information
```
8e66046b

optimize slice op and slice grad op (#32266) · 5161f71a

由 jiangcheng 提交于 4月 26, 2021

* optimize slice op and slice grad op, test=develop

* optimize variable name and annotation information, test=develop

5161f71a

25 4月, 2021 9 次提交
- L
  
  [Setitem] Support grad computation of op set_value (#32431) · 25e723e7
  由 liym27 提交于 4月 25, 2021
  
  25e723e7
- B
  
  add copy_cross_scope (#32432) · 5943ff7b
  由 Baibaifan 提交于 4月 25, 2021
  
  5943ff7b
- Z
  
  fix gradient(nan) when two inputs are equal (#32448) · 1896c777
  由 Zhang Ting 提交于 4月 25, 2021
  
  1896c777
- Q
  
  [ROCM] update PADDLE_WITH_ROCM to PADDLE_WITH_HIP, test=develop (#32487) · 3b4dcad7
  由 Qi Li 提交于 4月 25, 2021
  
  3b4dcad7
- M
  
  add silu op, test=develop (#32384) · 2f351ed5
  由 minghaoBD 提交于 4月 25, 2021
  
  2f351ed5
- W
  [BUG FIX] when x.dim < y.dim, the result of compare_op is inverse (#32470) · 78eff521
  由 wawltor 提交于 4月 25, 2021
```
* fix bug: when x.dim < y.dim, the result of compare_op is inverse to expected result

* support the cuda for fix the compare broadcast bug
```
  78eff521
- C
  
  fix reader_blocking_queue_test (#32505) · 4db2cc90
  由 Chen Weihang 提交于 4月 25, 2021
  
  4db2cc90
- L
  [NPU] refine lookup_table_v2_grad npu_kernel (#32497) · fb7590d4
  由 Leo Chen 提交于 4月 25, 2021
```
* use ZerosLike instead of NPUMemsetAsync

* fix compile
```
  fb7590d4
- D
  Nne integration (#32255) · feb2e476
  由 denglin-github 提交于 4月 25, 2021
```
* Add dlnne engine runtime

* Fix log

* Remove <const_cast> and remove unrelated modify with dlnne, +clang-format

* Fix CMakeList format error

* Add copyright message

* Fix dlnne CMakeList.txt

* Add some paddlepaddle_pass to support more networks

* Fix some format bug
```
  feb2e476
23 4月, 2021 8 次提交
- L
  add the c_identity op (#32485) · 8fa8a37f
  由 lilong12 提交于 4月 23, 2021
```
* add c_identity op, test=develop
```
  8fa8a37f
- L
  [NPU] refactor check_finite_and_scale npu kernel (#32407) · 39a59dcf
  由 Leo Chen 提交于 4月 23, 2021
```
* refactor_check_finite_and_scale_npu_kernel

* fix compile

* add alloc_float_status op

* add alloc_float_status op

* add FloatStatus for check_finite_and_unscale

* refine code

* remove unneccessary logic

* refine for fleet
```
  39a59dcf
- B
  solve hccl communicate conflict (#32447) · 0e74eea2
  由 Baibaifan 提交于 4月 23, 2021
```
solve hccl communicate conflict (#32447)
```
  0e74eea2
- L
  add c_concat and c_split ops (#32486) · 2b108a04
  由 lilong12 提交于 4月 23, 2021
```
* add c_concat op
```
  2b108a04
- S
  
  add lstm support on xpu test=kunlun (#32436) · b6f8ccd2
  由 shanliang1992 提交于 4月 23, 2021
  
  b6f8ccd2
- R
  
  [ROCM] add cuda kenrel for batch_norm_op (#32393) · 7879477f
  由 ronnywang 提交于 4月 23, 2021
  
  7879477f
- L
  
  [NPU] Fix bug that epsilon become 0 using power (#32469) · 49773f36
  由 Leo Chen 提交于 4月 23, 2021
  
  49773f36
- K
  Fix seven error message (#32397) · 203ac4f3
  由 Kqnonrime 提交于 4月 23, 2021
```
* fix two error message

* fix two error message

* fix error

* fix error

* fix error

* fix error

* fix some error message

* fix some error

* fix error

* fix some error

* fix some error

* fix some error

* fix one error

* fix some error

* fix seven error message

* fix error

* fix error

* fix error

* fix error
```
  203ac4f3
22 4月, 2021 2 次提交
- W
  support int32 and int64 kernel for clip operator (#32373) · c3328288
  由 wuyefeilin 提交于 4月 22, 2021
```
support int32 and int64 kernel for clip operator 
```
  c3328288
- Z
  
  Modify some contents for elementwise op impl (#32414) · 890d6bc0
  由 Zhang Zheng 提交于 4月 22, 2021
  
  890d6bc0
21 4月, 2021 6 次提交

【NPU】Merge NPU ccl code (#32381) · c3158527

由 zhang wenhui 提交于 4月 21, 2021

* add allreduce and broadcast without test (#31024)

add allreduce and broadcast without test

* Refactor HCCLCommContext to be compatible with Paddle (#31359)

Refactor HCCLCommContext to be compatible with Paddle (#31359)

* [NPU] add npu kernel for communication op (#31437)

* add allreduce and broadcast without test

* add c_broadcast_test case

* build c_comm_init and c_create_group operators

* make the whole thing compile

* add broadcast and init op test case but run failed

* make unit test compile

* fix broadcast test bug and change into hcom for ccl

* change c_comm_init and c_create_group ops accordingly

* make tests compile

* transfer code to 27

* compiled successfully in 28, but run failed

* test broadcast in 28, but failed

* make hcom primitives work

* change hccl data type for base.h

* fix broadcast bug

* make attributes work

* fix group name bug

* add allreduce but test failed

* allreduce bug for qiuliang

* allreduce finished

* add allgather and reducescatter

* merge all op code

* add allgather test

* finish run all ccl op test exclude send/recv

* all all op and test exclude send/recv

* send_v2_npu.cc recv_v2_npiu.cc compiled

* fix ccl core dump bug and test allgather, reducescatter, broadcast op

* fix allreduce bug just for test

* hcom send&recv test pass, without hcom_destroy

* for qiuliang test

* Ascend Send&Recv Test Pass

* all op (ex send/recv) ok

* fix bug

* merge all ccl op

* style merge to PaddlePaddle

* merge style

* new merge style

* merge style 2

* insert an empty at the end

* disable ctest for hcom to pass ci
Co-authored-by: Nvoid-main <voidmain1313113@gmail.com>
Co-authored-by: Nf2hkop <f2huestc@outlook.com>

* Add auto-increasing tag id for Hcom OPs (#31702)

* add c_reduce_sum op (#31793)

add c_reduce_sum op

* update Ascendrc hccl to 20.3 (#32126)

update Ascendrc hccl to 20.3 (#32126)

* fix merge code

* change cmake.txt1

* [NPU] Support npu kernel for c sync stream op (#31386)

* sync stream npu op

* add with_ascend_acl

* update c++ unittest

* compile all failed

* try to pre commit

* after pre commit

* merge&compile&test hccl successfully!

* fix code style

* fix code style

* fix bugs about hccl

* fix some bugs

* fix code style

* fix style

* fix style

* fix

* fixed

* merge develop
Co-authored-by: Nlw921014 <liuwei921014@yeah.net>
Co-authored-by: NVoid Main <voidmain1313113@gmail.com>
Co-authored-by: Nf2hkop <f2huestc@outlook.com>
Co-authored-by: Nxiayanming <41795079@qq.com>

c3158527

L
[NPU] register npu finalize on exit (#32390) · 8e4c1936
由 Leo Chen 提交于 4月 21, 2021
```
* [NPU] register finalize on exit

* fix
```
8e4c1936

remove thrust include files (#32395) · ab6f8745

由 wuhuanzhou 提交于 4月 21, 2021

* remove thrust includes, test=develop

* fix compilation error, test=develop

* fix compilation of truncated_gaussian_random_op, test=develop

ab6f8745

L

[Kunlun]add collective ops for multi XPU cards training and add Kunlun multi XPU cards CI (#32302) · 2194ad15
由 liuyuhui 提交于 4月 21, 2021

2194ad15
J

Added bilinear and nearest interp v2 oneDNN FP32 kernels (#32312) · 5d19f8d8
由 jakpiase 提交于 4月 21, 2021

5d19f8d8
J

Added oneDNN reduce_op GRAD kernel (#32280) · ead83422
由 jakpiase 提交于 4月 21, 2021

ead83422

20 4月, 2021 1 次提交
- W
  
  move REGISTER_OP_CUDA_KERNEL into cpp with eigen, test=develop (#32114) · f6f59e50
  由 wuhuanzhou 提交于 4月 20, 2021
  
  f6f59e50
19 4月, 2021 1 次提交

[NPU] cherry-pick gc/dataloader/save&load/optimization from ascendrc to develop (#32294) · cbe5c9f8

由 Leo Chen 提交于 4月 19, 2021

* [NPU] support GarbageCollector for npu (#31874)

* support GarbageCollector for npu

* fix typo

* fix gather_grad

* disable NPUDefaultStreamGarbageCollector on NPU

* [NPU] support npu for memcpy op (#31808)

* support npu for memcpy op

* add ut

* fix ut

* fix typo

* 【NPU】fix bug of using temp vector (#31963)

* fix bug when beta1_pow on cpu (#31995)

* [NPU] support npu profiler (#31684)

* support npu profiler

* add python api

* fix bugs

* add wrapper for incomplete type

* update profile proto

* record npu wait

* add xpu placeholder

* fix adam (#32016)

* [NPU] enable async copy and  add wait before sync operation (#31956)

* enable async copy and  add wait before sync operation

* remove unneccessary wait

* add FillNpuTensorWithConstant

* refine

* fix fill_constant

* make TensorFromVector/TensorToVector sync

* [NPU] Support dataloader on npu place. (#31867)

* [NPU] Wait on NPUPlace (#32086)

* [NPU] fix cast op (#32121)

* fix npu kernel of cast op to handle casting to same dtype

* add comments

* [NPU] support cann 20.3 (#32044)

* fix compile problem on cann 20.3

* fix ut

* fix test_mul

* fix check_finite_and_scale

* fix lookup_table_v2_grad

* fix cmake

* support print op

* [NPU] Support npu save load (#31893)

* support save load for NPU

* add save load npu unittest

* support np.array transform in NPU

* fix errors

* delete dygraph in unittest

* add Wait

* fix unittest

* fix review comment

* fix unittest problem

* fix little problem

* change aclrtSynchronizeDevice to aclrtSynchronizeStream for better performance (#32196)

* change aclrtSynchronizeDevice to aclrtSynchronizeStream for better performace

* refine code

* fix NPUDeviceContext in all c++ unittest (#32198)

* fix NPUDeviceContext in all c++ unittest

* refine log
Co-authored-by: Npangyoki <pangyoki@126.com>

* [NPU] Remove TensorFromVector and avoid sync copy in npu op kernel for better performance (#31994)

* enable async copy and  add wait before sync operation

* remove unneccessary wait

* add FillNpuTensorWithConstant

* refine

* fix fill_constant

* change TensorFromVector to FillNpuTensorWithConstant

* fix ignored api

* delete extra unittest

* fix little error

* fix update_loss_scaling_op_npu and check_finite_and_unscale_op_npu

* change TensorCopySync to TensorCopy

* delete useless Wait and add StreamWait

* fix npu_stream error

* fix check_finite_and_unscale_op_npu TensorCopy

* only save stream wait

* fix NPUDeviceContext in all c++ unittest

* delete wait
Co-authored-by: Nzhiqiu <chenqiuliang@baidu.com>

* delete useless unittest file (#32206)

* Fix op test (#32231)

* fix conditional block (#32243)

* fix adam bug again (#32246)

* fix compile

* fix ut

* fix ut
Co-authored-by: Nliym27 <33742067+liym27@users.noreply.github.com>
Co-authored-by: Npangyoki <pangyoki@126.com>

cbe5c9f8

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致