提交 · ffd4086098b915087794757b07aa83dcae1074e1 · BaiXuePrincess / Paddle

09 4月, 2021 1 次提交

[NPU] cherry-pick basic NPU components/allocator/operator/executor supports from ascendrc (#32144) · ccf5709d

由 Leo Chen 提交于 4月 09, 2021

* [feature] support npu allocator (#30840)

[feature] support npu allocator

* [feature] support npu operator (#30951)

[feature] support npu operator

* [feature] support npu allocator, part 2 (#30972)

* support npu allocator

* add npu device context

* fix some compile problem

* fix some compile problem

* add npu info

* compile ok

* fix include dir

* support naive_best_fit_allocator

* run ut ok, bug failed to exit

* call aclrtResetDevice before exit

* fix aclFinilize

* add system allocatot test

* add selected_gpus in gtest

* add tensor_test for npu

* support npu op, initial commit

* add npu stream

* add elementwise_add_op

* compile ok

* fix typo

* fix elementwise_add_op_npu_test

* support op run

* test can run but failed

* change aclopExecuteV2 to aclopCompileAndExecute

* support parsing ascend rank table file (#31000)

support parsing ascend rank table file

* Fix reshape on GE graph. (#31084)

Fix reshape on GE graph

* add npu kernel for elementwise_sub and elementwise_sub_grad (#30973)

* add npu sub op

* fix typo

* rename test

* fix bug

* fix bug

* add fp16 kernel

* fix typo

* support sub grad op

* support elementwise_sub_grad op
Co-authored-by: Nfrankwhzhang <frankwhzhang@126.com>

* Fix compilation problem (#31100)

Fix compilation problem (#31100)

* fix compile

* fix code stype

* remove const_cast

* support adding correct npu op in pybind.h (#31143)

* support adding correct npu op in pybind.h

* refine code

* [NPU] Support executor with NPU (#31057)

* [NPU] Support executor with NPU

* Fix code according to reviews

* Fix code

* Add unittest for sub op npu

* refactor npu device manager (#31154)

refactor npu device manager (#31154)

* fix selected npus

* fix compile

* fix reading flags from env

* format
Co-authored-by: Nxiayanming <41795079@qq.com>
Co-authored-by: Ngongweibao <weibao.gong@gmail.com>
Co-authored-by: Nfrankwhzhang <frankwhzhang@126.com>
Co-authored-by: Nliym27 <33742067+liym27@users.noreply.github.com>

ccf5709d

04 3月, 2021 1 次提交
- Q
  
  [ROCM] update fluid platform for rocm (part5), test=develop (#31315) · 4d647ec1
  由 Qi Li 提交于 3月 04, 2021
  
  4d647ec1
24 2月, 2021 1 次提交
- L
  Add cublas_handle() to expose cublas_handle to ops (#31157) · ae2be49f
  由 liu zhengxi 提交于 2月 24, 2021
```
* add get_cublas_handle() api

* update format

* add unittests

* alter function name
```
  ae2be49f
09 2月, 2021 1 次提交
- W
  update eigen version on Windows (#30573) · 9b3c80c8
  由 wuhuanzhou 提交于 2月 09, 2021
```
* update eigen version on Windows, test=develop

* add /bigobj for cl, test=develop
```
  9b3c80c8
08 2月, 2021 1 次提交
- Q
  
  [ROCM] update fluid platform for rocm39 (part3), test=develop (#30913) · 93c1d9e7
  由 Qi Li 提交于 2月 08, 2021
  
  93c1d9e7
04 2月, 2021 1 次提交
- W
  use iwyu clean include second time, test=develop (#30829) · 35c5b23f
  由 wanghuancoder 提交于 2月 04, 2021
```
* use iwyu clean include second time, test=develop
```
  35c5b23f
03 2月, 2021 1 次提交
- W
  
  【kunlun】dygraph supports multi xpu card training (#30671) · b1026f64
  由 WangXi 提交于 2月 03, 2021
  
  b1026f64
01 2月, 2021 1 次提交
- Q
  fix malloc L3 failed bug for kunlun (#30745) · c35a9880
  由 QingshuChen 提交于 2月 01, 2021
```
* fix malloc L3 failed bug for kunlun

* minor
```
  c35a9880
25 1月, 2021 1 次提交
- J
  
  [oneDNN] Cache oneDNN stream not to recreate in each oneDNN op (#30358) · 173660be
  由 Jacek Czaja 提交于 1月 25, 2021
  
  173660be
18 1月, 2021 2 次提交
- L
  
  [Kunlun]PR3: add xpu executor, multi xpu card train function optimization (#30317) · 843dc3cd
  由 liuyuhui 提交于 1月 18, 2021
  
  843dc3cd
- Q
  
  optimize batch_norm & pool op for kunlun (#30490) · 8489d4f7
  由 QingshuChen 提交于 1月 18, 2021
  
  8489d4f7
13 1月, 2021 1 次提交
- Q
  optimize memcpy perf for kunlun (#30291) · 2c1bba02
  由 QingshuChen 提交于 1月 13, 2021
```
* optimize memcpy perf for kunlun

* remove useless unitest for kunlun mean

* minor
```
  2c1bba02
11 1月, 2021 1 次提交
- A
  
  Add tf32 switch for cuDNN (#29192) · 924aac22
  由 AshburnLee 提交于 1月 11, 2021
  
  924aac22
23 12月, 2020 1 次提交
- J
  
  [oneDNN] Unit test for checking oneDNN caching (#29606) · c9e874fc
  由 Jacek Czaja 提交于 12月 23, 2020
  
  c9e874fc
16 12月, 2020 1 次提交
- L
  
  [Kunlun] PR1:Support one Kunlun card training in parallel executor (#29337) · f13c3a9c
  由 liuyuhui 提交于 12月 16, 2020
  
  f13c3a9c
15 12月, 2020 1 次提交
- A
  
  Add tf32 support for A100 tensor core acceleration for cuBLAS (#28732) · efea540c
  由 AshburnLee 提交于 12月 15, 2020
  
  efea540c
14 12月, 2020 1 次提交
- A
  
  Added verbose oneDNN lib version (#29378) · 62d44836
  由 arlesniak 提交于 12月 14, 2020
  
  62d44836
26 11月, 2020 1 次提交
- A
  
  Polish CUDA Information stdout (#29109) · 7ae3cb55
  由 Aurelius84 提交于 11月 26, 2020
  
  7ae3cb55
25 11月, 2020 1 次提交
- W
  remove eigen threadpool for the speed up · b2c8a007
  由 wawltor 提交于 11月 25, 2020
```
remove eigen threadpool for the speed up
```
  b2c8a007
29 9月, 2020 1 次提交
- 1
  test=develop, optimize geo communicator (#26857) · cc780b19
  由 123malin 提交于 9月 29, 2020
```
* test=develop, optimize geo communicator 
```
  cc780b19
17 9月, 2020 1 次提交
- J
  enhance reduce op which can reduce tensor with arbitrary rank · 63203c4a
  由 Jack Zhou 提交于 9月 17, 2020
```
enhance reduce op which can reduce tensor with arbitrary rank 
```
  63203c4a
21 8月, 2020 2 次提交

A
Add mechanism for blocking oneDNN cache clearing (#26502) · f3909020
由 Adam 提交于 8月 21, 2020
```
* Add mechanism for blocking oneDNN cache clearing

* Review changes and Add thread guards
```
f3909020

support Baidu Kunlun AI Accelerator (#25959) · 138ecf24

由 QingshuChen 提交于 8月 21, 2020

* support Baidu AI Accelerator
  * test=kunlun

* minor
 * test=kunlun

* support xpu op in separate file
 * test=kunlun

* update XPU error message and remove duplicated code

 * test=kunlun

* minor
 * test=kunlun

* minor
 * test=kunlun

138ecf24

15 7月, 2020 1 次提交
- G
  refine PADDLE_ENFORCE (#25456) · c10dcff1
  由 GaoWei8 提交于 7月 15, 2020
```
* Refine PADDLE_ENFORCE in paddle/fluid/platform
test=develop
```
  c10dcff1
07 7月, 2020 1 次提交
- G
  Refine PADDLE_ENFORCE (#25369) · ea7e5325
  由 GaoWei8 提交于 7月 07, 2020
```
* refine PADDLE_ENFORCE
test=develop
```
  ea7e5325
03 6月, 2020 1 次提交

Replace all errors thrown by LOG(FATAL) with PADDLE_THROW (#24759) · d1062d52

由 Chen Weihang 提交于 6月 03, 2020

* remove REPLACE_ENFORCE_GLOG compile option & add ci rule prohibit LOG(FATAL) using, test=develop

* remove ci test case, test=develop

* replace all LOG(FATAL) & polish message, test=develop

* fix typo, test=develop

* polish error info detail, test=develop

d1062d52

14 5月, 2020 1 次提交
- P
  Hide globals & redesign restore PR (#24279) · db2b6b65
  由 pawelpiotrowicz 提交于 5月 14, 2020
```
test=develop
```
  db2b6b65
11 5月, 2020 1 次提交

Add macro BOOST_GET to enrich the error information of boost :: get (#24175) · aa0f254f

由 Chen Weihang 提交于 5月 11, 2020

* add new macro BOOST_GET_SAFELY & unittests, test=develop

* add different macro type, test=develop

* fix get macro type in executor, test=develop

* four macro part change backup

* using one macro for all case, test=develop

* revert attribute change, test=develop

* change to three func to solve gcc4.8 bug, test=develop

* polish some details, test=develop

aa0f254f

28 4月, 2020 1 次提交
- S
  
  added reshape transpose matmul fuse pass (#23754) · e1a7a880
  由 Sylwester Fraczek 提交于 4月 28, 2020
  
  e1a7a880
24 4月, 2020 1 次提交

Add cholesky_op (#23543) · a8c0fb4e

由 Guo Sheng 提交于 4月 24, 2020

* Add cholesky_op forward part. test=develop

* Complete cholesky_op forward part. test=develop

* Add cholesky_op backward part. test=develop

* Complete cholesky_op backward part. test=develop

* Refine cholesky_op error check and docs. test=develop

* Add grad_check unit test for cholesky_op. test=develop

* Fix sample code in cholesky doc. test=develop

* Refine some error messages of cholesky_op. test=develop

* Refine some error messages of cholesky_op. test=develop

* Remove unused input in cholesky_grad. test=develop

* Remove unused input in cholesky_grad. test=develop

* Fix stream for cusolverDnSetStream. test=develop

* Update PADDLE_ENFORCE_CUDA_SUCCESS from cholesky_op to adapt to latest code.
test=develop

* Add CUSOLVER ERROR in enforce.h
test=develop

* Fix the missing return value in cholesky. test=develop

a8c0fb4e

23 4月, 2020 1 次提交
- 石
  
  declare the stream::Priority as enum class, test=develop (#24013) · 34d7d6ae
  由石晓伟提交于 4月 23, 2020
  
  34d7d6ae
18 4月, 2020 1 次提交

Update eigen (#23203) · b89dd86f

由 Zhang Ting 提交于 4月 18, 2020

* update eigen, test=develop

* remove patches, test=develop

* add definition of -fabi-version, test=develop

* add patch for TensorBlock.h, test=develop

* test windows, test=develop

* only update eigen for Linux, test=develop

* add code comments, test=develop

b89dd86f

17 4月, 2020 1 次提交

石

DeviceContext Split, test=develop (#23737) · 2d01cc85

由石晓伟提交于 4月 17, 2020

* supports thread-binding stream, test=develop

* avoid using thread_local variables in dtor, test=develop

* modify the stream priority enum, test=develop

2d01cc85

01 4月, 2020 1 次提交
- 石
  
  reverts the commit 23177, test=develop (#23363) · 5c59d213
  由石晓伟提交于 4月 01, 2020
  
  5c59d213
31 3月, 2020 1 次提交

fix nccl comm double free bug (#23344) · 0471476a

由 Yi Liu 提交于 3月 31, 2020

As nccl comm is not created by CUDADeviceContext, it should be destroyed by the creator as the best practice of RAII.

0471476a

30 3月, 2020 1 次提交
- 石
  
  supports thread-binding stream, test=develop (#23177) · 75ebb48a
  由石晓伟提交于 3月 30, 2020
  
  75ebb48a
05 2月, 2020 1 次提交

add WITH_NCCL option for cmake. (#22384) · 7bc4b095

由 Wilber 提交于 2月 05, 2020

cmake选项中添加了WITH_NCCL，显示指定是否编译NCCL的部分代码，WITH_NCCL默认打开，但如果WITH_GPU为OFF，则关闭WITH_NCCL

添加了PADDLE_WITH_NCCL定义

单机单卡能够关闭NCCL编译，多卡的话需要默认打开NCCL，如果关闭NCCL，则只能使用单卡
Co-authored-by: N石晓伟 <39303645+Shixiaowei02@users.noreply.github.com>

7bc4b095

08 1月, 2020 1 次提交

Refine stack op to improve xlnet performance, test=develop (#22142) · 3d4f2aa6

由 zhaoyuchen2018 提交于 1月 08, 2020

stack's wait cost a lot of cpu time, use cuda kernel to do memory copy
will reduce cpu time.
Signed-off-by: Nzhaoyuchen <zhaoyuchen01@baidu.com>

3d4f2aa6

10 12月, 2019 1 次提交

MKL-DNN 1.0 Update (#20162) · e81f0228

由 Adam 提交于 12月 10, 2019

* MKLDNN v1.0 rebase to Paddle 1.6
test=develop

* Add hacky paddle::string::to_string() implementation

* vectorize<int64-t>() -> vectorize() cleanup
test=develop

* PADDLE_ENFORCE and void_cast fixes
test=develop

* Rebase changes
test=develop

* Cosmetics
test=develop

* Delete MKL from mkldnn.cmake
test=develop

* CMake debug commands
test=develop

* Delete MKLDNN_VERBOSE and rebase fixes
test=develop

* Rebase fixes
test=develop

* Temporarily disable int8 resnet101 vgg16 and vgg19 tests
test=develop

* Add libmkldnn.so.1 to python setup
test=develop

* Add libmkldnn.so.1 to inference_lib cmake after rebase
test=develop

* Post rebase fixes + FC int8 changes
test=develop

* Fix LRN NHWC
test=develop

* Fix NHWC conv3d
test=develop

* Windows build fix + next conv3d fix
test=develop

* Fix conv2d on AVX2 machines
test=develop

e81f0228

06 12月, 2019 1 次提交
- Z
  
  refine dev_ctx.Wait() exception throw, test=develop (#21600) · 97e76cb9
  由 Zeng Jinle 提交于 12月 06, 2019
  
  97e76cb9

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致