提交 · 2a3d9eca64b0312a6bf49ffe6f470a084886bbe4 · BaiXuePrincess / Paddle

07 3月, 2022 1 次提交

cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU (#39437) · 2a3d9eca

由 Ming-Xu Huang 提交于 3月 07, 2022

* Added cuBlasLtHandle_t to device context.

* Added fused_gemm_epilogue op.

1. Added fused_gemm_epilogue op to leverage cuBlastLt Epilogue.
2. Support fusion Act(X*Y + bias), X'dims >=2 and Y'dims shoule be 2.
2. Act currently only be supported ReLU. (Will add GeLU in the future).

* Added UT to fused_gemm_epilogue op.

* Added LinearAct Pattern

1. Added LinearAct into graph_pattern_detector.* to define (2.)'s
pattern.
2. LinearAct is used to detect act(element_add(matmul_v2(x, w), bias)).
3. act currently only support ReLU (Will support GeLU in the future).

* Added FuseGemmEpiloguePass

1, Added FuseGemmEpiloguePass to handle nn.Linear + Act{ReLU}
fusion (GeLU will be supported in the future).
2. Only support matmul_v2 from nn.Linear.

* Added pybind to BuildStrageter.fuse_gemm_epilogue_.

* Added UT for fuse_gemm_epilogue_pass.

* GeLU support and EpilogueSingleton

1. Added GeLU support to fused_gemm_epilogue op.
2. Added EpilogueSingleton to cache auxiliary pointer.
3. Added related UTs.

* Rename cublaslt_epilogue_opto gemm_epilogue_op.*.

* Added both train and infer pattern to LinearAct.

1. Added support of fwd graph with grap_ops linking to LinearAct.
2. Added related changes to fuse_gemm_epilogue_pass for above
modification.

* Changed CUDA requirement from 11.4 to 11.6 for fuse_gemm_epilogue_pass.

* Added identity activation support to gemm_epilogue_op.

* Added Linear Fusion (matmul_v2 + ele_add)

1. Added matmul_v2 + ele_add pattern to LinearActPattern.
2. Added matmul_v2 + ele_add support to fuse_gemm_epilogue_pass.

* Rename gemm_epilogue_op.* to fused_gemm_epilogue_op.*

* Add fused_gemm_epilogue_grad op.

1. Added fused_gemm_epilogue_grad to support backward epilogue fusion.

* Add UTs to fused_gemm_epilogue_grad_op.

* Change attribute name in fused_gemm_epilogue_grad_op for clearing.

* Allow DX and DBias be dispensable to fused_gemm_epilogue_grad op.

* Added ElementwiseAdd+Matmul+Act graph pattern detection.

* Fuse backward of Linear( Act(x))

1. Added backward fusion pass to Linear( Act(x)).
2. Added backward fusion pass to Linear(x).

* Added UTs to backward fusion of Linear(Act(x)).

* Complete document of arguments to fused_gemm_epilogue_op.

* Made arguments of some functions pass by reference.

* Modify code with review comments.

1. Made arguments of some function pass by reference.
2. Removed redundant code.
3. Followed Google code style to change code.

* Made 'const' code style be consistent

* Fixed random seed of python UTs.

* Set Compiling constrains to cuBlasLt

1. Require CUDA 11.6+
2. Remove fuse_gemm_epilogue related tests when CUDA < 11.6.

* Code Reivew from Paddle

1. Changed arguments name is_first_gemm to without_x_gradient for
clearing.
2. Applied PADDLE_THROW in fused_gemm_epilogue_op.

* Remove EpilogueSingleton

1. Applied ReserveSpace to replace Epilogue for passing auxiliary
pointers between FWD and BWD.

* Fix a logical error and enhance UTs.

1. Added act op count checking in UTs.
2. Fix issue to fuse backward or ReLU(Linear(X)).
3. TODO: solve GELU fusion issues.

* Fix Linear and GeLU fusion issues.

1. Modified graph_detech_pattern to fit with both linear wiht gelu or
relu.
2. Modified data range in Uts to allow negative values.

* Removed fused_gemm_epilogue_op.h.

* Rename namespace pten to phi.

* Rename name of arguments in fused_gemm_epilogue_op

1. bias -> Bias.
2. out -> Out.
3. reserve_space -> ReserveSpace.

* Change EpiloguePassActivationCache as local variable.

1. Removed singleton in EpiloguePassActivationCache.
2. Made EpiloguePassActivationCache as an argument to each pass
functions.

2a3d9eca

03 3月, 2022 1 次提交
- R
  
  [CustomRuntime] migrate CustomRuntime into phi (#39908) · b4665d23
  由 ronnywang 提交于 3月 03, 2022
  
  b4665d23
23 2月, 2022 2 次提交

L
[phi] move randperm to phi (#39816) · 30992ea0
由 Leo Chen 提交于 2月 23, 2022
```
* move randperm to phi

* fix npu

* fix memory::Copy
```
30992ea0

Update record interface using part3 (#39695) · 1fcaab45

由 chenjian 提交于 2月 23, 2022

* fix RecordEvent interface

* modify default level to 4

* update interface use

* add const default trace level

* update record event interface using

* update record event interface using

* update record event interface using

* update operator.cc

* update part2

* update part1

* update part3

* fix include profiler.h header in ps server

* fix include profiler.h header in ps server

* fix profiler.h header

* fix profiler.h header

* fix merge buf

* update

* fix bug

* fix bug

1fcaab45

22 2月, 2022 1 次提交
- R
  
  [CustomRuntime] fix CustomDeviceContext (#39766) · 60fc555e
  由 ronnywang 提交于 2月 22, 2022
  
  60fc555e
20 2月, 2022 1 次提交

[PTen->Phi PR1] Change pten dirname and namespace to phi (#39748) · dcfe1986

由 Chen Weihang 提交于 2月 20, 2022

* rename pten dir to phi

* rename namespace to phi

* rename infrt pten dir to phi

* resolve conflict

* rename pten to phi in cmake

* revert all infrt change

* change needed files

* fix infrt failed

* fix inference failed

dcfe1986

15 2月, 2022 1 次提交

[PluggableDevice] Add custom runtime support (#38740) · 3e7825f3

由 ronnywang 提交于 2月 15, 2022

* [CustomRuntime] Add DeviceManager

* [CustomRuntime] Add DeviceInterface

* [CustomRuntime] Add Stream, Event, DeviceGuard, CallbackManager

* [CustomRuntime] Add plug-in device

* [CustomRuntime] Memory module support PluggableDevice

* [CustomRuntime] Add WITH_PLUGGABLE_DEVICE cmake option

* update

* [API] update API doc based on comments, test=develop
Co-authored-by: Nqili93 <qili93@qq.com>

3e7825f3

14 2月, 2022 1 次提交
- W
  context add generator (#39475) · 463e31f4
  由 Wilber 提交于 2月 14, 2022
```
* context add generator

* update
```
  463e31f4
08 2月, 2022 1 次提交
- W
  [PTEN] Update gpu_context. (#39359) · 24103cbb
  由 Wilber 提交于 2月 08, 2022
```
* gpu_context..

* update

* update

* update
```
  24103cbb
06 2月, 2022 1 次提交
- W
  
  [PTEN] Add Gpu context (#39305) · a821c4a9
  由 Wilber 提交于 2月 06, 2022
  
  a821c4a9
27 1月, 2022 1 次提交

[PTen]Support AllocateFrom in Tensor and Alloc/HostAlloc in Context (#39022) · 5631da9c

由 Aurelius84 提交于 1月 27, 2022

* Support allocate_from in Tensor and allocate_data in Context

* fix #ifdef CUDA

* fix cycle depends

* fix test_xxx_dev_api failed

* fix windows compiling error

* fix unittest

* modify into PImpl

* fix selected rows

* add TODO comment

* refine interface according reviewer

5631da9c

26 1月, 2022 1 次提交

[IPU] sync misc changes 01 (#38876) · 4efbebea

由 Allen Guo 提交于 1月 26, 2022

* sync misc changes

* apply comments 01

* fix compile error

* remove is_ipu_place check

* add authors
Co-authored-by: NXiaobing Wang <xiaobingw@graphcore.ai>
Co-authored-by: NAllen Guo <alleng@graphcore.ai>
Co-authored-by: NZhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: NHaicheng Jiang <haichengj@graphcore.ai>
Co-authored-by: NHan Zhao <hanzhao@graphcore.ai>

* sync changes

* restore cmake

* update ir cmake and setup.py

* update inference_lib cmake

* split PR
Co-authored-by: NXiaobing Wang <xiaobingw@graphcore.ai>
Co-authored-by: NZhixin Yao <zhixiny@graphcore.ai>
Co-authored-by: NHaicheng Jiang <haichengj@graphcore.ai>
Co-authored-by: NHan Zhao <hanzhao@graphcore.ai>

4efbebea

25 1月, 2022 1 次提交
- W
  
  [PTEN] Add xpu context. (#39098) · c1e5a393
  由 Wilber 提交于 1月 25, 2022
  
  c1e5a393
24 1月, 2022 1 次提交

Remved redundant defintions of likely/unlikely (#38911) · 43919d0a

由 Jacek Czaja 提交于 1月 24, 2022

* - more unlikely

* - compilation fix

* - removed redundant definition

* - fix

* - Fixes

* - compilation fix for windows

43919d0a

21 1月, 2022 1 次提交

[PTEN] Add cpu context (#38979) · 064bc4b8

由 Wilber 提交于 1月 21, 2022

* add cpu_context.

* update

* update

* update

* update

* update

* fix ci problem

* fix npu ci problem

* update

* fix ci compile

064bc4b8

20 1月, 2022 1 次提交
- S
  
  fix device_context place print (#39062) · 3dd7f353
  由 sneaxiy 提交于 1月 20, 2022
  
  3dd7f353
17 1月, 2022 1 次提交

[Pten] Replace platform::Place to pten::Place. (#38899) · c48a9ad5

由 Wilber 提交于 1月 17, 2022

* add pten::Place data structure.

* update ci problem

* fix ci problem

* update

* using platform::Place=pten::Place

* remove BOOST_GET_CONST for CPUPlace and GPUPlace

* compile pass 25%.

* compile pass 45%

* compile pass 60%

* remove boost_get for xpu npu mlu and ipu

* compile pass on cpu and gpu.

* fix compile problem

* fix compile error.

* update

* fix ci problem

* update

* ci approve

* fix ci problem

* fix ci eager test problem

* remove BOOST_GET_CONST

* fix npu compile

c48a9ad5

04 1月, 2022 1 次提交
- Q
  
  [XPU] update XPU device info, test=develop (#37884) · e1187e50
  由 Qi Li 提交于 1月 04, 2022
  
  e1187e50
30 12月, 2021 1 次提交

Add cusparse and unittest (#38431) · 667dc9f0

由 zhangkaihuo 提交于 12月 30, 2021

将cuSparse的handle与DeviceContext进行绑定，避免op中进行创建和销毁
添加对cuSparse中dense和sparse转换的API进行封装
添加对封装的API的单测

667dc9f0

23 12月, 2021 2 次提交
- J
  Make GetBlob assuming elements are cached (#38336) · 7da5368d
  由 Jacek Czaja 提交于 12月 23, 2021
```
* First set of fixes

* - Make more likely to GetBlob find a blobs

* - Lint
```
  7da5368d
- W
  Support external stream. (#38373) · 15ad7ee4
  由 Wilber 提交于 12月 23, 2021
```
* support external stream.

* update

* update

* update
```
  15ad7ee4
20 12月, 2021 1 次提交
- F
  
  [MLU]add mlu backend (#38207) · 76514a1f
  由 fwenguang 提交于 12月 20, 2021
  
  76514a1f
09 12月, 2021 1 次提交
- J
  
  add ipu device p2 (#37840) · cb636a48
  由 jianghaicheng 提交于 12月 09, 2021
  
  cb636a48
03 12月, 2021 1 次提交
- R
  refine structure for cuda and rocm (#37202) · a6d2fddb
  由 ronnywang 提交于 12月 03, 2021
```
* refine structure for cuda and rocm

* update

* update

* update

* update
```
  a6d2fddb
27 11月, 2021 1 次提交

[NPU] reorganization for device API abstraction (#37110) · 72241a6a

由 Aganlengzi 提交于 11月 27, 2021

* [NPU] reorganization for device API abstraction

* [NPU] delete old files

* [NPU] fix npu_collective_helper

* [NPU] fix collective_helper

* [NPU] fix ut

* [NPU] mod memory allocation and hccl_helper

* [NPU] fix place_type

* [NPU] split enfoce.h

* move acl* call into npu_info

* merge conflict

* fix merge

* merge conflict

* merge conflict

72241a6a

24 11月, 2021 1 次提交
- P
  Changed second batch of deprecated mkldnn header and function names to new oneDNN names (#37351) · 7db7a0ec
  由 piotrekobiIntel 提交于 11月 24, 2021
```
* Add second batch of deprecated mkldnn namespace and macro changes

* Unlock CI

* Fix temporary namespace alias placing
```
  7db7a0ec
02 11月, 2021 1 次提交
- Q
  support different precision in kunlun (#36836) · e512aa9a
  由 QingshuChen 提交于 11月 02, 2021
```
* support different precision in kunlun

* minor

* minor

* minor
```
  e512aa9a
01 11月, 2021 1 次提交
- L
  [new-exec] refine vlog of interpretercore (#36865) · 4c93c4c3
  由 Leo Chen 提交于 11月 01, 2021
```
* refine vlog of interpretercore

* fix ut
```
  4c93c4c3
14 10月, 2021 1 次提交
- L
  Revert "Implemented LRU based cache clearing (#36290)" (#36426) · 5d18967b
  由 lidanqing 提交于 10月 14, 2021
```
This reverts commit bf748f24.
```
  5d18967b
13 10月, 2021 1 次提交
- J
  Implemented LRU based cache clearing (#36290) · bf748f24
  由 Jacek Czaja 提交于 10月 13, 2021
```
- Lint

- Merge with develop

- lint
```
  bf748f24
15 9月, 2021 1 次提交
- S
  Add paddle.cuda.device.stream_guard API (#35623) · 3218075d
  由 Siming Dai 提交于 9月 15, 2021
```
Add paddle.cuda.device.stream_guard API 
```
  3218075d
03 8月, 2021 1 次提交
- Q
  support Kunlun2 (#34459) · 2d0f3d9b
  由 QingshuChen 提交于 8月 03, 2021
```
* support Kunlun2

* support KL2

* support KL2
```
  2d0f3d9b
15 7月, 2021 1 次提交
- A
  Upgrade Executor into ParallelExcutor to apply Graph Optimization in @to_static (#32283) · 2850391d
  由 Aurelius84 提交于 7月 15, 2021
```
* Refine Constructor logic of ParallelExecutor

* Replace executor into ParallelExecutor in run_program_op
```
  2850391d
09 6月, 2021 1 次提交
- J
  [oneDNN] First fix to #33021 (#33174) · 1382cd22
  由 Jacek Czaja 提交于 6月 09, 2021
```
* - First fix to #33021
```
  1382cd22
12 5月, 2021 1 次提交
- L
  
  [NPU] Support npu pinned allocator and manage Tensor on NPUPinnedPlace (#32840) · 6b3bb796
  由 liym27 提交于 5月 12, 2021
  
  6b3bb796
28 4月, 2021 1 次提交
- J
  [oneDNN] Added clearing oneDNN cache per executor (#32499) · ba610761
  由 Jacek Czaja 提交于 4月 28, 2021
```
* - Added clearing oneDNN per executor

* - Executor is nt always having FLAGS_use_mkldnn set to true
```
  ba610761
19 4月, 2021 1 次提交

[NPU] cherry-pick gc/dataloader/save&load/optimization from ascendrc to develop (#32294) · cbe5c9f8

由 Leo Chen 提交于 4月 19, 2021

* [NPU] support GarbageCollector for npu (#31874)

* support GarbageCollector for npu

* fix typo

* fix gather_grad

* disable NPUDefaultStreamGarbageCollector on NPU

* [NPU] support npu for memcpy op (#31808)

* support npu for memcpy op

* add ut

* fix ut

* fix typo

* 【NPU】fix bug of using temp vector (#31963)

* fix bug when beta1_pow on cpu (#31995)

* [NPU] support npu profiler (#31684)

* support npu profiler

* add python api

* fix bugs

* add wrapper for incomplete type

* update profile proto

* record npu wait

* add xpu placeholder

* fix adam (#32016)

* [NPU] enable async copy and  add wait before sync operation (#31956)

* enable async copy and  add wait before sync operation

* remove unneccessary wait

* add FillNpuTensorWithConstant

* refine

* fix fill_constant

* make TensorFromVector/TensorToVector sync

* [NPU] Support dataloader on npu place. (#31867)

* [NPU] Wait on NPUPlace (#32086)

* [NPU] fix cast op (#32121)

* fix npu kernel of cast op to handle casting to same dtype

* add comments

* [NPU] support cann 20.3 (#32044)

* fix compile problem on cann 20.3

* fix ut

* fix test_mul

* fix check_finite_and_scale

* fix lookup_table_v2_grad

* fix cmake

* support print op

* [NPU] Support npu save load (#31893)

* support save load for NPU

* add save load npu unittest

* support np.array transform in NPU

* fix errors

* delete dygraph in unittest

* add Wait

* fix unittest

* fix review comment

* fix unittest problem

* fix little problem

* change aclrtSynchronizeDevice to aclrtSynchronizeStream for better performance (#32196)

* change aclrtSynchronizeDevice to aclrtSynchronizeStream for better performace

* refine code

* fix NPUDeviceContext in all c++ unittest (#32198)

* fix NPUDeviceContext in all c++ unittest

* refine log
Co-authored-by: Npangyoki <pangyoki@126.com>

* [NPU] Remove TensorFromVector and avoid sync copy in npu op kernel for better performance (#31994)

* enable async copy and  add wait before sync operation

* remove unneccessary wait

* add FillNpuTensorWithConstant

* refine

* fix fill_constant

* change TensorFromVector to FillNpuTensorWithConstant

* fix ignored api

* delete extra unittest

* fix little error

* fix update_loss_scaling_op_npu and check_finite_and_unscale_op_npu

* change TensorCopySync to TensorCopy

* delete useless Wait and add StreamWait

* fix npu_stream error

* fix check_finite_and_unscale_op_npu TensorCopy

* only save stream wait

* fix NPUDeviceContext in all c++ unittest

* delete wait
Co-authored-by: Nzhiqiu <chenqiuliang@baidu.com>

* delete useless unittest file (#32206)

* Fix op test (#32231)

* fix conditional block (#32243)

* fix adam bug again (#32246)

* fix compile

* fix ut

* fix ut
Co-authored-by: Nliym27 <33742067+liym27@users.noreply.github.com>
Co-authored-by: Npangyoki <pangyoki@126.com>

cbe5c9f8

09 4月, 2021 1 次提交

[NPU] cherry-pick basic NPU components/allocator/operator/executor supports from ascendrc (#32144) · ccf5709d

由 Leo Chen 提交于 4月 09, 2021

* [feature] support npu allocator (#30840)

[feature] support npu allocator

* [feature] support npu operator (#30951)

[feature] support npu operator

* [feature] support npu allocator, part 2 (#30972)

* support npu allocator

* add npu device context

* fix some compile problem

* fix some compile problem

* add npu info

* compile ok

* fix include dir

* support naive_best_fit_allocator

* run ut ok, bug failed to exit

* call aclrtResetDevice before exit

* fix aclFinilize

* add system allocatot test

* add selected_gpus in gtest

* add tensor_test for npu

* support npu op, initial commit

* add npu stream

* add elementwise_add_op

* compile ok

* fix typo

* fix elementwise_add_op_npu_test

* support op run

* test can run but failed

* change aclopExecuteV2 to aclopCompileAndExecute

* support parsing ascend rank table file (#31000)

support parsing ascend rank table file

* Fix reshape on GE graph. (#31084)

Fix reshape on GE graph

* add npu kernel for elementwise_sub and elementwise_sub_grad (#30973)

* add npu sub op

* fix typo

* rename test

* fix bug

* fix bug

* add fp16 kernel

* fix typo

* support sub grad op

* support elementwise_sub_grad op
Co-authored-by: Nfrankwhzhang <frankwhzhang@126.com>

* Fix compilation problem (#31100)

Fix compilation problem (#31100)

* fix compile

* fix code stype

* remove const_cast

* support adding correct npu op in pybind.h (#31143)

* support adding correct npu op in pybind.h

* refine code

* [NPU] Support executor with NPU (#31057)

* [NPU] Support executor with NPU

* Fix code according to reviews

* Fix code

* Add unittest for sub op npu

* refactor npu device manager (#31154)

refactor npu device manager (#31154)

* fix selected npus

* fix compile

* fix reading flags from env

* format
Co-authored-by: Nxiayanming <41795079@qq.com>
Co-authored-by: Ngongweibao <weibao.gong@gmail.com>
Co-authored-by: Nfrankwhzhang <frankwhzhang@126.com>
Co-authored-by: Nliym27 <33742067+liym27@users.noreply.github.com>

ccf5709d

04 3月, 2021 1 次提交
- Q
  
  [ROCM] update fluid platform for rocm (part5), test=develop (#31315) · 4d647ec1
  由 Qi Li 提交于 3月 04, 2021
  
  4d647ec1
24 2月, 2021 1 次提交
- L
  Add cublas_handle() to expose cublas_handle to ops (#31157) · ae2be49f
  由 liu zhengxi 提交于 2月 24, 2021
```
* add get_cublas_handle() api

* update format

* add unittests

* alter function name
```
  ae2be49f

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致