提交 · 8676302364924bb190dcf171da7cf30d290aa2a6 · PaddlePaddle / Paddle

01 8月, 2022 1 次提交

由 Leo Chen 提交于 8月 01, 2022

* remove cudaDeviceContext

* remove more template

* fix rocm compile

* remove alias name CUDADeviceContext

* fix compile

* fix tests

* revert changes

86763023

28 7月, 2022 1 次提交
- L
  
  Complete the dtypes for all_gather, add all_gather_object api (#44417) · d4cf02bc
  由 LiYuRio 提交于 7月 28, 2022
  
  d4cf02bc
27 7月, 2022 1 次提交
- Y
  
  [DCU] Fix NAN problem when training BERT on DUC platform (#44643) · 28aa0c61
  由 Yuang Liu 提交于 7月 27, 2022
  
  28aa0c61
19 7月, 2022 1 次提交

compile phi/backends into one static library (#44373) · 1047cb17

由 Leo Chen 提交于 7月 19, 2022

* compile into one static library

* fix xpu compile

* fix xpu compile

* fix inference compile

* fix inference compile

* add custom test

* revert one file

1047cb17

14 7月, 2022 2 次提交

refine allocation cmake (#44241) · dc5a0420

由 Leo Chen 提交于 7月 14, 2022

* build into one static library

* move memory/detail to memory/allocation

* fix bug

* fix profiler

* fix framework_proto

* fix deps

* fix inference compilation

* fix rocm compile

* follow comments

* fix buddy_allocator_test

dc5a0420

W
Compilation optimization (#44242) · 4baf0dbe
由 wanghuancoder 提交于 7月 14, 2022
```
* Compilation optimization
```
4baf0dbe

26 6月, 2022 1 次提交
- S
  
  format all files in fluid using new config (#43776) · 576236a0
  由 Sing_chan 提交于 6月 26, 2022
  
  576236a0
24 6月, 2022 1 次提交

record memory and op supplement info (#43550) · 8dd0a3b9

由 chenjian 提交于 6月 24, 2022

* record memory and op supplement info

* update

* update

* fix a bug

* fix memory recording

* fix a bug

* update

* update

* fix a bug

* update

* fix a bug

* fix a bug

* fix a bug

* Revert "fix a bug"

This reverts commit c1d4df52762ba9ae7c7e27cd2ba4fc3a7ed9c7a5.

* fix a bug

* fix format

* fix

8dd0a3b9

15 6月, 2022 1 次提交
- add some kernels(csr*dense->csr, dense*dense->csr) of SparseTensor matmul (#42935) · 346efe96
  由 zhouweiwei2014 提交于 6月 15, 2022
```
* add some kernel(csr*dense->csr, dense*dense->csr) of SparseTensor matmul

* fix CI

* fix CI

* fix comment

* fix comment
```
  346efe96
05 6月, 2022 1 次提交
- S
  
  【code format check upgrade】 step2：clang-format (#42840) · a3730dc8
  由 Sing_chan 提交于 6月 05, 2022
  
  a3730dc8
04 6月, 2022 1 次提交
- S
  
  【code format check upgrade】 step2：cmake-format (#43057) · 92568edb
  由 Sing_chan 提交于 6月 04, 2022
  
  92568edb
02 6月, 2022 2 次提交
- S
  Fix bug of CUDAGraph kernel parameter comparation (#43163) · 3fcfcd51
  由 sneaxiy 提交于 6月 02, 2022
```
* fix cuda graph sizeof

* fix tuple type
```
  3fcfcd51
- S
  Support CUDA Graph for partial graph in dygraph mode (#42786) · d05b940a
  由 sneaxiy 提交于 6月 02, 2022
```
* support CUDAGraph for partial graph

* add ut

* fix ci

* fix ut again because of eager mode

* fix kunlun ci

* fix win ci
```
  d05b940a
01 6月, 2022 1 次提交
- G
  
  support nccl api for bfloat16, required >= cudnn 10.1, nccl >= 2.10.3 (#43147) · 67b9b51b
  由 Guoxia Wang 提交于 6月 01, 2022
  
  67b9b51b
30 5月, 2022 1 次提交
- C
  
  Implement fused_gate_attention operator for AlphaFold. (#42018) · fdcdbec5
  由 crystal 提交于 5月 30, 2022
  
  fdcdbec5
27 5月, 2022 1 次提交
- R
  Support memory stats for CPU (#42945) · 21f11d35
  由 Ruibiao Chen 提交于 5月 27, 2022
```
* Support memory stats for CPU

* Add UTs

* Fix typos

* Fix typos
```
  21f11d35
10 5月, 2022 2 次提交
- W
  [Eager] print gpu mem info (#42616) · 81644145
  由 wanghuancoder 提交于 5月 10, 2022
```
* print mem

* refine

* refine

* refine

* refine
```
  81644145
- L
  
  fix bug for heter (#42590) · 21b35167
  由 lilong12 提交于 5月 10, 2022
  
  21b35167
05 5月, 2022 1 次提交

Print memory peak message for UT (#42092) · 28375ca4

由 Ruibiao Chen 提交于 5月 05, 2022

* Add peak memory log for CI

* Change VLOG to std::cout

* Move print code to test_runner.py and paddle_gtest_main.cc

* Fix typo

* Fix conflicts

* Updata message format

* Fix CI errors

* Add FLAGS_enable_gpu_memory_usage_log

* Fix CI errors

28375ca4

12 4月, 2022 1 次提交
- J
  fix_paddle_numel_check (#41607) · 51cae7f7
  由 JingZhuangzhuang 提交于 4月 12, 2022
```
* fix_paddle_numel_check

* fix_paddle_numel_check
```
  51cae7f7
09 4月, 2022 1 次提交

Autotune the workspace_size_limit in conv. (#40338) · b937cdc5

由 limingshu 提交于 4月 09, 2022

* Using the maximum workspace_size of all alogirhms to limit the workspace size in exhaustive search mode.

* Use the system cudaMalloc and cudaFree to allocate workspace during searching.

* Enable switch of two kind of workspace setting methods.
Co-authored-by: NLiu Yiqun <liuyiqun01@baidu.com>

b937cdc5

03 4月, 2022 1 次提交

add maximum limit for grid of index_select (#41127) · af8d2482

由 FlyingQianMM 提交于 4月 03, 2022

* limit grid dim for index select

* mv LimitGridDim into gpu_launch_config.h

* fix conflicts

* fix conflicts

* fix code style

* set block to 256

* fix grid setting

* set dtype of block_dim to unsigned int

af8d2482

01 4月, 2022 1 次提交

[Eager] Support pinned (#41035) · f3270fc8

由 wanghuancoder 提交于 4月 01, 2022

* support pinned, test=develop

* support async_write, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine,test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

f3270fc8

30 3月, 2022 1 次提交

Add new APIs for GPU memory monitoring (max_memory_allocated,... · afe02e9d

由 From00 提交于 3月 30, 2022

Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) (#38657)

* Add new API memory_reserved

* Add memory_allocated, max_memory_reserved and max_memory_allocater

* Fix CI error

* Fix CI error

* Enhance UT

* Add FLAGS_memory_stats_opt

* Add STATS macro functions

* Add StatAllocator

* Fix CI errors

* Add UT

* Fix CI errors

afe02e9d

25 3月, 2022 1 次提交
- F
  add maximum limit for grid of reduce, elementwise, gather and scatter (#40813) · 608a5f55
  由 FlyingQianMM 提交于 3月 25, 2022
```
* add maximum limit for grid of reduce, elementwise and gather

* add {} after if
```
  608a5f55
07 3月, 2022 1 次提交

cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU (#39437) · 2a3d9eca

由 Ming-Xu Huang 提交于 3月 07, 2022

* Added cuBlasLtHandle_t to device context.

* Added fused_gemm_epilogue op.

1. Added fused_gemm_epilogue op to leverage cuBlastLt Epilogue.
2. Support fusion Act(X*Y + bias), X'dims >=2 and Y'dims shoule be 2.
2. Act currently only be supported ReLU. (Will add GeLU in the future).

* Added UT to fused_gemm_epilogue op.

* Added LinearAct Pattern

1. Added LinearAct into graph_pattern_detector.* to define (2.)'s
pattern.
2. LinearAct is used to detect act(element_add(matmul_v2(x, w), bias)).
3. act currently only support ReLU (Will support GeLU in the future).

* Added FuseGemmEpiloguePass

1, Added FuseGemmEpiloguePass to handle nn.Linear + Act{ReLU}
fusion (GeLU will be supported in the future).
2. Only support matmul_v2 from nn.Linear.

* Added pybind to BuildStrageter.fuse_gemm_epilogue_.

* Added UT for fuse_gemm_epilogue_pass.

* GeLU support and EpilogueSingleton

1. Added GeLU support to fused_gemm_epilogue op.
2. Added EpilogueSingleton to cache auxiliary pointer.
3. Added related UTs.

* Rename cublaslt_epilogue_opto gemm_epilogue_op.*.

* Added both train and infer pattern to LinearAct.

1. Added support of fwd graph with grap_ops linking to LinearAct.
2. Added related changes to fuse_gemm_epilogue_pass for above
modification.

* Changed CUDA requirement from 11.4 to 11.6 for fuse_gemm_epilogue_pass.

* Added identity activation support to gemm_epilogue_op.

* Added Linear Fusion (matmul_v2 + ele_add)

1. Added matmul_v2 + ele_add pattern to LinearActPattern.
2. Added matmul_v2 + ele_add support to fuse_gemm_epilogue_pass.

* Rename gemm_epilogue_op.* to fused_gemm_epilogue_op.*

* Add fused_gemm_epilogue_grad op.

1. Added fused_gemm_epilogue_grad to support backward epilogue fusion.

* Add UTs to fused_gemm_epilogue_grad_op.

* Change attribute name in fused_gemm_epilogue_grad_op for clearing.

* Allow DX and DBias be dispensable to fused_gemm_epilogue_grad op.

* Added ElementwiseAdd+Matmul+Act graph pattern detection.

* Fuse backward of Linear( Act(x))

1. Added backward fusion pass to Linear( Act(x)).
2. Added backward fusion pass to Linear(x).

* Added UTs to backward fusion of Linear(Act(x)).

* Complete document of arguments to fused_gemm_epilogue_op.

* Made arguments of some functions pass by reference.

* Modify code with review comments.

1. Made arguments of some function pass by reference.
2. Removed redundant code.
3. Followed Google code style to change code.

* Made 'const' code style be consistent

* Fixed random seed of python UTs.

* Set Compiling constrains to cuBlasLt

1. Require CUDA 11.6+
2. Remove fuse_gemm_epilogue related tests when CUDA < 11.6.

* Code Reivew from Paddle

1. Changed arguments name is_first_gemm to without_x_gradient for
clearing.
2. Applied PADDLE_THROW in fused_gemm_epilogue_op.

* Remove EpilogueSingleton

1. Applied ReserveSpace to replace Epilogue for passing auxiliary
pointers between FWD and BWD.

* Fix a logical error and enhance UTs.

1. Added act op count checking in UTs.
2. Fix issue to fuse backward or ReLU(Linear(X)).
3. TODO: solve GELU fusion issues.

* Fix Linear and GeLU fusion issues.

1. Modified graph_detech_pattern to fit with both linear wiht gelu or
relu.
2. Modified data range in Uts to allow negative values.

* Removed fused_gemm_epilogue_op.h.

* Rename namespace pten to phi.

* Rename name of arguments in fused_gemm_epilogue_op

1. bias -> Bias.
2. out -> Out.
3. reserve_space -> ReserveSpace.

* Change EpiloguePassActivationCache as local variable.

1. Removed singleton in EpiloguePassActivationCache.
2. Made EpiloguePassActivationCache as an argument to each pass
functions.

2a3d9eca

02 3月, 2022 1 次提交
- Z
  [bf16] add bf16 kernel: softmax & log_softmax (#39999) · 4a4215ff
  由 zhangbo9674 提交于 3月 02, 2022
```
* add softmax log_softmax

* refine rocm

* refine unittest
```
  4a4215ff
01 3月, 2022 1 次提交

[bf16] add bf16 kernel: scale gather sum (#39683) · 6d26b332

由 zhangbo9674 提交于 3月 01, 2022

* add scale gather sum

* refine CUDA_ATOMIC_WRAPPER ADD for bf16

* add gather unittest

* solve conflict

* add scale uinttest

* add sum unittest

* solve conflict

* refine gather unittest

* refine unittest

6d26b332

25 2月, 2022 1 次提交
- L
  [Fix bug] fix fp16 atomicAdd compiler error on different cuda_arch. (#39886) · ef96ffb6
  由 Li Min 提交于 2月 25, 2022
```
* Fix compile error on cuda_arch less than 700.
```
  ef96ffb6
24 2月, 2022 2 次提交
- C
  [PTen->Phi PR3] Rename pten make target to phi (#39832) · f77019a0
  由 Chen Weihang 提交于 2月 24, 2022
```
* rename pten to phi

* fix infrt compile failed

* resolve conflict
```
  f77019a0
- L
  optimize performance of lookup_table_v2_op (#39856) · d6038c22
  由 Li Min 提交于 2月 24, 2022
```
* optimize block config  and fp16 atomicAdd perf for lookup_table_v2_grad.
```
  d6038c22
23 2月, 2022 2 次提交
- S
  Add ProcessGroupNCCL for distributed training (#39737) · 0b205817
  由 ShenLiang 提交于 2月 23, 2022
```
* add processgroup_nccl
```
  0b205817
- Z
  [bf16] add bf16 kernel: elementwise_div (#39602) · ca4df333
  由 zhangbo9674 提交于 2月 23, 2022
```
* add elementwise_div

* refine rocm

* refine code

* refine op register

* solve conflict

* refine unittest

* refine unittest precision

* add rocm
```
  ca4df333
20 2月, 2022 1 次提交

[PTen->Phi PR1] Change pten dirname and namespace to phi (#39748) · dcfe1986

由 Chen Weihang 提交于 2月 20, 2022

* rename pten dir to phi

* rename namespace to phi

* rename infrt pten dir to phi

* resolve conflict

* rename pten to phi in cmake

* revert all infrt change

* change needed files

* fix infrt failed

* fix inference failed

dcfe1986

19 2月, 2022 1 次提交

[Pten]Unify paddle/pten::framework::ddim into pten::ddim (#39614) · 2fe04264

由 Aurelius84 提交于 2月 19, 2022

* Unify paddle/pten::framework::ddim into pten::ddim

* fix paddle namespace

* compile sucessfully

* fix npu src file

* fix conflict

* fix conflict

* fix tensorrt compiler error

* fix conflict

* fix conflict

* fix tesst file conflict

* fix conflict

* fix mlu file conflict

* fix mlu file conflict

* fix cinn header file conflict

* fix conflict

* fix conflict

* fix conflict

* fix conflict

2fe04264

18 2月, 2022 1 次提交
- W
  
  fix compile error in jetson (#39669) · c8c24460
  由 Wilber 提交于 2月 18, 2022
  
  c8c24460
16 2月, 2022 1 次提交

[bf16] pten matmul cuda kernel support bf16 (#39485) · d5a0d31a

由 Leo Chen 提交于 2月 16, 2022

* pten matmul cuda kernel support bf16

* fix pten kernel name

* add matmul_grad bf16 kernel

* add emptylike bf16 kernel

* fix compile

* suppport rocm

* fix error

* fix rocm

* add bf16 header file

* fix compile

d5a0d31a

15 2月, 2022 1 次提交

[PTen]Migrate proto::VarType outside of Pten (#39411) · 7e7e9404

由 Aurelius84 提交于 2月 15, 2022

* #1 migrate dist-related type()-> dtype()

* move datatype function from pten -> fluid/framework

* change type() in imperative into convert(dtype())

* modify xx_tensor->type into xx_tensor->dtype

* change the set_type interface and the caller

* modify xx_tensor.type into xx_tensor.dtype

* fix mutable_data(place, dtype())

* change caller of mutable_data in pten and distributed

* change the caller of mutable_data in fluid/framework

* change the caller of mutable_data in imperative directory

* mutable_data: inference

* update the call of mutable_data

* transfer MakePenScalarArray MakePtenScalar ResetHolderWithType

* pass the compile. the next step is remove VarType in Pten

* fix all and remove VarType from pten. success in linux. Next task is other platform

* fix conflict with develop

* fix compiled error

* Fix reset conversion

* fix conflict

* fix compiled problem

* fix typo

* Fix << in tensor_utils.cc

* fix type->dtype

* fix unittest

* fix tensor init constructor

* fix DataTypeSize for BFloat16

* fix code style

* fix npu compiled error

* fix npu

* compile npu sucessfully

* fix conflict

* fix conflict
Co-authored-by: Nxiongkun <xiongkun03@baidu.com>

7e7e9404

08 2月, 2022 1 次提交

Support allocate CUDA managed memory (#39075) · 42910361

由 From00 提交于 2月 08, 2022

* Rough implementation for experiment

* Support allocate cuda managed memory

* Fix CI error

* Modify UT

* Check whether support memory oversubscription

* Fix ROCM Compile error

* Fix ROCM Compile error

* Fix UT cuda_managed_memory_test

* Set UT timeout to 40

* Add UT OOMExceptionTest

* Set UT timeout to 50

42910361

06 2月, 2022 1 次提交
- W
  
  [PTEN] Add Gpu context (#39305) · a821c4a9
  由 Wilber 提交于 2月 06, 2022
  
  a821c4a9

PaddlePaddle / Paddle 1 年多 前同步成功

PaddlePaddle / Paddle
1 年多前同步成功