提交 · e0866dc630dc8dc81567d0644c0688976132eb2c · Crayon鑫 / Paddle

09 3月, 2022 1 次提交
- W
  
  [hybrid] fused_feedforward op support tensor model parallel (#40160) · e0866dc6
  由 WangXi 提交于 3月 09, 2022
  
  e0866dc6
07 3月, 2022 1 次提交

cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU (#39437) · 2a3d9eca

由 Ming-Xu Huang 提交于 3月 07, 2022

* Added cuBlasLtHandle_t to device context.

* Added fused_gemm_epilogue op.

1. Added fused_gemm_epilogue op to leverage cuBlastLt Epilogue.
2. Support fusion Act(X*Y + bias), X'dims >=2 and Y'dims shoule be 2.
2. Act currently only be supported ReLU. (Will add GeLU in the future).

* Added UT to fused_gemm_epilogue op.

* Added LinearAct Pattern

1. Added LinearAct into graph_pattern_detector.* to define (2.)'s
pattern.
2. LinearAct is used to detect act(element_add(matmul_v2(x, w), bias)).
3. act currently only support ReLU (Will support GeLU in the future).

* Added FuseGemmEpiloguePass

1, Added FuseGemmEpiloguePass to handle nn.Linear + Act{ReLU}
fusion (GeLU will be supported in the future).
2. Only support matmul_v2 from nn.Linear.

* Added pybind to BuildStrageter.fuse_gemm_epilogue_.

* Added UT for fuse_gemm_epilogue_pass.

* GeLU support and EpilogueSingleton

1. Added GeLU support to fused_gemm_epilogue op.
2. Added EpilogueSingleton to cache auxiliary pointer.
3. Added related UTs.

* Rename cublaslt_epilogue_opto gemm_epilogue_op.*.

* Added both train and infer pattern to LinearAct.

1. Added support of fwd graph with grap_ops linking to LinearAct.
2. Added related changes to fuse_gemm_epilogue_pass for above
modification.

* Changed CUDA requirement from 11.4 to 11.6 for fuse_gemm_epilogue_pass.

* Added identity activation support to gemm_epilogue_op.

* Added Linear Fusion (matmul_v2 + ele_add)

1. Added matmul_v2 + ele_add pattern to LinearActPattern.
2. Added matmul_v2 + ele_add support to fuse_gemm_epilogue_pass.

* Rename gemm_epilogue_op.* to fused_gemm_epilogue_op.*

* Add fused_gemm_epilogue_grad op.

1. Added fused_gemm_epilogue_grad to support backward epilogue fusion.

* Add UTs to fused_gemm_epilogue_grad_op.

* Change attribute name in fused_gemm_epilogue_grad_op for clearing.

* Allow DX and DBias be dispensable to fused_gemm_epilogue_grad op.

* Added ElementwiseAdd+Matmul+Act graph pattern detection.

* Fuse backward of Linear( Act(x))

1. Added backward fusion pass to Linear( Act(x)).
2. Added backward fusion pass to Linear(x).

* Added UTs to backward fusion of Linear(Act(x)).

* Complete document of arguments to fused_gemm_epilogue_op.

* Made arguments of some functions pass by reference.

* Modify code with review comments.

1. Made arguments of some function pass by reference.
2. Removed redundant code.
3. Followed Google code style to change code.

* Made 'const' code style be consistent

* Fixed random seed of python UTs.

* Set Compiling constrains to cuBlasLt

1. Require CUDA 11.6+
2. Remove fuse_gemm_epilogue related tests when CUDA < 11.6.

* Code Reivew from Paddle

1. Changed arguments name is_first_gemm to without_x_gradient for
clearing.
2. Applied PADDLE_THROW in fused_gemm_epilogue_op.

* Remove EpilogueSingleton

1. Applied ReserveSpace to replace Epilogue for passing auxiliary
pointers between FWD and BWD.

* Fix a logical error and enhance UTs.

1. Added act op count checking in UTs.
2. Fix issue to fuse backward or ReLU(Linear(X)).
3. TODO: solve GELU fusion issues.

* Fix Linear and GeLU fusion issues.

1. Modified graph_detech_pattern to fit with both linear wiht gelu or
relu.
2. Modified data range in Uts to allow negative values.

* Removed fused_gemm_epilogue_op.h.

* Rename namespace pten to phi.

* Rename name of arguments in fused_gemm_epilogue_op

1. bias -> Bias.
2. out -> Out.
3. reserve_space -> ReserveSpace.

* Change EpiloguePassActivationCache as local variable.

1. Removed singleton in EpiloguePassActivationCache.
2. Made EpiloguePassActivationCache as an argument to each pass
functions.

2a3d9eca

02 3月, 2022 2 次提交

new fleet_desc builder (#39948) · 1c4e3e5d

由 ziyoujiyi 提交于 3月 02, 2022

* delete gloo connect retry

* the_one_ps dirs reconstruct

* .

* .

* create the_one_ps dirs

* create the_one_ps dirs

* create the_one_ps dirs

* create the_one_ps dirs

* create the_one_ps dirs

* create the_one_ps dirs

* the one ps dirs modify

* the one ps dirs modify

* the one ps dirs modify

* the one ps dirs modify

* refactor ps optimize

* refactor ps optimize

* refactor ps optimize

* .

* .

* .

* .

* .

* .

* refactor theoneps

* the_one_ps

* add ps pass unittest

* add ps pass unittest

* ps unitest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* ps unittest frame

* add cpu_async_ps_mode test

* add cpu_async_ps_mode test

* add cpu_async_ps_mode test

* ps unittest ready

* ps unittest ready

* solve dist_pass init conflict

* solve import CommContext error

* unittest ok

* implement AllocateFrom

* solve setup.py.in conflict

* solve conflict

* solve conflict

* solve conflict

* .

* .

* cpu-async-ps minimize test ok & gpu minimize test ok

* add heter 2stage unittest

* add heter 2stage unittest

* add heter 2stage unittest

* sync/geo test ok & fix heter_worker program ok

* .

* new fleet desc generator

* new fleet_desc builder

* new fleet_desc builder

* .

* .

* correct ps.proto compile

* .
Co-authored-by: Nzkh2016 <zhangkaihuo@baidu.com>

1c4e3e5d

[Eager] open eager when WITH_PYTHON (#39979) · 9af72957

由 wanghuancoder 提交于 3月 02, 2022

* open eager when WITH_PYTHON, test=develop

* refine, test=develop

* refine, test=develop

* add DWITH_PYTHON for gen_fluid_lib, test=develop

9af72957

01 3月, 2022 1 次提交
- Z
  
  add test_warpctc_op in mac (#39983) · 25650774
  由 zhangchunle 提交于 3月 01, 2022
  
  25650774
28 2月, 2022 1 次提交
- Z
  PR-CI-Py3 change cpu test (#39659) · 3cb93edf
  由 zhangchunle 提交于 2月 28, 2022
```
* update;test=cpu-py3
```
  3cb93edf
23 2月, 2022 1 次提交
- S
  Add ProcessGroupNCCL for distributed training (#39737) · 0b205817
  由 ShenLiang 提交于 2月 23, 2022
```
* add processgroup_nccl
```
  0b205817
22 2月, 2022 1 次提交
- Y
  
  disable some distribute test case when in CPU test env (#39801) · ae8c811a
  由 YUNSHEN XIE 提交于 2月 22, 2022
  
  ae8c811a
21 2月, 2022 1 次提交

disable some distribute test case when in CPU test env (#39682) · 941bdb41

由 wanghuancoder 提交于 2月 21, 2022

* disable some distribute test case when in CPU test env, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

941bdb41

19 2月, 2022 1 次提交

Add the DistributedFusedLamb optimizer (#39148) · 5df3cd61

由 sneaxiy 提交于 2月 19, 2022

* add DistributedFusedLamb op

* polish code

* fix compile error

* compatible with pten changement

* fix rocm compile error

* improve converage

* update upstream/develop

* fix cast_with_ptr.h

* add FLAGS_distributed_lamb_divide_nranks_when_allreduce=1

* fix clip before allreduce

* add use_master_param_norm

* code polish

* fix bug

* fix ROCM ci

5df3cd61

09 2月, 2022 1 次提交

Replace EagerTensor with Tensor (#39376) · 945a3ce9

由 Jiabin Yang 提交于 2月 09, 2022

* merge legacy to fluid

* Remove legacy code

* Remove legacy code

* Remove DataType test

* Using Tensor directly instead of using EagerTensor

* support gradient_accumulation

* make test_imperative_lod_tensor_to_selected_rows longer

* make test_imperative_lod_tensor_to_selected_rows longer

945a3ce9

28 1月, 2022 1 次提交

Resolve unit-test timeout issues (#39292) · 543f3dea

由 Weilong Wu 提交于 1月 28, 2022

* implement AllocateFrom

* fix PR-CI-Coverage timeout in 120s
Co-authored-by: Nzkh2016 <zhangkaihuo@baidu.com>

543f3dea

25 1月, 2022 1 次提交
- Y
  
  [fleet_executor] Dist model run method Implementation (#39194) · 20e23e1b
  由 Yuang Liu 提交于 1月 25, 2022
  
  20e23e1b
24 1月, 2022 1 次提交

Refactored python-level trace_op to call through _C_ops instead of... · c3796061

由 Zhanlue Yang 提交于 1月 24, 2022

Refactored python-level trace_op to call through _C_ops instead of Tracer::TraceOp, under eager_mode (#38338)

* Replaced core.ops with _C_ops

* Refactored python-level trace_op to call through _C_ops instead of Tracer::TraceOp, under eager_mode

* Modified trace_op interface

* Refactored trace_op logic for eager mode

* Added Eager Dygraph support for OpTest

* Fixed ci issues

* Fixed CI failures

* Fixed Coverage CI Issues

* Fixed XPU CI Issues

c3796061

21 1月, 2022 1 次提交
- Y
  
  [fleet executor] add a tensor wrapper to support python numpy input (#39076) · 08793179
  由 Yuang Liu 提交于 1月 21, 2022
  
  08793179
19 1月, 2022 1 次提交

ipu python interface p1 (#38096) · 0837a2cc

由 jianghaicheng 提交于 1月 19, 2022

* ipu_commit_tests p1

* resolve comments

* resolve comments

* resolve comments

* resolve comments

* resolve comments

* resolve comments

* resolve comments

* update lint and ipustrategy introduction

* update ipu_config

* update __init__ of static

* update doc

* update doc 2

* update doc 3

* update doc 4

* update doc 5

* update doc 5

* update doc 6

* update lint

* update lint 2

* update ipustrategy

* add IpuStrategy to all

* update ipustrategy

* update ipu_shard_guard

* update ipu_shard_guard 2
Co-authored-by: Nyaozhixin <522190855@qq.com>

0837a2cc

14 1月, 2022 2 次提交
- B
  
  Add dygraph sharding stage3 (#38052) · 4c77a908
  由 Baibaifan 提交于 1月 14, 2022
  
  4c77a908
- Q
  [MLU]Add mean and reduce_mean op (#38872) · 7f8d5bc8
  由 qipengh 提交于 1月 14, 2022
```
* [MLU]: add mean and reduce mean op

* [MLU]add mlu pytest dir in CMakeLists.txt

* [MLU]fix tensor data

* [MLU]fix TensorToPyArray and license
```
  7f8d5bc8
11 1月, 2022 1 次提交

【Auto Parallel】New local tensor (#38747) · d3ba1895

由 caozhou 提交于 1月 11, 2022

* update dist tensor

* add unitest

* update unitest

* refactor dist tensor

* update dist tensor and unitest

d3ba1895

10 1月, 2022 1 次提交
- Y
  Add the backward support for QR (#38824) · 657b6742
  由 Yulong Ao 提交于 1月 10, 2022
```
* Add the backward support for QR

* Remove unnecessary comments
```
  657b6742
31 12月, 2021 1 次提交
- D
  
  fix timeout (#38612) · 02c17c0b
  由 Double_V 提交于 12月 31, 2021
  
  02c17c0b
23 12月, 2021 1 次提交
- X
  move distribution.py into distribution package and split into different file... · a3e6f18c
  由 Xiaoxu Chen 提交于 12月 23, 2021
```
move distribution.py into distribution package and split into different file for better scalability (#38047)
```
  a3e6f18c
21 12月, 2021 1 次提交
- Y
  
  [fleet_executor] Python side fleet executor and task node (#38290) · a4afb97a
  由 Yuang Liu 提交于 12月 21, 2021
  
  a4afb97a
20 12月, 2021 1 次提交
- Y
  
  [fleet_executor] Remove runtime graph, all scheduler on python side (#38261) · 2f188341
  由 Yuang Liu 提交于 12月 20, 2021
  
  2f188341
08 12月, 2021 1 次提交
- C
  add update func of auto search (#37867) · 46212b80
  由 caozhou 提交于 12月 08, 2021
```
* add update func of auto search

* update unitest
```
  46212b80
07 12月, 2021 1 次提交

[Auto para] Relaunch with auto mapping function (#37326) · 506e79d1

由 Yulong Ao 提交于 12月 07, 2021

* [Auto Parallel]  Add the unified cluster representation

* [Auto Parallel] Add the graph class for physical mapping

* [Auto Parallel] Add the simple physical mapper

* Set the timeout of the mapper

* Merge the upstream develop unittests cmake files

* Fix a bug of the process group

* Remove mapper unittest from platforms which is not GPU

* Move the instantiation of process group after resharding

* Add the local id for devices

* Update the rank mapping format

* [Auto Parallel] Relaunch with the rank mapping file

* Remove the unnecessary json file

* Avoid entering get_device_proc_info for auto mapping

* Correct the mapper unit test

* Add some comments

* Remove the related files about mapping

* Update the unittest for auto mapping

* Remove unused rank_mapping unittest

* Improve the unittest coverage

* Improve the unittest coverage

* Improve the unittest of relaunch

* Fix the unittest problem in CI

* Improve the unittest of relaunch

* Remove unnecessary statements

* Update the unittest cmakefile

* Correct the cmakefile of auto parallel unittests

* Modify codes based on the new elastic change

* Use the GPUs exclusively in the unittest

* Correct the cmakefile

* Set the timeout of the unittest

506e79d1

02 12月, 2021 1 次提交
- B
  
  Add dygraph sharding stage2 (#37707) · 20e19776
  由 Baibaifan 提交于 12月 02, 2021
  
  20e19776
30 11月, 2021 1 次提交

[Auto Parallel] Do the physical mapping between the process graph and the cluster graph (#37094) · b0dff05d

由 Yulong Ao 提交于 11月 30, 2021

* [Auto Parallel]  Add the unified cluster representation

* [Auto Parallel] Add the graph class for physical mapping

* [Auto Parallel] Add the simple physical mapper

* Set the timeout of the mapper

* Merge the upstream develop unittests cmake files

* Fix a bug of the process group

* Remove mapper unittest from platforms which is not GPU

* Move the instantiation of process group after resharding

* Add the local id for devices

* Update the rank mapping format

* Add some comments

* Remove the related files about mapping

* Update the unittest for auto mapping

* Remove unused rank_mapping unittest

* Improve the unittest coverage

* Improve the unittest coverage

b0dff05d

27 11月, 2021 1 次提交

[Auto Parallel] Add the graph class for the process and cluster (#37482) · 48faf638

由 Yulong Ao 提交于 11月 27, 2021

* [Auto Parallel]  Add the unified cluster representation

* [Auto Parallel] Add the graph class for physical mapping

* [Auto Parallel] Add the simple physical mapper

* Set the timeout of the mapper

* Merge the upstream develop unittests cmake files

* Fix a bug of the process group

* Remove mapper unittest from platforms which is not GPU

* Move the instantiation of process group after resharding

* Add the local id for devices

* Update the rank mapping format

* Add some comments

* Remove the related files about mapping

* Remove unused rank_mapping unittest

* Improve the unittest coverage

48faf638

26 11月, 2021 1 次提交
- S
  fix data parallel when VOCAB var in program (#37543) · e05540f7
  由 Steffy-zxf 提交于 11月 26, 2021
```
* fix data parallel when VOCAB var in program
```
  e05540f7
25 11月, 2021 2 次提交
- B
  
  Add InternalStorage and add ShardingOptimizerStage2 (#37489) · 5af64631
  由 Baibaifan 提交于 11月 25, 2021
  
  5af64631
- L
  
  Export task node to python (#37509) · 3f815e76
  由 LiYuRio 提交于 11月 25, 2021
  
  3f815e76
15 11月, 2021 1 次提交

Add distributed pass framework: including PassBase/PassTest/PassUtils (#36643) · 12339fa0

由 Zeng Jinle 提交于 11月 15, 2021

* add split_program

* make ut faster

* increase ut timeout

* make result deterministic

* add fuse_all_reduce pass

* add ut framework, update

* fix ut framework

* remove useless code

* add coverage support

* update

* fix CI

* fix some bugs and fix ci coverage

* fix conflict

12339fa0

12 11月, 2021 3 次提交
- Z
  [fix]fix the bug of fused_attention and fused_feedforward (#36972) · 6486e242
  由 zhangkaihuo 提交于 11月 12, 2021
```
* fix bug:
1. atten: set the default value of attn_dropout_rate to None
2. ffn: add activation parameter
```
  6486e242
- Y
  
  [fleet_executor] handle empty addr for single card train (#37150) · 2c7870e0
  由 Yuang Liu 提交于 11月 12, 2021
  
  2c7870e0
- Z
  [AutoParallel] Add AutoConvert (#36958) · 1773afd7
  由 zhaoyingli 提交于 11月 12, 2021
```
* add AutoConvert

* add unitest

* amend merge&slice

* amend default dist_attr

* update doc&improve coverage

* add interface dist_context

* tiny modify
```
  1773afd7
05 11月, 2021 1 次提交
- W
  
  Optimized the solve op code:renamed var and removed template func (#36981) · bea0c9f5
  由 Weilong Wu 提交于 11月 05, 2021
  
  bea0c9f5
03 11月, 2021 1 次提交
- L
  
  executor framework (#36892) · 10b039b7
  由 LiYuRio 提交于 11月 03, 2021
  
  10b039b7
02 11月, 2021 1 次提交

[AutoParallel] Save&Load Module (#36558) · b9defb4f

由 zhaoyingli 提交于 11月 02, 2021

* AutoParallel Save&Load

* tiny modi

* update func name

* tiny fix

* add NotImplementedError

* fix doc

* update func name

* update func param

* update interface

* add unitest & modi make_data_unshard

* update unittest

* update unittest

* fix unittest

* fix cmakelist

* update unittest

b9defb4f

28 10月, 2021 1 次提交
- B
  
  Add lazy distributed launch with rank mapping (#36570) · 7de3f81c
  由 Bo Liu 提交于 10月 28, 2021
  
  7de3f81c

Crayon鑫 / Paddle 与 Fork 源项目一致

Crayon鑫 / Paddle
与 Fork 源项目一致