提交 · 3cb93edfc35aeb67ed94b8a46979e27aaef92eed · Crayon鑫 / Paddle

28 2月, 2022 1 次提交
- Z
  PR-CI-Py3 change cpu test (#39659) · 3cb93edf
  由 zhangchunle 提交于 2月 28, 2022
```
* update;test=cpu-py3
```
  3cb93edf
23 2月, 2022 1 次提交
- S
  Add ProcessGroupNCCL for distributed training (#39737) · 0b205817
  由 ShenLiang 提交于 2月 23, 2022
```
* add processgroup_nccl
```
  0b205817
22 2月, 2022 1 次提交
- Y
  
  disable some distribute test case when in CPU test env (#39801) · ae8c811a
  由 YUNSHEN XIE 提交于 2月 22, 2022
  
  ae8c811a
21 2月, 2022 1 次提交

disable some distribute test case when in CPU test env (#39682) · 941bdb41

由 wanghuancoder 提交于 2月 21, 2022

* disable some distribute test case when in CPU test env, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

941bdb41

19 2月, 2022 1 次提交

Add the DistributedFusedLamb optimizer (#39148) · 5df3cd61

由 sneaxiy 提交于 2月 19, 2022

* add DistributedFusedLamb op

* polish code

* fix compile error

* compatible with pten changement

* fix rocm compile error

* improve converage

* update upstream/develop

* fix cast_with_ptr.h

* add FLAGS_distributed_lamb_divide_nranks_when_allreduce=1

* fix clip before allreduce

* add use_master_param_norm

* code polish

* fix bug

* fix ROCM ci

5df3cd61

09 2月, 2022 1 次提交

Replace EagerTensor with Tensor (#39376) · 945a3ce9

由 Jiabin Yang 提交于 2月 09, 2022

* merge legacy to fluid

* Remove legacy code

* Remove legacy code

* Remove DataType test

* Using Tensor directly instead of using EagerTensor

* support gradient_accumulation

* make test_imperative_lod_tensor_to_selected_rows longer

* make test_imperative_lod_tensor_to_selected_rows longer

945a3ce9

28 1月, 2022 1 次提交

Resolve unit-test timeout issues (#39292) · 543f3dea

由 Weilong Wu 提交于 1月 28, 2022

* implement AllocateFrom

* fix PR-CI-Coverage timeout in 120s
Co-authored-by: Nzkh2016 <zhangkaihuo@baidu.com>

543f3dea

25 1月, 2022 1 次提交
- Y
  
  [fleet_executor] Dist model run method Implementation (#39194) · 20e23e1b
  由 Yuang Liu 提交于 1月 25, 2022
  
  20e23e1b
24 1月, 2022 1 次提交

Refactored python-level trace_op to call through _C_ops instead of... · c3796061

由 Zhanlue Yang 提交于 1月 24, 2022

Refactored python-level trace_op to call through _C_ops instead of Tracer::TraceOp, under eager_mode (#38338)

* Replaced core.ops with _C_ops

* Refactored python-level trace_op to call through _C_ops instead of Tracer::TraceOp, under eager_mode

* Modified trace_op interface

* Refactored trace_op logic for eager mode

* Added Eager Dygraph support for OpTest

* Fixed ci issues

* Fixed CI failures

* Fixed Coverage CI Issues

* Fixed XPU CI Issues

c3796061

21 1月, 2022 1 次提交
- Y
  
  [fleet executor] add a tensor wrapper to support python numpy input (#39076) · 08793179
  由 Yuang Liu 提交于 1月 21, 2022
  
  08793179
19 1月, 2022 1 次提交

ipu python interface p1 (#38096) · 0837a2cc

由 jianghaicheng 提交于 1月 19, 2022

* ipu_commit_tests p1

* resolve comments

* resolve comments

* resolve comments

* resolve comments

* resolve comments

* resolve comments

* resolve comments

* update lint and ipustrategy introduction

* update ipu_config

* update __init__ of static

* update doc

* update doc 2

* update doc 3

* update doc 4

* update doc 5

* update doc 5

* update doc 6

* update lint

* update lint 2

* update ipustrategy

* add IpuStrategy to all

* update ipustrategy

* update ipu_shard_guard

* update ipu_shard_guard 2
Co-authored-by: Nyaozhixin <522190855@qq.com>

0837a2cc

14 1月, 2022 2 次提交
- B
  
  Add dygraph sharding stage3 (#38052) · 4c77a908
  由 Baibaifan 提交于 1月 14, 2022
  
  4c77a908
- Q
  [MLU]Add mean and reduce_mean op (#38872) · 7f8d5bc8
  由 qipengh 提交于 1月 14, 2022
```
* [MLU]: add mean and reduce mean op

* [MLU]add mlu pytest dir in CMakeLists.txt

* [MLU]fix tensor data

* [MLU]fix TensorToPyArray and license
```
  7f8d5bc8
11 1月, 2022 1 次提交

【Auto Parallel】New local tensor (#38747) · d3ba1895

由 caozhou 提交于 1月 11, 2022

* update dist tensor

* add unitest

* update unitest

* refactor dist tensor

* update dist tensor and unitest

d3ba1895

10 1月, 2022 1 次提交
- Y
  Add the backward support for QR (#38824) · 657b6742
  由 Yulong Ao 提交于 1月 10, 2022
```
* Add the backward support for QR

* Remove unnecessary comments
```
  657b6742
31 12月, 2021 1 次提交
- D
  
  fix timeout (#38612) · 02c17c0b
  由 Double_V 提交于 12月 31, 2021
  
  02c17c0b
23 12月, 2021 1 次提交
- X
  move distribution.py into distribution package and split into different file... · a3e6f18c
  由 Xiaoxu Chen 提交于 12月 23, 2021
```
move distribution.py into distribution package and split into different file for better scalability (#38047)
```
  a3e6f18c
21 12月, 2021 1 次提交
- Y
  
  [fleet_executor] Python side fleet executor and task node (#38290) · a4afb97a
  由 Yuang Liu 提交于 12月 21, 2021
  
  a4afb97a
20 12月, 2021 1 次提交
- Y
  
  [fleet_executor] Remove runtime graph, all scheduler on python side (#38261) · 2f188341
  由 Yuang Liu 提交于 12月 20, 2021
  
  2f188341
08 12月, 2021 1 次提交
- C
  add update func of auto search (#37867) · 46212b80
  由 caozhou 提交于 12月 08, 2021
```
* add update func of auto search

* update unitest
```
  46212b80
07 12月, 2021 1 次提交

[Auto para] Relaunch with auto mapping function (#37326) · 506e79d1

由 Yulong Ao 提交于 12月 07, 2021

* [Auto Parallel]  Add the unified cluster representation

* [Auto Parallel] Add the graph class for physical mapping

* [Auto Parallel] Add the simple physical mapper

* Set the timeout of the mapper

* Merge the upstream develop unittests cmake files

* Fix a bug of the process group

* Remove mapper unittest from platforms which is not GPU

* Move the instantiation of process group after resharding

* Add the local id for devices

* Update the rank mapping format

* [Auto Parallel] Relaunch with the rank mapping file

* Remove the unnecessary json file

* Avoid entering get_device_proc_info for auto mapping

* Correct the mapper unit test

* Add some comments

* Remove the related files about mapping

* Update the unittest for auto mapping

* Remove unused rank_mapping unittest

* Improve the unittest coverage

* Improve the unittest coverage

* Improve the unittest of relaunch

* Fix the unittest problem in CI

* Improve the unittest of relaunch

* Remove unnecessary statements

* Update the unittest cmakefile

* Correct the cmakefile of auto parallel unittests

* Modify codes based on the new elastic change

* Use the GPUs exclusively in the unittest

* Correct the cmakefile

* Set the timeout of the unittest

506e79d1

02 12月, 2021 1 次提交
- B
  
  Add dygraph sharding stage2 (#37707) · 20e19776
  由 Baibaifan 提交于 12月 02, 2021
  
  20e19776
30 11月, 2021 1 次提交

[Auto Parallel] Do the physical mapping between the process graph and the cluster graph (#37094) · b0dff05d

由 Yulong Ao 提交于 11月 30, 2021

* [Auto Parallel]  Add the unified cluster representation

* [Auto Parallel] Add the graph class for physical mapping

* [Auto Parallel] Add the simple physical mapper

* Set the timeout of the mapper

* Merge the upstream develop unittests cmake files

* Fix a bug of the process group

* Remove mapper unittest from platforms which is not GPU

* Move the instantiation of process group after resharding

* Add the local id for devices

* Update the rank mapping format

* Add some comments

* Remove the related files about mapping

* Update the unittest for auto mapping

* Remove unused rank_mapping unittest

* Improve the unittest coverage

* Improve the unittest coverage

b0dff05d

27 11月, 2021 1 次提交

[Auto Parallel] Add the graph class for the process and cluster (#37482) · 48faf638

由 Yulong Ao 提交于 11月 27, 2021

* [Auto Parallel]  Add the unified cluster representation

* [Auto Parallel] Add the graph class for physical mapping

* [Auto Parallel] Add the simple physical mapper

* Set the timeout of the mapper

* Merge the upstream develop unittests cmake files

* Fix a bug of the process group

* Remove mapper unittest from platforms which is not GPU

* Move the instantiation of process group after resharding

* Add the local id for devices

* Update the rank mapping format

* Add some comments

* Remove the related files about mapping

* Remove unused rank_mapping unittest

* Improve the unittest coverage

48faf638

26 11月, 2021 1 次提交
- S
  fix data parallel when VOCAB var in program (#37543) · e05540f7
  由 Steffy-zxf 提交于 11月 26, 2021
```
* fix data parallel when VOCAB var in program
```
  e05540f7
25 11月, 2021 2 次提交
- B
  
  Add InternalStorage and add ShardingOptimizerStage2 (#37489) · 5af64631
  由 Baibaifan 提交于 11月 25, 2021
  
  5af64631
- L
  
  Export task node to python (#37509) · 3f815e76
  由 LiYuRio 提交于 11月 25, 2021
  
  3f815e76
15 11月, 2021 1 次提交

Add distributed pass framework: including PassBase/PassTest/PassUtils (#36643) · 12339fa0

由 Zeng Jinle 提交于 11月 15, 2021

* add split_program

* make ut faster

* increase ut timeout

* make result deterministic

* add fuse_all_reduce pass

* add ut framework, update

* fix ut framework

* remove useless code

* add coverage support

* update

* fix CI

* fix some bugs and fix ci coverage

* fix conflict

12339fa0

12 11月, 2021 3 次提交
- Z
  [fix]fix the bug of fused_attention and fused_feedforward (#36972) · 6486e242
  由 zhangkaihuo 提交于 11月 12, 2021
```
* fix bug:
1. atten: set the default value of attn_dropout_rate to None
2. ffn: add activation parameter
```
  6486e242
- Y
  
  [fleet_executor] handle empty addr for single card train (#37150) · 2c7870e0
  由 Yuang Liu 提交于 11月 12, 2021
  
  2c7870e0
- Z
  [AutoParallel] Add AutoConvert (#36958) · 1773afd7
  由 zhaoyingli 提交于 11月 12, 2021
```
* add AutoConvert

* add unitest

* amend merge&slice

* amend default dist_attr

* update doc&improve coverage

* add interface dist_context

* tiny modify
```
  1773afd7
05 11月, 2021 1 次提交
- W
  
  Optimized the solve op code:renamed var and removed template func (#36981) · bea0c9f5
  由 Weilong Wu 提交于 11月 05, 2021
  
  bea0c9f5
03 11月, 2021 1 次提交
- L
  
  executor framework (#36892) · 10b039b7
  由 LiYuRio 提交于 11月 03, 2021
  
  10b039b7
02 11月, 2021 1 次提交

[AutoParallel] Save&Load Module (#36558) · b9defb4f

由 zhaoyingli 提交于 11月 02, 2021

* AutoParallel Save&Load

* tiny modi

* update func name

* tiny fix

* add NotImplementedError

* fix doc

* update func name

* update func param

* update interface

* add unitest & modi make_data_unshard

* update unittest

* update unittest

* fix unittest

* fix cmakelist

* update unittest

b9defb4f

28 10月, 2021 1 次提交
- B
  
  Add lazy distributed launch with rank mapping (#36570) · 7de3f81c
  由 Bo Liu 提交于 10月 28, 2021
  
  7de3f81c
26 10月, 2021 3 次提交

Add fused attention op backward and python layer. (#36498) · 5119428e

由 Li Min 提交于 10月 26, 2021

功能：本PR的目标是提高attention模块的计算性能。
为了减少框架层对op的调度开销，本PR通过在C++层手动实现attention模块，对外提供attention 大op；
为了减少防存开销，本PR采取了两种优化方法：
（1）在q,k,v计算时通过共享输入X，将该处的gemm，transpose和bias add从三次调用减少为一次；
（2）使用kernel融合优化技术，在不同cuda kernel之间通过寄存器传输数据；

5119428e

L
Move fused_attention and fused_feedforward functional api path to incubate (#36704) · 9aeca2f1
由 Li Min 提交于 10月 26, 2021
```
将 #35905 和 #35843 PR中新增的的python api接口移到incubate目录下。
```
9aeca2f1

Support various length support for SelectedRows in GLOO::AllGather (#36637) · eca78a9f

由 xiongkun 提交于 10月 26, 2021

* In cpu parallel using gloo, add various length support for SelectedRows

* fix bug

* fix bugs

* fix by code review

* remove timeout

eca78a9f

25 10月, 2021 1 次提交

add op: fused_feedforward(forward) (#35843) · b18cbfb2

由 zhangkaihuo 提交于 10月 25, 2021

这个PR只包含fused_feedforward前向的代码。

相关kernel实现：fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias

fused_feedforward是一个融合算子，该算子对transformer模型的feed forward层的算子进行融合和封装，使得前端只呈现一个接口，通过融合减少部分访存和kernel launch的时间，以此提升性能。

b18cbfb2

22 10月, 2021 1 次提交

Fused attention op forward (#35905) · d4906214

由 Li Min 提交于 10月 22, 2021

功能：本PR的目标是提高attention模块的计算性能。
为了减少框架层对op的调度开销，本PR通过在C++层手动实现attention模块，对外提供attention 大op；
为了减少防存开销，本PR采取了两种优化方法：
（1）在q,k,v计算时通过共享输入X，将该处的gemm，transpose和bias add从三次调用减少为一次；
（2）使用kernel融合优化技术，在不同cuda kernel之间通过寄存器传输数据；

d4906214

Crayon鑫 / Paddle 与 Fork 源项目一致

Crayon鑫 / Paddle
与 Fork 源项目一致