提交 · ceb6b3f1fb3865711c03d64be5497197f1257258 · PaddlePaddle / Paddle

14 6月, 2022 1 次提交
- Z
  [MLU]: add elementwise_max mlu kernel (#43365) · ceb6b3f1
  由 zhaoying9105 提交于 6月 14, 2022
```
* [MLU]: add elementwise_max mlu kernel

* [MLU]: add int32 support for elementwise maxk MLU kernel
```
  ceb6b3f1
24 3月, 2022 1 次提交

由 Roc 提交于 3月 24, 2022

* # This is a combination of 10 commits.
# The first commit's message is:
add expert count op

add ut for expert_count

# This is the 2nd commit message:

update UT only for cuda

# This is the 3rd commit message:

fix for rocm

# This is the 4th commit message:

update ut

# This is the 5th commit message:

add moe module

# This is the 6th commit message:

add expert count op

add ut for expert_count

# This is the 7th commit message:

update UT only for cuda

# This is the 8th commit message:

update ut

# This is the 9th commit message:

add moe module

# This is the 10th commit message:

make expert count private

* add assign pos op

* fix upper num name

* add api _assign pos

* add ut for assign pos op

* update date

* fix for win

* update for test (timeout)

* fix ut

* update

* fix ut for number count
Co-authored-by: Nhlygit66666 <2570058140@qq.com>

305f32d1

13 11月, 2020 1 次提交
- L
  add send and recv ops (#28590) · ed9dd7c9
  由 lilong12 提交于 11月 13, 2020
```
* update, test=develop
```
  ed9dd7c9
30 9月, 2020 1 次提交

fix distributed error info (#27206) · 20fb01fb

由 MRXLT 提交于 9月 30, 2020

* fix distributed error info

* bug fix; notest

* error info refine

* update error info

* update error info

* update error info

* bug fix

* bug fix

* bug fix

* bug fix

20fb01fb

02 7月, 2019 1 次提交

supports collective training with programs (#18392) · a873fa84

由 Yi Liu 提交于 7月 02, 2019

1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis

a873fa84

27 6月, 2019 1 次提交

supports collective communicated training (#18175) · b7128bac

由 HaoRen 提交于 6月 27, 2019

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* fix comment
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* fix comment
test=develop

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* test=develop
add collective op unittest standard

* test=develop
remove the test_collective directory

* test=develop
remove the test_collective directory

* remove slicegather test

* code format for reducescatter

* update attr of shard_index_op

* Modify macro nccl_helper

* remove test without distribute

* macro collective_helper

* marcro update

* test=develop
update support python3.5

* test=develop change gpu memory use to 0.1 when test

* test=develop
update ut equal func

* test=develop
set flags to 1.5

* test=develop fix pickle dumple  py35

* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream

* test=develop update unittest sync operator I/O

b7128bac

17 5月, 2019 1 次提交
- Y
  polish parallel dygraph code (#17164) · 02175555
  由 Yan Xu 提交于 5月 17, 2019
```
* add var grad hook test=develop
```
  02175555
25 4月, 2019 1 次提交
- Y
  ParallelDyGraph with GPU collective mode (#16827) · 0b07eef1
  由 Yan Xu 提交于 4月 25, 2019
```
implement dygraph.parallel.DataParallel to hook reduce op.
```
  0b07eef1

PaddlePaddle / Paddle 大约 1 年 前同步成功

PaddlePaddle / Paddle
大约 1 年前同步成功