提交 · fab92824b5994cbcec5d7282fb9369bba5596419 · 机器未来 / Paddle

21 9月, 2021 1 次提交

Reuse OneDNN handler for SGD and SUM for SelectedRows input tensors. (#35510) · 799f3861

由 Adam Osewski 提交于 9月 20, 2021

* Create stateful OneDNNAXPYHandler object.

This makes it possible to call it multiple times without recreating the
oneDNN primitives every time.

* Prepare SGDOpKernel to reuse its implementation from OneDNN kernel.

* OneDNN SGD kernel.

* Update call to use new OneDNNAXPYHandler object api.

* Setup seed in proper place.

* Enable OneDNN kernel only for single case.

* For dense param and sparse grad.

* Small refactor.

* Enable oneDNN by op attr or by cmd line flag.

* Use int64_t type for number of elements.

* Support dense param and grad from OneDNN kernel.

* Enable SGD OneDNN kernel when use MP BF16 optimizer.

* Force non-copyable/movable OneDNNAXPYHandler.

* Reuse OneDNNAXPYHandler for spare tensors in SUM op.

* Fix SFINAE rules.

* Remove recording event inside AXPY.

* Get rid of internal primitive caching.

* Stop use PP cache mechanims to store mem and primitive obj.
* Handler obj store and reuse needed desc & prim

* Do not derive from MKLDNNHandlerT

799f3861

21 6月, 2021 1 次提交

Add AXPY oneDNN handler (#33632) · 773aabc7

由 lidanqing 提交于 6月 21, 2021

* Add oneDNN AXPY handler.

* Add fallback for small tensors.

* Fix ifdefs

* Remove unnecessary namespace prefixes and add missing headers.

* Guard handler_axpy with proper ifdefs.

* Compilation of this function is possible only when Paddle is not build
with CUDA nor HIP.

* Move AXPY handler code to separate files.

* Use oneDNN AXPY handler in SGD op.

* Use axpy handler only when Paddle is built with oneDNN.

* Add test for SUM BF16 with big rows.

* Fix SFINAE rules for elementwise_add_to.

* Add test case for SGD with big rows.

* update

* update
Co-authored-by: NAdam Osewski <adam.osewski@intel.com>

773aabc7

23 4月, 2021 1 次提交
- L
  add c_concat and c_split ops (#32486) · 2b108a04
  由 lilong12 提交于 4月 23, 2021
```
* add c_concat op
```
  2b108a04
13 11月, 2020 1 次提交
- L
  add send and recv ops (#28590) · ed9dd7c9
  由 lilong12 提交于 11月 13, 2020
```
* update, test=develop
```
  ed9dd7c9
30 9月, 2020 1 次提交

fix distributed error info (#27206) · 20fb01fb

由 MRXLT 提交于 9月 30, 2020

* fix distributed error info

* bug fix; notest

* error info refine

* update error info

* update error info

* update error info

* bug fix

* bug fix

* bug fix

* bug fix

20fb01fb

02 7月, 2019 1 次提交

supports collective training with programs (#18392) · a873fa84

由 Yi Liu 提交于 7月 02, 2019

1. Since allreduce op has 4 reduce types, We split these four reduce types into four ops
2. We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext
3. We remove the newly added Collective op role to reduce the complexity of program and graph analysis

a873fa84

27 6月, 2019 1 次提交

supports collective communicated training (#18175) · b7128bac

由 HaoRen 提交于 6月 27, 2019

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* fix comment
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* fix comment
test=develop

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* test=develop
add collective op unittest standard

* test=develop
remove the test_collective directory

* test=develop
remove the test_collective directory

* remove slicegather test

* code format for reducescatter

* update attr of shard_index_op

* Modify macro nccl_helper

* remove test without distribute

* macro collective_helper

* marcro update

* test=develop
update support python3.5

* test=develop change gpu memory use to 0.1 when test

* test=develop
update ut equal func

* test=develop
set flags to 1.5

* test=develop fix pickle dumple  py35

* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream

* test=develop update unittest sync operator I/O

b7128bac

17 5月, 2019 1 次提交
- Y
  polish parallel dygraph code (#17164) · 02175555
  由 Yan Xu 提交于 5月 17, 2019
```
* add var grad hook test=develop
```
  02175555
25 4月, 2019 1 次提交
- Y
  ParallelDyGraph with GPU collective mode (#16827) · 0b07eef1
  由 Yan Xu 提交于 4月 25, 2019
```
implement dygraph.parallel.DataParallel to hook reduce op.
```
  0b07eef1

机器未来 / Paddle 与 Fork 源项目一致

机器未来 / Paddle
与 Fork 源项目一致