提交 · 55c2329a5356628a5ba06ecad3750e0296ea1c16 · PaddlePaddle / Paddle

17 10月, 2019 2 次提交

T
fix fetch handler error with pslib (#20681) · 8c1e1ded
由 tangwei12 提交于 10月 17, 2019
```
* fix fetch handler error with pslib

* fix distributed lookup table op with 1 pserver
```
8c1e1ded

[cherry-pick]Fix communicator slow bug & fix communicator stop bug (#20366) (#20646) · eeaf04da

由 Chengmo 提交于 10月 17, 2019

* Fix communicator slow bug & fix communicator stop bug (#20366)

* test=develop,Fix communicator slow bug

* test=develop, delete if() in stop_worker()

* test=develop

* fix UT, test=develop

* fix bug in fetch handler, test=develop

* fix bug in fetch handler, test=develop

* test=develop, fix fetch barrier bug

* test=develop, bug fix

* test=develop, bug fix

* test=develop, fix bug

* test=develop,test=release/1.6

eeaf04da

16 10月, 2019 1 次提交
- 1
  bug fix: invalid learning rate decay in pserver async mode (#20325) (#20635) · c33312f7
  由 123malin 提交于 10月 16, 2019
```
* bug fix: invalid learning rate decay in pserver async mode
```
  c33312f7
11 10月, 2019 2 次提交
- C
  [cherry-pick][release-1.6]Fix transpiler en doc (#20149) (#20371) · 42909238
  由 Chengmo 提交于 10月 11, 2019
```
* Fix transpiler en doc (#20149)

* test=develop,test=document_fix,fix transpiler doc,add API.spec

* test=develop,test=document_fix,fix transpiler doc,add API.spec
```
  42909238
- T
  
  doc merge, test=document_fix (#20464) · c339e2dc
  由 tangwei12 提交于 10月 11, 2019
  
  c339e2dc
08 10月, 2019 1 次提交
- T
  Trainer heartbeat for async mode (#19600) (#20183) · 8589c719
  由 tangwei12 提交于 10月 08, 2019
```
Heartbeat for distributed async training.
```
  8589c719
02 10月, 2019 1 次提交
- C
  Add GEO-SGD distribute training algorithm (#20018) (#20133) · 2467c137
  由 Chengmo 提交于 10月 02, 2019
```
* refector geo sgd & communicator
```
  2467c137
26 9月, 2019 1 次提交
- 1
  fix APIs, test=document_preview (#19954) · 6c74e738
  由 123malin 提交于 9月 26, 2019
```
* fix DistributeTranspilerConfig document, test=develop
```
  6c74e738
16 9月, 2019 1 次提交
- T
  fix sync_with_distributed_lookup_table, test=develop (#19737) · 6a1db204
  由 tangwei12 提交于 9月 16, 2019
```
fix wrong place with distributed_lookup_table
```
  6a1db204
06 9月, 2019 1 次提交
- 1
  Optimize fleet API: add input check for some interfaces (#18971) · a25a716e
  由 123malin 提交于 9月 06, 2019
```
* fleet api add input check, test=develop
```
  a25a716e
28 8月, 2019 2 次提交

Y
adapte fleet api for localsgd and support nccl comm configuration in executor (#19443) · 4ef6b845
由 Yi Liu 提交于 8月 28, 2019
```
test=develop
```
4ef6b845

Fix the correctness of async mode at distributed training (#18863) · 65c73684

由 tangwei12 提交于 8月 28, 2019

* fix correctness of the communicator

* fix a bug in send thread when sending var context is empty, test=develop

* add lookup_table_prefetch_op and prefetch optimize, test=develop

* remove remote prefetch GPU supported

* word2vec force with CPU, test=develop

* test dist remote lookup table force with CPU, test=develop

65c73684

26 8月, 2019 1 次提交
- T
  fix distribute transpiler GRPC error code 4, RPC Deadline (#18984) · 19dac67e
  由 tangwei12 提交于 8月 26, 2019
```
* fix sync mode hang in transpiler
* remove sync mode in send/recv
* replace PADDLE_ENFORCE with PADDLE_ENFORCE_NE
```
  19dac67e
12 8月, 2019 1 次提交
- G
  Polish fleet API to support cuda collective mode and nccl2 mode. (#18966) · 29d87812
  由 gongweibao 提交于 8月 12, 2019
```
Polish fleet API to support cuda collective mode and nccl2 mode
```
  29d87812
11 7月, 2019 1 次提交
- G
  
  Polish backwards optimizer dependency codes and use more default values. (#18255) · c0a82748
  由 gongweibao 提交于 7月 11, 2019
  
  c0a82748
27 6月, 2019 1 次提交

supports collective communicated training (#18175) · b7128bac

由 HaoRen 提交于 6月 27, 2019

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* fix comment
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* fix comment
test=develop

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* test=develop
add collective op unittest standard

* test=develop
remove the test_collective directory

* test=develop
remove the test_collective directory

* remove slicegather test

* code format for reducescatter

* update attr of shard_index_op

* Modify macro nccl_helper

* remove test without distribute

* macro collective_helper

* marcro update

* test=develop
update support python3.5

* test=develop change gpu memory use to 0.1 when test

* test=develop
update ut equal func

* test=develop
set flags to 1.5

* test=develop fix pickle dumple  py35

* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream

* test=develop update unittest sync operator I/O

b7128bac

31 5月, 2019 1 次提交
- T
  fix document of python api get_startup_program() (#17764) · 659b72a9
  由 tangwei12 提交于 5月 31, 2019
```
* add example to get_startup_program()
* fix example to get_startup_program()
```
  659b72a9
30 5月, 2019 1 次提交
- Y
  
  fix distributed_transpiler.py api test=develop (#17668) · ac92e4c0
  由 yaoxuefeng 提交于 5月 30, 2019
  
  ac92e4c0
29 5月, 2019 1 次提交
- G
  
  fix 2dconn test=develop (#17681) · 0d561ef4
  由 gongweibao 提交于 5月 29, 2019
  
  0d561ef4
27 5月, 2019 1 次提交
- G
  
  Add multi-ncclcomm and 2D ncclallreduce support. (#17263) · 65bbf950
  由 gongweibao 提交于 5月 27, 2019
  
  65bbf950
23 5月, 2019 2 次提交
- Q
  fix distribute doc test=develop (#17318) · 92e7d5d7
  由 Qiao Longfei 提交于 5月 23, 2019
```
* fix distribute doc
```
  92e7d5d7
- Q
  Async exe support communicator (#17386) · 58f7695a
  由 Qiao Longfei 提交于 5月 23, 2019
```
Async exe support communicator
```
  58f7695a
26 4月, 2019 1 次提交
- T
  
  truncated_gaussian_random supported in distributed training, test=develop (#17091) · 7330cd63
  由 tangwei12 提交于 4月 26, 2019
  
  7330cd63
25 4月, 2019 1 次提交
- T
  Fleet unify distributed training (#16791) · 1a4a51db
  由 tangwei12 提交于 4月 25, 2019
```
* implement distributed transpiler with fleet
```
  1a4a51db
27 3月, 2019 1 次提交
- Q
  
  fix pylint · d640c6cf
  由 Qiao Longfei 提交于 3月 27, 2019
  
  d640c6cf
25 3月, 2019 1 次提交
- Q
  
  fix trainer_id · 542b52fa
  由 Qiao Longfei 提交于 3月 25, 2019
  
  542b52fa
23 3月, 2019 1 次提交
- Q
  
  update transpiler and listen and serv op · de65398c
  由 Qiao Longfei 提交于 3月 23, 2019
  
  de65398c
20 2月, 2019 1 次提交
- T
  fix params with only 1 dim (#15828) · 971f3bc9
  由 tangwei12 提交于 2月 20, 2019
```
* fix params with only 1 dim
* test=develop
```
  971f3bc9
08 2月, 2019 2 次提交
- Q
  
  parameter recv can run · 8bda4ab2
  由 Qiao Longfei 提交于 2月 08, 2019
  
  8bda4ab2
- Q
  
  complete recv op · fbd186bd
  由 Qiao Longfei 提交于 2月 08, 2019
  
  fbd186bd
06 2月, 2019 1 次提交
- Q
  
  complete parameter_send · 4356f186
  由 Qiao Longfei 提交于 2月 06, 2019
  
  4356f186
30 1月, 2019 1 次提交

transpiler.py code clean (#15555) · 90df7ff3

由 tangwei12 提交于 1月 30, 2019

* move var strusted to vars_distributed.py, add optimizer's block name, test=develop

* rename optimzier's seems complex, revert it, test=develop

* replace * with details, test=develop

90df7ff3

24 1月, 2019 1 次提交
- W
  
  fix tangwei merge issue test=develop (#15506) · 22db82c0
  由 Wu Yi 提交于 1月 24, 2019
  
  22db82c0
23 1月, 2019 1 次提交
- T
  checkpoint at distributed training (#14854) · 8b50ad80
  由 tangwei12 提交于 1月 23, 2019
```
checkpoint for distributed training.
```
  8b50ad80
08 1月, 2019 1 次提交
- Q
  
  fix style test=develop · 810439a9
  由 Qiao Longfei 提交于 1月 08, 2019
  
  810439a9
28 12月, 2018 1 次提交
- Q
  fix dist sparse l2 decay · 49cce3fd
  由 Qiao Longfei 提交于 12月 28, 2018
```
test=develop
```
  49cce3fd
27 12月, 2018 1 次提交
- H
  en api improve format Dec 27 · 66ea7184
  由 haowang101779990 提交于 12月 26, 2018
```
test=develop
```
  66ea7184
18 12月, 2018 1 次提交
- J
  
  add test transpiler dist test, test=develop · b2f789c6
  由 JiabinYang 提交于 12月 18, 2018
  
  b2f789c6
07 12月, 2018 2 次提交
- G
  
  Add reduce sparse tensor feature. (#14757) · f1fb64b1
  由 gongweibao 提交于 12月 07, 2018
  
  f1fb64b1
- T
  
  add prefetch and remvoe selectedrows of bias · b653ed05
  由 tangwei12 提交于 12月 07, 2018
  
  b653ed05

PaddlePaddle / Paddle 1 年多 前同步成功

PaddlePaddle / Paddle
1 年多前同步成功