提交 · 2e40cfb5c36df9de5ced0b82856de7ba32ec16fa · PaddlePaddle / Paddle

28 10月, 2021 3 次提交
- W
  save/load in ps runtime(the_one_ps) (#36097) · e7842ba6
  由 wangguanqun 提交于 10月 28, 2021
```
* add trainer desc config to distributed strategy

* code style modified

* data_feed set lod

* fix bug

* code style

* fix bug

* save load

* save load

* save unittest

* add unittest of the_one_ps

* unittest

* add todo in communicator sendsparse
```
  e7842ba6
- S
  
  fix MultiSlotDataGenerator error (#36773) · dc0178ef
  由 seemingwang 提交于 10月 28, 2021
  
  dc0178ef
- B
  
  Add lazy distributed launch with rank mapping (#36570) · 7de3f81c
  由 Bo Liu 提交于 10月 28, 2021
  
  7de3f81c
27 10月, 2021 2 次提交
- J
  [Auto Parallel] Completion Dist Attribute for Backward & Update stage (#36744) · 5e9845b8
  由 JZ-LIANG 提交于 10月 27, 2021
```
* revise completion for backward

* revise completion for update

* revise completion for update

* update unitest
```
  5e9845b8
- X
  bugfix: only check backend when mode == Collecive (#36758) · e6253152
  由 xiongkun 提交于 10月 27, 2021
```
* bugfix: only check backend when mode == Collecive

* fix bug
```
  e6253152
25 10月, 2021 1 次提交
- H
  [HybridParallel]fix bug of check_inf in fleet_base.py (#36651) · 59d8b8cb
  由 Haohongxiang 提交于 10月 25, 2021
```
* fix bug of check_inf

* fix allreduce
```
  59d8b8cb
21 10月, 2021 2 次提交
- D
  
  fix hdfs download_dir (#36590) · 66f4b292
  由 danleifeng 提交于 10月 21, 2021
  
  66f4b292
- X
  
  User specified backend (#35745) · b6e7f8e9
  由 xiongkun 提交于 10月 21, 2021
  
  b6e7f8e9
20 10月, 2021 3 次提交

H
fix bugs of ClipGradByGlobalNorm in HybridParallel (#36555) · 6a3941e3
由 Haohongxiang 提交于 10月 20, 2021
```
* fix bugs of ClipGradByGlobalNorm

* add unittests

* add unittests
```
6a3941e3
李
Fix global gather and global scatter operators (#36517) · 17b4dd70
由李季提交于 10月 20, 2021
```
* fix global gather and global scatter operators
```
17b4dd70

[Auto Parallel] Generalization for Partition and Completion (#35735) · 797bd40d

由 JZ-LIANG 提交于 10月 20, 2021

* default dist op

* add dist_attr for dist op

* add unitest

* update inputname

* update function name

* add unitest

* update CMakeLists.txt for CI

* fix dis_matmul

* fix compile error

* update matmul to matmul_v2

* unify api

* unify api

* todo

* update distop forward func

* update distop forward func

* auto parallel backward

* update dist op

* autoparallel backward

* add backward for embedding

* temp1

* temp2

* temp3

* temp4

* backward done1

* backward done2

* backward done3

* dist embedding remove mp mode

* dist matmul remove mp mode

* update dist embedding
『

* dist op init1

* dist op init 2

* update unitest

* context remove parallel mode

* partitioner remove parallel mode

* update unitest

* a more general method to support varying mesh in pipeline parallel

* support varying mesh in pipeline parallel

* embedding support varying mesh in pipeline parallel

* matmul support varying mesh in pipeline parallel

* default dist op support varying mesh in pipeline parallel

* dist attribute for startup program

* default dist op support varying mesh in pipeline parallel 2

* partitoner support varying mesh in pipeline parallel

* revise logic for auto compeletion

* revise framework.py

* revise reshard unitest

* revise unitest for parallelize

* chmod

* fixed bug for dist embedding name mapping
Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>

797bd40d

19 10月, 2021 3 次提交
- D
  
  [heterps]edit shrink and unseenday logit for pslib (#36194) · 9e494472
  由 danleifeng 提交于 10月 19, 2021
  
  9e494472
- W
  
  [hybrid] static model parallel dropout support deterministic RandomSeedGenerator (#36228) · 8cc8e411
  由 WangXi 提交于 10月 19, 2021
  
  8cc8e411
- Y
  Add auto parallel cost model and unittests (#36363) · a573a7ed
  由 YipZLF 提交于 10月 19, 2021
```
* Add auto parallel cost model and unittests

* Fixed code styles.

* Fixed bugs and codes style

* fixed typo

* Improved code style: object encapsulation.

* Fixed codes.

* Refractored estimate_cost

* Fixed typo
```
  a573a7ed
18 10月, 2021 1 次提交

[HybridParallel]Support fp16 in dygraph hybrid parallel (#36420) · 10f0a0f6

由 Haohongxiang 提交于 10月 18, 2021

* [HybridParallel]Support fp16 in dygraph hybrid parallel

* update

* update

* update for recompute

* add unittest of pp+fp16

* add unittest of recompute+fp16

* update

* modify ut

10f0a0f6

15 10月, 2021 1 次提交
- D
  
  fix opt-offload save bug (#36433) · e703a2ed
  由 duanboqiang 提交于 10月 15, 2021
  
  e703a2ed
14 10月, 2021 3 次提交
- D
  
  optimize-offload support adamw op type (#36432) · 66c58fa3
  由 duanboqiang 提交于 10月 14, 2021
  
  66c58fa3
- S
  [HybridParallel]Rebuild code for pipeline (#36396) · 8ffcc7c8
  由 ShenLiang 提交于 10月 14, 2021
```
* add no_sync for parameters sync

* add pipeline for moe
```
  8ffcc7c8
- Y
  
  [hybrid enhance] add flag to control the avg position for grad merge under pipeline mode (#36384) · 03d8304f
  由 Yuang Liu 提交于 10月 14, 2021
  
  03d8304f
13 10月, 2021 3 次提交
- G
  
  support auto parallel data shard (#36055) · 85bb1a85
  由 Guoxia Wang 提交于 10月 13, 2021
  
  85bb1a85
- C
  
  fix pp comm init bug (#36377) · 817f9ef0
  由 caozhou 提交于 10月 13, 2021
  
  817f9ef0
- L
  [Amp] refine code of amp level (#36362) · 59e425cd
  由 Leo Chen 提交于 10月 13, 2021
```
* refine amp level

* fix typo

* update tracer._amp_level
```
  59e425cd
12 10月, 2021 1 次提交

fix bugs in mp_layers、pp_layers and HybridParallelClipGrad (#36144) · d247cf17

由 Haohongxiang 提交于 10月 12, 2021

* fix calling bug of HybridParallelClipGrad

* fix bugs of HybridParallelClipGrad

* add unittest of pp with HybridParallelClipGrad

* fix bugs in mp_layers.py

* update

* fix bugs in pp_layers.py

* update

d247cf17

11 10月, 2021 3 次提交
- D
  [heterps] add fuse_allreduce (#35131) · e5b4dd73
  由 danleifeng 提交于 10月 11, 2021
```
* heterps:add fuse_allreduce op; test=develop
* add program_mode in minimize for pslib mode;test=develop
```
  e5b4dd73
- C
  add reshard module (#35779) · c38b0488
  由 caozhou 提交于 10月 11, 2021
```
* add reshard module

* fix conflict

* update reshard module

* update and add unitest

* update reshard module and unitest

* add more unitests
```
  c38b0488
- 李
  
  fix the hidden method in paddle.distributed.utils file (#36210) · ea76457c
  由李季提交于 10月 11, 2021
  
  ea76457c
09 10月, 2021 1 次提交
- Z
  support ClipGradByGlobalNorm in sharding (#36012) · 623df429
  由 zhaoyingli 提交于 10月 09, 2021
```
* support ClipGradByGlobalNorm in sharding

* support ClipGradByGlobalNorm in sharding

* test=allcase
```
  623df429
08 10月, 2021 1 次提交
- Y
  
  add fs list_files_info (#36224) · ca16e8fd
  由 yaoxuefeng 提交于 10月 08, 2021
  
  ca16e8fd
07 10月, 2021 1 次提交
- H
  fix bugs in HybridParallelClipGrad of hybrid_parallel_optimizer (#36237) · 730dcaf4
  由 Haohongxiang 提交于 10月 07, 2021
```
* fix bugs in HybridParallelClipGrad of hybrid_parallel_optimizer

* update

* update
```
  730dcaf4
30 9月, 2021 1 次提交

李

Fix raw optim (#36176) · 5e0f199a

由李季提交于 9月 30, 2021

* fix raw optim

* pre-commit test file
Co-authored-by: Nsneaxiy <sneaxiy@126.com>

5e0f199a

29 9月, 2021 1 次提交
- W
  
  [hybrid] Fix model parallel non-distributed param broadcast (#36186) · bec9fc9a
  由 WangXi 提交于 9月 29, 2021
  
  bec9fc9a
28 9月, 2021 1 次提交
- W
  
  [hybrid] optimizer sharding support optimize cast (#35878) · eef0a943
  由 WangXi 提交于 9月 28, 2021
  
  eef0a943
24 9月, 2021 2 次提交

S

add update (#36017) · 1691dc7a
由 ShenLiang 提交于 9月 24, 2021

1691dc7a

fix distributed ops combining problems (#35942) · 4c35f515

由 seemingwang 提交于 9月 24, 2021

* graph engine demo

* upload unsaved changes

* fix dependency error

* fix shard_num problem

* py client

* remove lock and graph-type

* add load direct graph

* add load direct graph

* add load direct graph

* batch random_sample

* batch_sample_k

* fix num_nodes size

* batch brpc

* batch brpc

* add test

* add test

* add load_nodes; change add_node function

* change sample return type to pair

* resolve conflict

* resolved conflict

* resolved conflict

* separate server and client

* merge pair type

* fix

* resolved conflict

* fixed segment fault; high-level VLOG for load edges and load nodes

* random_sample return 0

* rm useless loop

* test:load edge

* fix ret -1

* test: rm sample

* rm sample

* random_sample return future

* random_sample return int

* test fake node

* fixed here

* memory leak

* remove test code

* fix return problem

* add common_graph_table

* random sample node &test & change data-structure from linkedList to vector

* add common_graph_table

* sample with srand

* add node_types

* optimize nodes sample

* recover test

* random sample

* destruct weighted sampler

* GraphEdgeBlob

* WeightedGraphEdgeBlob to GraphEdgeBlob

* WeightedGraphEdgeBlob to GraphEdgeBlob

* pybind sample nodes api

* pull nodes with step

* fixed pull_graph_list bug; add test for pull_graph_list by step

* add graph table;name

* add graph table;name

* add pybind

* add pybind

* add FeatureNode

* add FeatureNode

* add FeatureNode Serialize

* add FeatureNode Serialize

* get_feat_node

* avoid local rpc

* fix get_node_feat

* fix get_node_feat

* remove log

* get_node_feat return  py:bytes

* merge develop with graph_engine

* fix threadpool.h head

* fix

* fix typo

* resolve conflict

* fix conflict

* recover lost content

* fix pybind of FeatureNode

* recover cmake

* recover tools

* resolve conflict

* resolve linking problem

* code style

* change test_server port

* fix code problems

* remove shard_num config

* remove redundent threads

* optimize start server

* remove logs

* fix code problems by reviewers' suggestions

* move graph files into a folder

* code style change

* remove graph operations from base table

* optimize get_feat function of graph engine

* fix long long count problem

* remove redandunt graph files

* remove unused shell

* recover dropout_op_pass.h

* fix potential stack overflow when request number is too large & node add & node clear & node remove

* when sample k is larger than neigbor num, return directly

* using random seed generator of paddle to speed up

* fix bug of random sample k

* fix code style

* fix code style

* add remove graph to fleet_py.cc

* fix blocking_queue problem

* fix style

* fix

* recover capacity check

* add remove graph node; add set_feature

* add remove graph node; add set_feature

* add remove graph node; add set_feature

* add remove graph node; add set_feature

* fix distributed op combining problems

* optimize

* remove logs
Co-authored-by: NHuang Zhengjie <270018958@qq.com>
Co-authored-by: NWeiyue Su <weiyue.su@gmail.com>
Co-authored-by: Nsuweiyue <suweiyue@baidu.com>
Co-authored-by: Nluobin06 <luobin06@baidu.com>
Co-authored-by: Nliweibin02 <liweibin02@baidu.com>
Co-authored-by: Ntangwei12 <tangwei12@baidu.com>

4c35f515

18 9月, 2021 2 次提交
- W
  
  [hybird] fix pipeline section program Parameter (#35847) · 67c63639
  由 WangXi 提交于 9月 18, 2021
  
  67c63639
- G
  fix bug of module 'paddle' has no attribute 'distributed' for python3.6 (#35848) · d4cd2590
  由 Guoxia Wang 提交于 9月 18, 2021
```
* fix bug
```
  d4cd2590
17 9月, 2021 3 次提交

[AMP] Support pure fp16 training mode for dygraph (#35521) · adaeee4d

由 zhangbo9674 提交于 9月 17, 2021

* add pure fp16 major function in auto_cast & tracer

* support master weight in dygraph for pure fp16

* check mix dtype of fp16&fp32 for check_finite_and_unscale op

* change pure fp16 funtion name

* refine some bug in auto_cast

* refine auto_cast interface logic

* add param _casted_by_pure_fp16 for class Layer

* support state_dict hook for save model by user appointed dtype in pure_fp16_decorator

* refine pure_fp16_decorator as decorator

* add unittest

* add comment

* add comment

* support recompute

* add comment for auto_cast and decorator

* support to_static_state_dict for paddle.jit.save

* unlimite models num and optimizers num

* add lookup_table in black_list

* fix momentum and layer state_dict

* fix bug in layer state_dict

* fix bug in layer state_dict_helper

* refine unittest

* refine test_momentun_op

* refine interface and some code

* refine amp_decorator interface

* refine pure fp16 interface

* refine master weight interface

adaeee4d

G

test=document_fix (#35824) · 177bf52f
由 Guoxia Wang 提交于 9月 17, 2021

177bf52f
G
add launch doc (#35634) · 5548061b
由 Guoxia Wang 提交于 9月 17, 2021
```
* add launch doc
```
5548061b

16 9月, 2021 1 次提交
- Y
  
  [hybrid] Fix mp multi gradient clip prob (#35713) · a4eadd15
  由 Yuang Liu 提交于 9月 16, 2021
  
  a4eadd15

PaddlePaddle / Paddle 大约 1 年 前同步成功

PaddlePaddle / Paddle
大约 1 年前同步成功