提交 · 5d902954d35b6910561bd696486ffd5f1d43369c · PaddlePaddle / Paddle

26 12月, 2021 3 次提交
- Z
  
  improve forward performace (#38279) · acef85b2
  由 Zhang Ting 提交于 12月 26, 2021
  
  acef85b2
- C
  Fix renorm op include error and format error (#38451) · e6c3f64f
  由 Chen Weihang 提交于 12月 25, 2021
```
* remove needless header

* remove needless header

* adjust header order
```
  e6c3f64f
- Z
  [Unify Tensors PR #2] Replaced pten::LoD with paddle::framework::LoD (#38275) · bbe879fc
  由 Zhanlue Yang 提交于 12月 26, 2021
```
* Replaced pten::LoD with paddle::framework::LoD

* Overrided CPUVector with CUDAVector

* Refactored paddle::framework::Vector
```
  bbe879fc
24 12月, 2021 8 次提交

由 seemingwang 提交于 12月 24, 2021

* graph engine demo

* upload unsaved changes

* fix dependency error

* fix shard_num problem

* py client

* remove lock and graph-type

* add load direct graph

* add load direct graph

* add load direct graph

* batch random_sample

* batch_sample_k

* fix num_nodes size

* batch brpc

* batch brpc

* add test

* add test

* add load_nodes; change add_node function

* change sample return type to pair

* resolve conflict

* resolved conflict

* resolved conflict

* separate server and client

* merge pair type

* fix

* resolved conflict

* fixed segment fault; high-level VLOG for load edges and load nodes

* random_sample return 0

* rm useless loop

* test:load edge

* fix ret -1

* test: rm sample

* rm sample

* random_sample return future

* random_sample return int

* test fake node

* fixed here

* memory leak

* remove test code

* fix return problem

* add common_graph_table

* random sample node &test & change data-structure from linkedList to vector

* add common_graph_table

* sample with srand

* add node_types

* optimize nodes sample

* recover test

* random sample

* destruct weighted sampler

* GraphEdgeBlob

* WeightedGraphEdgeBlob to GraphEdgeBlob

* WeightedGraphEdgeBlob to GraphEdgeBlob

* pybind sample nodes api

* pull nodes with step

* fixed pull_graph_list bug; add test for pull_graph_list by step

* add graph table;name

* add graph table;name

* add pybind

* add pybind

* add FeatureNode

* add FeatureNode

* add FeatureNode Serialize

* add FeatureNode Serialize

* get_feat_node

* avoid local rpc

* fix get_node_feat

* fix get_node_feat

* remove log

* get_node_feat return  py:bytes

* merge develop with graph_engine

* fix threadpool.h head

* fix

* fix typo

* resolve conflict

* fix conflict

* recover lost content

* fix pybind of FeatureNode

* recover cmake

* recover tools

* resolve conflict

* resolve linking problem

* code style

* change test_server port

* fix code problems

* remove shard_num config

* remove redundent threads

* optimize start server

* remove logs

* fix code problems by reviewers' suggestions

* move graph files into a folder

* code style change

* remove graph operations from base table

* optimize get_feat function of graph engine

* fix long long count problem

* remove redandunt graph files

* remove unused shell

* recover dropout_op_pass.h

* fix potential stack overflow when request number is too large & node add & node clear & node remove

* when sample k is larger than neigbor num, return directly

* using random seed generator of paddle to speed up

* fix bug of random sample k

* fix code style

* fix code style

* add remove graph to fleet_py.cc

* fix blocking_queue problem

* fix style

* fix

* recover capacity check

* add remove graph node; add set_feature

* add remove graph node; add set_feature

* add remove graph node; add set_feature

* add remove graph node; add set_feature

* fix distributed op combining problems

* optimize

* remove logs

* fix MultiSlotDataGenerator error

* cache for graph engine

* fix type compare error

* more test&fix thread terminating problem

* remove header

* change time interval of shrink

* use cache when sample nodes

* remove unused function

* change unique_ptr to shared_ptr

* simplify cache template

* cache api on client

* fix

* reduce sample threads when cache is not used

* reduce cache memory

* cache optimization

* remove test function

* remove extra fetch function

* graph-engine data transfer optimization

* support graph_split load&query

* remove logs

* change shards to pointer vector

* use inference

* remove test code

* renorm op

* simplify renorm op

* recover local changes

* recover renorm op kernel

* fix init

* add blanklines in renorm doc

* fix import

* fix import
Co-authored-by: NHuang Zhengjie <270018958@qq.com>
Co-authored-by: NWeiyue Su <weiyue.su@gmail.com>
Co-authored-by: Nsuweiyue <suweiyue@baidu.com>
Co-authored-by: Nluobin06 <luobin06@baidu.com>
Co-authored-by: Nliweibin02 <liweibin02@baidu.com>
Co-authored-by: Ntangwei12 <tangwei12@baidu.com>

6982871d

Z

[AMP] Add multi_precision for sgd (#38231) · a4d07bb9
由 zhangbo9674 提交于 12月 24, 2021

a4d07bb9

[pten] combine reduce_cuda codes (#38328) · 08941eda

由 chentianyu03 提交于 12月 24, 2021

* combine reduce_cuda codes

* support float16 in pten redcue_mean

* replace ReduceCudaKernel impl with pten reduce impl

* mv reduce funcs into reduce_cuda_impl

* rm unsed codes and headers

* mv GetReduceDim into reduce_cuda_impl

* recover GetReduceDim in reduce_op.h

* add new dispatch macro

* fix pool op output not inited and cause transform to pten::denseTensor error

* fix output tensor not initialized error

* rename new dispatch macro and format code style

* rm reduce_functor_op.h file

08941eda

add new API/OP:paddle.Tensor.exponential_ (#38256) · 33185000
由 zhouweiwei2014 提交于 12月 24, 2021
```
* add new API/OP:paddle.Tensor.exponential_

* fix CI
```
33185000
[MLU]add mlu op interface (#38241) · c396ee65
由努力努力在努力丶提交于 12月 24, 2021
```
* [MLU]add mlu op interface

* [MLU]fix alpha of activation op
```
c396ee65
Y
add pull gpups sparse op (#37124) · 572b3e90
由 yaoxuefeng 提交于 12月 24, 2021
```
 add pull gpups sparse op
```
572b3e90
Z

Add new API cholesky_solve (#38167) · 39f7c41f
由 zhiboniu 提交于 12月 24, 2021

39f7c41f
add new API/OP: paddle.poisson (#38117) · bcf86e5c
由 zhouweiwei2014 提交于 12月 24, 2021
```
* add new API/OP:paddle.poisson

* fix comment
```
bcf86e5c

23 12月, 2021 5 次提交
- C
  
  move conj kernel impl (#38365) · 8da9eff4
  由 Chen Weihang 提交于 12月 23, 2021
  
  8da9eff4
- J
  Make GetBlob assuming elements are cached (#38336) · 7da5368d
  由 Jacek Czaja 提交于 12月 23, 2021
```
* First set of fixes

* - Make more likely to GetBlob find a blobs

* - Lint
```
  7da5368d
- W
  Add erfinv API (#38295) · 6b59b58c
  由 wuhuanzhou 提交于 12月 23, 2021
```
* add erfinv API, test=develop

* fix gradient accuracy error, test=develop

* fix cuda compilation error on Windows, test=develop

* fix M_2_SQRTPI undeclared identifier on Windows, test=develop
```
  6b59b58c
- Z
  【PTen】Add empty and empty_like kernel in pten (#38334) · 4221cd33
  由 zyfncg 提交于 12月 23, 2021
```
* add empty and empty_like kernel in pten

* add empty dev_api
```
  4221cd33
- C
  
  move sign kernel impl (#38363) · bb38b6aa
  由 Chen Weihang 提交于 12月 22, 2021
  
  bb38b6aa
22 12月, 2021 3 次提交
- C
  use elementwise to optimize gelu backward implementation on GPU (#38263) · 858e4358
  由 crystal 提交于 12月 22, 2021
```
* optimize gelu backward

* optimize gelu backward

* optimize code

* Number to expression

* Replacement number
```
  858e4358
- Y
  [PTen]Move flatten kernel to new directory (#38255) · 4d1ce184
  由 YuanRisheng 提交于 12月 22, 2021
```
* move flatten

* fix bugs of test

* modify header file

* add copy declare

* fix compile bugs
```
  4d1ce184
- J
  
  Add nearest_interp/v2 int8 and uint8 support (#37985) · 56e2a6a6
  由 joanna.wozna.intel 提交于 12月 22, 2021
  
  56e2a6a6
21 12月, 2021 4 次提交
- C
  [PTen] Rename cuda dir and context to gpu (#38296) · dc7597e3
  由 Chen Weihang 提交于 12月 21, 2021
```
* rename cuda to gpu

* revert CMake change

* resolve conflit

* rename other cuda to gpu

* poish details
```
  dc7597e3
- C
  use elementwise to optimize gelu forward implementation on GPU (#38188) · aff43684
  由 crystal 提交于 12月 21, 2021
```
* relu forward opt

* add gelu functor

* optimize code
```
  aff43684
- A
  
  Fix for wrong conditions between forward and backward in elementwise_add_grad op (#38176) · d9780a22
  由 arlesniak 提交于 12月 21, 2021
  
  d9780a22
- S
  Support FP16 mean (#38289) · 643a268e
  由 sneaxiy 提交于 12月 21, 2021
```
* mean first version

* fix scalar mean

* add fp16 dtype for api
```
  643a268e
20 12月, 2021 9 次提交
- C
  [pten]add pten conj kernel (#38247) · a2793e5e
  由 chentianyu03 提交于 12月 20, 2021
```
* add pten conj kernel

* modify conj_kernel file path

* add defined cuda macro to cuda/conj_kernel.h
```
  a2793e5e
- B
  
  add gelu pbtxt for conv+gelu mkldnn fuse pass (#38162) · 1b7f6ae9
  由 baoachun 提交于 12月 20, 2021
  
  1b7f6ae9
- F
  
  [MLU]add mlu backend (#38207) · 76514a1f
  由 fwenguang 提交于 12月 20, 2021
  
  76514a1f
- S
  Support FP16 for more ops (#38123) · 1f445bf3
  由 sneaxiy 提交于 12月 20, 2021
```
* support FP16 for more ops

* add amp list tests

* refine reduce_mean_grad

* fix OP benchmark ci

* fix fp16 reduce_mean

* updat ut, but still have some problems

* remove mean/reduce_mean fp16 kernel
```
  1f445bf3
- F
  optimize softmax with cross entropy soft label (#32387) · f8955602
  由 Feng Xing 提交于 12月 20, 2021
```
softmax_with_cross_entropy optimization with soft label. This PR includes optimization of
    "SoftmaxWithCrossEntropySoftLabel" : compute log_softmax and then compute loss.
    "CrossEntropySoftLabel" : compute loss with softmax as input.
These optimization includes following technics:
    read data to buffer with vectorization
    compute max and sum in warp
    fixed loop size with macro
Performance (computation time):
    softmax_with_cross_entropy_0 (forward) : -40.1%
    softmax_with_cross_entropy_0 (backward): -41%
```
  f8955602
- 石
  
  changes the call AllocShared to Alloc, test=develop (#38258) · bb0713b2
  由石晓伟提交于 12月 20, 2021
  
  bb0713b2
- F
  
  fix typos in header inclusion in complex_op.cc (#38272) · 2635cc86
  由 Feiyu Chan 提交于 12月 20, 2021
  
  2635cc86
- S
  
  fix use of implicitly deleted constructor (#38225) · 23d9e947
  由 Sylwester Fraczek 提交于 12月 20, 2021
  
  23d9e947
- Y
  Fix bugs that copy occurs when tensor "in" and tensor "out" is same in reshape kernel (#38249) · a615002a
  由 YuanRisheng 提交于 12月 20, 2021
```
* fix bugs when run reshape

* fix ci bug
```
  a615002a
18 12月, 2021 3 次提交
- N
  
  [pnorm] fix bug in pnorm (#38215) · 9e42fe9a
  由 Noel 提交于 12月 18, 2021
  
  9e42fe9a
- G
  
  fix seed for class_center_sample using paddle.seed (#38248) · 59be8e0e
  由 Guoxia Wang 提交于 12月 18, 2021
  
  59be8e0e
- F
  add complex op (#37918) · 31e874b1
  由 Feiyu Chan 提交于 12月 18, 2021
```
* add complex op and `paddle.complex`.
```
  31e874b1
17 12月, 2021 5 次提交

Refine some AMP operators for BERT (#37923) · d80fe268

由 sneaxiy 提交于 12月 17, 2021

* support multi precision update for LAMB

* hide some api

* fix ci uts

* fix lamb output of dygraph

* remove some changes to some PR

* try to fix Py3 CI compile error

* fix test_imperative_optimizer, add lars ut, add layer_norm ut

* fix ut, fix format

* fix ut

* fix windows ci

d80fe268

[pten] modify reduce_sum reduce_mean args (#38216) · eaa2363e

由 chentianyu03 提交于 12月 17, 2021

* modify sum mean args

* add GetExpectedPtenKernelArgs for redcue_op

* modify kernel args number

* modify kernel args number

eaa2363e

K

add op/api repeat/interleave (#37981) · a7de0e66
由 kuizhiqing 提交于 12月 17, 2021

a7de0e66

add launch bound to limit the registers usage for volta architecture (#38113) · 18a59822

由 zlsh80826 提交于 12月 17, 2021

From --ptxas-options=-v, SegmentOpsKernel uses 66 registers in a block.
There are two ways to resolve this problem:
Reduce the threads per block launch configuration
add __launch_bound__ to give information to nvcc compiler for reducing registers usage
this PR chooses __launch_bound__ solution because changing gpu_launch_config may affect other ops.

18a59822

N

Delete cub_reduce.h and modified the TensorReduce to TensorReduceFunctorImpl (#38197) · 9a8a4c77
由 niuliling123 提交于 12月 17, 2021

9a8a4c77

PaddlePaddle / Paddle 1 年多 前同步成功

PaddlePaddle / Paddle
1 年多前同步成功