1. 11 1月, 2021 10 次提交
    • P
      [Cherry-pick PR 29913], add View(reuse allocation) strategy on squeeze,... · 7c943a65
      pangyoki 提交于
      [Cherry-pick PR 29913], add View(reuse allocation) strategy on squeeze, unsqueeze, reshape, flatten op (#29913) (#30258)
      
      * add view strategy on squeeze,unsqueeze,reshape,flatten
      
      * add squeeze unittest
      
      * add unittests
      
      * use View strategy as name rather than Reuse Allacation
      
      * fix view api doc
      
      * fix format
      
      * use core.ops when input of reshape2 is Tensor
      
      * fix test_cross_entropy_loss error because of reshape2
      
      * delete selected_rows
      
      * change op_function
      
      * little change
      
      * solve HandleViewBetweenInputAndOutput
      7c943a65
    • Z
      [cherry-pick]add cast cuda kernel (#29352) #30263 · afbc6367
      Zhang Ting 提交于
       add cast cuda kernel
      
      cherry-pick #29352
      afbc6367
    • W
      [cherry-pick]add support for place string representation #30264 · fb66355e
      wangchaochaohu 提交于
      cherry-pick #28769, add support for place string representation 
      fb66355e
    • W
      [cherry-pick]Elementwise add grad GPU kernel optimization (#30276) · e59524f8
      wangchaochaohu 提交于
      * elementwise_add_grad Op optimization  (#29575)
      
      * optimize for long width for elementwise (#29602)
      
      * refine (#29622)
      
      * delete the code for fp16 optimization because it is not faster than common template code (#29715)
      
      * fix the shape choose of vectorize for cuda
      
      * optimization for fp16 elementwise add (#29744)
      
      * Fix the compiler error for half type (#29799)
      
      * refine the compiler error for half2 operation (#29816)
      
      * fix the compiler error when gcc4 cuda9.0 (#29997)
      e59524f8
    • Z
      [Cherry-Pick] Support pure fp16 training for AMP API. (#29544) (#30241) · d8dfef54
      Zhen Wang 提交于
      * Support pure fp16 training for AMP API. (#29544)
      
      * add cast ops before and after unsupported fp16 ops.
      
      * Keep partial net in FP32 pattern.
      
      * Support check_finite_and_unscale and update_loss_scaling for FP16 calculation mode.
      
      * Add fp16 support for adam op.
      
      * add multi precision attr for adam.
      
      * Fix the bug of test_multi_precision_fp16_train UT.
      
      * Code format for CI.
      
      * Fix the redefine error about MPTypeTrait on windows.
      
      * fix bugs of the _create_accumulators func in Momentum.
      
      * fix bug when inserting post cast op.
      
      * Add the update_loss_scaling op in allow_set of UnusedVarCheck.
      
      * Update for ci coverage.
      
      * Add some doc for OptimizerWithMixedPrecision.
      
      * Fix the code style.
      
      * Imporve the doc of `amp_init`.
      
      * Change for fp16 testing if users have the infer program defined in separate way.
      
      * Remove tensor copy in the update_loss_scaling op. (#29426)
      
      * remove tensor copy in the update_loss_scaling op
      
      * not use thrust.
      
      * fix some cuda memory access error.
      d8dfef54
    • Z
      [Cherry pick] improve dropout (#30260) · b4931ab1
      Zhang Ting 提交于
      * improve dropout (#29465)
      
      * improve drop out
      
      * add VectorizedRandomGeneratorWithGenerator
      
      * fix bug
      
      * modify according to comments
      
      * improve dropout grad (#29605)
      
      * improve grad perf
      
      * fix the bug of dropout_grad (#29813)
      b4931ab1
    • G
      [cherry-pick] softmax optimize (#30279) · b80beb16
      GaoWei8 提交于
      * Softmax vectorization (#29404)
      
      * vec softmax fw
      
      * vec softmax bw
      
      * add a message argument for compiler compatibility
      
      * optimize softmax forward (#30217)
      
      * optimize softmax forward
      Co-authored-by: Nzlsh80826 <zlsh80826@gmail.com>
      b80beb16
    • W
      Cherry-pick 30194 30164 30201(#30202) · 36de178a
      Wilber 提交于
      36de178a
    • W
      [cherry-pick 2.0] optimize gradient merge (#30185) · e283dc6f
      WangXi 提交于
      * Optimization grad merge performance (#29784)
      
      * [fleet] combine amp and gradient merge, test=develop (#30086)
      
      * fix assign_op_xpu concat_op_xpu warining (#30120)
      Co-authored-by: Nliuyuhui <liuyuhui@baidu.com>
      e283dc6f
    • Q
      add aarch64 and sunway kunlun lib (#30027) (#30237) · eacbd488
      QingshuChen 提交于
      * add aarch64 and sunway kunlun lib
      
      * minor
      
      * optimize elementwise_add for kunlun
      
      * update kunlun dependence
      
      * minor
      
      * minor
      eacbd488
  2. 08 1月, 2021 4 次提交
  3. 07 1月, 2021 5 次提交
  4. 06 1月, 2021 2 次提交
    • H
      support dygraph in xpu place (#30051) (#30112) · 285f33e5
      hong 提交于
      * support dygraph in xpu place; test=develop
      
      * fix cpu/gpu compile error; test=develop
      
      * fix compile error; test=develop
      
      * fix xpu compile error; testd=develop
      285f33e5
    • L
      [Cherry-Pick 2.0][Dynamic Inplace] Support ShareInplaceVersionCounterWith for... · 743649b5
      liym27 提交于
      [Cherry-Pick 2.0][Dynamic Inplace] Support ShareInplaceVersionCounterWith for C++ Tensor (#29842) (#30105)
      
      Before this PR, SharePlaceHolderWith share Tensor between different C++ Variable, which meas sharing the data, shape, and inplace_version_counter_ of Tensor.
      But in some cases, Sharing data and inplace_version_counter_ but not sharing shape is needed. For example, inplace op reshape, can't share shape.
      
      This PR, discard SharePlaceHolderWith, and expose ShareInplaceVersionCounterWith for C++ Tensor.
      This reverts commit b10ecd9d.
      
      * Support ShareInplaceVersionCounterWith to share the same inplace version counter for VarBase
      743649b5
  5. 05 1月, 2021 3 次提交
  6. 04 1月, 2021 3 次提交
  7. 31 12月, 2020 1 次提交
  8. 29 12月, 2020 8 次提交
    • L
      [Kunlun] 2.0 cherry-pick:Support for Baidu Kunlun XPU multi card training (#29713) · 847aa172
      liuyuhui 提交于
      * [Kunlun] PR1:Support one Kunlun card training in parallel executor (#29337)
      
      * [Kunlun] PR2: Support MultiDevicePass and BKCL in parallel executor (#29574)
      
      * [Kunlun] bug fix of PR2: Support MultiDevicePass and BKCL in parallel executor  (#29926)
      
      * add bkcl.so in whl for kunlun (#29947)
      
      * [Kunlun] bug fix of PR2: Support MultiDevicePass and BKCL in parallel executor  (#29961)
      Co-authored-by: NQingshuChen <qingshu.chen714@gmail.com>
      847aa172
    • C
      [Cherry-pick] Complex network execute support (#29905) · 91ebc460
      Chen Weihang 提交于
      * [Complex] Add support for complex grad accumulated (#29889)
      
      * add support for complex grad accumulated
      
      * add unittest for coverage
      
      * update test dtype
      
      * remove useless blank line
      
      * [Complex] Handle complex to real after type promotion (#29855)
      
      * try to add fwd op input dtypes
      
      * refactor base impl
      
      * return tmp_ins after dygraph prepare data
      
      * fix typo found in debug
      
      * polish comment & add complex net test
      
      * revert detail change
      
      * fix unittest failed
      
      * add complex kernel condition control
      
      * fix xpu test failed & polish comment
      
      * polish details by review comments
      
      * Complex op test (#29753)
      
      * delete no need to calculate inputs in dygraph op_test
      
      * delete no need to calculate inputs in dygraph op_test
      
      * change grad elementwise_mul for complex types (#29757)
      
      * add conj op for complex types
      
      * add conj for complex types
      
      * add more test case
      
      * add conj_op test
      
      * modify conj api and impl
      
      * add complex type for fill_constant_op xpu
      
      * add setConstant for complex type
      
      * remove complex conj test file
      
      * user define grad for test_conj_op
      
      * add test case for static mode of conj api
      
      * modify conj doc
      
      * change input args name to x
      
      * remove useless codes
      
      * conj support real types
      
      * add conj test case for real number
      
      * delete no need to calculate inputs in dygraph op_test
      
      * delete no need to calculate inputs in dygraph op_test
      
      * modify grad of mul for complex types
      
      * fix the grads of inputs args order not match bug
      
      * change the grad of div when complex types (#29804)
      
      * change the grad of div when complex types
      
      * fix the grads of inputs args order not match bug
      Co-authored-by: Nchentianyu03 <chentianyu03@baidu.com>
      91ebc460
    • [cherry-pick] #26920 , #22924 (#29948) · bea300dd
      石晓伟 提交于
      bea300dd
    • C
    • W
      Support mips (#29943) · 5a8d43bb
      Wilber 提交于
      5a8d43bb
    • T
      cherry pick heter ps (#29955) · a839ddca
      Thunderbrook 提交于
      * cherry pick heter ps
      
      *  CMakeList
      a839ddca
    • W
      [Inference] FLAGS_call_statck is turned on default when ON_INFER=ON (#29800) · fae406ae
      Wilber 提交于
      * [Inference] FLAGS_call_statck is turned on default when ON_INFER=ON
      
      * cherry-pick 29828
      fae406ae
    • W
  9. 28 12月, 2020 2 次提交
    • T
      support some shape for matmul and cast in xpu place (#29900) (#29907) · d84b8e83
      taixiurong 提交于
      * support some shape in matmul and cast
      
      * modify matmul
      d84b8e83
    • H
      [Cherry-pick] Cherry-pick of PR#29579 and PR#29617 (#29904) · 63939597
      Huihuang Zheng 提交于
      * [Dy2stat] Enable jit.save to Save Without Running (#29579)
      
      Enable jit.save to Save Without Running.
      
      * Modify CublasHandleHolder to Fix Random Unittest Failure. test=develop (#29617)
      
      Modify CublasHandleHolder from using PADDLE_ENFORCE_CUDA_SUCCESS to PADDLE_RETRY_CUDA_SUCCESS to fix random unittest failure. We checked that the unittest log showed CUDA allocation error at this file, which may due to GPU not enough. We fixed similar failure in the past, so we applied PADDLE_RETRY_CUDA_SUCCESS here.
      63939597
  10. 25 12月, 2020 2 次提交
    • Q
      feat: support check_nan_inf for kunlun/xpu device (#29694) (#29898) · 41917fb5
      QingshuChen 提交于
      * feat: support check_nan_inf for kunlun device
      
      * support kunlun stack
      
      * minor
      41917fb5
    • T
      2 0 ps core 2 (#29894) · f781ab08
      tangwei12 提交于
      * add ps table (#29463)
      
      * add ps table
      
      Change-Id: I468a04bd071d21ff52654926fcf4d5f3da19e178
      
      * add service (#29560)
      
      * add service, remove ut on mac
      
      * fix heter_profiler & add heter stop method
      
      * fix code style
      
      * merge pscore
      
      Change-Id: Ie7f60d1cdde6755a0c29db26863c6283e9843d57
      
      * fix cmake
      
      Change-Id: I6773509a7b4ca79139ecc40b7bf3eb318ceff8bb
      
      * fix conflit
      
      Change-Id: I35575be0c96a8520f9d756ea7f1ff0b904a165ba
      
      * fix conflit
      
      Change-Id: Ic926ea0b0d67803226d51241397ba3b510226bfa
      f781ab08