1. 23 10月, 2020 1 次提交
    • H
      Fix test_parallel_executor_test_while_train Random Failure by Decreasing GPU Usage (#28213) · a1e7fd4a
      Huihuang Zheng 提交于
      Recently, test_parallel_executor_test_while_train randomly failed on CI. On all CI logs, it showed NCCL initialization failed or cusolver initialization failed. I found online that those failure is usually caused by GPU shortage. Those API calls CUDA APIs directly so it shouldn't be the problem of allocator. It may be somewhere in PaddlePaddle increases GPU usage.
      
      However, I run this test for 1000 times on my machine and the CI machine, either of them can reproduce the random failure. Maybe there is something related to the environment only happened in test env.
      
      To verify my assumption that somewhere in PaddlePaddle increases GPU usage and also fix this CI, I decreased the batch_size to see whether the random failure disappears in test env.
      a1e7fd4a
  2. 22 10月, 2020 4 次提交
  3. 21 10月, 2020 7 次提交
  4. 20 10月, 2020 8 次提交
  5. 19 10月, 2020 10 次提交
    • Y
      xpu adam op (#28031) · 6f0c3d1f
      yinhaofeng 提交于
      * lookup_table_xpu op report errors;test=kunlun
      
      * add adam xpu op;test=kunlun
      
      * reset lookup
      
      * change adam wrong;test=kunlun
      6f0c3d1f
    • T
      Add xpu transpose2 op.test=kunlun (#28086) · a5c95cd5
      TeslaZhao 提交于
      a5c95cd5
    • L
      Fix diag OP bug on Windows Python3.8 · c8d32c8c
      LutaoChu 提交于
      Fix diag OP bug on Windows Python3.8 ,remove the std::min
      c8d32c8c
    • M
      fleet support paddle.optimzier (#28026) · 55098b97
      MRXLT 提交于
      fleet support paddle.optimzier
      
      * bug fix
      
      * fix fleet_base
      
      * bug fix
      
      * fix coverage
      55098b97
    • L
      [API 2.0: doc] transfer from paddle.fluid.layers.assign() into creation.py (#27999) · e21b13fb
      liuyuhui 提交于
      * transfer from paddle.fluid.layers.assign() into creation.py,test=develop
      
      * fix ut fail,add support for paddle.assign,test=develop
      
      * fix,test=develop
      
      * fix UT coverage,test=coverage
      
      * fix UT fail,test=coverage
      
      * fix doc,test=develop
      e21b13fb
    • H
      Allclose op (#27891) · d4668938
      huangxu96 提交于
      * Still has bugs.
      
      * Fixed allclose_op bug, which cannot deal with some cases of fp64 inputs.
      
      * improved CUDA kernel performance.
      
      * Changed CUDA code.
      
      * Fixed a bug in cuda kernel which cannot deal with large dimension input, and added an unittest for it.
      
      * Add a test case for float32 input.
      d4668938
    • P
      Fix error message of multinomial op (#27946) · 975bd887
      pangyoki 提交于
      * fix multinomial doc
      
      * fix multinomial error message
      
      * little doc change
      
      * fix Categorical class doc
      
      * optimize format of error message
      
      * fix CPU Kernel error message format
      
      * fix isinf and isnan error in WindowsOPENBLAS CI
      
      * delete inf and nan
      
      * add manual_seed in sample code
      
      * little error message change
      
      * change error message to InvalidArgument
      
      * add full point for error message and add manual_seed in CPU environment
      975bd887
    • P
      Add truncated_gaussian_random XPU kernel (#27861) · 4c5b779a
      pangyoki 提交于
      * Add truncated_gaussian_random_op XPU kernel
      
      * Add truncated_gaussian_random_op XPU kernel, test=kunlun
      
      * little change, test=kunlun
      
      * change boost_get to BOOST_GET_CONST
      
      * change boost_get to BOOST_GET_CONST, test=kunlun
      
      * little change, test=kunlun
      
      * use Generator to generate random number and optimize format, test=kunlun
      
      * little change, test=kunlun
      
      * add TODO, test=kunlun
      4c5b779a
    • P
      Add gaussian_random XPU kernels (#27853) · 5b8e5001
      pangyoki 提交于
      * Add gaussian_random XPU kernels
      
      * commit kunlun, test=kunlun
      
      * new version, test=kunlun
      
      * change boost_get to BOOST_GET_CONST, test=kunlun
      
      * use Generator to generate random number and optimize format, test=kunlun
      
      * add TODO, test=kunlun
      5b8e5001
    • P
      Add uniform_random XPU kernel (#27846) · 74ce0397
      pangyoki 提交于
      * support uniform_random op on Baidu Kunlun
      
      * change dtype of attr shape from int to int64_t
      
      * kunlun ci, test=kunlun
      
      * new version, test=kunlun
      
      * change boost_get to BOOST_GET_CONST
      
      * change boost_get to BOOST_GET_CONST, test=kunlun
      
      * use Generator to generate random number and optimize format
      
      * run Kunlun CI, test=kunlun
      
      * add TODO, test=kunlun
      74ce0397
  6. 18 10月, 2020 1 次提交
    • L
      add cast/concat/assign xpu op (#27911) · 3e956865
      liuyuhui 提交于
      * addd
      
      * add cast_op_xpu, test=kunlun
      
      * fix bug for cast_op_xpu,test=kunlun
      
      * add concat_op_xpu, test=kunlun
      
      * slove conflicts, test=kunlun
      
      * fix bug,test=kunlun
      
      * add assign_op_xpu, test=kunlun
      
      * fix bug,test=kunlun
      
      * test=kunlun;test=develop
      
      * fix concat bug,test=kunlun
      
      * fix check_dygraph set in test_concat_op_xpu.py,test=kunlun
      
      * fix error message,test=kunlun
      Co-authored-by: Nmapingshuo <mps2012@yeah.net>
      3e956865
  7. 17 10月, 2020 2 次提交
  8. 16 10月, 2020 7 次提交
    • Y
      disable test_lstm,test=document_fix (#28030) · bf5325f3
      YUNSHEN XIE 提交于
      * disable test_lstm,test=document_fix
      
      * fix some error,test=document_fix
      bf5325f3
    • W
    • G
      Incorporate cudnn_lstm into LSTM api (#27217) · fa9d3fa5
      Guo Sheng 提交于
      * Incorporate cudnn_lstm into LSTM api.
      test=develop
      
      * Make coalesce_tensor support alignment optionally.
      test=develop
      
      * Reorganize RNN apis. test=develop
      
      * Fix cudnn rnn layout conversion.
      test=develop
      
      * Add sequence_length support for RNN cudnn implement.
      Add optional init_h and init_c gradient for cudnn_lstm_op.
      test=develop
      
      * Use create_parameter for rnn cudnn impl.
      test=develop
      
      * Move `self._flat_weight = self.create_parameter()` in RNNBase to main_program.
      test=develop
      
      * Update RNN api unittest to use set_device.
      test=develop
      
      * Fix set_place for unit tests of RNN apis.
      test=develop
      
      * Fix use_align in coalesce_tensor_op.
      test=develop
      
      * Adjust RNN apis arguments according to comments.
      test=develop
      
      * Polish documents for SimpleRNN apis.
      test=develop
      
      * Refine random seed in cudnn_lstm_op.
      Expose rnn params from sublayers to RNN.
      test=develop
      
      * Fix RNN saving for jit.save.
      Refine cudnn_lstm dropout behavior.
      test=develop
      
      * Fix doc of GRU. test=develop
      
      * Use ShareDataWith to avoid copying for cudnn_lstm_op test.
      test=develop
      
      * Remove updates on cudnn_lstm temporarily.
      test=develop
      
      * Use ShareDataWith to avoid copying for cudnn_lstm_op test.
      test=develop
      
      * Refine random seed in cudnn_lstm_op.
      test=develop
      
      * Fix test_lstm by adjust ConcreteProgram buffer getter.
      test=develop
      
      * Use create_parameter instead of create_var for rnn._flat_weight for static graph usage.
      test=develop
      
      * Remove W input for cudnn_lstm to pass unused_var_check.
      test=develop
      
      * Add test_predict for RNN unit tests coverage.
      test=develop
      
      * Fix code style of rnn.
      test=develop
      
      * Fix F.rnn usage in rnn.py.
      test=develop
      fa9d3fa5
    • L
      fix random failure (#27996) · 78b1026f
      Leo Chen 提交于
      78b1026f
    • A
      [Dy2Stat] Fix Error when generating train_program in eval mode (#27975) · ffcc1175
      Aurelius84 提交于
      * Fix save in eval mode
      
      * remove assert statement
      
      * fix test_partial_program failed
      
      * add more test
      
      * modify back into _train_program
      ffcc1175
    • C
    • J
      Fix xpu enforce (#27978) · d330cf66
      Jack Zhou 提交于
      * test=kunlun;
      
      Add elementwise XPU OP kernel for KUNLUN core, including (but still cannot process common broadcast):
      
          * elementwise_div op
          * elementwise_max op
          * elementwise_mul op (with grad op)
          * elementwise_sub op (with grad op)
      
      * 0.05->0.01
      
      * add xpu error message description;test=kunlun
      d330cf66