1. 23 11月, 2021 1 次提交
    • W
      cherry pick save/load in the_one_ps (#37461) · 58a51130
      wangguanqun 提交于
      * save/load in ps runtime(the_one_ps) (#36097)
      
      * add trainer desc config to distributed strategy
      
      * code style modified
      
      * data_feed set lod
      
      * fix bug
      
      * code style
      
      * fix bug
      
      * save load
      
      * save load
      
      * save unittest
      
      * add unittest of the_one_ps
      
      * unittest
      
      * add todo in communicator sendsparse
      
      * fix bug in save_inference_model (#37362)
      58a51130
  2. 16 11月, 2021 1 次提交
  3. 15 11月, 2021 1 次提交
  4. 08 11月, 2021 1 次提交
  5. 26 10月, 2021 6 次提交
    • S
      [Cherry-pick] Add FasterTokenizer Operator (#36716) · edff5b79
      Steffy-zxf 提交于
      * Add FasterTokenizer Operator (#34491)
      
      Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent.
      
      * support the text string as an input Tensor
      * support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens
      * Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization.
      * It first applies basic tokenization, followed by wordpiece tokenization.
      
      * optimize fast tokenizer
      
      * remove const_cast
      Co-authored-by: Nzhoushunjie <zhoushunjie@baidu.com>
      Co-authored-by: Nwawltor <fangzeyang0904@hotmail.com>
      edff5b79
    • Z
      [cherry-pick]add op: fused_feedforward(forward) (#36729) · 77034fc3
      zhangkaihuo 提交于
      This is a fusion operator to compute feed forward layer in transformer model architecture.
      77034fc3
    • S
      Add bincount op (#36317) (#36709) · 610a810c
      smallv0221 提交于
      * Add bincount op
      
      * upload cpu version
      
      * fix unitest
      
      * fix unittest
      
      * fix unittest
      
      * fix en doc
      
      * add more test
      
      * fix en doc
      
      * add more test case
      
      * fix test
      
      * fix input vailidation
      
      * fix input check
      
      * fix unittest
      
      * fix test
      
      * fix en doc
      
      cherry-pick
      610a810c
    • L
      [Amp] refine code of amp level (#36362) (#36726) · 1ee4fc32
      Leo Chen 提交于
      * refine amp level
      
      * fix typo
      
      * update tracer._amp_level
      1ee4fc32
    • L
      [cherry-pick-2.2] Fused attention op forward (#35905) (#36708) · d2be870a
      Li Min 提交于
      功能:本PR的目标是提高attention模块的计算性能。
      为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op;
      为了减少防存开销,本PR采取了两种优化方法:
      (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次;
      (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
      d2be870a
    • Y
      add slot record dataset (#36200) (#36710) · 3fbb6644
      yaoxuefeng 提交于
      3fbb6644
  6. 25 10月, 2021 1 次提交
    • W
      [cherry-pick 2.2] static model parallel dropout support deterministic RandomSeedGenerator (#36682) · 59615fff
      WangXi 提交于
      * Revert "Add fused_dropout wrapper to ease use. (#36185) (#36640)"
      
      This reverts commit 05d7e2fd.
      
      * [hybrid] seed and dropout op support force-cpu (#35820)
      
      * [HIP] fix op not support AMD GPU bug, the flag PADDLE_WITH_ROCM is invalid
      
      * [HIP] fix op not support AMD GPU bug, the flag PADDLE_WITH_ROCM is invalid
      
      * [HIP] fix op not support AMD GPU bug
      
      * [hybrid] seed and dropout op support force-cpu
      
      * [hybrid] seed and dropout op support force-cpu
      
      * [hybrid] seed and dropout op support force-cpu
      
      * [hybrid] seed and dropout op support force-cpu
      
      * [hybrid] seed and dropout op support force-cpu
      
      * [hybrid] fix seed ci failed issue
      
      * add AsExtra for force_cpu of seed op
      
      * Add fused_dropout wrapper to ease use. (#36185)
      
      * [hybrid] static model parallel dropout support deterministic RandomSeedGenerator (#36228)
      Co-authored-by: Nxiayanming <41795079@qq.com>
      Co-authored-by: NLi Min <11663212+limin2021@users.noreply.github.com>
      59615fff
  7. 20 10月, 2021 1 次提交
  8. 19 10月, 2021 1 次提交
    • S
      Add operators for async read & async write (#36333) (#36501) · d65f8af8
      Siming Dai 提交于
      * fix async_read bug
      
      * change index place to cpu
      
      * add tensor size judge
      
      * add async_read & async_write test
      
      * fix bug in async_write
      
      * fix mac py3 ci
      
      * fix bug for cpu version paddle
      
      * fix windows ci bug
      
      * change input argument error type
      
      * change const_cast to mutable_data
      
      * add async_write out-of-bound check and consumate error hint
      
      * fix a small bug for dst_tensor
      
      * add docs and refine codes
      
      * refine docs
      
      * notest,test=windows_ci
      
      * fix windows ci
      
      * fix require
      
      * fix code-block
      
      * add core.is_compiled_with_cuda()
      d65f8af8
  9. 11 10月, 2021 1 次提交
  10. 27 9月, 2021 3 次提交
  11. 24 9月, 2021 1 次提交
  12. 22 9月, 2021 1 次提交
  13. 18 9月, 2021 2 次提交
  14. 17 9月, 2021 5 次提交
    • Z
      [AMP] Support pure fp16 training mode for dygraph (#35521) · adaeee4d
      zhangbo9674 提交于
      * add pure fp16 major function in auto_cast & tracer
      
      * support master weight in dygraph for pure fp16
      
      * check mix dtype of fp16&fp32 for check_finite_and_unscale op
      
      * change pure fp16 funtion name
      
      * refine some bug in auto_cast
      
      * refine auto_cast interface logic
      
      * add param _casted_by_pure_fp16 for class Layer
      
      * support state_dict hook for save model by user appointed dtype in pure_fp16_decorator
      
      * refine pure_fp16_decorator as decorator
      
      * add unittest
      
      * add comment
      
      * add comment
      
      * support recompute
      
      * add comment for auto_cast and decorator
      
      * support to_static_state_dict for paddle.jit.save
      
      * unlimite models num and optimizers num
      
      * add lookup_table in black_list
      
      * fix momentum and layer state_dict
      
      * fix bug in layer state_dict
      
      * fix bug in layer state_dict_helper
      
      * refine unittest
      
      * refine test_momentun_op
      
      * refine interface and some code
      
      * refine amp_decorator interface
      
      * refine pure fp16 interface
      
      * refine master weight interface
      adaeee4d
    • Z
      change to PADDLE_DEFINE_EXPORTED (#35841) · d22914fd
      Zeng Jinle 提交于
      d22914fd
    • Z
      Make flag adding easier (#35823) · 2c781455
      Zeng Jinle 提交于
      * make flag setter easier
      
      * update
      
      * rename macro name
      
      * fix bug of public/writable
      
      * update to pass CI
      
      * polish
      
      * fix CPU link error
      2c781455
    • L
      expose cuda stream to users (#35813) · 40cfa512
      Leo Chen 提交于
      * expose cuda stream to users
      
      * add ut
      40cfa512
    • W
      GeneratePass for Python Pass (#35708) · f6db9806
      wuhuanzhou 提交于
      #### 背景
      
      #35602 提供Python侧开发子图替换类Pass的方式:
      
      - 利用Paddle Python API或者辅助类型定义子图program用来匹配/替换图;
      - Python侧注册Pass时,将注册函数最终转换为protobuf定义的PassDesc数据形式,供C++侧进行解析完成Pass实例注册。
      
      本PR即为根据PassDesc规则描述解析生成Pass实例。
      
      #### 方案设计
      
      ##### Pass规则验证
      
      在以往的Pass开发中,会存在随着算子迭代引发的匹配失效或者错误匹配的问题,该问题可以通过扫描算子支持的参数设置及参数类型等来判断是否应该使用该Pass或者给出提示需要修改Pass代码。
      
      当前Pass开发中提供了算子兼容性OpCompatSensiblePass用于解决上述问题。但同时还存在不足:由于以往Pass开发在运行时才能获取到pattern信息,所以需要在执行Pass时才可以判断。
      
      使用PassDesc表示的Pass可以在执行Pass前验证上述问题,这个过程在VerifyDesc中完成。
      
      ##### 根据匹配子图构造pattern
      
      GeneratePass对于图匹配和替换使用GraphPatternDecetor完成,构造匹配pattern实际上就是将对应对象成员PDPattern中添加PDNode和边关系。该过程在函数`InitGeneratePattern`中完成,该函数没有作为GeneratePass的成员方法,主要出于后续可能开发新的Decetor考虑,GeneratePass与Decetor的操作是没有关联的。
      
      初始化pattern主要通过遍历匹配子图program的全部算子实现:
      
      1. 添加当前算子对应PDNode及限制条件(算子类型、属性限制等);
      2. 遍历当前算子对应输入并从pattern中尝试获取PDNode:
         - 在pattern中获取到PDNode且为输出节点:表示属于匹配子图的中间节点,将该PDNode设置为中间节点;
         - 在pattern中没有获取到PDNode:添加该输入PDNode并设置作为输入节点;
         - 设置输入到算子的边关系;
      3. 遍历当前算子对应输出:
         - 在pattern中获取到PDNode且为输入节点:表示属于匹配子图的中间节点,将该PDNode设置为中间节点;
         - 在pattern中没有获取到PDNode:添加该输入PDNode并设置作为输出节点;
         - 设置算子到输出的边关系;
      
      ##### 根据替换子图操作graph
      
      替换子图操作的过程在`GetGenerateRewrite`函数中完成,与`InitGeneratePattern`类似没有作为GeneratePass的成员方法。
      
      生成替换子图操作过程如下:
      
      1. 判断冗余替换子图;
      2. 遍历替换子图program的全部算子添加替换子图Node:
         1. 添加当前算子的Node及属性设置;
         2. 遍历当前算子对应输入,添加中间variable节点;
         3. 遍历当前算子对应输出,添加中间variable节点;
         4. 添加输入/输出节点与算子节点的边关系;
      3. 删除匹配图中属于中间节点的Node;
      
      ##### 优化子图验证
      
      对于替换子图或者替换后的计算图是否可以正确运行等,可以在执行Pass时验证,从而防止在后续执行计算图时出现异常。
      
      当前Pass执行直接修改计算图,验证失败时无法很好的完成还原操作,目前子图验证暂时默认成功,留到后续改进。
      f6db9806
  15. 16 9月, 2021 2 次提交
  16. 15 9月, 2021 4 次提交
  17. 14 9月, 2021 2 次提交
  18. 11 9月, 2021 1 次提交
  19. 10 9月, 2021 1 次提交
  20. 09 9月, 2021 1 次提交
    • 0
      Add matrix_rank Op and it's GPU and CPU kernel (#34823) · eb1fbf12
      0x45f 提交于
      * init matrix_rank op, add matrix_rank CPU code and test
      
      * add GPU kernel, remove svd_eigen.h
      
      * add CPU kernel when tol is tensor
      
      * add cpu and gpu code when tol is tensor
      
      * fix CI-ROCM error
      
      * add matrix_rank API describe, fix PR-CI-Py3 error
      
      * fix PR-CI-Windows error, add matrix_rank API test
      
      * delete useless comments
      
      * fix review
      
      * add my code in svd_helper.h
      
      * update doc commets
      
      * remove spaces
      eb1fbf12
  21. 08 9月, 2021 3 次提交
    • X
      Intergrate GLOOParallelContext to support Multi-CPU Core for Dygraph DataParallel (#35154) · 51cc73f0
      xiongkun 提交于
      * can pass the fake test
      
      * add files
      
      * modify cmake to pass windows-ci
      
      * for ci pass
      
      * WITH_GLOO=ON
      
      * for pass coverage test
      
      * add cpuonly testcase
      
      * add
      
      * disable nccl when compile with cuda
      
      * change python version in cpuonly
      
      * add backend argument
      
      * add required gpu
      
      * add required:gpu
      51cc73f0
    • Z
      Enable program passes on Fleet APIs (#34955) · 5f369881
      Zeng Jinle 提交于
      * add fleet api for program pass
      
      * turn on apply pass for CI test
      
      * fix disable fuse_all_optimizer bug
      
      * try to test ci
      
      * fix CI
      
      * fill unspecified op role
      
      * fix fuse_allreduce
      
      * add ut to improve coverage
      
      * remove useless change
      
      * improve c++ coverage
      
      * follow some comments
      
      * test ir pass pipeline
      
      * update doc
      
      * reduce ut time again
      5f369881
    • L
      [NPU] release gil before op run (#35370) · db6242e9
      Leo Chen 提交于
      * release gil before op run
      
      * support npu grad test
      
      * fix op_test
      db6242e9