- 28 10月, 2021 1 次提交
-
-
由 wangguanqun 提交于
* add trainer desc config to distributed strategy * code style modified * data_feed set lod * fix bug * code style * fix bug * save load * save load * save unittest * add unittest of the_one_ps * unittest * add todo in communicator sendsparse
-
- 27 10月, 2021 1 次提交
-
-
由 taixiurong 提交于
-
- 25 10月, 2021 2 次提交
-
-
由 smallv0221 提交于
* Add bincount op * upload cpu version * fix unitest * fix unittest * fix unittest * fix en doc * add more test * fix en doc * add more test case * fix test * fix input vailidation * fix input check * fix unittest * fix test * fix en doc
-
由 zhangkaihuo 提交于
这个PR只包含fused_feedforward前向的代码。 相关kernel实现:fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias fused_feedforward是一个融合算子,该算子对transformer模型的feed forward层的算子进行融合和封装,使得前端只呈现一个接口,通过融合减少部分访存和kernel launch的时间,以此提升性能。
-
- 22 10月, 2021 2 次提交
-
-
由 Li Min 提交于
功能:本PR的目标是提高attention模块的计算性能。 为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op; 为了减少防存开销,本PR采取了两种优化方法: (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次; (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
-
由 Leo Chen 提交于
* [hapi] support dygrapg amp O2 * fix problem of static pure fp16 in hapi * fix bug * fix format * fix ut * follow comments * update ut * update amp save/load * fix ut * refine code format
-
- 20 10月, 2021 1 次提交
-
-
由 Steffy-zxf 提交于
Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent. * support the text string as an input Tensor * support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens * Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization. * It first applies basic tokenization, followed by wordpiece tokenization.
-
- 19 10月, 2021 3 次提交
-
-
由 danleifeng 提交于
-
由 Wilber 提交于
* update * fix ut error * update ut
-
由 WangXi 提交于
-
- 18 10月, 2021 1 次提交
-
-
由 Siming Dai 提交于
* fix async_read bug * change index place to cpu * add tensor size judge * add async_read & async_write test * fix bug in async_write * fix mac py3 ci * fix bug for cpu version paddle * fix windows ci bug * change input argument error type * change const_cast to mutable_data * add async_write out-of-bound check and consumate error hint * fix a small bug for dst_tensor * add docs and refine codes * refine docs * notest,test=windows_ci * fix windows ci * fix require * fix code-block * add core.is_compiled_with_cuda()
-
- 13 10月, 2021 2 次提交
-
-
由 Leo Chen 提交于
* refine amp level * fix typo * update tracer._amp_level
-
由 Huihuang Zheng 提交于
Remove RunFromCinn method in PE because We Will Call CinnRunner in Compute method of SubgraphOp
-
- 11 10月, 2021 1 次提交
-
-
由 Huihuang Zheng 提交于
Add use_cinn flag and use it to control whether we run PaddlePaddle using CINN. Also add: Replace PaddlePaddle graph with a CINN graph in a pass PE Method to feed data and run the graph by CINN
-
- 08 10月, 2021 2 次提交
-
-
由 Zeng Jinle 提交于
* support CUDA Graph on PE * add ut, fix CI compile * reduce memory consumption * fix CUDA 10 CI * improve coverage * improve python coverage
-
由 huangxu96 提交于
Add python interface of subgraph: 1. all_sub_graphs() 2. get_sub_graph(idx)
-
- 29 9月, 2021 2 次提交
-
-
由 Zeng Jinle 提交于
* add basic support for CUDA Graph * fix ci compile error * fix LOG print, fix windows CI * follow comments and update * small fix for default ctor * fix rocm compile error * fix CPU compile error
-
由 yaoxuefeng 提交于
-
- 28 9月, 2021 2 次提交
-
-
由 Yanxing Shi 提交于
* Initial Commit * add unittest and add error information * modify doc * fix some error * fix some word * fix bug cudaDeviceProp* and modify error explanation * fix cudaDeviceProp* error and unnitest samples * fix hip error and PADDLE_WITH_HIP * update style * fix error is_compiled_with_cuda * fix paddle.device.cuda.get_device_properties * fix error for multi thread safe * update style * merge conflict * modify after mentor review * update style * delete word * fix unittest error for windows * support string input and modify some code * modify doc to support string input * fix error for express information * fix error for express information * fix unnitest for windows * fix device.startswith('gpu:') * format error and doc * fix after review * format code * fix error for doc compile * fix error for doc compile * fix error for doc compile * fix error for doc compile * fix error for doc compile * fix py2 error * fix wrong words and doc * fix _gpuDeviceProperties
-
由 Siming Dai 提交于
-
- 26 9月, 2021 3 次提交
-
-
由 JYChen 提交于
* add func/class API psroi_pool and UT * add UT in static mode * Remove redundant type checks in static mode * More detailed description for test_psroi_pool_op * fix code format of UT * fix en-doc
-
由 Thunderbrook 提交于
* set file_num in one shard * format
-
由 zhangbo9674 提交于
* adam to adamw in AdamW * add lr_ratio in adamw * refine logic bug in cpu adamw * delete fix bug for cpu adamw * delete fix bug for cpu adamw
-
- 22 9月, 2021 2 次提交
-
-
由 Tomasz Socha 提交于
* Fix copy elision warning * Remove redundand code
-
由 JingZhuangzhuang 提交于
-
- 18 9月, 2021 3 次提交
-
-
由 Huihuang Zheng 提交于
Add basic Cost Model, it uses executor to run program and profile it to get op time. This is an early basic version, we will add more functions in the future.
-
由 Aurelius84 提交于
* split cuda_profiler into .h and .cc * fix cmake * remove inline
-
由 Aurelius84 提交于
* Clean ParaseMemInfo and fix unittest with multi-thread * fix declare
-
- 17 9月, 2021 5 次提交
-
-
由 zhangbo9674 提交于
* add pure fp16 major function in auto_cast & tracer * support master weight in dygraph for pure fp16 * check mix dtype of fp16&fp32 for check_finite_and_unscale op * change pure fp16 funtion name * refine some bug in auto_cast * refine auto_cast interface logic * add param _casted_by_pure_fp16 for class Layer * support state_dict hook for save model by user appointed dtype in pure_fp16_decorator * refine pure_fp16_decorator as decorator * add unittest * add comment * add comment * support recompute * add comment for auto_cast and decorator * support to_static_state_dict for paddle.jit.save * unlimite models num and optimizers num * add lookup_table in black_list * fix momentum and layer state_dict * fix bug in layer state_dict * fix bug in layer state_dict_helper * refine unittest * refine test_momentun_op * refine interface and some code * refine amp_decorator interface * refine pure fp16 interface * refine master weight interface
-
由 Zeng Jinle 提交于
-
由 Zeng Jinle 提交于
* make flag setter easier * update * rename macro name * fix bug of public/writable * update to pass CI * polish * fix CPU link error
-
由 Leo Chen 提交于
* expose cuda stream to users * add ut
-
由 wuhuanzhou 提交于
#### 背景 #35602 提供Python侧开发子图替换类Pass的方式: - 利用Paddle Python API或者辅助类型定义子图program用来匹配/替换图; - Python侧注册Pass时,将注册函数最终转换为protobuf定义的PassDesc数据形式,供C++侧进行解析完成Pass实例注册。 本PR即为根据PassDesc规则描述解析生成Pass实例。 #### 方案设计 ##### Pass规则验证 在以往的Pass开发中,会存在随着算子迭代引发的匹配失效或者错误匹配的问题,该问题可以通过扫描算子支持的参数设置及参数类型等来判断是否应该使用该Pass或者给出提示需要修改Pass代码。 当前Pass开发中提供了算子兼容性OpCompatSensiblePass用于解决上述问题。但同时还存在不足:由于以往Pass开发在运行时才能获取到pattern信息,所以需要在执行Pass时才可以判断。 使用PassDesc表示的Pass可以在执行Pass前验证上述问题,这个过程在VerifyDesc中完成。 ##### 根据匹配子图构造pattern GeneratePass对于图匹配和替换使用GraphPatternDecetor完成,构造匹配pattern实际上就是将对应对象成员PDPattern中添加PDNode和边关系。该过程在函数`InitGeneratePattern`中完成,该函数没有作为GeneratePass的成员方法,主要出于后续可能开发新的Decetor考虑,GeneratePass与Decetor的操作是没有关联的。 初始化pattern主要通过遍历匹配子图program的全部算子实现: 1. 添加当前算子对应PDNode及限制条件(算子类型、属性限制等); 2. 遍历当前算子对应输入并从pattern中尝试获取PDNode: - 在pattern中获取到PDNode且为输出节点:表示属于匹配子图的中间节点,将该PDNode设置为中间节点; - 在pattern中没有获取到PDNode:添加该输入PDNode并设置作为输入节点; - 设置输入到算子的边关系; 3. 遍历当前算子对应输出: - 在pattern中获取到PDNode且为输入节点:表示属于匹配子图的中间节点,将该PDNode设置为中间节点; - 在pattern中没有获取到PDNode:添加该输入PDNode并设置作为输出节点; - 设置算子到输出的边关系; ##### 根据替换子图操作graph 替换子图操作的过程在`GetGenerateRewrite`函数中完成,与`InitGeneratePattern`类似没有作为GeneratePass的成员方法。 生成替换子图操作过程如下: 1. 判断冗余替换子图; 2. 遍历替换子图program的全部算子添加替换子图Node: 1. 添加当前算子的Node及属性设置; 2. 遍历当前算子对应输入,添加中间variable节点; 3. 遍历当前算子对应输出,添加中间variable节点; 4. 添加输入/输出节点与算子节点的边关系; 3. 删除匹配图中属于中间节点的Node; ##### 优化子图验证 对于替换子图或者替换后的计算图是否可以正确运行等,可以在执行Pass时验证,从而防止在后续执行计算图时出现异常。 当前Pass执行直接修改计算图,验证失败时无法很好的完成还原操作,目前子图验证暂时默认成功,留到后续改进。
-
- 16 9月, 2021 2 次提交
-
-
由 0x45f 提交于
* fix no_grad context error in dy2stat * remove useless comments * fix error by drop_kids in python * add test and fix review
-
由 wanghuancoder 提交于
-
- 15 9月, 2021 4 次提交
-
-
由 王明冬 提交于
* clip op extra information when export model,test=ocr * rename clip_extra parameter to kwargs in save_inference_model, test=ocr
-
由 zyfncg 提交于
* Change the invoking method of settiem from numpy to set_value op when value is not tensor * fix the check logic for inplace in setitem * fix the unittest problem caused by setitem doesn't support fp16 * modify some code format in setitem
-
由 houj04 提交于
-
由 Siming Dai 提交于
Add paddle.cuda.device.stream_guard API
-
- 14 9月, 2021 1 次提交
-
-
由 chenenquan 提交于
* Add empty_cache api to release idle gpu memory hold by allocator,test=develop * Add empty_cache api to release idle gpu memory hold by allocator,test=develop * Add empty_cache api to release idle gpu memory hold by allocator,test=develop * Fix test coverage problem for empty_cache * delete redundant check for empty_cache * fix the problem of empty_cache's doc * delete the nvidia-smi comment in doc of empty_cache, test=document_fix
-