- 22 10月, 2021 5 次提交
-
-
由 zhangbo9674 提交于
-
由 Li Min 提交于
功能:本PR的目标是提高attention模块的计算性能。 为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op; 为了减少防存开销,本PR采取了两种优化方法: (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次; (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
-
由 Leo Chen 提交于
* [hapi] support dygrapg amp O2 * fix problem of static pure fp16 in hapi * fix bug * fix format * fix ut * follow comments * update ut * update amp save/load * fix ut * refine code format
-
由 Weilong Wu 提交于
* Support elementwise_add triple grad Kernel * Change code-format to follow CI std * Removed unreasonable code, and fixed an input uninitialized issue * Support elementwise_add triple grad Kernel * Change code-format to follow CI std * Removed unreasonable code, and fixed an input uninitialized issue
-
由 Wilber 提交于
-
- 21 10月, 2021 12 次提交
-
-
由 zhulei 提交于
-
由 ronnywang 提交于
-
由 jakpiase 提交于
* added base changes for matmul_v2+trans+resh fuse pass * added full matmul_v2+transpose+reshape pass * removed a file added by mistake * added reviewers suggestions * Changed ops type in checking capatibility version * Deteled one statement
-
由 furnace 提交于
* add sync_batch_norm (support train, infer, and fp32, fp16, and NCHW, NHWC) * [NPU] Delete debug codes * [NPU] Remove FP16
-
由 Jack Zhou 提交于
* add viterbi decode cpu kernel * add viterbi decoder api in paddle.text * add a data buffer once to avoid create many small pieces of data buffer frequently * fix viterbi max_seq_length bug * fix seq_len=1 bug * fix device context * move split out of for loop * remove INVERSE_SUB * remove 2 GET_CAST_MASK * remove 1 loop * remove Functor * add to_static deploy code * use MAX_FUNC instead of ELE_MAX * add MaxFunctor * impl max_func * remove MaxFunctor * remove cast op * use REGISTER_OP_WITHOUT_GRADIENT * add viterbi cuda kernel * add FIX_BLOCKDIM_CASE macro * add MKL add, mul; add get data mask * add arange mkl impl * add CPU Argmax * add cpu gather * use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL * use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP * use SAME_DIMS_ELEMENT_BINARY_OP * add SimpleBroadcastBinaryOP * use int instead of int64_t to accelerate * optimize SimpleBroadcastBinaryOP * optimize SimpleBroadcastBinaryOP * optimize performance in both single thread and multithread situation * remove useless line * remove useless code * add CREATE_TENSOR_BUFFER macro * add INIT_REQUIRED_TENSOR macro * add comment * fix windows ci * add viterbi unittest * remove cuda add functor * remove cuda equal * remove a template function * fix windows ci * fix windows dtype * remove some template instance * remove useless header file * remove some blockdim * remove transpose impl * accelerate cpu performance on single thread situation * viterbi_decode->crf_decode * rename crf params name * add viterbi api test * remove useless import * add enable_static * use viterbi decoder * fix viterbi len=1 * fix viterbi unittest * remove useless comments * reconstruct viterbi decode * remove ADD,SUB,MUL structure * fix coverage * remove CREATE_TENSOR * add name args * crf.py->ops.py; with_start_stop_tag->include_start_end_tag * update crf_decode en docs * fix viterbi decode en docs * fix some review comments * add FIXED_BLOCK_DIM_CASE in cuda * push_back->emplace_back * crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag * paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode * fix viterbi_decode en docs
-
由 TTerror 提交于
* add some ops to train ssd on kunlun * update test_fill_any_like_op_xpu.py
-
由 xiongkun 提交于
-
由 niuliling123 提交于
* Update the implement of reduceAnyKernel according to kernel primitive api * Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1
-
由 seemingwang 提交于
-
由 zhaocaibei123 提交于
* add ctr table depends * code style * fix * fix * fix naming * rename * rename
-
由 liutiexing 提交于
* add align for WorkQueue * add spinlock * merge develop * merge * Add EventsWaiter * Revert "Add EventsWaiter" This reverts commit e206173aa9be7401b83a53581627bfaf557c8fb2. * adjust multithread using, fix flame graph * update
-
由 Aurelius84 提交于
* Add kQueueSync.synchronize_run_ logic * Support No DataTransform From GetKernelTypeForVar
-
- 20 10月, 2021 13 次提交
-
-
由 danleifeng 提交于
* split into PreBuildTask and BuildPull; slove endpass bug;test=develop * change buildcpu into prebuild and buildcpu into build;test=develop
-
由 李季 提交于
* fix global gather and global scatter operators
-
由 ronnywang 提交于
-
由 Wilber 提交于
-
由 Steffy-zxf 提交于
Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent. * support the text string as an input Tensor * support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens * Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization. * It first applies basic tokenization, followed by wordpiece tokenization.
-
由 wuhuachaocoding 提交于
-
由 Zeng Jinle 提交于
-
由 Wilber 提交于
-
由 Wilber 提交于
-
由 zmx 提交于
* bug fix for DeserializeSelectedRows. test=develop * fix bug for SerializeSelectedRows. test=develop * update. test=develop
-
由 Huihuang Zheng 提交于
Add CINN compile option in CMake. Now you can use CINN in Paddle by `-DWITH_CINN=ON` when `cmake` To test it, you can run `make cinn_lib_test -j` and `ctest -R cinn_lib_test`. Note: 1. You should set ``` export runtime_include_dir=${CINN_SOURCE_DIR}/cinn/runtime/cuda ``` When run test, the `${CINN_SOURCE_DIR}` should be set based on your CINN directory. 2. CINN is under developing now, you may have to change `CINN_GIT_TAG` to the git commit you need.
-
由 wenbin 提交于
* fix * remove const
-
由 Aurelius84 提交于
-
- 19 10月, 2021 10 次提交
-
-
由 Weilong Wu 提交于
* Support elementwise_add triple grad Kernel * Change code-format to follow CI std
-
由 zhulei 提交于
* [NPU] Add iou_similarity op * [NPU] Add iou_similarity op * [NPU] Add iou_similarity op
-
由 Qi Li 提交于
* [NPU] update inference cmake, test=develop * address review comments, test=develop * fix compile error when WITH_ASCEND_CXX11 ON, test=develop
-
由 danleifeng 提交于
-
由 Wilber 提交于
* update * fix ut error * update ut
-
由 jiangcheng 提交于
* add feed op and new var for the generated subgraph * perfect the test script of build_cinn_pass * remove useless clear and perfect some annotation
-
由 wangxinxin08 提交于
* add nearest_interp_v2 trt plugin
-
由 WangXi 提交于
-
由 littletomatodonkey 提交于
* fix replicate pad when input size is 0 * add unit test
-
由 Yulong Ao 提交于
* Add QR decomposition op * Change codes to adapt to new svd_helper * Update linalg.py Restore the deleted comma * Restore the deleted line * Update linalg.py * Update linalg.py * Improve the qr code by reviews * Update QR based on CI results * Update qr doc, test=document_fix * Change unsafe and ill-formed codes
-