paddle/fluid/framework/operator.cc · edff5b7975dcbf5ed6996f952de79be8ca15b49a · Crayon鑫 / Paddle

[Cherry-pick] Add FasterTokenizer Operator (#36716) · edff5b79

由 Steffy-zxf 提交于 10月 26, 2021

* Add FasterTokenizer Operator (#34491)

Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent.

* support the text string as an input Tensor
* support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens
* Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization.
* It first applies basic tokenization, followed by wordpiece tokenization.

* optimize fast tokenizer

* remove const_cast
Co-authored-by: Nzhoushunjie <zhoushunjie@baidu.com>
Co-authored-by: Nwawltor <fangzeyang0904@hotmail.com>

edff5b79

operator.cc 61.4 KB

Crayon鑫 / Paddle 与 Fork 源项目一致

Replace operator.cc

Crayon鑫 / Paddle
与 Fork 源项目一致