[Cherry-pick] Add FasterTokenizer Operator (#36716)
* Add FasterTokenizer Operator (#34491) Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent. * support the text string as an input Tensor * support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens * Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization. * It first applies basic tokenization, followed by wordpiece tokenization. * optimize fast tokenizer * remove const_cast Co-authored-by: Nzhoushunjie <zhoushunjie@baidu.com> Co-authored-by: Nwawltor <fangzeyang0904@hotmail.com>
Showing
cmake/external/utf8proc.cmake
0 → 100644
此差异已折叠。
想要评论请 注册 或 登录