• S
    [Cherry-pick] Add FasterTokenizer Operator (#36716) · edff5b79
    Steffy-zxf 提交于
    * Add FasterTokenizer Operator (#34491)
    
    Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent.
    
    * support the text string as an input Tensor
    * support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens
    * Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization.
    * It first applies basic tokenization, followed by wordpiece tokenization.
    
    * optimize fast tokenizer
    
    * remove const_cast
    Co-authored-by: Nzhoushunjie <zhoushunjie@baidu.com>
    Co-authored-by: Nwawltor <fangzeyang0904@hotmail.com>
    edff5b79
operator.cc 61.4 KB