diff --git a/doc/tutorials/embedding_model/index_cn.md b/doc/tutorials/embedding_model/index_cn.md index fe800308d8d7a03619ec8e13fd8dc4aa7a8ed8be..2b4a79fbbfc0c4af74aa73c540919f5d9cf2635b 100644 --- a/doc/tutorials/embedding_model/index_cn.md +++ b/doc/tutorials/embedding_model/index_cn.md @@ -6,9 +6,10 @@ ## 介绍 ### ### 中文字典 ### -我们的字典使用内部的分词工具对百度知道和百度百科的语料进行分词后产生。分词风格如下: "《红楼梦》"将被分为 "《","红楼梦","》",和 "《红楼梦》"。字典采用UTF8编码,输出有2列:词本身和词频。字典共包含 3206325个词和3个特殊标记: +我们的字典使用内部的分词工具对百度知道和百度百科的语料进行分词后产生。分词风格如下: "《红楼梦》"将被分为 "《","红楼梦","》",和 "《红楼梦》"。字典采用UTF8编码,输出有2列:词本身和词频。字典共包含 3206326个词和4个特殊标记: - ``: 分词序列的开始 - ``: 分词序列的结束 + - `PALCEHOLDER_JUST_IGNORE_THE_EMBEDDING`: 占位符,没有实际意义 - ``: 未知词 ### 中文词向量的预训练模型 ### diff --git a/doc/tutorials/embedding_model/index_en.md b/doc/tutorials/embedding_model/index_en.md index d793a50f488e464bcd90a2fb506a8dcc3c760433..9525f64f9b5384c8e44690fb0887fb2293108e0a 100644 --- a/doc/tutorials/embedding_model/index_en.md +++ b/doc/tutorials/embedding_model/index_en.md @@ -6,9 +6,10 @@ We thank @lipeng for the pull request that defined the model schemas and pretrai ## Introduction ### ### Chinese Word Dictionary ### -Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of "《红楼梦》" is "《","红楼梦","》",and "《红楼梦》". Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206325, including 3 special token: +Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of "《红楼梦》" is "《","红楼梦","》",and "《红楼梦》". Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206326, including 4 special token: - ``: the start of a sequence - ``: the end of a sequence + - `PALCEHOLDER_JUST_IGNORE_THE_EMBEDDING`: a placeholder, just ignore it and its embedding - ``: a word not included in dictionary ### Pretrained Chinese Word Embedding Model ###