From bf1a4afba97b2d0ddf106b0b524410bfcc7964cb Mon Sep 17 00:00:00 2001 From: Peng Li Date: Mon, 17 Apr 2017 19:31:58 +0800 Subject: [PATCH] The description for vocabulary file is not consistent with the latest file --- doc/tutorials/embedding_model/index_cn.md | 3 ++- doc/tutorials/embedding_model/index_en.md | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/tutorials/embedding_model/index_cn.md b/doc/tutorials/embedding_model/index_cn.md index fe800308d8..2b4a79fbbf 100644 --- a/doc/tutorials/embedding_model/index_cn.md +++ b/doc/tutorials/embedding_model/index_cn.md @@ -6,9 +6,10 @@ ## 介绍 ### ### 中文字典 ### -我们的字典使用内部的分词工具对百度知道和百度百科的语料进行分词后产生。分词风格如下: "《红楼梦》"将被分为 "《","红楼梦","》",和 "《红楼梦》"。字典采用UTF8编码,输出有2列:词本身和词频。字典共包含 3206325个词和3个特殊标记: +我们的字典使用内部的分词工具对百度知道和百度百科的语料进行分词后产生。分词风格如下: "《红楼梦》"将被分为 "《","红楼梦","》",和 "《红楼梦》"。字典采用UTF8编码,输出有2列:词本身和词频。字典共包含 3206326个词和4个特殊标记: - ``: 分词序列的开始 - ``: 分词序列的结束 + - `PALCEHOLDER_JUST_IGNORE_THE_EMBEDDING`: 占位符,没有实际意义 - ``: 未知词 ### 中文词向量的预训练模型 ### diff --git a/doc/tutorials/embedding_model/index_en.md b/doc/tutorials/embedding_model/index_en.md index d793a50f48..9525f64f9b 100644 --- a/doc/tutorials/embedding_model/index_en.md +++ b/doc/tutorials/embedding_model/index_en.md @@ -6,9 +6,10 @@ We thank @lipeng for the pull request that defined the model schemas and pretrai ## Introduction ### ### Chinese Word Dictionary ### -Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of "《红楼梦》" is "《","红楼梦","》",and "《红楼梦》". Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206325, including 3 special token: +Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of "《红楼梦》" is "《","红楼梦","》",and "《红楼梦》". Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206326, including 4 special token: - ``: the start of a sequence - ``: the end of a sequence + - `PALCEHOLDER_JUST_IGNORE_THE_EMBEDDING`: a placeholder, just ignore it and its embedding - ``: a word not included in dictionary ### Pretrained Chinese Word Embedding Model ### -- GitLab