@@ -6,9 +6,10 @@ We thank @lipeng for the pull request that defined the model schemas and pretrai
## Introduction ###
### Chinese Word Dictionary ###
Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of "《红楼梦》" is "《","红楼梦","》",and "《红楼梦》". Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206325, including 3 special token:
Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of "《红楼梦》" is "《","红楼梦","》",and "《红楼梦》". Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206326, including 4 special token:
- `<s>`: the start of a sequence
- `<e>`: the end of a sequence
- `PALCEHOLDER_JUST_IGNORE_THE_EMBEDDING`: a placeholder, just ignore it and its embedding
<spanid="introduction"></span><h2>Introduction<aclass="headerlink"href="#introduction"title="Permalink to this headline">¶</a></h2>
<divclass="section"id="chinese-word-dictionary">
<spanid="chinese-word-dictionary"></span><h3>Chinese Word Dictionary<aclass="headerlink"href="#chinese-word-dictionary"title="Permalink to this headline">¶</a></h3>
<p>Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of “《红楼梦》” is “《”,”红楼梦”,”》”,and “《红楼梦》”. Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206325, including 3 special token:</p>
<p>Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of “《红楼梦》” is “《”,”红楼梦”,”》”,and “《红楼梦》”. Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206326, including 4 special token:</p>
<ulclass="simple">
<li><codeclass="docutils literal"><spanclass="pre"><s></span></code>: the start of a sequence</li>
<li><codeclass="docutils literal"><spanclass="pre"><e></span></code>: the end of a sequence</li>
<li><codeclass="docutils literal"><spanclass="pre">PALCEHOLDER_JUST_IGNORE_THE_EMBEDDING</span></code>: a placeholder, just ignore it and its embedding</li>
<li><codeclass="docutils literal"><spanclass="pre"><unk></span></code>: a word not included in dictionary</li>