From 9b7fc7e903ff408095c17d893cde9e9c9bd9b08d Mon Sep 17 00:00:00 2001 From: yangyaming Date: Thu, 12 Oct 2017 11:38:22 +0800 Subject: [PATCH] Add doc for Chinese LM. --- README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index fca2528a..efd45efb 100644 --- a/README.md +++ b/README.md @@ -232,13 +232,13 @@ Now the preprocessing is done and we get a clean corpus to train the language mo #### Mandarin LM -Different from the English language model, Mandarin language model is character-based where each token is a Chinese character. We use an internal corpus to train the released Mandarin language model. This corpus contains billions of tokens. The preprocessing has tiny difference from English language model and main steps include: +Different from the English language model, Mandarin language model is character-based where each token is a Chinese character. We use internal corpus to train the released Mandarin language models. The corpus contain billions of tokens. The preprocessing has tiny difference from English language model and main steps include: * The beginning and trailing whitespace characters are removed. * English punctuations and Chinese punctuations are removed. * A whitespace character between two tokens is inserted. -Please notice that the released language model only contains Chinese simplified characters. After preprocessing done we can begin to train the language model. The key training arguments are '-o 5 --prune 0 1 2 4 4'. Please refer above section for the meaning of each argument. We also convert the arpa file to binary file using default settings. +Please notice that the released language models only contain Chinese simplified characters. After preprocessing done we can begin to train the language model. The key training arguments for small LM is '-o 5 --prune 0 1 2 4 4' and '-o 5' for large LM. Please refer above section for the meaning of each argument. We also convert the arpa file to binary file using default settings. ### Speech-to-text Inference @@ -459,10 +459,11 @@ Mandarin | [Internal Mandarin Model](to-be-added) | Baidu Mandarin Dataset | 291 #### Language Model Released -Language Model | Training Data | Token-based | Size | Filter Configuraiton -:-------------:| :------------:| :-----: | -----: | -----------------: -[English LM](http://paddlepaddle.bj.bcebos.com/model_zoo/speech/common_crawl_00.prune01111.trie.klm) | To Be Added | Word-based | 8.3 GB | To Be Added -[Mandarin LM](http://cloud.dlnel.org/filepub/?uuid=d21861e4-4ed6-45bb-ad8e-ae417a43195e) | To Be Added | Character-based | 2.8 GB | To Be Added +Language Model | Training Data | Token-based | Size | Description +:-------------:| :------------:| :-----: | -----: | :-----------------: +[English LM](http://paddlepaddle.bj.bcebos.com/model_zoo/speech/common_crawl_00.prune01111.trie.klm) | [en.00.deduped.xz](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | pruned with 0 1 1 1 1
about 1.85 billion n-grams
'trie' binary with '-a 22 -q 8 -b 8' +[Mandarin LM Small](http://cloud.dlnel.org/filepub/?uuid=d21861e4-4ed6-45bb-ad8e-ae417a43195e) | Small internel data | Character-based | 2.8 GB | pruned with 0 1 2 4 4
about 0.13 billion n-grams
'probing' binary with default settings +Mandarin LM Large | Large internel data | Character-based | 70.4 GB | no pruning
about 3.7 billion n-grams
'probing' binary with default settings ## Experiments and Benchmarks -- GitLab