提交 e99d6f4f 编写于 作者: Y Yang yaming 提交者: GitHub

Merge pull request #295 from pkuyym/fix-294

Add doc for english LM.
...@@ -217,7 +217,18 @@ cd models/lm ...@@ -217,7 +217,18 @@ cd models/lm
sh download_lm_en.sh sh download_lm_en.sh
sh download_lm_ch.sh sh download_lm_ch.sh
``` ```
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials.
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials. Here we provide some tips to show how we preparing our english and mandarin language models. You can take it as a reference when you train your own.
#### English LM
The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training:
* Characters not in \[A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and arabic numbers are converted to english numbers like 1000 to one thousand.
* Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
* Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are trained with agruments '-o 5 --prune 0 1 1 1 1'. '-o 5' means the max order of language model is 5. '--prune 0 1 1 1 1' represents count thresholds for each order and more specifically it will prune singletons for orders two and higher. To save disk storage we convert the arpa file to 'trie' binary file with arguments '-a 22 -q 8 -b 8'. '-a' represents the maximum number of leading bits of pointers in 'trie' to chop. '-q -b' are quantization parameters for probability and backoff.
TODO: any other requirements or tips to add? TODO: any other requirements or tips to add?
...@@ -296,7 +307,7 @@ The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertio ...@@ -296,7 +307,7 @@ The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertio
```bash ```bash
python tools/tune.py --use_gpu False python tools/tune.py --use_gpu False
``` ```
The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure. The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.
<p align="center"> <p align="center">
<img src="docs/images/tuning_error_surface.png" width=550> <img src="docs/images/tuning_error_surface.png" width=550>
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册