How to build Mandarin LM ? (#315) · Issue · PaddlePaddle / DeepSpeech

How to build Mandarin LM ?

Created by: yflin1

我自己训练了语言模型，且在 python 下使用 kenlm 模块测试正常， I trained my own language model and tested it with kenlm under python. No problem so far.

但当我把它应用到 DeepSpeech-1 项目下，语音识别时，识别结果都是空字串，而如果用项目里提供的预训练语言模型又是正常的，我不知道我哪一步做错了。 but when I applied it to DeepSpeech-1 project for sound recognition, the output showed only blank string. However, if I use the pre-trained model came with the project, it worked just fine. I don't know which step went wrong.

输入语料处理包括:

去除空白和非中文字符
所有字符之间都插入一个空白字符处理后的文件(train.txt，共254万行)，如下： I processed the input text with the following methods:
delete blank and non-Chinese characters
insert a blank character between the characters the processed text (train.txt，2.54 million lines in total) is as below:

在 苏 黎 世 主 宰 的 瑞 士 金 融 界 来 自 巴 塞 尔 的 奥 斯 佩 尔 迄 今 仍 然 是 局 外 人
位 于 前 五 名 的 另 外 三 个 城 市 为 哥 本 哈 根 苏 黎 世 和 东 京 纽 约 名 列 全 球 第 七
但 是 在 苏 黎 世 的 公 寓 里 他 很 快 就 恢 复 了 过 去 的 秩 序 和 旧 礼 节
此 外 巴 赫 家 族 的 音 乐 将 贯 穿 整 个 艺 术 节 尤 见 于 苏 黎 世 芭 蕾 舞 团 以 至 安 洁 拉 休 伊 特 与 欧 洲 嘉 兰 乐 团 的 节 目
...

训练使用如下命令： I used these command to train the model

./kenlm_asr/build/bin/lmplz -o 5 --prune 0 1 2 4 4 -T . < train.txt > train.arpa
./kenlm_asr/build/bin/build_binary train.arpa train.klm

生成的 .arpa 文件如下： the generated .arpa file is as below:

[yflin@p100 4_3ngram]$ head train.arpa -n 15
\data\
ngram 1=4834
ngram 2=1108956
ngram 3=2311307
ngram 4=962852
ngram 5=467075

\1-grams:
-6.0460653         0
0            -2.6255522
-2.5958326          0
-2.791055       而      -1.0586245
-2.831564       对      -1.2137116
-3.3869455      楼      -0.5979735
-3.1091366      市      -0.7209683
...

Python 下测试语言模型： I tested my language model under python:

>>> import kenlm
>>> model=kenlm.Model("train.klm")
>>> model.score('这 是 一 个 测 试',bos = True,eos = True)
-8.183526992797852

使用该语言模型，语音识别时的输出如下： The result of sound recognition using my model:

[INFO 2019-03-27 14:18:25,014 server.py:122] start inference ...

Output Transcription:
[INFO 2019-03-27 14:18:25,150 server.py:146] finish inference

而改用项目里预训练的语言模型，语音识别的结果如下(识别正确)： And if I use the pre-trained language model came with the project, the result is like this:

[INFO 2019-03-27 14:19:47,514 server.py:122] start inference ...

Output Transcription: 这是一个测试
[INFO 2019-03-27 14:19:47,799 server.py:146] finish inference

PaddlePaddle / DeepSpeech 大约 2 年 前同步成功

How to build Mandarin LM ?

PaddlePaddle / DeepSpeech
大约 2 年前同步成功