Size of custom language model is very small compared to common_crawl_00.prune01111.trie.klm (#173) · Issue · PaddlePaddle / DeepSpeech

Size of custom language model is very small compared to common_crawl_00.prune01111.trie.klm

Created by: dalonlobo

Hi,

I've created a language model using kenlm on common crawl data. The size of resultant output arpa is only 51Mb. The trained language model that you have provided is close to 8Gb in size which is also trained on common crawl dataset (common_crawl_00.prune01111.trie.klm). Could you tell why this could happen?

Command used to create the language model: bin/lmplz -o 5 --prune 01111 -T /datadrive/ -S 50% --text en.00.deduped --arpa en.00.deduped.arpa

Output:

=== 1/5 Counting and sorting n-grams ===
Reading /datadrive/speechexperiments/paddle/datasets/en.00.deduped.sadguru.processed
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 9630074265 types 109380831
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1312569972 2:6413600768 3:12025502720 4:19240802304 5:28059504640
Statistics:
1 116081/109380831 D1=0.832246 D2=1.08381 D3+=1.25936
2 641408/699725052 D1=0.817398 D2=1.02413 D3+=1.20664
3 525936/2424418023 D1=0.843848 D2=1.08604 D3+=1.26471
4 241171/4531878860 D1=0.894573 D2=1.15259 D3+=1.2933
5 152819/5933461247 D1=0.898007 D2=1.2717 D3+=1.32749
Memory estimate for binary LM:
type       kB
probing 38645 assuming -p 1.5
probing 47352 assuming -r models -p 1.5
trie    20622 without quantization
trie    12119 assuming -q 8 -b 8 quantization 
trie    18450 assuming -a 22 array pointer compression
trie     9947 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:1392972 2:10262528 3:10518720 4:5788104 5:4278932
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:1392972 2:10262528 3:10518720 4:5788104 5:4278932
=== 5/5 Writing ARPA model ===
Name:lmplz	VmPeak:67678176 kB	VmRSS:85464 kB	RSSMax:67560500 kB	user:12830.9	sys:2520.67	CPU:15351.5	real:15451.3

PaddlePaddle / DeepSpeech 大约 1 年 前同步成功

Size of custom language model is very small compared to common_crawl_00.prune01111.trie.klm

PaddlePaddle / DeepSpeech
大约 1 年前同步成功