Size of custom language model is very small compared to common_crawl_00.prune01111.trie.klm
Created by: dalonlobo
Hi,
I've created a language model using kenlm on common crawl data. The size of resultant output arpa is only 51Mb. The trained language model that you have provided is close to 8Gb in size which is also trained on common crawl dataset (common_crawl_00.prune01111.trie.klm). Could you tell why this could happen?
Command used to create the language model: bin/lmplz -o 5 --prune 01111 -T /datadrive/ -S 50% --text en.00.deduped --arpa en.00.deduped.arpa
Output:
=== 1/5 Counting and sorting n-grams ===
Reading /datadrive/speechexperiments/paddle/datasets/en.00.deduped.sadguru.processed
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 9630074265 types 109380831
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1312569972 2:6413600768 3:12025502720 4:19240802304 5:28059504640
Statistics:
1 116081/109380831 D1=0.832246 D2=1.08381 D3+=1.25936
2 641408/699725052 D1=0.817398 D2=1.02413 D3+=1.20664
3 525936/2424418023 D1=0.843848 D2=1.08604 D3+=1.26471
4 241171/4531878860 D1=0.894573 D2=1.15259 D3+=1.2933
5 152819/5933461247 D1=0.898007 D2=1.2717 D3+=1.32749
Memory estimate for binary LM:
type kB
probing 38645 assuming -p 1.5
probing 47352 assuming -r models -p 1.5
trie 20622 without quantization
trie 12119 assuming -q 8 -b 8 quantization
trie 18450 assuming -a 22 array pointer compression
trie 9947 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:1392972 2:10262528 3:10518720 4:5788104 5:4278932
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:1392972 2:10262528 3:10518720 4:5788104 5:4278932
=== 5/5 Writing ARPA model ===
Name:lmplz VmPeak:67678176 kB VmRSS:85464 kB RSSMax:67560500 kB user:12830.9 sys:2520.67 CPU:15351.5 real:15451.3