Add doc for mandarin LM (#297) · Issue · PaddlePaddle / models

Add doc for mandarin LM

Created by: pkuyym

Mandarin LM

Different from word-based language model, mandarin language model is character-based where each token is a chinese character. We use an internal corpus to train the released mandarin language model. This corpus contains billions of tokens. The preprocessing has small difference from english language model and all steps are:

The beginning and trailing whitespace characters are removed.
English punctuations and chinese punctuations are removed.
Insert a whitespace character between two tokens.

Please notice that the released language model only contains chinese simplified characters. When preprocessing done we can begin to train the language model. The key training parameters are '-o 5 --prune 0 1 2 4 4'. Please refer above section for the meaning of each parameter. We also convert the arpa file to binary file using default settings.

PaddlePaddle / models 大约 1 年 前同步成功

Add doc for mandarin LM

Mandarin LM

PaddlePaddle / models
大约 1 年前同步成功