Merge pull request #128 from lcy-seso/clean_rnn_lm_codes

refactor codes of the language model example.

Merge pull request #128 from lcy-seso/clean_rnn_lm_codes
refactor codes of the language model example.
c075ae21 · Cao Ying · GitHub · 08ab956f · 7d3a8cdf · c075ae21
23 changed file
--- a/generate_sequence_by_rnn_lm/.gitignore
+++ b/generate_sequence_by_rnn_lm/.gitignore
+*.pyc
+*.tar.gz
+models
--- a/generate_sequence_by_rnn_lm/README.md
+++ b/generate_sequence_by_rnn_lm/README.md
+# 使用循环神经网语言模型生成文本
+
+语言模型(Language Model)是一个概率分布模型，简单来说，就是用来计算一个句子的概率的模型。利用它可以确定哪个词序列的可能性更大，或者给定若干个词，可以预测下一个最可能出现的词。语言模型是自然语言处理领域里一个重要的基础模型。
+
+## 应用场景
+**语言模型被应用在很多领域**，如：
+
+* **自动写作**：语言模型可以根据上文生成下一个词，递归下去可以生成整个句子、段落、篇章。
+* **QA**：语言模型可以根据Question生成Answer。
+* **机器翻译**：当前主流的机器翻译模型大多基于Encoder-Decoder模式，其中Decoder就是一个待条件的语言模型，用来生成目标语言。
+* **拼写检查**：语言模型可以计算出词序列的概率，一般在拼写错误处序列的概率会骤减，可以用来识别拼写错误并提供改正候选集。
+* **词性标注、句法分析、语音识别......**
+
+## 关于本例
+本例实现基于RNN的语言模型，以及利用语言模型生成文本，本例的目录结构如下：
+
+```text
+.
+├── data
+│   └── train_data_examples.txt        # 示例数据，可参考示例数据的格式，提供自己的数据
+├── config.py    # 配置文件，包括data、train、infer相关配置
+├── generate.py  # 预测任务脚本，即生成文本
+├── beam_search.py    # beam search 算法实现
+├── network_conf.py   # 本例中涉及的各种网络结构均定义在此文件中，希望进一步修改模型结构，请修改此文件
+├── reader.py    # 读取数据接口
+├── README.md
+├── train.py    # 训练任务脚本
+└── utils.py    # 定义通用的函数，例如：构建字典、加载字典等
+```
+
+## RNN 语言模型
+### 简介
+
+RNN是一个序列模型，基本思路是：在时刻$t$，将前一时刻$t-1$的隐藏层输出和$t$时刻的词向量一起输入到隐藏层从而得到时刻$t$的特征表示，然后用这个特征表示得到$t$时刻的预测输出，如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识，具有“记忆”功能。理论上RNN能实现“长依赖”（即利用很久之前的知识），但在实际应用中发现效果并不理想，研究提出了LSTM和GRU等变种，通过引入门机制对传统RNN的记忆单元进行了改进，弥补了传统RNN在学习长序列时遇到的难题。本例模型使用了LSTM或GRU，可通过配置进行修改。下图是RNN（广义上包含了LSTM、GRU等）语言模型“循环”思想的示意图：
+
+<p align=center><img src='images/rnn.png' width='500px'/></p>
+
+### 模型实现
+
+本例中RNN语言模型的实现简介如下：
+
+- **定义模型参数**：`config.py`中定义了模型的参数变量。
+- **定义模型结构**：`network_conf.py`中的`rnn_lm`**函数**中定义了模型的**结构**，如下：
+    - 输入层：将输入的词（或字）序列映射成向量，即词向量层： `embedding`。
+    - 中间层：根据配置实现RNN层，将上一步得到的`embedding`向量序列作为输入。
+    - 输出层：使用`softmax`归一化计算单词的概率。
+    - loss：定义多类交叉熵作为模型的损失函数。
+- **训练模型**：`train.py`中的`main`方法实现了模型的训练，实现流程如下：
+    - 准备输入数据：建立并保存词典、构建train和test数据的reader。
+    - 初始化模型：包括模型的结构、参数。
+    - 构建训练器：demo中使用的是Adam优化算法。
+    - 定义回调函数：构建`event_handler`来跟踪训练过程中loss的变化，并在每轮训练结束时保存模型的参数。
+    - 训练：使用trainer训练模型。
+
+- **生成文本**：`generate.py` 实现了文本的生成，实现流程如下：
+    - 加载训练好的模型和词典文件。
+    - 读取`gen_file`文件，每行是一个句子的前缀，用[柱搜索算法(Beam Search)](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md#柱搜索算法)根据前缀生成文本。
+    - 将生成的文本及其前缀保存到文件`gen_result`。
+
+## 使用说明
+
+运行本例的方法如下：
+
+* 1，运行`python train.py`命令，开始train模型（默认使用RNN），待训练结束。
+* 2，运行`python generate.py`运行文本生成。（输入的文本默认为`data/train_data_examples.txt`，生成的文本默认保存到`data/gen_result.txt`中。）
+
+
+**如果需要使用自己的语料、定制模型，需要修改`config.py`中的配置，细节和适配工作详情如下：**
+
+
+### 语料适配
+
+* 清洗语料：去除原文中空格、tab、乱码，按需去除数字、标点符号、特殊符号等。
+* 内容格式：每个句子占一行；每行中的各词之间使用一个空格符分开。
+* 按需要配置`config.py`中的如下参数：
+
+    ```python  
+     train_file = "data/train_data_examples.txt"
+     test_file = ""
+
+     vocab_file = "data/word_vocab.txt"
+     model_save_dir = "models"
+    ```
+    1. `train_file`：指定训练数据的路径，**需要预先分词**。
+    2. `test_file`：指定测试数据的路径，如果训练数据不为空，将在每个 `pass` 训练结束对指定的测试数据进行测试。
+    3. `vocab_file`：指定字典的路径，如果字典文件不存在，将会对训练语料进行词频统计，构建字典。
+    4. `model_save_dir`：指定模型保存的路径，如果指定的文件夹不存在，将会自动创建。
+
+### 构建字典的策略
+- 当指定的字典文件不存在时，将对训练数据进行词频统计，自动构建字典`config.py` 中有如下两个参数与构建字典有关：
+
+    ```python
+    max_word_num = 51200 - 2
+    cutoff_word_fre = 0
+    ```
+    1. `max_word_num`：指定字典中含有多少个词。
+    2. `cutoff_word_fre`：字典中词语在训练语料中出现的最低频率。
+- 加入指定了 `max_word_num = 5000`，并且 `cutoff_word_fre = 10`，词频统计发现训练语料中出现频率高于10次的词语仅有3000个，那么最终会取3000个词构成词典。
+- 构建词典时，会自动加入两个特殊符号：
+    1. `<unk>`：不出现在字典中的词
+    2. `<e>`：句子的结束符
+
+    *注：需要注意的是，词典越大生成的内容越丰富，但训练耗时越久。一般中文分词之后，语料中不同的词能有几万乃至几十万，如果`max_word_num`取值过小则导致`<unk>`占比过高，如果`max_word_num`取值较大，则严重影响训练速度（对精度也有影响）。所以，也有“按字”训练模型的方式，即：把每个汉字当做一个词，常用汉字也就几千个，使得字典的大小不会太大、不会丢失太多信息，但汉语中同一个字在不同词中语义相差很大，有时导致模型效果不理想。建议多试试、根据实际情况选择是“按词训练”还是“按字训练”。*
+
+### 模型适配、训练
+
+* 按需调整`config.py`中如下配置，来修改 rnn 语言模型的网络结果：
+
+    ```python
+    rnn_type = "lstm"  # "gru" or "lstm"
+    emb_dim = 256
+    hidden_size = 256
+    stacked_rnn_num = 2
+    ```
+    1. `rnn_type`：支持 ”gru“ 或者 ”lstm“ 两种参数，选择使用何种 RNN 单元。
+    2. `emb_dim`：设置词向量的维度。
+    3. `hidden_size`：设置 RNN 单元隐层大小。
+    4. `stacked_rnn_num`：设置堆叠 RNN 单元的个数，构成一个更深的模型。
+
+* 运行`python train.py`命令训练模型，模型将被保存到`model_save_dir`指定的目录。
+
+### 按需生成文本
+
+* 按需调整`config.py`中以下变量，详解如下：
+
+    ```python
+    gen_file = "data/train_data_examples.txt"
+    gen_result = "data/gen_result.txt"
+    max_gen_len = 25  # the max number of words to generate
+    beam_size = 5
+    model_path = "models/rnn_lm_pass_00000.tar.gz"
+    ```
+    1. `gen_file`：指定输入数据文件，每行是一个句子的前缀，**需要预先分词**。
+    2. `gen_result`：指定输出文件路径，生成结果将写入此文件。
+    3. `max_gen_len`：指定每一句生成的话最长长度，如果模型无法生成出`<e>`，当生成 `max_gen_len` 个词语后，生成过程会自动终止。
+    4. `beam_size`：Beam Search 算法每一步的展开宽度。
+    5. `model_path`：指定训练好的模型的路径。
+
+    其中，`gen_file` 中保存的是待生成的文本前缀，每个前缀占一行，形如：
+
+    ```text
+    若隐若现 地像 幽灵 , 像 死神
+    ```
+    将需要生成的文本前缀按此格式存入文件即可；
+
+* 运行`python generate.py`命令运行beam search 算法为输入前缀生成文本，下面是模型生成的结果：
+
+    ```text
+    81    若隐若现 地像 幽灵 , 像 死神
+    -12.2542    一样 。 他 是 个 怪物 <e>
+    -12.6889    一样 。 他 是 个 英雄 <e>
+    -13.9877    一样 。 他 是 我 的 敌人 <e>
+    -14.2741    一样 。 他 是 我 的 <e>
+    -14.6250    一样 。 他 是 我 的 朋友 <e>
+    ```
+    其中：
+    1. 第一行 `81    若隐若现 地像 幽灵 , 像 死神`以`\t`为分隔，共有两列：
+        - 第一列是输入前缀在训练样本集中的序号。
+        - 第二列是输入的前缀。
+    2. 第二 ~ `beam_size + 1` 行是生成结果，同样以 `\t` 分隔为两列：
+        - 第一列是该生成序列的对数概率（log probability）。
+        - 第二列是生成的文本序列，正常的生成结果会以符号`<e>`结尾，如果没有以`<e>`结尾，意味着超过了最大序列长度，生成强制终止。
--- a/generate_sequence_by_rnn_lm/beam_search.py
+++ b/generate_sequence_by_rnn_lm/beam_search.py
+#!/usr/bin/env python
+# coding=utf-8
+import os
+import math
+import numpy as np
+
+import paddle.v2 as paddle
+
+from utils import logger, load_reverse_dict
+
+__all__ = ["BeamSearch"]
+
+
+class BeamSearch(object):
+    """
+    Generating sequence by beam search
+    NOTE: this class only implements generating one sentence at a time.
+    """
+
+    def __init__(self, inferer, word_dict_file, beam_size=1, max_gen_len=100):
+        """
+        constructor method.
+
+        :param inferer: object of paddle.Inference that represents the entire
+            network to forward compute the test batch
+        :type inferer: paddle.Inference
+        :param word_dict_file: path of word dictionary file
+        :type word_dict_file: str
+        :param beam_size: expansion width in each iteration
+        :type param beam_size: int
+        :param max_gen_len: the maximum number of iterations
+        :type max_gen_len: int
+        """
+        self.inferer = inferer
+        self.beam_size = beam_size
+        self.max_gen_len = max_gen_len
+        self.ids_2_word = load_reverse_dict(word_dict_file)
+        logger.info("dictionay len = %d" % (len(self.ids_2_word)))
+
+        try:
+            self.eos_id = next(x[0] for x in self.ids_2_word.iteritems()
+                               if x[1] == "<e>")
+            self.unk_id = next(x[0] for x in self.ids_2_word.iteritems()
+                               if x[1] == "<unk>")
+        except StopIteration:
+            logger.fatal(("the word dictionay must contain an ending mark "
+                          "in the text generation task."))
+
+        self.candidate_paths = []
+        self.final_paths = []
+
+    def _top_k(self, softmax_out, k):
+        """
+        get indices of the words with k highest probablities.
+        NOTE: <unk> will be excluded if it is among the top k words, then word
+        with (k + 1)th highest probability will be returned.
+
+        :param softmax_out: probablity over the dictionary
+        :type softmax_out: narray
+        :param k: number of word indices to return
+        :type k: int
+        :return: indices of k words with highest probablities.
+        :rtype: list
+        """
+        ids = softmax_out.argsort()[::-1]
+        return ids[ids != self.unk_id][:k]
+
+    def _forward_batch(self, batch):
+        """
+        forward a test batch.
+
+        :params batch: the input data batch
+        :type batch: list
+        :return: probablities of the predicted word
+        :rtype: ndarray
+        """
+        return self.inferer.infer(input=batch, field=["value"])
+
+    def _beam_expand(self, next_word_prob):
+        """
+        In every iteration step, the model predicts the possible next words.
+        For each input sentence, the top k words is added to end of the original
+        sentence to form a new generated sentence.
+
+        :param next_word_prob: probablities of the next words
+        :type next_word_prob: ndarray
+        :return: the expanded new sentences.
+        :rtype: list
+        """
+        assert len(next_word_prob) == len(self.candidate_paths), (
+            "Wrong forward computing results!")
+        top_beam_words = np.apply_along_axis(self._top_k, 1, next_word_prob,
+                                             self.beam_size)
+        new_paths = []
+        for i, words in enumerate(top_beam_words):
+            old_path = self.candidate_paths[i]
+            for w in words:
+                log_prob = old_path["log_prob"] + math.log(next_word_prob[i][w])
+                gen_ids = old_path["ids"] + [w]
+                if w == self.eos_id:
+                    self.final_paths.append({
+                        "log_prob": log_prob,
+                        "ids": gen_ids
+                    })
+                else:
+                    new_paths.append({"log_prob": log_prob, "ids": gen_ids})
+        return new_paths
+
+    def _beam_shrink(self, new_paths):
+        """
+        to return the top beam_size generated sequences with the highest
+        probabilities at the end of evey generation iteration.
+
+        :param new_paths: all possible generated sentences
+        :type new_paths: list
+        :return: a state flag to indicate whether to stop beam search
+        :rtype: bool
+        """
+
+        if len(self.final_paths) >= self.beam_size:
+            max_candidate_log_prob = max(
+                new_paths, key=lambda x: x["log_prob"])["log_prob"]
+            min_complete_path_log_prob = min(
+                self.final_paths, key=lambda x: x["log_prob"])["log_prob"]
+            if min_complete_path_log_prob >= max_candidate_log_prob:
+                return True
+
+        new_paths.sort(key=lambda x: x["log_prob"], reverse=True)
+        self.candidate_paths = new_paths[:self.beam_size]
+        return False
+
+    def gen_a_sentence(self, input_sentence):
+        """
+        generating sequence for an given input
+
+        :param input_sentence: one input_sentence
+        :type input_sentence: list
+        :return: the generated word sequences
+        :rtype: list
+        """
+        self.candidate_paths = [{"log_prob": 0., "ids": input_sentence}]
+        input_len = len(input_sentence)
+
+        for i in range(self.max_gen_len):
+            next_word_prob = self._forward_batch(
+                [[x["ids"]] for x in self.candidate_paths])
+            new_paths = self._beam_expand(next_word_prob)
+
+            min_candidate_log_prob = min(
+                new_paths, key=lambda x: x["log_prob"])["log_prob"]
+
+            path_to_remove = [
+                path for path in self.final_paths
+                if path["log_prob"] < min_candidate_log_prob
+            ]
+            for p in path_to_remove:
+                self.final_paths.remove(p)
+
+            if self._beam_shrink(new_paths):
+                self.candidate_paths = []
+                break
+
+        gen_ids = sorted(
+            self.final_paths + self.candidate_paths,
+            key=lambda x: x["log_prob"],
+            reverse=True)[:self.beam_size]
+        self.final_paths = []
+
+        def _to_str(x):
+            text = " ".join(self.ids_2_word[idx]
+                            for idx in x["ids"][input_len:])
+            return "%.4f\t%s" % (x["log_prob"], text)
+
+        return map(_to_str, gen_ids)
--- a/generate_sequence_by_rnn_lm/config.py
+++ b/generate_sequence_by_rnn_lm/config.py
+#!/usr/bin/env python
+# coding=utf-8
+import os
+
+################## for building word dictionary  ##################
+
+max_word_num = 51200 - 2
+cutoff_word_fre = 0
+
+################## for training task  #########################
+# path of training data
+train_file = "data/train_data_examples.txt"
+# path of testing data, if testing file does not exist,
+# testing will not be performed at the end of each training pass
+test_file = ""
+# path of word dictionary, if this file does not exist,
+# word dictionary will be built from training data.
+vocab_file = "data/word_vocab.txt"
+# directory to save the trained model
+# create a new directory if the directoy does not exist
+model_save_dir = "models"
+
+batch_size = 32  # the number of training examples in one forward/backward pass
+num_passes = 20  # how many passes to train the model
+
+log_period = 50
+save_period_by_batches = 50
+
+use_gpu = True  # to use gpu or not
+trainer_count = 1  # number of trainer
+
+##################  for model configuration  ##################
+rnn_type = "lstm"  # "gru" or "lstm"
+emb_dim = 256
+hidden_size = 256
+stacked_rnn_num = 2
+
+##################  for text generation  ##################
+gen_file = "data/train_data_examples.txt"
+gen_result = "data/gen_result.txt"
+max_gen_len = 25  # the max number of words to generate
+beam_size = 5
+model_path = "models/rnn_lm_pass_00000.tar.gz"
+
+if not os.path.exists(model_save_dir):
+    os.mkdir(model_save_dir)
--- a/generate_sequence_by_rnn_lm/data/train_data_examples.txt
+++ b/generate_sequence_by_rnn_lm/data/train_data_examples.txt
+我们 不会 伤害 你 的 。 他们 也 这么 说 。
+你 拥有 你 父亲 皇室 的 血统 。 是 合法 的 继承人 。
+叫 什么 你 可以 告诉 我 。
+你 并 没有 留言 说 要 去 哪里 。 是 的 , 因为 我 必须 要 去 完成 这件 事 。
+你 查出 是 谁 住 在 隔壁 房间 吗 ?
--- a/generate_sequence_by_rnn_lm/generate.py
+++ b/generate_sequence_by_rnn_lm/generate.py
+# coding=utf-8
+import os
+import gzip
+import numpy as np
+
+import paddle.v2 as paddle
+
+from utils import logger, load_dict
+from beam_search import BeamSearch
+import config as conf
+from network_conf import rnn_lm
+
+
+def rnn_generate(gen_input_file, model_path, max_gen_len, beam_size,
+                 word_dict_file):
+    """
+    use RNN model to generate sequences.
+
+    :param word_id_dict: vocab.
+    :type word_id_dict: dictionary with content of "{word, id}",
+                        "word" is string type , "id" is int type.
+    :param num_words: the number of the words to generate.
+    :type num_words: int
+    :param beam_size: beam width.
+    :type beam_size: int
+    :return: save prediction results to output_file
+    """
+
+    assert os.path.exists(gen_input_file), "test file does not exist!"
+    assert os.path.exists(model_path), "trained model does not exist!"
+    assert os.path.exists(
+        word_dict_file), "word dictionary file does not exist!"
+
+    # load word dictionary
+    word_2_ids = load_dict(word_dict_file)
+    try:
+        UNK_ID = word_2_ids["<unk>"]
+    except KeyError:
+        logger.fatal("the word dictionary must contain a <unk> token!")
+        sys.exit(-1)
+
+    # initialize paddle
+    paddle.init(use_gpu=conf.use_gpu, trainer_count=conf.trainer_count)
+
+    # load the trained model
+    pred_words = rnn_lm(
+        len(word_2_ids),
+        conf.emb_dim,
+        conf.hidden_size,
+        conf.stacked_rnn_num,
+        conf.rnn_type,
+        is_infer=True)
+
+    parameters = paddle.parameters.Parameters.from_tar(
+        gzip.open(model_path, "r"))
+
+    inferer = paddle.inference.Inference(
+        output_layer=pred_words, parameters=parameters)
+
+    generator = BeamSearch(inferer, word_dict_file, beam_size, max_gen_len)
+    # generate text
+    with open(conf.gen_file, "r") as fin, open(conf.gen_result, "w") as fout:
+        for idx, line in enumerate(fin):
+            fout.write("%d\t%s" % (idx, line))
+            for gen_res in generator.gen_a_sentence([
+                    word_2_ids.get(w, UNK_ID)
+                    for w in line.lower().strip().split()
+            ]):
+                fout.write("%s\n" % gen_res)
+            fout.write("\n")
+
+
+if __name__ == "__main__":
+    rnn_generate(conf.gen_file, conf.model_path, conf.max_gen_len,
+                 conf.beam_size, conf.vocab_file)
--- a/language_model/images/ngram.png
+++ b/language_model/images/ngram.png
--- a/language_model/images/rnn.png
+++ b/language_model/images/rnn.png
--- a/generate_sequence_by_rnn_lm/index.html
+++ b/generate_sequence_by_rnn_lm/index.html
+
+<html>
+<head>
+  <script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
+    jax: ["input/TeX", "output/HTML-CSS"],
+    tex2jax: {
+      inlineMath: [ ['$','$'] ],
+      displayMath: [ ['$$','$$'] ],
+      processEscapes: true
+    },
+    "HTML-CSS": { availableFonts: ["TeX"] }
+  });
+  </script>
+  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
+  <script type="text/javascript" src="../.tools/theme/marked.js">
+  </script>
+  <link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
+  <script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
+  <link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
+  <link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
+  <link href="../.tools/theme/github-markdown.css" rel='stylesheet'>
+</head>
+<style type="text/css" >
+.markdown-body {
+    box-sizing: border-box;
+    min-width: 200px;
+    max-width: 980px;
+    margin: 0 auto;
+    padding: 45px;
+}
+</style>
+
+
+<body>
+
+<div id="context" class="container-fluid markdown-body">
+</div>
+
+<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
+<div id="markdown" style='display:none'>
+# 使用循环神经网语言模型生成文本
+
+语言模型(Language Model)是一个概率分布模型，简单来说，就是用来计算一个句子的概率的模型。利用它可以确定哪个词序列的可能性更大，或者给定若干个词，可以预测下一个最可能出现的词。语言模型是自然语言处理领域里一个重要的基础模型。
+
+## 应用场景
+**语言模型被应用在很多领域**，如：
+
+* **自动写作**：语言模型可以根据上文生成下一个词，递归下去可以生成整个句子、段落、篇章。
+* **QA**：语言模型可以根据Question生成Answer。
+* **机器翻译**：当前主流的机器翻译模型大多基于Encoder-Decoder模式，其中Decoder就是一个待条件的语言模型，用来生成目标语言。
+* **拼写检查**：语言模型可以计算出词序列的概率，一般在拼写错误处序列的概率会骤减，可以用来识别拼写错误并提供改正候选集。
+* **词性标注、句法分析、语音识别......**
+
+## 关于本例
+本例实现基于RNN的语言模型，以及利用语言模型生成文本，本例的目录结构如下：
+
+```text
+.
+├── data
+│   └── train_data_examples.txt        # 示例数据，可参考示例数据的格式，提供自己的数据
+├── config.py    # 配置文件，包括data、train、infer相关配置
+├── generate.py  # 预测任务脚本，即生成文本
+├── beam_search.py    # beam search 算法实现
+├── network_conf.py   # 本例中涉及的各种网络结构均定义在此文件中，希望进一步修改模型结构，请修改此文件
+├── reader.py    # 读取数据接口
+├── README.md
+├── train.py    # 训练任务脚本
+└── utils.py    # 定义通用的函数，例如：构建字典、加载字典等
+```
+
+## RNN 语言模型
+### 简介
+
+RNN是一个序列模型，基本思路是：在时刻$t$，将前一时刻$t-1$的隐藏层输出和$t$时刻的词向量一起输入到隐藏层从而得到时刻$t$的特征表示，然后用这个特征表示得到$t$时刻的预测输出，如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识，具有“记忆”功能。理论上RNN能实现“长依赖”（即利用很久之前的知识），但在实际应用中发现效果并不理想，研究提出了LSTM和GRU等变种，通过引入门机制对传统RNN的记忆单元进行了改进，弥补了传统RNN在学习长序列时遇到的难题。本例模型使用了LSTM或GRU，可通过配置进行修改。下图是RNN（广义上包含了LSTM、GRU等）语言模型“循环”思想的示意图：
+
+<p align=center><img src='images/rnn.png' width='500px'/></p>
+
+### 模型实现
+
+本例中RNN语言模型的实现简介如下：
+
+- **定义模型参数**：`config.py`中定义了模型的参数变量。
+- **定义模型结构**：`network_conf.py`中的`rnn_lm`**函数**中定义了模型的**结构**，如下：
+    - 输入层：将输入的词（或字）序列映射成向量，即词向量层： `embedding`。
+    - 中间层：根据配置实现RNN层，将上一步得到的`embedding`向量序列作为输入。
+    - 输出层：使用`softmax`归一化计算单词的概率。
+    - loss：定义多类交叉熵作为模型的损失函数。
+- **训练模型**：`train.py`中的`main`方法实现了模型的训练，实现流程如下：
+    - 准备输入数据：建立并保存词典、构建train和test数据的reader。
+    - 初始化模型：包括模型的结构、参数。
+    - 构建训练器：demo中使用的是Adam优化算法。
+    - 定义回调函数：构建`event_handler`来跟踪训练过程中loss的变化，并在每轮训练结束时保存模型的参数。
+    - 训练：使用trainer训练模型。
+
+- **生成文本**：`generate.py` 实现了文本的生成，实现流程如下：
+    - 加载训练好的模型和词典文件。
+    - 读取`gen_file`文件，每行是一个句子的前缀，用[柱搜索算法(Beam Search)](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md#柱搜索算法)根据前缀生成文本。
+    - 将生成的文本及其前缀保存到文件`gen_result`。
+
+## 使用说明
+
+运行本例的方法如下：
+
+* 1，运行`python train.py`命令，开始train模型（默认使用RNN），待训练结束。
+* 2，运行`python generate.py`运行文本生成。（输入的文本默认为`data/train_data_examples.txt`，生成的文本默认保存到`data/gen_result.txt`中。）
+
+
+**如果需要使用自己的语料、定制模型，需要修改`config.py`中的配置，细节和适配工作详情如下：**
+
+
+### 语料适配
+
+* 清洗语料：去除原文中空格、tab、乱码，按需去除数字、标点符号、特殊符号等。
+* 内容格式：每个句子占一行；每行中的各词之间使用一个空格符分开。
+* 按需要配置`config.py`中的如下参数：
+
+    ```python  
+     train_file = "data/train_data_examples.txt"
+     test_file = ""
+
+     vocab_file = "data/word_vocab.txt"
+     model_save_dir = "models"
+    ```
+    1. `train_file`：指定训练数据的路径，**需要预先分词**。
+    2. `test_file`：指定测试数据的路径，如果训练数据不为空，将在每个 `pass` 训练结束对指定的测试数据进行测试。
+    3. `vocab_file`：指定字典的路径，如果字典文件不存在，将会对训练语料进行词频统计，构建字典。
+    4. `model_save_dir`：指定模型保存的路径，如果指定的文件夹不存在，将会自动创建。
+
+### 构建字典的策略
+- 当指定的字典文件不存在时，将对训练数据进行词频统计，自动构建字典`config.py` 中有如下两个参数与构建字典有关：
+
+    ```python
+    max_word_num = 51200 - 2
+    cutoff_word_fre = 0
+    ```
+    1. `max_word_num`：指定字典中含有多少个词。
+    2. `cutoff_word_fre`：字典中词语在训练语料中出现的最低频率。
+- 加入指定了 `max_word_num = 5000`，并且 `cutoff_word_fre = 10`，词频统计发现训练语料中出现频率高于10次的词语仅有3000个，那么最终会取3000个词构成词典。
+- 构建词典时，会自动加入两个特殊符号：
+    1. `<unk>`：不出现在字典中的词
+    2. `<e>`：句子的结束符
+
+    *注：需要注意的是，词典越大生成的内容越丰富，但训练耗时越久。一般中文分词之后，语料中不同的词能有几万乃至几十万，如果`max_word_num`取值过小则导致`<unk>`占比过高，如果`max_word_num`取值较大，则严重影响训练速度（对精度也有影响）。所以，也有“按字”训练模型的方式，即：把每个汉字当做一个词，常用汉字也就几千个，使得字典的大小不会太大、不会丢失太多信息，但汉语中同一个字在不同词中语义相差很大，有时导致模型效果不理想。建议多试试、根据实际情况选择是“按词训练”还是“按字训练”。*
+
+### 模型适配、训练
+
+* 按需调整`config.py`中如下配置，来修改 rnn 语言模型的网络结果：
+
+    ```python
+    rnn_type = "lstm"  # "gru" or "lstm"
+    emb_dim = 256
+    hidden_size = 256
+    stacked_rnn_num = 2
+    ```
+    1. `rnn_type`：支持 ”gru“ 或者 ”lstm“ 两种参数，选择使用何种 RNN 单元。
+    2. `emb_dim`：设置词向量的维度。
+    3. `hidden_size`：设置 RNN 单元隐层大小。
+    4. `stacked_rnn_num`：设置堆叠 RNN 单元的个数，构成一个更深的模型。
+
+* 运行`python train.py`命令训练模型，模型将被保存到`model_save_dir`指定的目录。
+
+### 按需生成文本
+
+* 按需调整`config.py`中以下变量，详解如下：
+
+    ```python
+    gen_file = "data/train_data_examples.txt"
+    gen_result = "data/gen_result.txt"
+    max_gen_len = 25  # the max number of words to generate
+    beam_size = 5
+    model_path = "models/rnn_lm_pass_00000.tar.gz"
+    ```
+    1. `gen_file`：指定输入数据文件，每行是一个句子的前缀，**需要预先分词**。
+    2. `gen_result`：指定输出文件路径，生成结果将写入此文件。
+    3. `max_gen_len`：指定每一句生成的话最长长度，如果模型无法生成出`<e>`，当生成 `max_gen_len` 个词语后，生成过程会自动终止。
+    4. `beam_size`：Beam Search 算法每一步的展开宽度。
+    5. `model_path`：指定训练好的模型的路径。
+
+    其中，`gen_file` 中保存的是待生成的文本前缀，每个前缀占一行，形如：
+
+    ```text
+    若隐若现 地像 幽灵 , 像 死神
+    ```
+    将需要生成的文本前缀按此格式存入文件即可；
+
+* 运行`python generate.py`命令运行beam search 算法为输入前缀生成文本，下面是模型生成的结果：
+
+    ```text
+    81    若隐若现 地像 幽灵 , 像 死神
+    -12.2542    一样 。 他 是 个 怪物 <e>
+    -12.6889    一样 。 他 是 个 英雄 <e>
+    -13.9877    一样 。 他 是 我 的 敌人 <e>
+    -14.2741    一样 。 他 是 我 的 <e>
+    -14.6250    一样 。 他 是 我 的 朋友 <e>
+    ```
+    其中：
+    1. 第一行 `81    若隐若现 地像 幽灵 , 像 死神`以`\t`为分隔，共有两列：
+        - 第一列是输入前缀在训练样本集中的序号。
+        - 第二列是输入的前缀。
+    2. 第二 ~ `beam_size + 1` 行是生成结果，同样以 `\t` 分隔为两列：
+        - 第一列是该生成序列的对数概率（log probability）。
+        - 第二列是生成的文本序列，正常的生成结果会以符号`<e>`结尾，如果没有以`<e>`结尾，意味着超过了最大序列长度，生成强制终止。
+
+</div>
+<!-- You can change the lines below now. -->
+
+<script type="text/javascript">
+marked.setOptions({
+  renderer: new marked.Renderer(),
+  gfm: true,
+  breaks: false,
+  smartypants: true,
+  highlight: function(code, lang) {
+    code = code.replace(/&amp;/g, "&")
+    code = code.replace(/&gt;/g, ">")
+    code = code.replace(/&lt;/g, "<")
+    code = code.replace(/&nbsp;/g, " ")
+    return hljs.highlightAuto(code, [lang]).value;
+  }
+});
+document.getElementById("context").innerHTML = marked(
+        document.getElementById("markdown").innerHTML)
+</script>
+</body>
--- a/generate_sequence_by_rnn_lm/network_conf.py
+++ b/generate_sequence_by_rnn_lm/network_conf.py
+# coding=utf-8
+
+import paddle.v2 as paddle
+
+
+def rnn_lm(vocab_dim,
+           emb_dim,
+           hidden_size,
+           stacked_rnn_num,
+           rnn_type="lstm",
+           is_infer=False):
+    """
+    RNN language model definition.
+
+    :param vocab_dim: size of vocabulary.
+    :type vocab_dim: int
+    :param emb_dim: dimension of the embedding vector
+    :type emb_dim: int
+    :param rnn_type: the type of RNN cell.
+    :type rnn_type: int
+    :param hidden_size: number of hidden unit.
+    :type hidden_size: int
+    :param stacked_rnn_num: number of stacked rnn cell.
+    :type stacked_rnn_num: int
+    :return: cost and output layer of model.
+    :rtype: LayerOutput
+    """
+
+    # input layers
+    input = paddle.layer.data(
+        name="input", type=paddle.data_type.integer_value_sequence(vocab_dim))
+    if not is_infer:
+        target = paddle.layer.data(
+            name="target",
+            type=paddle.data_type.integer_value_sequence(vocab_dim))
+
+    # embedding layer
+    input_emb = paddle.layer.embedding(input=input, size=emb_dim)
+
+    # rnn layer
+    if rnn_type == "lstm":
+        for i in range(stacked_rnn_num):
+            rnn_cell = paddle.networks.simple_lstm(
+                input=rnn_cell if i else input_emb, size=hidden_size)
+    elif rnn_type == "gru":
+        for i in range(stacked_rnn_num):
+            rnn_cell = paddle.networks.simple_gru(
+                input=rnn_cell if i else input_emb, size=hidden_size)
+    else:
+        raise Exception("rnn_type error!")
+
+    # fc(full connected) and output layer
+    output = paddle.layer.fc(
+        input=[rnn_cell], size=vocab_dim, act=paddle.activation.Softmax())
+
+    if is_infer:
+        last_word = paddle.layer.last_seq(input=output)
+        return last_word
+    else:
+        cost = paddle.layer.classification_cost(input=output, label=target)
+
+        return cost, output
--- a/generate_sequence_by_rnn_lm/reader.py
+++ b/generate_sequence_by_rnn_lm/reader.py
+# coding=utf-8
+import collections
+import os
+
+MIN_LEN = 3
+MAX_LEN = 100
+
+
+def rnn_reader(file_name, word_dict):
+    """
+    create reader for RNN, each line is a sample.
+
+    :param file_name: file name.
+    :param min_sentence_length: sentence's min length.
+    :param max_sentence_length: sentence's max length.
+    :param word_dict: vocab with content of '{word, id}',
+                      'word' is string type , 'id' is int type.
+    :return: data reader.
+    """
+
+    def reader():
+        UNK_ID = word_dict['<unk>']
+        with open(file_name) as file:
+            for line in file:
+                words = line.strip().lower().split()
+                if len(words) < MIN_LEN or len(words) > MAX_LEN:
+                    continue
+                ids = [word_dict.get(w, UNK_ID)
+                       for w in words] + [word_dict['<e>']]
+                yield ids[:-1], ids[1:]
+
+    return reader
--- a/generate_sequence_by_rnn_lm/train.py
+++ b/generate_sequence_by_rnn_lm/train.py
+#!/usr/bin/env python
+# coding=utf-8
+import os
+import sys
+import gzip
+import pdb
+
+import paddle.v2 as paddle
+import config as conf
+import reader
+from network_conf import rnn_lm
+from utils import logger, build_dict, load_dict
+
+
+def train(topology,
+          train_reader,
+          test_reader,
+          model_save_dir="models",
+          num_passes=10):
+    """
+    train model.
+
+    :param topology: cost layer of the model to train.
+    :type topology: LayerOuput
+    :param train_reader: train data reader.
+    :type trainer_reader: collections.Iterable
+    :param test_reader: test data reader.
+    :type test_reader: collections.Iterable
+    :param model_save_dir: path to save the trained model
+    :type model_save_dir: str
+    :param num_passes: number of epoch
+    :type num_passes: int
+    """
+    if not os.path.exists(model_save_dir):
+        os.mkdir(model_save_dir)
+
+    # initialize PaddlePaddle
+    paddle.init(
+        use_gpu=conf.use_gpu, gpu_id=3, trainer_count=conf.trainer_count)
+
+    # create optimizer
+    adam_optimizer = paddle.optimizer.Adam(
+        learning_rate=1e-3,
+        regularization=paddle.optimizer.L2Regularization(rate=1e-3),
+        model_average=paddle.optimizer.ModelAverage(
+            average_window=0.5, max_average_window=10000))
+
+    # create parameters
+    parameters = paddle.parameters.create(topology)
+    # create trainer
+    trainer = paddle.trainer.SGD(
+        cost=topology, parameters=parameters, update_equation=adam_optimizer)
+
+    # define the event_handler callback
+    def event_handler(event):
+        if isinstance(event, paddle.event.EndIteration):
+            if not event.batch_id % conf.log_period:
+                logger.info("Pass %d, Batch %d, Cost %f, %s" % (
+                    event.pass_id, event.batch_id, event.cost, event.metrics))
+
+            if (not event.batch_id %
+                    conf.save_period_by_batches) and event.batch_id:
+                save_name = os.path.join(model_save_dir,
+                                         "rnn_lm_pass_%05d_batch_%03d.tar.gz" %
+                                         (event.pass_id, event.batch_id))
+                with gzip.open(save_name, "w") as f:
+                    parameters.to_tar(f)
+
+        if isinstance(event, paddle.event.EndPass):
+            if test_reader is not None:
+                result = trainer.test(reader=test_reader)
+                logger.info("Test with Pass %d, %s" %
+                            (event.pass_id, result.metrics))
+            save_name = os.path.join(model_save_dir, "rnn_lm_pass_%05d.tar.gz" %
+                                     (event.pass_id))
+            with gzip.open(save_name, "w") as f:
+                parameters.to_tar(f)
+
+    logger.info("start training...")
+    trainer.train(
+        reader=train_reader, event_handler=event_handler, num_passes=num_passes)
+
+    logger.info("Training is finished.")
+
+
+def main():
+    # prepare vocab
+    if not (os.path.exists(conf.vocab_file) and
+            os.path.getsize(conf.vocab_file)):
+        logger.info(("word dictionary does not exist, "
+                     "build it from the training data"))
+        build_dict(conf.train_file, conf.vocab_file, conf.max_word_num,
+                   conf.cutoff_word_fre)
+    logger.info("load word dictionary.")
+    word_dict = load_dict(conf.vocab_file)
+    logger.info("dictionay size = %d" % (len(word_dict)))
+
+    cost = rnn_lm(
+        len(word_dict), conf.emb_dim, conf.hidden_size, conf.stacked_rnn_num,
+        conf.rnn_type)
+
+    # define reader
+    reader_args = {
+        "file_name": conf.train_file,
+        "word_dict": word_dict,
+    }
+    train_reader = paddle.batch(
+        paddle.reader.shuffle(
+            reader.rnn_reader(**reader_args), buf_size=102400),
+        batch_size=conf.batch_size)
+    test_reader = None
+    if os.path.exists(conf.test_file) and os.path.getsize(conf.test_file):
+        test_reader = paddle.batch(
+            paddle.reader.shuffle(
+                reader.rnn_reader(**reader_args), buf_size=65536),
+            batch_size=config.batch_size)
+
+    train(
+        topology=cost,
+        train_reader=train_reader,
+        test_reader=test_reader,
+        model_save_dir=conf.model_save_dir,
+        num_passes=conf.num_passes)
+
+
+if __name__ == "__main__":
+    main()
--- a/generate_sequence_by_rnn_lm/utils.py
+++ b/generate_sequence_by_rnn_lm/utils.py
+#!/usr/bin/env python
+# coding=utf-8
+import os
+import logging
+from collections import defaultdict
+
+__all__ = ["build_dict", "load_dict"]
+
+logger = logging.getLogger("paddle")
+logger.setLevel(logging.DEBUG)
+
+
+def build_dict(data_file,
+               save_path,
+               max_word_num,
+               cutoff_word_fre=5,
+               insert_extra_words=["<unk>", "<e>"]):
+    """
+    :param data_file: path of data file
+    :type data_file: str
+    :param save_path: path to save the word dictionary
+    :type save_path: str
+    :param vocab_max_size: if vocab_max_size is set, top vocab_max_size words
+        will be added into word vocabulary
+    :type vocab_max_size: int
+    :param cutoff_thd: if cutoff_thd is set, words whose frequencies are less
+        than cutoff_thd will not be added into word vocabulary.
+        NOTE that: vocab_max_size and cutoff_thd cannot be set at the same time
+    :type cutoff_word_fre: int
+    :param extra_keys: extra keys defined by users that added into the word
+        dictionary, ususally these keys include <unk>, start and ending marks
+    :type extra_keys: list
+    """
+    word_count = defaultdict(int)
+    with open(data_file, "r") as f:
+        for idx, line in enumerate(f):
+            if not (idx + 1) % 100000:
+                logger.debug("processing %d lines ... " % (idx + 1))
+            words = line.strip().lower().split()
+            for w in words:
+                word_count[w] += 1
+
+    sorted_words = sorted(
+        word_count.iteritems(), key=lambda x: x[1], reverse=True)
+
+    stop_pos = len(sorted_words) if sorted_words[-1][
+        1] > cutoff_word_fre else next(idx for idx, v in enumerate(sorted_words)
+                                       if v[1] < cutoff_word_fre)
+
+    stop_pos = min(max_word_num, stop_pos)
+    with open(save_path, "w") as fdict:
+        for w in insert_extra_words:
+            fdict.write("%s\t-1\n" % (w))
+        for idx, info in enumerate(sorted_words):
+            if idx == stop_pos: break
+            fdict.write("%s\t%d\n" % (info[0], info[-1]))
+
+
+def load_dict(dict_path):
+    """
+    load word dictionary from the given file. Each line of the give file is
+    a word in the word dictionary. The first column of the line, seperated by
+    TAB, is the key, while the line index is the value.
+
+    :param dict_path: path of word dictionary
+    :type dict_path: str
+    :return: the dictionary
+    :rtype: dict
+    """
+    return dict((line.strip().split("\t")[0], idx)
+                for idx, line in enumerate(open(dict_path, "r").readlines()))
+
+
+def load_reverse_dict(dict_path):
+    """
+    load word dictionary from the given file. Each line of the give file is
+    a word in the word dictionary. The line index is the key, while the first
+    column of the line, seperated by TAB, is the value.
+
+    :param dict_path: path of word dictionary
+    :type dict_path: str
+    :return: the dictionary
+    :rtype: dict
+    """
+    return dict((idx, line.strip().split("\t")[0])
+                for idx, line in enumerate(open(dict_path, "r").readlines()))
--- a/language_model/README.md
+++ b/language_model/README.md
-# 语言模型
-
-## 简介
-语言模型即 Language Model，简称LM。它是一个概率分布模型，简单来说，就是用来计算一个句子的概率的模型。利用它可以确定哪个词序列的可能性更大，或者给定若干个词，可以预测下一个最可能出现的词。语言模型是自然语言处理领域里一个重要的基础模型。
-
-## 应用场景
-**语言模型被应用在很多领域**，如：
-
-* **自动写作**：语言模型可以根据上文生成下一个词，递归下去可以生成整个句子、段落、篇章。
-* **QA**：语言模型可以根据Question生成Answer。
-* **机器翻译**：当前主流的机器翻译模型大多基于Encoder-Decoder模式，其中Decoder就是一个语言模型，用来生成目标语言。
-* **拼写检查**：语言模型可以计算出词序列的概率，一般在拼写错误处序列的概率会骤减，可以用来识别拼写错误并提供改正候选集。
-* **词性标注、句法分析、语音识别......**
-
-## 关于本例
-Language Model 常见的实现方式有 N-Gram、RNN、seq2seq。本例中实现了基于N-Gram、RNN的语言模型。**本例的文件结构如下**（`images` 文件夹与使用无关可不关心）：
-
-
-```text
-.
-├── data                    # toy、demo数据，用户可据此格式化自己的数据
-│   ├── chinese.test.txt    # test用的数据demo
-|   ├── chinese.train.txt   # train用的数据demo
-│   └── input.txt           # infer用的输入数据demo
-├── config.py               # 配置文件，包括data、train、infer相关配置
-├── infer.py                # 预测任务脚本，即生成文本
-├── network_conf.py         # 本例中涉及的各种网络结构均定义在此文件中，希望进一步修改模型结构，请修改此文件
-├── reader.py               # 读取数据接口
-├── README.md               # 文档
-├── train.py                # 训练任务脚本
-└── utils.py                # 定义通用的函数，例如：构建字典、加载字典等
-```
-
-**注：一般情况下基于N-Gram的语言模型不如基于RNN的语言模型效果好，所以实际使用时建议使用基于RNN的语言模型，本例中也将着重介绍基于RNN的模型，简略介绍基于N-Gram的模型。**
-
-## RNN 语言模型
-### 简介
-
-RNN是一个序列模型，基本思路是：在时刻t，将前一时刻t-1的隐藏层输出和t时刻的词向量一起输入到隐藏层从而得到时刻t的特征表示，然后用这个特征表示得到t时刻的预测输出，如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识，具有“记忆”功能。理论上RNN能实现“长依赖”（即利用很久之前的知识），但在实际应用中发现效果并不理想，于是出现了很多RNN的变种，如常用的LSTM和GRU，它们对传统RNN的cell进行了改进，弥补了传统RNN的不足，本例中即使用了LSTM、GRU。下图是RNN（广义上包含了LSTM、GRU等）语言模型“循环”思想的示意图：
-
-<p align=center><img src='images/rnn.png' width='500px'/></p>
-
-### 模型实现
-
-本例中RNN语言模型的实现简介如下：
-
-* **定义模型参数**：`config.py`中的`Config_rnn`**类**中定义了模型的参数变量。
-* **定义模型结构**：`network_conf.py`中的`rnn_lm`**函数**中定义了模型的**结构**，如下：
-    * 输入层：将输入的词（或字）序列映射成向量，即embedding。
-    * 中间层：根据配置实现RNN层，将上一步得到的embedding向量序列作为输入。
-    * 输出层：使用softmax归一化计算单词的概率，将output结果返回
-    * loss：定义模型的cost为多类交叉熵损失函数。
-* **训练模型**：`train.py`中的`main`方法实现了模型的训练，实现流程如下：
-    * 准备输入数据：建立并保存词典、构建train和test数据的reader。
-    * 初始化模型：包括模型的结构、参数。
-    * 构建训练器：demo中使用的是Adam优化算法。
-    * 定义回调函数：构建`event_handler`来跟踪训练过程中loss的变化，并在每轮训练结束时保存模型的参数。
-    * 训练：使用trainer训练模型。
-
-* **生成文本**：`infer.py`中的`main`方法实现了文本的生成，实现流程如下：
-    * 根据配置选择生成方法：RNN模型 or N-Gram模型。
-    * 加载train好的模型和词典文件。
-    * 读取`input_file`文件（每行为一个sentence的前缀），用启发式图搜索算法`beam_search`根据各sentence的前缀生成文本。
-    * 将生成的文本及其前缀保存到文件`output_file`。
-
-
-## N-Gram 语言模型
-
-### 简介
-N-Gram模型也称为N-1阶马尔科夫模型，它有一个有限历史假设：当前词的出现概率仅仅与前面N-1个词相关。一般采用最大似然估计（Maximum Likelihood Estimation，MLE）方法对模型的参数进行估计。当N取1、2、3时，N-Gram模型分别称为unigram、bigram和trigram语言模型。一般情况下，N越大、训练语料的规模越大，参数估计的结果越可靠，但由于模型较简单、表达能力不强以及数据稀疏等问题。一般情况下用N-Gram实现的语言模型不如RNN、seq2seq效果好。下图是基于神经网络的N-Gram语言模型结构示意图：
-
-<p align=center><img src='images/ngram.png' width='500px'/></p>
-
-### 模型实现
-
-本例中N-Gram语言模型的实现简介如下：
-
-* **定义模型参数**：`config.py`中的`Config_ngram`**类**中定义了模型的参数变量。
-* **定义模型结构**：`network_conf.py`中的`ngram_lm`**函数**中定义了模型的**结构**，如下：
-    * 输入层：本例中N取5，将前四个词分别做embedding，然后连接起来作为输入。
-    * 中间层：根据配置实现DNN层，将上一步得到的embedding向量序列作为输入。
-    * 输出层：使用softmax归一化计算单词的概率，将output结果返回
-    * loss：定义模型的cost为多类交叉熵损失函数。
-* **训练模型**：`train.py`中的`main`方法实现了模型的训练，实现流程与上文中RNN语言模型基本一致。
-* **生成文本**：`infer.py`中的`main`方法实现了文本的生成，实现流程与上文中RNN语言模型基本一致，区别在于构建input时本例会取每个前缀的最后4（N-1）个词作为输入。
-
-## 使用说明
-
-运行本例的方法如下：
-
-* 1，运行`python train.py`命令，开始train模型（默认使用RNN），待训练结束。
-* 2，运行`python infer.py`命令做prediction。（输入的文本默认为`data/input.txt`，生成的文本默认保存到`data/output.txt`中。）
-
-
-**如果用户需要使用自己的语料、定制模型，需要修改的地方主要是`语料`和`config.py`中的配置，需要注意的细节和适配工作详情如下：**
-
-
-### 语料适配
-
-* 清洗语料：去除原文中空格、tab、乱码，按需去除数字、标点符号、特殊符号等。
-* 编码格式：utf-8，本例中已经对中文做了适配。
-* 内容格式：每个句子占一行；每行中的各词之间使用一个空格符分开。
-* 按需要配置`config.py`中对于data的配置：
-
-    ```python
-    # -- config : data --
-
-    train_file = 'data/chinese.train.txt'
-    test_file = 'data/chinese.test.txt'
-    vocab_file = 'data/vocab_cn.txt'  # the file to save vocab
-
-    build_vocab_method = 'fixed_size'  # 'frequency' or 'fixed_size'
-    vocab_max_size = 3000  # when build_vocab_method = 'fixed_size'
-    unk_threshold = 1  # # when build_vocab_method = 'frequency'
-
-    min_sentence_length = 3
-    max_sentence_length = 60
-    ```
-
-    其中，`build_vocab_method `指定了构建词典的方法：**1，按词频**，即将出现次数小于`unk_threshold `的词视为`<UNK>`；**2，按词典长度**，`vocab_max_size`定义了词典的最大长度，如果语料中出现的不同词的个数大于这个值，则根据各词的词频倒序排，取`top(vocab_max_size)`个词纳入词典。
-
-    其中`min_sentence_length`和`max_sentence_length `分别指定了句子的最小和最大长度，小于最小长度的和大于最大长度的句子将被过滤掉、不参与训练。
-
-    *注：需要注意的是词典越大生成的内容越丰富但训练耗时越久，一般中文分词之后，语料中不同的词能有几万乃至几十万，如果vocab\_max\_size取值过小则导致\<UNK\>占比过高，如果vocab\_max\_size取值较大则严重影响训练速度（对精度也有影响），所以也有“按字”训练模型的方式，即：把每个汉字当做一个词，常用汉字也就几千个，使得字典的大小不会太大、不会丢失太多信息，但汉语中同一个字在不同词中语义相差很大，有时导致模型效果不理想。建议用户多试试、根据实际情况选择是“按词训练”还是“按字训练”。*
-
-### 模型适配、训练
-
-* 按需调整`config.py`中对于模型的配置，详解如下：
-
-    ```python
-    # -- config : train --
-
-    use_which_model = 'rnn'  # must be: 'rnn' or 'ngram'
-    use_gpu = False  # whether to use gpu
-    trainer_count = 1  # number of trainer
-
-
-    class Config_rnn(object):
-        """
-        config for RNN language model
-        """
-        rnn_type = 'gru'  # or 'lstm'
-        emb_dim = 200
-        hidden_size = 200
-        num_layer = 2
-        num_passs = 2
-        batch_size = 32
-        model_file_name_prefix = 'lm_' + rnn_type + '_params_pass_'
-
-
-    class Config_ngram(object):
-        """
-        config for N-Gram language model
-        """
-        emb_dim = 200
-        hidden_size = 200
-        num_layer = 2
-        N = 5
-        num_passs = 2
-        batch_size = 32
-        model_file_name_prefix = 'lm_ngram_pass_'
-    ```
-
-    其中，`use_which_model`指定了要train的模型，如果使用RNN语言模型则设置为'rnn'，如果使用N-Gram语言模型则设置为'ngram'；`use_gpu`指定了train的时候是否使用gpu；`trainer_count`指定了并行度、用几个trainer去train模型；`rnn_type` 用于配置rnn cell类型，可以取‘lstm’或‘gru’；`hidden_size`配置unit个数；`num_layer`配置RNN的层数；`num_passs`配置训练的轮数；`emb_dim`配置embedding的dimension；`batch_size `配置了train model时每个batch的大小；`model_file_name_prefix `配置了要保存的模型的名字前缀。
-
-* 运行`python train.py`命令训练模型，模型将被保存到当前目录。
-
-### 按需生成文本
-
-* 按需调整`config.py`中对于infer的配置，详解如下：
-
-    ```python
-    # -- config : infer --
-
-    input_file = 'data/input.txt'  # input file contains sentence prefix each line
-    output_file = 'data/output.txt'  # the file to save results
-    num_words = 10  # the max number of words need to generate
-    beam_size = 5  # beam_width, the number of the prediction sentence for each prefix
-    ```
-
-    其中，`input_file`中保存的是待生成的文本前缀，utf-8编码，每个前缀占一行，形如：
-
-    ```text
-    我
-    我 是
-    ```
-    用户将需要生成的文本前缀按此格式存入文件即可；
-    `num_words`指定了要生成多少个单词（实际生成过程中遇到结束符会停止生成，所以实际生成的词个数可能会比此值小）；`beam_size`指定了beam search方法的width，即每个前缀生成多少个候选词序列；`output_file`指定了生成结果的存放位置。
-
-* 运行`python infer.py`命令生成文本，生成的结果格式如下：
-
-    ```text
-    我
-        我 <EOS>    0.107702672482
-        我 爱 。我 中国 中国 <EOS>    0.000177299271939
-        我 爱 中国 。我 是 中国 <EOS>    4.51695544709e-05
-        我 爱 中国 中国 <EOS>    0.000910127729821
-        我 爱 中国 。我 是 <EOS>    0.00015957862922
-    ```
-    其中，‘我’是前缀，其下方的五个句子时补全的结果，每个句子末尾的浮点数表示此句子的生成概率。
--- a/language_model/config.py
+++ b/language_model/config.py
-# coding=utf-8
-
-# -- config : data --
-
-train_file = 'data/chinese.train.txt'
-test_file = 'data/chinese.test.txt'
-vocab_file = 'data/vocab_cn.txt'  # the file to save vocab
-
-build_vocab_method = 'fixed_size'  # 'frequency' or 'fixed_size'
-vocab_max_size = 3000  # when build_vocab_method = 'fixed_size'
-unk_threshold = 1  # # when build_vocab_method = 'frequency'
-
-min_sentence_length = 3
-max_sentence_length = 60
-
-# -- config : train --
-
-use_which_model = 'ngram'  # must be: 'rnn' or 'ngram'
-use_gpu = False  # whether to use gpu
-trainer_count = 1  # number of trainer
-
-
-class Config_rnn(object):
-    """
-    config for RNN language model
-    """
-    rnn_type = 'gru'  # or 'lstm'
-    emb_dim = 200
-    hidden_size = 200
-    num_layer = 2
-    num_passs = 2
-    batch_size = 32
-    model_file_name_prefix = 'lm_' + rnn_type + '_params_pass_'
-
-
-class Config_ngram(object):
-    """
-    config for N-Gram language model
-    """
-    emb_dim = 200
-    hidden_size = 200
-    num_layer = 2
-    N = 5
-    num_passs = 2
-    batch_size = 32
-    model_file_name_prefix = 'lm_ngram_pass_'
-
-
-# -- config : infer --
-
-input_file = 'data/input.txt'  # input file contains sentence prefix each line
-output_file = 'data/output.txt'  # the file to save results
-num_words = 10  # the max number of words need to generate
-beam_size = 5  # beam_width, the number of the prediction sentence for each prefix
--- a/language_model/data/chinese.test.txt
+++ b/language_model/data/chinese.test.txt
-我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。
\ No newline at end of file
--- a/language_model/data/chinese.train.txt
+++ b/language_model/data/chinese.train.txt
-我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。我 是 中国 人 。
-我 爱 中国 。
\ No newline at end of file
--- a/language_model/data/input.txt
+++ b/language_model/data/input.txt
-我
-我 是
-我 是 中国
-我 爱
-我 是 中国 人。
-我 爱 中国
-我 爱 中国 。我
-我 爱 中国 。我 爱
-我 爱 中国 。我 是
-我 爱 中国 。我 是 中国
\ No newline at end of file
--- a/language_model/infer.py
+++ b/language_model/infer.py
-# coding=utf-8
-import paddle.v2 as paddle
-import gzip
-import numpy as np
-from utils import *
-import network_conf
-from config import *
-
-
-def generate_using_rnn(word_id_dict, num_words, beam_size):
-    """
-    Demo: use RNN model to do prediction.
-
-    :param word_id_dict: vocab.
-    :type word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
-    :param num_words: the number of the words to generate.
-    :type num_words: int
-    :param beam_size: beam width.
-    :type beam_size: int
-    :return: save prediction results to output_file
-    """
-
-    # prepare and cache model
-    config = Config_rnn()
-    _, output_layer = network_conf.rnn_lm(
-        vocab_size=len(word_id_dict),
-        emb_dim=config.emb_dim,
-        rnn_type=config.rnn_type,
-        hidden_size=config.hidden_size,
-        num_layer=config.num_layer)  # network config
-    model_file_name = config.model_file_name_prefix + str(config.num_passs -
-                                                          1) + '.tar.gz'
-    parameters = paddle.parameters.Parameters.from_tar(
-        gzip.open(model_file_name))  # load parameters
-    inferer = paddle.inference.Inference(
-        output_layer=output_layer, parameters=parameters)
-
-    # tools, different from generate_using_ngram's tools
-    id_word_dict = dict(
-        [(v, k) for k, v in word_id_dict.items()])  # {id : word}
-
-    def str2ids(str):
-        return [[[
-            word_id_dict.get(w, word_id_dict['<UNK>']) for w in str.split()
-        ]]]
-
-    def ids2str(ids):
-        return [[[id_word_dict.get(id, ' ') for id in ids]]]
-
-    # generate text
-    with open(input_file) as file:
-        output_f = open(output_file, 'w')
-        for line in file:
-            line = line.decode('utf-8').strip()
-            # generate
-            texts = {}  # type: {text : probability}
-            texts[line] = 1
-            for _ in range(num_words):
-                texts_new = {}
-                for (text, prob) in texts.items():
-                    if '<EOS>' in text:  # stop prediction when <EOS> appear
-                        texts_new[text] = prob
-                        continue
-                    # next word's probability distribution
-                    predictions = inferer.infer(input=str2ids(text))
-                    predictions[-1][word_id_dict['<UNK>']] = -1  # filter <UNK>
-                    # find next beam_size words
-                    for _ in range(beam_size):
-                        cur_maxProb_index = np.argmax(
-                            predictions[-1])  # next word's id
-                        text_new = text + ' ' + id_word_dict[
-                            cur_maxProb_index]  # text append next word
-                        texts_new[text_new] = texts[text] * predictions[-1][
-                            cur_maxProb_index]
-                        predictions[-1][cur_maxProb_index] = -1
-                texts.clear()
-                if len(texts_new) <= beam_size:
-                    texts = texts_new
-                else:  # cutting
-                    texts = dict(
-                        sorted(
-                            texts_new.items(), key=lambda d: d[1], reverse=True)
-                        [:beam_size])
-
-            # save results to output file
-            output_f.write(line.encode('utf-8') + '\n')
-            for (sentence, prob) in texts.items():
-                output_f.write('\t' + sentence.encode('utf-8', 'replace') + '\t'
-                               + str(prob) + '\n')
-            output_f.write('\n')
-
-        output_f.close()
-    print('already saved results to ' + output_file)
-
-
-def generate_using_ngram(word_id_dict, num_words, beam_size):
-    """
-    Demo: use N-Gram model to do prediction.
-
-    :param word_id_dict: vocab.
-    :type word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
-    :param num_words: the number of the words to generate.
-    :type num_words: int
-    :param beam_size: beam width.
-    :type beam_size: int
-    :return: save prediction results to output_file
-    """
-
-    # prepare and cache model
-    config = Config_ngram()
-    _, output_layer = network_conf.ngram_lm(
-        vocab_size=len(word_id_dict),
-        emb_dim=config.emb_dim,
-        hidden_size=config.hidden_size,
-        num_layer=config.num_layer)  # network config
-    model_file_name = config.model_file_name_prefix + str(config.num_passs -
-                                                          1) + '.tar.gz'
-    parameters = paddle.parameters.Parameters.from_tar(
-        gzip.open(model_file_name))  # load parameters
-    inferer = paddle.inference.Inference(
-        output_layer=output_layer, parameters=parameters)
-
-    # tools, different from generate_using_rnn's tools
-    id_word_dict = dict(
-        [(v, k) for k, v in word_id_dict.items()])  # {id : word}
-
-    def str2ids(str):
-        return [[
-            word_id_dict.get(w, word_id_dict['<UNK>']) for w in str.split()
-        ]]
-
-    def ids2str(ids):
-        return [[id_word_dict.get(id, ' ') for id in ids]]
-
-    # generate text
-    with open(input_file) as file:
-        output_f = open(output_file, 'w')
-        for line in file:
-            line = line.decode('utf-8').strip()
-            words = line.split()
-            if len(words) < config.N:
-                output_f.write(line.encode('utf-8') + "\n\tnone\n")
-                continue
-            # generate
-            texts = {}  # type: {text : probability}
-            texts[line] = 1
-            for _ in range(num_words):
-                texts_new = {}
-                for (text, prob) in texts.items():
-                    if '<EOS>' in text:  # stop prediction when <EOS> appear
-                        texts_new[text] = prob
-                        continue
-                    # next word's probability distribution
-                    predictions = inferer.infer(
-                        input=str2ids(' '.join(text.split()[-config.N:])))
-                    predictions[-1][word_id_dict['<UNK>']] = -1  # filter <UNK>
-                    # find next beam_size words
-                    for _ in range(beam_size):
-                        cur_maxProb_index = np.argmax(
-                            predictions[-1])  # next word's id
-                        text_new = text + ' ' + id_word_dict[
-                            cur_maxProb_index]  # text append nextWord
-                        texts_new[text_new] = texts[text] * predictions[-1][
-                            cur_maxProb_index]
-                        predictions[-1][cur_maxProb_index] = -1
-                texts.clear()
-                if len(texts_new) <= beam_size:
-                    texts = texts_new
-                else:  # cutting
-                    texts = dict(
-                        sorted(
-                            texts_new.items(), key=lambda d: d[1], reverse=True)
-                        [:beam_size])
-
-            # save results to output file
-            output_f.write(line.encode('utf-8') + '\n')
-            for (sentence, prob) in texts.items():
-                output_f.write('\t' + sentence.encode('utf-8', 'replace') + '\t'
-                               + str(prob) + '\n')
-            output_f.write('\n')
-
-        output_f.close()
-    print('already saved results to ' + output_file)
-
-
-def main():
-    # init paddle
-    paddle.init(use_gpu=use_gpu, trainer_count=trainer_count)
-
-    # prepare and cache vocab
-    if os.path.isfile(vocab_file):
-        word_id_dict = load_vocab(vocab_file)  # load word dictionary
-    else:
-        if build_vocab_method == 'fixed_size':
-            word_id_dict = build_vocab_with_fixed_size(
-                train_file, vocab_max_size)  # build vocab
-        else:
-            word_id_dict = build_vocab_using_threshhold(
-                train_file, unk_threshold)  # build vocab
-        save_vocab(word_id_dict, vocab_file)  # save vocab
-
-    # generate
-    if use_which_model == 'rnn':
-        generate_using_rnn(
-            word_id_dict=word_id_dict, num_words=num_words, beam_size=beam_size)
-    elif use_which_model == 'ngram':
-        generate_using_ngram(
-            word_id_dict=word_id_dict, num_words=num_words, beam_size=beam_size)
-    else:
-        raise Exception('use_which_model must be rnn or ngram!')
-
-
-if __name__ == "__main__":
-    main()
--- a/language_model/network_conf.py
+++ b/language_model/network_conf.py
-# coding=utf-8
-
-import paddle.v2 as paddle
-
-
-def rnn_lm(vocab_size, emb_dim, rnn_type, hidden_size, num_layer):
-    """
-    RNN language model definition.
-
-    :param vocab_size: size of vocab.
-    :param emb_dim: embedding vector's dimension.
-    :param rnn_type: the type of RNN cell.
-    :param hidden_size: number of unit.
-    :param num_layer: layer number.
-    :return: cost and output layer of model.
-    """
-
-    assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0
-
-    # input layers
-    input = paddle.layer.data(
-        name="input", type=paddle.data_type.integer_value_sequence(vocab_size))
-    target = paddle.layer.data(
-        name="target", type=paddle.data_type.integer_value_sequence(vocab_size))
-
-    # embedding layer
-    input_emb = paddle.layer.embedding(input=input, size=emb_dim)
-
-    # rnn layer
-    if rnn_type == 'lstm':
-        rnn_cell = paddle.networks.simple_lstm(
-            input=input_emb, size=hidden_size)
-        for _ in range(num_layer - 1):
-            rnn_cell = paddle.networks.simple_lstm(
-                input=rnn_cell, size=hidden_size)
-    elif rnn_type == 'gru':
-        rnn_cell = paddle.networks.simple_gru(input=input_emb, size=hidden_size)
-        for _ in range(num_layer - 1):
-            rnn_cell = paddle.networks.simple_gru(
-                input=rnn_cell, size=hidden_size)
-    else:
-        raise Exception('rnn_type error!')
-
-    # fc(full connected) and output layer
-    output = paddle.layer.fc(
-        input=[rnn_cell], size=vocab_size, act=paddle.activation.Softmax())
-
-    # loss
-    cost = paddle.layer.classification_cost(input=output, label=target)
-
-    return cost, output
-
-
-def ngram_lm(vocab_size, emb_dim, hidden_size, num_layer, gram_num=4):
-    """
-    N-Gram language model definition.
-
-    :param vocab_size: size of vocab.
-    :param emb_dim: embedding vector's dimension.
-    :param hidden_size: size of unit.
-    :param num_layer: number of hidden layers.
-    :param gram_size: gram number in n-gram method
-    :return: cost and output layer of model.
-    """
-
-    assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0
-
-    # input layers
-    emb_layers = []
-    for i in range(gram_num):
-        word = paddle.layer.data(
-            name="__word%02d__" % (i + 1),
-            type=paddle.data_type.integer_value(vocab_size))
-        emb = paddle.layer.embedding(
-            input=word,
-            size=emb_dim,
-            param_attr=paddle.attr.Param(name="_proj", initial_std=1e-3))
-        emb_layers.append(emb)
-    next_word = paddle.layer.data(
-        name="__next_word__", type=paddle.data_type.integer_value(vocab_size))
-
-    # hidden layer
-    for i in range(num_layer):
-        hidden = paddle.layer.fc(
-            input=hidden if i else paddle.layer.concat(input=emb_layers),
-            size=hidden_size,
-            act=paddle.activation.Relu())
-
-    predict_word = paddle.layer.fc(
-        input=[hidden], size=vocab_size, act=paddle.activation.Softmax())
-
-    # loss
-    cost = paddle.layer.classification_cost(input=predict_word, label=next_word)
-
-    return cost, predict_word
--- a/language_model/reader.py
+++ b/language_model/reader.py
-# coding=utf-8
-import collections
-import os
-
-
-def rnn_reader(file_name, min_sentence_length, max_sentence_length,
-               word_id_dict):
-    """
-    create reader for RNN, each line is a sample.
-
-    :param file_name: file name.
-    :param min_sentence_length: sentence's min length.
-    :param max_sentence_length: sentence's max length.
-    :param word_id_dict: vocab with content of '{word, id}', 'word' is string type , 'id' is int type.
-    :return: data reader.
-    """
-
-    def reader():
-        UNK = word_id_dict['<UNK>']
-        with open(file_name) as file:
-            for line in file:
-                words = line.decode('utf-8', 'ignore').strip().split()
-                if len(words) < min_sentence_length or len(
-                        words) > max_sentence_length:
-                    continue
-                ids = [word_id_dict.get(w, UNK) for w in words]
-                ids.append(word_id_dict['<EOS>'])
-                target = ids[1:]
-                target.append(word_id_dict['<EOS>'])
-                yield ids[:], target[:]
-
-    return reader
-
-
-def ngram_reader(file_name, N, word_id_dict):
-    """
-    create reader for N-Gram.
-
-    :param file_name: file name.
-    :param N: N-Gram's N.
-    :param word_id_dict: vocab with content of '{word, id}', 'word' is string type , 'id' is int type.
-    :return: data reader.
-    """
-    assert N >= 2
-
-    def reader():
-        ids = []
-        UNK_ID = word_id_dict['<UNK>']
-        cache_size = 10000000
-        with open(file_name) as file:
-            for line in file:
-                words = line.decode('utf-8', 'ignore').strip().split()
-                ids += [word_id_dict.get(w, UNK_ID) for w in words]
-                ids_len = len(ids)
-                if ids_len > cache_size:  # output
-                    for i in range(ids_len - N - 1):
-                        yield tuple(ids[i:i + N])
-                    ids = []
-        ids_len = len(ids)
-        for i in range(ids_len - N - 1):
-            yield tuple(ids[i:i + N])
-
-    return reader
--- a/language_model/train.py
+++ b/language_model/train.py
-# coding=utf-8
-import sys
-import paddle.v2 as paddle
-import reader
-from utils import *
-import network_conf
-import gzip
-from config import *
-
-
-def train(model_cost, train_reader, test_reader, model_file_name_prefix,
-          num_passes):
-    """
-    train model.
-
-    :param model_cost: cost layer of the model to train.
-    :param train_reader: train data reader.
-    :param test_reader: test data reader.
-    :param model_file_name_prefix: model's prefix name.
-    :param num_passes: epoch.
-    :return:
-    """
-
-    # init paddle
-    paddle.init(use_gpu=use_gpu, trainer_count=trainer_count)
-
-    # create parameters
-    parameters = paddle.parameters.create(model_cost)
-
-    # create optimizer
-    adam_optimizer = paddle.optimizer.Adam(
-        learning_rate=1e-3,
-        regularization=paddle.optimizer.L2Regularization(rate=1e-3),
-        model_average=paddle.optimizer.ModelAverage(
-            average_window=0.5, max_average_window=10000))
-
-    # create trainer
-    trainer = paddle.trainer.SGD(
-        cost=model_cost, parameters=parameters, update_equation=adam_optimizer)
-
-    # define event_handler callback
-    def event_handler(event):
-        if isinstance(event, paddle.event.EndIteration):
-            if event.batch_id % 100 == 0:
-                print("\nPass %d, Batch %d, Cost %f, %s" % (
-                    event.pass_id, event.batch_id, event.cost, event.metrics))
-            else:
-                sys.stdout.write('.')
-                sys.stdout.flush()
-
-        # save model each pass
-        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(reader=test_reader)
-            print("\nTest with Pass %d, %s" % (event.pass_id, result.metrics))
-            with gzip.open(
-                    model_file_name_prefix + str(event.pass_id) + '.tar.gz',
-                    'w') as f:
-                parameters.to_tar(f)
-
-    # start to train
-    print('start training...')
-    trainer.train(
-        reader=train_reader, event_handler=event_handler, num_passes=num_passes)
-
-    print("Training finished.")
-
-
-def main():
-    # prepare vocab
-    print('prepare vocab...')
-    if build_vocab_method == 'fixed_size':
-        word_id_dict = build_vocab_with_fixed_size(
-            train_file, vocab_max_size)  # build vocab
-    else:
-        word_id_dict = build_vocab_using_threshhold(
-            train_file, unk_threshold)  # build vocab
-    save_vocab(word_id_dict, vocab_file)  # save vocab
-
-    # init model and data reader
-    if use_which_model == 'rnn':
-        # init RNN model
-        print('prepare rnn model...')
-        config = Config_rnn()
-        cost, _ = network_conf.rnn_lm(
-            len(word_id_dict), config.emb_dim, config.rnn_type,
-            config.hidden_size, config.num_layer)
-
-        # init RNN data reader
-        train_reader = paddle.batch(
-            paddle.reader.shuffle(
-                reader.rnn_reader(train_file, min_sentence_length,
-                                  max_sentence_length, word_id_dict),
-                buf_size=65536),
-            batch_size=config.batch_size)
-
-        test_reader = paddle.batch(
-            paddle.reader.shuffle(
-                reader.rnn_reader(test_file, min_sentence_length,
-                                  max_sentence_length, word_id_dict),
-                buf_size=65536),
-            batch_size=config.batch_size)
-
-    elif use_which_model == 'ngram':
-        # init N-Gram model
-        print('prepare ngram model...')
-        config = Config_ngram()
-        assert config.N == 5
-        cost, _ = network_conf.ngram_lm(
-            vocab_size=len(word_id_dict),
-            emb_dim=config.emb_dim,
-            hidden_size=config.hidden_size,
-            num_layer=config.num_layer)
-
-        # init N-Gram data reader
-        train_reader = paddle.batch(
-            paddle.reader.shuffle(
-                reader.ngram_reader(train_file, config.N, word_id_dict),
-                buf_size=65536),
-            batch_size=config.batch_size)
-
-        test_reader = paddle.batch(
-            paddle.reader.shuffle(
-                reader.ngram_reader(test_file, config.N, word_id_dict),
-                buf_size=65536),
-            batch_size=config.batch_size)
-    else:
-        raise Exception('use_which_model must be rnn or ngram!')
-
-    # train model
-    train(
-        model_cost=cost,
-        train_reader=train_reader,
-        test_reader=test_reader,
-        model_file_name_prefix=config.model_file_name_prefix,
-        num_passes=config.num_passs)
-
-
-if __name__ == "__main__":
-    main()
--- a/language_model/utils.py
+++ b/language_model/utils.py
-# coding=utf-8
-import os
-import collections
-
-
-def save_vocab(word_id_dict, vocab_file_name):
-    """
-    save vocab.
-
-    :param word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
-    :param vocab_file_name: vocab file name.
-    """
-    f = open(vocab_file_name, 'w')
-    for (k, v) in word_id_dict.items():
-        f.write(k.encode('utf-8') + '\t' + str(v) + '\n')
-    print('save vocab to ' + vocab_file_name)
-    f.close()
-
-
-def load_vocab(vocab_file_name):
-    """
-    load vocab from file.
-    :param vocab_file_name: vocab file name.
-    :return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
-    """
-    assert os.path.isfile(vocab_file_name)
-    dict = {}
-    with open(vocab_file_name) as file:
-        for line in file:
-            if len(line) < 2:
-                continue
-            kv = line.decode('utf-8').strip().split('\t')
-            dict[kv[0]] = int(kv[1])
-    return dict
-
-
-def build_vocab_using_threshhold(file_name, unk_threshold):
-    """
-    build vacab using_<UNK> threshhold.
-
-    :param file_name:
-    :param unk_threshold: <UNK> threshhold.
-    :type unk_threshold: int.
-    :return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
-    """
-    counter = {}
-    with open(file_name) as file:
-        for line in file:
-            words = line.decode('utf-8', 'ignore').strip().split()
-            for word in words:
-                if word in counter:
-                    counter[word] += 1
-                else:
-                    counter[word] = 1
-    counter_new = {}
-    for (word, frequency) in counter.items():
-        if frequency >= unk_threshold:
-            counter_new[word] = frequency
-    counter.clear()
-    counter_new = sorted(counter_new.items(), key=lambda d: -d[1])
-    words = [word_frequency[0] for word_frequency in counter_new]
-    word_id_dict = dict(zip(words, range(2, len(words) + 2)))
-    word_id_dict['<UNK>'] = 0
-    word_id_dict['<EOS>'] = 1
-    return word_id_dict
-
-
-def build_vocab_with_fixed_size(file_name, vocab_max_size):
-    """
-    build vacab with assigned max size.
-
-    :param vocab_max_size: vocab's max size.
-    :return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
-    """
-    words = []
-    for line in open(file_name):
-        words += line.decode('utf-8', 'ignore').strip().split()
-
-    counter = collections.Counter(words)
-    counter = sorted(counter.items(), key=lambda x: -x[1])
-    if len(counter) > vocab_max_size:
-        counter = counter[:vocab_max_size]
-    words, counts = zip(*counter)
-    word_id_dict = dict(zip(words, range(2, len(words) + 2)))
-    word_id_dict['<UNK>'] = 0
-    word_id_dict['<EOS>'] = 1
-    return word_id_dict