rebuid code

bf0cfe33 · zhaopu · f495dda9 · f495dda9 · bf0cfe33 · bf0cfe33
13 changed file
--- a/language_model/README.md
+++ b/language_model/README.md
-# 语言模型
-## 简介
-语言模型即 Language Model，简称LM，它是一个概率分布模型，简单来说，就是用来计算一个句子的概率的模型。给定句子（词语序列）：
-<div align=center><img src='images/s.png'/></div>
-它的概率可以表示为：
-<div align=center><img src='images/ps.png'/> &nbsp;&nbsp;&nbsp;&nbsp;(式1)</div>
-语言模型可以计算（式1）中的P(S)及其中间结果。**利用它可以确定哪个词序列的可能性更大，或者给定若干个词，可以预测下一个最可能出现的词语。**
-## 应用场景
-**语言模型被应用在多个领域**，如：
-* **自动写作**：语言模型可以根据上文生成下一个词，递归下去可以生成整个句子、段落、篇章。
-* **QA**：语言模型可以根据Question生成Answer。
-* **机器翻译**：当前主流的机器翻译模型大多基于Encoder-Decoder模式，其中Decoder就是一个语言模型，用来生成目标语言。
-* **拼写检查**：语言模型可以计算出词语序列的概率，一般在拼写错误处序列的概率会骤减，可以用来识别拼写错误并提供改正候选集。
-* **词性标注、句法分析、语音识别......**
-## 关于本例
-Language Model 常见的实现方式有 N-Gram、RNN、seq2seq。本例中实现了基于N-Gram、RNN的语言模型。**本例的文件结构如下**：
-* data_util.py：实现了对语料的读取以及词典的建立、保存和加载。
-* lm_rnn.py：实现了基于rnn的语言模型的定义、训练以及做预测。
-* lm_ngram.py：实现了基于n-gram的语言模型的定义、训练以及做预测。
-***注：**一般情况下基于N-Gram的语言模型不如基于RNN的语言模型效果好，所以实际使用时建议使用基于RNN的语言模型，本例中也将着重介绍基于RNN的模型，简略介绍基于N-Gram的模型。*
-## RNN 语言模型
-### 简介
-RNN是一个序列模型，基本思路是：在时刻t，将前一时刻t-1的隐藏层输出h<small>t-1</small>和t时刻的词向量x<small>t</small>一起输入到隐藏层从而得到时刻t的特征表示h<small>t</small>，然后用这个特征表示得到t时刻的预测输出ŷ ，如此在时间维上递归下去，如下图所示：
-<div align=center><img src='images/rnn_str.png' width='500px'/></div>
-可以看出RNN善于使用上文信息、历史知识，具有“记忆”功能。理论上RNN能实现“长依赖”（即利用很久之前的知识），但在实际应用中发现效果并不理想，于是出现了很多RNN的变种，如常用的LSTM和GRU，它们对传统RNN的cell进行了改进，弥补了RNN的不足，下图是LSTM的示意图：
-<div align=center><img src='images/lstm.png' width='500px'/></div>
-本例中即使用了LSTM、GRU。
-### 模型结构
-lm_rnn.py 中的 lm() 函数定义了模型的结构。解析如下：
-* 1，首先，在\_\_main\_\_中定义了模型的参数变量。
-	```python
-    # -- config : model --
-    rnn_type = 'gru' # or 'lstm'
-    emb_dim = 200
-    hidden_size = 200
-    num_passs = 2
-    num_layer = 2
-	```
-	其中 rnn\_type 用于配置rnn cell类型，可以取‘lstm’或‘gru’；hidden\_size配置unit个数；num\_layer配置RNN的层数；num\_passs配置训练的轮数；emb_dim配置embedding的dimension。
-* 2，将输入的词（或字）序列映射成向量，即embedding。
-	```python
-	data = paddle.layer.data(name="word", type=paddle.data_type.integer_value_sequence(vocab_size))
-	target = paddle.layer.data("label", paddle.data_type.integer_value_sequence(vocab_size))
-	emb = paddle.layer.embedding(input=data, size=emb_dim)
-	```
-* 3，根据配置实现RNN层，将上一步得到的embedding向量序列作为输入。
-	```python
-    if rnn_type == 'lstm':
-        rnn_cell = paddle.networks.simple_lstm(
-            input=emb, size=hidden_size)
-        for _ in range(num_layer - 1):
-            rnn_cell = paddle.networks.simple_lstm(
-                input=rnn_cell, size=hidden_size)
-    elif rnn_type == 'gru':
-        rnn_cell = paddle.networks.simple_gru(
-            input=emb, size=hidden_size)
-        for _ in range(num_layer - 1):
-            rnn_cell = paddle.networks.simple_gru(
-                input=rnn_cell, size=hidden_size)
-	```
-* 4，实现输出层（使用softmax归一化计算单词的概率，将output结果返回）、定义模型的cost（多类交叉熵损失函数）。
-	```python
-	# fc and output layer
-	output = paddle.layer.fc(input=[rnn_cell], size=vocab_size, act=paddle.activation.Softmax())
-	# loss
-	cost = paddle.layer.classification_cost(input=output, label=target)
-	```
-### 训练模型
-lm\_rnn.py 中的 train() 方法实现了模型的训练，流程如下：
-* 1，准备输入数据：本例中使用的是标准PTB数据，调用data\_util.py中的build\_vocab()方法建立词典，并使用save\_vocab()方法将词典持久化，以备复用（当语料量大时生成词典比较耗时，所以这里把第一次生成的词典保存下来复用）。然后使用data\_util.py中的train\_data()、test\_data()方法建立train\_reader和test\_reader用来实现对train数据和test数据的读取。
-* 2，初始化模型：包括模型的结构、参数、优化器（demo中使用的是Adam）以及训练器trainer。如下：
-	```python
-	# network config
-	cost, _ = lm(len(word_id_dict), emb_dim, rnn_type, hidden_size, num_layer)
-	# create parameters
-	parameters = paddle.parameters.create(cost)
-	# create optimizer
-	adam_optimizer = paddle.optimizer.Adam(
-        learning_rate=1e-3,
-        regularization=paddle.optimizer.L2Regularization(rate=1e-3),
-        model_average=paddle.optimizer.ModelAverage(average_window=0.5))
-    # create trainer
-    trainer = paddle.trainer.SGD(
-        cost=cost, parameters=parameters, update_equation=adam_optimizer)
-	```
-* 3，定义回调函数event_handler来跟踪训练过程中loss的变化，并在每轮时结束保存模型的参数：
-	```python
-	# define event_handler callback
-    def event_handler(event):
-        if isinstance(event, paddle.event.EndIteration):
-            if event.batch_id % 100 == 0:
-                print("\nPass %d, Batch %d, Cost %f, %s" % (
-                    event.pass_id, event.batch_id, event.cost,
-                    event.metrics))
-            else:
-                sys.stdout.write('.')
-                sys.stdout.flush()
-        # save model each pass
-        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(reader=ptb_reader)
-            print("\nTest with Pass %d, %s" % (event.pass_id, result.metrics))
-            with gzip.open(model_file_name_prefix + str(event.pass_id) + '.tar.gz', 'w') as f:
-                parameters.to_tar(f)
-	```
-* 4，开始train模型：
-	```python
-	trainer.train(
-	        reader=ptb_reader, event_handler=event_handler, num_passes=num_passs)
-	```
-### 生成文本
-lm\_rnn.py中的predict()方法实现了做prediction、生成文本。流程如下：
-* 1，首先加载并缓存词典和模型，其中加载train好的模型参数方法如下：
-	```python
-	parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name))
-	```
-* 2，生成文本，本例中生成文本的方式是启发式图搜索算法beam search，即lm\_rnn.py中的 \_generate\_with\_beamSearch() 方法。
-### <font color='red'>使用此demo</font>
-本例中使用的是标准的PTB数据，如果用户要实现自己的model，则只需要做如下适配工作：
-#### 语料适配
-* 清洗语料：去除空格、tab、乱码，根据需要去除数字、标点符号、特殊符号等。
-* 编码格式：utf-8，本例中已经对中文做了适配。
-* 内容格式：每个句子占一行；每行中的各词之间使用一个空格分开。
-* 按需要配置lm\_rnn.py中\_\_main\_\_函数中对于data的配置：
-	```python
-    # -- config : data --
-    train_file = 'data/ptb.train.txt'
-    test_file = 'data/ptb.test.txt'
-    vocab_file = 'data/vocab_cn.txt'  # the file to save vocab
-    vocab_max_size = 3000
-    min_sentence_length = 3
-    max_sentence_length = 60
-	```
-	其中，vocab\_max\_size定义了词典的最大长度，如果语料中出现的不同词的个数大于这个值，则根据各词的词频倒序排，取top(vocab\_max\_size)个词纳入词典。
-	*注：需要注意的是词典越大生成的内容越丰富但训练耗时越久，一般中文分词之后，语料中不同的词能有几万乃至几十万，如果vocab\_max\_size取值过小则导致\<UNK\>占比过高，如果vocab\_max\_size取值较大则严重影响训练速度（对精度也有影响），所以也有“按字”训练模型的方式，即：把每个汉字当做一个词，常用汉字也就几千个，使得字典的大小不会太大、不会丢失太多信息，但汉语中同一个字在不同词中语义相差很大，有时导致模型效果不理想。建议用户多试试、根据实际情况选择是“按词训练”还是“按字训练”。*
-#### 模型适配
-根据语料的大小按需调整模型的\_\_main\_\_中定义的参数。
-然后运行 python lm\_rnn.py即可训练模型、做prediction。
-## n-gram 语言模型
-n-gram模型也称为n-1阶马尔科夫模型，它有一个有限历史假设：当前词的出现概率仅仅与前面n-1个词相关。因此 (式1) 可以近似为：
-<div align=center><img src='images/ps2.png'/></div>
-一般采用最大似然估计（Maximum Likelihood Estimation，MLE）的方法对模型的参数进行估计。当n取1、2、3时，n-gram模型分别称为unigram、bigram和trigram语言模型。一般情况下，n越大、训练语料的规模越大，参数估计的结果越可靠，但由于模型较简单、表达能力不强以及数据稀疏等问题。一般情况下用n-gram实现的语言模型不如RNN、seq2seq效果好。
-### 模型结构
-lm\_ngram.py中的lm()定义了模型的结构，大致如下：
-* 1，demo中n取5，将前四个词分别做embedding，然后连接起来作为特征向量。
-* 2，后接DNN的hidden layer。
-* 3，将DNN的输出通过softmax layer做分类，得到下个词在词典中的概率分布。
-* 4，模型的loss采用交叉熵，用Adam optimizer对loss做优化。
-图示如下：
-<div align=center><img src='images/ngram.png' width='400px'/></div>
-### 模型训练
-lm\_ngram.py中的train()方法实现了模型的训练，过程和RNN LM类似，简介如下：
-* 1，准备输入数据：使用的是标准PTB数据，调用data\_util.py中的build\_vocab()方法建立词典，并使用save\_vocab()方法将词典持久化，使用data\_util.py中的train\_data()、test\_data()方法建立train\_reader和test\_reader用来实现对train数据和test数据的读取。
-* 2，初始化模型：包括模型的结构、参数、优化器（demo中使用的是Adam）以及trainer。
-* 3，定义回调函数event_handler来跟踪训练过程中loss的变化，并在每轮时结束保存模型的参数。
-* 4，使用trainer开始train模型。
-### 生成文本
-lm\_ngram.py中的\_\_main\_\_方法中对prediction（生成文本）做了简单的实现。流程如下：
-* 1，首先加载词典和模型：
-	```python
-	# prepare model
-	word_id_dict = reader.load_vocab(vocab_file)  # load word dictionary
-	_, output_layer = lm(len(word_id_dict), emb_dim, hidden_size, num_layer)  # network config
-	model_file_name =  model_file_name_prefix + str(num_passs - 1) + '.tar.gz'
-parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name))  # load parameters
-	```
-* 2，根据4（n-1）个词的上文预测下一个单词并打印：
-	```python
-	# generate
-	text = 'the end of the'  # use 4 words to predict the 5th word
-	input = [[word_id_dict.get(w, word_id_dict['<UNK>']) for w in text.split()]]
-	predictions = paddle.infer(
-	    output_layer=output_layer,
-	    parameters=parameters,
-	    input=input,
-	    field=['value'])
-	id_word_dict = dict([(v, k) for k, v in word_id_dict.items()])  # dictionary with type {id : word}
-	predictions[-1][word_id_dict['<UNK>']] = -1  # filter <UNK>
-	next_word = id_word_dict[np.argmax(predictions[-1])]
-	print(next_word.encode('utf-8'))
-	```
-	*注：这里展示了另一种做预测的方法，即使用paddle.infer方法。RNN的实例中使用的是paddle.inference.Inference接口。*
--- a/language_model/config.py
+++ b/language_model/config.py
+# coding=utf-8
+# -- config : data --
+train_file = 'data/chinese.train.txt'
+test_file = 'data/chinese.test.txt'
+vocab_file = 'data/vocab_cn.txt'  # the file to save vocab
+build_vocab_method = 'fixed_size'  # 'frequency' or 'fixed_size'
+vocab_max_size = 3000  # when build_vocab_method = 'fixed_size'
+unk_threshold = 1  # # when build_vocab_method = 'frequency'
+min_sentence_length = 3
+max_sentence_length = 60
+# -- config : train --
+use_which_model = 'ngram'  # must be: 'rnn' or 'ngram'
+use_gpu = False  # whether to use gpu
+trainer_count = 1  # number of trainer
+class Config_rnn(object):
+    """
+    config for RNN language model
+    """
+    rnn_type = 'gru'  # or 'lstm'
+    emb_dim = 200
+    hidden_size = 200
+    num_layer = 2
+    num_passs = 2
+    batch_size = 32
+    model_file_name_prefix = 'lm_' + rnn_type + '_params_pass_'
+class Config_ngram(object):
+    """
+    config for N-Gram language model
+    """
+    emb_dim = 200
+    hidden_size = 200
+    num_layer = 2
+    N = 5
+    num_passs = 2
+    batch_size = 32
+    model_file_name_prefix = 'lm_ngram_pass_'
+# -- config : infer --
+input_file = 'data/input.txt'  # input file contains sentence prefix each line
+output_file = 'data/output.txt'  # the file to save results
+num_words = 10  # the max number of words need to generate
+beam_size = 5  # beam_width, the number of the prediction sentence for each prefix
--- a/language_model/data/chinese.test.txt
+++ b/language_model/data/chinese.test.txt
+我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。
\ No newline at end of file
--- a/language_model/data/chinese.train.txt
+++ b/language_model/data/chinese.train.txt
+我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。我 是 中国 人 。
+我 爱 中国 。
\ No newline at end of file
--- a/language_model/data/input.txt
+++ b/language_model/data/input.txt
+我
+我 是
+我 是 中国
+我 爱
+我 是 中国 人。
+我 爱 中国
+我 爱 中国 。我
+我 爱 中国 。我 爱
+我 爱 中国 。我 是
+我 爱 中国 。我 是 中国
\ No newline at end of file
--- a/language_model/data/ptb.test.txt
+++ b/language_model/data/ptb.test.txt
--- a/language_model/data/ptb.train.txt
+++ b/language_model/data/ptb.train.txt
--- a/language_model/data/vocab_ptb.txt
+++ b/language_model/data/vocab_ptb.txt
-limited	791
-consolidated	2482
-four	347
-facilities	1200
-asian	2798
-controversial	2177
-whose	623
-votes	2089
-founder	2229
-paris	1721
-adviser	1759
-edward	2090
-voted	1935
-under	125
-worth	977
-placed	1677
-merchant	2565
-pact	2130
-risk	647
-rise	498
-sellers	2851
-handling	2476
-every	539
-jack	1722
-reforms	2309
-affect	1968
-bringing	2469
-lehman	1238
-believed	1542
-school	722
-calif	2386
-companies	102
-wednesday	910
-van	2897
-announced	412
-pilson	2915
-expanded	2427
-force	534
-leaders	818
-miller	2247
-guidelines	1795
-estimates	784
-japanese	174
-elections	1720
-second	335
-street	323
-estimated	453
-machines	753
-even	114
-established	1755
-disk	2826
-pace	1888
-panama	1852
-contributed	1263
-nec	2551
-asia	2310
-spokesman	301
-above	626
-dr.	982
-new	36
-net	136
-increasing	987
-ever	823
-seeks	2454
-told	549
-specialist	2519
-never	575
-here	348
-hundreds	1689
-reported	221
-protection	955
-china	638
-brooks	2353
-active	1027
-balance	1687
-auction	968
-items	1470
-employees	457
-climbed	1323
-reports	658
-credit	355
-analysts	166
-chrysler	1969
-military	756
-poverty	2838
-changes	515
-criticism	2288
-golden	1750
-campaign	879
-reagan	1195
-peabody	2432
-highly	1329
-brought	1130
-opportunities	2661
-total	344
-unit	168
-swings	2036
-would	43
-army	2135
-hospital	1833
-m.	1664
-negative	1330
-noting	2958
-call	787
-asset	1302
-strike	1106
-type	2136
-until	315
-b.a.t	1873
-hahn	2992
-supporters	2678
-composite	431
-hurt	1005
-phone	1690
-berlin	2737
-hold	838
-must	405
-me	812
-word	2057
-room	1505
-rights	535
-pursue	2248
-work	222
-plunged	1604
-movies	1970
-henry	2960
-already	294
-merely	2444
-revenues	2521
-my	406
-example	469
-wang	1958
-estate	460
-give	438
-cited	1376
-india	2595
-involve	2801
-currency	820
-foods	2389
-woman	1128
-caution	2981
-ual	409
-want	358
-drive	1524
-times	421
-attract	2236
-totaled	1297
-guarantee	1982
-end	237
-recovery	1654
-turn	837
-provide	577
-travel	1487
-damage	516
-machine	1042
-how	243
-hot	2766
-interview	1220
-widespread	2410
-resignation	2178
-badly	2455
-regional	1486
-minority	1398
-lufkin	2979
-after	79
-damaged	1608
-modest	1441
-president	72
-mesa	2754
-law	279
-types	2203
-las	2785
-purchase	496
-attempt	1036
-third	277
-amid	1646
-headquarters	1203
-maintain	1346
-green	1997
-suggest	1894
-democratic	961
-order	529
-ec	2125
-wine	2831
-operations	223
-senators	1994
-office	284
-over	95
-expects	407
-london	410
-japan	203
-mayor	1819
-before	158
-fit	2767
-personal	629
-expectations	1418
-better	502
-production	360
-weeks	422
-easier	2329
-damages	2063
-then	224
-dec.	932
-affected	1709
-combination	2802
-lambert	1987
-weakness	1827
-safe	1998
-break	1784
-effects	1670
-they	39
-schools	1632
-silver	1988
-bank	105
-structural	1989
-represents	1651
-30-year	1456
-detroit	2961
-affiliate	2357
-victory	2820
-reasonable	2852
-each	216
-went	789
-side	1078
-bond	272
-financial	142
-suspended	1944
-fairly	1953
-series	442
-carolina	2099
-carry	1536
-currencies	2403
-trading	77
-impossible	2487
-substantially	1777
-temporary	1907
-saturday	2070
-burnham	1880
-t.	2213
-network	673
-crucial	2935
-tomorrow	1588
-semiconductor	2204
-encourage	2422
-daniel	2483
-got	596
-newly	2193
-millions	2159
-sluggish	2609
-gop	2456
-foundation	2583
-sept.	628
-turning	2442
-written	2032
-veto	1052
-u.s.	54
-threatened	2522
-little	327
-free	813
-standard	738
-estimate	884
-wanted	1030
-enormous	2291
-created	1053
-days	172
-pence	1347
-oppose	2993
-1970s	2205
-uses	1851
-r.	1170
-industrial	439
-suspension	2903
-economists	1119
-primary	1723
-hearing	1803
-adopted	2224
-another	206
-electronic	1284
-<UNK>	0
-rated	1609
-service	337
-top	525
-approximately	2994
-needed	911
-rates	173
-too	307
-percentage	761
-john	401
-ranging	2708
-urban	1922
-ceiling	2109
-collapse	1633
-serve	2110
-took	454
-rejected	1392
-direct	1201
-western	650
-somewhat	2739
-shortly	2626
-toronto	2091
-renewed	2853
-target	1313
-showed	988
-likely	372
-nations	1606
-project	802
-matter	1196
-greenspan	2542
-feeling	2982
-acquisition	470
-bridge	1481
-fashion	2411
-sees	2498
-ran	2457
-boston	652
-modern	2390
-mind	1714
-mine	2120
-talking	1923
-seen	920
-seem	1058
-seek	983
-relatively	1303
-forced	1172
-abroad	1909
-strength	1531
-concrete	2911
-responsible	1845
-sound	2100
-recommended	2898
-client	1786
-luxury	1999
-forces	1252
-unsecured	2433
-shipments	2292
-blue	2753
-nobody	2148
-philadelphia	2253
-though	329
-wells	1940
-involving	1571
-germany	705
-letter	1107
-competing	2791
-germans	2995
-consumers	1134
-antitrust	1990
-medical	765
-flow	1647
-competitors	1414
-points	395
-principle	2860
-after-tax	2451
-voting	2259
-consumer	582
-dow	448
-came	545
-reserve	717
-d.	674
-saying	715
-meetings	1926
-ending	1555
-showing	1659
-radio	1665
-poison	2936
-hungary	2330
-judges	1853
-finally	1383
-proposed	499
-representing	2098
-delays	2567
-unemployment	2413
-sugar	1971
-rico	2293
-bush	257
-rich	1991
-announce	2624
-resulting	2712
-do	88
-exports	1002
-de	990
-stop	1147
-preferred	1009
-coast	1828
-lenders	2331
-despite	486
-report	238
-du	2723
-volatility	929
-hall	2517
-runs	1678
-jaguar	574
-countries	676
-fields	2186
-high-yield	2138
-bay	759
-twice	1983
-bad	890
-release	1700
-prudential-bache	2691
-mergers	2963
-secretary	718
-headed	1658
-disaster	1352
-fair	2161
-w.	1820
-testing	2938
-decided	1104
-result	411
-discussions	1963
-resigned	1234
-taiwan	2793
-best	598
-subject	966
-brazil	2803
-said	16
-capacity	1137
-away	601
-irs	1049
-compensation	2619
-machinists	2106
-pressures	2768
-future	508
-cooperation	2340
-approach	1402
-co.	96
-profitable	1629
-we	65
-men	944
-terms	520
-extend	2369
-nature	2162
-wo	418
-ask	1906
-handful	2916
-weak	1520
-however	229
-retirement	1427
-extent	2206
-news	311
-convertible	956
-debt	208
-improve	1153
-suggested	1943
-received	517
-protect	1865
-met	1317
-country	359
-over-the-counter	1241
-against	175
-players	1578
-else	1394
-supplies	1802
-games	2111
-planned	950
-faces	1772
-studio	2318
-argue	1992
-asked	613
-prospect	2424
-tough	1401
-appeared	2008
-royal	2725
-offerings	2272
-represented	2623
-tons	1068
-initiative	2804
-trust	458
-telecommunications	1821
-conference	881
-puts	1847
-basis	854
-union	319
-anc	2756
-three	133
-been	59
-quickly	914
-commission	475
-beer	2358
-much	122
-interest	132
-basic	2273
-expected	171
-entered	2139
-containers	2625
-life	391
-families	1881
-mci	2741
-eastern	734
-drugs	1387
-republicans	1666
-worker	2523
-mca	1972
-enterprises	2282
-child	2250
-ogilvy	2261
-worked	1370
-slowdown	1653
-applied	2484
-commerce	1204
-has	31
-publicly	1377
-air	452
-ventures	1715
-near	958
-appeals	1920
-aid	870
-property	810
-study	936
-launched	1610
-seven	913
-changed	973
-metropolitan	2742
-mexico	1433
-is	14
-it	15
-expenses	1167
-ii	1917
-player	2524
-experts	1403
-world-wide	1393
-in	8
-victims	1468
-confident	2883
-turner	2588
-if	67
-grown	2391
-hong	794
-patent	2069
-things	607
-make	139
-linked	2895
-complex	1724
-split	1568
-several	249
-couple	1547
-european	568
-independent	829
-pick	2319
-hand	1054
-ownership	1443
-constitution	2696
-opportunity	1473
-kept	1785
-scenario	2918
-programs	561
-settled	1621
-savings	746
-materials	1318
-rey	2879
-mother	2320
-claims	595
-the	2
-corporate	353
-investments	744
-left	697
-quoted	1182
-yen	255
-mills	2718
-expanding	2640
-ideas	2966
-identify	2861
-human	1224
-campbell	2627
-yet	397
-previous	558
-adding	1344
-buyers	846
-hills	2375
-phillips	1895
-ease	1751
-had	51
-intends	2227
-spread	1437
-board	147
-easy	1540
-prison	2584
-east	519
-gave	1227
-municipal	1488
-possible	492
-possibly	2370
-buy-out	589
-judge	403
-replace	1986
-advanced	992
-desire	2649
-county	1223
-exxon	1691
-hunt	2459
-securities	126
-offices	1154
-officer	290
-night	1069
-security	648
-delmed	2689
-attorney	752
-right	382
-old	394
-deal	497
-people	109
-dead	2610
-consultants	1954
-donald	2460
-election	1725
-short-term	967
-specific	1228
-for	12
-bottom	2525
-comments	2302
-p.m.	2163
-when	68
-continue	388
-denied	2027
-steps	1652
-christmas	2321
-core	1866
-marketing	610
-corn	1579
-conventional	1883
-discount	1179
-restructuring	654
-plc	957
-packages	2664
-losing	1810
-brokerage	857
-post	1096
-manufacturing	737
-properties	1298
-georgia-pacific	1660
-chapter	1457
-dollars	604
-months	127
-costs	271
-magazine	781
-plus	1726
-afternoon	1736
-efforts	702
-slightly	743
-nixon	2526
-raised	748
-managers	585
-publishing	1584
-formerly	2392
-facility	1541
-civil	1229
-maxwell	2471
-marshall	2901
-son	2423
-down	119
-explain	2680
-magazines	1444
-dean	2805
-reducing	2294
-defendants	1811
-crowd	2983
-support	415
-initial	1074
-legislation	803
-cosmetics	2659
-per-share	1891
-why	763
-joseph	1955
-editor	1716
-way	228
-resulted	2527
-music	2289
-was	25
-war	978
-interest-rate	2629
-head	642
-economics	2800
-form	945
-manufacturers	931
-becoming	2341
-differences	2322
-ford	480
-failure	1638
-heat	2239
-hear	2568
-syndicate	2939
-sustained	2964
-stand	1419
-true	1550
-analyst	365
-nov.	408
-counsel	2628
-inside	2207
-bids	1404
-maximum	2538
-devices	1672
-tell	1345
-jan.	2081
-<unk>	3
-stronger	1934
-one-third	2681
-evidence	1043
-promised	2597
-accounting	1410
-ship	2351
-program-trading	2844
-check	2323
-negotiations	1391
-regime	2822
-floor	1014
-phelan	2180
-stake	320
-generally	925
-credibility	2924
-successful	1698
-interested	1378
-role	795
-holding	477
-digital	1044
-test	895
-developers	2587
-bailout	2663
-roll	2332
-picture	2151
-'s	10
-brothers	1001
-delivered	2930
-models	1901
-surprise	2608
-felt	1918
-utilities	1717
-'d	1131
-invested	2295
-authorities	1892
-'m	827
-aware	2662
-weekend	1854
-died	2228
-jones	459
-reorganization	1411
-longer	921
-glass	2647
-assume	2312
-italy	1977
-connecticut	2959
-together	1190
-liquidity	1841
-premiums	2940
-time	104
-push	1812
-serious	1175
-profits	740
-concept	2786
-managed	1855
-chain	1273
-global	1551
-alternatives	2461
-focus	1247
-manager	611
-battle	1118
-creative	2919
-s.a.	2146
-certainly	1532
-everything	1498
-father	2393
-environment	1692
-charge	527
-asking	2026
-e.	1514
-marks	754
-suffered	2187
-circumstances	2290
-division	540
-supported	2316
-mixte	1733
-keeping	1929
-choice	2164
-liability	1756
-drexel	875
-lynch	834
-10-year	2216
-join	2149
-trouble	1513
-corp.	66
-governments	2352
-level	462
-did	144
-turns	2434
-proposals	1737
-democrat	2862
-standards	1405
-leave	1235
-settle	1927
-team	769
-quick	2414
-speculation	1311
-round	2863
-lloyd	1509
-prevent	1371
-says	45
-trend	1458
-gasoline	2439
-telerate	2073
-sign	1215
-mich.	2823
-cost	363
-aggressive	2210
-adds	872
-appear	1299
-hewlett-packard	2082
-assistance	1965
-shares	71
-current	265
-goes	1321
-international	198
-falling	1752
-principal	1510
-boost	962
-filled	2443
-paribas	1557
-transportation	847
-genes	2262
-french	897
-agreement	332
-water	1092
-baseball	1804
-groups	723
-address	2834
-alone	1325
-along	703
-earthquake	427
-change	429
-wait	2555
-canadian	741
-institute	760
-shift	1738
-guilty	1623
-trial	1162
-usually	1026
-corp	587
-bob	2612
-navigation	1667
-retired	2598
-defensive	2769
-extra	2092
-lending	1710
-mobil	2920
-crisis	2071
-market	48
-everybody	2613
-indicated	1296
-working	670
-prove	2000
-positive	1778
-psyllium	2721
-visit	2275
-third-quarter	349
-france	1003
-live	1636
-opposed	1779
-stearns	2281
-memory	2513
-francs	1012
-australian	1773
-household	2249
-today	326
-club	2058
-apparent	2770
-fuel	2225
-cautious	2577
-downturn	2549
-cases	685
-effort	677
-behalf	2771
-fly	2683
-organizations	2599
-valued	1216
-ibm	690
-tokyo	606
-car	602
-abortion	886
-believes	1095
-districts	2884
-ms.	487
-values	1673
-can	90
-growing	776
-making	424
-interstate	1936
-newspapers	2376
-claim	1535
-citizens	2371
-figure	1187
-predict	2499
-december	924
-chip	2093
-agent	2276
-1980s	2123
-heard	2378
-dropped	557
-council	1334
-allowed	1141
-requirements	1562
-winter	2650
-secured	2965
-bonds	129
-chemical	720
-beat	2816
-sunday	2500
-s.	2050
-fourth	915
-ensure	2283
-subsidiaries	2569
-economy	354
-product	521
-huge	828
-may	94
-southern	1277
-applications	2529
-membership	2140
-produce	979
-mae	1596
-designed	1098
-date	1326
-such	89
-data	451
-grow	1556
-man	898
-natural	1007
-johnson	1244
-maybe	1688
-futures	288
-borrowing	2864
-gap	2518
-so	106
-deposit	1856
-increase	244
-pulled	2473
-talk	1197
-typical	2600
-exclusive	2755
-no.	1656
-acts	2869
-seeing	2865
-sell-off	2772
-indeed	1010
-mainly	1871
-consulting	1611
-years	73
-ended	328
-experiments	2835
-cuts	1287
-argued	2133
-statements	2585
-cold	2263
-still	148
-stock-index	974
-group	97
-monitor	2941
-procedures	2734
-presence	2238
-troubles	2474
-forms	2726
-offers	1338
-policy	374
-mail	1617
-main	1112
-decades	2985
-texas	484
-happened	2152
-finance	573
-views	2141
-introduce	2539
-nation	507
-records	2023
-half	389
-not	64
-now	100
-provision	1207
-discuss	2124
-nor	1412
-term	1080
-attorneys	1857
-name	819
-january	1135
-drop	471
-rock	2866
-quarter	110
-el	2727
-square	2083
-significantly	1775
-latin	2827
-revised	1582
-s&p	835
-begun	2360
-year	41
-happen	2048
-worried	2084
-tried	1590
-canada	730
-living	2033
-shown	2028
-inventories	2418
-opened	1239
-space	971
-profit	178
-factory	2042
-looking	735
-investigation	1339
-indicating	2991
-shows	1133
-exactly	2675
-earlier	134
-theory	2570
-cars	701
-million	22
-incentives	2799
-possibility	2251
-quite	1583
-california	296
-besides	2709
-obligation	2264
-marine	2787
-card	2303
-care	858
-advance	1639
-training	1993
-language	2142
-ministry	1245
-discussing	2885
-wrong	1780
-british	304
-thing	899
-place	666
-massive	2441
-promotion	2644
-think	316
-first	75
-merrill	814
-revenue	187
-one	55
-opec	2552
-americans	1018
-one-time	2049
-directly	1085
-vote	839
-corporations	1640
-message	2601
-fight	1447
-open	671
-george	1031
-size	1070
-city	286
-given	732
-sheet	2998
-district	806
-caught	2436
-trillion	1744
-plastic	2571
-anyone	1453
-indicate	2165
-returns	984
-white	445
-friend	2788
-gives	1499
-hud	2043
-acquisitions	1491
-mining	2037
-mostly	1425
-that	11
-pittsburgh	2343
-season	1580
-moscow	1495
-alan	1745
-released	1625
-specialists	2166
-surged	1730
-than	56
-population	2507
-wide	1834
-television	758
-effective	1140
-rival	2117
-require	1214
-spokeswoman	1046
-officials	161
-venture	816
-were	47
-published	2398
-and	9
-mountain	2355
-san	305
-investors	117
-remained	1849
-turned	951
-argument	2774
-say	118
-plunge	1232
-allen	2836
-sells	1626
-saw	1463
-any	107
-accounted	2722
-offering	357
-regular	2035
-efficient	2743
-offer	209
-aside	2530
-note	952
-equipment	550
-mr.	24
-potential	649
-take	210
-performance	766
-wonder	2744
-registered	2783
-channel	2556
-begin	852
-sure	1138
-normal	1796
-track	2274
-price	116
-enter	2882
-paid	490
-icahn	2736
-nomura	2997
-america	541
-pages	1957
-honecker	2531
-manville	1637
-operate	1890
-especially	860
-surprising	2492
-payable	2694
-considered	994
-average	197
-later	467
-steady	2472
-sale	227
-federal	101
-professional	1757
-senior	446
-mass.	1701
-typically	1449
-filing	963
-laws	1489
-shop	2682
-rating	1384
-shot	2684
-surplus	2775
-show	465
-german	770
-delta	2806
-allegations	2508
-commitments	2018
-discovered	2059
-rep.	824
-soviets	1396
-fifth	2728
-ground	1840
-slow	1188
-ratio	2304
-gulf	1482
-title	2126
-daily	859
-enough	543
-crime	1563
-only	87
-going	325
-black	755
-treasury	282
-thompson	2528
-watching	2974
-congressional	912
-dispute	1702
-get	188
-contracts	646
-assistant	1634
-employers	2445
-nearly	441
-secondary	2395
-prime	953
-regarding	2630
-yield	236
-morning	1211
-miles	1357
-predicted	1693
-scott	2906
-where	252
-husband	2666
-salomon	1278
-declared	2009
-corry	2710
-committed	2266
-seat	2685
-elected	1240
-j.	891
-college	1500
-stanley	1348
-concern	289
-mortgage	583
-farmers	1202
-ways	1139
-jumped	1008
-review	1282
-representatives	1858
-forecast	1474
-weapons	2886
-outside	849
-bureau	1464
-between	179
-import	2179
-reading	2557
-across	1177
-jobs	1142
-august	398
-parent	731
-blame	2667
-article	1055
-cities	1806
-come	435
-reaction	2112
-acquiring	2240
-many	98
-trader	1281
-trades	1246
-according	215
-contract	317
-prompted	2486
-buy-back	2668
-senator	2967
-holders	675
-traded	926
-comes	1097
-among	212
-cancer	1019
-color	2652
-roman	2639
-period	341
-insist	2631
-confirmed	1964
-learning	2789
-moreover	1208
-poll	2824
-two-year	2324
-considering	1434
-save	1973
-unusual	1739
-west	380
-airlines	699
-mark	1061
-hutton	1340
-combined	1545
-hardly	2825
-mary	2942
-disclosed	871
-wants	821
-direction	1984
-shopping	1974
-offered	494
-formed	2252
-observers	2437
-wake	2010
-minister	1047
-former	338
-those	151
-pilot	2509
-case	308
-developing	1589
-these	145
-consultant	1373
-cash	268
-n't	33
-warning	2366
-policies	1312
-newspaper	1148
-situation	942
-shops	2632
-margin	1797
-region	1648
-eventually	1415
-metric	2265
-health-care	2218
-engaged	2837
-telephone	733
-quiet	2757
-middle	2686
-someone	1362
-attributed	1400
-technology	503
-worry	2001
-par	1164
-develop	1285
-pay	256
-same	313
-dealer	2586
-speech	2396
-grain	2200
-insurers	1668
-events	1483
-week	124
-buy-outs	2926
-oil	269
-singapore	2085
-boosted	1813
-drives	2072
-producers	855
-running	1122
-harris	2711
-intended	1951
-changing	2510
-anticipated	2344
-complained	2540
-costa	2211
-theater	2669
-largely	892
-charges	593
-no	103
-constitutional	2305
-roughly	1597
-mortgages	1537
-severe	2541
-without	350
-relief	1746
-model	1807
-researchers	1248
-charged	1657
-summer	976
-asset-backed	2687
-being	214
-money	162
-rest	1115
-kill	2633
-speed	1787
-weekly	2052
-announcement	1040
-death	1237
-rose	120
-seems	900
-except	1627
-improvement	1467
-westinghouse	2968
-setting	2150
-bloc	2634
-treatment	1355
-plenty	2962
-tuesday	474
-ross	2196
-scheduled	832
-negotiating	2017
-around	385
-read	1331
-papers	2695
-virginia	2698
-early	267
-inflation	620
-traffic	1703
-using	665
-accepted	1747
-ruled	2379
-intel	2167
-nissan	1271
-rivals	2616
-'ve	578
-annually	1758
-chamber	2615
-benefit	1279
-either	707
-retailers	2314
-fully	1145
-output	1450
-tower	2416
-reduced	901
-nikkei	2635
-competition	874
-loyalty	2970
-bigger	1869
-thinks	2086
-provided	1294
-earth	2821
-operators	2638
-recorded	2489
-legal	656
-conservative	1280
-critical	2101
-deficit	822
-provides	1304
-newport	2943
-moderate	2446
-football	2359
-assembly	2479
-scientific	2118
-power	339
-airways	2807
-equivalent	2428
-broker	1862
-broken	2808
-leadership	1902
-aide	2738
-manufacturer	2188
-on	17
-central	877
-package	1184
-of	5
-industry	154
-thousands	1567
-fell	204
-airline	767
-sachs	1679
-act	757
-mixed	1764
-mean	1465
-or	37
-confidence	1818
-tape	2658
-barrels	1850
-outlook	2219
-coupon	1661
-instruments	1867
-image	1341
-accounts	907
-determine	2636
-parties	1546
-operator	2127
-your	483
-pharmaceutical	1945
-fast	2512
-her	200
-area	449
-there	84
-alleged	1601
-start	903
-appears	1314
-low	615
-lot	580
-valley	1272
-billion	49
-complete	1490
-saatchi	2087
-delayed	2782
-sophisticated	2501
-brain	2975
-succeeded	2913
-two-thirds	2603
-technologies	2212
-trying	547
-with	23
-buying	361
-faster	2362
-volume	396
-october	522
-circulation	2923
-sears	1719
-default	2380
-wholesale	2699
-agree	1842
-strongly	2759
-gone	1843
-vehicles	1576
-ad	695
-ag	1674
-certain	426
-totaling	2637
-moved	1266
-sales	82
-deep	2945
-an	32
-cbs	1023
-britain	774
-at	19
-file	2012
-aids	2137
-politics	2113
-moves	1103
-film	1475
-fill	2946
-again	556
-consensus	2713
-personnel	2103
-storage	2490
-event	1938
-field	1105
-you	111
-poor	888
-a$	2589
-congress	287
-separate	1127
-students	1358
-a.	943
-n.j.	1848
-important	590
-massachusetts	2537
-coverage	2002
-planners	2653
-brands	1335
-stocks	150
-building	482
-assets	297
-calls	724
-wife	1985
-invest	2114
-having	716
-directors	751
-mass	2654
-overseas	1060
-starting	1801
-original	1525
-represent	2013
-all	74
-sci	1781
-consider	930
-chinese	1564
-caused	989
-lack	1439
-dollar	333
-month	169
-mccaw	1947
-talks	711
-follow	2019
-settlement	917
-decisions	1868
-children	1028
-causes	2904
-reluctant	2947
-tv	659
-thursday	939
-shall	2832
-to	6
-program	157
-spain	2480
-health	468
-lawmakers	1388
-activities	1230
-calif.	850
-premium	1257
-returned	2053
-divisions	2954
-very	253
-resistance	2285
-worst	2325
-decide	1882
-fall	619
-sony	1155
-difference	1574
-condition	2119
-cable	1264
-louis	2128
-list	1407
-joined	2024
-large	381
-circuit	2199
-small	300
-webster	2984
-past	225
-rate	159
-arizona	1599
-design	1521
-lawyer	1359
-pass	2168
-nbc	1808
-further	393
-investment	146
-what	115
-abc	2189
-richard	867
-investing	2181
-sun	1581
-section	1236
-resume	2497
-brief	2700
-<EOS>	1
-noriega	1006
-version	1349
-scientists	1466
-certificates	1822
-learned	2604
-public	275
-contrast	1928
-movement	2038
-turmoil	2730
-full	632
-editorial	2488
-answers	2854
-hours	887
-citicorp	1995
-operating	299
-excess	2094
-november	1445
-strong	404
-thrift	1077
-publisher	2194
-prosecutors	1598
-ahead	970
-extraordinary	2147
-losses	392
-experience	1526
-prior	1874
-amount	542
-advertising	627
-social	1342
-action	588
-narrow	2688
-options	657
-via	645
-followed	1159
-family	617
-requiring	2855
-africa	1946
-thatcher	1533
-put	336
-aimed	1516
-establish	2559
-donaldson	2809
-shareholders	591
-eye	2190
-takes	1254
-petroleum	1422
-two	76
-generate	2760
-taken	640
-markets	191
-minor	1896
-more	46
-flat	1231
-israel	2241
-door	2054
-knows	2326
-fast-food	2543
-jr.	1209
-company	38
-broke	2856
-particular	1476
-known	709
-producing	1884
-town	1741
-jim	2773
-none	1919
-lilly	2874
-hour	1477
-science	2905
-des	2746
-remain	706
-sudden	2558
-nine	511
-sent	998
-morgan	905
-strategies	2230
-history	964
-purchases	1336
-processing	2306
-brown	1448
-pont	2888
-share	63
-accept	1527
-states	599
-pushed	2221
-minimum	1079
-numbers	1618
-purchased	1662
-sense	1116
-sharp	1123
-f.	2095
-information	532
-needs	1198
-answer	2440
-court	213
-advantage	1924
-rather	644
-hugo	1065
-conducted	2810
-earnings	137
-portfolios	2548
-plant	402
-plans	232
-advice	2747
-different	772
-reflect	1897
-fe	2003
-coming	762
-response	1124
-a	7
-short	694
-brady	2400
-departure	2889
-coal	2354
-broadcasting	1528
-responsibility	2068
-media	1034
-banks	248
-egg	2602
-playing	2447
-turnover	2701
-played	1839
-help	334
-september	312
-developed	1260
-soon	641
-trade	220
-held	417
-paper	399
-through	149
-committee	373
-signs	1694
-suffer	2948
-its	28
-developer	2969
-style	2074
-rapidly	2214
-actually	1071
-late	292
-systems	419
-conn.	2051
-stephen	1898
-inquiry	2999
-might	280
-tentatively	2907
-good	264
-return	504
-seeking	817
-food	622
-reflected	1886
-association	447
-easily	2145
-holiday	2763
-always	851
-stopped	1885
-eager	2990
-found	552
-heavy	771
-sterling	1641
-everyone	1385
-england	1492
-generation	2198
-house	165
-energy	773
-hard	712
-reduce	713
-idea	1165
-police	1399
-extended	2313
-expect	524
-advertisers	1704
-operation	1143
-beyond	1300
-insurance	240
-really	840
-deals	1306
-funding	1173
-carriers	2339
-blacks	1875
-robert	513
-since	156
-douglas	2381
-research	318
-participants	2315
-safety	1267
-hill	2075
-fujitsu	2707
-issue	186
-highway	1844
-reporting	2192
-risen	2908
-lawrence	2401
-friday	283
-houses	1332
-reason	878
-base	980
-members	450
-backed	1319
-beginning	937
-guy	2481
-director	276
-owners	1275
-benefits	991
-launch	2222
-just	152
-computers	565
-excluding	2317
-american	141
-threat	1705
-pilots	1045
-fallen	2154
-lawsuits	2160
-copper	1478
-major	138
-slipped	1903
-feel	1113
-number	295
-feet	1350
-done	927
-fees	792
-miss	2925
-causing	2470
-stage	2258
-story	1363
-heads	2532
-leading	815
-st.	1612
-kidder	1086
-least	298
-station	1887
-expand	1708
-statement	682
-dealing	2554
-compromise	1975
-store	1365
-listed	1605
-selling	400
-passed	1364
-relationship	1904
-behind	1120
-hotel	1727
-park	1518
-immediate	1930
-blue-chip	2729
-profitability	2566
-part	202
-favorable	2870
-believe	624
-hollywood	2007
-king	2242
-kind	948
-grew	1572
-rebound	2900
-double	1870
-pennsylvania	2361
-determined	1959
-risks	1484
-elaborate	2520
-messrs.	2402
-toward	811
-aug.	2039
-outstanding	548
-imports	949
-substantial	1315
-orders	456
-option	1389
-sell	183
-ratings	1573
-built	1099
-trip	2887
-gorbachev	1166
-officers	2670
-targets	2611
-majority	908
-internal	1327
-chairman	143
-finding	1899
-frequently	2387
-play	1117
-added	285
-electric	940
-goldman	1602
-eggs	2811
-measures	1322
-reach	1366
-freddie	2605
-most	91
-hired	2493
-shareholder	1032
-plan	176
-significant	893
-services	324
-extremely	1976
-approved	680
-soared	1910
-compaq	2690
-dealers	669
-clear	726
-sometimes	1276
-cover	1506
-rockefeller	2731
-traditional	1522
-three-month	2578
-clean	2790
-usual	1771
-institutions	826
-painewebber	1628
-sector	1075
-thomas	1269
-particularly	783
-gold	660
-commissions	2277
-nasdaq	1343
-session	1082
-businesses	434
-jury	1372
-fine	1829
-find	678
-impact	883
-gen.	2812
-giant	1100
-regulations	2060
-nevertheless	2955
-northern	1431
-justice	1011
-heavily	1440
-distributed	2871
-failed	778
-flights	2544
-pretty	1760
-equity	621
-giants	2169
-begins	2679
-his	50
-hit	777
-gains	554
-meanwhile	672
-express	1020
-financing	605
-collection	2878
-b	2327
-actions	1622
-closely	1102
-reporters	2170
-during	199
-him	302
-merchandise	2450
-appeal	1682
-doubled	2813
-six-month	2014
-banking	569
-common	247
-activity	807
-switzerland	2096
-coors	2909
-river	1996
-wrote	1565
-restaurants	2748
-set	386
-art	1459
-achieved	2761
-declines	1256
-sex	2986
-culture	2345
-see	379
-defense	536
-sec	1132
-are	27
-sea	1876
-tender	1765
-close	321
-arm	2890
-declined	413
-filings	2932
-#	687
-spirits	2776
-movie	1603
-century	1742
-currently	488
-won	1072
-various	1186
-probably	681
-conditions	1087
-supposed	2449
-available	679
-korea	1592
-recently	376
-creating	2363
-initially	2115
-dividends	1435
-sold	239
-attention	1420
-aircraft	1496
-succeed	2284
-coffee	2143
-opposition	1305
-franchise	2575
-dividend	704
-both	180
-prospects	2171
-last	70
-appropriations	1367
-annual	367
-foreign	245
-sensitive	2732
-connection	2591
-became	985
-long-term	688
-let	1025
-whole	1180
-baltimore	2749
-point	375
-reasons	1748
-loan	501
-community	922
-simply	946
-church	1960
-throughout	1766
-expensive	1619
-decline	461
-described	2182
-raise	630
-monthly	1504
-create	1288
-political	390
-due	260
-strategy	750
-convicted	2830
-whom	1713
-reduction	1501
-maintenance	2545
-meeting	476
-walter	2438
-firm	192
-partly	1110
-fire	1782
-gas	538
-convert	2794
-N	4
-fund	293
-whatever	2671
-lives	2129
-brokers	960
-bidding	1494
-demand	437
-prices	113
-plants	865
-georgia	2076
-look	714
-solid	2950
-judicial	2987
-bill	261
-budget	570
-governor	2672
-technical	1586
-while	121
-mainframe	2927
-ought	2546
-fleet	2928
-mitchell	2346
-guide	2792
-engineers	2762
-real	309
-pound	1066
-costly	2183
-voters	1683
-cents	108
-motors	1328
-stations	1740
-disappointing	2462
-itself	683
-ready	1788
-fannie	1967
-coca-cola	2910
-chase	2088
-underwriters	1718
-suggests	2045
-rules	906
-virtually	1753
-widely	1283
-grand	1426
-survey	1108
-dozen	1671
-higher	207
-development	444
-used	263
-lawyers	691
-d.c.	2988
-affairs	1699
-comprehensive	2655
-yesterday	123
-moment	1859
-levels	788
-moving	1408
-purpose	2617
-tobacco	2477
-recent	182
-lower	231
-task	2015
-older	1908
-studies	1956
-poland	1221
-spent	1149
-person	1442
-machinery	2511
-ltd.	555
-swiss	1416
-organization	1178
-spend	1270
-coup	2226
-one-year	2560
-junk-bond	1767
-networks	2464
-u.k.	1168
-competitive	1650
-quarters	2311
-questions	1093
-world	219
-alternative	1978
-wage	1158
-cut	378
-helping	2116
-$	13
-also	60
-advisers	2044
-workers	432
-deputy	1809
-guaranteed	2268
-attractive	2467
-source	1076
-stock-market	2382
-parents	1877
-location	2777
-violations	2576
-guarantees	2004
-administrative	2233
-remaining	1428
-surprised	2478
-build	848
-customers	526
-australia	1249
-march	618
-emergency	1171
-demands	2648
-big	130
-bid	258
-matters	2104
-game	1088
-aerospace	1931
-bit	1893
-projects	868
-moody	995
-breeden	2364
-success	1395
-follows	1860
-signal	2383
-toyota	2929
-separately	1261
-communications	779
-arthur	2891
-individuals	1324
-yields	923
-popular	1429
-healthy	1805
-privately	2297
-often	518
-senate	463
-spring	1205
-b.	1814
-some	58
-back	193
-trends	2673
-economic	234
-pricing	1861
-apply	2465
-nicaragua	2503
-facing	2397
-scale	2750
-decision	531
-transactions	1083
-audience	2144
-per	1038
-eliminate	2561
-be	26
-run	612
-lose	1517
-continuing	1021
-fed	566
-refused	2077
-step	1210
-santa	1250
-served	2066
-at&t	1789
-by	18
-pipeline	2365
-goods	804
-anything	997
-truck	1792
-mrs.	662
-range	882
-ounce	1921
-duties	2917
-block	1035
-pollution	2951
-repair	2839
-steinhardt	2692
-into	92
-within	530
-retailer	2751
-nothing	1033
-primarily	1548
-sports	1259
-pentagon	1472
-bankruptcy	1029
-statistics	1939
-spending	509
-question	801
-long	352
-ordered	1823
-amr	2989
-suit	633
-himself	1056
-elsewhere	1731
-collapsed	2347
-vehicle	2061
-specialty	2269
-hoped	2872
-atlantic	2254
-pacific	689
-filed	528
-hopes	1101
-subsidiary	663
-line	464
-considerable	2714
-raising	1421
-posted	634
-up	53
-us	505
-maturity	1529
-'re	278
-exploration	2105
-viacom	2562
-similar	710
-called	342
-bell	1310
-associated	1669
-metal	2485
-influence	1905
-metals	1790
-engineering	1293
-associates	1121
-rally	975
-amounts	1356
-peace	2892
-fears	1711
-teams	2921
-yeargin	2494
-afford	2902
-politicians	2131
-reputation	2693
-income	185
-department	235
-manhattan	1430
-users	2367
-gross	2011
-problems	356
-prepared	1575
-william	782
-allowing	2579
-formal	2592
-sides	1783
-structure	1761
-ago	196
-urged	2223
-land	1217
-vice	233
-age	1067
-required	972
-bankers	1022
-responded	2873
-far	291
-fresh	2593
-requires	2384
-leveraged	1146
-once	500
-code	2372
-issued	902
-results	343
-existing	1251
-oct.	314
-ge	1889
-broader	2778
-go	387
-gm	844
-contributions	2463
-centers	1585
-issues	251
-seemed	1815
-concerned	1493
-young	841
-send	2779
-suits	2399
-citing	1961
-stable	2745
-quarterly	1090
-include	537
-friendly	2385
-resources	1059
-garden	2255
-automotive	1620
-continues	1050
-wave	2857
-putting	1663
-cellular	2448
-telling	2931
-continued	609
-entire	1436
-eased	2894
-sen.	969
-real-estate	1446
-positions	1380
-notes	377
-michael	798
-fewer	1732
-try	876
-race	2475
-noted	667
-guber	1183
-concluded	2244
-smaller	1048
-cds	1774
-crop	2296
-jump	2132
-video	2533
-expense	2458
-makers	768
-index	217
-edison	2814
-business	86
-chicago	485
-giving	1454
-expressed	2173
-practices	1798
-access	1455
-paying	1125
-waiting	1630
-indian	2840
-volatile	2030
-five-year	2651
-capital	181
-firms	366
-exercise	1743
-body	2875
-led	661
-lee	1816
-exchange	112
-pushing	2514
-commercial	478
-jointly	2040
-following	572
-northeast	2553
-them	128
-others	495
-great	592
-credits	3001
-receive	934
-involved	831
-larger	1307
-leaving	1835
-engine	2758
-merger	1037
-products	190
-opinion	1817
-residents	1754
-gene	1614
-makes	559
-maker	340
-fourth-quarter	2016
-named	523
-writer	2971
-apple	1538
-heart	1655
-win	1255
-manage	2933
-private	551
-fraud	1552
-names	1863
-motor	996
-scandal	1948
-standing	2715
-use	270
-from	21
-p&g	2217
-consumption	2618
-&	83
-remains	780
-illegal	2121
-cray	1553
-next	131
-few	262
-doubt	1878
-year-ago	1519
-themselves	873
-consecutive	2243
-reflects	1712
-usx	1502
-sort	1566
-parliament	2580
-started	889
-becomes	2934
-factor	2245
-benchmark	1824
-occurred	2841
-carrying	2590
-sharply	1051
-allianz	2867
-mitsubishi	1615
-appointed	2495
-women	941
-customer	1353
-account	636
-us$	1539
-effectively	2876
-this	40
-challenge	1879
-clients	764
-recession	965
-thin	2429
-island	2868
-meet	896
-closing	1039
-n.y.	2208
-control	351
-beijing	2062
-slid	2780
-weaker	2716
-engelken	2703
-process	796
-a.m.	2859
-daiwa	2733
-tax	218
-purposes	2842
-high	241
-professor	2022
-reserves	825
-something	785
-sought	1432
-stories	2466
-voice	1981
-rape	2972
-sir	1836
-educational	2534
-united	562
-usair	2029
-democracy	2404
-recalls	2572
-six	364
-hampshire	2279
-arrangement	2795
-traders	281
-forest	2877
-instead	597
-stock	61
-buildings	1642
-farm	2097
-watch	2237
-tied	2256
-ties	2215
-boeing	1962
-light	1063
-lines	805
-commodities	2373
-chief	153
-road	2102
-allow	894
-executives	493
-martin	2417
-houston	1933
-holds	1199
-hanover	2547
-producer	1587
-institutional	1258
-move	330
-produced	1109
-alliance	2056
-including	211
-looks	2155
-quake	1064
-year-earlier	797
-industries	553
-delay	1932
-la	2041
-labor	700
-whites	2952
-willing	1289
-orange	2515
-covered	2156
-criminal	1479
-spot	2209
-pending	1503
-crash	1041
-greater	1225
-auto	616
-practice	1360
-earn	2937
-cutting	1949
-h.	2184
-hands	1316
-front	1911
-bar	2740
-republican	1409
-investor	506
-day	273
-capital-gains	1386
-successor	2704
-february	1799
-l.	1577
-warned	2280
-university	594
-covering	2724
-identified	2504
-morris	1979
-rising	1126
-bills	614
-warner	576
-doing	842
-strip	2944
-related	866
-society	1423
-books	1800
-measure	853
-our	230
-margins	1242
-agriculture	2717
-special	655
-out	85
-merc	2849
-'	135
-entertainment	1016
-defend	2815
-critics	1825
-electronics	1301
-cause	1213
-integrated	2286
-red	1013
-thrifts	1837
-disclose	2021
-shut	2912
-frank	1643
-ban	2065
-regulators	1084
-york	93
-regulatory	1374
-indicates	2596
-philip	1696
-navy	2034
-hostile	1791
-could	80
-florida	1864
-mac	2496
-keep	586
-ltd	1480
-davis	2781
-retain	2452
-retail	909
-south	436
-respond	2833
-plastics	2953
-succeeds	2031
-powerful	1616
-owned	1015
-strategic	1768
-owner	1507
-reached	698
-awarded	2409
-quality	1192
-nyse	2335
-legislative	2657
-management	266
-stands	2702
-los	639
-system	274
-relations	1560
-recapitalization	2880
-priority	2581
-their	52
-attack	2348
-intelligence	1649
-final	1111
-interests	775
-enforcement	2535
-shell	2973
-completed	790
-acquire	869
-environmental	1091
-chemicals	1591
-reflecting	1308
-branches	2415
-july	567
-institution	2706
-steel	600
-colleagues	2174
-hearings	2287
-commodity	1406
-patients	1952
-individual	736
-providing	2064
-creditors	1004
-projections	2996
-unchanged	1073
-partnership	1218
-lin	1613
-unlikely	2356
-have	35
-need	479
-apparently	1081
-clearly	1508
-rjr	2220
-documents	1675
-dallas	1830
-agency	303
-able	584
-purchasing	1941
-instance	1222
-concerns	830
-which	42
-campeau	1680
-coke	2956
-unless	1150
-who	57
-eight	959
-preliminary	1980
-device	2843
-segment	1569
-payment	1212
-so-called	1354
-request	1681
-face	664
-looked	2828
-proceedings	2195
-lowered	2176
-pictures	2267
-normally	2191
-fact	668
-goals	2764
-agreed	346
-charles	1379
-bring	1144
-planning	1351
-democrats	1156
-portfolio	684
-fear	1262
-economist	1189
-debate	1561
-decade	1219
-staff	861
-litigation	1485
-partners	918
-based	306
-earned	1017
-controls	1558
-should	205
-unable	2260
-candidates	1925
-employee	1624
-communist	1769
-local	579
-hope	1160
-meant	2845
-dinkins	1185
-handle	2435
-means	863
-fellow	2505
-familiar	1253
-overall	1268
-bear	1644
-reinsurance	2656
-joint	799
-ones	1460
-words	1749
-exchanges	1469
-buyer	2005
-kong	845
-chips	1600
-areas	880
-trucks	1684
-course	904
-numerous	2574
-taxes	954
-calling	2405
-she	164
-ohio	1570
-fixed	1424
-conduct	2819
-view	808
-europe	546
-temporarily	2122
-downward	3000
-acquired	747
-national	163
-accord	1770
-operates	1912
-edition	2846
-computer	226
-subcommittee	2430
-closer	2516
-nationwide	2594
-reform	1471
-nuclear	1676
-tend	2134
-favor	1697
-state	140
-closed	250
-crude	1900
-progress	2453
-neither	1397
-bought	686
-comparable	2006
-brewing	2425
-ability	1157
-opening	1461
-deliver	2234
-agencies	1169
-job	793
-takeover	423
-key	864
-approval	727
-precious	2858
-lawsuit	2536
-distribution	1793
-declining	1631
-david	693
-restrictions	1530
-limits	2307
-career	2349
-goal	1966
-taking	836
-equal	1950
-drug	510
-pulp	2563
-april	856
-figures	653
-jersey	1762
-otherwise	2406
-comment	472
-adjusted	1706
-english	2765
-co	571
-lang	2333
-agents	2172
-wall	331
-ca	563
-cd	2055
-packaging	2976
-qintex	1243
-table	2573
-oakland	1937
-industrials	2607
-addition	489
-genetic	2336
-permanent	2893
-agreements	2078
-proposal	491
-waste	2847
-faced	2412
-controlled	1763
-c.	1707
-league	2308
-am	1635
-sufficient	2977
-otc	1413
-essentially	2620
-c$	1511
-bulk	2752
-finished	1831
-graphics	2257
-improved	1309
-atlanta	2431
-general	189
-present	1838
-homes	1515
-troubled	1543
-abandoned	2896
-unlike	1728
-sotheby	2848
-restaurant	2278
-harder	2550
-as	20
-value	310
-will	34
-owns	833
-wild	2978
-uncertainty	1686
-almost	512
-blood	2502
-thus	1114
-site	2491
-helped	745
-claimed	2606
-partner	1062
-shearson	938
-halt	2665
-tumbled	2407
-perhaps	986
-began	544
-administration	345
-cross	2079
-member	1094
-retailing	2419
-parts	935
-largest	414
-units	603
-party	728
-gets	1523
-difficult	1024
-material	1695
-columbia	809
-nekoosa	1916
-upon	2025
-effect	692
-forecasts	2299
-student	2246
-rumors	1914
-kkr	2614
-single	1136
-transaction	564
-off	170
-center	721
-i	69
-approve	2426
-well	201
-fighting	2641
-thought	947
-banker	2377
-sets	2235
-position	533
-soviet	384
-inc.	81
-latest	466
-stores	643
-less	246
-increasingly	1554
-executive	155
-domestic	708
-obtain	2157
-sources	1534
-underlying	2107
-rooms	2374
-seats	1832
-paul	1549
-rapid	2642
-ads	1417
-supply	933
-smith	1129
-deposits	1734
-realize	2298
-simple	2231
-add	1226
-other	62
-subordinated	1544
-match	2408
-boom	2582
-tests	1645
-increased	369
-provisions	1292
-government	99
-chancellor	2300
-increases	696
-five	259
-know	481
-press	916
-immediately	1451
-loss	254
-lincoln	1735
-necessary	1381
-like	167
-lost	635
-miami	2829
-taxpayers	2201
-lawson	1390
-payments	729
-james	560
-become	428
-works	1320
-soft	2674
-amendment	1846
-exceed	2643
-because	78
-arbitrage	981
-authority	1286
-growth	242
-export	1462
-cleveland	2796
-home	322
-peter	1163
-employment	1872
-line-item	2350
-lead	786
-broad	1593
-avoid	1265
-hurricane	1151
-slide	2270
-does	177
-york-based	2645
-chains	2621
-leader	862
-schedule	2719
-journal	919
-monetary	1452
-expansion	1594
-beach	2394
-pressure	843
-expire	2817
-although	368
-offset	1193
-includes	800
-loans	362
-vs.	2660
-panel	1559
-gained	725
-about	44
-actual	1174
-carried	2420
-debentures	1295
-freedom	2676
-shipping	2158
-surge	2080
-angeles	749
-holdings	885
-carries	2468
-carrier	1368
-introduced	1191
-software	993
-own	195
-letters	2175
-previously	719
-warrants	2046
-washington	514
-commitment	2337
-billions	2622
-getting	637
-malcolm	2922
-included	651
-guard	2881
-promise	2047
-managing	1274
-banco	2338
-utility	1913
-accused	1794
-additional	581
-krenz	1729
-transfer	2506
-housing	999
-secret	2949
-peters	1206
-continental	1776
-biggest	742
-pretax	1176
-fiscal	416
-buy	184
-north	739
-stadium	2899
-triggered	2368
-insurer	2564
-funds	194
-brand	1089
-akzo	2957
-but	30
-delivery	1000
-insured	2108
-construction	608
-gain	430
-courts	1942
-highest	1595
-ltv	2705
-he	29
-made	160
-places	2735
-whether	370
-cells	2388
-official	455
-signed	1194
-record	440
-below	631
-limit	1057
-ruling	1233
-problem	433
-piece	2185
-minutes	1497
-supreme	1152
-deaths	2818
-wcrs	2980
-slowing	1607
-flight	2020
-education	1382
-proceeds	1337
-worse	2197
-inc	443
-aetna	2720
-mutual	1369
-compared	371
-'ll	928
-variety	2784
-corporation	2271
-illinois	2646
-book	1291
-compares	2914
-details	1512
-branch	2202
-compete	2850
-gonzalez	2797
-junk	473
-francisco	420
-star	1826
-monday	383
-class	1361
-june	625
-ultimately	2153
-contends	2328
-stay	1333
-chance	1375
-bellsouth	2697
-priced	425
-friends	2421
-exposure	2301
-resolution	2067
-baker	1290
-factors	1438
-rule	1161
-ortega	2677
-portion	1685
-write	2342
-status	2334
-pension	1181
-understand	2232
-frankfurt	1915
--- a/language_model/infer.py
+++ b/language_model/infer.py
+# coding=utf-8
+import paddle.v2 as paddle
+import gzip
+import numpy as np
+from utils import *
+import network_conf
+from config import *
+def generate_using_rnn(word_id_dict, num_words, beam_size):
+    """
+    Demo: use RNN model to do prediction.
+    :param word_id_dict: vocab.
+    :type word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
+    :param num_words: the number of the words to generate.
+    :type num_words: int
+    :param beam_size: beam width.
+    :type beam_size: int
+    :return: save prediction results to output_file
+    """
+    # prepare and cache model
+    config = Config_rnn()
+    _, output_layer = network_conf.rnn_lm(
+        vocab_size=len(word_id_dict),
+        emb_dim=config.emb_dim,
+        rnn_type=config.rnn_type,
+        hidden_size=config.hidden_size,
+        num_layer=config.num_layer)  # network config
+    model_file_name = config.model_file_name_prefix + str(config.num_passs -
+                                                          1) + '.tar.gz'
+    parameters = paddle.parameters.Parameters.from_tar(
+        gzip.open(model_file_name))  # load parameters
+    inferer = paddle.inference.Inference(
+        output_layer=output_layer, parameters=parameters)
+    # tools, different from generate_using_ngram's tools
+    id_word_dict = dict(
+        [(v, k) for k, v in word_id_dict.items()])  # {id : word}
+    def str2ids(str):
+        return [[[
+            word_id_dict.get(w, word_id_dict['<UNK>']) for w in str.split()
+        ]]]
+    def ids2str(ids):
+        return [[[id_word_dict.get(id, ' ') for id in ids]]]
+    # generate text
+    with open(input_file) as file:
+        output_f = open(output_file, 'w')
+        for line in file:
+            line = line.decode('utf-8').strip()
+            # generate
+            texts = {}  # type: {text : probability}
+            texts[line] = 1
+            for _ in range(num_words):
+                texts_new = {}
+                for (text, prob) in texts.items():
+                    if '<EOS>' in text:  # stop prediction when <EOS> appear
+                        texts_new[text] = prob
+                        continue
+                    # next word's probability distribution
+                    predictions = inferer.infer(input=str2ids(text))
+                    predictions[-1][word_id_dict['<UNK>']] = -1  # filter <UNK>
+                    # find next beam_size words
+                    for _ in range(beam_size):
+                        cur_maxProb_index = np.argmax(
+                            predictions[-1])  # next word's id
+                        text_new = text + ' ' + id_word_dict[
+                            cur_maxProb_index]  # text append next word
+                        texts_new[text_new] = texts[text] * predictions[-1][
+                            cur_maxProb_index]
+                        predictions[-1][cur_maxProb_index] = -1
+                texts.clear()
+                if len(texts_new) <= beam_size:
+                    texts = texts_new
+                else:  # cutting
+                    texts = dict(
+                        sorted(
+                            texts_new.items(), key=lambda d: d[1], reverse=True)
+                        [:beam_size])
+            # save results to output file
+            output_f.write(line.encode('utf-8') + '\n')
+            for (sentence, prob) in texts.items():
+                output_f.write('\t' + sentence.encode('utf-8', 'replace') + '\t'
+                               + str(prob) + '\n')
+            output_f.write('\n')
+        output_f.close()
+    print('already saved results to ' + output_file)
+def generate_using_ngram(word_id_dict, num_words, beam_size):
+    """
+    Demo: use N-Gram model to do prediction.
+    :param word_id_dict: vocab.
+    :type word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
+    :param num_words: the number of the words to generate.
+    :type num_words: int
+    :param beam_size: beam width.
+    :type beam_size: int
+    :return: save prediction results to output_file
+    """
+    # prepare and cache model
+    config = Config_ngram()
+    _, output_layer = network_conf.ngram_lm(
+        vocab_size=len(word_id_dict),
+        emb_dim=config.emb_dim,
+        hidden_size=config.hidden_size,
+        num_layer=config.num_layer)  # network config
+    model_file_name = config.model_file_name_prefix + str(config.num_passs -
+                                                          1) + '.tar.gz'
+    parameters = paddle.parameters.Parameters.from_tar(
+        gzip.open(model_file_name))  # load parameters
+    inferer = paddle.inference.Inference(
+        output_layer=output_layer, parameters=parameters)
+    # tools, different from generate_using_rnn's tools
+    id_word_dict = dict(
+        [(v, k) for k, v in word_id_dict.items()])  # {id : word}
+    def str2ids(str):
+        return [[
+            word_id_dict.get(w, word_id_dict['<UNK>']) for w in str.split()
+        ]]
+    def ids2str(ids):
+        return [[id_word_dict.get(id, ' ') for id in ids]]
+    # generate text
+    with open(input_file) as file:
+        output_f = open(output_file, 'w')
+        for line in file:
+            line = line.decode('utf-8').strip()
+            words = line.split()
+            if len(words) < config.N:
+                output_f.write(line.encode('utf-8') + "\n\tnone\n")
+                continue
+            # generate
+            texts = {}  # type: {text : probability}
+            texts[line] = 1
+            for _ in range(num_words):
+                texts_new = {}
+                for (text, prob) in texts.items():
+                    if '<EOS>' in text:  # stop prediction when <EOS> appear
+                        texts_new[text] = prob
+                        continue
+                    # next word's probability distribution
+                    predictions = inferer.infer(
+                        input=str2ids(' '.join(text.split()[-config.N:])))
+                    predictions[-1][word_id_dict['<UNK>']] = -1  # filter <UNK>
+                    # find next beam_size words
+                    for _ in range(beam_size):
+                        cur_maxProb_index = np.argmax(
+                            predictions[-1])  # next word's id
+                        text_new = text + ' ' + id_word_dict[
+                            cur_maxProb_index]  # text append nextWord
+                        texts_new[text_new] = texts[text] * predictions[-1][
+                            cur_maxProb_index]
+                        predictions[-1][cur_maxProb_index] = -1
+                texts.clear()
+                if len(texts_new) <= beam_size:
+                    texts = texts_new
+                else:  # cutting
+                    texts = dict(
+                        sorted(
+                            texts_new.items(), key=lambda d: d[1], reverse=True)
+                        [:beam_size])
+            # save results to output file
+            output_f.write(line.encode('utf-8') + '\n')
+            for (sentence, prob) in texts.items():
+                output_f.write('\t' + sentence.encode('utf-8', 'replace') + '\t'
+                               + str(prob) + '\n')
+            output_f.write('\n')
+        output_f.close()
+    print('already saved results to ' + output_file)
+def main():
+    # init paddle
+    paddle.init(use_gpu=use_gpu, trainer_count=trainer_count)
+    # prepare and cache vocab
+    if os.path.isfile(vocab_file):
+        word_id_dict = load_vocab(vocab_file)  # load word dictionary
+    else:
+        if build_vocab_method == 'fixed_size':
+            word_id_dict = build_vocab_with_fixed_size(
+                train_file, vocab_max_size)  # build vocab
+        else:
+            word_id_dict = build_vocab_using_threshhold(
+                train_file, unk_threshold)  # build vocab
+        save_vocab(word_id_dict, vocab_file)  # save vocab
+    # generate
+    if use_which_model == 'rnn':
+        generate_using_rnn(
+            word_id_dict=word_id_dict, num_words=num_words, beam_size=beam_size)
+    elif use_which_model == 'ngram':
+        generate_using_ngram(
+            word_id_dict=word_id_dict, num_words=num_words, beam_size=beam_size)
+    else:
+        raise Exception('use_which_model must be rnn or ngram!')
+if __name__ == "__main__":
+    main()
--- a/language_model/network_conf.py
+++ b/language_model/network_conf.py
+# coding=utf-8
+import paddle.v2 as paddle
+def rnn_lm(vocab_size, emb_dim, rnn_type, hidden_size, num_layer):
+    """
+    RNN language model definition.
+    :param vocab_size: size of vocab.
+    :param emb_dim: embedding vector's dimension.
+    :param rnn_type: the type of RNN cell.
+    :param hidden_size: number of unit.
+    :param num_layer: layer number.
+    :return: cost and output layer of model.
+    """
+    assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0
+    # input layers
+    input = paddle.layer.data(
+        name="input", type=paddle.data_type.integer_value_sequence(vocab_size))
+    target = paddle.layer.data(
+        name="target", type=paddle.data_type.integer_value_sequence(vocab_size))
+    # embedding layer
+    input_emb = paddle.layer.embedding(input=input, size=emb_dim)
+    # rnn layer
+    if rnn_type == 'lstm':
+        rnn_cell = paddle.networks.simple_lstm(
+            input=input_emb, size=hidden_size)
+        for _ in range(num_layer - 1):
+            rnn_cell = paddle.networks.simple_lstm(
+                input=rnn_cell, size=hidden_size)
+    elif rnn_type == 'gru':
+        rnn_cell = paddle.networks.simple_gru(input=input_emb, size=hidden_size)
+        for _ in range(num_layer - 1):
+            rnn_cell = paddle.networks.simple_gru(
+                input=rnn_cell, size=hidden_size)
+    else:
+        raise Exception('rnn_type error!')
+    # fc(full connected) and output layer
+    output = paddle.layer.fc(
+        input=[rnn_cell], size=vocab_size, act=paddle.activation.Softmax())
+    # loss
+    cost = paddle.layer.classification_cost(input=output, label=target)
+    return cost, output
+def ngram_lm(vocab_size, emb_dim, hidden_size, num_layer):
+    """
+    N-Gram language model definition.
+    :param vocab_size: size of vocab.
+    :param emb_dim: embedding vector's dimension.
+    :param hidden_size: size of unit.
+    :param num_layer: layer number.
+    :return: cost and output layer of model.
+    """
+    assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0
+    def wordemb(inlayer):
+        wordemb = paddle.layer.table_projection(
+            input=inlayer,
+            size=emb_dim,
+            param_attr=paddle.attr.Param(
+                name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0))
+        return wordemb
+    # input layers
+    first_word = paddle.layer.data(
+        name="first_word", type=paddle.data_type.integer_value(vocab_size))
+    second_word = paddle.layer.data(
+        name="second_word", type=paddle.data_type.integer_value(vocab_size))
+    third_word = paddle.layer.data(
+        name="third_word", type=paddle.data_type.integer_value(vocab_size))
+    fourth_word = paddle.layer.data(
+        name="fourth_word", type=paddle.data_type.integer_value(vocab_size))
+    next_word = paddle.layer.data(
+        name="next_word", type=paddle.data_type.integer_value(vocab_size))
+    # embedding layer
+    first_emb = wordemb(first_word)
+    second_emb = wordemb(second_word)
+    third_emb = wordemb(third_word)
+    fourth_emb = wordemb(fourth_word)
+    context_emb = paddle.layer.concat(
+        input=[first_emb, second_emb, third_emb, fourth_emb])
+    # hidden layer
+    hidden = paddle.layer.fc(
+        input=context_emb, size=hidden_size, act=paddle.activation.Relu())
+    for _ in range(num_layer - 1):
+        hidden = paddle.layer.fc(
+            input=hidden, size=hidden_size, act=paddle.activation.Relu())
+    # fc(full connected) and output layer
+    predict_word = paddle.layer.fc(
+        input=[hidden], size=vocab_size, act=paddle.activation.Softmax())
+    # loss
+    cost = paddle.layer.classification_cost(input=predict_word, label=next_word)
+    return cost, predict_word
--- a/language_model/reader.py
+++ b/language_model/reader.py
+# coding=utf-8
+import collections
+import os
+def rnn_reader(file_name, min_sentence_length, max_sentence_length,
+               word_id_dict):
+    """
+    create reader for RNN, each line is a sample.
+    :param file_name: file name.
+    :param min_sentence_length: sentence's min length.
+    :param max_sentence_length: sentence's max length.
+    :param word_id_dict: vocab with content of '{word, id}', 'word' is string type , 'id' is int type.
+    :return: data reader.
+    """
+    def reader():
+        UNK = word_id_dict['<UNK>']
+        with open(file_name) as file:
+            for line in file:
+                words = line.decode('utf-8', 'ignore').strip().split()
+                if len(words) < min_sentence_length or len(
+                        words) > max_sentence_length:
+                    continue
+                ids = [word_id_dict.get(w, UNK) for w in words]
+                ids.append(word_id_dict['<EOS>'])
+                target = ids[1:]
+                target.append(word_id_dict['<EOS>'])
+                yield ids[:], target[:]
+    return reader
+def ngram_reader(file_name, N, word_id_dict):
+    """
+    create reader for N-Gram.
+    :param file_name: file name.
+    :param N: N-Gram's N.
+    :param word_id_dict: vocab with content of '{word, id}', 'word' is string type , 'id' is int type.
+    :return: data reader.
+    """
+    assert N >= 2
+    def reader():
+        ids = []
+        UNK_ID = word_id_dict['<UNK>']
+        cache_size = 10000000
+        with open(file_name) as file:
+            for line in file:
+                words = line.decode('utf-8', 'ignore').strip().split()
+                ids += [word_id_dict.get(w, UNK_ID) for w in words]
+                ids_len = len(ids)
+                if ids_len > cache_size:  # output
+                    for i in range(ids_len - N - 1):
+                        yield tuple(ids[i:i + N])
+                    ids = []
+        ids_len = len(ids)
+        for i in range(ids_len - N - 1):
+            yield tuple(ids[i:i + N])
+    return reader
--- a/language_model/train.py
+++ b/language_model/train.py
+# coding=utf-8
+import sys
+import paddle.v2 as paddle
+import reader
+from utils import *
+import network_conf
+import gzip
+from config import *
+def train(model_cost, train_reader, test_reader, model_file_name_prefix,
+          num_passes):
+    """
+    train model.
+    :param model_cost: cost layer of the model to train.
+    :param train_reader: train data reader.
+    :param test_reader: test data reader.
+    :param model_file_name_prefix: model's prefix name.
+    :param num_passes: epoch.
+    :return:
+    """
+    # init paddle
+    paddle.init(use_gpu=use_gpu, trainer_count=trainer_count)
+    # create parameters
+    parameters = paddle.parameters.create(model_cost)
+    # create optimizer
+    adam_optimizer = paddle.optimizer.Adam(
+        learning_rate=1e-3,
+        regularization=paddle.optimizer.L2Regularization(rate=1e-3),
+        model_average=paddle.optimizer.ModelAverage(
+            average_window=0.5, max_average_window=10000))
+    # create trainer
+    trainer = paddle.trainer.SGD(
+        cost=model_cost, parameters=parameters, update_equation=adam_optimizer)
+    # define event_handler callback
+    def event_handler(event):
+        if isinstance(event, paddle.event.EndIteration):
+            if event.batch_id % 100 == 0:
+                print("\nPass %d, Batch %d, Cost %f, %s" % (
+                    event.pass_id, event.batch_id, event.cost, event.metrics))
+            else:
+                sys.stdout.write('.')
+                sys.stdout.flush()
+        # save model each pass
+        if isinstance(event, paddle.event.EndPass):
+            result = trainer.test(reader=test_reader)
+            print("\nTest with Pass %d, %s" % (event.pass_id, result.metrics))
+            with gzip.open(
+                    model_file_name_prefix + str(event.pass_id) + '.tar.gz',
+                    'w') as f:
+                parameters.to_tar(f)
+    # start to train
+    print('start training...')
+    trainer.train(
+        reader=train_reader, event_handler=event_handler, num_passes=num_passes)
+    print("Training finished.")
+def main():
+    # prepare vocab
+    print('prepare vocab...')
+    if build_vocab_method == 'fixed_size':
+        word_id_dict = build_vocab_with_fixed_size(
+            train_file, vocab_max_size)  # build vocab
+    else:
+        word_id_dict = build_vocab_using_threshhold(
+            train_file, unk_threshold)  # build vocab
+    save_vocab(word_id_dict, vocab_file)  # save vocab
+    # init model and data reader
+    if use_which_model == 'rnn':
+        # init RNN model
+        print('prepare rnn model...')
+        config = Config_rnn()
+        cost, _ = network_conf.rnn_lm(
+            len(word_id_dict), config.emb_dim, config.rnn_type,
+            config.hidden_size, config.num_layer)
+        # init RNN data reader
+        train_reader = paddle.batch(
+            paddle.reader.shuffle(
+                reader.rnn_reader(train_file, min_sentence_length,
+                                  max_sentence_length, word_id_dict),
+                buf_size=65536),
+            batch_size=config.batch_size)
+        test_reader = paddle.batch(
+            paddle.reader.shuffle(
+                reader.rnn_reader(test_file, min_sentence_length,
+                                  max_sentence_length, word_id_dict),
+                buf_size=65536),
+            batch_size=config.batch_size)
+    elif use_which_model == 'ngram':
+        # init N-Gram model
+        print('prepare ngram model...')
+        config = Config_ngram()
+        assert config.N == 5
+        cost, _ = network_conf.ngram_lm(
+            vocab_size=len(word_id_dict),
+            emb_dim=config.emb_dim,
+            hidden_size=config.hidden_size,
+            num_layer=config.num_layer)
+        # init N-Gram data reader
+        train_reader = paddle.batch(
+            paddle.reader.shuffle(
+                reader.ngram_reader(train_file, config.N, word_id_dict),
+                buf_size=65536),
+            batch_size=config.batch_size)
+        test_reader = paddle.batch(
+            paddle.reader.shuffle(
+                reader.ngram_reader(test_file, config.N, word_id_dict),
+                buf_size=65536),
+            batch_size=config.batch_size)
+    else:
+        raise Exception('use_which_model must be rnn or ngram!')
+    # train model
+    train(
+        model_cost=cost,
+        train_reader=train_reader,
+        test_reader=test_reader,
+        model_file_name_prefix=config.model_file_name_prefix,
+        num_passes=config.num_passs)
+if __name__ == "__main__":
+    main()
--- a/language_model/utils.py
+++ b/language_model/utils.py
+# coding=utf-8
+import os
+import collections
+def save_vocab(word_id_dict, vocab_file_name):
+    """
+    save vocab.
+    :param word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
+    :param vocab_file_name: vocab file name.
+    """
+    f = open(vocab_file_name, 'w')
+    for (k, v) in word_id_dict.items():
+        f.write(k.encode('utf-8') + '\t' + str(v) + '\n')
+    print('save vocab to ' + vocab_file_name)
+    f.close()
+def load_vocab(vocab_file_name):
+    """
+    load vocab from file.
+    :param vocab_file_name: vocab file name.
+    :return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
+    """
+    assert os.path.isfile(vocab_file_name)
+    dict = {}
+    with open(vocab_file_name) as file:
+        for line in file:
+            if len(line) < 2:
+                continue
+            kv = line.decode('utf-8').strip().split('\t')
+            dict[kv[0]] = int(kv[1])
+    return dict
+def build_vocab_using_threshhold(file_name, unk_threshold):
+    """
+    build vacab using_<UNK> threshhold.
+    :param file_name:
+    :param unk_threshold: <UNK> threshhold.
+    :type unk_threshold: int.
+    :return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
+    """
+    counter = {}
+    with open(file_name) as file:
+        for line in file:
+            words = line.decode('utf-8', 'ignore').strip().split()
+            for word in words:
+                if word in counter:
+                    counter[word] += 1
+                else:
+                    counter[word] = 1
+    counter_new = {}
+    for (word, frequency) in counter.items():
+        if frequency >= unk_threshold:
+            counter_new[word] = frequency
+    counter.clear()
+    counter_new = sorted(counter_new.items(), key=lambda d: -d[1])
+    words = [word_frequency[0] for word_frequency in counter_new]
+    word_id_dict = dict(zip(words, range(2, len(words) + 2)))
+    word_id_dict['<UNK>'] = 0
+    word_id_dict['<EOS>'] = 1
+    return word_id_dict
+def build_vocab_with_fixed_size(file_name, vocab_max_size):
+    """
+    build vacab with assigned max size.
+    :param vocab_max_size: vocab's max size.
+    :return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
+    """
+    words = []
+    for line in open(file_name):
+        words += line.decode('utf-8', 'ignore').strip().split()
+    counter = collections.Counter(words)
+    counter = sorted(counter.items(), key=lambda x: -x[1])
+    if len(counter) > vocab_max_size:
+        counter = counter[:vocab_max_size]
+    words, counts = zip(*counter)
+    word_id_dict = dict(zip(words, range(2, len(words) + 2)))
+    word_id_dict['<UNK>'] = 0
+    word_id_dict['<EOS>'] = 1
+    return word_id_dict