提交 bf0cfe33 编写于 作者: Z zhaopu

rebuid code

上级 f495dda9
# 语言模型
## 简介
语言模型即 Language Model,简称LM,它是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。给定句子(词语序列):
<div align=center><img src='images/s.png'/></div>
它的概率可以表示为:
<div align=center><img src='images/ps.png'/> &nbsp;&nbsp;&nbsp;&nbsp;(式1)</div>
语言模型可以计算(式1)中的P(S)及其中间结果。**利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词语。**
## 应用场景
**语言模型被应用在多个领域**,如:
* **自动写作**:语言模型可以根据上文生成下一个词,递归下去可以生成整个句子、段落、篇章。
* **QA**:语言模型可以根据Question生成Answer。
* **机器翻译**:当前主流的机器翻译模型大多基于Encoder-Decoder模式,其中Decoder就是一个语言模型,用来生成目标语言。
* **拼写检查**:语言模型可以计算出词语序列的概率,一般在拼写错误处序列的概率会骤减,可以用来识别拼写错误并提供改正候选集。
* **词性标注、句法分析、语音识别......**
## 关于本例
Language Model 常见的实现方式有 N-Gram、RNN、seq2seq。本例中实现了基于N-Gram、RNN的语言模型。**本例的文件结构如下**
* data_util.py:实现了对语料的读取以及词典的建立、保存和加载。
* lm_rnn.py:实现了基于rnn的语言模型的定义、训练以及做预测。
* lm_ngram.py:实现了基于n-gram的语言模型的定义、训练以及做预测。
***注:**一般情况下基于N-Gram的语言模型不如基于RNN的语言模型效果好,所以实际使用时建议使用基于RNN的语言模型,本例中也将着重介绍基于RNN的模型,简略介绍基于N-Gram的模型。*
## RNN 语言模型
### 简介
RNN是一个序列模型,基本思路是:在时刻t,将前一时刻t-1的隐藏层输出h<small>t-1</small>和t时刻的词向量x<small>t</small>一起输入到隐藏层从而得到时刻t的特征表示h<small>t</small>,然后用这个特征表示得到t时刻的预测输出ŷ ,如此在时间维上递归下去,如下图所示:
<div align=center><img src='images/rnn_str.png' width='500px'/></div>
可以看出RNN善于使用上文信息、历史知识,具有“记忆”功能。理论上RNN能实现“长依赖”(即利用很久之前的知识),但在实际应用中发现效果并不理想,于是出现了很多RNN的变种,如常用的LSTM和GRU,它们对传统RNN的cell进行了改进,弥补了RNN的不足,下图是LSTM的示意图:
<div align=center><img src='images/lstm.png' width='500px'/></div>
本例中即使用了LSTM、GRU。
### 模型结构
lm_rnn.py 中的 lm() 函数定义了模型的结构。解析如下:
* 1,首先,在\_\_main\_\_中定义了模型的参数变量。
```python
# -- config : model --
rnn_type = 'gru' # or 'lstm'
emb_dim = 200
hidden_size = 200
num_passs = 2
num_layer = 2
```
其中 rnn\_type 用于配置rnn cell类型,可以取‘lstm’或‘gru’;hidden\_size配置unit个数;num\_layer配置RNN的层数;num\_passs配置训练的轮数;emb_dim配置embedding的dimension。
* 2,将输入的词(或字)序列映射成向量,即embedding。
```python
data = paddle.layer.data(name="word", type=paddle.data_type.integer_value_sequence(vocab_size))
target = paddle.layer.data("label", paddle.data_type.integer_value_sequence(vocab_size))
emb = paddle.layer.embedding(input=data, size=emb_dim)
```
* 3,根据配置实现RNN层,将上一步得到的embedding向量序列作为输入。
```python
if rnn_type == 'lstm':
rnn_cell = paddle.networks.simple_lstm(
input=emb, size=hidden_size)
for _ in range(num_layer - 1):
rnn_cell = paddle.networks.simple_lstm(
input=rnn_cell, size=hidden_size)
elif rnn_type == 'gru':
rnn_cell = paddle.networks.simple_gru(
input=emb, size=hidden_size)
for _ in range(num_layer - 1):
rnn_cell = paddle.networks.simple_gru(
input=rnn_cell, size=hidden_size)
```
* 4,实现输出层(使用softmax归一化计算单词的概率,将output结果返回)、定义模型的cost(多类交叉熵损失函数)。
```python
# fc and output layer
output = paddle.layer.fc(input=[rnn_cell], size=vocab_size, act=paddle.activation.Softmax())
# loss
cost = paddle.layer.classification_cost(input=output, label=target)
```
### 训练模型
lm\_rnn.py 中的 train() 方法实现了模型的训练,流程如下:
* 1,准备输入数据:本例中使用的是标准PTB数据,调用data\_util.py中的build\_vocab()方法建立词典,并使用save\_vocab()方法将词典持久化,以备复用(当语料量大时生成词典比较耗时,所以这里把第一次生成的词典保存下来复用)。然后使用data\_util.py中的train\_data()、test\_data()方法建立train\_reader和test\_reader用来实现对train数据和test数据的读取。
* 2,初始化模型:包括模型的结构、参数、优化器(demo中使用的是Adam)以及训练器trainer。如下:
```python
# network config
cost, _ = lm(len(word_id_dict), emb_dim, rnn_type, hidden_size, num_layer)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# create trainer
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=adam_optimizer)
```
* 3,定义回调函数event_handler来跟踪训练过程中loss的变化,并在每轮时结束保存模型的参数:
```python
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print("\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost,
event.metrics))
else:
sys.stdout.write('.')
sys.stdout.flush()
# save model each pass
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=ptb_reader)
print("\nTest with Pass %d, %s" % (event.pass_id, result.metrics))
with gzip.open(model_file_name_prefix + str(event.pass_id) + '.tar.gz', 'w') as f:
parameters.to_tar(f)
```
* 4,开始train模型:
```python
trainer.train(
reader=ptb_reader, event_handler=event_handler, num_passes=num_passs)
```
### 生成文本
lm\_rnn.py中的predict()方法实现了做prediction、生成文本。流程如下:
* 1,首先加载并缓存词典和模型,其中加载train好的模型参数方法如下:
```python
parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name))
```
* 2,生成文本,本例中生成文本的方式是启发式图搜索算法beam search,即lm\_rnn.py中的 \_generate\_with\_beamSearch() 方法。
### <font color='red'>使用此demo</font>
本例中使用的是标准的PTB数据,如果用户要实现自己的model,则只需要做如下适配工作:
#### 语料适配
* 清洗语料:去除空格、tab、乱码,根据需要去除数字、标点符号、特殊符号等。
* 编码格式:utf-8,本例中已经对中文做了适配。
* 内容格式:每个句子占一行;每行中的各词之间使用一个空格分开。
* 按需要配置lm\_rnn.py中\_\_main\_\_函数中对于data的配置:
```python
# -- config : data --
train_file = 'data/ptb.train.txt'
test_file = 'data/ptb.test.txt'
vocab_file = 'data/vocab_cn.txt' # the file to save vocab
vocab_max_size = 3000
min_sentence_length = 3
max_sentence_length = 60
```
其中,vocab\_max\_size定义了词典的最大长度,如果语料中出现的不同词的个数大于这个值,则根据各词的词频倒序排,取top(vocab\_max\_size)个词纳入词典。
*注:需要注意的是词典越大生成的内容越丰富但训练耗时越久,一般中文分词之后,语料中不同的词能有几万乃至几十万,如果vocab\_max\_size取值过小则导致\<UNK\>占比过高,如果vocab\_max\_size取值较大则严重影响训练速度(对精度也有影响),所以也有“按字”训练模型的方式,即:把每个汉字当做一个词,常用汉字也就几千个,使得字典的大小不会太大、不会丢失太多信息,但汉语中同一个字在不同词中语义相差很大,有时导致模型效果不理想。建议用户多试试、根据实际情况选择是“按词训练”还是“按字训练”。*
#### 模型适配
根据语料的大小按需调整模型的\_\_main\_\_中定义的参数。
然后运行 python lm\_rnn.py即可训练模型、做prediction。
## n-gram 语言模型
n-gram模型也称为n-1阶马尔科夫模型,它有一个有限历史假设:当前词的出现概率仅仅与前面n-1个词相关。因此 (式1) 可以近似为:
<div align=center><img src='images/ps2.png'/></div>
一般采用最大似然估计(Maximum Likelihood Estimation,MLE)的方法对模型的参数进行估计。当n取1、2、3时,n-gram模型分别称为unigram、bigram和trigram语言模型。一般情况下,n越大、训练语料的规模越大,参数估计的结果越可靠,但由于模型较简单、表达能力不强以及数据稀疏等问题。一般情况下用n-gram实现的语言模型不如RNN、seq2seq效果好。
### 模型结构
lm\_ngram.py中的lm()定义了模型的结构,大致如下:
* 1,demo中n取5,将前四个词分别做embedding,然后连接起来作为特征向量。
* 2,后接DNN的hidden layer。
* 3,将DNN的输出通过softmax layer做分类,得到下个词在词典中的概率分布。
* 4,模型的loss采用交叉熵,用Adam optimizer对loss做优化。
图示如下:
<div align=center><img src='images/ngram.png' width='400px'/></div>
### 模型训练
lm\_ngram.py中的train()方法实现了模型的训练,过程和RNN LM类似,简介如下:
* 1,准备输入数据:使用的是标准PTB数据,调用data\_util.py中的build\_vocab()方法建立词典,并使用save\_vocab()方法将词典持久化,使用data\_util.py中的train\_data()、test\_data()方法建立train\_reader和test\_reader用来实现对train数据和test数据的读取。
* 2,初始化模型:包括模型的结构、参数、优化器(demo中使用的是Adam)以及trainer。
* 3,定义回调函数event_handler来跟踪训练过程中loss的变化,并在每轮时结束保存模型的参数。
* 4,使用trainer开始train模型。
### 生成文本
lm\_ngram.py中的\_\_main\_\_方法中对prediction(生成文本)做了简单的实现。流程如下:
* 1,首先加载词典和模型:
```python
# prepare model
word_id_dict = reader.load_vocab(vocab_file) # load word dictionary
_, output_layer = lm(len(word_id_dict), emb_dim, hidden_size, num_layer) # network config
model_file_name = model_file_name_prefix + str(num_passs - 1) + '.tar.gz'
parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name)) # load parameters
```
* 2,根据4(n-1)个词的上文预测下一个单词并打印:
```python
# generate
text = 'the end of the' # use 4 words to predict the 5th word
input = [[word_id_dict.get(w, word_id_dict['<UNK>']) for w in text.split()]]
predictions = paddle.infer(
output_layer=output_layer,
parameters=parameters,
input=input,
field=['value'])
id_word_dict = dict([(v, k) for k, v in word_id_dict.items()]) # dictionary with type {id : word}
predictions[-1][word_id_dict['<UNK>']] = -1 # filter <UNK>
next_word = id_word_dict[np.argmax(predictions[-1])]
print(next_word.encode('utf-8'))
```
*注:这里展示了另一种做预测的方法,即使用paddle.infer方法。RNN的实例中使用的是paddle.inference.Inference接口。*
# coding=utf-8
# -- config : data --
train_file = 'data/chinese.train.txt'
test_file = 'data/chinese.test.txt'
vocab_file = 'data/vocab_cn.txt' # the file to save vocab
build_vocab_method = 'fixed_size' # 'frequency' or 'fixed_size'
vocab_max_size = 3000 # when build_vocab_method = 'fixed_size'
unk_threshold = 1 # # when build_vocab_method = 'frequency'
min_sentence_length = 3
max_sentence_length = 60
# -- config : train --
use_which_model = 'ngram' # must be: 'rnn' or 'ngram'
use_gpu = False # whether to use gpu
trainer_count = 1 # number of trainer
class Config_rnn(object):
"""
config for RNN language model
"""
rnn_type = 'gru' # or 'lstm'
emb_dim = 200
hidden_size = 200
num_layer = 2
num_passs = 2
batch_size = 32
model_file_name_prefix = 'lm_' + rnn_type + '_params_pass_'
class Config_ngram(object):
"""
config for N-Gram language model
"""
emb_dim = 200
hidden_size = 200
num_layer = 2
N = 5
num_passs = 2
batch_size = 32
model_file_name_prefix = 'lm_ngram_pass_'
# -- config : infer --
input_file = 'data/input.txt' # input file contains sentence prefix each line
output_file = 'data/output.txt' # the file to save results
num_words = 10 # the max number of words need to generate
beam_size = 5 # beam_width, the number of the prediction sentence for each prefix
我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。
\ No newline at end of file
我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。
\ No newline at end of file
我 是
我 是 中国
我 爱
我 是 中国 人。
我 爱 中国
我 爱 中国 。我
我 爱 中国 。我 爱
我 爱 中国 。我 是
我 爱 中国 。我 是 中国
\ No newline at end of file
因为 它太大了无法显示 source diff 。你可以改为 查看blob
因为 它太大了无法显示 source diff 。你可以改为 查看blob
limited 791
consolidated 2482
four 347
facilities 1200
asian 2798
controversial 2177
whose 623
votes 2089
founder 2229
paris 1721
adviser 1759
edward 2090
voted 1935
under 125
worth 977
placed 1677
merchant 2565
pact 2130
risk 647
rise 498
sellers 2851
handling 2476
every 539
jack 1722
reforms 2309
affect 1968
bringing 2469
lehman 1238
believed 1542
school 722
calif 2386
companies 102
wednesday 910
van 2897
announced 412
pilson 2915
expanded 2427
force 534
leaders 818
miller 2247
guidelines 1795
estimates 784
japanese 174
elections 1720
second 335
street 323
estimated 453
machines 753
even 114
established 1755
disk 2826
pace 1888
panama 1852
contributed 1263
nec 2551
asia 2310
spokesman 301
above 626
dr. 982
new 36
net 136
increasing 987
ever 823
seeks 2454
told 549
specialist 2519
never 575
here 348
hundreds 1689
reported 221
protection 955
china 638
brooks 2353
active 1027
balance 1687
auction 968
items 1470
employees 457
climbed 1323
reports 658
credit 355
analysts 166
chrysler 1969
military 756
poverty 2838
changes 515
criticism 2288
golden 1750
campaign 879
reagan 1195
peabody 2432
highly 1329
brought 1130
opportunities 2661
total 344
unit 168
swings 2036
would 43
army 2135
hospital 1833
m. 1664
negative 1330
noting 2958
call 787
asset 1302
strike 1106
type 2136
until 315
b.a.t 1873
hahn 2992
supporters 2678
composite 431
hurt 1005
phone 1690
berlin 2737
hold 838
must 405
me 812
word 2057
room 1505
rights 535
pursue 2248
work 222
plunged 1604
movies 1970
henry 2960
already 294
merely 2444
revenues 2521
my 406
example 469
wang 1958
estate 460
give 438
cited 1376
india 2595
involve 2801
currency 820
foods 2389
woman 1128
caution 2981
ual 409
want 358
drive 1524
times 421
attract 2236
totaled 1297
guarantee 1982
end 237
recovery 1654
turn 837
provide 577
travel 1487
damage 516
machine 1042
how 243
hot 2766
interview 1220
widespread 2410
resignation 2178
badly 2455
regional 1486
minority 1398
lufkin 2979
after 79
damaged 1608
modest 1441
president 72
mesa 2754
law 279
types 2203
las 2785
purchase 496
attempt 1036
third 277
amid 1646
headquarters 1203
maintain 1346
green 1997
suggest 1894
democratic 961
order 529
ec 2125
wine 2831
operations 223
senators 1994
office 284
over 95
expects 407
london 410
japan 203
mayor 1819
before 158
fit 2767
personal 629
expectations 1418
better 502
production 360
weeks 422
easier 2329
damages 2063
then 224
dec. 932
affected 1709
combination 2802
lambert 1987
weakness 1827
safe 1998
break 1784
effects 1670
they 39
schools 1632
silver 1988
bank 105
structural 1989
represents 1651
30-year 1456
detroit 2961
affiliate 2357
victory 2820
reasonable 2852
each 216
went 789
side 1078
bond 272
financial 142
suspended 1944
fairly 1953
series 442
carolina 2099
carry 1536
currencies 2403
trading 77
impossible 2487
substantially 1777
temporary 1907
saturday 2070
burnham 1880
t. 2213
network 673
crucial 2935
tomorrow 1588
semiconductor 2204
encourage 2422
daniel 2483
got 596
newly 2193
millions 2159
sluggish 2609
gop 2456
foundation 2583
sept. 628
turning 2442
written 2032
veto 1052
u.s. 54
threatened 2522
little 327
free 813
standard 738
estimate 884
wanted 1030
enormous 2291
created 1053
days 172
pence 1347
oppose 2993
1970s 2205
uses 1851
r. 1170
industrial 439
suspension 2903
economists 1119
primary 1723
hearing 1803
adopted 2224
another 206
electronic 1284
<UNK> 0
rated 1609
service 337
top 525
approximately 2994
needed 911
rates 173
too 307
percentage 761
john 401
ranging 2708
urban 1922
ceiling 2109
collapse 1633
serve 2110
took 454
rejected 1392
direct 1201
western 650
somewhat 2739
shortly 2626
toronto 2091
renewed 2853
target 1313
showed 988
likely 372
nations 1606
project 802
matter 1196
greenspan 2542
feeling 2982
acquisition 470
bridge 1481
fashion 2411
sees 2498
ran 2457
boston 652
modern 2390
mind 1714
mine 2120
talking 1923
seen 920
seem 1058
seek 983
relatively 1303
forced 1172
abroad 1909
strength 1531
concrete 2911
responsible 1845
sound 2100
recommended 2898
client 1786
luxury 1999
forces 1252
unsecured 2433
shipments 2292
blue 2753
nobody 2148
philadelphia 2253
though 329
wells 1940
involving 1571
germany 705
letter 1107
competing 2791
germans 2995
consumers 1134
antitrust 1990
medical 765
flow 1647
competitors 1414
points 395
principle 2860
after-tax 2451
voting 2259
consumer 582
dow 448
came 545
reserve 717
d. 674
saying 715
meetings 1926
ending 1555
showing 1659
radio 1665
poison 2936
hungary 2330
judges 1853
finally 1383
proposed 499
representing 2098
delays 2567
unemployment 2413
sugar 1971
rico 2293
bush 257
rich 1991
announce 2624
resulting 2712
do 88
exports 1002
de 990
stop 1147
preferred 1009
coast 1828
lenders 2331
despite 486
report 238
du 2723
volatility 929
hall 2517
runs 1678
jaguar 574
countries 676
fields 2186
high-yield 2138
bay 759
twice 1983
bad 890
release 1700
prudential-bache 2691
mergers 2963
secretary 718
headed 1658
disaster 1352
fair 2161
w. 1820
testing 2938
decided 1104
result 411
discussions 1963
resigned 1234
taiwan 2793
best 598
subject 966
brazil 2803
said 16
capacity 1137
away 601
irs 1049
compensation 2619
machinists 2106
pressures 2768
future 508
cooperation 2340
approach 1402
co. 96
profitable 1629
we 65
men 944
terms 520
extend 2369
nature 2162
wo 418
ask 1906
handful 2916
weak 1520
however 229
retirement 1427
extent 2206
news 311
convertible 956
debt 208
improve 1153
suggested 1943
received 517
protect 1865
met 1317
country 359
over-the-counter 1241
against 175
players 1578
else 1394
supplies 1802
games 2111
planned 950
faces 1772
studio 2318
argue 1992
asked 613
prospect 2424
tough 1401
appeared 2008
royal 2725
offerings 2272
represented 2623
tons 1068
initiative 2804
trust 458
telecommunications 1821
conference 881
puts 1847
basis 854
union 319
anc 2756
three 133
been 59
quickly 914
commission 475
beer 2358
much 122
interest 132
basic 2273
expected 171
entered 2139
containers 2625
life 391
families 1881
mci 2741
eastern 734
drugs 1387
republicans 1666
worker 2523
mca 1972
enterprises 2282
child 2250
ogilvy 2261
worked 1370
slowdown 1653
applied 2484
commerce 1204
has 31
publicly 1377
air 452
ventures 1715
near 958
appeals 1920
aid 870
property 810
study 936
launched 1610
seven 913
changed 973
metropolitan 2742
mexico 1433
is 14
it 15
expenses 1167
ii 1917
player 2524
experts 1403
world-wide 1393
in 8
victims 1468
confident 2883
turner 2588
if 67
grown 2391
hong 794
patent 2069
things 607
make 139
linked 2895
complex 1724
split 1568
several 249
couple 1547
european 568
independent 829
pick 2319
hand 1054
ownership 1443
constitution 2696
opportunity 1473
kept 1785
scenario 2918
programs 561
settled 1621
savings 746
materials 1318
rey 2879
mother 2320
claims 595
the 2
corporate 353
investments 744
left 697
quoted 1182
yen 255
mills 2718
expanding 2640
ideas 2966
identify 2861
human 1224
campbell 2627
yet 397
previous 558
adding 1344
buyers 846
hills 2375
phillips 1895
ease 1751
had 51
intends 2227
spread 1437
board 147
easy 1540
prison 2584
east 519
gave 1227
municipal 1488
possible 492
possibly 2370
buy-out 589
judge 403
replace 1986
advanced 992
desire 2649
county 1223
exxon 1691
hunt 2459
securities 126
offices 1154
officer 290
night 1069
security 648
delmed 2689
attorney 752
right 382
old 394
deal 497
people 109
dead 2610
consultants 1954
donald 2460
election 1725
short-term 967
specific 1228
for 12
bottom 2525
comments 2302
p.m. 2163
when 68
continue 388
denied 2027
steps 1652
christmas 2321
core 1866
marketing 610
corn 1579
conventional 1883
discount 1179
restructuring 654
plc 957
packages 2664
losing 1810
brokerage 857
post 1096
manufacturing 737
properties 1298
georgia-pacific 1660
chapter 1457
dollars 604
months 127
costs 271
magazine 781
plus 1726
afternoon 1736
efforts 702
slightly 743
nixon 2526
raised 748
managers 585
publishing 1584
formerly 2392
facility 1541
civil 1229
maxwell 2471
marshall 2901
son 2423
down 119
explain 2680
magazines 1444
dean 2805
reducing 2294
defendants 1811
crowd 2983
support 415
initial 1074
legislation 803
cosmetics 2659
per-share 1891
why 763
joseph 1955
editor 1716
way 228
resulted 2527
music 2289
was 25
war 978
interest-rate 2629
head 642
economics 2800
form 945
manufacturers 931
becoming 2341
differences 2322
ford 480
failure 1638
heat 2239
hear 2568
syndicate 2939
sustained 2964
stand 1419
true 1550
analyst 365
nov. 408
counsel 2628
inside 2207
bids 1404
maximum 2538
devices 1672
tell 1345
jan. 2081
<unk> 3
stronger 1934
one-third 2681
evidence 1043
promised 2597
accounting 1410
ship 2351
program-trading 2844
check 2323
negotiations 1391
regime 2822
floor 1014
phelan 2180
stake 320
generally 925
credibility 2924
successful 1698
interested 1378
role 795
holding 477
digital 1044
test 895
developers 2587
bailout 2663
roll 2332
picture 2151
's 10
brothers 1001
delivered 2930
models 1901
surprise 2608
felt 1918
utilities 1717
'd 1131
invested 2295
authorities 1892
'm 827
aware 2662
weekend 1854
died 2228
jones 459
reorganization 1411
longer 921
glass 2647
assume 2312
italy 1977
connecticut 2959
together 1190
liquidity 1841
premiums 2940
time 104
push 1812
serious 1175
profits 740
concept 2786
managed 1855
chain 1273
global 1551
alternatives 2461
focus 1247
manager 611
battle 1118
creative 2919
s.a. 2146
certainly 1532
everything 1498
father 2393
environment 1692
charge 527
asking 2026
e. 1514
marks 754
suffered 2187
circumstances 2290
division 540
supported 2316
mixte 1733
keeping 1929
choice 2164
liability 1756
drexel 875
lynch 834
10-year 2216
join 2149
trouble 1513
corp. 66
governments 2352
level 462
did 144
turns 2434
proposals 1737
democrat 2862
standards 1405
leave 1235
settle 1927
team 769
quick 2414
speculation 1311
round 2863
lloyd 1509
prevent 1371
says 45
trend 1458
gasoline 2439
telerate 2073
sign 1215
mich. 2823
cost 363
aggressive 2210
adds 872
appear 1299
hewlett-packard 2082
assistance 1965
shares 71
current 265
goes 1321
international 198
falling 1752
principal 1510
boost 962
filled 2443
paribas 1557
transportation 847
genes 2262
french 897
agreement 332
water 1092
baseball 1804
groups 723
address 2834
alone 1325
along 703
earthquake 427
change 429
wait 2555
canadian 741
institute 760
shift 1738
guilty 1623
trial 1162
usually 1026
corp 587
bob 2612
navigation 1667
retired 2598
defensive 2769
extra 2092
lending 1710
mobil 2920
crisis 2071
market 48
everybody 2613
indicated 1296
working 670
prove 2000
positive 1778
psyllium 2721
visit 2275
third-quarter 349
france 1003
live 1636
opposed 1779
stearns 2281
memory 2513
francs 1012
australian 1773
household 2249
today 326
club 2058
apparent 2770
fuel 2225
cautious 2577
downturn 2549
cases 685
effort 677
behalf 2771
fly 2683
organizations 2599
valued 1216
ibm 690
tokyo 606
car 602
abortion 886
believes 1095
districts 2884
ms. 487
values 1673
can 90
growing 776
making 424
interstate 1936
newspapers 2376
claim 1535
citizens 2371
figure 1187
predict 2499
december 924
chip 2093
agent 2276
1980s 2123
heard 2378
dropped 557
council 1334
allowed 1141
requirements 1562
winter 2650
secured 2965
bonds 129
chemical 720
beat 2816
sunday 2500
s. 2050
fourth 915
ensure 2283
subsidiaries 2569
economy 354
product 521
huge 828
may 94
southern 1277
applications 2529
membership 2140
produce 979
mae 1596
designed 1098
date 1326
such 89
data 451
grow 1556
man 898
natural 1007
johnson 1244
maybe 1688
futures 288
borrowing 2864
gap 2518
so 106
deposit 1856
increase 244
pulled 2473
talk 1197
typical 2600
exclusive 2755
no. 1656
acts 2869
seeing 2865
sell-off 2772
indeed 1010
mainly 1871
consulting 1611
years 73
ended 328
experiments 2835
cuts 1287
argued 2133
statements 2585
cold 2263
still 148
stock-index 974
group 97
monitor 2941
procedures 2734
presence 2238
troubles 2474
forms 2726
offers 1338
policy 374
mail 1617
main 1112
decades 2985
texas 484
happened 2152
finance 573
views 2141
introduce 2539
nation 507
records 2023
half 389
not 64
now 100
provision 1207
discuss 2124
nor 1412
term 1080
attorneys 1857
name 819
january 1135
drop 471
rock 2866
quarter 110
el 2727
square 2083
significantly 1775
latin 2827
revised 1582
s&p 835
begun 2360
year 41
happen 2048
worried 2084
tried 1590
canada 730
living 2033
shown 2028
inventories 2418
opened 1239
space 971
profit 178
factory 2042
looking 735
investigation 1339
indicating 2991
shows 1133
exactly 2675
earlier 134
theory 2570
cars 701
million 22
incentives 2799
possibility 2251
quite 1583
california 296
besides 2709
obligation 2264
marine 2787
card 2303
care 858
advance 1639
training 1993
language 2142
ministry 1245
discussing 2885
wrong 1780
british 304
thing 899
place 666
massive 2441
promotion 2644
think 316
first 75
merrill 814
revenue 187
one 55
opec 2552
americans 1018
one-time 2049
directly 1085
vote 839
corporations 1640
message 2601
fight 1447
open 671
george 1031
size 1070
city 286
given 732
sheet 2998
district 806
caught 2436
trillion 1744
plastic 2571
anyone 1453
indicate 2165
returns 984
white 445
friend 2788
gives 1499
hud 2043
acquisitions 1491
mining 2037
mostly 1425
that 11
pittsburgh 2343
season 1580
moscow 1495
alan 1745
released 1625
specialists 2166
surged 1730
than 56
population 2507
wide 1834
television 758
effective 1140
rival 2117
require 1214
spokeswoman 1046
officials 161
venture 816
were 47
published 2398
and 9
mountain 2355
san 305
investors 117
remained 1849
turned 951
argument 2774
say 118
plunge 1232
allen 2836
sells 1626
saw 1463
any 107
accounted 2722
offering 357
regular 2035
efficient 2743
offer 209
aside 2530
note 952
equipment 550
mr. 24
potential 649
take 210
performance 766
wonder 2744
registered 2783
channel 2556
begin 852
sure 1138
normal 1796
track 2274
price 116
enter 2882
paid 490
icahn 2736
nomura 2997
america 541
pages 1957
honecker 2531
manville 1637
operate 1890
especially 860
surprising 2492
payable 2694
considered 994
average 197
later 467
steady 2472
sale 227
federal 101
professional 1757
senior 446
mass. 1701
typically 1449
filing 963
laws 1489
shop 2682
rating 1384
shot 2684
surplus 2775
show 465
german 770
delta 2806
allegations 2508
commitments 2018
discovered 2059
rep. 824
soviets 1396
fifth 2728
ground 1840
slow 1188
ratio 2304
gulf 1482
title 2126
daily 859
enough 543
crime 1563
only 87
going 325
black 755
treasury 282
thompson 2528
watching 2974
congressional 912
dispute 1702
get 188
contracts 646
assistant 1634
employers 2445
nearly 441
secondary 2395
prime 953
regarding 2630
yield 236
morning 1211
miles 1357
predicted 1693
scott 2906
where 252
husband 2666
salomon 1278
declared 2009
corry 2710
committed 2266
seat 2685
elected 1240
j. 891
college 1500
stanley 1348
concern 289
mortgage 583
farmers 1202
ways 1139
jumped 1008
review 1282
representatives 1858
forecast 1474
weapons 2886
outside 849
bureau 1464
between 179
import 2179
reading 2557
across 1177
jobs 1142
august 398
parent 731
blame 2667
article 1055
cities 1806
come 435
reaction 2112
acquiring 2240
many 98
trader 1281
trades 1246
according 215
contract 317
prompted 2486
buy-back 2668
senator 2967
holders 675
traded 926
comes 1097
among 212
cancer 1019
color 2652
roman 2639
period 341
insist 2631
confirmed 1964
learning 2789
moreover 1208
poll 2824
two-year 2324
considering 1434
save 1973
unusual 1739
west 380
airlines 699
mark 1061
hutton 1340
combined 1545
hardly 2825
mary 2942
disclosed 871
wants 821
direction 1984
shopping 1974
offered 494
formed 2252
observers 2437
wake 2010
minister 1047
former 338
those 151
pilot 2509
case 308
developing 1589
these 145
consultant 1373
cash 268
n't 33
warning 2366
policies 1312
newspaper 1148
situation 942
shops 2632
margin 1797
region 1648
eventually 1415
metric 2265
health-care 2218
engaged 2837
telephone 733
quiet 2757
middle 2686
someone 1362
attributed 1400
technology 503
worry 2001
par 1164
develop 1285
pay 256
same 313
dealer 2586
speech 2396
grain 2200
insurers 1668
events 1483
week 124
buy-outs 2926
oil 269
singapore 2085
boosted 1813
drives 2072
producers 855
running 1122
harris 2711
intended 1951
changing 2510
anticipated 2344
complained 2540
costa 2211
theater 2669
largely 892
charges 593
no 103
constitutional 2305
roughly 1597
mortgages 1537
severe 2541
without 350
relief 1746
model 1807
researchers 1248
charged 1657
summer 976
asset-backed 2687
being 214
money 162
rest 1115
kill 2633
speed 1787
weekly 2052
announcement 1040
death 1237
rose 120
seems 900
except 1627
improvement 1467
westinghouse 2968
setting 2150
bloc 2634
treatment 1355
plenty 2962
tuesday 474
ross 2196
scheduled 832
negotiating 2017
around 385
read 1331
papers 2695
virginia 2698
early 267
inflation 620
traffic 1703
using 665
accepted 1747
ruled 2379
intel 2167
nissan 1271
rivals 2616
've 578
annually 1758
chamber 2615
benefit 1279
either 707
retailers 2314
fully 1145
output 1450
tower 2416
reduced 901
nikkei 2635
competition 874
loyalty 2970
bigger 1869
thinks 2086
provided 1294
earth 2821
operators 2638
recorded 2489
legal 656
conservative 1280
critical 2101
deficit 822
provides 1304
newport 2943
moderate 2446
football 2359
assembly 2479
scientific 2118
power 339
airways 2807
equivalent 2428
broker 1862
broken 2808
leadership 1902
aide 2738
manufacturer 2188
on 17
central 877
package 1184
of 5
industry 154
thousands 1567
fell 204
airline 767
sachs 1679
act 757
mixed 1764
mean 1465
or 37
confidence 1818
tape 2658
barrels 1850
outlook 2219
coupon 1661
instruments 1867
image 1341
accounts 907
determine 2636
parties 1546
operator 2127
your 483
pharmaceutical 1945
fast 2512
her 200
area 449
there 84
alleged 1601
start 903
appears 1314
low 615
lot 580
valley 1272
billion 49
complete 1490
saatchi 2087
delayed 2782
sophisticated 2501
brain 2975
succeeded 2913
two-thirds 2603
technologies 2212
trying 547
with 23
buying 361
faster 2362
volume 396
october 522
circulation 2923
sears 1719
default 2380
wholesale 2699
agree 1842
strongly 2759
gone 1843
vehicles 1576
ad 695
ag 1674
certain 426
totaling 2637
moved 1266
sales 82
deep 2945
an 32
cbs 1023
britain 774
at 19
file 2012
aids 2137
politics 2113
moves 1103
film 1475
fill 2946
again 556
consensus 2713
personnel 2103
storage 2490
event 1938
field 1105
you 111
poor 888
a$ 2589
congress 287
separate 1127
students 1358
a. 943
n.j. 1848
important 590
massachusetts 2537
coverage 2002
planners 2653
brands 1335
stocks 150
building 482
assets 297
calls 724
wife 1985
invest 2114
having 716
directors 751
mass 2654
overseas 1060
starting 1801
original 1525
represent 2013
all 74
sci 1781
consider 930
chinese 1564
caused 989
lack 1439
dollar 333
month 169
mccaw 1947
talks 711
follow 2019
settlement 917
decisions 1868
children 1028
causes 2904
reluctant 2947
tv 659
thursday 939
shall 2832
to 6
program 157
spain 2480
health 468
lawmakers 1388
activities 1230
calif. 850
premium 1257
returned 2053
divisions 2954
very 253
resistance 2285
worst 2325
decide 1882
fall 619
sony 1155
difference 1574
condition 2119
cable 1264
louis 2128
list 1407
joined 2024
large 381
circuit 2199
small 300
webster 2984
past 225
rate 159
arizona 1599
design 1521
lawyer 1359
pass 2168
nbc 1808
further 393
investment 146
what 115
abc 2189
richard 867
investing 2181
sun 1581
section 1236
resume 2497
brief 2700
<EOS> 1
noriega 1006
version 1349
scientists 1466
certificates 1822
learned 2604
public 275
contrast 1928
movement 2038
turmoil 2730
full 632
editorial 2488
answers 2854
hours 887
citicorp 1995
operating 299
excess 2094
november 1445
strong 404
thrift 1077
publisher 2194
prosecutors 1598
ahead 970
extraordinary 2147
losses 392
experience 1526
prior 1874
amount 542
advertising 627
social 1342
action 588
narrow 2688
options 657
via 645
followed 1159
family 617
requiring 2855
africa 1946
thatcher 1533
put 336
aimed 1516
establish 2559
donaldson 2809
shareholders 591
eye 2190
takes 1254
petroleum 1422
two 76
generate 2760
taken 640
markets 191
minor 1896
more 46
flat 1231
israel 2241
door 2054
knows 2326
fast-food 2543
jr. 1209
company 38
broke 2856
particular 1476
known 709
producing 1884
town 1741
jim 2773
none 1919
lilly 2874
hour 1477
science 2905
des 2746
remain 706
sudden 2558
nine 511
sent 998
morgan 905
strategies 2230
history 964
purchases 1336
processing 2306
brown 1448
pont 2888
share 63
accept 1527
states 599
pushed 2221
minimum 1079
numbers 1618
purchased 1662
sense 1116
sharp 1123
f. 2095
information 532
needs 1198
answer 2440
court 213
advantage 1924
rather 644
hugo 1065
conducted 2810
earnings 137
portfolios 2548
plant 402
plans 232
advice 2747
different 772
reflect 1897
fe 2003
coming 762
response 1124
a 7
short 694
brady 2400
departure 2889
coal 2354
broadcasting 1528
responsibility 2068
media 1034
banks 248
egg 2602
playing 2447
turnover 2701
played 1839
help 334
september 312
developed 1260
soon 641
trade 220
held 417
paper 399
through 149
committee 373
signs 1694
suffer 2948
its 28
developer 2969
style 2074
rapidly 2214
actually 1071
late 292
systems 419
conn. 2051
stephen 1898
inquiry 2999
might 280
tentatively 2907
good 264
return 504
seeking 817
food 622
reflected 1886
association 447
easily 2145
holiday 2763
always 851
stopped 1885
eager 2990
found 552
heavy 771
sterling 1641
everyone 1385
england 1492
generation 2198
house 165
energy 773
hard 712
reduce 713
idea 1165
police 1399
extended 2313
expect 524
advertisers 1704
operation 1143
beyond 1300
insurance 240
really 840
deals 1306
funding 1173
carriers 2339
blacks 1875
robert 513
since 156
douglas 2381
research 318
participants 2315
safety 1267
hill 2075
fujitsu 2707
issue 186
highway 1844
reporting 2192
risen 2908
lawrence 2401
friday 283
houses 1332
reason 878
base 980
members 450
backed 1319
beginning 937
guy 2481
director 276
owners 1275
benefits 991
launch 2222
just 152
computers 565
excluding 2317
american 141
threat 1705
pilots 1045
fallen 2154
lawsuits 2160
copper 1478
major 138
slipped 1903
feel 1113
number 295
feet 1350
done 927
fees 792
miss 2925
causing 2470
stage 2258
story 1363
heads 2532
leading 815
st. 1612
kidder 1086
least 298
station 1887
expand 1708
statement 682
dealing 2554
compromise 1975
store 1365
listed 1605
selling 400
passed 1364
relationship 1904
behind 1120
hotel 1727
park 1518
immediate 1930
blue-chip 2729
profitability 2566
part 202
favorable 2870
believe 624
hollywood 2007
king 2242
kind 948
grew 1572
rebound 2900
double 1870
pennsylvania 2361
determined 1959
risks 1484
elaborate 2520
messrs. 2402
toward 811
aug. 2039
outstanding 548
imports 949
substantial 1315
orders 456
option 1389
sell 183
ratings 1573
built 1099
trip 2887
gorbachev 1166
officers 2670
targets 2611
majority 908
internal 1327
chairman 143
finding 1899
frequently 2387
play 1117
added 285
electric 940
goldman 1602
eggs 2811
measures 1322
reach 1366
freddie 2605
most 91
hired 2493
shareholder 1032
plan 176
significant 893
services 324
extremely 1976
approved 680
soared 1910
compaq 2690
dealers 669
clear 726
sometimes 1276
cover 1506
rockefeller 2731
traditional 1522
three-month 2578
clean 2790
usual 1771
institutions 826
painewebber 1628
sector 1075
thomas 1269
particularly 783
gold 660
commissions 2277
nasdaq 1343
session 1082
businesses 434
jury 1372
fine 1829
find 678
impact 883
gen. 2812
giant 1100
regulations 2060
nevertheless 2955
northern 1431
justice 1011
heavily 1440
distributed 2871
failed 778
flights 2544
pretty 1760
equity 621
giants 2169
begins 2679
his 50
hit 777
gains 554
meanwhile 672
express 1020
financing 605
collection 2878
b 2327
actions 1622
closely 1102
reporters 2170
during 199
him 302
merchandise 2450
appeal 1682
doubled 2813
six-month 2014
banking 569
common 247
activity 807
switzerland 2096
coors 2909
river 1996
wrote 1565
restaurants 2748
set 386
art 1459
achieved 2761
declines 1256
sex 2986
culture 2345
see 379
defense 536
sec 1132
are 27
sea 1876
tender 1765
close 321
arm 2890
declined 413
filings 2932
# 687
spirits 2776
movie 1603
century 1742
currently 488
won 1072
various 1186
probably 681
conditions 1087
supposed 2449
available 679
korea 1592
recently 376
creating 2363
initially 2115
dividends 1435
sold 239
attention 1420
aircraft 1496
succeed 2284
coffee 2143
opposition 1305
franchise 2575
dividend 704
both 180
prospects 2171
last 70
appropriations 1367
annual 367
foreign 245
sensitive 2732
connection 2591
became 985
long-term 688
let 1025
whole 1180
baltimore 2749
point 375
reasons 1748
loan 501
community 922
simply 946
church 1960
throughout 1766
expensive 1619
decline 461
described 2182
raise 630
monthly 1504
create 1288
political 390
due 260
strategy 750
convicted 2830
whom 1713
reduction 1501
maintenance 2545
meeting 476
walter 2438
firm 192
partly 1110
fire 1782
gas 538
convert 2794
N 4
fund 293
whatever 2671
lives 2129
brokers 960
bidding 1494
demand 437
prices 113
plants 865
georgia 2076
look 714
solid 2950
judicial 2987
bill 261
budget 570
governor 2672
technical 1586
while 121
mainframe 2927
ought 2546
fleet 2928
mitchell 2346
guide 2792
engineers 2762
real 309
pound 1066
costly 2183
voters 1683
cents 108
motors 1328
stations 1740
disappointing 2462
itself 683
ready 1788
fannie 1967
coca-cola 2910
chase 2088
underwriters 1718
suggests 2045
rules 906
virtually 1753
widely 1283
grand 1426
survey 1108
dozen 1671
higher 207
development 444
used 263
lawyers 691
d.c. 2988
affairs 1699
comprehensive 2655
yesterday 123
moment 1859
levels 788
moving 1408
purpose 2617
tobacco 2477
recent 182
lower 231
task 2015
older 1908
studies 1956
poland 1221
spent 1149
person 1442
machinery 2511
ltd. 555
swiss 1416
organization 1178
spend 1270
coup 2226
one-year 2560
junk-bond 1767
networks 2464
u.k. 1168
competitive 1650
quarters 2311
questions 1093
world 219
alternative 1978
wage 1158
cut 378
helping 2116
$ 13
also 60
advisers 2044
workers 432
deputy 1809
guaranteed 2268
attractive 2467
source 1076
stock-market 2382
parents 1877
location 2777
violations 2576
guarantees 2004
administrative 2233
remaining 1428
surprised 2478
build 848
customers 526
australia 1249
march 618
emergency 1171
demands 2648
big 130
bid 258
matters 2104
game 1088
aerospace 1931
bit 1893
projects 868
moody 995
breeden 2364
success 1395
follows 1860
signal 2383
toyota 2929
separately 1261
communications 779
arthur 2891
individuals 1324
yields 923
popular 1429
healthy 1805
privately 2297
often 518
senate 463
spring 1205
b. 1814
some 58
back 193
trends 2673
economic 234
pricing 1861
apply 2465
nicaragua 2503
facing 2397
scale 2750
decision 531
transactions 1083
audience 2144
per 1038
eliminate 2561
be 26
run 612
lose 1517
continuing 1021
fed 566
refused 2077
step 1210
santa 1250
served 2066
at&t 1789
by 18
pipeline 2365
goods 804
anything 997
truck 1792
mrs. 662
range 882
ounce 1921
duties 2917
block 1035
pollution 2951
repair 2839
steinhardt 2692
into 92
within 530
retailer 2751
nothing 1033
primarily 1548
sports 1259
pentagon 1472
bankruptcy 1029
statistics 1939
spending 509
question 801
long 352
ordered 1823
amr 2989
suit 633
himself 1056
elsewhere 1731
collapsed 2347
vehicle 2061
specialty 2269
hoped 2872
atlantic 2254
pacific 689
filed 528
hopes 1101
subsidiary 663
line 464
considerable 2714
raising 1421
posted 634
up 53
us 505
maturity 1529
're 278
exploration 2105
viacom 2562
similar 710
called 342
bell 1310
associated 1669
metal 2485
influence 1905
metals 1790
engineering 1293
associates 1121
rally 975
amounts 1356
peace 2892
fears 1711
teams 2921
yeargin 2494
afford 2902
politicians 2131
reputation 2693
income 185
department 235
manhattan 1430
users 2367
gross 2011
problems 356
prepared 1575
william 782
allowing 2579
formal 2592
sides 1783
structure 1761
ago 196
urged 2223
land 1217
vice 233
age 1067
required 972
bankers 1022
responded 2873
far 291
fresh 2593
requires 2384
leveraged 1146
once 500
code 2372
issued 902
results 343
existing 1251
oct. 314
ge 1889
broader 2778
go 387
gm 844
contributions 2463
centers 1585
issues 251
seemed 1815
concerned 1493
young 841
send 2779
suits 2399
citing 1961
stable 2745
quarterly 1090
include 537
friendly 2385
resources 1059
garden 2255
automotive 1620
continues 1050
wave 2857
putting 1663
cellular 2448
telling 2931
continued 609
entire 1436
eased 2894
sen. 969
real-estate 1446
positions 1380
notes 377
michael 798
fewer 1732
try 876
race 2475
noted 667
guber 1183
concluded 2244
smaller 1048
cds 1774
crop 2296
jump 2132
video 2533
expense 2458
makers 768
index 217
edison 2814
business 86
chicago 485
giving 1454
expressed 2173
practices 1798
access 1455
paying 1125
waiting 1630
indian 2840
volatile 2030
five-year 2651
capital 181
firms 366
exercise 1743
body 2875
led 661
lee 1816
exchange 112
pushing 2514
commercial 478
jointly 2040
following 572
northeast 2553
them 128
others 495
great 592
credits 3001
receive 934
involved 831
larger 1307
leaving 1835
engine 2758
merger 1037
products 190
opinion 1817
residents 1754
gene 1614
makes 559
maker 340
fourth-quarter 2016
named 523
writer 2971
apple 1538
heart 1655
win 1255
manage 2933
private 551
fraud 1552
names 1863
motor 996
scandal 1948
standing 2715
use 270
from 21
p&g 2217
consumption 2618
& 83
remains 780
illegal 2121
cray 1553
next 131
few 262
doubt 1878
year-ago 1519
themselves 873
consecutive 2243
reflects 1712
usx 1502
sort 1566
parliament 2580
started 889
becomes 2934
factor 2245
benchmark 1824
occurred 2841
carrying 2590
sharply 1051
allianz 2867
mitsubishi 1615
appointed 2495
women 941
customer 1353
account 636
us$ 1539
effectively 2876
this 40
challenge 1879
clients 764
recession 965
thin 2429
island 2868
meet 896
closing 1039
n.y. 2208
control 351
beijing 2062
slid 2780
weaker 2716
engelken 2703
process 796
a.m. 2859
daiwa 2733
tax 218
purposes 2842
high 241
professor 2022
reserves 825
something 785
sought 1432
stories 2466
voice 1981
rape 2972
sir 1836
educational 2534
united 562
usair 2029
democracy 2404
recalls 2572
six 364
hampshire 2279
arrangement 2795
traders 281
forest 2877
instead 597
stock 61
buildings 1642
farm 2097
watch 2237
tied 2256
ties 2215
boeing 1962
light 1063
lines 805
commodities 2373
chief 153
road 2102
allow 894
executives 493
martin 2417
houston 1933
holds 1199
hanover 2547
producer 1587
institutional 1258
move 330
produced 1109
alliance 2056
including 211
looks 2155
quake 1064
year-earlier 797
industries 553
delay 1932
la 2041
labor 700
whites 2952
willing 1289
orange 2515
covered 2156
criminal 1479
spot 2209
pending 1503
crash 1041
greater 1225
auto 616
practice 1360
earn 2937
cutting 1949
h. 2184
hands 1316
front 1911
bar 2740
republican 1409
investor 506
day 273
capital-gains 1386
successor 2704
february 1799
l. 1577
warned 2280
university 594
covering 2724
identified 2504
morris 1979
rising 1126
bills 614
warner 576
doing 842
strip 2944
related 866
society 1423
books 1800
measure 853
our 230
margins 1242
agriculture 2717
special 655
out 85
merc 2849
' 135
entertainment 1016
defend 2815
critics 1825
electronics 1301
cause 1213
integrated 2286
red 1013
thrifts 1837
disclose 2021
shut 2912
frank 1643
ban 2065
regulators 1084
york 93
regulatory 1374
indicates 2596
philip 1696
navy 2034
hostile 1791
could 80
florida 1864
mac 2496
keep 586
ltd 1480
davis 2781
retain 2452
retail 909
south 436
respond 2833
plastics 2953
succeeds 2031
powerful 1616
owned 1015
strategic 1768
owner 1507
reached 698
awarded 2409
quality 1192
nyse 2335
legislative 2657
management 266
stands 2702
los 639
system 274
relations 1560
recapitalization 2880
priority 2581
their 52
attack 2348
intelligence 1649
final 1111
interests 775
enforcement 2535
shell 2973
completed 790
acquire 869
environmental 1091
chemicals 1591
reflecting 1308
branches 2415
july 567
institution 2706
steel 600
colleagues 2174
hearings 2287
commodity 1406
patients 1952
individual 736
providing 2064
creditors 1004
projections 2996
unchanged 1073
partnership 1218
lin 1613
unlikely 2356
have 35
need 479
apparently 1081
clearly 1508
rjr 2220
documents 1675
dallas 1830
agency 303
able 584
purchasing 1941
instance 1222
concerns 830
which 42
campeau 1680
coke 2956
unless 1150
who 57
eight 959
preliminary 1980
device 2843
segment 1569
payment 1212
so-called 1354
request 1681
face 664
looked 2828
proceedings 2195
lowered 2176
pictures 2267
normally 2191
fact 668
goals 2764
agreed 346
charles 1379
bring 1144
planning 1351
democrats 1156
portfolio 684
fear 1262
economist 1189
debate 1561
decade 1219
staff 861
litigation 1485
partners 918
based 306
earned 1017
controls 1558
should 205
unable 2260
candidates 1925
employee 1624
communist 1769
local 579
hope 1160
meant 2845
dinkins 1185
handle 2435
means 863
fellow 2505
familiar 1253
overall 1268
bear 1644
reinsurance 2656
joint 799
ones 1460
words 1749
exchanges 1469
buyer 2005
kong 845
chips 1600
areas 880
trucks 1684
course 904
numerous 2574
taxes 954
calling 2405
she 164
ohio 1570
fixed 1424
conduct 2819
view 808
europe 546
temporarily 2122
downward 3000
acquired 747
national 163
accord 1770
operates 1912
edition 2846
computer 226
subcommittee 2430
closer 2516
nationwide 2594
reform 1471
nuclear 1676
tend 2134
favor 1697
state 140
closed 250
crude 1900
progress 2453
neither 1397
bought 686
comparable 2006
brewing 2425
ability 1157
opening 1461
deliver 2234
agencies 1169
job 793
takeover 423
key 864
approval 727
precious 2858
lawsuit 2536
distribution 1793
declining 1631
david 693
restrictions 1530
limits 2307
career 2349
goal 1966
taking 836
equal 1950
drug 510
pulp 2563
april 856
figures 653
jersey 1762
otherwise 2406
comment 472
adjusted 1706
english 2765
co 571
lang 2333
agents 2172
wall 331
ca 563
cd 2055
packaging 2976
qintex 1243
table 2573
oakland 1937
industrials 2607
addition 489
genetic 2336
permanent 2893
agreements 2078
proposal 491
waste 2847
faced 2412
controlled 1763
c. 1707
league 2308
am 1635
sufficient 2977
otc 1413
essentially 2620
c$ 1511
bulk 2752
finished 1831
graphics 2257
improved 1309
atlanta 2431
general 189
present 1838
homes 1515
troubled 1543
abandoned 2896
unlike 1728
sotheby 2848
restaurant 2278
harder 2550
as 20
value 310
will 34
owns 833
wild 2978
uncertainty 1686
almost 512
blood 2502
thus 1114
site 2491
helped 745
claimed 2606
partner 1062
shearson 938
halt 2665
tumbled 2407
perhaps 986
began 544
administration 345
cross 2079
member 1094
retailing 2419
parts 935
largest 414
units 603
party 728
gets 1523
difficult 1024
material 1695
columbia 809
nekoosa 1916
upon 2025
effect 692
forecasts 2299
student 2246
rumors 1914
kkr 2614
single 1136
transaction 564
off 170
center 721
i 69
approve 2426
well 201
fighting 2641
thought 947
banker 2377
sets 2235
position 533
soviet 384
inc. 81
latest 466
stores 643
less 246
increasingly 1554
executive 155
domestic 708
obtain 2157
sources 1534
underlying 2107
rooms 2374
seats 1832
paul 1549
rapid 2642
ads 1417
supply 933
smith 1129
deposits 1734
realize 2298
simple 2231
add 1226
other 62
subordinated 1544
match 2408
boom 2582
tests 1645
increased 369
provisions 1292
government 99
chancellor 2300
increases 696
five 259
know 481
press 916
immediately 1451
loss 254
lincoln 1735
necessary 1381
like 167
lost 635
miami 2829
taxpayers 2201
lawson 1390
payments 729
james 560
become 428
works 1320
soft 2674
amendment 1846
exceed 2643
because 78
arbitrage 981
authority 1286
growth 242
export 1462
cleveland 2796
home 322
peter 1163
employment 1872
line-item 2350
lead 786
broad 1593
avoid 1265
hurricane 1151
slide 2270
does 177
york-based 2645
chains 2621
leader 862
schedule 2719
journal 919
monetary 1452
expansion 1594
beach 2394
pressure 843
expire 2817
although 368
offset 1193
includes 800
loans 362
vs. 2660
panel 1559
gained 725
about 44
actual 1174
carried 2420
debentures 1295
freedom 2676
shipping 2158
surge 2080
angeles 749
holdings 885
carries 2468
carrier 1368
introduced 1191
software 993
own 195
letters 2175
previously 719
warrants 2046
washington 514
commitment 2337
billions 2622
getting 637
malcolm 2922
included 651
guard 2881
promise 2047
managing 1274
banco 2338
utility 1913
accused 1794
additional 581
krenz 1729
transfer 2506
housing 999
secret 2949
peters 1206
continental 1776
biggest 742
pretax 1176
fiscal 416
buy 184
north 739
stadium 2899
triggered 2368
insurer 2564
funds 194
brand 1089
akzo 2957
but 30
delivery 1000
insured 2108
construction 608
gain 430
courts 1942
highest 1595
ltv 2705
he 29
made 160
places 2735
whether 370
cells 2388
official 455
signed 1194
record 440
below 631
limit 1057
ruling 1233
problem 433
piece 2185
minutes 1497
supreme 1152
deaths 2818
wcrs 2980
slowing 1607
flight 2020
education 1382
proceeds 1337
worse 2197
inc 443
aetna 2720
mutual 1369
compared 371
'll 928
variety 2784
corporation 2271
illinois 2646
book 1291
compares 2914
details 1512
branch 2202
compete 2850
gonzalez 2797
junk 473
francisco 420
star 1826
monday 383
class 1361
june 625
ultimately 2153
contends 2328
stay 1333
chance 1375
bellsouth 2697
priced 425
friends 2421
exposure 2301
resolution 2067
baker 1290
factors 1438
rule 1161
ortega 2677
portion 1685
write 2342
status 2334
pension 1181
understand 2232
frankfurt 1915
# coding=utf-8
import paddle.v2 as paddle
import gzip
import numpy as np
from utils import *
import network_conf
from config import *
def generate_using_rnn(word_id_dict, num_words, beam_size):
"""
Demo: use RNN model to do prediction.
:param word_id_dict: vocab.
:type word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
:param num_words: the number of the words to generate.
:type num_words: int
:param beam_size: beam width.
:type beam_size: int
:return: save prediction results to output_file
"""
# prepare and cache model
config = Config_rnn()
_, output_layer = network_conf.rnn_lm(
vocab_size=len(word_id_dict),
emb_dim=config.emb_dim,
rnn_type=config.rnn_type,
hidden_size=config.hidden_size,
num_layer=config.num_layer) # network config
model_file_name = config.model_file_name_prefix + str(config.num_passs -
1) + '.tar.gz'
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_file_name)) # load parameters
inferer = paddle.inference.Inference(
output_layer=output_layer, parameters=parameters)
# tools, different from generate_using_ngram's tools
id_word_dict = dict(
[(v, k) for k, v in word_id_dict.items()]) # {id : word}
def str2ids(str):
return [[[
word_id_dict.get(w, word_id_dict['<UNK>']) for w in str.split()
]]]
def ids2str(ids):
return [[[id_word_dict.get(id, ' ') for id in ids]]]
# generate text
with open(input_file) as file:
output_f = open(output_file, 'w')
for line in file:
line = line.decode('utf-8').strip()
# generate
texts = {} # type: {text : probability}
texts[line] = 1
for _ in range(num_words):
texts_new = {}
for (text, prob) in texts.items():
if '<EOS>' in text: # stop prediction when <EOS> appear
texts_new[text] = prob
continue
# next word's probability distribution
predictions = inferer.infer(input=str2ids(text))
predictions[-1][word_id_dict['<UNK>']] = -1 # filter <UNK>
# find next beam_size words
for _ in range(beam_size):
cur_maxProb_index = np.argmax(
predictions[-1]) # next word's id
text_new = text + ' ' + id_word_dict[
cur_maxProb_index] # text append next word
texts_new[text_new] = texts[text] * predictions[-1][
cur_maxProb_index]
predictions[-1][cur_maxProb_index] = -1
texts.clear()
if len(texts_new) <= beam_size:
texts = texts_new
else: # cutting
texts = dict(
sorted(
texts_new.items(), key=lambda d: d[1], reverse=True)
[:beam_size])
# save results to output file
output_f.write(line.encode('utf-8') + '\n')
for (sentence, prob) in texts.items():
output_f.write('\t' + sentence.encode('utf-8', 'replace') + '\t'
+ str(prob) + '\n')
output_f.write('\n')
output_f.close()
print('already saved results to ' + output_file)
def generate_using_ngram(word_id_dict, num_words, beam_size):
"""
Demo: use N-Gram model to do prediction.
:param word_id_dict: vocab.
:type word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
:param num_words: the number of the words to generate.
:type num_words: int
:param beam_size: beam width.
:type beam_size: int
:return: save prediction results to output_file
"""
# prepare and cache model
config = Config_ngram()
_, output_layer = network_conf.ngram_lm(
vocab_size=len(word_id_dict),
emb_dim=config.emb_dim,
hidden_size=config.hidden_size,
num_layer=config.num_layer) # network config
model_file_name = config.model_file_name_prefix + str(config.num_passs -
1) + '.tar.gz'
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_file_name)) # load parameters
inferer = paddle.inference.Inference(
output_layer=output_layer, parameters=parameters)
# tools, different from generate_using_rnn's tools
id_word_dict = dict(
[(v, k) for k, v in word_id_dict.items()]) # {id : word}
def str2ids(str):
return [[
word_id_dict.get(w, word_id_dict['<UNK>']) for w in str.split()
]]
def ids2str(ids):
return [[id_word_dict.get(id, ' ') for id in ids]]
# generate text
with open(input_file) as file:
output_f = open(output_file, 'w')
for line in file:
line = line.decode('utf-8').strip()
words = line.split()
if len(words) < config.N:
output_f.write(line.encode('utf-8') + "\n\tnone\n")
continue
# generate
texts = {} # type: {text : probability}
texts[line] = 1
for _ in range(num_words):
texts_new = {}
for (text, prob) in texts.items():
if '<EOS>' in text: # stop prediction when <EOS> appear
texts_new[text] = prob
continue
# next word's probability distribution
predictions = inferer.infer(
input=str2ids(' '.join(text.split()[-config.N:])))
predictions[-1][word_id_dict['<UNK>']] = -1 # filter <UNK>
# find next beam_size words
for _ in range(beam_size):
cur_maxProb_index = np.argmax(
predictions[-1]) # next word's id
text_new = text + ' ' + id_word_dict[
cur_maxProb_index] # text append nextWord
texts_new[text_new] = texts[text] * predictions[-1][
cur_maxProb_index]
predictions[-1][cur_maxProb_index] = -1
texts.clear()
if len(texts_new) <= beam_size:
texts = texts_new
else: # cutting
texts = dict(
sorted(
texts_new.items(), key=lambda d: d[1], reverse=True)
[:beam_size])
# save results to output file
output_f.write(line.encode('utf-8') + '\n')
for (sentence, prob) in texts.items():
output_f.write('\t' + sentence.encode('utf-8', 'replace') + '\t'
+ str(prob) + '\n')
output_f.write('\n')
output_f.close()
print('already saved results to ' + output_file)
def main():
# init paddle
paddle.init(use_gpu=use_gpu, trainer_count=trainer_count)
# prepare and cache vocab
if os.path.isfile(vocab_file):
word_id_dict = load_vocab(vocab_file) # load word dictionary
else:
if build_vocab_method == 'fixed_size':
word_id_dict = build_vocab_with_fixed_size(
train_file, vocab_max_size) # build vocab
else:
word_id_dict = build_vocab_using_threshhold(
train_file, unk_threshold) # build vocab
save_vocab(word_id_dict, vocab_file) # save vocab
# generate
if use_which_model == 'rnn':
generate_using_rnn(
word_id_dict=word_id_dict, num_words=num_words, beam_size=beam_size)
elif use_which_model == 'ngram':
generate_using_ngram(
word_id_dict=word_id_dict, num_words=num_words, beam_size=beam_size)
else:
raise Exception('use_which_model must be rnn or ngram!')
if __name__ == "__main__":
main()
# coding=utf-8
import paddle.v2 as paddle
def rnn_lm(vocab_size, emb_dim, rnn_type, hidden_size, num_layer):
"""
RNN language model definition.
:param vocab_size: size of vocab.
:param emb_dim: embedding vector's dimension.
:param rnn_type: the type of RNN cell.
:param hidden_size: number of unit.
:param num_layer: layer number.
:return: cost and output layer of model.
"""
assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0
# input layers
input = paddle.layer.data(
name="input", type=paddle.data_type.integer_value_sequence(vocab_size))
target = paddle.layer.data(
name="target", type=paddle.data_type.integer_value_sequence(vocab_size))
# embedding layer
input_emb = paddle.layer.embedding(input=input, size=emb_dim)
# rnn layer
if rnn_type == 'lstm':
rnn_cell = paddle.networks.simple_lstm(
input=input_emb, size=hidden_size)
for _ in range(num_layer - 1):
rnn_cell = paddle.networks.simple_lstm(
input=rnn_cell, size=hidden_size)
elif rnn_type == 'gru':
rnn_cell = paddle.networks.simple_gru(input=input_emb, size=hidden_size)
for _ in range(num_layer - 1):
rnn_cell = paddle.networks.simple_gru(
input=rnn_cell, size=hidden_size)
else:
raise Exception('rnn_type error!')
# fc(full connected) and output layer
output = paddle.layer.fc(
input=[rnn_cell], size=vocab_size, act=paddle.activation.Softmax())
# loss
cost = paddle.layer.classification_cost(input=output, label=target)
return cost, output
def ngram_lm(vocab_size, emb_dim, hidden_size, num_layer):
"""
N-Gram language model definition.
:param vocab_size: size of vocab.
:param emb_dim: embedding vector's dimension.
:param hidden_size: size of unit.
:param num_layer: layer number.
:return: cost and output layer of model.
"""
assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=emb_dim,
param_attr=paddle.attr.Param(
name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0))
return wordemb
# input layers
first_word = paddle.layer.data(
name="first_word", type=paddle.data_type.integer_value(vocab_size))
second_word = paddle.layer.data(
name="second_word", type=paddle.data_type.integer_value(vocab_size))
third_word = paddle.layer.data(
name="third_word", type=paddle.data_type.integer_value(vocab_size))
fourth_word = paddle.layer.data(
name="fourth_word", type=paddle.data_type.integer_value(vocab_size))
next_word = paddle.layer.data(
name="next_word", type=paddle.data_type.integer_value(vocab_size))
# embedding layer
first_emb = wordemb(first_word)
second_emb = wordemb(second_word)
third_emb = wordemb(third_word)
fourth_emb = wordemb(fourth_word)
context_emb = paddle.layer.concat(
input=[first_emb, second_emb, third_emb, fourth_emb])
# hidden layer
hidden = paddle.layer.fc(
input=context_emb, size=hidden_size, act=paddle.activation.Relu())
for _ in range(num_layer - 1):
hidden = paddle.layer.fc(
input=hidden, size=hidden_size, act=paddle.activation.Relu())
# fc(full connected) and output layer
predict_word = paddle.layer.fc(
input=[hidden], size=vocab_size, act=paddle.activation.Softmax())
# loss
cost = paddle.layer.classification_cost(input=predict_word, label=next_word)
return cost, predict_word
# coding=utf-8
import collections
import os
def rnn_reader(file_name, min_sentence_length, max_sentence_length,
word_id_dict):
"""
create reader for RNN, each line is a sample.
:param file_name: file name.
:param min_sentence_length: sentence's min length.
:param max_sentence_length: sentence's max length.
:param word_id_dict: vocab with content of '{word, id}', 'word' is string type , 'id' is int type.
:return: data reader.
"""
def reader():
UNK = word_id_dict['<UNK>']
with open(file_name) as file:
for line in file:
words = line.decode('utf-8', 'ignore').strip().split()
if len(words) < min_sentence_length or len(
words) > max_sentence_length:
continue
ids = [word_id_dict.get(w, UNK) for w in words]
ids.append(word_id_dict['<EOS>'])
target = ids[1:]
target.append(word_id_dict['<EOS>'])
yield ids[:], target[:]
return reader
def ngram_reader(file_name, N, word_id_dict):
"""
create reader for N-Gram.
:param file_name: file name.
:param N: N-Gram's N.
:param word_id_dict: vocab with content of '{word, id}', 'word' is string type , 'id' is int type.
:return: data reader.
"""
assert N >= 2
def reader():
ids = []
UNK_ID = word_id_dict['<UNK>']
cache_size = 10000000
with open(file_name) as file:
for line in file:
words = line.decode('utf-8', 'ignore').strip().split()
ids += [word_id_dict.get(w, UNK_ID) for w in words]
ids_len = len(ids)
if ids_len > cache_size: # output
for i in range(ids_len - N - 1):
yield tuple(ids[i:i + N])
ids = []
ids_len = len(ids)
for i in range(ids_len - N - 1):
yield tuple(ids[i:i + N])
return reader
# coding=utf-8
import sys
import paddle.v2 as paddle
import reader
from utils import *
import network_conf
import gzip
from config import *
def train(model_cost, train_reader, test_reader, model_file_name_prefix,
num_passes):
"""
train model.
:param model_cost: cost layer of the model to train.
:param train_reader: train data reader.
:param test_reader: test data reader.
:param model_file_name_prefix: model's prefix name.
:param num_passes: epoch.
:return:
"""
# init paddle
paddle.init(use_gpu=use_gpu, trainer_count=trainer_count)
# create parameters
parameters = paddle.parameters.create(model_cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000))
# create trainer
trainer = paddle.trainer.SGD(
cost=model_cost, parameters=parameters, update_equation=adam_optimizer)
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print("\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics))
else:
sys.stdout.write('.')
sys.stdout.flush()
# save model each pass
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader)
print("\nTest with Pass %d, %s" % (event.pass_id, result.metrics))
with gzip.open(
model_file_name_prefix + str(event.pass_id) + '.tar.gz',
'w') as f:
parameters.to_tar(f)
# start to train
print('start training...')
trainer.train(
reader=train_reader, event_handler=event_handler, num_passes=num_passes)
print("Training finished.")
def main():
# prepare vocab
print('prepare vocab...')
if build_vocab_method == 'fixed_size':
word_id_dict = build_vocab_with_fixed_size(
train_file, vocab_max_size) # build vocab
else:
word_id_dict = build_vocab_using_threshhold(
train_file, unk_threshold) # build vocab
save_vocab(word_id_dict, vocab_file) # save vocab
# init model and data reader
if use_which_model == 'rnn':
# init RNN model
print('prepare rnn model...')
config = Config_rnn()
cost, _ = network_conf.rnn_lm(
len(word_id_dict), config.emb_dim, config.rnn_type,
config.hidden_size, config.num_layer)
# init RNN data reader
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.rnn_reader(train_file, min_sentence_length,
max_sentence_length, word_id_dict),
buf_size=65536),
batch_size=config.batch_size)
test_reader = paddle.batch(
paddle.reader.shuffle(
reader.rnn_reader(test_file, min_sentence_length,
max_sentence_length, word_id_dict),
buf_size=65536),
batch_size=config.batch_size)
elif use_which_model == 'ngram':
# init N-Gram model
print('prepare ngram model...')
config = Config_ngram()
assert config.N == 5
cost, _ = network_conf.ngram_lm(
vocab_size=len(word_id_dict),
emb_dim=config.emb_dim,
hidden_size=config.hidden_size,
num_layer=config.num_layer)
# init N-Gram data reader
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.ngram_reader(train_file, config.N, word_id_dict),
buf_size=65536),
batch_size=config.batch_size)
test_reader = paddle.batch(
paddle.reader.shuffle(
reader.ngram_reader(test_file, config.N, word_id_dict),
buf_size=65536),
batch_size=config.batch_size)
else:
raise Exception('use_which_model must be rnn or ngram!')
# train model
train(
model_cost=cost,
train_reader=train_reader,
test_reader=test_reader,
model_file_name_prefix=config.model_file_name_prefix,
num_passes=config.num_passs)
if __name__ == "__main__":
main()
# coding=utf-8
import os
import collections
def save_vocab(word_id_dict, vocab_file_name):
"""
save vocab.
:param word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
:param vocab_file_name: vocab file name.
"""
f = open(vocab_file_name, 'w')
for (k, v) in word_id_dict.items():
f.write(k.encode('utf-8') + '\t' + str(v) + '\n')
print('save vocab to ' + vocab_file_name)
f.close()
def load_vocab(vocab_file_name):
"""
load vocab from file.
:param vocab_file_name: vocab file name.
:return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
"""
assert os.path.isfile(vocab_file_name)
dict = {}
with open(vocab_file_name) as file:
for line in file:
if len(line) < 2:
continue
kv = line.decode('utf-8').strip().split('\t')
dict[kv[0]] = int(kv[1])
return dict
def build_vocab_using_threshhold(file_name, unk_threshold):
"""
build vacab using_<UNK> threshhold.
:param file_name:
:param unk_threshold: <UNK> threshhold.
:type unk_threshold: int.
:return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
"""
counter = {}
with open(file_name) as file:
for line in file:
words = line.decode('utf-8', 'ignore').strip().split()
for word in words:
if word in counter:
counter[word] += 1
else:
counter[word] = 1
counter_new = {}
for (word, frequency) in counter.items():
if frequency >= unk_threshold:
counter_new[word] = frequency
counter.clear()
counter_new = sorted(counter_new.items(), key=lambda d: -d[1])
words = [word_frequency[0] for word_frequency in counter_new]
word_id_dict = dict(zip(words, range(2, len(words) + 2)))
word_id_dict['<UNK>'] = 0
word_id_dict['<EOS>'] = 1
return word_id_dict
def build_vocab_with_fixed_size(file_name, vocab_max_size):
"""
build vacab with assigned max size.
:param vocab_max_size: vocab's max size.
:return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
"""
words = []
for line in open(file_name):
words += line.decode('utf-8', 'ignore').strip().split()
counter = collections.Counter(words)
counter = sorted(counter.items(), key=lambda x: -x[1])
if len(counter) > vocab_max_size:
counter = counter[:vocab_max_size]
words, counts = zip(*counter)
word_id_dict = dict(zip(words, range(2, len(words) + 2)))
word_id_dict['<UNK>'] = 0
word_id_dict['<EOS>'] = 1
return word_id_dict
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册