diff --git a/language_model/README.md b/language_model/README.md deleted file mode 100644 index 35350f10adea819ee2a26084ff7931ff82c87880..0000000000000000000000000000000000000000 --- a/language_model/README.md +++ /dev/null @@ -1,248 +0,0 @@ -# 语言模型 -## 简介 -语言模型即 Language Model,简称LM,它是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。给定句子(词语序列):
-
-它的概率可以表示为:

-
    (式1)
- -语言模型可以计算(式1)中的P(S)及其中间结果。**利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词语。** - -## 应用场景 -**语言模型被应用在多个领域**,如: - -* **自动写作**:语言模型可以根据上文生成下一个词,递归下去可以生成整个句子、段落、篇章。 -* **QA**:语言模型可以根据Question生成Answer。 -* **机器翻译**:当前主流的机器翻译模型大多基于Encoder-Decoder模式,其中Decoder就是一个语言模型,用来生成目标语言。 -* **拼写检查**:语言模型可以计算出词语序列的概率,一般在拼写错误处序列的概率会骤减,可以用来识别拼写错误并提供改正候选集。 -* **词性标注、句法分析、语音识别......** - -## 关于本例 -Language Model 常见的实现方式有 N-Gram、RNN、seq2seq。本例中实现了基于N-Gram、RNN的语言模型。**本例的文件结构如下**: - -* data_util.py:实现了对语料的读取以及词典的建立、保存和加载。 -* lm_rnn.py:实现了基于rnn的语言模型的定义、训练以及做预测。 -* lm_ngram.py:实现了基于n-gram的语言模型的定义、训练以及做预测。 - -**注:***一般情况下基于N-Gram的语言模型不如基于RNN的语言模型效果好,所以实际使用时建议使用基于RNN的语言模型,本例中也将着重介绍基于RNN的模型,简略介绍基于N-Gram的模型。* - -## RNN 语言模型 -### 简介 - -RNN是一个序列模型,基本思路是:在时刻t,将前一时刻t-1的隐藏层输出ht-1和t时刻的词向量xt一起输入到隐藏层从而得到时刻t的特征表示ht,然后用这个特征表示得到t时刻的预测输出ŷ ,如此在时间维上递归下去,如下图所示: - -
- -可以看出RNN善于使用上文信息、历史知识,具有“记忆”功能。理论上RNN能实现“长依赖”(即利用很久之前的知识),但在实际应用中发现效果并不理想,于是出现了很多RNN的变种,如常用的LSTM和GRU,它们对传统RNN的cell进行了改进,弥补了RNN的不足,下图是LSTM的示意图: - -
- -本例中即使用了LSTM、GRU。 - -### 模型结构 - -lm_rnn.py 中的 lm() 函数定义了模型的结构。解析如下: - -* 1,首先,在\_\_main\_\_中定义了模型的参数变量。 - - ```python - # -- config : model -- - rnn_type = 'gru' # or 'lstm' - emb_dim = 200 - hidden_size = 200 - num_passs = 2 - num_layer = 2 - - ``` - 其中 rnn\_type 用于配置rnn cell类型,可以取‘lstm’或‘gru’;hidden\_size配置unit个数;num\_layer配置RNN的层数;num\_passs配置训练的轮数;emb_dim配置embedding的dimension。 - -* 2,将输入的词(或字)序列映射成向量,即embedding。 - - ```python - data = paddle.layer.data(name="word", type=paddle.data_type.integer_value_sequence(vocab_size)) - target = paddle.layer.data("label", paddle.data_type.integer_value_sequence(vocab_size)) - emb = paddle.layer.embedding(input=data, size=emb_dim) - - ``` -* 3,根据配置实现RNN层,将上一步得到的embedding向量序列作为输入。 - - ```python - if rnn_type == 'lstm': - rnn_cell = paddle.networks.simple_lstm( - input=emb, size=hidden_size) - for _ in range(num_layer - 1): - rnn_cell = paddle.networks.simple_lstm( - input=rnn_cell, size=hidden_size) - elif rnn_type == 'gru': - rnn_cell = paddle.networks.simple_gru( - input=emb, size=hidden_size) - for _ in range(num_layer - 1): - rnn_cell = paddle.networks.simple_gru( - input=rnn_cell, size=hidden_size) - ``` -* 4,实现输出层(使用softmax归一化计算单词的概率,将output结果返回)、定义模型的cost(多类交叉熵损失函数)。 - - ```python - # fc and output layer - output = paddle.layer.fc(input=[rnn_cell], size=vocab_size, act=paddle.activation.Softmax()) - - # loss - cost = paddle.layer.classification_cost(input=output, label=target) - ``` - -### 训练模型 - -lm\_rnn.py 中的 train() 方法实现了模型的训练,流程如下: - -* 1,准备输入数据:本例中使用的是标准PTB数据,调用data\_util.py中的build\_vocab()方法建立词典,并使用save\_vocab()方法将词典持久化,以备复用(当语料量大时生成词典比较耗时,所以这里把第一次生成的词典保存下来复用)。然后使用data\_util.py中的train\_data()、test\_data()方法建立train\_reader和test\_reader用来实现对train数据和test数据的读取。 - -* 2,初始化模型:包括模型的结构、参数、优化器(demo中使用的是Adam)以及训练器trainer。如下: - - ```python - # network config - cost, _ = lm(len(word_id_dict), emb_dim, rnn_type, hidden_size, num_layer) - - # create parameters - parameters = paddle.parameters.create(cost) - - # create optimizer - adam_optimizer = paddle.optimizer.Adam( - learning_rate=1e-3, - regularization=paddle.optimizer.L2Regularization(rate=1e-3), - model_average=paddle.optimizer.ModelAverage(average_window=0.5)) - - # create trainer - trainer = paddle.trainer.SGD( - cost=cost, parameters=parameters, update_equation=adam_optimizer) - - ``` - -* 3,定义回调函数event_handler来跟踪训练过程中loss的变化,并在每轮时结束保存模型的参数: - - ```python - # define event_handler callback - def event_handler(event): - if isinstance(event, paddle.event.EndIteration): - if event.batch_id % 100 == 0: - print("\nPass %d, Batch %d, Cost %f, %s" % ( - event.pass_id, event.batch_id, event.cost, - event.metrics)) - else: - sys.stdout.write('.') - sys.stdout.flush() - - # save model each pass - if isinstance(event, paddle.event.EndPass): - result = trainer.test(reader=ptb_reader) - print("\nTest with Pass %d, %s" % (event.pass_id, result.metrics)) - with gzip.open(model_file_name_prefix + str(event.pass_id) + '.tar.gz', 'w') as f: - parameters.to_tar(f) - ``` - -* 4,开始train模型: - - ```python - trainer.train( - reader=ptb_reader, event_handler=event_handler, num_passes=num_passs) - ``` - -### 生成文本 -lm\_rnn.py中的predict()方法实现了做prediction、生成文本。流程如下: - -* 1,首先加载并缓存词典和模型,其中加载train好的模型参数方法如下: - - ```python - parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name)) - ``` - -* 2,生成文本,本例中生成文本的方式是启发式图搜索算法beam search,即lm\_rnn.py中的\_generate\_with\_beamSearch()方法。 - -### 使用此demo - -本例中使用的是标准的PTB数据,如果用户要实现自己的model,则只需要做如下适配工作: - -#### 语料适配 -* 清洗语料:去除空格、tab、乱码,根据需要去除数字、标点符号、特殊符号等。 -* 编码格式:utf-8,本例中已经对中文做了适配。 -* 内容格式:每个句子占一行;每行中的各词之间使用一个空格分开。 -* 按需要配置lm\_rnn.py中\_\_main\_\_函数中对于data的配置: - - ```python - # -- config : data -- - train_file = 'data/ptb.train.txt' - test_file = 'data/ptb.test.txt' - vocab_file = 'data/vocab_cn.txt' # the file to save vocab - vocab_max_size = 3000 - min_sentence_length = 3 - max_sentence_length = 60 - - ``` - 其中,vocab\_max\_size定义了词典的最大长度,如果语料中出现的不同词的个数大于这个值,则根据各词的词频倒序排,取top(vocab\_max\_size)个词纳入词典。 - - *注:需要注意的是词典越大生成的内容越丰富但训练耗时越久,一般中文分词之后,语料中不同的词能有几万乃至几十万,如果vocab\_max\_size取值过小则导致\占比过高,如果vocab\_max\_size取值较大则严重影响训练速度(对精度也有影响),所以也有“按字”训练模型的方式,即:把每个汉字当做一个词,常用汉字也就几千个,使得字典的大小不会太大、不会丢失太多信息,但汉语中同一个字在不同词中语义相差很大,有时导致模型效果不理想。建议用户多试试、根据实际情况选择是“按词训练”还是“按字训练”。* - -#### 模型适配 - -根据语料的大小按需调整模型的\_\_main\_\_中定义的参数。 - -然后运行 python lm_rnn.py即可训练模型、做prediction。 - -## n-gram 语言模型 - - - -n-gram模型也称为n-1阶马尔科夫模型,它有一个有限历史假设:当前词的出现概率仅仅与前面n-1个词相关。因此 (式1) 可以近似为: -
-一般采用最大似然估计(Maximum Likelihood Estimation,MLE)的方法对模型的参数进行估计。当n取1、2、3时,n-gram模型分别称为unigram、bigram和trigram语言模型。一般情况下,n越大、训练语料的规模越大,参数估计的结果越可靠,但由于模型较简单、表达能力不强以及数据稀疏等问题。一般情况下用n-gram实现的语言模型不如rnn、seq2seq效果好。 - -### 模型结构 - -lm\_ngram.py中的lm()定义了模型的结构,大致如下: - -* 1,demo中n取5,将前四个词分别做embedding,然后连接起来作为特征向量。 -* 2,后接DNN的hidden layer。 -* 3,将DNN的输出通过softmax layer做分类,得到下个词在词典中的概率分布。 -* 4,模型的loss采用交叉熵,用Adam optimizer对loss做优化。 - -图示如下: -
- -### 模型训练 - -lm\_ngram.py中的train()方法实现了模型的训练,过程和RNN LM类似,简介如下: - -* 1,准备输入数据:使用的是标准PTB数据,调用data\_util.py中的build\_vocab()方法建立词典,并使用save\_vocab()方法将词典持久化,使用data\_util.py中的train\_data()、test\_data()方法建立train\_reader和test\_reader用来实现对train数据和test数据的读取。 -* 2,初始化模型:包括模型的结构、参数、优化器(demo中使用的是Adam)以及trainer。 -* 3,定义回调函数event_handler来跟踪训练过程中loss的变化,并在每轮时结束保存模型的参数。 -* 4,使用trainer开始train模型。 - -### 生成文本 -lm\_ngram.py中的\_\_main\_\_方法中对prediction(生成文本)做了简单的实现。流程如下: - -* 1,首先加载词典和模型: - - ```python - # prepare model - word_id_dict = reader.load_vocab(vocab_file) # load word dictionary - _, output_layer = lm(len(word_id_dict), emb_dim, hidden_size, num_layer) # network config - model_file_name = model_file_name_prefix + str(num_passs - 1) + '.tar.gz' -parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name)) # load parameters - ``` - -* 2,根据4(n-1)个词的上文预测下一个单词并打印: - - ```python - # generate - text = 'the end of the' # use 4 words to predict the 5th word - input = [[word_id_dict.get(w, word_id_dict['']) for w in text.split()]] - predictions = paddle.infer( - output_layer=output_layer, - parameters=parameters, - input=input, - field=['value']) - id_word_dict = dict([(v, k) for k, v in word_id_dict.items()]) # dictionary with type {id : word} - predictions[-1][word_id_dict['']] = -1 # filter - next_word = id_word_dict[np.argmax(predictions[-1])] - print(next_word.encode('utf-8')) - ``` - - *注:这里展示了另一种做预测的方法,即使用paddle.infer方法。RNN的实例中使用的是paddle.inference.Inference接口。* \ No newline at end of file diff --git a/language_model/data/vocab_ptb.txt b/language_model/data/vocab_ptb.txt new file mode 100644 index 0000000000000000000000000000000000000000..7b03f36e5f1fd31c97612e3e7e0f2c544ef2bfaa --- /dev/null +++ b/language_model/data/vocab_ptb.txt @@ -0,0 +1,3002 @@ +limited 791 +consolidated 2482 +four 347 +facilities 1200 +asian 2798 +controversial 2177 +whose 623 +votes 2089 +founder 2229 +paris 1721 +adviser 1759 +edward 2090 +voted 1935 +under 125 +worth 977 +placed 1677 +merchant 2565 +pact 2130 +risk 647 +rise 498 +sellers 2851 +handling 2476 +every 539 +jack 1722 +reforms 2309 +affect 1968 +bringing 2469 +lehman 1238 +believed 1542 +school 722 +calif 2386 +companies 102 +wednesday 910 +van 2897 +announced 412 +pilson 2915 +expanded 2427 +force 534 +leaders 818 +miller 2247 +guidelines 1795 +estimates 784 +japanese 174 +elections 1720 +second 335 +street 323 +estimated 453 +machines 753 +even 114 +established 1755 +disk 2826 +pace 1888 +panama 1852 +contributed 1263 +nec 2551 +asia 2310 +spokesman 301 +above 626 +dr. 982 +new 36 +net 136 +increasing 987 +ever 823 +seeks 2454 +told 549 +specialist 2519 +never 575 +here 348 +hundreds 1689 +reported 221 +protection 955 +china 638 +brooks 2353 +active 1027 +balance 1687 +auction 968 +items 1470 +employees 457 +climbed 1323 +reports 658 +credit 355 +analysts 166 +chrysler 1969 +military 756 +poverty 2838 +changes 515 +criticism 2288 +golden 1750 +campaign 879 +reagan 1195 +peabody 2432 +highly 1329 +brought 1130 +opportunities 2661 +total 344 +unit 168 +swings 2036 +would 43 +army 2135 +hospital 1833 +m. 1664 +negative 1330 +noting 2958 +call 787 +asset 1302 +strike 1106 +type 2136 +until 315 +b.a.t 1873 +hahn 2992 +supporters 2678 +composite 431 +hurt 1005 +phone 1690 +berlin 2737 +hold 838 +must 405 +me 812 +word 2057 +room 1505 +rights 535 +pursue 2248 +work 222 +plunged 1604 +movies 1970 +henry 2960 +already 294 +merely 2444 +revenues 2521 +my 406 +example 469 +wang 1958 +estate 460 +give 438 +cited 1376 +india 2595 +involve 2801 +currency 820 +foods 2389 +woman 1128 +caution 2981 +ual 409 +want 358 +drive 1524 +times 421 +attract 2236 +totaled 1297 +guarantee 1982 +end 237 +recovery 1654 +turn 837 +provide 577 +travel 1487 +damage 516 +machine 1042 +how 243 +hot 2766 +interview 1220 +widespread 2410 +resignation 2178 +badly 2455 +regional 1486 +minority 1398 +lufkin 2979 +after 79 +damaged 1608 +modest 1441 +president 72 +mesa 2754 +law 279 +types 2203 +las 2785 +purchase 496 +attempt 1036 +third 277 +amid 1646 +headquarters 1203 +maintain 1346 +green 1997 +suggest 1894 +democratic 961 +order 529 +ec 2125 +wine 2831 +operations 223 +senators 1994 +office 284 +over 95 +expects 407 +london 410 +japan 203 +mayor 1819 +before 158 +fit 2767 +personal 629 +expectations 1418 +better 502 +production 360 +weeks 422 +easier 2329 +damages 2063 +then 224 +dec. 932 +affected 1709 +combination 2802 +lambert 1987 +weakness 1827 +safe 1998 +break 1784 +effects 1670 +they 39 +schools 1632 +silver 1988 +bank 105 +structural 1989 +represents 1651 +30-year 1456 +detroit 2961 +affiliate 2357 +victory 2820 +reasonable 2852 +each 216 +went 789 +side 1078 +bond 272 +financial 142 +suspended 1944 +fairly 1953 +series 442 +carolina 2099 +carry 1536 +currencies 2403 +trading 77 +impossible 2487 +substantially 1777 +temporary 1907 +saturday 2070 +burnham 1880 +t. 2213 +network 673 +crucial 2935 +tomorrow 1588 +semiconductor 2204 +encourage 2422 +daniel 2483 +got 596 +newly 2193 +millions 2159 +sluggish 2609 +gop 2456 +foundation 2583 +sept. 628 +turning 2442 +written 2032 +veto 1052 +u.s. 54 +threatened 2522 +little 327 +free 813 +standard 738 +estimate 884 +wanted 1030 +enormous 2291 +created 1053 +days 172 +pence 1347 +oppose 2993 +1970s 2205 +uses 1851 +r. 1170 +industrial 439 +suspension 2903 +economists 1119 +primary 1723 +hearing 1803 +adopted 2224 +another 206 +electronic 1284 + 0 +rated 1609 +service 337 +top 525 +approximately 2994 +needed 911 +rates 173 +too 307 +percentage 761 +john 401 +ranging 2708 +urban 1922 +ceiling 2109 +collapse 1633 +serve 2110 +took 454 +rejected 1392 +direct 1201 +western 650 +somewhat 2739 +shortly 2626 +toronto 2091 +renewed 2853 +target 1313 +showed 988 +likely 372 +nations 1606 +project 802 +matter 1196 +greenspan 2542 +feeling 2982 +acquisition 470 +bridge 1481 +fashion 2411 +sees 2498 +ran 2457 +boston 652 +modern 2390 +mind 1714 +mine 2120 +talking 1923 +seen 920 +seem 1058 +seek 983 +relatively 1303 +forced 1172 +abroad 1909 +strength 1531 +concrete 2911 +responsible 1845 +sound 2100 +recommended 2898 +client 1786 +luxury 1999 +forces 1252 +unsecured 2433 +shipments 2292 +blue 2753 +nobody 2148 +philadelphia 2253 +though 329 +wells 1940 +involving 1571 +germany 705 +letter 1107 +competing 2791 +germans 2995 +consumers 1134 +antitrust 1990 +medical 765 +flow 1647 +competitors 1414 +points 395 +principle 2860 +after-tax 2451 +voting 2259 +consumer 582 +dow 448 +came 545 +reserve 717 +d. 674 +saying 715 +meetings 1926 +ending 1555 +showing 1659 +radio 1665 +poison 2936 +hungary 2330 +judges 1853 +finally 1383 +proposed 499 +representing 2098 +delays 2567 +unemployment 2413 +sugar 1971 +rico 2293 +bush 257 +rich 1991 +announce 2624 +resulting 2712 +do 88 +exports 1002 +de 990 +stop 1147 +preferred 1009 +coast 1828 +lenders 2331 +despite 486 +report 238 +du 2723 +volatility 929 +hall 2517 +runs 1678 +jaguar 574 +countries 676 +fields 2186 +high-yield 2138 +bay 759 +twice 1983 +bad 890 +release 1700 +prudential-bache 2691 +mergers 2963 +secretary 718 +headed 1658 +disaster 1352 +fair 2161 +w. 1820 +testing 2938 +decided 1104 +result 411 +discussions 1963 +resigned 1234 +taiwan 2793 +best 598 +subject 966 +brazil 2803 +said 16 +capacity 1137 +away 601 +irs 1049 +compensation 2619 +machinists 2106 +pressures 2768 +future 508 +cooperation 2340 +approach 1402 +co. 96 +profitable 1629 +we 65 +men 944 +terms 520 +extend 2369 +nature 2162 +wo 418 +ask 1906 +handful 2916 +weak 1520 +however 229 +retirement 1427 +extent 2206 +news 311 +convertible 956 +debt 208 +improve 1153 +suggested 1943 +received 517 +protect 1865 +met 1317 +country 359 +over-the-counter 1241 +against 175 +players 1578 +else 1394 +supplies 1802 +games 2111 +planned 950 +faces 1772 +studio 2318 +argue 1992 +asked 613 +prospect 2424 +tough 1401 +appeared 2008 +royal 2725 +offerings 2272 +represented 2623 +tons 1068 +initiative 2804 +trust 458 +telecommunications 1821 +conference 881 +puts 1847 +basis 854 +union 319 +anc 2756 +three 133 +been 59 +quickly 914 +commission 475 +beer 2358 +much 122 +interest 132 +basic 2273 +expected 171 +entered 2139 +containers 2625 +life 391 +families 1881 +mci 2741 +eastern 734 +drugs 1387 +republicans 1666 +worker 2523 +mca 1972 +enterprises 2282 +child 2250 +ogilvy 2261 +worked 1370 +slowdown 1653 +applied 2484 +commerce 1204 +has 31 +publicly 1377 +air 452 +ventures 1715 +near 958 +appeals 1920 +aid 870 +property 810 +study 936 +launched 1610 +seven 913 +changed 973 +metropolitan 2742 +mexico 1433 +is 14 +it 15 +expenses 1167 +ii 1917 +player 2524 +experts 1403 +world-wide 1393 +in 8 +victims 1468 +confident 2883 +turner 2588 +if 67 +grown 2391 +hong 794 +patent 2069 +things 607 +make 139 +linked 2895 +complex 1724 +split 1568 +several 249 +couple 1547 +european 568 +independent 829 +pick 2319 +hand 1054 +ownership 1443 +constitution 2696 +opportunity 1473 +kept 1785 +scenario 2918 +programs 561 +settled 1621 +savings 746 +materials 1318 +rey 2879 +mother 2320 +claims 595 +the 2 +corporate 353 +investments 744 +left 697 +quoted 1182 +yen 255 +mills 2718 +expanding 2640 +ideas 2966 +identify 2861 +human 1224 +campbell 2627 +yet 397 +previous 558 +adding 1344 +buyers 846 +hills 2375 +phillips 1895 +ease 1751 +had 51 +intends 2227 +spread 1437 +board 147 +easy 1540 +prison 2584 +east 519 +gave 1227 +municipal 1488 +possible 492 +possibly 2370 +buy-out 589 +judge 403 +replace 1986 +advanced 992 +desire 2649 +county 1223 +exxon 1691 +hunt 2459 +securities 126 +offices 1154 +officer 290 +night 1069 +security 648 +delmed 2689 +attorney 752 +right 382 +old 394 +deal 497 +people 109 +dead 2610 +consultants 1954 +donald 2460 +election 1725 +short-term 967 +specific 1228 +for 12 +bottom 2525 +comments 2302 +p.m. 2163 +when 68 +continue 388 +denied 2027 +steps 1652 +christmas 2321 +core 1866 +marketing 610 +corn 1579 +conventional 1883 +discount 1179 +restructuring 654 +plc 957 +packages 2664 +losing 1810 +brokerage 857 +post 1096 +manufacturing 737 +properties 1298 +georgia-pacific 1660 +chapter 1457 +dollars 604 +months 127 +costs 271 +magazine 781 +plus 1726 +afternoon 1736 +efforts 702 +slightly 743 +nixon 2526 +raised 748 +managers 585 +publishing 1584 +formerly 2392 +facility 1541 +civil 1229 +maxwell 2471 +marshall 2901 +son 2423 +down 119 +explain 2680 +magazines 1444 +dean 2805 +reducing 2294 +defendants 1811 +crowd 2983 +support 415 +initial 1074 +legislation 803 +cosmetics 2659 +per-share 1891 +why 763 +joseph 1955 +editor 1716 +way 228 +resulted 2527 +music 2289 +was 25 +war 978 +interest-rate 2629 +head 642 +economics 2800 +form 945 +manufacturers 931 +becoming 2341 +differences 2322 +ford 480 +failure 1638 +heat 2239 +hear 2568 +syndicate 2939 +sustained 2964 +stand 1419 +true 1550 +analyst 365 +nov. 408 +counsel 2628 +inside 2207 +bids 1404 +maximum 2538 +devices 1672 +tell 1345 +jan. 2081 + 3 +stronger 1934 +one-third 2681 +evidence 1043 +promised 2597 +accounting 1410 +ship 2351 +program-trading 2844 +check 2323 +negotiations 1391 +regime 2822 +floor 1014 +phelan 2180 +stake 320 +generally 925 +credibility 2924 +successful 1698 +interested 1378 +role 795 +holding 477 +digital 1044 +test 895 +developers 2587 +bailout 2663 +roll 2332 +picture 2151 +'s 10 +brothers 1001 +delivered 2930 +models 1901 +surprise 2608 +felt 1918 +utilities 1717 +'d 1131 +invested 2295 +authorities 1892 +'m 827 +aware 2662 +weekend 1854 +died 2228 +jones 459 +reorganization 1411 +longer 921 +glass 2647 +assume 2312 +italy 1977 +connecticut 2959 +together 1190 +liquidity 1841 +premiums 2940 +time 104 +push 1812 +serious 1175 +profits 740 +concept 2786 +managed 1855 +chain 1273 +global 1551 +alternatives 2461 +focus 1247 +manager 611 +battle 1118 +creative 2919 +s.a. 2146 +certainly 1532 +everything 1498 +father 2393 +environment 1692 +charge 527 +asking 2026 +e. 1514 +marks 754 +suffered 2187 +circumstances 2290 +division 540 +supported 2316 +mixte 1733 +keeping 1929 +choice 2164 +liability 1756 +drexel 875 +lynch 834 +10-year 2216 +join 2149 +trouble 1513 +corp. 66 +governments 2352 +level 462 +did 144 +turns 2434 +proposals 1737 +democrat 2862 +standards 1405 +leave 1235 +settle 1927 +team 769 +quick 2414 +speculation 1311 +round 2863 +lloyd 1509 +prevent 1371 +says 45 +trend 1458 +gasoline 2439 +telerate 2073 +sign 1215 +mich. 2823 +cost 363 +aggressive 2210 +adds 872 +appear 1299 +hewlett-packard 2082 +assistance 1965 +shares 71 +current 265 +goes 1321 +international 198 +falling 1752 +principal 1510 +boost 962 +filled 2443 +paribas 1557 +transportation 847 +genes 2262 +french 897 +agreement 332 +water 1092 +baseball 1804 +groups 723 +address 2834 +alone 1325 +along 703 +earthquake 427 +change 429 +wait 2555 +canadian 741 +institute 760 +shift 1738 +guilty 1623 +trial 1162 +usually 1026 +corp 587 +bob 2612 +navigation 1667 +retired 2598 +defensive 2769 +extra 2092 +lending 1710 +mobil 2920 +crisis 2071 +market 48 +everybody 2613 +indicated 1296 +working 670 +prove 2000 +positive 1778 +psyllium 2721 +visit 2275 +third-quarter 349 +france 1003 +live 1636 +opposed 1779 +stearns 2281 +memory 2513 +francs 1012 +australian 1773 +household 2249 +today 326 +club 2058 +apparent 2770 +fuel 2225 +cautious 2577 +downturn 2549 +cases 685 +effort 677 +behalf 2771 +fly 2683 +organizations 2599 +valued 1216 +ibm 690 +tokyo 606 +car 602 +abortion 886 +believes 1095 +districts 2884 +ms. 487 +values 1673 +can 90 +growing 776 +making 424 +interstate 1936 +newspapers 2376 +claim 1535 +citizens 2371 +figure 1187 +predict 2499 +december 924 +chip 2093 +agent 2276 +1980s 2123 +heard 2378 +dropped 557 +council 1334 +allowed 1141 +requirements 1562 +winter 2650 +secured 2965 +bonds 129 +chemical 720 +beat 2816 +sunday 2500 +s. 2050 +fourth 915 +ensure 2283 +subsidiaries 2569 +economy 354 +product 521 +huge 828 +may 94 +southern 1277 +applications 2529 +membership 2140 +produce 979 +mae 1596 +designed 1098 +date 1326 +such 89 +data 451 +grow 1556 +man 898 +natural 1007 +johnson 1244 +maybe 1688 +futures 288 +borrowing 2864 +gap 2518 +so 106 +deposit 1856 +increase 244 +pulled 2473 +talk 1197 +typical 2600 +exclusive 2755 +no. 1656 +acts 2869 +seeing 2865 +sell-off 2772 +indeed 1010 +mainly 1871 +consulting 1611 +years 73 +ended 328 +experiments 2835 +cuts 1287 +argued 2133 +statements 2585 +cold 2263 +still 148 +stock-index 974 +group 97 +monitor 2941 +procedures 2734 +presence 2238 +troubles 2474 +forms 2726 +offers 1338 +policy 374 +mail 1617 +main 1112 +decades 2985 +texas 484 +happened 2152 +finance 573 +views 2141 +introduce 2539 +nation 507 +records 2023 +half 389 +not 64 +now 100 +provision 1207 +discuss 2124 +nor 1412 +term 1080 +attorneys 1857 +name 819 +january 1135 +drop 471 +rock 2866 +quarter 110 +el 2727 +square 2083 +significantly 1775 +latin 2827 +revised 1582 +s&p 835 +begun 2360 +year 41 +happen 2048 +worried 2084 +tried 1590 +canada 730 +living 2033 +shown 2028 +inventories 2418 +opened 1239 +space 971 +profit 178 +factory 2042 +looking 735 +investigation 1339 +indicating 2991 +shows 1133 +exactly 2675 +earlier 134 +theory 2570 +cars 701 +million 22 +incentives 2799 +possibility 2251 +quite 1583 +california 296 +besides 2709 +obligation 2264 +marine 2787 +card 2303 +care 858 +advance 1639 +training 1993 +language 2142 +ministry 1245 +discussing 2885 +wrong 1780 +british 304 +thing 899 +place 666 +massive 2441 +promotion 2644 +think 316 +first 75 +merrill 814 +revenue 187 +one 55 +opec 2552 +americans 1018 +one-time 2049 +directly 1085 +vote 839 +corporations 1640 +message 2601 +fight 1447 +open 671 +george 1031 +size 1070 +city 286 +given 732 +sheet 2998 +district 806 +caught 2436 +trillion 1744 +plastic 2571 +anyone 1453 +indicate 2165 +returns 984 +white 445 +friend 2788 +gives 1499 +hud 2043 +acquisitions 1491 +mining 2037 +mostly 1425 +that 11 +pittsburgh 2343 +season 1580 +moscow 1495 +alan 1745 +released 1625 +specialists 2166 +surged 1730 +than 56 +population 2507 +wide 1834 +television 758 +effective 1140 +rival 2117 +require 1214 +spokeswoman 1046 +officials 161 +venture 816 +were 47 +published 2398 +and 9 +mountain 2355 +san 305 +investors 117 +remained 1849 +turned 951 +argument 2774 +say 118 +plunge 1232 +allen 2836 +sells 1626 +saw 1463 +any 107 +accounted 2722 +offering 357 +regular 2035 +efficient 2743 +offer 209 +aside 2530 +note 952 +equipment 550 +mr. 24 +potential 649 +take 210 +performance 766 +wonder 2744 +registered 2783 +channel 2556 +begin 852 +sure 1138 +normal 1796 +track 2274 +price 116 +enter 2882 +paid 490 +icahn 2736 +nomura 2997 +america 541 +pages 1957 +honecker 2531 +manville 1637 +operate 1890 +especially 860 +surprising 2492 +payable 2694 +considered 994 +average 197 +later 467 +steady 2472 +sale 227 +federal 101 +professional 1757 +senior 446 +mass. 1701 +typically 1449 +filing 963 +laws 1489 +shop 2682 +rating 1384 +shot 2684 +surplus 2775 +show 465 +german 770 +delta 2806 +allegations 2508 +commitments 2018 +discovered 2059 +rep. 824 +soviets 1396 +fifth 2728 +ground 1840 +slow 1188 +ratio 2304 +gulf 1482 +title 2126 +daily 859 +enough 543 +crime 1563 +only 87 +going 325 +black 755 +treasury 282 +thompson 2528 +watching 2974 +congressional 912 +dispute 1702 +get 188 +contracts 646 +assistant 1634 +employers 2445 +nearly 441 +secondary 2395 +prime 953 +regarding 2630 +yield 236 +morning 1211 +miles 1357 +predicted 1693 +scott 2906 +where 252 +husband 2666 +salomon 1278 +declared 2009 +corry 2710 +committed 2266 +seat 2685 +elected 1240 +j. 891 +college 1500 +stanley 1348 +concern 289 +mortgage 583 +farmers 1202 +ways 1139 +jumped 1008 +review 1282 +representatives 1858 +forecast 1474 +weapons 2886 +outside 849 +bureau 1464 +between 179 +import 2179 +reading 2557 +across 1177 +jobs 1142 +august 398 +parent 731 +blame 2667 +article 1055 +cities 1806 +come 435 +reaction 2112 +acquiring 2240 +many 98 +trader 1281 +trades 1246 +according 215 +contract 317 +prompted 2486 +buy-back 2668 +senator 2967 +holders 675 +traded 926 +comes 1097 +among 212 +cancer 1019 +color 2652 +roman 2639 +period 341 +insist 2631 +confirmed 1964 +learning 2789 +moreover 1208 +poll 2824 +two-year 2324 +considering 1434 +save 1973 +unusual 1739 +west 380 +airlines 699 +mark 1061 +hutton 1340 +combined 1545 +hardly 2825 +mary 2942 +disclosed 871 +wants 821 +direction 1984 +shopping 1974 +offered 494 +formed 2252 +observers 2437 +wake 2010 +minister 1047 +former 338 +those 151 +pilot 2509 +case 308 +developing 1589 +these 145 +consultant 1373 +cash 268 +n't 33 +warning 2366 +policies 1312 +newspaper 1148 +situation 942 +shops 2632 +margin 1797 +region 1648 +eventually 1415 +metric 2265 +health-care 2218 +engaged 2837 +telephone 733 +quiet 2757 +middle 2686 +someone 1362 +attributed 1400 +technology 503 +worry 2001 +par 1164 +develop 1285 +pay 256 +same 313 +dealer 2586 +speech 2396 +grain 2200 +insurers 1668 +events 1483 +week 124 +buy-outs 2926 +oil 269 +singapore 2085 +boosted 1813 +drives 2072 +producers 855 +running 1122 +harris 2711 +intended 1951 +changing 2510 +anticipated 2344 +complained 2540 +costa 2211 +theater 2669 +largely 892 +charges 593 +no 103 +constitutional 2305 +roughly 1597 +mortgages 1537 +severe 2541 +without 350 +relief 1746 +model 1807 +researchers 1248 +charged 1657 +summer 976 +asset-backed 2687 +being 214 +money 162 +rest 1115 +kill 2633 +speed 1787 +weekly 2052 +announcement 1040 +death 1237 +rose 120 +seems 900 +except 1627 +improvement 1467 +westinghouse 2968 +setting 2150 +bloc 2634 +treatment 1355 +plenty 2962 +tuesday 474 +ross 2196 +scheduled 832 +negotiating 2017 +around 385 +read 1331 +papers 2695 +virginia 2698 +early 267 +inflation 620 +traffic 1703 +using 665 +accepted 1747 +ruled 2379 +intel 2167 +nissan 1271 +rivals 2616 +'ve 578 +annually 1758 +chamber 2615 +benefit 1279 +either 707 +retailers 2314 +fully 1145 +output 1450 +tower 2416 +reduced 901 +nikkei 2635 +competition 874 +loyalty 2970 +bigger 1869 +thinks 2086 +provided 1294 +earth 2821 +operators 2638 +recorded 2489 +legal 656 +conservative 1280 +critical 2101 +deficit 822 +provides 1304 +newport 2943 +moderate 2446 +football 2359 +assembly 2479 +scientific 2118 +power 339 +airways 2807 +equivalent 2428 +broker 1862 +broken 2808 +leadership 1902 +aide 2738 +manufacturer 2188 +on 17 +central 877 +package 1184 +of 5 +industry 154 +thousands 1567 +fell 204 +airline 767 +sachs 1679 +act 757 +mixed 1764 +mean 1465 +or 37 +confidence 1818 +tape 2658 +barrels 1850 +outlook 2219 +coupon 1661 +instruments 1867 +image 1341 +accounts 907 +determine 2636 +parties 1546 +operator 2127 +your 483 +pharmaceutical 1945 +fast 2512 +her 200 +area 449 +there 84 +alleged 1601 +start 903 +appears 1314 +low 615 +lot 580 +valley 1272 +billion 49 +complete 1490 +saatchi 2087 +delayed 2782 +sophisticated 2501 +brain 2975 +succeeded 2913 +two-thirds 2603 +technologies 2212 +trying 547 +with 23 +buying 361 +faster 2362 +volume 396 +october 522 +circulation 2923 +sears 1719 +default 2380 +wholesale 2699 +agree 1842 +strongly 2759 +gone 1843 +vehicles 1576 +ad 695 +ag 1674 +certain 426 +totaling 2637 +moved 1266 +sales 82 +deep 2945 +an 32 +cbs 1023 +britain 774 +at 19 +file 2012 +aids 2137 +politics 2113 +moves 1103 +film 1475 +fill 2946 +again 556 +consensus 2713 +personnel 2103 +storage 2490 +event 1938 +field 1105 +you 111 +poor 888 +a$ 2589 +congress 287 +separate 1127 +students 1358 +a. 943 +n.j. 1848 +important 590 +massachusetts 2537 +coverage 2002 +planners 2653 +brands 1335 +stocks 150 +building 482 +assets 297 +calls 724 +wife 1985 +invest 2114 +having 716 +directors 751 +mass 2654 +overseas 1060 +starting 1801 +original 1525 +represent 2013 +all 74 +sci 1781 +consider 930 +chinese 1564 +caused 989 +lack 1439 +dollar 333 +month 169 +mccaw 1947 +talks 711 +follow 2019 +settlement 917 +decisions 1868 +children 1028 +causes 2904 +reluctant 2947 +tv 659 +thursday 939 +shall 2832 +to 6 +program 157 +spain 2480 +health 468 +lawmakers 1388 +activities 1230 +calif. 850 +premium 1257 +returned 2053 +divisions 2954 +very 253 +resistance 2285 +worst 2325 +decide 1882 +fall 619 +sony 1155 +difference 1574 +condition 2119 +cable 1264 +louis 2128 +list 1407 +joined 2024 +large 381 +circuit 2199 +small 300 +webster 2984 +past 225 +rate 159 +arizona 1599 +design 1521 +lawyer 1359 +pass 2168 +nbc 1808 +further 393 +investment 146 +what 115 +abc 2189 +richard 867 +investing 2181 +sun 1581 +section 1236 +resume 2497 +brief 2700 + 1 +noriega 1006 +version 1349 +scientists 1466 +certificates 1822 +learned 2604 +public 275 +contrast 1928 +movement 2038 +turmoil 2730 +full 632 +editorial 2488 +answers 2854 +hours 887 +citicorp 1995 +operating 299 +excess 2094 +november 1445 +strong 404 +thrift 1077 +publisher 2194 +prosecutors 1598 +ahead 970 +extraordinary 2147 +losses 392 +experience 1526 +prior 1874 +amount 542 +advertising 627 +social 1342 +action 588 +narrow 2688 +options 657 +via 645 +followed 1159 +family 617 +requiring 2855 +africa 1946 +thatcher 1533 +put 336 +aimed 1516 +establish 2559 +donaldson 2809 +shareholders 591 +eye 2190 +takes 1254 +petroleum 1422 +two 76 +generate 2760 +taken 640 +markets 191 +minor 1896 +more 46 +flat 1231 +israel 2241 +door 2054 +knows 2326 +fast-food 2543 +jr. 1209 +company 38 +broke 2856 +particular 1476 +known 709 +producing 1884 +town 1741 +jim 2773 +none 1919 +lilly 2874 +hour 1477 +science 2905 +des 2746 +remain 706 +sudden 2558 +nine 511 +sent 998 +morgan 905 +strategies 2230 +history 964 +purchases 1336 +processing 2306 +brown 1448 +pont 2888 +share 63 +accept 1527 +states 599 +pushed 2221 +minimum 1079 +numbers 1618 +purchased 1662 +sense 1116 +sharp 1123 +f. 2095 +information 532 +needs 1198 +answer 2440 +court 213 +advantage 1924 +rather 644 +hugo 1065 +conducted 2810 +earnings 137 +portfolios 2548 +plant 402 +plans 232 +advice 2747 +different 772 +reflect 1897 +fe 2003 +coming 762 +response 1124 +a 7 +short 694 +brady 2400 +departure 2889 +coal 2354 +broadcasting 1528 +responsibility 2068 +media 1034 +banks 248 +egg 2602 +playing 2447 +turnover 2701 +played 1839 +help 334 +september 312 +developed 1260 +soon 641 +trade 220 +held 417 +paper 399 +through 149 +committee 373 +signs 1694 +suffer 2948 +its 28 +developer 2969 +style 2074 +rapidly 2214 +actually 1071 +late 292 +systems 419 +conn. 2051 +stephen 1898 +inquiry 2999 +might 280 +tentatively 2907 +good 264 +return 504 +seeking 817 +food 622 +reflected 1886 +association 447 +easily 2145 +holiday 2763 +always 851 +stopped 1885 +eager 2990 +found 552 +heavy 771 +sterling 1641 +everyone 1385 +england 1492 +generation 2198 +house 165 +energy 773 +hard 712 +reduce 713 +idea 1165 +police 1399 +extended 2313 +expect 524 +advertisers 1704 +operation 1143 +beyond 1300 +insurance 240 +really 840 +deals 1306 +funding 1173 +carriers 2339 +blacks 1875 +robert 513 +since 156 +douglas 2381 +research 318 +participants 2315 +safety 1267 +hill 2075 +fujitsu 2707 +issue 186 +highway 1844 +reporting 2192 +risen 2908 +lawrence 2401 +friday 283 +houses 1332 +reason 878 +base 980 +members 450 +backed 1319 +beginning 937 +guy 2481 +director 276 +owners 1275 +benefits 991 +launch 2222 +just 152 +computers 565 +excluding 2317 +american 141 +threat 1705 +pilots 1045 +fallen 2154 +lawsuits 2160 +copper 1478 +major 138 +slipped 1903 +feel 1113 +number 295 +feet 1350 +done 927 +fees 792 +miss 2925 +causing 2470 +stage 2258 +story 1363 +heads 2532 +leading 815 +st. 1612 +kidder 1086 +least 298 +station 1887 +expand 1708 +statement 682 +dealing 2554 +compromise 1975 +store 1365 +listed 1605 +selling 400 +passed 1364 +relationship 1904 +behind 1120 +hotel 1727 +park 1518 +immediate 1930 +blue-chip 2729 +profitability 2566 +part 202 +favorable 2870 +believe 624 +hollywood 2007 +king 2242 +kind 948 +grew 1572 +rebound 2900 +double 1870 +pennsylvania 2361 +determined 1959 +risks 1484 +elaborate 2520 +messrs. 2402 +toward 811 +aug. 2039 +outstanding 548 +imports 949 +substantial 1315 +orders 456 +option 1389 +sell 183 +ratings 1573 +built 1099 +trip 2887 +gorbachev 1166 +officers 2670 +targets 2611 +majority 908 +internal 1327 +chairman 143 +finding 1899 +frequently 2387 +play 1117 +added 285 +electric 940 +goldman 1602 +eggs 2811 +measures 1322 +reach 1366 +freddie 2605 +most 91 +hired 2493 +shareholder 1032 +plan 176 +significant 893 +services 324 +extremely 1976 +approved 680 +soared 1910 +compaq 2690 +dealers 669 +clear 726 +sometimes 1276 +cover 1506 +rockefeller 2731 +traditional 1522 +three-month 2578 +clean 2790 +usual 1771 +institutions 826 +painewebber 1628 +sector 1075 +thomas 1269 +particularly 783 +gold 660 +commissions 2277 +nasdaq 1343 +session 1082 +businesses 434 +jury 1372 +fine 1829 +find 678 +impact 883 +gen. 2812 +giant 1100 +regulations 2060 +nevertheless 2955 +northern 1431 +justice 1011 +heavily 1440 +distributed 2871 +failed 778 +flights 2544 +pretty 1760 +equity 621 +giants 2169 +begins 2679 +his 50 +hit 777 +gains 554 +meanwhile 672 +express 1020 +financing 605 +collection 2878 +b 2327 +actions 1622 +closely 1102 +reporters 2170 +during 199 +him 302 +merchandise 2450 +appeal 1682 +doubled 2813 +six-month 2014 +banking 569 +common 247 +activity 807 +switzerland 2096 +coors 2909 +river 1996 +wrote 1565 +restaurants 2748 +set 386 +art 1459 +achieved 2761 +declines 1256 +sex 2986 +culture 2345 +see 379 +defense 536 +sec 1132 +are 27 +sea 1876 +tender 1765 +close 321 +arm 2890 +declined 413 +filings 2932 +# 687 +spirits 2776 +movie 1603 +century 1742 +currently 488 +won 1072 +various 1186 +probably 681 +conditions 1087 +supposed 2449 +available 679 +korea 1592 +recently 376 +creating 2363 +initially 2115 +dividends 1435 +sold 239 +attention 1420 +aircraft 1496 +succeed 2284 +coffee 2143 +opposition 1305 +franchise 2575 +dividend 704 +both 180 +prospects 2171 +last 70 +appropriations 1367 +annual 367 +foreign 245 +sensitive 2732 +connection 2591 +became 985 +long-term 688 +let 1025 +whole 1180 +baltimore 2749 +point 375 +reasons 1748 +loan 501 +community 922 +simply 946 +church 1960 +throughout 1766 +expensive 1619 +decline 461 +described 2182 +raise 630 +monthly 1504 +create 1288 +political 390 +due 260 +strategy 750 +convicted 2830 +whom 1713 +reduction 1501 +maintenance 2545 +meeting 476 +walter 2438 +firm 192 +partly 1110 +fire 1782 +gas 538 +convert 2794 +N 4 +fund 293 +whatever 2671 +lives 2129 +brokers 960 +bidding 1494 +demand 437 +prices 113 +plants 865 +georgia 2076 +look 714 +solid 2950 +judicial 2987 +bill 261 +budget 570 +governor 2672 +technical 1586 +while 121 +mainframe 2927 +ought 2546 +fleet 2928 +mitchell 2346 +guide 2792 +engineers 2762 +real 309 +pound 1066 +costly 2183 +voters 1683 +cents 108 +motors 1328 +stations 1740 +disappointing 2462 +itself 683 +ready 1788 +fannie 1967 +coca-cola 2910 +chase 2088 +underwriters 1718 +suggests 2045 +rules 906 +virtually 1753 +widely 1283 +grand 1426 +survey 1108 +dozen 1671 +higher 207 +development 444 +used 263 +lawyers 691 +d.c. 2988 +affairs 1699 +comprehensive 2655 +yesterday 123 +moment 1859 +levels 788 +moving 1408 +purpose 2617 +tobacco 2477 +recent 182 +lower 231 +task 2015 +older 1908 +studies 1956 +poland 1221 +spent 1149 +person 1442 +machinery 2511 +ltd. 555 +swiss 1416 +organization 1178 +spend 1270 +coup 2226 +one-year 2560 +junk-bond 1767 +networks 2464 +u.k. 1168 +competitive 1650 +quarters 2311 +questions 1093 +world 219 +alternative 1978 +wage 1158 +cut 378 +helping 2116 +$ 13 +also 60 +advisers 2044 +workers 432 +deputy 1809 +guaranteed 2268 +attractive 2467 +source 1076 +stock-market 2382 +parents 1877 +location 2777 +violations 2576 +guarantees 2004 +administrative 2233 +remaining 1428 +surprised 2478 +build 848 +customers 526 +australia 1249 +march 618 +emergency 1171 +demands 2648 +big 130 +bid 258 +matters 2104 +game 1088 +aerospace 1931 +bit 1893 +projects 868 +moody 995 +breeden 2364 +success 1395 +follows 1860 +signal 2383 +toyota 2929 +separately 1261 +communications 779 +arthur 2891 +individuals 1324 +yields 923 +popular 1429 +healthy 1805 +privately 2297 +often 518 +senate 463 +spring 1205 +b. 1814 +some 58 +back 193 +trends 2673 +economic 234 +pricing 1861 +apply 2465 +nicaragua 2503 +facing 2397 +scale 2750 +decision 531 +transactions 1083 +audience 2144 +per 1038 +eliminate 2561 +be 26 +run 612 +lose 1517 +continuing 1021 +fed 566 +refused 2077 +step 1210 +santa 1250 +served 2066 +at&t 1789 +by 18 +pipeline 2365 +goods 804 +anything 997 +truck 1792 +mrs. 662 +range 882 +ounce 1921 +duties 2917 +block 1035 +pollution 2951 +repair 2839 +steinhardt 2692 +into 92 +within 530 +retailer 2751 +nothing 1033 +primarily 1548 +sports 1259 +pentagon 1472 +bankruptcy 1029 +statistics 1939 +spending 509 +question 801 +long 352 +ordered 1823 +amr 2989 +suit 633 +himself 1056 +elsewhere 1731 +collapsed 2347 +vehicle 2061 +specialty 2269 +hoped 2872 +atlantic 2254 +pacific 689 +filed 528 +hopes 1101 +subsidiary 663 +line 464 +considerable 2714 +raising 1421 +posted 634 +up 53 +us 505 +maturity 1529 +'re 278 +exploration 2105 +viacom 2562 +similar 710 +called 342 +bell 1310 +associated 1669 +metal 2485 +influence 1905 +metals 1790 +engineering 1293 +associates 1121 +rally 975 +amounts 1356 +peace 2892 +fears 1711 +teams 2921 +yeargin 2494 +afford 2902 +politicians 2131 +reputation 2693 +income 185 +department 235 +manhattan 1430 +users 2367 +gross 2011 +problems 356 +prepared 1575 +william 782 +allowing 2579 +formal 2592 +sides 1783 +structure 1761 +ago 196 +urged 2223 +land 1217 +vice 233 +age 1067 +required 972 +bankers 1022 +responded 2873 +far 291 +fresh 2593 +requires 2384 +leveraged 1146 +once 500 +code 2372 +issued 902 +results 343 +existing 1251 +oct. 314 +ge 1889 +broader 2778 +go 387 +gm 844 +contributions 2463 +centers 1585 +issues 251 +seemed 1815 +concerned 1493 +young 841 +send 2779 +suits 2399 +citing 1961 +stable 2745 +quarterly 1090 +include 537 +friendly 2385 +resources 1059 +garden 2255 +automotive 1620 +continues 1050 +wave 2857 +putting 1663 +cellular 2448 +telling 2931 +continued 609 +entire 1436 +eased 2894 +sen. 969 +real-estate 1446 +positions 1380 +notes 377 +michael 798 +fewer 1732 +try 876 +race 2475 +noted 667 +guber 1183 +concluded 2244 +smaller 1048 +cds 1774 +crop 2296 +jump 2132 +video 2533 +expense 2458 +makers 768 +index 217 +edison 2814 +business 86 +chicago 485 +giving 1454 +expressed 2173 +practices 1798 +access 1455 +paying 1125 +waiting 1630 +indian 2840 +volatile 2030 +five-year 2651 +capital 181 +firms 366 +exercise 1743 +body 2875 +led 661 +lee 1816 +exchange 112 +pushing 2514 +commercial 478 +jointly 2040 +following 572 +northeast 2553 +them 128 +others 495 +great 592 +credits 3001 +receive 934 +involved 831 +larger 1307 +leaving 1835 +engine 2758 +merger 1037 +products 190 +opinion 1817 +residents 1754 +gene 1614 +makes 559 +maker 340 +fourth-quarter 2016 +named 523 +writer 2971 +apple 1538 +heart 1655 +win 1255 +manage 2933 +private 551 +fraud 1552 +names 1863 +motor 996 +scandal 1948 +standing 2715 +use 270 +from 21 +p&g 2217 +consumption 2618 +& 83 +remains 780 +illegal 2121 +cray 1553 +next 131 +few 262 +doubt 1878 +year-ago 1519 +themselves 873 +consecutive 2243 +reflects 1712 +usx 1502 +sort 1566 +parliament 2580 +started 889 +becomes 2934 +factor 2245 +benchmark 1824 +occurred 2841 +carrying 2590 +sharply 1051 +allianz 2867 +mitsubishi 1615 +appointed 2495 +women 941 +customer 1353 +account 636 +us$ 1539 +effectively 2876 +this 40 +challenge 1879 +clients 764 +recession 965 +thin 2429 +island 2868 +meet 896 +closing 1039 +n.y. 2208 +control 351 +beijing 2062 +slid 2780 +weaker 2716 +engelken 2703 +process 796 +a.m. 2859 +daiwa 2733 +tax 218 +purposes 2842 +high 241 +professor 2022 +reserves 825 +something 785 +sought 1432 +stories 2466 +voice 1981 +rape 2972 +sir 1836 +educational 2534 +united 562 +usair 2029 +democracy 2404 +recalls 2572 +six 364 +hampshire 2279 +arrangement 2795 +traders 281 +forest 2877 +instead 597 +stock 61 +buildings 1642 +farm 2097 +watch 2237 +tied 2256 +ties 2215 +boeing 1962 +light 1063 +lines 805 +commodities 2373 +chief 153 +road 2102 +allow 894 +executives 493 +martin 2417 +houston 1933 +holds 1199 +hanover 2547 +producer 1587 +institutional 1258 +move 330 +produced 1109 +alliance 2056 +including 211 +looks 2155 +quake 1064 +year-earlier 797 +industries 553 +delay 1932 +la 2041 +labor 700 +whites 2952 +willing 1289 +orange 2515 +covered 2156 +criminal 1479 +spot 2209 +pending 1503 +crash 1041 +greater 1225 +auto 616 +practice 1360 +earn 2937 +cutting 1949 +h. 2184 +hands 1316 +front 1911 +bar 2740 +republican 1409 +investor 506 +day 273 +capital-gains 1386 +successor 2704 +february 1799 +l. 1577 +warned 2280 +university 594 +covering 2724 +identified 2504 +morris 1979 +rising 1126 +bills 614 +warner 576 +doing 842 +strip 2944 +related 866 +society 1423 +books 1800 +measure 853 +our 230 +margins 1242 +agriculture 2717 +special 655 +out 85 +merc 2849 +' 135 +entertainment 1016 +defend 2815 +critics 1825 +electronics 1301 +cause 1213 +integrated 2286 +red 1013 +thrifts 1837 +disclose 2021 +shut 2912 +frank 1643 +ban 2065 +regulators 1084 +york 93 +regulatory 1374 +indicates 2596 +philip 1696 +navy 2034 +hostile 1791 +could 80 +florida 1864 +mac 2496 +keep 586 +ltd 1480 +davis 2781 +retain 2452 +retail 909 +south 436 +respond 2833 +plastics 2953 +succeeds 2031 +powerful 1616 +owned 1015 +strategic 1768 +owner 1507 +reached 698 +awarded 2409 +quality 1192 +nyse 2335 +legislative 2657 +management 266 +stands 2702 +los 639 +system 274 +relations 1560 +recapitalization 2880 +priority 2581 +their 52 +attack 2348 +intelligence 1649 +final 1111 +interests 775 +enforcement 2535 +shell 2973 +completed 790 +acquire 869 +environmental 1091 +chemicals 1591 +reflecting 1308 +branches 2415 +july 567 +institution 2706 +steel 600 +colleagues 2174 +hearings 2287 +commodity 1406 +patients 1952 +individual 736 +providing 2064 +creditors 1004 +projections 2996 +unchanged 1073 +partnership 1218 +lin 1613 +unlikely 2356 +have 35 +need 479 +apparently 1081 +clearly 1508 +rjr 2220 +documents 1675 +dallas 1830 +agency 303 +able 584 +purchasing 1941 +instance 1222 +concerns 830 +which 42 +campeau 1680 +coke 2956 +unless 1150 +who 57 +eight 959 +preliminary 1980 +device 2843 +segment 1569 +payment 1212 +so-called 1354 +request 1681 +face 664 +looked 2828 +proceedings 2195 +lowered 2176 +pictures 2267 +normally 2191 +fact 668 +goals 2764 +agreed 346 +charles 1379 +bring 1144 +planning 1351 +democrats 1156 +portfolio 684 +fear 1262 +economist 1189 +debate 1561 +decade 1219 +staff 861 +litigation 1485 +partners 918 +based 306 +earned 1017 +controls 1558 +should 205 +unable 2260 +candidates 1925 +employee 1624 +communist 1769 +local 579 +hope 1160 +meant 2845 +dinkins 1185 +handle 2435 +means 863 +fellow 2505 +familiar 1253 +overall 1268 +bear 1644 +reinsurance 2656 +joint 799 +ones 1460 +words 1749 +exchanges 1469 +buyer 2005 +kong 845 +chips 1600 +areas 880 +trucks 1684 +course 904 +numerous 2574 +taxes 954 +calling 2405 +she 164 +ohio 1570 +fixed 1424 +conduct 2819 +view 808 +europe 546 +temporarily 2122 +downward 3000 +acquired 747 +national 163 +accord 1770 +operates 1912 +edition 2846 +computer 226 +subcommittee 2430 +closer 2516 +nationwide 2594 +reform 1471 +nuclear 1676 +tend 2134 +favor 1697 +state 140 +closed 250 +crude 1900 +progress 2453 +neither 1397 +bought 686 +comparable 2006 +brewing 2425 +ability 1157 +opening 1461 +deliver 2234 +agencies 1169 +job 793 +takeover 423 +key 864 +approval 727 +precious 2858 +lawsuit 2536 +distribution 1793 +declining 1631 +david 693 +restrictions 1530 +limits 2307 +career 2349 +goal 1966 +taking 836 +equal 1950 +drug 510 +pulp 2563 +april 856 +figures 653 +jersey 1762 +otherwise 2406 +comment 472 +adjusted 1706 +english 2765 +co 571 +lang 2333 +agents 2172 +wall 331 +ca 563 +cd 2055 +packaging 2976 +qintex 1243 +table 2573 +oakland 1937 +industrials 2607 +addition 489 +genetic 2336 +permanent 2893 +agreements 2078 +proposal 491 +waste 2847 +faced 2412 +controlled 1763 +c. 1707 +league 2308 +am 1635 +sufficient 2977 +otc 1413 +essentially 2620 +c$ 1511 +bulk 2752 +finished 1831 +graphics 2257 +improved 1309 +atlanta 2431 +general 189 +present 1838 +homes 1515 +troubled 1543 +abandoned 2896 +unlike 1728 +sotheby 2848 +restaurant 2278 +harder 2550 +as 20 +value 310 +will 34 +owns 833 +wild 2978 +uncertainty 1686 +almost 512 +blood 2502 +thus 1114 +site 2491 +helped 745 +claimed 2606 +partner 1062 +shearson 938 +halt 2665 +tumbled 2407 +perhaps 986 +began 544 +administration 345 +cross 2079 +member 1094 +retailing 2419 +parts 935 +largest 414 +units 603 +party 728 +gets 1523 +difficult 1024 +material 1695 +columbia 809 +nekoosa 1916 +upon 2025 +effect 692 +forecasts 2299 +student 2246 +rumors 1914 +kkr 2614 +single 1136 +transaction 564 +off 170 +center 721 +i 69 +approve 2426 +well 201 +fighting 2641 +thought 947 +banker 2377 +sets 2235 +position 533 +soviet 384 +inc. 81 +latest 466 +stores 643 +less 246 +increasingly 1554 +executive 155 +domestic 708 +obtain 2157 +sources 1534 +underlying 2107 +rooms 2374 +seats 1832 +paul 1549 +rapid 2642 +ads 1417 +supply 933 +smith 1129 +deposits 1734 +realize 2298 +simple 2231 +add 1226 +other 62 +subordinated 1544 +match 2408 +boom 2582 +tests 1645 +increased 369 +provisions 1292 +government 99 +chancellor 2300 +increases 696 +five 259 +know 481 +press 916 +immediately 1451 +loss 254 +lincoln 1735 +necessary 1381 +like 167 +lost 635 +miami 2829 +taxpayers 2201 +lawson 1390 +payments 729 +james 560 +become 428 +works 1320 +soft 2674 +amendment 1846 +exceed 2643 +because 78 +arbitrage 981 +authority 1286 +growth 242 +export 1462 +cleveland 2796 +home 322 +peter 1163 +employment 1872 +line-item 2350 +lead 786 +broad 1593 +avoid 1265 +hurricane 1151 +slide 2270 +does 177 +york-based 2645 +chains 2621 +leader 862 +schedule 2719 +journal 919 +monetary 1452 +expansion 1594 +beach 2394 +pressure 843 +expire 2817 +although 368 +offset 1193 +includes 800 +loans 362 +vs. 2660 +panel 1559 +gained 725 +about 44 +actual 1174 +carried 2420 +debentures 1295 +freedom 2676 +shipping 2158 +surge 2080 +angeles 749 +holdings 885 +carries 2468 +carrier 1368 +introduced 1191 +software 993 +own 195 +letters 2175 +previously 719 +warrants 2046 +washington 514 +commitment 2337 +billions 2622 +getting 637 +malcolm 2922 +included 651 +guard 2881 +promise 2047 +managing 1274 +banco 2338 +utility 1913 +accused 1794 +additional 581 +krenz 1729 +transfer 2506 +housing 999 +secret 2949 +peters 1206 +continental 1776 +biggest 742 +pretax 1176 +fiscal 416 +buy 184 +north 739 +stadium 2899 +triggered 2368 +insurer 2564 +funds 194 +brand 1089 +akzo 2957 +but 30 +delivery 1000 +insured 2108 +construction 608 +gain 430 +courts 1942 +highest 1595 +ltv 2705 +he 29 +made 160 +places 2735 +whether 370 +cells 2388 +official 455 +signed 1194 +record 440 +below 631 +limit 1057 +ruling 1233 +problem 433 +piece 2185 +minutes 1497 +supreme 1152 +deaths 2818 +wcrs 2980 +slowing 1607 +flight 2020 +education 1382 +proceeds 1337 +worse 2197 +inc 443 +aetna 2720 +mutual 1369 +compared 371 +'ll 928 +variety 2784 +corporation 2271 +illinois 2646 +book 1291 +compares 2914 +details 1512 +branch 2202 +compete 2850 +gonzalez 2797 +junk 473 +francisco 420 +star 1826 +monday 383 +class 1361 +june 625 +ultimately 2153 +contends 2328 +stay 1333 +chance 1375 +bellsouth 2697 +priced 425 +friends 2421 +exposure 2301 +resolution 2067 +baker 1290 +factors 1438 +rule 1161 +ortega 2677 +portion 1685 +write 2342 +status 2334 +pension 1181 +understand 2232 +frankfurt 1915 diff --git a/language_model/img/lstm.png b/language_model/images/lstm.png similarity index 100% rename from language_model/img/lstm.png rename to language_model/images/lstm.png diff --git a/language_model/img/ngram.png b/language_model/images/ngram.png similarity index 100% rename from language_model/img/ngram.png rename to language_model/images/ngram.png diff --git a/language_model/img/ps.png b/language_model/images/ps.png similarity index 100% rename from language_model/img/ps.png rename to language_model/images/ps.png diff --git a/language_model/img/ps2.png b/language_model/images/ps2.png similarity index 100% rename from language_model/img/ps2.png rename to language_model/images/ps2.png diff --git a/language_model/img/rnn.png b/language_model/images/rnn.png similarity index 100% rename from language_model/img/rnn.png rename to language_model/images/rnn.png diff --git a/language_model/img/rnn_str.png b/language_model/images/rnn_str.png similarity index 100% rename from language_model/img/rnn_str.png rename to language_model/images/rnn_str.png diff --git a/language_model/img/s.png b/language_model/images/s.png similarity index 100% rename from language_model/img/s.png rename to language_model/images/s.png diff --git a/language_model/lm_ngram.py b/language_model/lm_ngram.py index 4607da3c6f02a7ae1a85f06b7dd370983092c9b6..23cca1c828576608e414d95b84dca062082410d6 100644 --- a/language_model/lm_ngram.py +++ b/language_model/lm_ngram.py @@ -5,6 +5,7 @@ import data_util as reader import gzip import numpy as np + def lm(vocab_size, emb_dim, hidden_size, num_layer): """ ngram language model definition. @@ -135,7 +136,6 @@ def train(): if __name__ == '__main__': - # -- config : model -- emb_dim = 200 hidden_size = 200 @@ -145,9 +145,9 @@ if __name__ == '__main__': model_file_name_prefix = 'lm_ngram_pass_' # -- config : data -- - train_file = 'data/chinese.txt' - test_file = 'data/chinese.txt' - vocab_file = 'data/vocab_cn.txt' # the file to save vocab + train_file = 'data/ptb.train.txt' + test_file = 'data/ptb.test.txt' + vocab_file = 'data/vocab_ptb.txt' # the file to save vocab vocab_max_size = 3000 min_sentence_length = 3 max_sentence_length = 60 @@ -163,7 +163,7 @@ if __name__ == '__main__': # prepare model word_id_dict = reader.load_vocab(vocab_file) # load word dictionary _, output_layer = lm(len(word_id_dict), emb_dim, hidden_size, num_layer) # network config - model_file_name = model_file_name_prefix + str(num_passs - 1) + '.tar.gz' + model_file_name = model_file_name_prefix + str(num_passs - 1) + '.tar.gz' parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_file_name)) # load parameters # generate input = [[word_id_dict.get(w, word_id_dict['']) for w in text.split()]] @@ -176,4 +176,3 @@ if __name__ == '__main__': predictions[-1][word_id_dict['']] = -1 # filter next_word = id_word_dict[np.argmax(predictions[-1])] print(next_word.encode('utf-8')) - diff --git a/language_model/lm_rnn.py b/language_model/lm_rnn.py index 6072d599cabd273cb48e26f2ea17f5f1d75ee707..5a9721bbca009a3f6ef572a7d44ee860689e74c5 100644 --- a/language_model/lm_rnn.py +++ b/language_model/lm_rnn.py @@ -6,6 +6,7 @@ import gzip import os import numpy as np + def lm(vocab_size, emb_dim, rnn_type, hidden_size, num_layer): """ rnn language model definition. @@ -63,8 +64,8 @@ def train(): # prepare word dictionary print('prepare vocab...') - word_id_dict = reader.build_vocab(train_file, vocab_max_size) # build vocab - reader.save_vocab(word_id_dict, vocab_file) # save vocab + word_id_dict = reader.build_vocab(train_file, vocab_max_size) # build vocab + reader.save_vocab(word_id_dict, vocab_file) # save vocab # define data reader train_reader = paddle.batch( @@ -188,7 +189,7 @@ def predict(): if os.path.isfile(vocab_file): word_id_dict = reader.load_vocab(vocab_file) # load word dictionary else: - word_id_dict = reader.build_vocab(train_file, vocab_max_size) # build vocab + word_id_dict = reader.build_vocab(train_file, vocab_max_size) # build vocab reader.save_vocab(word_id_dict, vocab_file) # save vocab # prepare and cache model @@ -209,10 +210,10 @@ def predict(): print('prob: ', prob) print('-------') -if __name__ == '__main__': +if __name__ == '__main__': # -- config : model -- - rnn_type = 'gru' # or 'lstm' + rnn_type = 'gru' # or 'lstm' emb_dim = 200 hidden_size = 200 num_passs = 2 @@ -232,4 +233,4 @@ if __name__ == '__main__': train() # -- predict -- - predict() \ No newline at end of file + predict()