{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 使用注意力机制的LSTM的机器翻译\n", "\n", "本示例教程介绍如何使用飞桨完成一个机器翻译任务。我们将会使用飞桨提供的LSTM的API,组建一个`sequence to sequence with attention`的机器翻译的模型,并在示例的数据集上完成从英文翻译成中文的机器翻译。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 环境设置\n", "\n", "本示例教程基于飞桨2.0-beta版本。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0.0\n", "89af2088b6e74bdfeef2d4d78e08461ed2aafee5\n" ] } ], "source": [ "import paddle\n", "import paddle.nn.functional as F\n", "import re\n", "import numpy as np\n", "\n", "paddle.disable_static()\n", "print(paddle.__version__)\n", "print(paddle.__git_commit__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 下载数据集\n", "\n", "我们将使用 [http://www.manythings.org/anki/](http://www.manythings.org/anki/) 提供的中英文的英汉句对作为数据集,来完成本任务。该数据集含有23610个中英文双语的句对。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2020-09-04 16:13:35-- https://www.manythings.org/anki/cmn-eng.zip\n", "Resolving www.manythings.org (www.manythings.org)... 104.24.109.196, 172.67.173.198, 2606:4700:3037::6818:6cc4, ...\n", "Connecting to www.manythings.org (www.manythings.org)|104.24.109.196|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1030722 (1007K) [application/zip]\n", "Saving to: ‘cmn-eng.zip’\n", "\n", "cmn-eng.zip 100%[===================>] 1007K 520KB/s in 1.9s \n", "\n", "2020-09-04 16:13:38 (520 KB/s) - ‘cmn-eng.zip’ saved [1030722/1030722]\n", "\n", "Archive: cmn-eng.zip\n", " inflating: cmn.txt \n", " inflating: _about.txt \n" ] } ], "source": [ "!wget -c https://www.manythings.org/anki/cmn-eng.zip && unzip cmn-eng.zip" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 23610 cmn.txt\r\n" ] } ], "source": [ "!wc -l cmn.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 构建双语句对的数据结构\n", "\n", "接下来我们通过处理下载下来的双语句对的文本文件,将双语句对读入到python的数据结构中。这里做了如下的处理。\n", "\n", "- 对于英文,会把全部英文都变成小写,并只保留英文的单词。\n", "- 对于中文,为了简便起见,未做分词,按照字做了切分。\n", "- 为了后续的程序运行的更快,我们通过限制句子长度,和只保留部分英文单词开头的句子的方式,得到了一个较小的数据集。这样得到了一个有5508个句对的数据集。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "MAX_LEN = 10" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5508\n", "(['i', 'won'], ['我', '赢', '了', '。'])\n", "(['he', 'ran'], ['他', '跑', '了', '。'])\n", "(['i', 'quit'], ['我', '退', '出', '。'])\n", "(['i', 'm', 'ok'], ['我', '沒', '事', '。'])\n", "(['i', 'm', 'up'], ['我', '已', '经', '起', '来', '了', '。'])\n", "(['we', 'try'], ['我', '们', '来', '试', '试', '。'])\n", "(['he', 'came'], ['他', '来', '了', '。'])\n", "(['he', 'runs'], ['他', '跑', '。'])\n", "(['i', 'agree'], ['我', '同', '意', '。'])\n", "(['i', 'm', 'ill'], ['我', '生', '病', '了', '。'])\n" ] } ], "source": [ "lines = open('cmn.txt', encoding='utf-8').read().strip().split('\\n')\n", "words_re = re.compile(r'\\w+')\n", "\n", "pairs = []\n", "for l in lines:\n", " en_sent, cn_sent, _ = l.split('\\t')\n", " pairs.append((words_re.findall(en_sent.lower()), list(cn_sent)))\n", "\n", "# create a smaller dataset to make the demo process faster\n", "filtered_pairs = []\n", "\n", "for x in pairs:\n", " if len(x[0]) < MAX_LEN and len(x[1]) < MAX_LEN and \\\n", " x[0][0] in ('i', 'you', 'he', 'she', 'we', 'they'):\n", " filtered_pairs.append(x)\n", " \n", "print(len(filtered_pairs))\n", "for x in filtered_pairs[:10]: print(x) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 创建词表\n", "\n", "接下来我们分别创建中英文的词表,这两份词表会用来将英文和中文的句子转换为词的ID构成的序列。词表中还加入了如下三个特殊的词:\n", "- ``: 用来对较短的句子进行填充。\n", "- ``: \"begin of sentence\", 表示句子的开始的特殊词。\n", "- ``: \"end of sentence\", 表示句子的结束的特殊词。\n", "\n", "Note: 在实际的任务中,可能还需要通过``(或者``)特殊词来表示未在词表中出现的词。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2539\n", "2039\n" ] } ], "source": [ "en_vocab = {}\n", "cn_vocab = {}\n", "\n", "# create special token for pad, begin of sentence, end of sentence\n", "en_vocab[''], en_vocab[''], en_vocab[''] = 0, 1, 2\n", "cn_vocab[''], cn_vocab[''], cn_vocab[''] = 0, 1, 2\n", "\n", "en_idx, cn_idx = 3, 3\n", "for en, cn in filtered_pairs:\n", " for w in en: \n", " if w not in en_vocab: \n", " en_vocab[w] = en_idx\n", " en_idx += 1\n", " for w in cn: \n", " if w not in cn_vocab: \n", " cn_vocab[w] = cn_idx\n", " cn_idx += 1\n", "\n", "print(len(list(en_vocab)))\n", "print(len(list(cn_vocab)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 创建padding过的数据集\n", "\n", "接下来根据词表,我们将会创建一份实际的用于训练的用numpy array组织起来的数据集。\n", "- 所有的句子都通过``补充成为了长度相同的句子。\n", "- 对于英文句子(源语言),我们将其反转了过来,这会带来更好的翻译的效果。\n", "- 所创建的`padded_cn_label_sents`是训练过程中的预测的目标,即,每个中文的当前词去预测下一个词是什么词。\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(5508, 11)\n", "(5508, 12)\n", "(5508, 12)\n" ] } ], "source": [ "padded_en_sents = []\n", "padded_cn_sents = []\n", "padded_cn_label_sents = []\n", "for en, cn in filtered_pairs:\n", " # reverse source sentence\n", " padded_en_sent = en + [''] + [''] * (MAX_LEN - len(en))\n", " padded_en_sent.reverse()\n", " padded_cn_sent = [''] + cn + [''] + [''] * (MAX_LEN - len(cn))\n", " padded_cn_label_sent = cn + [''] + [''] * (MAX_LEN - len(cn) + 1) \n", "\n", " padded_en_sents.append([en_vocab[w] for w in padded_en_sent])\n", " padded_cn_sents.append([cn_vocab[w] for w in padded_cn_sent])\n", " padded_cn_label_sents.append([cn_vocab[w] for w in padded_cn_label_sent])\n", "\n", "train_en_sents = np.array(padded_en_sents)\n", "train_cn_sents = np.array(padded_cn_sents)\n", "train_cn_label_sents = np.array(padded_cn_label_sents)\n", "\n", "print(train_en_sents.shape)\n", "print(train_cn_sents.shape)\n", "print(train_cn_label_sents.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 创建网络\n", "\n", "我们将会创建一个Encoder-AttentionDecoder架构的模型结构用来完成机器翻译任务。\n", "首先我们将设置一些必要的网络结构中用到的参数。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "embedding_size = 128\n", "hidden_size = 256\n", "num_encoder_lstm_layers = 1\n", "en_vocab_size = len(list(en_vocab))\n", "cn_vocab_size = len(list(cn_vocab))\n", "epochs = 20\n", "batch_size = 16" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Encoder部分\n", "\n", "在编码器的部分,我们通过查找完Embedding之后接一个LSTM的方式构建一个对源语言编码的网络。飞桨的RNN系列的API,除了LSTM之外,还提供了SimleRNN, GRU供使用,同时,还可以使用反向RNN,双向RNN,多层RNN等形式。也可以通过`dropout`参数设置是否对多层RNN的中间层进行`dropout`处理,来防止过拟合。\n", "\n", "除了使用序列到序列的RNN操作之外,也可以通过SimpleRNN, GRUCell, LSTMCell等API更灵活的创建单步的RNN计算,甚至通过继承RNNCellBase来实现自己的RNN计算单元。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# encoder: simply learn representation of source sentence\n", "class Encoder(paddle.nn.Layer):\n", " def __init__(self):\n", " super(Encoder, self).__init__()\n", " self.emb = paddle.nn.Embedding(en_vocab_size, embedding_size,)\n", " self.lstm = paddle.nn.LSTM(input_size=embedding_size, \n", " hidden_size=hidden_size, \n", " num_layers=num_encoder_lstm_layers)\n", "\n", " def forward(self, x):\n", " x = self.emb(x)\n", " x, (_, _) = self.lstm(x)\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# AttentionDecoder部分\n", "\n", "在解码器部分,我们通过一个带有注意力机制的LSTM来完成解码。\n", "\n", "- 单步的LSTM:在解码器的实现的部分,我们同样使用LSTM,与Encoder部分不同的是,下面的代码,每次只让LSTM往前计算一次。整体的recurrent部分,是在训练循环内完成的。\n", "- 注意力机制:这里使用了一个由两个Linear组成的网络来完成注意力机制的计算,它用来计算出目标语言在每次翻译一个词的时候,需要对源语言当中的每个词需要赋予多少的权重。\n", "- 对于第一次接触这样的网络结构来说,下面的代码在理解起来可能稍微有些复杂,你可以通过插入打印每个tensor在不同步骤时的形状的方式来更好的理解。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# only move one step of LSTM, \n", "# the recurrent loop is implemented inside training loop\n", "class AttentionDecoder(paddle.nn.Layer):\n", " def __init__(self):\n", " super(AttentionDecoder, self).__init__()\n", " self.emb = paddle.nn.Embedding(cn_vocab_size, embedding_size)\n", " self.lstm = paddle.nn.LSTM(input_size=embedding_size + hidden_size, \n", " hidden_size=hidden_size)\n", "\n", " # for computing attention weights\n", " self.attention_linear1 = paddle.nn.Linear(hidden_size * 2, hidden_size)\n", " self.attention_linear2 = paddle.nn.Linear(hidden_size, 1)\n", " \n", " # for computing output logits\n", " self.outlinear =paddle.nn.Linear(hidden_size, cn_vocab_size)\n", "\n", " def forward(self, x, previous_hidden, previous_cell, encoder_outputs):\n", " x = self.emb(x)\n", " \n", " attention_inputs = paddle.concat((encoder_outputs, \n", " paddle.tile(previous_hidden, repeat_times=[1, MAX_LEN+1, 1])),\n", " axis=-1\n", " )\n", "\n", " attention_hidden = self.attention_linear1(attention_inputs)\n", " attention_hidden = F.tanh(attention_hidden)\n", " attention_logits = self.attention_linear2(attention_hidden)\n", " attention_logits = paddle.squeeze(attention_logits)\n", "\n", " attention_weights = F.softmax(attention_logits) \n", " attention_weights = paddle.expand_as(paddle.unsqueeze(attention_weights, -1), \n", " encoder_outputs)\n", "\n", " context_vector = paddle.multiply(encoder_outputs, attention_weights) \n", " context_vector = paddle.reduce_sum(context_vector, 1)\n", " context_vector = paddle.unsqueeze(context_vector, 1)\n", " \n", " lstm_input = paddle.concat((x, context_vector), axis=-1)\n", "\n", " # LSTM requirement to previous hidden/state: \n", " # (number_of_layers * direction, batch, hidden)\n", " previous_hidden = paddle.transpose(previous_hidden, [1, 0, 2])\n", " previous_cell = paddle.transpose(previous_cell, [1, 0, 2])\n", " \n", " x, (hidden, cell) = self.lstm(lstm_input, (previous_hidden, previous_cell))\n", " \n", " # change the return to (batch, number_of_layers * direction, hidden)\n", " hidden = paddle.transpose(hidden, [1, 0, 2])\n", " cell = paddle.transpose(cell, [1, 0, 2])\n", "\n", " output = self.outlinear(hidden)\n", " output = paddle.squeeze(output)\n", " return output, (hidden, cell)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 训练模型\n", "\n", "接下来我们开始训练模型。\n", "\n", "- 在每个epoch开始之前,我们对训练数据进行了随机打乱。\n", "- 我们通过多次调用`atten_decoder`,在这里实现了解码时的recurrent循环。\n", "- `teacher forcing`策略: 在每次解码下一个词时,我们给定了训练数据当中的真实词作为了预测下一个词时的输入。相应的,你也可以尝试用模型预测的结果作为下一个词的输入。(或者混合使用)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "epoch:0\n", "iter 0, loss:[7.6194725]\n", "iter 200, loss:[3.4147663]\n", "epoch:1\n", "iter 0, loss:[3.0931656]\n", "iter 200, loss:[2.7543137]\n", "epoch:2\n", "iter 0, loss:[2.8413522]\n", "iter 200, loss:[2.340513]\n", "epoch:3\n", "iter 0, loss:[2.597812]\n", "iter 200, loss:[2.5552855]\n", "epoch:4\n", "iter 0, loss:[2.0783448]\n", "iter 200, loss:[2.4544785]\n", "epoch:5\n", "iter 0, loss:[1.8709135]\n", "iter 200, loss:[1.8736631]\n", "epoch:6\n", "iter 0, loss:[1.9589291]\n", "iter 200, loss:[2.119414]\n", "epoch:7\n", "iter 0, loss:[1.5829577]\n", "iter 200, loss:[1.6002902]\n", "epoch:8\n", "iter 0, loss:[1.6022769]\n", "iter 200, loss:[1.52694]\n", "epoch:9\n", "iter 0, loss:[1.3616685]\n", "iter 200, loss:[1.5420443]\n", "epoch:10\n", "iter 0, loss:[1.0397792]\n", "iter 200, loss:[1.2458231]\n", "epoch:11\n", "iter 0, loss:[1.2107158]\n", "iter 200, loss:[1.426417]\n", "epoch:12\n", "iter 0, loss:[1.1840894]\n", "iter 200, loss:[1.0999664]\n", "epoch:13\n", "iter 0, loss:[1.0968472]\n", "iter 200, loss:[0.8149167]\n", "epoch:14\n", "iter 0, loss:[0.95585203]\n", "iter 200, loss:[1.0070628]\n", "epoch:15\n", "iter 0, loss:[0.89463925]\n", "iter 200, loss:[0.8288595]\n", "epoch:16\n", "iter 0, loss:[0.5672495]\n", "iter 200, loss:[0.7317069]\n", "epoch:17\n", "iter 0, loss:[0.76785177]\n", "iter 200, loss:[0.5319323]\n", "epoch:18\n", "iter 0, loss:[0.5250005]\n", "iter 200, loss:[0.4182841]\n", "epoch:19\n", "iter 0, loss:[0.52320284]\n", "iter 200, loss:[0.47618982]\n" ] } ], "source": [ "encoder = Encoder()\n", "atten_decoder = AttentionDecoder()\n", "\n", "opt = paddle.optimizer.Adam(learning_rate=0.001, \n", " parameters=encoder.parameters()+atten_decoder.parameters())\n", "\n", "for epoch in range(epochs):\n", " print(\"epoch:{}\".format(epoch))\n", "\n", " # shuffle training data\n", " perm = np.random.permutation(len(train_en_sents))\n", " train_en_sents_shuffled = train_en_sents[perm]\n", " train_cn_sents_shuffled = train_cn_sents[perm]\n", " train_cn_label_sents_shuffled = train_cn_label_sents[perm]\n", "\n", " for iteration in range(train_en_sents_shuffled.shape[0] // batch_size):\n", " x_data = train_en_sents_shuffled[(batch_size*iteration):(batch_size*(iteration+1))]\n", " sent = paddle.to_tensor(x_data)\n", " en_repr = encoder(sent)\n", "\n", " x_cn_data = train_cn_sents_shuffled[(batch_size*iteration):(batch_size*(iteration+1))]\n", " x_cn_label_data = train_cn_label_sents_shuffled[(batch_size*iteration):(batch_size*(iteration+1))]\n", "\n", " # shape: (batch, num_layer(=1 here) * num_of_direction(=1 here), hidden_size)\n", " hidden = paddle.zeros([batch_size, 1, hidden_size])\n", " cell = paddle.zeros([batch_size, 1, hidden_size])\n", "\n", " loss = paddle.zeros([1])\n", " # the decoder recurrent loop mentioned above\n", " for i in range(MAX_LEN + 2):\n", " cn_word = paddle.to_tensor(x_cn_data[:,i:i+1])\n", " cn_word_label = paddle.to_tensor(x_cn_label_data[:,i:i+1])\n", "\n", " logits, (hidden, cell) = atten_decoder(cn_word, hidden, cell, en_repr)\n", " step_loss = F.softmax_with_cross_entropy(logits, cn_word_label)\n", " avg_step_loss = paddle.mean(step_loss)\n", " loss += avg_step_loss\n", "\n", " loss = loss / (MAX_LEN + 2)\n", " if(iteration % 200 == 0):\n", " print(\"iter {}, loss:{}\".format(iteration, loss.numpy()))\n", "\n", " loss.backward()\n", " opt.minimize(loss)\n", " encoder.clear_gradients()\n", " atten_decoder.clear_gradients()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 使用模型进行机器翻译\n", "\n", "根据你所使用的计算设备的不同,上面的训练过程可能需要不等的时间。(在一台Mac笔记本上,大约耗时15~20分钟)\n", "完成上面的模型训练之后,我们可以得到一个能够从英文翻译成中文的机器翻译模型。接下来我们通过一个greedy search来实现使用该模型完成实际的机器翻译。(实际的任务中,你可能需要用beam search算法来提升效果)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "i agree with him\n", "true: 我同意他。\n", "pred: 我同意他。\n", "i think i ll take a bath tonight\n", "true: 我想我今晚會洗澡。\n", "pred: 我想我今晚會洗澡。\n", "he asked for a drink of water\n", "true: 他要了水喝。\n", "pred: 他喝了一杯水。\n", "i began running\n", "true: 我開始跑。\n", "pred: 我開始跑。\n", "i m sick\n", "true: 我生病了。\n", "pred: 我生病了。\n", "you had better go to the dentist s\n", "true: 你最好去看牙醫。\n", "pred: 你最好去看牙醫。\n", "we went for a walk in the forest\n", "true: 我们去了林中散步。\n", "pred: 我們去公园散步。\n", "you ve arrived very early\n", "true: 你來得很早。\n", "pred: 你去早个。\n", "he pretended not to be listening\n", "true: 他裝作沒在聽。\n", "pred: 他假装聽到它。\n", "he always wanted to study japanese\n", "true: 他一直想學日語。\n", "pred: 他一直想學日語。\n" ] } ], "source": [ "encoder.eval()\n", "atten_decoder.eval()\n", "\n", "num_of_exampels_to_evaluate = 10\n", "\n", "indices = np.random.choice(len(train_en_sents), num_of_exampels_to_evaluate, replace=False)\n", "x_data = train_en_sents[indices]\n", "sent = paddle.to_tensor(x_data)\n", "en_repr = encoder(sent)\n", "\n", "word = np.array(\n", " [[cn_vocab['']]] * num_of_exampels_to_evaluate\n", ")\n", "word = paddle.to_tensor(word)\n", "\n", "hidden = paddle.zeros([num_of_exampels_to_evaluate, 1, hidden_size])\n", "cell = paddle.zeros([num_of_exampels_to_evaluate, 1, hidden_size])\n", "\n", "decoded_sent = []\n", "for i in range(MAX_LEN + 2):\n", " logits, (hidden, cell) = atten_decoder(word, hidden, cell, en_repr)\n", " word = paddle.argmax(logits, axis=1)\n", " decoded_sent.append(word.numpy())\n", " word = paddle.unsqueeze(word, axis=-1)\n", " \n", "results = np.stack(decoded_sent, axis=1)\n", "for i in range(num_of_exampels_to_evaluate):\n", " en_input = \" \".join(filtered_pairs[indices[i]][0])\n", " ground_truth_translate = \"\".join(filtered_pairs[indices[i]][1])\n", " model_translate = \"\"\n", " for k in results[i]:\n", " w = list(cn_vocab)[k]\n", " if w != '' and w != '':\n", " model_translate += w\n", " print(en_input)\n", " print(\"true: {}\".format(ground_truth_translate))\n", " print(\"pred: {}\".format(model_translate))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The End\n", "\n", "你还可以通过变换网络结构,调整数据集,尝试不同的参数的方式来进一步提升本示例当中的机器翻译的效果。同时,也可以尝试在其他的类似的任务中用飞桨来完成实际的实践。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }