提交 6cbf7fd8 编写于 作者: C chenlong

add three docs for paddle2.0

上级 9fb40dac
此差异已折叠。
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基于GRU的Text Generation\n",
"文本生成是NLP领域中的重要组成部分,基于GRU,我们可以快速构建文本生成模型。"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'2.0.0-alpha0'"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import paddle\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"paddle.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 复现过程\n",
"## 1.下载数据\n",
"文件路径:https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt\n",
"保存为txt格式即可"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.读取数据"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Length of text: 1115394 characters\n"
]
}
],
"source": [
"# 文件路径\n",
"path_to_file = './shakespeare.txt'\n",
"text = open(path_to_file, 'rb').read().decode(encoding='utf-8')\n",
"\n",
"# 文本长度是指文本中的字符个数\n",
"print ('Length of text: {} characters'.format(len(text)))"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First Citizen:\n",
"Before we proceed any further, hear me speak.\n",
"\n",
"All:\n",
"Speak, speak.\n",
"\n",
"First Citizen:\n",
"You are all resolved rather to die than to famish?\n",
"\n",
"All:\n",
"Resolved. resolved.\n",
"\n",
"First Citizen:\n",
"First, you know Caius Marcius is chief enemy to the people.\n",
"\n"
]
}
],
"source": [
"# 看一看文本中的前 250 个字符\n",
"print(text[:250])"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"65 unique characters\n"
]
}
],
"source": [
"# 文本中的非重复字符\n",
"vocab = sorted(set(text))\n",
"print ('{} unique characters'.format(len(vocab)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.向量化文本\n",
"在训练之前,我们需要将字符串映射到数字表示值。创建两个查找表格:一个将字符映射到数字,另一个将数字映射到字符。"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"# 创建从非重复字符到索引的映射\n",
"char2idx = {u:i for i, u in enumerate(vocab)}\n",
"idx2char = np.array(vocab)\n",
"# 用index表示文本\n",
"text_as_int = np.array([char2idx[c] for c in text])\n"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'\\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, \"'\": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}\n"
]
}
],
"source": [
"print(char2idx)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['\\n' ' ' '!' '$' '&' \"'\" ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E'\n",
" 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W'\n",
" 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'\n",
" 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']\n"
]
}
],
"source": [
"print(idx2char)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在,每个字符都有一个整数表示值。请注意,我们将字符映射至索引 0 至 len(vocab)."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[18 47 56 ... 45 8 0]\n",
"1115394\n"
]
}
],
"source": [
"print(text_as_int)\n",
"print(len(text_as_int))"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'First Citizen' ---- characters mapped to int ---- > [18 47 56 57 58 1 15 47 58 47 64 43 52]\n"
]
}
],
"source": [
"# 显示文本首 13 个字符的整数映射\n",
"print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 预测任务\n",
"给定一个字符或者一个字符序列,下一个最可能出现的字符是什么?这就是我们训练模型要执行的任务。输入进模型的是一个字符序列,我们训练这个模型来预测输出 -- 每个时间步(time step)预测下一个字符是什么。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 创建训练样本和目标\n",
"接下来,将文本划分为样本序列。每个输入序列包含文本中的 seq_length 个字符。\n",
"\n",
"对于每个输入序列,其对应的目标包含相同长度的文本,但是向右顺移一个字符。\n",
"\n",
"将文本拆分为长度为 seq_length 的文本块。例如,假设 seq_length 为 4 而且文本为 “Hello”, 那么输入序列将为 “Hell”,目标序列将为 “ello”。"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"seq_length = 100\n",
"def load_data(data, seq_length):\n",
" train_data = []\n",
" train_label = []\n",
" for i in range(len(data)//seq_length):\n",
" train_data.append(data[i*seq_length:(i+1)*seq_length])\n",
" train_label.append(data[i*seq_length + 1:(i+1)*seq_length+1])\n",
" return train_data, train_label\n",
"train_data, train_label = load_data(text_as_int, seq_length)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"training data is :\n",
"First Citizen:\n",
"Before we proceed any further, hear me speak.\n",
"\n",
"All:\n",
"Speak, speak.\n",
"\n",
"First Citizen:\n",
"You\n",
"------------\n",
"training_label is:\n",
"irst Citizen:\n",
"Before we proceed any further, hear me speak.\n",
"\n",
"All:\n",
"Speak, speak.\n",
"\n",
"First Citizen:\n",
"You \n"
]
}
],
"source": [
"char_list = []\n",
"label_list = []\n",
"for char_id, label_id in zip(train_data[0], train_label[0]):\n",
" char_list.append(idx2char[char_id])\n",
" label_list.append(idx2char[label_id])\n",
"\n",
"print('training data is :')\n",
"print(''.join(char_list))\n",
"print(\"------------\")\n",
"print('training_label is:')\n",
"print(''.join(label_list))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 用`paddle.batch`完成数据的加载"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"batch_size = 64\n",
"def train_reader():\n",
" for i in range(len(train_data)):\n",
" yield train_data[i], train_label[i]\n",
"batch_reader = paddle.batch(train_reader, batch_size=batch_size) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 基于GRU构建文本生成模型"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"import numpy as np\n",
"\n",
"vocab_size = len(vocab)\n",
"embedding_dim = 256\n",
"hidden_size = 1024\n",
"class GRUModel(paddle.nn.Layer):\n",
" def __init__(self):\n",
" super(GRUModel, self).__init__()\n",
" self.embedding = paddle.nn.Embedding(size=[vocab_size, embedding_dim])\n",
" self.gru = paddle.incubate.hapi.text.GRU(input_size=embedding_dim, hidden_size=hidden_size)\n",
" self.linear1 = paddle.nn.Linear(hidden_size, hidden_size//2)\n",
" self.linear2 = paddle.nn.Linear(hidden_size//2, vocab_size)\n",
" def forward(self, x):\n",
" x = self.embedding(x)\n",
" x = paddle.reshape(x, [-1, 1, embedding_dim])\n",
" x, _ = self.gru(x)\n",
" x = paddle.reshape(x, [-1, hidden_size])\n",
" x = self.linear1(x)\n",
" x = paddle.nn.functional.relu(x)\n",
" x = self.linear2(x)\n",
" x = paddle.nn.functional.softmax(x)\n",
" return x"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch: 0, batch: 50, loss is: [3.7835407]\n",
"epoch: 0, batch: 100, loss is: [3.2774005]\n",
"epoch: 0, batch: 150, loss is: [3.2576294]\n",
"epoch: 1, batch: 50, loss is: [3.3434656]\n",
"epoch: 1, batch: 100, loss is: [2.9948606]\n",
"epoch: 1, batch: 150, loss is: [3.0285468]\n",
"epoch: 2, batch: 50, loss is: [3.133882]\n",
"epoch: 2, batch: 100, loss is: [2.7811327]\n",
"epoch: 2, batch: 150, loss is: [2.8133557]\n",
"epoch: 3, batch: 50, loss is: [3.000814]\n",
"epoch: 3, batch: 100, loss is: [2.6404488]\n",
"epoch: 3, batch: 150, loss is: [2.7050896]\n",
"epoch: 4, batch: 50, loss is: [2.9289591]\n",
"epoch: 4, batch: 100, loss is: [2.5629177]\n",
"epoch: 4, batch: 150, loss is: [2.6438713]\n",
"epoch: 5, batch: 50, loss is: [2.8832304]\n",
"epoch: 5, batch: 100, loss is: [2.5137548]\n",
"epoch: 5, batch: 150, loss is: [2.5926144]\n",
"epoch: 6, batch: 50, loss is: [2.8562953]\n",
"epoch: 6, batch: 100, loss is: [2.4752126]\n",
"epoch: 6, batch: 150, loss is: [2.5510798]\n",
"epoch: 7, batch: 50, loss is: [2.8426895]\n",
"epoch: 7, batch: 100, loss is: [2.4442513]\n",
"epoch: 7, batch: 150, loss is: [2.5187433]\n",
"epoch: 8, batch: 50, loss is: [2.8353484]\n",
"epoch: 8, batch: 100, loss is: [2.4200597]\n",
"epoch: 8, batch: 150, loss is: [2.4956212]\n",
"epoch: 9, batch: 50, loss is: [2.8308532]\n",
"epoch: 9, batch: 100, loss is: [2.4011066]\n",
"epoch: 9, batch: 150, loss is: [2.4787998]\n"
]
}
],
"source": [
"paddle.enable_imperative()\n",
"losses = []\n",
"def train(model):\n",
" model.train()\n",
" optim = paddle.optimizer.SGD(learning_rate=0.001, parameter_list=model.parameters())\n",
" for epoch in range(10):\n",
" batch_id = 0\n",
" for batch_data in batch_reader():\n",
" batch_id += 1\n",
" data = np.array(batch_data)\n",
" x_data = data[:, 0]\n",
" y_data = data[:, 1]\n",
" for i in range(len(x_data[0])):\n",
" x_char = x_data[:, i]\n",
" y_char = y_data[:, i]\n",
" x_char = paddle.imperative.to_variable(x_char)\n",
" y_char = paddle.imperative.to_variable(y_char)\n",
" predicts = model(x_char)\n",
" loss = paddle.nn.functional.cross_entropy(predicts, y_char)\n",
" avg_loss = paddle.mean(loss)\n",
" avg_loss.backward()\n",
" optim.minimize(avg_loss)\n",
" model.clear_gradients()\n",
" if batch_id % 50 == 0:\n",
" print(\"epoch: {}, batch: {}, loss is: {}\".format(epoch, batch_id, avg_loss.numpy()))\n",
" losses.append(loss.numpy())\n",
"model = GRUModel()\n",
"train(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 模型预测\n",
"利用训练好的模型,输出初始化文本'ROMEO: ',自动生成后续的num_generate个字符。"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ROMEO:I the the the the the the the the the the the the the the the the the the the the the the the the th\n"
]
}
],
"source": [
"def generate_text(model, start_string):\n",
" \n",
" model.eval()\n",
" num_generate = 100\n",
"\n",
" # Converting our start string to numbers (vectorizing)\n",
" input_eval = [char2idx[s] for s in start_string]\n",
" input_data = paddle.imperative.to_variable(np.array(input_eval))\n",
" input_data = paddle.reshape(input_data, [-1, 1])\n",
" text_generated = []\n",
"\n",
" for i in range(num_generate):\n",
" predicts = model(input_data)\n",
" predicts = predicts.numpy().tolist()[0]\n",
" # print(predicts)\n",
" predicts_id = predicts.index(max(predicts))\n",
" # print(predicts_id)\n",
" # using a categorical distribution to predict the character returned by the model\n",
" input_data = paddle.imperative.to_variable(np.array([predicts_id]))\n",
" input_data = paddle.reshape(input_data, [-1, 1])\n",
" text_generated.append(idx2char[predicts_id])\n",
" return (start_string + ''.join(text_generated))\n",
"print(generate_text(model, start_string=u\"ROMEO:\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册