未验证 提交 3e91a34d 编写于 作者: C Chen Long 提交者: GitHub

add_tutorial test=develop (#2582)

* add_tutorial test=develop

* fix quick start test=develop
上级 7b69a251
使用卷积神经网络进行图像分类
============================
本示例教程将会演示如何使用飞桨的卷积神经网络来完成图像分类任务。这是一个较为简单的示例,将会使用一个由三个卷积层组成的网络完成\ `cifar10 <https://www.cs.toronto.edu/~kriz/cifar.html>`__\ 数据集的图像分类任务。
设置环境
----------
我们将使用飞桨2.0beta版本。
.. code:: ipython3
import paddle
import paddle.nn.functional as F
from paddle.vision.transforms import Normalize
import numpy as np
import matplotlib.pyplot as plt
paddle.disable_static()
print(paddle.__version__)
print(paddle.__git_commit__)
.. parsed-literal::
0.0.0
264e76cae6861ad9b1d4bcd8c3212f7a78c01e4d
加载并浏览数据集
-------------------
我们将会使用飞桨提供的API完成数据集的下载并为后续的训练任务准备好数据迭代器。cifar10数据集由60000张大小为32
\*
32的彩色图片组成,其中有50000张图片组成了训练集,另外10000张图片组成了测试集。这些图片分为10个类别,我们的任务是训练一个模型能够把图片进行正确的分类。
.. code:: ipython3
cifar10_train = paddle.vision.datasets.cifar.Cifar10(mode='train', transform=None)
train_images = np.zeros((50000, 32, 32, 3), dtype='float32')
train_labels = np.zeros((50000, 1), dtype='int32')
for i, data in enumerate(cifar10_train):
train_image, train_label = data
train_image = train_image.reshape((3, 32, 32 )).astype('float32') / 255.
train_image = train_image.transpose(1, 2, 0)
train_images[i, :, :, :] = train_image
train_labels[i, 0] = train_label
浏览数据集
-------------
接下来我们从数据集中随机挑选一些图片并显示,从而对数据集有一个直观的了解。
.. code:: ipython3
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
plt.figure(figsize=(10,10))
sample_idxs = np.random.choice(50000, size=25, replace=False)
for i in range(25):
plt.subplot(5, 5, i+1)
plt.xticks([])
plt.yticks([])
plt.imshow(train_images[sample_idxs[i]], cmap=plt.cm.binary)
plt.xlabel(class_names[train_labels[sample_idxs[i]][0]])
plt.show()
.. image:: convnet_image_classification_files/convnet_image_classification_6_0.png
组建网络
----------
接下来我们使用飞桨定义一个使用了三个二维卷积(\ ``Conv2d``)且每次卷积之后使用\ ``relu``\ 激活函数,两个二维池化层(\ ``MaxPool2d``\ ),和两个线性变换层组成的分类网络,来把一个\ ``(32, 32, 3)``\ 形状的图片通过卷积神经网络映射为10个输出,这对应着10个分类的类别。
.. code:: ipython3
class MyNet(paddle.nn.Layer):
def __init__(self, num_classes=1):
super(MyNet, self).__init__()
self.conv1 = paddle.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(3, 3))
self.pool1 = paddle.nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = paddle.nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(3,3))
self.pool2 = paddle.nn.MaxPool2d(kernel_size=2, stride=2)
self.conv3 = paddle.nn.Conv2d(in_channels=64, out_channels=64, kernel_size=(3,3))
self.flatten = paddle.nn.Flatten()
self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.pool1(x)
x = self.conv2(x)
x = F.relu(x)
x = self.pool2(x)
x = self.conv3(x)
x = F.relu(x)
x = self.flatten(x)
x = self.linear1(x)
x = F.relu(x)
x = self.linear2(x)
return x
模型训练
--------
接下来,我们用一个循环来进行模型的训练,我们将会: -
使用\ ``paddle.optimizer.Adam``\ 优化器来进行优化。 -
使用\ ``F.softmax_with_cross_entropy``\ 来计算损失值。 -
使用\ ``paddle.io.DataLoader``\ 来加载数据并组建batch
.. code:: ipython3
epoch_num = 10
batch_size = 32
learning_rate = 0.001
.. code:: ipython3
val_acc_history = []
val_loss_history = []
def train(model):
print('start training ... ')
# turn into training mode
model.train()
opt = paddle.optimizer.Adam(learning_rate=learning_rate,
parameters=model.parameters())
train_loader = paddle.io.DataLoader(cifar10_train,
places=paddle.CPUPlace(),
shuffle=True,
batch_size=batch_size)
cifar10_test = paddle.vision.datasets.cifar.Cifar10(mode='test', transform=None)
valid_loader = paddle.io.DataLoader(cifar10_test, places=paddle.CPUPlace(), batch_size=batch_size)
for epoch in range(epoch_num):
for batch_id, data in enumerate(train_loader()):
x_data = paddle.cast(data[0], 'float32')
x_data = paddle.reshape(x_data, (-1, 3, 32, 32)) / 255.0
y_data = paddle.cast(data[1], 'int64')
y_data = paddle.reshape(y_data, (-1, 1))
logits = model(x_data)
loss = F.softmax_with_cross_entropy(logits, y_data)
avg_loss = paddle.mean(loss)
if batch_id % 1000 == 0:
print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, avg_loss.numpy()))
avg_loss.backward()
opt.minimize(avg_loss)
model.clear_gradients()
# evaluate model after one epoch
model.eval()
accuracies = []
losses = []
for batch_id, data in enumerate(valid_loader()):
x_data = paddle.cast(data[0], 'float32')
x_data = paddle.reshape(x_data, (-1, 3, 32, 32)) / 255.0
y_data = paddle.cast(data[1], 'int64')
y_data = paddle.reshape(y_data, (-1, 1))
logits = model(x_data)
loss = F.softmax_with_cross_entropy(logits, y_data)
acc = paddle.metric.accuracy(logits, y_data)
accuracies.append(np.mean(acc.numpy()))
losses.append(np.mean(loss.numpy()))
avg_acc, avg_loss = np.mean(accuracies), np.mean(losses)
print("[validation] accuracy/loss: {}/{}".format(avg_acc, avg_loss))
val_acc_history.append(avg_acc)
val_loss_history.append(avg_loss)
model.train()
model = MyNet(num_classes=10)
train(model)
.. parsed-literal::
start training ...
epoch: 0, batch_id: 0, loss is: [2.3024805]
epoch: 0, batch_id: 1000, loss is: [1.1422595]
[validation] accuracy/loss: 0.5575079917907715/1.2516425848007202
epoch: 1, batch_id: 0, loss is: [0.9350736]
epoch: 1, batch_id: 1000, loss is: [1.3825703]
[validation] accuracy/loss: 0.5959464907646179/1.1320706605911255
epoch: 2, batch_id: 0, loss is: [0.979844]
epoch: 2, batch_id: 1000, loss is: [0.87730503]
[validation] accuracy/loss: 0.6607428193092346/0.9754576086997986
epoch: 3, batch_id: 0, loss is: [0.7345351]
epoch: 3, batch_id: 1000, loss is: [1.0982555]
[validation] accuracy/loss: 0.6671326160430908/0.9667007327079773
epoch: 4, batch_id: 0, loss is: [0.9291839]
epoch: 4, batch_id: 1000, loss is: [1.1812104]
[validation] accuracy/loss: 0.6895966529846191/0.9075900316238403
epoch: 5, batch_id: 0, loss is: [0.5072213]
epoch: 5, batch_id: 1000, loss is: [0.60360587]
[validation] accuracy/loss: 0.6944888234138489/0.8740479350090027
epoch: 6, batch_id: 0, loss is: [0.5917944]
epoch: 6, batch_id: 1000, loss is: [0.7963876]
[validation] accuracy/loss: 0.7072683572769165/0.8597638607025146
epoch: 7, batch_id: 0, loss is: [0.50116754]
epoch: 7, batch_id: 1000, loss is: [0.95844793]
[validation] accuracy/loss: 0.700579047203064/0.876727819442749
epoch: 8, batch_id: 0, loss is: [0.87496114]
epoch: 8, batch_id: 1000, loss is: [0.68749857]
[validation] accuracy/loss: 0.7198482155799866/0.8403064608573914
epoch: 9, batch_id: 0, loss is: [0.8548105]
epoch: 9, batch_id: 1000, loss is: [0.6488569]
[validation] accuracy/loss: 0.7106629610061646/0.874437153339386
.. code:: ipython3
plt.plot(val_acc_history, label = 'validation accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 0.8])
plt.legend(loc='lower right')
.. parsed-literal::
<matplotlib.legend.Legend at 0x163d6ec50>
.. image:: convnet_image_classification_files/convnet_image_classification_12_1.png
The End
-------
从上面的示例可以看到,在cifar10数据集上,使用简单的卷积神经网络,用飞桨可以达到71%以上的准确率。
基于图片相似度的图片搜索
========================
简要介绍
--------
图片搜索是一种有着广泛的应用场景的深度学习技术的应用,目前,无论是工程图纸的检索,还是互联网上相似图片的搜索,都基于深度学习算法能够实现很好的基于给定图片,检索出跟该图片相似的图片的效果。
本示例简要介绍如何通过飞桨开源框架,实现图片搜索的功能。其基本思路是,先将图片使用卷积神经网络转换为高维空间的向量表示,然后计算两张图片的高维空间的向量表示之间的相似程度(本示例中,我们使用余弦相似度)。在模型训练阶段,其训练目标是让同一类别的图片的相似程度尽可能的高,不同类别的图片的相似程度尽可能的低。在模型预测阶段,对于用户上传的一张图片,会计算其与图片库中图片的相似程度,返回给用户按照相似程度由高到低的图片的列表作为检索的结果。
环境设置
--------
本示例基于飞桨开源框架2.0版本。
.. code:: ipython3
import paddle
import paddle.nn.functional as F
import numpy as np
import random
import matplotlib.pyplot as plt
from PIL import Image
from collections import defaultdict
paddle.disable_static()
print(paddle.__version__)
print(paddle.__git_commit__)
.. parsed-literal::
0.0.0
89af2088b6e74bdfeef2d4d78e08461ed2aafee5
数据集
------
本示例采用\ `CIFAR-10 <https://www.cs.toronto.edu/~kriz/cifar.html>`__\ 数据集。这是一个经典的数据集,由50000张图片的训练数据,和10000张图片的测试数据组成,其中每张图片是一个RGB的长和宽都为32的图片。使用\ ``paddle.dataset.cifar``\ 可以方便的完成数据的下载工作,把数据归一化到\ ``(0, 1.0)``\ 区间内,并提供迭代器供按顺序访问数据。我们会把训练数据和测试数据分别存放在两个\ ``numpy``\ 数组中,供后面的训练和评估来使用。
.. code:: ipython3
cifar10_train = paddle.vision.datasets.cifar.Cifar10(mode='train', transform=None)
x_train = np.zeros((50000, 3, 32, 32))
y_train = np.zeros((50000, 1), dtype='int32')
for i in range(len(cifar10_train)):
train_image, train_label = cifar10_train[i]
train_image = train_image.reshape((3,32,32 ))
# normalize the data
x_train[i,:, :, :] = train_image / 255.
y_train[i, 0] = train_label
y_train = np.squeeze(y_train)
print(x_train.shape)
print(y_train.shape)
.. parsed-literal::
(50000, 3, 32, 32)
(50000,)
.. code:: ipython3
cifar10_test = paddle.vision.datasets.cifar.Cifar10(mode='test', transform=None)
x_test = np.zeros((10000, 3, 32, 32), dtype='float32')
y_test = np.zeros((10000, 1), dtype='int64')
for i in range(len(cifar10_test)):
test_image, test_label = cifar10_test[i]
test_image = test_image.reshape((3,32,32 ))
# normalize the data
x_test[i,:, :, :] = test_image / 255.
y_test[i, 0] = test_label
y_test = np.squeeze(y_test)
print(x_test.shape)
print(y_test.shape)
.. parsed-literal::
(10000, 3, 32, 32)
(10000,)
数据探索
--------
接下来我们随机从训练数据里找一些图片,浏览一下这些图片。
.. code:: ipython3
height_width = 32
def show_collage(examples):
box_size = height_width + 2
num_rows, num_cols = examples.shape[:2]
collage = Image.new(
mode="RGB",
size=(num_cols * box_size, num_rows * box_size),
color=(255, 255, 255),
)
for row_idx in range(num_rows):
for col_idx in range(num_cols):
array = (np.array(examples[row_idx, col_idx]) * 255).astype(np.uint8)
array = array.transpose(1,2,0)
collage.paste(
Image.fromarray(array), (col_idx * box_size, row_idx * box_size)
)
collage = collage.resize((2 * num_cols * box_size, 2 * num_rows * box_size))
return collage
sample_idxs = np.random.randint(0, 50000, size=(5, 5))
examples = x_train[sample_idxs]
show_collage(examples)
.. image:: image_search_files/image_search_8_0.png
构建训练数据
--------------
图片检索的模型的训练样本跟我们常见的分类任务的训练样本不太一样的地方在于,每个训练样本并不是一个\ ``(image, class)``\ 这样的形式。而是(image0,
image1,
similary_or_not)的形式,即,每一个训练样本由两张图片组成,而其\ ``label``\ 是这两张图片是否相似的标志位(0或者1)。
很自然的我们能够想到,来自同一个类别的两张图片,是相似的图片,而来自不同类别的两张图片,应该是不相似的图片。
为了能够方便的抽样出相似图片(以及不相似图片)的样本,我们先建立能够根据类别找到该类别下所有图片的索引。
.. code:: ipython3
class_idx_to_train_idxs = defaultdict(list)
for y_train_idx, y in enumerate(y_train):
class_idx_to_train_idxs[y].append(y_train_idx)
class_idx_to_test_idxs = defaultdict(list)
for y_test_idx, y in enumerate(y_test):
class_idx_to_test_idxs[y].append(y_test_idx)
有了上面的索引,我们就可以为飞桨准备一个读取数据的迭代器。该迭代器每次生成\ ``2 * number of classes``\ 张图片,在CIFAR10数据集中,这会是20张图片。前10张图片,和后10张图片,分别是10个类别中每个类别随机抽出的一张图片。这样,在实际的训练过程中,我们就会有10张相似的图片和90张不相似的图片(前10张图片中的任意一张图片,都与后10张的对应位置的1张图片相似,而与其他9张图片不相似)。
.. code:: ipython3
num_classes = 10
def reader_creator(num_batchs):
def reader():
iter_step = 0
while True:
if iter_step >= num_batchs:
break
iter_step += 1
x = np.empty((2, num_classes, 3, height_width, height_width), dtype=np.float32)
for class_idx in range(num_classes):
examples_for_class = class_idx_to_train_idxs[class_idx]
anchor_idx = random.choice(examples_for_class)
positive_idx = random.choice(examples_for_class)
while positive_idx == anchor_idx:
positive_idx = random.choice(examples_for_class)
x[0, class_idx] = x_train[anchor_idx]
x[1, class_idx] = x_train[positive_idx]
yield x
return reader
# num_batchs: how many batchs to generate
def anchor_positive_pairs(num_batchs=100):
return reader_creator(num_batchs)
.. code:: ipython3
pairs_train_reader = anchor_positive_pairs(num_batchs=1000)
拿出第一批次的图片,并可视化的展示出来,如下所示。(这样更容易理解训练样本的构成)
.. code:: ipython3
examples = next(pairs_train_reader())
print(examples.shape)
show_collage(examples)
.. parsed-literal::
(2, 10, 3, 32, 32)
.. image:: image_search_files/image_search_15_1.png
把图片转换为高维的向量表示的网络
-----------------------------------
我们的目标是首先把图片转换为高维空间的表示,然后计算图片在高维空间表示时的相似度。
下面的网络结构用来把一个形状为\ ``(3, 32, 32)``\ 的图片转换成形状为\ ``(8,)``\ 的向量。在有些资料中也会把这个转换成的向量称为\ ``Embedding``\ ,请注意,这与自然语言处理领域的词向量的区别。
下面的模型由三个连续的卷积加一个全局均值池化,然后用一个线性全链接层映射到维数为8的向量空间。为了后续计算余弦相似度时的便利,我们还在最后用\ `l2_normalize <https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/layers_cn/l2_normalize_cn.html>`__\ 做了归一化。(即,余弦相似度的分母部分)
.. code:: ipython3
class MyNet(paddle.nn.Layer):
def __init__(self):
super(MyNet, self).__init__()
self.conv1 = paddle.nn.Conv2d(in_channels=3,
out_channels=32,
kernel_size=(3, 3),
stride=2)
self.conv2 = paddle.nn.Conv2d(in_channels=32,
out_channels=64,
kernel_size=(3,3),
stride=2)
self.conv3 = paddle.nn.Conv2d(in_channels=64,
out_channels=128,
kernel_size=(3,3),
stride=2)
self.gloabl_pool = paddle.nn.AdaptiveAvgPool2d((1,1))
self.fc1 = paddle.nn.Linear(in_features=128, out_features=8)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = self.conv3(x)
x = F.relu(x)
x = self.gloabl_pool(x)
x = paddle.squeeze(x, axis=[2, 3])
x = self.fc1(x)
x = F.l2_normalize(x, axis=1)
return x
在模型的训练过程中如下面的代码所示:
- ``inverse_temperature``\ 参数起到的作用是让softmax在计算梯度时,能够处于梯度更显著的区域。(可以参考\ `attention
is all you
need <https://arxiv.org/abs/1706.03762>`__\ 中,在点积之后的\ ``scale``\ 操作)。
- 整个计算过程,会先用上面的网络分别计算前10张图片(anchors)的高维表示,和后10张图片的高维表示。然后再用\ `matmul <https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/layers_cn/matmul_cn.html>`__\ 计算前10张图片分别与后10张图片的相似度。(所以\ ``similarities``\ 会是一个\ ``(10, 10)``\ Tensor)。
- \ `softmax_with_cross_entropy <https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/layers_cn/softmax_with_cross_entropy_cn.html>`__\ 构造类别标签时,则相应的,可以构造出来0
~
num_classes的标签值,用来让学习的目标成为相似的图片的相似度尽可能的趋向于1.0,而不相似的图片的相似度尽可能的趋向于-1.0
.. code:: ipython3
# 定义训练过程
def train(model):
print('start training ... ')
model.train()
inverse_temperature = paddle.to_tensor(np.array([1.0/0.2], dtype='float32'))
epoch_num = 20
opt = paddle.optimizer.Adam(learning_rate=0.0001,
parameters=model.parameters())
for epoch in range(epoch_num):
for batch_id, data in enumerate(pairs_train_reader()):
anchors_data, positives_data = data[0], data[1]
anchors = paddle.to_tensor(anchors_data)
positives = paddle.to_tensor(positives_data)
anchor_embeddings = model(anchors)
positive_embeddings = model(positives)
similarities = paddle.matmul(anchor_embeddings, positive_embeddings, transpose_y=True)
similarities = paddle.multiply(similarities, inverse_temperature)
sparse_labels = paddle.arange(0, num_classes, dtype='int64')
sparse_labels = paddle.reshape(sparse_labels, (num_classes, 1))
loss = F.softmax_with_cross_entropy(similarities, sparse_labels)
avg_loss = paddle.mean(loss)
if batch_id % 500 == 0:
print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, avg_loss.numpy()))
avg_loss.backward()
opt.minimize(avg_loss)
model.clear_gradients()
model = MyNet()
train(model)
.. parsed-literal::
start training ...
epoch: 0, batch_id: 0, loss is: [2.3080945]
epoch: 0, batch_id: 500, loss is: [2.326215]
epoch: 1, batch_id: 0, loss is: [2.0898924]
epoch: 1, batch_id: 500, loss is: [1.8754089]
epoch: 2, batch_id: 0, loss is: [2.2416227]
epoch: 2, batch_id: 500, loss is: [1.9024051]
epoch: 3, batch_id: 0, loss is: [1.841417]
epoch: 3, batch_id: 500, loss is: [2.1239076]
epoch: 4, batch_id: 0, loss is: [1.9291763]
epoch: 4, batch_id: 500, loss is: [2.2363486]
epoch: 5, batch_id: 0, loss is: [2.0078473]
epoch: 5, batch_id: 500, loss is: [2.0765374]
epoch: 6, batch_id: 0, loss is: [2.080376]
epoch: 6, batch_id: 500, loss is: [2.1759136]
epoch: 7, batch_id: 0, loss is: [1.908263]
epoch: 7, batch_id: 500, loss is: [1.7774136]
epoch: 8, batch_id: 0, loss is: [1.6335764]
epoch: 8, batch_id: 500, loss is: [1.5713912]
epoch: 9, batch_id: 0, loss is: [2.287479]
epoch: 9, batch_id: 500, loss is: [1.7719988]
epoch: 10, batch_id: 0, loss is: [1.2894523]
epoch: 10, batch_id: 500, loss is: [1.599735]
epoch: 11, batch_id: 0, loss is: [1.78816]
epoch: 11, batch_id: 500, loss is: [1.4773489]
epoch: 12, batch_id: 0, loss is: [1.6737808]
epoch: 12, batch_id: 500, loss is: [1.8889393]
epoch: 13, batch_id: 0, loss is: [1.6156021]
epoch: 13, batch_id: 500, loss is: [1.3851049]
epoch: 14, batch_id: 0, loss is: [1.3854092]
epoch: 14, batch_id: 500, loss is: [2.0325592]
epoch: 15, batch_id: 0, loss is: [1.9734558]
epoch: 15, batch_id: 500, loss is: [1.8050598]
epoch: 16, batch_id: 0, loss is: [1.7084911]
epoch: 16, batch_id: 500, loss is: [1.8919995]
epoch: 17, batch_id: 0, loss is: [1.3137552]
epoch: 17, batch_id: 500, loss is: [1.8817297]
epoch: 18, batch_id: 0, loss is: [1.9453808]
epoch: 18, batch_id: 500, loss is: [2.1317677]
epoch: 19, batch_id: 0, loss is: [1.6051079]
epoch: 19, batch_id: 500, loss is: [1.779858]
模型预测
--------
前述的模型训练训练结束之后,我们就可以用该网络结构来计算出任意一张图片的高维向量表示(embedding),通过计算该图片与图片库中其他图片的高维向量表示之间的相似度,就可以按照相似程度进行排序,排序越靠前,则相似程度越高。
下面我们对测试集中所有的图片都两两计算相似度,然后选一部分相似的图片展示出来。
.. code:: ipython3
near_neighbours_per_example = 10
x_test_t = paddle.to_tensor(x_test)
test_images_embeddings = model(x_test_t)
similarities_matrix = paddle.matmul(test_images_embeddings, test_images_embeddings, transpose_y=True)
indicies = paddle.argsort(similarities_matrix, descending=True)
indicies = indicies.numpy()
.. code:: ipython3
num_collage_examples = 10
examples = np.empty(
(
num_collage_examples,
near_neighbours_per_example + 1,
3,
height_width,
height_width,
),
dtype=np.float32,
)
for row_idx in range(num_collage_examples):
examples[row_idx, 0] = x_test[row_idx]
anchor_near_neighbours = indicies[row_idx][1:near_neighbours_per_example+1]
for col_idx, nn_idx in enumerate(anchor_near_neighbours):
examples[row_idx, col_idx + 1] = x_test[nn_idx]
show_collage(examples)
.. image:: image_search_files/image_search_22_0.png
The end
-------
上面展示的结果当中,每一行里其余的图片都是跟第一张图片按照相似度进行排序相似的图片。你也可以调整网络结构和超参数,以获得更好的结果。
......@@ -6,10 +6,16 @@
在这里PaddlePaddle为大家提供了一篇cv的教程供大家学习:
- `图像分类 <./mnist_lenet_classification/mnist_lenet_classification.html>`_ :介绍使用 Paddle 在MNIST数据集上完成图像分类。
- `图像分类 <./convnet_image_classification/convnet_image_classification.html>`_ :介绍使用 Paddle 在CIFA10数据集上完成图像分类。
- `图像搜索 <./image_search/image_search.html>`_ :介绍使用 Paddle 实现图像搜索。
- `图像分割 <./image_segmentation/pets_image_segmentation_U_Net_like.html>`_ :介绍使用 Paddle 实现U-Net模型完成图像分割。
.. toctree::
:hidden:
:titlesonly:
mnist_lenet_classification/mnist_lenet_classification.rst
convnet_image_classification/convnet_image_classification.rst
image_search/image_search.rst
image_segmentation/pets_image_segmentation_U_Net_like.rst
......@@ -11,9 +11,13 @@
**内容简介**
- `快速上手 <./simple_case/index_cn.html>`_ :快速了解Paddle 2的特性与功能。
- `计算机视觉 <./cv_case/index_cn.html>`_ :介绍使用 Paddle 解决计算机视觉领域的案例
- `自然语言处理 <./nlp_case/index_cn.html>`_ :介绍使用 Paddle 解决自然语言处理领域的案例
.. toctree::
:hidden:
quick_start/index_cn.rst
cv_case/index_cn.rst
nlp_case/index_cn.rst
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IMDB 数据集使用BOW网络的文本分类\n",
"\n",
"本示例教程演示如何在IMDB数据集上用简单的BOW网络完成文本分类的任务。\n",
"\n",
"IMDB数据集是一个对电影评论标注为正向评论与负向评论的数据集,共有25000条文本数据作为训练集,25000条文本数据作为测试集。\n",
"该数据集的官方地址为: http://ai.stanford.edu/~amaas/data/sentiment/\n",
"\n",
"- Warning: `paddle.dataset.imdb`先在是一个非常粗野的实现,后续需要有替代的方案。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 环境设置\n",
"\n",
"本示例基于飞桨开源框架2.0版本。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.0.0\n",
"264e76cae6861ad9b1d4bcd8c3212f7a78c01e4d\n"
]
}
],
"source": [
"import paddle\n",
"import numpy as np\n",
"\n",
"paddle.disable_static()\n",
"print(paddle.__version__)\n",
"print(paddle.__git_commit__)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 加载数据\n",
"\n",
"我们会使用`paddle.dataset`完成数据下载,构建字典和准备数据读取器。在飞桨2.0版本中,推荐使用padding的方式来对同一个batch中长度不一的数据进行补齐,所以在字典中,我们还会添加一个特殊的`<pad>`词,用来在后续对batch中较短的句子进行填充。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading IMDB word dict....\n"
]
}
],
"source": [
"print(\"Loading IMDB word dict....\")\n",
"word_dict = paddle.dataset.imdb.word_dict()\n",
"\n",
"train_reader = paddle.dataset.imdb.train(word_dict)\n",
"test_reader = paddle.dataset.imdb.test(word_dict)\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the:0\n",
"and:1\n",
"a:2\n",
"of:3\n",
"to:4\n",
"...\n",
"virtual:5143\n",
"warriors:5144\n",
"widely:5145\n",
"<unk>:5146\n",
"<pad>:5147\n",
"totally 5148 words\n"
]
}
],
"source": [
"# add a pad token to the dict for later padding the sequence\n",
"word_dict['<pad>'] = len(word_dict)\n",
"\n",
"for k in list(word_dict)[:5]:\n",
" print(\"{}:{}\".format(k.decode('ASCII'), word_dict[k]))\n",
"\n",
"print(\"...\")\n",
"\n",
"for k in list(word_dict)[-5:]:\n",
" print(\"{}:{}\".format(k if isinstance(k, str) else k.decode('ASCII'), word_dict[k]))\n",
"\n",
"print(\"totally {} words\".format(len(word_dict)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 参数设置\n",
"\n",
"在这里我们设置一下词表大小,`embedding`的大小,batch_size,等等"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"vocab_size = len(word_dict)\n",
"emb_size = 256\n",
"seq_len = 200\n",
"batch_size = 32\n",
"epoch_num = 2\n",
"pad_id = word_dict['<pad>']\n",
"\n",
"classes = ['negative', 'positive']\n",
"\n",
"def ids_to_str(ids):\n",
" #print(ids)\n",
" words = []\n",
" for k in ids:\n",
" w = list(word_dict)[k]\n",
" words.append(w if isinstance(w, str) else w.decode('ASCII'))\n",
" return \" \".join(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在这里,取出一条数据打印出来看看,可以对数据有一个初步直观的印象。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[5146, 43, 71, 6, 1092, 14, 0, 878, 130, 151, 5146, 18, 281, 747, 0, 5146, 3, 5146, 2165, 37, 5146, 46, 5, 71, 4089, 377, 162, 46, 5, 32, 1287, 300, 35, 203, 2136, 565, 14, 2, 253, 26, 146, 61, 372, 1, 615, 5146, 5, 30, 0, 50, 3290, 6, 2148, 14, 0, 5146, 11, 17, 451, 24, 4, 127, 10, 0, 878, 130, 43, 2, 50, 5146, 751, 5146, 5, 2, 221, 3727, 6, 9, 1167, 373, 9, 5, 5146, 7, 5, 1343, 13, 2, 5146, 1, 250, 7, 98, 4270, 56, 2316, 0, 928, 11, 11, 9, 16, 5, 5146, 5146, 6, 50, 69, 27, 280, 27, 108, 1045, 0, 2633, 4177, 3180, 17, 1675, 1, 2571] 0\n",
"<unk> has much in common with the third man another <unk> film set among the <unk> of <unk> europe like <unk> there is much inventive camera work there is an innocent american who gets emotionally involved with a woman he doesnt really understand and whose <unk> is all the more striking in contrast with the <unk> br but id have to say that the third man has a more <unk> storyline <unk> is a bit disjointed in this respect perhaps this is <unk> it is presented as a <unk> and making it too coherent would spoil the effect br br this movie is <unk> <unk> in more than one sense one never sees the sun shine grim but intriguing and frightening\n",
"negative\n"
]
}
],
"source": [
"# 取出来第一条数据看看样子。\n",
"sent, label = next(train_reader())\n",
"print(sent, label)\n",
"\n",
"print(ids_to_str(sent))\n",
"print(classes[label])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 用padding的方式对齐数据\n",
"\n",
"文本数据中,每一句话的长度都是不一样的,为了方便后续的神经网络的计算,常见的处理方式是把数据集中的数据都统一成同样长度的数据。这包括:对于较长的数据进行截断处理,对于较短的数据用特殊的词`<pad>`进行填充。接下来的代码会对数据集中的数据进行这样的处理。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(25000, 200)\n",
"(25000, 1)\n",
"(25000, 200)\n",
"(25000, 1)\n",
"<unk> has much in common with the third man another <unk> film set among the <unk> of <unk> europe like <unk> there is much inventive camera work there is an innocent american who gets emotionally involved with a woman he doesnt really understand and whose <unk> is all the more striking in contrast with the <unk> br but id have to say that the third man has a more <unk> storyline <unk> is a bit disjointed in this respect perhaps this is <unk> it is presented as a <unk> and making it too coherent would spoil the effect br br this movie is <unk> <unk> in more than one sense one never sees the sun shine grim but intriguing and frightening <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>\n",
"<unk> is the most original movie ive seen in years if you like unique thrillers that are influenced by film noir then this is just the right cure for all of those hollywood summer <unk> <unk> the theaters these days von <unk> <unk> like breaking the waves have gotten more <unk> but this is really his best work it is <unk> without being distracting and offers the perfect combination of suspense and dark humor its too bad he decided <unk> cameras were the wave of the future its hard to say who talked him away from the style he <unk> here but its everyones loss that he went into his heavily <unk> <unk> direction instead <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>\n",
"<unk> von <unk> is never <unk> in trying out new techniques some of them are very original while others are best <unk> br he depicts <unk> germany as a <unk> train journey with so many cities lying in ruins <unk> <unk> a young american of german descent feels <unk> to help in their <unk> it is not a simple task as he quickly finds outbr br his uncle finds him a job as a night <unk> on the <unk> <unk> line his job is to <unk> to the needs of the passengers when the shoes are <unk> a <unk> mark is made on the <unk> a terrible argument <unk> when a passengers shoes are not <unk> despite the fact they have been <unk> there are many <unk> to the german <unk> of <unk> to such stupid <unk> br the <unk> journey is like an <unk> <unk> mans <unk> through life with all its <unk> and <unk> in one sequence <unk> <unk> through the back <unk> to discover them filled with <unk> bodies appearing to have just escaped from <unk> these images horrible as they are are <unk> as in a dream each with its own terrible impact yet <unk> br\n"
]
}
],
"source": [
"def create_padded_dataset(reader):\n",
" padded_sents = []\n",
" labels = []\n",
" for batch_id, data in enumerate(reader):\n",
" sent, label = data\n",
" padded_sent = sent[:seq_len] + [pad_id] * (seq_len - len(sent))\n",
" padded_sents.append(padded_sent)\n",
" labels.append(label)\n",
" return np.array(padded_sents), np.expand_dims(np.array(labels), axis=1)\n",
"\n",
"train_sents, train_labels = create_padded_dataset(train_reader())\n",
"test_sents, test_labels = create_padded_dataset(test_reader())\n",
"\n",
"print(train_sents.shape)\n",
"print(train_labels.shape)\n",
"print(test_sents.shape)\n",
"print(test_labels.shape)\n",
"\n",
"for sent in train_sents[:3]:\n",
" print(ids_to_str(sent))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 组建网络\n",
"\n",
"本示例中,我们将会使用一个不考虑词的顺序的BOW的网络,在查找到每个词对应的embedding后,简单的取平均,作为一个句子的表示。然后用`Linear`进行线性变换。为了防止过拟合,我们还使用了`Dropout`。"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"class MyNet(paddle.nn.Layer):\n",
" def __init__(self):\n",
" super(MyNet, self).__init__()\n",
" self.emb = paddle.nn.Embedding(vocab_size, emb_size)\n",
" self.fc = paddle.nn.Linear(in_features=emb_size, out_features=2)\n",
" self.dropout = paddle.nn.Dropout(0.5)\n",
"\n",
" def forward(self, x):\n",
" x = self.emb(x)\n",
" x = paddle.reduce_mean(x, dim=1)\n",
" x = self.dropout(x)\n",
" x = self.fc(x)\n",
" return x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 开始模型的训练\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch: 0, batch_id: 0, loss is: [0.6926701]\n",
"epoch: 0, batch_id: 500, loss is: [0.41248566]\n",
"[validation] accuracy/loss: 0.8505121469497681/0.3615057170391083\n",
"epoch: 1, batch_id: 0, loss is: [0.29521096]\n",
"epoch: 1, batch_id: 500, loss is: [0.2916747]\n",
"[validation] accuracy/loss: 0.86475670337677/0.3259459137916565\n"
]
}
],
"source": [
"def train(model):\n",
" model.train()\n",
"\n",
" opt = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())\n",
"\n",
" for epoch in range(epoch_num):\n",
" # shuffle data\n",
" perm = np.random.permutation(len(train_sents))\n",
" train_sents_shuffled = train_sents[perm]\n",
" train_labels_shuffled = train_labels[perm]\n",
" \n",
" for batch_id in range(len(train_sents_shuffled) // batch_size):\n",
" x_data = train_sents_shuffled[(batch_id * batch_size):((batch_id+1)*batch_size)]\n",
" y_data = train_labels_shuffled[(batch_id * batch_size):((batch_id+1)*batch_size)]\n",
" \n",
" sent = paddle.to_tensor(x_data)\n",
" label = paddle.to_tensor(y_data)\n",
" \n",
" logits = model(sent)\n",
" loss = paddle.nn.functional.softmax_with_cross_entropy(logits, label)\n",
" \n",
" avg_loss = paddle.mean(loss)\n",
" if batch_id % 500 == 0:\n",
" print(\"epoch: {}, batch_id: {}, loss is: {}\".format(epoch, batch_id, avg_loss.numpy()))\n",
" avg_loss.backward()\n",
" opt.minimize(avg_loss)\n",
" model.clear_gradients()\n",
"\n",
" # evaluate model after one epoch\n",
" model.eval()\n",
" accuracies = []\n",
" losses = []\n",
" for batch_id in range(len(test_sents) // batch_size):\n",
" x_data = test_sents[(batch_id * batch_size):((batch_id+1)*batch_size)]\n",
" y_data = test_labels[(batch_id * batch_size):((batch_id+1)*batch_size)]\n",
" \n",
" sent = paddle.to_tensor(x_data)\n",
" label = paddle.to_tensor(y_data)\n",
"\n",
" logits = model(sent)\n",
" loss = paddle.nn.functional.softmax_with_cross_entropy(logits, label)\n",
" acc = paddle.metric.accuracy(logits, label)\n",
" \n",
" accuracies.append(acc.numpy())\n",
" losses.append(loss.numpy())\n",
" \n",
" avg_acc, avg_loss = np.mean(accuracies), np.mean(losses)\n",
" print(\"[validation] accuracy/loss: {}/{}\".format(avg_acc, avg_loss))\n",
" \n",
" model.train()\n",
" \n",
"model = MyNet()\n",
"train(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The End\n",
"\n",
"可以看到,在这个数据集上,经过两轮的迭代可以得到86%左右的准确率。你也可以通过调整网络结构和超参数,来获得更好的效果。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"name": "cifar-10-cnn.ipynb",
"private_outputs": true,
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
IMDB 数据集使用BOW网络的文本分类
================================
本示例教程演示如何在IMDB数据集上用简单的BOW网络完成文本分类的任务。
IMDB数据集是一个对电影评论标注为正向评论与负向评论的数据集,共有25000条文本数据作为训练集,25000条文本数据作为测试集。
该数据集的官方地址为: http://ai.stanford.edu/~amaas/data/sentiment/
- Warning:
``paddle.dataset.imdb``\ 先在是一个非常粗野的实现,后续需要有替代的方案。
环境设置
--------
本示例基于飞桨开源框架2.0版本。
.. code::
import paddle
import numpy as np
paddle.disable_static()
print(paddle.__version__)
print(paddle.__git_commit__)
.. parsed-literal::
0.0.0
264e76cae6861ad9b1d4bcd8c3212f7a78c01e4d
加载数据
--------
我们会使用\ ``paddle.dataset``\ 完成数据下载,构建字典和准备数据读取器。在飞桨2.0版本中,推荐使用padding的方式来对同一个batch中长度不一的数据进行补齐,所以在字典中,我们还会添加一个特殊的\ ``<pad>``\ 词,用来在后续对batch中较短的句子进行填充。
.. code::
print("Loading IMDB word dict....")
word_dict = paddle.dataset.imdb.word_dict()
train_reader = paddle.dataset.imdb.train(word_dict)
test_reader = paddle.dataset.imdb.test(word_dict)
.. parsed-literal::
Loading IMDB word dict....
.. code::
# add a pad token to the dict for later padding the sequence
word_dict['<pad>'] = len(word_dict)
for k in list(word_dict)[:5]:
print("{}:{}".format(k.decode('ASCII'), word_dict[k]))
print("...")
for k in list(word_dict)[-5:]:
print("{}:{}".format(k if isinstance(k, str) else k.decode('ASCII'), word_dict[k]))
print("totally {} words".format(len(word_dict)))
.. parsed-literal::
the:0
and:1
a:2
of:3
to:4
...
virtual:5143
warriors:5144
widely:5145
<unk>:5146
<pad>:5147
totally 5148 words
参数设置
--------
在这里我们设置一下词表大小,\ ``embedding``\ 的大小,batch_size,等等
.. code::
vocab_size = len(word_dict)
emb_size = 256
seq_len = 200
batch_size = 32
epoch_num = 2
pad_id = word_dict['<pad>']
classes = ['negative', 'positive']
def ids_to_str(ids):
#print(ids)
words = []
for k in ids:
w = list(word_dict)[k]
words.append(w if isinstance(w, str) else w.decode('ASCII'))
return " ".join(words)
在这里,取出一条数据打印出来看看,可以对数据有一个初步直观的印象。
.. code::
# 取出来第一条数据看看样子。
sent, label = next(train_reader())
print(sent, label)
print(ids_to_str(sent))
print(classes[label])
.. parsed-literal::
[5146, 43, 71, 6, 1092, 14, 0, 878, 130, 151, 5146, 18, 281, 747, 0, 5146, 3, 5146, 2165, 37, 5146, 46, 5, 71, 4089, 377, 162, 46, 5, 32, 1287, 300, 35, 203, 2136, 565, 14, 2, 253, 26, 146, 61, 372, 1, 615, 5146, 5, 30, 0, 50, 3290, 6, 2148, 14, 0, 5146, 11, 17, 451, 24, 4, 127, 10, 0, 878, 130, 43, 2, 50, 5146, 751, 5146, 5, 2, 221, 3727, 6, 9, 1167, 373, 9, 5, 5146, 7, 5, 1343, 13, 2, 5146, 1, 250, 7, 98, 4270, 56, 2316, 0, 928, 11, 11, 9, 16, 5, 5146, 5146, 6, 50, 69, 27, 280, 27, 108, 1045, 0, 2633, 4177, 3180, 17, 1675, 1, 2571] 0
<unk> has much in common with the third man another <unk> film set among the <unk> of <unk> europe like <unk> there is much inventive camera work there is an innocent american who gets emotionally involved with a woman he doesnt really understand and whose <unk> is all the more striking in contrast with the <unk> br but id have to say that the third man has a more <unk> storyline <unk> is a bit disjointed in this respect perhaps this is <unk> it is presented as a <unk> and making it too coherent would spoil the effect br br this movie is <unk> <unk> in more than one sense one never sees the sun shine grim but intriguing and frightening
negative
padding的方式对齐数据
----------------------------
文本数据中,每一句话的长度都是不一样的,为了方便后续的神经网络的计算,常见的处理方式是把数据集中的数据都统一成同样长度的数据。这包括:对于较长的数据进行截断处理,对于较短的数据用特殊的词\ ``<pad>``\ 进行填充。接下来的代码会对数据集中的数据进行这样的处理。
.. code::
def create_padded_dataset(reader):
padded_sents = []
labels = []
for batch_id, data in enumerate(reader):
sent, label = data
padded_sent = sent[:seq_len] + [pad_id] * (seq_len - len(sent))
padded_sents.append(padded_sent)
labels.append(label)
return np.array(padded_sents), np.expand_dims(np.array(labels), axis=1)
train_sents, train_labels = create_padded_dataset(train_reader())
test_sents, test_labels = create_padded_dataset(test_reader())
print(train_sents.shape)
print(train_labels.shape)
print(test_sents.shape)
print(test_labels.shape)
for sent in train_sents[:3]:
print(ids_to_str(sent))
.. parsed-literal::
(25000, 200)
(25000, 1)
(25000, 200)
(25000, 1)
<unk> has much in common with the third man another <unk> film set among the <unk> of <unk> europe like <unk> there is much inventive camera work there is an innocent american who gets emotionally involved with a woman he doesnt really understand and whose <unk> is all the more striking in contrast with the <unk> br but id have to say that the third man has a more <unk> storyline <unk> is a bit disjointed in this respect perhaps this is <unk> it is presented as a <unk> and making it too coherent would spoil the effect br br this movie is <unk> <unk> in more than one sense one never sees the sun shine grim but intriguing and frightening <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
<unk> is the most original movie ive seen in years if you like unique thrillers that are influenced by film noir then this is just the right cure for all of those hollywood summer <unk> <unk> the theaters these days von <unk> <unk> like breaking the waves have gotten more <unk> but this is really his best work it is <unk> without being distracting and offers the perfect combination of suspense and dark humor its too bad he decided <unk> cameras were the wave of the future its hard to say who talked him away from the style he <unk> here but its everyones loss that he went into his heavily <unk> <unk> direction instead <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
<unk> von <unk> is never <unk> in trying out new techniques some of them are very original while others are best <unk> br he depicts <unk> germany as a <unk> train journey with so many cities lying in ruins <unk> <unk> a young american of german descent feels <unk> to help in their <unk> it is not a simple task as he quickly finds outbr br his uncle finds him a job as a night <unk> on the <unk> <unk> line his job is to <unk> to the needs of the passengers when the shoes are <unk> a <unk> mark is made on the <unk> a terrible argument <unk> when a passengers shoes are not <unk> despite the fact they have been <unk> there are many <unk> to the german <unk> of <unk> to such stupid <unk> br the <unk> journey is like an <unk> <unk> mans <unk> through life with all its <unk> and <unk> in one sequence <unk> <unk> through the back <unk> to discover them filled with <unk> bodies appearing to have just escaped from <unk> these images horrible as they are are <unk> as in a dream each with its own terrible impact yet <unk> br
组建网络
--------
本示例中,我们将会使用一个不考虑词的顺序的BOW的网络,在查找到每个词对应的embedding后,简单的取平均,作为一个句子的表示。然后用\ ``Linear``\ 进行线性变换。为了防止过拟合,我们还使用了\ ``Dropout``\
.. code::
class MyNet(paddle.nn.Layer):
def __init__(self):
super(MyNet, self).__init__()
self.emb = paddle.nn.Embedding(vocab_size, emb_size)
self.fc = paddle.nn.Linear(in_features=emb_size, out_features=2)
self.dropout = paddle.nn.Dropout(0.5)
def forward(self, x):
x = self.emb(x)
x = paddle.reduce_mean(x, dim=1)
x = self.dropout(x)
x = self.fc(x)
return x
开始模型的训练
--------------
.. code::
def train(model):
model.train()
opt = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())
for epoch in range(epoch_num):
# shuffle data
perm = np.random.permutation(len(train_sents))
train_sents_shuffled = train_sents[perm]
train_labels_shuffled = train_labels[perm]
for batch_id in range(len(train_sents_shuffled) // batch_size):
x_data = train_sents_shuffled[(batch_id * batch_size):((batch_id+1)*batch_size)]
y_data = train_labels_shuffled[(batch_id * batch_size):((batch_id+1)*batch_size)]
sent = paddle.to_tensor(x_data)
label = paddle.to_tensor(y_data)
logits = model(sent)
loss = paddle.nn.functional.softmax_with_cross_entropy(logits, label)
avg_loss = paddle.mean(loss)
if batch_id % 500 == 0:
print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, avg_loss.numpy()))
avg_loss.backward()
opt.minimize(avg_loss)
model.clear_gradients()
# evaluate model after one epoch
model.eval()
accuracies = []
losses = []
for batch_id in range(len(test_sents) // batch_size):
x_data = test_sents[(batch_id * batch_size):((batch_id+1)*batch_size)]
y_data = test_labels[(batch_id * batch_size):((batch_id+1)*batch_size)]
sent = paddle.to_tensor(x_data)
label = paddle.to_tensor(y_data)
logits = model(sent)
loss = paddle.nn.functional.softmax_with_cross_entropy(logits, label)
acc = paddle.metric.accuracy(logits, label)
accuracies.append(acc.numpy())
losses.append(loss.numpy())
avg_acc, avg_loss = np.mean(accuracies), np.mean(losses)
print("[validation] accuracy/loss: {}/{}".format(avg_acc, avg_loss))
model.train()
model = MyNet()
train(model)
.. parsed-literal::
epoch: 0, batch_id: 0, loss is: [0.6926701]
epoch: 0, batch_id: 500, loss is: [0.41248566]
[validation] accuracy/loss: 0.8505121469497681/0.3615057170391083
epoch: 1, batch_id: 0, loss is: [0.29521096]
epoch: 1, batch_id: 500, loss is: [0.2916747]
[validation] accuracy/loss: 0.86475670337677/0.3259459137916565
The End
--------
可以看到,在这个数据集上,经过两轮的迭代可以得到86%左右的准确率。你也可以通过调整网络结构和超参数,来获得更好的效果。
################
自然语言处理
################
在这里PaddlePaddle为大家提供了一篇nlp的教程供大家学习:
- `N-Gram <./n_gram_model/n_gram_model.html>`_ :介绍使用 Paddle 实现N-Gram 模型。
- `文本分类 <./imdb_bow_classification/imdb_bow_classification.html>`_ :介绍使用 Paddle 在IMDB数据集上完成文本分类。
- `文本翻译 <./seq2seq_with_attention/seq2seq_with_attention.html>`_ :介绍使用 Paddle 实现文本翻译。
.. toctree::
:hidden:
:titlesonly:
n_gram_model/n_gram_model.rst
imdb_bow_classification/imdb_bow_classification.rst
seq2seq_with_attention/seq2seq_with_attention.rst
N-Gram模型在莎士比亚文集中训练word embedding
==============================================
N-gram
是计算机语言学和概率论范畴内的概念,是指给定的一段文本中N个项目的序列。
N=1 N-gram 又称为 unigramN=2 称为 bigramN=3 称为
trigram,以此类推。实际应用通常采用 bigram trigram 进行计算。
本示例在莎士比亚文集上实现了trigram
环境
----
本教程基于paddle-develop编写,如果您的环境不是本版本,请先安装paddle-develop
.. code:: ipython3
import paddle
paddle.__version__
.. parsed-literal::
'0.0.0'
数据集&&相关参数
----------------
训练数据集采用了莎士比亚文集,\ `下载 <https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt>`__\ ,保存为txt格式即可。
context_size设为2,意味着是trigramembedding_dim设为256
.. code:: ipython3
!wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
.. parsed-literal::
--2020-09-09 14:58:26-- https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
正在解析主机 ocw.mit.edu (ocw.mit.edu)... 151.101.110.133
正在连接 ocw.mit.edu (ocw.mit.edu)|151.101.110.133|:443... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:5458199 (5.2M) [text/plain]
正在保存至: t8.shakespeare.txt
t8.shakespeare.txt 100%[===================>] 5.21M 94.1KB/s 用时 70s
2020-09-09 14:59:38 (75.7 KB/s) - 已保存 t8.shakespeare.txt [5458199/5458199])
.. code:: ipython3
embedding_dim = 256
context_size = 2
.. code:: ipython3
# 文件路径
path_to_file = './t8.shakespeare.txt'
test_sentence = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# 文本长度是指文本中的字符个数
print ('Length of text: {} characters'.format(len(test_sentence)))
.. parsed-literal::
Length of text: 5458199 characters
去除标点符号
------------
因为标点符号本身无实际意义,用\ ``string``\ 库中的punctuation,完成英文符号的替换。
.. code:: ipython3
from string import punctuation
process_dicts={i:'' for i in punctuation}
print(process_dicts)
.. parsed-literal::
{'!': '', '"': '', '#': '', '$': '', '%': '', '&': '', "'": '', '(': '', ')': '', '*': '', '+': '', ',': '', '-': '', '.': '', '/': '', ':': '', ';': '', '<': '', '=': '', '>': '', '?': '', '@': '', '[': '', '\\': '', ']': '', '^': '', '_': '', '`': '', '{': '', '|': '', '}': '', '~': ''}
.. code:: ipython3
punc_table = str.maketrans(process_dicts)
test_sentence = test_sentence.translate(punc_table)
test_sentence = test_sentence.lower().split()
vocab = set(test_sentence)
print(len(vocab))
.. parsed-literal::
28343
数据预处理
----------
将文本被拆成了元组的形式,格式为((‘第一个词’, ‘第二个词’),
‘第三个词’);其中,第三个词就是我们的目标。
.. code:: ipython3
trigram = [[[test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2]]
for i in range(len(test_sentence) - 2)]
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}
# 看一下数据集
print(trigram[:3])
.. parsed-literal::
[[['this', 'is'], 'the'], [['is', 'the'], '100th'], [['the', '100th'], 'etext']]
构建\ ``Dataset``\ 加载数据
------------------------------
\ ``paddle.io.Dataset``\ 构建数据集,然后作为参数传入到\ ``paddle.io.DataLoader``\ ,完成数据集的加载。
.. code:: ipython3
import paddle
import numpy as np
batch_size = 256
paddle.disable_static()
class TrainDataset(paddle.io.Dataset):
def __init__(self, tuple_data):
self.tuple_data = tuple_data
def __getitem__(self, idx):
data = self.tuple_data[idx][0]
label = self.tuple_data[idx][1]
data = np.array(list(map(lambda w: word_to_idx[w], data)))
label = np.array(word_to_idx[label])
return data, label
def __len__(self):
return len(self.tuple_data)
train_dataset = TrainDataset(trigram)
train_loader = paddle.io.DataLoader(train_dataset,places=paddle.CPUPlace(), return_list=True,
shuffle=True, batch_size=batch_size, drop_last=True)
组网&训练
---------
这里用paddle动态图的方式组网。为了构建Trigram模型,用一层 ``Embedding``
与两层 ``Linear`` 完成构建。\ ``Embedding``
层对输入的前两个单词embedding,然后输入到后面的两个\ ``Linear``\ 层中,完成特征提取。
.. code:: ipython3
import paddle
import numpy as np
hidden_size = 1024
class NGramModel(paddle.nn.Layer):
def __init__(self, vocab_size, embedding_dim, context_size):
super(NGramModel, self).__init__()
self.embedding = paddle.nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
self.linear1 = paddle.nn.Linear(context_size * embedding_dim, hidden_size)
self.linear2 = paddle.nn.Linear(hidden_size, len(vocab))
def forward(self, x):
x = self.embedding(x)
x = paddle.reshape(x, [-1, context_size * embedding_dim])
x = self.linear1(x)
x = paddle.nn.functional.relu(x)
x = self.linear2(x)
return x
定义\ ``train()``\ 函数,对模型进行训练。
-----------------------------------------
.. code:: ipython3
vocab_size = len(vocab)
epochs = 2
losses = []
def train(model):
model.train()
optim = paddle.optimizer.Adam(learning_rate=0.01, parameters=model.parameters())
for epoch in range(epochs):
for batch_id, data in enumerate(train_loader()):
x_data = data[0]
y_data = data[1]
predicts = model(x_data)
y_data = paddle.reshape(y_data, ([-1, 1]))
loss = paddle.nn.functional.softmax_with_cross_entropy(predicts, y_data)
avg_loss = paddle.mean(loss)
avg_loss.backward()
if batch_id % 500 == 0:
losses.append(avg_loss.numpy())
print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, avg_loss.numpy()))
optim.minimize(avg_loss)
model.clear_gradients()
model = NGramModel(vocab_size, embedding_dim, context_size)
train(model)
.. parsed-literal::
epoch: 0, batch_id: 0, loss is: [10.252193]
epoch: 0, batch_id: 500, loss is: [6.894636]
epoch: 0, batch_id: 1000, loss is: [6.849346]
epoch: 0, batch_id: 1500, loss is: [6.931605]
epoch: 0, batch_id: 2000, loss is: [6.6860313]
epoch: 0, batch_id: 2500, loss is: [6.2472367]
epoch: 0, batch_id: 3000, loss is: [6.8818874]
epoch: 0, batch_id: 3500, loss is: [6.941615]
epoch: 1, batch_id: 0, loss is: [6.3628616]
epoch: 1, batch_id: 500, loss is: [6.2065206]
epoch: 1, batch_id: 1000, loss is: [6.5334334]
epoch: 1, batch_id: 1500, loss is: [6.5788]
epoch: 1, batch_id: 2000, loss is: [6.352103]
epoch: 1, batch_id: 2500, loss is: [6.6272373]
epoch: 1, batch_id: 3000, loss is: [6.801074]
epoch: 1, batch_id: 3500, loss is: [6.2274427]
打印loss下降曲线
----------------
通过可视化loss的曲线,可以看到模型训练的效果。
.. code:: ipython3
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
plt.figure()
plt.plot(losses)
.. parsed-literal::
[<matplotlib.lines.Line2D at 0x14e27b3c8>]
.. image:: n_gram_model_files/n_gram_model_19_1.png
预测
----
用训练好的模型进行预测。
.. code:: ipython3
import random
def test(model):
model.eval()
# 从最后10组数据中随机选取1
idx = random.randint(len(trigram)-10, len(trigram)-1)
print('the input words is: ' + trigram[idx][0][0] + ', ' + trigram[idx][0][1])
x_data = list(map(lambda w: word_to_idx[w], trigram[idx][0]))
x_data = paddle.to_tensor(np.array(x_data))
predicts = model(x_data)
predicts = predicts.numpy().tolist()[0]
predicts = predicts.index(max(predicts))
print('the predict words is: ' + idx_to_word[predicts])
y_data = trigram[idx][1]
print('the true words is: ' + y_data)
test(model)
.. parsed-literal::
the input words is: of, william
the predict words is: shakespeare
the true words is: shakespeare
动态图
======
从飞桨开源框架2.0beta版本开始,飞桨默认为用户开启了动态图模式。在这种模式下,每次执行一个运算,可以立即得到结果(而不是事先定义好网络结构,然后再执行)。
在动态图模式下,您可以更加方便的组织代码,更容易的调试程序,本示例教程将向你介绍飞桨的动态图的使用。
设置环境
--------
我们将使用飞桨2.0beta版本,并确认已经开启了动态图模式。
.. code:: ipython3
import paddle
import paddle.nn.functional as F
import numpy as np
paddle.disable_static()
print(paddle.__version__)
print(paddle.__git_commit__)
.. parsed-literal::
0.0.0
89af2088b6e74bdfeef2d4d78e08461ed2aafee5
基本用法
--------
在动态图模式下,您可以直接运行一个飞桨提供的API,它会立刻返回结果到python。不再需要首先创建一个计算图,然后再给定数据去运行。
.. code:: ipython3
a = paddle.randn([4, 2])
b = paddle.arange(1, 3, dtype='float32')
print(a.numpy())
print(b.numpy())
c = a + b
print(c.numpy())
d = paddle.matmul(a, b)
print(d.numpy())
.. parsed-literal::
[[-0.49341336 -0.8112665 ]
[ 0.8929015 0.24661176]
[-0.64440054 -0.7945008 ]
[-0.07345356 1.3641853 ]]
[1. 2.]
[[0.5065867 1.1887336 ]
[1.8929014 2.2466118 ]
[0.35559946 1.2054992 ]
[0.92654645 3.3641853 ]]
[-2.1159463 1.386125 -2.2334023 2.654917 ]
使用python的控制流
------------------
动态图模式下,您可以使用python的条件判断和循环,这类控制语句来执行神经网络的计算。(不再需要\ ``cond``,
``loop``\ 这类OP)
.. code:: ipython3
a = paddle.to_tensor(np.array([1, 2, 3]))
b = paddle.to_tensor(np.array([4, 5, 6]))
for i in range(10):
r = paddle.rand([1,])
if r > 0.5:
c = paddle.pow(a, i) + b
print("{} +> {}".format(i, c.numpy()))
else:
c = paddle.pow(a, i) - b
print("{} -> {}".format(i, c.numpy()))
.. parsed-literal::
0 +> [5 6 7]
1 +> [5 7 9]
2 +> [ 5 9 15]
3 -> [-3 3 21]
4 -> [-3 11 75]
5 +> [ 5 37 249]
6 +> [ 5 69 735]
7 -> [ -3 123 2181]
8 +> [ 5 261 6567]
9 +> [ 5 517 19689]
构建更加灵活的网络:控制流
-------------------------------
- 使用动态图可以用来创建更加灵活的网络,比如根据控制流选择不同的分支网络,和方便的构建权重共享的网络。接下来我们来看一个具体的例子,在这个例子中,第二个线性变换只有0.5的可能性会运行。
- sequence to sequence with
attention的机器翻译的示例中,你会看到更实际的使用动态图构建RNN类的网络带来的灵活性。
.. code:: ipython3
class MyModel(paddle.nn.Layer):
def __init__(self, input_size, hidden_size):
super(MyModel, self).__init__()
self.linear1 = paddle.nn.Linear(input_size, hidden_size)
self.linear2 = paddle.nn.Linear(hidden_size, hidden_size)
self.linear3 = paddle.nn.Linear(hidden_size, 1)
def forward(self, inputs):
x = self.linear1(inputs)
x = F.relu(x)
if paddle.rand([1,]) > 0.5:
x = self.linear2(x)
x = F.relu(x)
x = self.linear3(x)
return x
.. code:: ipython3
total_data, batch_size, input_size, hidden_size = 1000, 64, 128, 256
x_data = np.random.randn(total_data, input_size).astype(np.float32)
y_data = np.random.randn(total_data, 1).astype(np.float32)
model = MyModel(input_size, hidden_size)
loss_fn = paddle.nn.MSELoss(reduction='mean')
optimizer = paddle.optimizer.SGD(learning_rate=0.01,
parameters=model.parameters())
for t in range(200 * (total_data // batch_size)):
idx = np.random.choice(total_data, batch_size, replace=False)
x = paddle.to_tensor(x_data[idx,:])
y = paddle.to_tensor(y_data[idx,:])
y_pred = model(x)
loss = loss_fn(y_pred, y)
if t % 200 == 0:
print(t, loss.numpy())
loss.backward()
optimizer.minimize(loss)
model.clear_gradients()
.. parsed-literal::
0 [2.0915627]
200 [0.67530334]
400 [0.52042854]
600 [0.28010666]
800 [0.09739777]
1000 [0.09307177]
1200 [0.04252927]
1400 [0.03095707]
1600 [0.03022156]
1800 [0.01616007]
2000 [0.01069116]
2200 [0.0055158]
2400 [0.00195092]
2600 [0.00101116]
2800 [0.00192219]
构建更加灵活的网络:共享权重
---------------------------------
- 使用动态图还可以更加方便的创建共享权重的网络,下面的示例展示了一个共享了权重的简单的AutoEncoder的示例。
- 你也可以参考图像搜索的示例看到共享参数权重的更实际的使用。
.. code:: ipython3
inputs = paddle.rand((256, 64))
linear = paddle.nn.Linear(64, 8, bias_attr=False)
loss_fn = paddle.nn.MSELoss()
optimizer = paddle.optimizer.Adam(0.01, parameters=linear.parameters())
for i in range(10):
hidden = linear(inputs)
# weight from input to hidden is shared with the linear mapping from hidden to output
outputs = paddle.matmul(hidden, linear.weight, transpose_y=True)
loss = loss_fn(outputs, inputs)
loss.backward()
print("step: {}, loss: {}".format(i, loss.numpy()))
optimizer.minimize(loss)
linear.clear_gradients()
.. parsed-literal::
step: 0, loss: [0.37666085]
step: 1, loss: [0.3063845]
step: 2, loss: [0.2647248]
step: 3, loss: [0.23831272]
step: 4, loss: [0.21714918]
step: 5, loss: [0.1955545]
step: 6, loss: [0.17261818]
step: 7, loss: [0.15009595]
step: 8, loss: [0.13051331]
step: 9, loss: [0.11537809]
The end
--------
可以看到使用动态图带来了更灵活易用的方式来组网和训练。
################
快速上手
################
在这里PaddlePaddle为大家提供了一些简单的案例,快速上手paddle 2.0:
- `hello paddle <./hello_paddle/hello_paddle.html>`_ :简单介绍 Paddle,完成您的第一个Paddle项目。
- `Paddle 动态图 <./dynamic_graph/dynamic_graph.html>`_ :介绍使用 Paddle 动态图。
- `高层API快速上手 <./getting_started/getting_started.html>`_ :介绍Paddle高层API,快速完成模型搭建。
- `高层API详细介绍 <./high_level_api/high_level_api.html>`_ :详细介绍Paddle高层API。
- `模型加载与保存 <./save_model/save_model.html>`_ :介绍Paddle 模型的加载与保存。
- `线性回归 <./linear_regression/linear_regression.html>`_ :介绍使用 Paddle 实现线性回归任务。
.. toctree::
:hidden:
:titlesonly:
hello_paddle/hello_paddle.rst
dynamic_graph/dynamic_graph.rst
getting_started/getting_started.rst
high_level_api/high_level_api.rst
save_model/save_model.rst
linear_regression/linear_regression.rst
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册