提交 e15c71cd 编写于 作者: L liuyang11

Merge branch 'develop' of https://github.com/PaddlePaddle/models into caffe2fluid

support lenet and resnet conertion
......@@ -24,11 +24,22 @@ unittest(){
trap 'abort' 0
set -e
for proj in */ ; do
for proj in * ; do
if [ -d $proj ]; then
unittest $proj
if [ $? != 0 ]; then
exit 1
if [ "$proj" = "fluid" ]; then
for proj in fluid/* ; do
if [ -d $proj ]; then
unittest $proj
if [ $? != 0 ]; then
exit 1
fi
fi
done
else
unittest $proj
if [ $? != 0 ]; then
exit 1
fi
fi
fi
done
......
The minimum PaddlePaddle version needed for the code sample in this directory is v0.11.0. If you are on a version of PaddlePaddle earlier than v0.11.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Convolutional Sequence to Sequence Learning
This model implements the work in the following paper:
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 点击率预估
以下是本例目录包含的文件以及对应说明:
......
The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Click-Through Rate Prediction
## Introduction
......
The minimum PaddlePaddle version needed for the code sample in this directory is v0.11.0. If you are on a version of PaddlePaddle earlier than v0.11.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Deep Factorization Machine for Click-Through Rate prediction
## Introduction
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此版本要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 深度结构化语义模型 (Deep Structured Semantic Models, DSSM)
DSSM使用DNN模型在一个连续的语义空间中学习文本低纬的表示向量,并且建模两个句子间的语义相似度。本例演示如何使用PaddlePaddle实现一个通用的DSSM 模型,用于建模两个字符串间的语义相似度,模型实现支持通用的数据格式,用户替换数据便可以在真实场景中使用该模型。
......
The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Deep Structured Semantic Models (DSSM)
Deep Structured Semantic Models (DSSM) is simple but powerful DNN based model for matching web search queries and the URL based documents. This example demonstrates how to use PaddlePaddle to implement a generic DSSM model for modeling the semantic similarity between two strings.
......
Deep ASR Kickoff
The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
### TODO
This project is still under active development.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
import data_utils.augmentor.trans_add_delta as trans_add_delta
import data_utils.augmentor.trans_splice as trans_splice
16.2845556399 11.6891798673
17.21509949 12.3788567902
18.1143704548 14.9912618017
19.2335963752 18.5419556172
19.9266772451 21.2768220522
19.8245737202 21.2347210705
19.5432940972 20.2784036567
19.4631271754 20.2934452329
19.3929919324 20.457971868
19.2924788362 20.3626439234
18.9207244502 19.9196569759
18.7202605641 19.5920276899
18.4844279398 19.2068349019
18.2670948624 18.8716893824
18.0929628855 18.5439666541
17.8428896026 18.0255891747
17.6646850635 17.473764296
17.4955705896 16.8966859471
17.3706720293 16.4294027467
17.2530867792 16.0514717623
17.1304341172 15.7234699057
17.0038353287 15.4344471514
16.902550309 15.1603287337
16.8375590047 14.9304337826
16.816287853 14.9119310513
16.828838265 15.0930023024
16.8602209498 15.3771992423
16.9101763812 15.6897991789
16.9466065143 15.9364556489
16.9486061956 16.0699417826
16.9041374104 16.0796970272
16.8410093699 16.0111444599
16.7045718836 15.7991985601
16.51128489 15.5208920129
16.3253910608 15.2603181921
16.1297317333 14.9499965958
15.903428372 14.5958280409
15.6131718105 14.2709618
15.1395035533 13.9993939893
14.4298229999 13.3841189151
0.0034970565424 0.246184766149
0.00501284154705 0.238484972472
0.00605942680019 0.269064381708
0.00687266156243 0.319479238011
0.00734065019253 0.371947383205
0.00718807218417 0.384426479694
0.00652195540212 0.384676838281
0.00660416525951 0.395543910317
0.00680202057642 0.400803979681
0.00659144183007 0.393228973031
0.00605294530423 0.385021118038
0.00590452969394 0.361763039625
0.00612315374687 0.346777773373
0.00582354093973 0.335802403976
0.00574556002554 0.320733728218
0.00612254485891 0.310153103033
0.00626733043219 0.299854747445
0.00567398408041 0.293353685493
0.00519236700706 0.287668810947
0.00529581474367 0.281479660772
0.00479019484082 0.27451415777
0.00486381039428 0.266294391154
0.00491126372868 0.258105116126
0.00452105305011 0.252926328298
0.00531483334271 0.250910887373
0.00546572110469 0.253302256977
0.00479544857908 0.258484183394
0.00422106426297 0.264582900173
0.00401824135188 0.268467945623
0.0041705465252 0.269699480291
0.00405239564143 0.270406162975
0.0040059737566 0.270407601782
0.00406426729317 0.267951582656
0.00416613791013 0.264543833042
0.00427847607653 0.26247798891
0.00428050903034 0.259635263243
0.00454842971786 0.255829377617
0.00393747552387 0.253802307025
0.00374143688909 0.251011478787
0.00335475310258 0.236543650856
0.000373194755312 0.0419494800709
0.000230909648678 0.0394102370205
0.000150840015851 0.0414956922398
8.44401840771e-05 0.0460502231327
-6.24759314572e-06 0.0528049937739
-8.82957758148e-05 0.055711244886
1.16795791952e-05 0.0563188428833
-1.68716267856e-05 0.0575232763711
-0.000112625308645 0.057979929947
-0.000122619090002 0.0564126233493
1.73569637319e-05 0.05522573909
6.49872782342e-05 0.0507353361334
4.17746389178e-05 0.0479568131253
5.13884475653e-05 0.0461253238047
1.8860115143e-05 0.0436860476919
-5.64317701105e-05 0.042516381059
-0.000136859948115 0.0413574820205
-7.00847019726e-05 0.0409516370727
-5.39392223336e-05 0.040441504085
-9.24897162815e-05 0.0397800398173
4.7104970622e-05 0.039046286243
6.24805896165e-06 0.0380185986602
-2.35272813418e-05 0.036851063786
5.88344154127e-05 0.0361640489242
-8.39162076993e-05 0.0357639427311
-0.000108702805776 0.0358774639538
3.22013961834e-06 0.0363644530435
9.43501518394e-05 0.0370309934774
0.000134406229423 0.0374972993343
3.84007008533e-05 0.037676222515
3.05989328157e-05 0.0379111939182
9.52201629091e-05 0.0380927209106
0.000102126083729 0.0379925358499
6.98628072264e-05 0.0377276252241
4.55782256339e-05 0.0375165468654
4.76370987786e-05 0.0371482526345
-2.24128832709e-05 0.0366810742947
0.000125621306953 0.036628355271
0.000134568666093 0.0364860461759
0.000159858844464 0.0345583593149
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import unittest
import numpy as np
import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
import data_utils.augmentor.trans_add_delta as trans_add_delta
import data_utils.augmentor.trans_splice as trans_splice
class TestTransMeanVarianceNorm(unittest.TestCase):
"""unit test for TransMeanVarianceNorm
"""
def setUp(self):
self._file_path = "./data_utils/augmentor/tests/data/" \
"global_mean_var_search26kHr"
def test(self):
feature = np.zeros((2, 120), dtype="float32")
feature.fill(1)
trans = trans_mean_variance_norm.TransMeanVarianceNorm(self._file_path)
(feature1, label1) = trans.perform_trans((feature, None))
(mean, var) = trans.get_mean_var()
feature_flat1 = feature1.flatten()
feature_flat = feature.flatten()
one = np.ones((1), dtype="float32")
for idx, val in enumerate(feature_flat1):
cur_idx = idx % 120
self.assertAlmostEqual(val, (one[0] - mean[cur_idx]) * var[cur_idx])
class TestTransAddDelta(unittest.TestCase):
"""unit test TestTransAddDelta
"""
def test_regress(self):
"""test regress
"""
feature = np.zeros((14, 120), dtype="float32")
feature[0:5, 0:40].fill(1)
feature[0 + 5, 0:40].fill(1)
feature[1 + 5, 0:40].fill(2)
feature[2 + 5, 0:40].fill(3)
feature[3 + 5, 0:40].fill(4)
feature[8:14, 0:40].fill(4)
trans = trans_add_delta.TransAddDelta()
feature = feature.reshape((14 * 120))
trans._regress(feature, 5 * 120, feature, 5 * 120 + 40, 40, 4, 120)
trans._regress(feature, 5 * 120 + 40, feature, 5 * 120 + 80, 40, 4, 120)
feature = feature.reshape((14, 120))
tmp_feature = feature[5:5 + 4, :]
self.assertAlmostEqual(1.0, tmp_feature[0][0])
self.assertAlmostEqual(0.24, tmp_feature[0][119])
self.assertAlmostEqual(2.0, tmp_feature[1][0])
self.assertAlmostEqual(0.13, tmp_feature[1][119])
self.assertAlmostEqual(3.0, tmp_feature[2][0])
self.assertAlmostEqual(-0.13, tmp_feature[2][119])
self.assertAlmostEqual(4.0, tmp_feature[3][0])
self.assertAlmostEqual(-0.24, tmp_feature[3][119])
def test_perform(self):
"""test perform
"""
feature = np.zeros((4, 40), dtype="float32")
feature[0, 0:40].fill(1)
feature[1, 0:40].fill(2)
feature[2, 0:40].fill(3)
feature[3, 0:40].fill(4)
trans = trans_add_delta.TransAddDelta()
(feature, label) = trans.perform_trans((feature, None))
self.assertAlmostEqual(feature.shape[0], 4)
self.assertAlmostEqual(feature.shape[1], 120)
self.assertAlmostEqual(1.0, feature[0][0])
self.assertAlmostEqual(0.24, feature[0][119])
self.assertAlmostEqual(2.0, feature[1][0])
self.assertAlmostEqual(0.13, feature[1][119])
self.assertAlmostEqual(3.0, feature[2][0])
self.assertAlmostEqual(-0.13, feature[2][119])
self.assertAlmostEqual(4.0, feature[3][0])
self.assertAlmostEqual(-0.24, feature[3][119])
class TestTransSplict(unittest.TestCase):
"""unit test Test TransSplict
"""
def test_perfrom(self):
feature = np.zeros((8, 10), dtype="float32")
for i in xrange(feature.shape[0]):
feature[i, :].fill(i)
trans = trans_splice.TransSplice()
(feature, label) = trans.perform_trans((feature, None))
self.assertEqual(feature.shape[1], 110)
for i in xrange(8):
nzero_num = 5 - i
cur_val = 0.0
if nzero_num < 0:
cur_val = i - 5 - 1
for j in xrange(11):
if j <= nzero_num:
for k in xrange(10):
self.assertAlmostEqual(feature[i][j * 10 + k], cur_val)
else:
if cur_val < 7:
cur_val += 1.0
for k in xrange(10):
self.assertAlmostEqual(feature[i][j * 10 + k], cur_val)
if __name__ == '__main__':
unittest.main()
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import math
import copy
class TransAddDelta(object):
""" add delta of feature data
trans feature for shape(a, b) to shape(a, b * 3)
Attributes:
_norder(int):
_window(int):
"""
def __init__(self, norder=2, nwindow=2):
""" init construction
Args:
norder: default 2
nwindow: default 2
"""
self._norder = norder
self._nwindow = nwindow
def perform_trans(self, sample):
""" add delta for feature
trans feature shape from (a,b) to (a, b * 3)
Args:
sample(object,tuple): contain feature numpy and label numpy
Returns:
(feature, label)
"""
(feature, label) = sample
frame_dim = feature.shape[1]
d_frame_dim = frame_dim * 3
head_filled = 5
tail_filled = 5
mat = np.zeros(
(feature.shape[0] + head_filled + tail_filled, d_frame_dim),
dtype="float32")
#copy first frame
for i in xrange(head_filled):
np.copyto(mat[i, 0:frame_dim], feature[0, :])
np.copyto(mat[head_filled:head_filled + feature.shape[0], 0:frame_dim],
feature[:, :])
# copy last frame
for i in xrange(head_filled + feature.shape[0], mat.shape[0], 1):
np.copyto(mat[i, 0:frame_dim], feature[feature.shape[0] - 1, :])
nframe = feature.shape[0]
start = head_filled
tmp_shape = mat.shape
mat = mat.reshape((tmp_shape[0] * tmp_shape[1]))
self._regress(mat, start * d_frame_dim, mat,
start * d_frame_dim + frame_dim, frame_dim, nframe,
d_frame_dim)
self._regress(mat, start * d_frame_dim + frame_dim, mat,
start * d_frame_dim + 2 * frame_dim, frame_dim, nframe,
d_frame_dim)
mat.shape = tmp_shape
return (mat[head_filled:mat.shape[0] - tail_filled, :], label)
def _regress(self, data_in, start_in, data_out, start_out, size, n, step):
""" regress
Args:
data_in: in data
start_in: start index of data_in
data_out: out data
start_out: start index of data_out
size: frame dimentional
n: frame num
step: 3 * (frame num)
Returns:
None
"""
sigma_t2 = 0.0
delta_window = self._nwindow
for t in xrange(1, delta_window + 1):
sigma_t2 += t * t
sigma_t2 *= 2.0
for i in xrange(n):
fp1 = start_in
fp2 = start_out
for j in xrange(size):
back = fp1
forw = fp1
sum = 0.0
for t in xrange(1, delta_window + 1):
back -= step
forw += step
sum += t * (data_in[forw] - data_in[back])
data_out[fp2] = sum / sigma_t2
fp1 += 1
fp2 += 1
start_in += step
start_out += step
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import math
class TransMeanVarianceNorm(object):
""" normalization of mean variance for feature data
Attributes:
_mean(numpy.array): the feature mean vector
_var(numpy.array): the feature variance
"""
def __init__(self, snorm_path):
"""init construction
Args:
snorm_path: the path of mean and variance
"""
self._mean = None
self._var = None
self._load_norm(snorm_path)
def _load_norm(self, snorm_path):
""" load mean var file
Args:
snorm_path(str):the file path
"""
lLines = open(snorm_path).readlines()
nLen = len(lLines)
self._mean = np.zeros((nLen), dtype="float32")
self._var = np.zeros((nLen), dtype="float32")
self._nLen = nLen
for nidx, l in enumerate(lLines):
s = l.split()
assert len(s) == 2
self._mean[nidx] = float(s[0])
self._var[nidx] = 1.0 / math.sqrt(float(s[1]))
if self._var[nidx] > 100000.0:
self._var[nidx] = 100000.0
def get_mean_var(self):
""" get mean and var
Args:
Returns:
(mean, var)
"""
return (self._mean, self._var)
def perform_trans(self, sample):
""" feature = (feature - mean) * var
Args:
sample(object):input sample, contain feature numpy and label numpy
Returns:
(feature, label)
"""
(feature, label) = sample
shape = feature.shape
assert len(shape) == 2
nfeature_len = shape[0] * shape[1]
assert nfeature_len % self._nLen == 0
ncur_idx = 0
feature = feature.reshape((nfeature_len))
while ncur_idx < nfeature_len:
block = feature[ncur_idx:ncur_idx + self._nLen]
block = (block - self._mean) * self._var
feature[ncur_idx:ncur_idx + self._nLen] = block
ncur_idx += self._nLen
feature = feature.reshape(shape)
return (feature, label)
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import math
class TransSplice(object):
""" copy feature context to construct new feature
expand feature data from shape (frame_num, frame_dim)
to shape (frame_num, frame_dim * 11)
Attributes:
_nleft_context(int): copy left context number
_nright_context(int): copy right context number
"""
def __init__(self, nleft_context=5, nright_context=5):
""" init construction
Args:
nleft_context(int):
nright_context(int):
"""
self._nleft_context = nleft_context
self._nright_context = nright_context
def perform_trans(self, sample):
""" copy feature context
Args:
sample(object): input sample(feature, label)
Return:
(feature, label)
"""
(feature, label) = sample
nframe_num = feature.shape[0]
nframe_dim = feature.shape[1]
nnew_frame_dim = nframe_dim * (
self._nleft_context + self._nright_context + 1)
mat = np.zeros(
(nframe_num + self._nleft_context + self._nright_context,
nframe_dim),
dtype="float32")
ret = np.zeros((nframe_num, nnew_frame_dim), dtype="float32")
#copy left
for i in xrange(self._nleft_context):
mat[i, :] = feature[0, :]
#copy middle
mat[self._nleft_context:self._nleft_context +
nframe_num, :] = feature[:, :]
#copy right
for i in xrange(self._nright_context):
mat[i + self._nleft_context + nframe_num, :] = feature[-1, :]
mat = mat.reshape(mat.shape[0] * mat.shape[1])
ret = ret.reshape(ret.shape[0] * ret.shape[1])
for i in xrange(nframe_num):
np.copyto(ret[i * nnew_frame_dim:(i + 1) * nnew_frame_dim],
mat[i * nframe_dim:i * nframe_dim + nnew_frame_dim])
ret = ret.reshape((nframe_num, nnew_frame_dim))
return (ret, label)
"""This module contains data processing related logic.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import random
import struct
import Queue
import time
import numpy as np
from threading import Thread
import signal
from multiprocessing import Manager, Process
import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
import data_utils.augmentor.trans_add_delta as trans_add_delta
from data_utils.util import suppress_complaints, suppress_signal
from data_utils.util import CriticalException, ForceExitWrapper
class SampleInfo(object):
"""SampleInfo holds the necessary information to load a sample from disk.
Args:
feature_bin_path (str): File containing the feature data.
feature_start (int): Start position of the sample's feature data.
feature_size (int): Byte count of the sample's feature data.
feature_frame_num (int): Time length of the sample.
feature_dim (int): Feature dimension of one frame.
label_bin_path (str): File containing the label data.
label_size (int): Byte count of the sample's label data.
label_frame_num (int): Label number of the sample.
"""
def __init__(self, feature_bin_path, feature_start, feature_size,
feature_frame_num, feature_dim, label_bin_path, label_start,
label_size, label_frame_num):
self.feature_bin_path = feature_bin_path
self.feature_start = feature_start
self.feature_size = feature_size
self.feature_frame_num = feature_frame_num
self.feature_dim = feature_dim
self.label_bin_path = label_bin_path
self.label_start = label_start
self.label_size = label_size
self.label_frame_num = label_frame_num
class SampleInfoBucket(object):
"""SampleInfoBucket contains paths of several description files. Feature
description file contains necessary information (including path of binary
data, sample start position, sample byte number etc.) to access samples'
feature data and the same with the label description file. SampleInfoBucket
is the minimum unit to do shuffle.
Args:
feature_bin_paths (list|tuple): Files containing the binary feature
data.
feature_desc_paths (list|tuple): Files containing the description of
samples' feature data.
label_bin_paths (list|tuple): Files containing the binary label data.
label_desc_paths (list|tuple): Files containing the description of
samples' label data.
split_perturb(int): Maximum perturbation value for length of
sub-sentence when splitting long sentence.
split_sentence_threshold(int): Sentence whose length larger than
the value will trigger split operation.
split_sub_sentence_len(int): sub-sentence length is equal to
(split_sub_sentence_len + rand() % split_perturb).
"""
def __init__(self,
feature_bin_paths,
feature_desc_paths,
label_bin_paths,
label_desc_paths,
split_perturb=50,
split_sentence_threshold=512,
split_sub_sentence_len=256):
block_num = len(label_bin_paths)
assert len(label_desc_paths) == block_num
assert len(feature_bin_paths) == block_num
assert len(feature_desc_paths) == block_num
self._block_num = block_num
self._feature_bin_paths = feature_bin_paths
self._feature_desc_paths = feature_desc_paths
self._label_bin_paths = label_bin_paths
self._label_desc_paths = label_desc_paths
self._split_perturb = split_perturb
self._split_sentence_threshold = split_sentence_threshold
self._split_sub_sentence_len = split_sub_sentence_len
self._rng = random.Random(0)
def generate_sample_info_list(self):
sample_info_list = []
for block_idx in xrange(self._block_num):
label_bin_path = self._label_bin_paths[block_idx]
label_desc_path = self._label_desc_paths[block_idx]
feature_bin_path = self._feature_bin_paths[block_idx]
feature_desc_path = self._feature_desc_paths[block_idx]
label_desc_lines = open(label_desc_path).readlines()
feature_desc_lines = open(feature_desc_path).readlines()
sample_num = int(label_desc_lines[0].split()[1])
assert sample_num == int(feature_desc_lines[0].split()[1])
for i in xrange(sample_num):
feature_desc_split = feature_desc_lines[i + 1].split()
feature_start = int(feature_desc_split[2])
feature_size = int(feature_desc_split[3])
feature_frame_num = int(feature_desc_split[4])
feature_dim = int(feature_desc_split[5])
label_desc_split = label_desc_lines[i + 1].split()
label_start = int(label_desc_split[2])
label_size = int(label_desc_split[3])
label_frame_num = int(label_desc_split[4])
assert feature_frame_num == label_frame_num
if self._split_sentence_threshold == -1 or \
self._split_perturb == -1 or \
self._split_sub_sentence_len == -1 \
or self._split_sentence_threshold >= feature_frame_num:
sample_info_list.append(
SampleInfo(feature_bin_path, feature_start,
feature_size, feature_frame_num, feature_dim,
label_bin_path, label_start, label_size,
label_frame_num))
#split sentence
else:
cur_frame_pos = 0
cur_frame_len = 0
remain_frame_num = feature_frame_num
while True:
if remain_frame_num > self._split_sentence_threshold:
cur_frame_len = self._split_sub_sentence_len + \
self._rng.randint(0, self._split_perturb)
if cur_frame_len > remain_frame_num:
cur_frame_len = remain_frame_num
else:
cur_frame_len = remain_frame_num
sample_info_list.append(
SampleInfo(
feature_bin_path, feature_start + cur_frame_pos
* feature_dim * 4, cur_frame_len * feature_dim *
4, cur_frame_len, feature_dim, label_bin_path,
label_start + cur_frame_pos * 4, cur_frame_len *
4, cur_frame_len))
remain_frame_num -= cur_frame_len
cur_frame_pos += cur_frame_len
if remain_frame_num <= 0:
break
return sample_info_list
class EpochEndSignal():
pass
class DataReader(object):
"""DataReader provides basic audio sample preprocessing pipeline including
data loading and data augmentation.
Args:
feature_file_list (str): File containing paths of feature data file and
corresponding description file.
label_file_list (str): File containing paths of label data file and
corresponding description file.
drop_frame_len (int): Samples whose label length above the value will be
dropped.(Using '-1' to disable the policy)
process_num (int): Number of processes for processing data.
sample_buffer_size (int): Buffer size to indicate the maximum samples
cached.
sample_info_buffer_size (int): Buffer size to indicate the maximum
sample information cached.
batch_buffer_size (int): Buffer size to indicate the maximum batch
cached.
shuffle_block_num (int): Block number indicating the minimum unit to do
shuffle.
random_seed (int): Random seed.
verbose (int): If set to 0, complaints including exceptions and signal
traceback from sub-process will be suppressed. If set
to 1, all complaints will be printed.
"""
def __init__(self,
feature_file_list,
label_file_list,
drop_frame_len=512,
process_num=10,
sample_buffer_size=1024,
sample_info_buffer_size=1024,
batch_buffer_size=1024,
shuffle_block_num=10,
random_seed=0,
verbose=0):
self._feature_file_list = feature_file_list
self._label_file_list = label_file_list
self._drop_frame_len = drop_frame_len
self._shuffle_block_num = shuffle_block_num
self._block_info_list = None
self._rng = random.Random(random_seed)
self._bucket_list = None
self.generate_bucket_list(True)
self._order_id = 0
self._manager = Manager()
self._sample_buffer_size = sample_buffer_size
self._sample_info_buffer_size = sample_info_buffer_size
self._batch_buffer_size = batch_buffer_size
self._process_num = process_num
self._verbose = verbose
self._force_exit = ForceExitWrapper(self._manager.Value('b', False))
def generate_bucket_list(self, is_shuffle):
if self._block_info_list is None:
block_feature_info_lines = open(self._feature_file_list).readlines()
block_label_info_lines = open(self._label_file_list).readlines()
assert len(block_feature_info_lines) == len(block_label_info_lines)
self._block_info_list = []
for i in xrange(0, len(block_feature_info_lines), 2):
block_info = (block_feature_info_lines[i],
block_feature_info_lines[i + 1],
block_label_info_lines[i],
block_label_info_lines[i + 1])
self._block_info_list.append(
map(lambda line: line.strip(), block_info))
if is_shuffle:
self._rng.shuffle(self._block_info_list)
self._bucket_list = []
for i in xrange(0, len(self._block_info_list), self._shuffle_block_num):
bucket_block_info = self._block_info_list[i:i +
self._shuffle_block_num]
self._bucket_list.append(
SampleInfoBucket(
map(lambda info: info[0], bucket_block_info),
map(lambda info: info[1], bucket_block_info),
map(lambda info: info[2], bucket_block_info),
map(lambda info: info[3], bucket_block_info)))
# @TODO make this configurable
def set_transformers(self, transformers):
self._transformers = transformers
def _sample_generator(self):
sample_info_queue = self._manager.Queue(self._sample_info_buffer_size)
sample_queue = self._manager.Queue(self._sample_buffer_size)
self._order_id = 0
@suppress_complaints(verbose=self._verbose, notify=self._force_exit)
def ordered_feeding_task(sample_info_queue):
for sample_info_bucket in self._bucket_list:
try:
sample_info_list = \
sample_info_bucket.generate_sample_info_list()
except Exception as e:
raise CriticalException(e)
else:
self._rng.shuffle(sample_info_list) # do shuffle here
for sample_info in sample_info_list:
sample_info_queue.put((sample_info, self._order_id))
self._order_id += 1
for i in xrange(self._process_num):
sample_info_queue.put(EpochEndSignal())
feeding_thread = Thread(
target=ordered_feeding_task, args=(sample_info_queue, ))
feeding_thread.daemon = True
feeding_thread.start()
@suppress_complaints(verbose=self._verbose, notify=self._force_exit)
def ordered_processing_task(sample_info_queue, sample_queue, out_order):
if self._verbose == 0:
signal.signal(signal.SIGTERM, suppress_signal)
signal.signal(signal.SIGINT, suppress_signal)
def read_bytes(fpath, start, size):
try:
f = open(fpath, 'r')
f.seek(start, 0)
binary_bytes = f.read(size)
f.close()
return binary_bytes
except Exception as e:
raise CriticalException(e)
ins = sample_info_queue.get()
while not isinstance(ins, EpochEndSignal):
sample_info, order_id = ins
feature_bytes = read_bytes(sample_info.feature_bin_path,
sample_info.feature_start,
sample_info.feature_size)
assert sample_info.feature_frame_num * sample_info.feature_dim * 4 \
== len(feature_bytes), \
(sample_info.feature_bin_path,
sample_info.feature_frame_num,
sample_info.feature_dim,
len(feature_bytes))
label_bytes = read_bytes(sample_info.label_bin_path,
sample_info.label_start,
sample_info.label_size)
assert sample_info.label_frame_num * 4 == len(label_bytes), (
sample_info.label_bin_path, sample_info.label_array,
len(label_bytes))
label_array = struct.unpack('I' * sample_info.label_frame_num,
label_bytes)
label_data = np.array(
label_array, dtype='int64').reshape(
(sample_info.label_frame_num, 1))
feature_frame_num = sample_info.feature_frame_num
feature_dim = sample_info.feature_dim
assert feature_frame_num * feature_dim * 4 == len(feature_bytes)
feature_array = struct.unpack('f' * feature_frame_num *
feature_dim, feature_bytes)
feature_data = np.array(
feature_array, dtype='float32').reshape((
sample_info.feature_frame_num, sample_info.feature_dim))
sample_data = (feature_data, label_data)
for transformer in self._transformers:
# @TODO(pkuyym) to make transfomer only accept feature_data
sample_data = transformer.perform_trans(sample_data)
while order_id != out_order[0]:
time.sleep(0.001)
# drop long sentence
if self._drop_frame_len == -1 or \
self._drop_frame_len >= sample_data[0].shape[0]:
sample_queue.put(sample_data)
out_order[0] += 1
ins = sample_info_queue.get()
sample_queue.put(EpochEndSignal())
out_order = self._manager.list([0])
args = (sample_info_queue, sample_queue, out_order)
workers = [
Process(
target=ordered_processing_task, args=args)
for _ in xrange(self._process_num)
]
for w in workers:
w.daemon = True
w.start()
finished_process_num = 0
while self._force_exit == False:
try:
sample = sample_queue.get_nowait()
except Queue.Empty:
time.sleep(0.001)
else:
if isinstance(sample, EpochEndSignal):
finished_process_num += 1
if finished_process_num >= self._process_num:
break
else:
continue
yield sample
def batch_iterator(self, batch_size, minimum_batch_size):
def batch_to_ndarray(batch_samples, lod):
assert len(batch_samples)
frame_dim = batch_samples[0][0].shape[1]
batch_feature = np.zeros((lod[-1], frame_dim), dtype="float32")
batch_label = np.zeros((lod[-1], 1), dtype="int64")
start = 0
for sample in batch_samples:
frame_num = sample[0].shape[0]
batch_feature[start:start + frame_num, :] = sample[0]
batch_label[start:start + frame_num, :] = sample[1]
start += frame_num
return (batch_feature, batch_label)
@suppress_complaints(verbose=self._verbose, notify=self._force_exit)
def batch_assembling_task(sample_generator, batch_queue):
batch_samples = []
lod = [0]
for sample in sample_generator():
batch_samples.append(sample)
lod.append(lod[-1] + sample[0].shape[0])
if len(batch_samples) == batch_size:
(batch_feature, batch_label) = batch_to_ndarray(
batch_samples, lod)
batch_queue.put((batch_feature, batch_label, lod))
batch_samples = []
lod = [0]
if len(batch_samples) >= minimum_batch_size:
(batch_feature, batch_label) = batch_to_ndarray(batch_samples,
lod)
batch_queue.put((batch_feature, batch_label, lod))
batch_queue.put(EpochEndSignal())
batch_queue = Queue.Queue(self._batch_buffer_size)
assembling_thread = Thread(
target=batch_assembling_task,
args=(self._sample_generator, batch_queue))
assembling_thread.daemon = True
assembling_thread.start()
while self._force_exit == False:
try:
batch_data = batch_queue.get_nowait()
except Queue.Empty:
time.sleep(0.001)
else:
if isinstance(batch_data, EpochEndSignal):
break
yield batch_data
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
from six import reraise
from tblib import Traceback
import numpy as np
def to_lodtensor(data, place):
"""convert tensor to lodtensor
"""
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = numpy.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
return res
def lodtensor_to_ndarray(lod_tensor):
"""conver lodtensor to ndarray
"""
dims = lod_tensor.get_dims()
ret = np.zeros(shape=dims).astype('float32')
for i in xrange(np.product(dims)):
ret.ravel()[i] = lod_tensor.get_float_element(i)
return ret, lod_tensor.lod()
class CriticalException(Exception):
pass
def suppress_signal(signo, stack_frame):
pass
def suppress_complaints(verbose, notify=None):
def decorator_maker(func):
def suppress_warpper(*args, **kwargs):
try:
func(*args, **kwargs)
except:
et, ev, tb = sys.exc_info()
if notify is not None:
notify(except_type=et, except_value=ev, traceback=tb)
if verbose == 1 or isinstance(ev, CriticalException):
reraise(et, ev, Traceback(tb).as_traceback())
return suppress_warpper
return decorator_maker
class ForceExitWrapper(object):
def __init__(self, exit_flag):
self._exit_flag = exit_flag
@suppress_complaints(verbose=0)
def __call__(self, *args, **kwargs):
self._exit_flag.value = True
def __eq__(self, flag):
return self._exit_flag.value == flag
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import argparse
import paddle.fluid as fluid
import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
import data_utils.augmentor.trans_add_delta as trans_add_delta
import data_utils.augmentor.trans_splice as trans_splice
import data_utils.data_reader as reader
from data_utils.util import lodtensor_to_ndarray
def parse_args():
parser = argparse.ArgumentParser("Inference for stacked LSTMP model.")
parser.add_argument(
'--batch_size',
type=int,
default=32,
help='The sequence number of a batch data. (default: %(default)d)')
parser.add_argument(
'--device',
type=str,
default='GPU',
choices=['CPU', 'GPU'],
help='The device type. (default: %(default)s)')
parser.add_argument(
'--mean_var',
type=str,
default='data/global_mean_var_search26kHr',
help="The path for feature's global mean and variance. "
"(default: %(default)s)")
parser.add_argument(
'--infer_feature_lst',
type=str,
default='data/infer_feature.lst',
help='The feature list path for inference. (default: %(default)s)')
parser.add_argument(
'--infer_label_lst',
type=str,
default='data/infer_label.lst',
help='The label list path for inference. (default: %(default)s)')
parser.add_argument(
'--infer_model_path',
type=str,
default='./infer_models/deep_asr.pass_0.infer.model/',
help='The directory for loading inference model. '
'(default: %(default)s)')
args = parser.parse_args()
return args
def print_arguments(args):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).iteritems()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def split_infer_result(infer_seq, lod):
infer_batch = []
for i in xrange(0, len(lod[0]) - 1):
infer_batch.append(infer_seq[lod[0][i]:lod[0][i + 1]])
return infer_batch
def infer(args):
""" Gets one batch of feature data and predicts labels for each sample.
"""
if not os.path.exists(args.infer_model_path):
raise IOError("Invalid inference model path!")
place = fluid.CUDAPlace(0) if args.device == 'GPU' else fluid.CPUPlace()
exe = fluid.Executor(place)
# load model
[infer_program, feed_dict,
fetch_targets] = fluid.io.load_inference_model(args.infer_model_path, exe)
ltrans = [
trans_add_delta.TransAddDelta(2, 2),
trans_mean_variance_norm.TransMeanVarianceNorm(args.mean_var),
trans_splice.TransSplice()
]
infer_data_reader = reader.DataReader(args.infer_feature_lst,
args.infer_label_lst)
infer_data_reader.set_transformers(ltrans)
feature_t = fluid.LoDTensor()
one_batch = infer_data_reader.batch_iterator(args.batch_size, 1).next()
(features, labels, lod) = one_batch
feature_t.set(features, place)
feature_t.set_lod([lod])
results = exe.run(infer_program,
feed={feed_dict[0]: feature_t},
fetch_list=fetch_targets,
return_numpy=False)
probs, lod = lodtensor_to_ndarray(results[0])
preds = probs.argmax(axis=1)
infer_batch = split_infer_result(preds, lod)
for index, sample in enumerate(infer_batch):
print("result %d: " % index, sample, '\n')
if __name__ == '__main__':
args = parse_args()
print_arguments(args)
infer(args)
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle
import paddle.fluid as fluid
def stacked_lstmp_model(hidden_dim,
proj_dim,
stacked_num,
class_num,
parallel=False,
is_train=True):
""" The model for DeepASR. The main structure is composed of stacked
identical LSTMP (LSTM with recurrent projection) layers.
When running in training and validation phase, the feeding dictionary
is {'feature', 'label'}, fed by the LodTensor for feature data and
label data respectively. And in inference, only `feature` is needed.
Args:
hidden_dim(int): The hidden state's dimension of the LSTMP layer.
proj_dim(int): The projection size of the LSTMP layer.
stacked_num(int): The number of stacked LSTMP layers.
parallel(bool): Run in parallel or not, default `False`.
is_train(bool): Run in training phase or not, default `True`.
class_dim(int): The number of output classes.
"""
# network configuration
def _net_conf(feature, label):
seq_conv1 = fluid.layers.sequence_conv(
input=feature,
num_filters=1024,
filter_size=3,
filter_stride=1,
bias_attr=True)
bn1 = fluid.layers.batch_norm(
input=seq_conv1,
act="sigmoid",
is_test=not is_train,
momentum=0.9,
epsilon=1e-05,
data_layout='NCHW')
stack_input = bn1
for i in range(stacked_num):
fc = fluid.layers.fc(input=stack_input,
size=hidden_dim * 4,
bias_attr=True)
proj, cell = fluid.layers.dynamic_lstmp(
input=fc,
size=hidden_dim * 4,
proj_size=proj_dim,
bias_attr=True,
use_peepholes=True,
is_reverse=False,
cell_activation="tanh",
proj_activation="tanh")
bn = fluid.layers.batch_norm(
input=proj,
act="sigmoid",
is_test=not is_train,
momentum=0.9,
epsilon=1e-05,
data_layout='NCHW')
stack_input = bn
prediction = fluid.layers.fc(input=stack_input,
size=class_num,
act='softmax')
cost = fluid.layers.cross_entropy(input=prediction, label=label)
avg_cost = fluid.layers.mean(x=cost)
acc = fluid.layers.accuracy(input=prediction, label=label)
return prediction, avg_cost, acc
# data feeder
feature = fluid.layers.data(
name="feature", shape=[-1, 120 * 11], dtype="float32", lod_level=1)
label = fluid.layers.data(
name="label", shape=[-1, 1], dtype="int64", lod_level=1)
if parallel:
# When the execution place is specified to CUDAPlace, the program will
# run on all $CUDA_VISIBLE_DEVICES GPUs. Otherwise the program will
# run on all CPU devices.
places = fluid.layers.get_places()
pd = fluid.layers.ParallelDo(places)
with pd.do():
feat_ = pd.read_input(feature)
label_ = pd.read_input(label)
prediction, avg_cost, acc = _net_conf(feat_, label_)
for out in [avg_cost, acc]:
pd.write_output(out)
# get mean loss and acc through every devices.
avg_cost, acc = pd()
avg_cost = fluid.layers.mean(x=avg_cost)
acc = fluid.layers.mean(x=acc)
else:
prediction, avg_cost, acc = _net_conf(feature, label)
return prediction, avg_cost, acc
"""Add the parent directory to $PYTHONPATH"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os.path
import sys
def add_path(path):
if path not in sys.path:
sys.path.insert(0, path)
this_dir = os.path.dirname(__file__)
# Add project path to PYTHONPATH
proj_path = os.path.join(this_dir, '..')
add_path(proj_path)
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import numpy as np
import argparse
import time
import paddle.fluid as fluid
import paddle.fluid.profiler as profiler
import _init_paths
import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
import data_utils.augmentor.trans_add_delta as trans_add_delta
import data_utils.augmentor.trans_splice as trans_splice
import data_utils.data_reader as reader
from model_utils.model import stacked_lstmp_model
from data_utils.util import lodtensor_to_ndarray
def parse_args():
parser = argparse.ArgumentParser("Profiling for the stacked LSTMP model.")
parser.add_argument(
'--batch_size',
type=int,
default=32,
help='The sequence number of a batch data. (default: %(default)d)')
parser.add_argument(
'--minimum_batch_size',
type=int,
default=1,
help='The minimum sequence number of a batch data. '
'(default: %(default)d)')
parser.add_argument(
'--stacked_num',
type=int,
default=5,
help='Number of lstmp layers to stack. (default: %(default)d)')
parser.add_argument(
'--proj_dim',
type=int,
default=512,
help='Project size of lstmp unit. (default: %(default)d)')
parser.add_argument(
'--hidden_dim',
type=int,
default=1024,
help='Hidden size of lstmp unit. (default: %(default)d)')
parser.add_argument(
'--learning_rate',
type=float,
default=0.002,
help='Learning rate used to train. (default: %(default)f)')
parser.add_argument(
'--device',
type=str,
default='GPU',
choices=['CPU', 'GPU'],
help='The device type. (default: %(default)s)')
parser.add_argument(
'--parallel', action='store_true', help='If set, run in parallel.')
parser.add_argument(
'--mean_var',
type=str,
default='data/global_mean_var_search26kHr',
help='mean var path')
parser.add_argument(
'--feature_lst',
type=str,
default='data/feature.lst',
help='feature list path.')
parser.add_argument(
'--label_lst',
type=str,
default='data/label.lst',
help='label list path.')
parser.add_argument(
'--max_batch_num',
type=int,
default=10,
help='Maximum number of batches for profiling. (default: %(default)d)')
parser.add_argument(
'--first_batches_to_skip',
type=int,
default=1,
help='Number of first batches to skip for profiling. '
'(default: %(default)d)')
parser.add_argument(
'--print_train_acc',
action='store_true',
help='If set, output training accuray.')
parser.add_argument(
'--sorted_key',
type=str,
default='total',
choices=['None', 'total', 'calls', 'min', 'max', 'ave'],
help='Different types of time to sort the profiling report. '
'(default: %(default)s)')
args = parser.parse_args()
return args
def print_arguments(args):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).iteritems()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def profile(args):
"""profile the training process.
"""
if not args.first_batches_to_skip < args.max_batch_num:
raise ValueError("arg 'first_batches_to_skip' must be smaller than "
"'max_batch_num'.")
if not args.first_batches_to_skip >= 0:
raise ValueError(
"arg 'first_batches_to_skip' must not be smaller than 0.")
_, avg_cost, accuracy = stacked_lstmp_model(
hidden_dim=args.hidden_dim,
proj_dim=args.proj_dim,
stacked_num=args.stacked_num,
class_num=1749,
parallel=args.parallel)
optimizer = fluid.optimizer.Momentum(
learning_rate=args.learning_rate, momentum=0.9)
optimizer.minimize(avg_cost)
place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
ltrans = [
trans_add_delta.TransAddDelta(2, 2),
trans_mean_variance_norm.TransMeanVarianceNorm(args.mean_var),
trans_splice.TransSplice()
]
data_reader = reader.DataReader(args.feature_lst, args.label_lst)
data_reader.set_transformers(ltrans)
feature_t = fluid.LoDTensor()
label_t = fluid.LoDTensor()
sorted_key = None if args.sorted_key is 'None' else args.sorted_key
with profiler.profiler(args.device, sorted_key) as prof:
frames_seen, start_time = 0, 0.0
for batch_id, batch_data in enumerate(
data_reader.batch_iterator(args.batch_size,
args.minimum_batch_size)):
if batch_id >= args.max_batch_num:
break
if args.first_batches_to_skip == batch_id:
profiler.reset_profiler()
start_time = time.time()
frames_seen = 0
# load_data
(features, labels, lod) = batch_data
feature_t.set(features, place)
feature_t.set_lod([lod])
label_t.set(labels, place)
label_t.set_lod([lod])
frames_seen += lod[-1]
outs = exe.run(fluid.default_main_program(),
feed={"feature": feature_t,
"label": label_t},
fetch_list=[avg_cost, accuracy],
return_numpy=False)
if args.print_train_acc:
print("Batch %d acc: %f" %
(batch_id, lodtensor_to_ndarray(outs[1])[0]))
else:
sys.stdout.write('.')
sys.stdout.flush()
time_consumed = time.time() - start_time
frames_per_sec = frames_seen / time_consumed
print("\nTime consumed: %f s, performance: %f frames/s." %
(time_consumed, frames_per_sec))
if __name__ == '__main__':
args = parse_args()
print_arguments(args)
profile(args)
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import os
import numpy as np
import argparse
import time
import paddle.fluid as fluid
import data_utils.augmentor.trans_mean_variance_norm as trans_mean_variance_norm
import data_utils.augmentor.trans_add_delta as trans_add_delta
import data_utils.augmentor.trans_splice as trans_splice
import data_utils.data_reader as reader
from data_utils.util import lodtensor_to_ndarray
from model_utils.model import stacked_lstmp_model
def parse_args():
parser = argparse.ArgumentParser("Training for stacked LSTMP model.")
parser.add_argument(
'--batch_size',
type=int,
default=32,
help='The sequence number of a batch data. (default: %(default)d)')
parser.add_argument(
'--minimum_batch_size',
type=int,
default=1,
help='The minimum sequence number of a batch data. '
'(default: %(default)d)')
parser.add_argument(
'--stacked_num',
type=int,
default=5,
help='Number of lstmp layers to stack. (default: %(default)d)')
parser.add_argument(
'--proj_dim',
type=int,
default=512,
help='Project size of lstmp unit. (default: %(default)d)')
parser.add_argument(
'--hidden_dim',
type=int,
default=1024,
help='Hidden size of lstmp unit. (default: %(default)d)')
parser.add_argument(
'--pass_num',
type=int,
default=100,
help='Epoch number to train. (default: %(default)d)')
parser.add_argument(
'--print_per_batches',
type=int,
default=100,
help='Interval to print training accuracy. (default: %(default)d)')
parser.add_argument(
'--learning_rate',
type=float,
default=0.002,
help='Learning rate used to train. (default: %(default)f)')
parser.add_argument(
'--device',
type=str,
default='GPU',
choices=['CPU', 'GPU'],
help='The device type. (default: %(default)s)')
parser.add_argument(
'--parallel', action='store_true', help='If set, run in parallel.')
parser.add_argument(
'--mean_var',
type=str,
default='data/global_mean_var_search26kHr',
help="The path for feature's global mean and variance. "
"(default: %(default)s)")
parser.add_argument(
'--train_feature_lst',
type=str,
default='data/feature.lst',
help='The feature list path for training. (default: %(default)s)')
parser.add_argument(
'--train_label_lst',
type=str,
default='data/label.lst',
help='The label list path for training. (default: %(default)s)')
parser.add_argument(
'--val_feature_lst',
type=str,
default='data/val_feature.lst',
help='The feature list path for validation. (default: %(default)s)')
parser.add_argument(
'--val_label_lst',
type=str,
default='data/val_label.lst',
help='The label list path for validation. (default: %(default)s)')
parser.add_argument(
'--init_model_path',
type=str,
default=None,
help="The model (checkpoint) path which the training resumes from. "
"If None, train the model from scratch. (default: %(default)s)")
parser.add_argument(
'--checkpoints',
type=str,
default='./checkpoints',
help="The directory for saving checkpoints. Do not save checkpoints "
"if set to ''. (default: %(default)s)")
parser.add_argument(
'--infer_models',
type=str,
default='./infer_models',
help="The directory for saving inference models. Do not save inference "
"models if set to ''. (default: %(default)s)")
args = parser.parse_args()
return args
def print_arguments(args):
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).iteritems()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def train(args):
"""train in loop.
"""
# paths check
if args.init_model_path is not None and \
not os.path.exists(args.init_model_path):
raise IOError("Invalid initial model path!")
if args.checkpoints != '' and not os.path.exists(args.checkpoints):
os.mkdir(args.checkpoints)
if args.infer_models != '' and not os.path.exists(args.infer_models):
os.mkdir(args.infer_models)
prediction, avg_cost, accuracy = stacked_lstmp_model(
hidden_dim=args.hidden_dim,
proj_dim=args.proj_dim,
stacked_num=args.stacked_num,
class_num=1749,
parallel=args.parallel)
optimizer = fluid.optimizer.Momentum(
learning_rate=args.learning_rate, momentum=0.9)
optimizer.minimize(avg_cost)
# program for test
test_program = fluid.default_main_program().clone()
with fluid.program_guard(test_program):
test_program = fluid.io.get_inference_program([avg_cost, accuracy])
place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
# resume training if initial model provided.
if args.init_model_path is not None:
fluid.io.load_persistables(exe, args.init_model_path)
ltrans = [
trans_add_delta.TransAddDelta(2, 2),
trans_mean_variance_norm.TransMeanVarianceNorm(args.mean_var),
trans_splice.TransSplice()
]
feature_t = fluid.LoDTensor()
label_t = fluid.LoDTensor()
# validation
def test(exe):
# If test data not found, return invalid cost and accuracy
if not (os.path.exists(args.val_feature_lst) and
os.path.exists(args.val_label_lst)):
return -1.0, -1.0
# test data reader
test_data_reader = reader.DataReader(args.val_feature_lst,
args.val_label_lst)
test_data_reader.set_transformers(ltrans)
test_costs, test_accs = [], []
for batch_id, batch_data in enumerate(
test_data_reader.batch_iterator(args.batch_size,
args.minimum_batch_size)):
# load_data
(features, labels, lod) = batch_data
feature_t.set(features, place)
feature_t.set_lod([lod])
label_t.set(labels, place)
label_t.set_lod([lod])
cost, acc = exe.run(test_program,
feed={"feature": feature_t,
"label": label_t},
fetch_list=[avg_cost, accuracy],
return_numpy=False)
test_costs.append(lodtensor_to_ndarray(cost)[0])
test_accs.append(lodtensor_to_ndarray(acc)[0])
return np.mean(test_costs), np.mean(test_accs)
# train data reader
train_data_reader = reader.DataReader(args.train_feature_lst,
args.train_label_lst, -1)
train_data_reader.set_transformers(ltrans)
# train
for pass_id in xrange(args.pass_num):
pass_start_time = time.time()
for batch_id, batch_data in enumerate(
train_data_reader.batch_iterator(args.batch_size,
args.minimum_batch_size)):
# load_data
(features, labels, lod) = batch_data
feature_t.set(features, place)
feature_t.set_lod([lod])
label_t.set(labels, place)
label_t.set_lod([lod])
cost, acc = exe.run(fluid.default_main_program(),
feed={"feature": feature_t,
"label": label_t},
fetch_list=[avg_cost, accuracy],
return_numpy=False)
if batch_id > 0 and (batch_id % args.print_per_batches == 0):
print("\nBatch %d, train cost: %f, train acc: %f" %
(batch_id, lodtensor_to_ndarray(cost)[0],
lodtensor_to_ndarray(acc)[0]))
# save the latest checkpoint
if args.checkpoints != '':
model_path = os.path.join(args.checkpoints,
"deep_asr.latest.checkpoint")
fluid.io.save_persistables(exe, model_path)
else:
sys.stdout.write('.')
sys.stdout.flush()
# run test
val_cost, val_acc = test(exe)
# save checkpoint per pass
if args.checkpoints != '':
model_path = os.path.join(
args.checkpoints,
"deep_asr.pass_" + str(pass_id) + ".checkpoint")
fluid.io.save_persistables(exe, model_path)
# save inference model
if args.infer_models != '':
model_path = os.path.join(
args.infer_models,
"deep_asr.pass_" + str(pass_id) + ".infer.model")
fluid.io.save_inference_model(model_path, ["feature"],
[prediction], exe)
# cal pass time
pass_end_time = time.time()
time_consumed = pass_end_time - pass_start_time
# print info at pass end
print("\nPass %d, time consumed: %f s, val cost: %f, val acc: %f\n" %
(pass_id, time_consumed, val_cost, val_acc))
if __name__ == '__main__':
args = parse_args()
print_arguments(args)
train(args)
# Paddle Fluid Models
---
The Paddle Fluid models are a collection of example models that use Paddle Fluid APIs. Currently, example codes in this directory are still under active development.
The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Advbox
Advbox is a Python toolbox to create adversarial examples that fool neural networks. It requires Python and paddle.
......
"""
A set of tools for generating adversarial example on paddle platform
"""
from . import attacks
from . import models
from .adversary import Adversary
......@@ -18,13 +18,15 @@ class Adversary(object):
"""
assert original is not None
self.original_label = original_label
self.target_label = None
self.adversarial_label = None
self.__original = original
self.__original_label = original_label
self.__target_label = None
self.__target = None
self.__is_targeted_attack = False
self.__adversarial_example = None
self.__adversarial_label = None
self.__bad_adversarial_example = None
def set_target(self, is_targeted_attack, target=None, target_label=None):
"""
......@@ -38,10 +40,10 @@ class Adversary(object):
"""
assert (target_label is None) or is_targeted_attack
self.__is_targeted_attack = is_targeted_attack
self.__target_label = target_label
self.target_label = target_label
self.__target = target
if not is_targeted_attack:
self.__target_label = None
self.target_label = None
self.__target = None
def set_original(self, original, original_label=None):
......@@ -53,10 +55,11 @@ class Adversary(object):
"""
if original != self.__original:
self.__original = original
self.__original_label = original_label
self.original_label = original_label
self.__adversarial_example = None
self.__bad_adversarial_example = None
if original is None:
self.__original_label = None
self.original_label = None
def _is_successful(self, adversarial_label):
"""
......@@ -65,11 +68,11 @@ class Adversary(object):
:param adversarial_label: adversarial label.
:return: bool
"""
if self.__target_label is not None:
return adversarial_label == self.__target_label
if self.target_label is not None:
return adversarial_label == self.target_label
else:
return (adversarial_label is not None) and \
(adversarial_label != self.__original_label)
(adversarial_label != self.original_label)
def is_successful(self):
"""
......@@ -77,7 +80,7 @@ class Adversary(object):
:return: bool
"""
return self._is_successful(self.__adversarial_label)
return self._is_successful(self.adversarial_label)
def try_accept_the_example(self, adversarial_example, adversarial_label):
"""
......@@ -93,7 +96,9 @@ class Adversary(object):
ok = self._is_successful(adversarial_label)
if ok:
self.__adversarial_example = adversarial_example
self.__adversarial_label = adversarial_label
self.adversarial_label = adversarial_label
else:
self.__bad_adversarial_example = adversarial_example
return ok
def perturbation(self, multiplying_factor=1.0):
......@@ -104,9 +109,14 @@ class Adversary(object):
:return: The perturbation that is multiplied by multiplying_factor.
"""
assert self.__original is not None
assert self.__adversarial_example is not None
return multiplying_factor * (
self.__adversarial_example - self.__original)
assert (self.__adversarial_example is not None) or \
(self.__bad_adversarial_example is not None)
if self.__adversarial_example is not None:
return multiplying_factor * (
self.__adversarial_example - self.__original)
else:
return multiplying_factor * (
self.__bad_adversarial_example - self.__original)
@property
def is_targeted_attack(self):
......@@ -115,20 +125,6 @@ class Adversary(object):
"""
return self.__is_targeted_attack
@property
def target_label(self):
"""
:property: target_label
"""
return self.__target_label
@target_label.setter
def target_label(self, label):
"""
:property: target_label
"""
self.__target_label = label
@property
def target(self):
"""
......@@ -143,20 +139,6 @@ class Adversary(object):
"""
return self.__original
@property
def original_label(self):
"""
:property: original
"""
return self.__original_label
@original_label.setter
def original_label(self, label):
"""
original_label setter
"""
self.__original_label = label
@property
def adversarial_example(self):
"""
......@@ -164,23 +146,9 @@ class Adversary(object):
"""
return self.__adversarial_example
@adversarial_example.setter
def adversarial_example(self, example):
"""
adversarial_example setter
"""
self.__adversarial_example = example
@property
def adversarial_label(self):
"""
:property: adversarial_label
"""
return self.__adversarial_label
@adversarial_label.setter
def adversarial_label(self, label):
def bad_adversarial_example(self):
"""
adversarial_label setter
:property: bad_adversarial_example
"""
self.__adversarial_label = label
return self.__bad_adversarial_example
"""
Attack methods
Attack methods __init__.py
"""
from .base import Attack
from .deepfool import DeepFoolAttack
from .gradientsign import FGSM
from .gradientsign import GradientSignAttack
from .iterator_gradientsign import IFGSM
from .iterator_gradientsign import IteratorGradientSignAttack
......@@ -52,21 +52,23 @@ class Attack(object):
:param adversary: adversary
:return: None
"""
assert self.model.channel_axis() == adversary.original.ndim
if adversary.original_label is None:
adversary.original_label = np.argmax(
self.model.predict(adversary.original))
if adversary.is_targeted_attack and adversary.target_label is None:
if adversary.target is None:
raise ValueError(
'When adversary.is_targeted_attack is True, '
'When adversary.is_targeted_attack is true, '
'adversary.target_label or adversary.target must be set.')
else:
adversary.target_label_label = np.argmax(
self.model.predict(
self.model.scale_input(adversary.target)))
adversary.target_label = np.argmax(
self.model.predict(adversary.target))
logging.info('adversary:\noriginal_label: {}'
'\n target_lable: {}'
'\n is_targeted_attack: {}'
logging.info('adversary:'
'\n original_label: {}'
'\n target_label: {}'
'\n is_targeted_attack: {}'
''.format(adversary.original_label, adversary.target_label,
adversary.is_targeted_attack))
......@@ -10,6 +10,8 @@ import numpy as np
from .base import Attack
__all__ = ['DeepFoolAttack']
class DeepFoolAttack(Attack):
"""
......@@ -56,7 +58,7 @@ class DeepFoolAttack(Attack):
gradient_k = self.model.gradient(x, k)
w_k = gradient_k - gradient
f_k = f[k] - f[pre_label]
w_k_norm = np.linalg.norm(w_k) + 1e-8
w_k_norm = np.linalg.norm(w_k.flatten()) + 1e-8
pert_k = (np.abs(f_k) + 1e-8) / w_k_norm
if pert_k < pert:
pert = pert_k
......@@ -70,9 +72,12 @@ class DeepFoolAttack(Attack):
f = self.model.predict(x)
gradient = self.model.gradient(x, pre_label)
adv_label = np.argmax(f)
logging.info('iteration = {}, f = {}, pre_label = {}'
', adv_label={}'.format(iteration, f[pre_label],
pre_label, adv_label))
logging.info('iteration={}, f[pre_label]={}, f[target_label]={}'
', f[adv_label]={}, pre_label={}, adv_label={}'
''.format(iteration, f[pre_label], (
f[adversary.target_label]
if adversary.is_targeted_attack else 'NaN'), f[
adv_label], pre_label, adv_label))
if adversary.try_accept_the_example(x, adv_label):
return adversary
......
"""
This module provide the attack method for Iterator FGSM's implement.
"""
from __future__ import division
import logging
from collections import Iterable
import numpy as np
from .base import Attack
__all__ = [
'GradientMethodAttack', 'FastGradientSignMethodAttack', 'FGSM',
'FastGradientSignMethodTargetedAttack', 'FGSMT',
'BasicIterativeMethodAttack', 'BIM',
'IterativeLeastLikelyClassMethodAttack', 'ILCM'
]
class GradientMethodAttack(Attack):
"""
This class implements gradient attack method, and is the base of FGSM, BIM,
ILCM, etc.
"""
def __init__(self, model, support_targeted=True):
"""
:param model(model): The model to be attacked.
:param support_targeted(bool): Does this attack method support targeted.
"""
super(GradientMethodAttack, self).__init__(model)
self.support_targeted = support_targeted
def _apply(self, adversary, norm_ord=np.inf, epsilons=0.01, steps=100):
"""
Apply the gradient attack method.
:param adversary(Adversary):
The Adversary object.
:param norm_ord(int):
Order of the norm, such as np.inf, 1, 2, etc. It can't be 0.
:param epsilons(list|tuple|int):
Attack step size (input variation).
:param steps:
The number of iterator steps.
:return:
adversary(Adversary): The Adversary object.
"""
if norm_ord == 0:
raise ValueError("L0 norm is not supported!")
if not self.support_targeted:
if adversary.is_targeted_attack:
raise ValueError(
"This attack method doesn't support targeted attack!")
if not isinstance(epsilons, Iterable):
epsilons = np.linspace(epsilons, epsilons + 1e-10, num=steps)
pre_label = adversary.original_label
min_, max_ = self.model.bounds()
assert self.model.channel_axis() == adversary.original.ndim
assert (self.model.channel_axis() == 1 or
self.model.channel_axis() == adversary.original.shape[0] or
self.model.channel_axis() == adversary.original.shape[-1])
step = 1
adv_img = adversary.original
for epsilon in epsilons[:steps]:
if epsilon == 0.0:
continue
if adversary.is_targeted_attack:
gradient = -self.model.gradient(adv_img, adversary.target_label)
else:
gradient = self.model.gradient(adv_img,
adversary.original_label)
if norm_ord == np.inf:
gradient_norm = np.sign(gradient)
else:
gradient_norm = gradient / self._norm(gradient, ord=norm_ord)
adv_img = adv_img + epsilon * gradient_norm * (max_ - min_)
adv_img = np.clip(adv_img, min_, max_)
adv_label = np.argmax(self.model.predict(adv_img))
logging.info('step={}, epsilon = {:.5f}, pre_label = {}, '
'adv_label={}'.format(step, epsilon, pre_label,
adv_label))
if adversary.try_accept_the_example(adv_img, adv_label):
return adversary
step += 1
return adversary
@staticmethod
def _norm(a, ord):
if a.ndim == 1:
return np.linalg.norm(a, ord=ord)
if a.ndim == a.shape[0]:
norm_shape = (a.ndim, reduce(np.dot, a.shape[1:]))
norm_axis = 1
else:
norm_shape = (reduce(np.dot, a.shape[:-1]), a.ndim)
norm_axis = 0
return np.linalg.norm(a.reshape(norm_shape), ord=ord, axis=norm_axis)
class FastGradientSignMethodTargetedAttack(GradientMethodAttack):
"""
"Fast Gradient Sign Method" is extended to support targeted attack.
"Fast Gradient Sign Method" was originally implemented by Goodfellow et
al. (2015) with the infinity norm.
Paper link: https://arxiv.org/abs/1412.6572
"""
def _apply(self, adversary, epsilons=0.03):
return GradientMethodAttack._apply(
self,
adversary=adversary,
norm_ord=np.inf,
epsilons=epsilons,
steps=1)
class FastGradientSignMethodAttack(FastGradientSignMethodTargetedAttack):
"""
This attack was originally implemented by Goodfellow et al. (2015) with the
infinity norm, and is known as the "Fast Gradient Sign Method".
Paper link: https://arxiv.org/abs/1412.6572
"""
def __init__(self, model):
super(FastGradientSignMethodAttack, self).__init__(model, False)
class IterativeLeastLikelyClassMethodAttack(GradientMethodAttack):
"""
"Iterative Least-likely Class Method (ILCM)" extends "BIM" to support
targeted attack.
"The Basic Iterative Method (BIM)" is to extend "FSGM". "BIM" iteratively
take multiple small steps while adjusting the direction after each step.
Paper link: https://arxiv.org/abs/1607.02533
"""
def _apply(self, adversary, epsilons=0.001, steps=1000):
return GradientMethodAttack._apply(
self,
adversary=adversary,
norm_ord=np.inf,
epsilons=epsilons,
steps=steps)
class BasicIterativeMethodAttack(IterativeLeastLikelyClassMethodAttack):
"""
FGSM is a one-step method. "The Basic Iterative Method (BIM)" iteratively
take multiple small steps while adjusting the direction after each step.
Paper link: https://arxiv.org/abs/1607.02533
"""
def __init__(self, model):
super(BasicIterativeMethodAttack, self).__init__(model, False)
FGSM = FastGradientSignMethodAttack
FGSMT = FastGradientSignMethodTargetedAttack
BIM = BasicIterativeMethodAttack
ILCM = IterativeLeastLikelyClassMethodAttack
"""
This module provide the attack method for FGSM's implement.
"""
from __future__ import division
import logging
from collections import Iterable
import numpy as np
from .base import Attack
class GradientSignAttack(Attack):
"""
This attack was originally implemented by Goodfellow et al. (2015) with the
infinity norm (and is known as the "Fast Gradient Sign Method").
This is therefore called the Fast Gradient Method.
Paper link: https://arxiv.org/abs/1412.6572
"""
def _apply(self, adversary, epsilons=1000):
"""
Apply the gradient sign attack.
Args:
adversary(Adversary): The Adversary object.
epsilons(list|tuple|int): The epsilon (input variation parameter).
Return:
adversary: The Adversary object.
"""
assert adversary is not None
if not isinstance(epsilons, Iterable):
epsilons = np.linspace(0, 1, num=epsilons + 1)[1:]
pre_label = adversary.original_label
min_, max_ = self.model.bounds()
if adversary.is_targeted_attack:
gradient = self.model.gradient(adversary.original,
adversary.target_label)
gradient_sign = -np.sign(gradient) * (max_ - min_)
else:
gradient = self.model.gradient(adversary.original,
adversary.original_label)
gradient_sign = np.sign(gradient) * (max_ - min_)
for epsilon in epsilons:
adv_img = adversary.original + epsilon * gradient_sign
adv_img = np.clip(adv_img, min_, max_)
adv_label = np.argmax(self.model.predict(adv_img))
logging.info('epsilon = {:.3f}, pre_label = {}, adv_label={}'.
format(epsilon, pre_label, adv_label))
if adversary.try_accept_the_example(adv_img, adv_label):
return adversary
return adversary
FGSM = GradientSignAttack
"""
This module provide the attack method for Iterator FGSM's implement.
"""
from __future__ import division
import logging
from collections import Iterable
import numpy as np
from .base import Attack
class IteratorGradientSignAttack(Attack):
"""
This attack was originally implemented by Alexey Kurakin(Google Brain).
Paper link: https://arxiv.org/pdf/1607.02533.pdf
"""
def _apply(self, adversary, epsilons=100, steps=10):
"""
Apply the iterative gradient sign attack.
Args:
adversary(Adversary): The Adversary object.
epsilons(list|tuple|int): The epsilon (input variation parameter).
steps(int): The number of iterator steps.
Return:
adversary(Adversary): The Adversary object.
"""
if not isinstance(epsilons, Iterable):
epsilons = np.linspace(0, 1 / steps, num=epsilons + 1)[1:]
pre_label = adversary.original_label
min_, max_ = self.model.bounds()
for epsilon in epsilons:
adv_img = adversary.original
for _ in range(steps):
if adversary.is_targeted_attack:
gradient = self.model.gradient(adversary.original,
adversary.target_label)
gradient_sign = -np.sign(gradient) * (max_ - min_)
else:
gradient = self.model.gradient(adversary.original,
adversary.original_label)
gradient_sign = np.sign(gradient) * (max_ - min_)
adv_img = adv_img + gradient_sign * epsilon
adv_img = np.clip(adv_img, min_, max_)
adv_label = np.argmax(self.model.predict(adv_img))
logging.info('epsilon = {:.3f}, pre_label = {}, adv_label={}'.
format(epsilon, pre_label, adv_label))
if adversary.try_accept_the_example(adv_img, adv_label):
return adversary
return adversary
IFGSM = IteratorGradientSignAttack
"""
This module provide the attack method of "LBFGS".
"""
from __future__ import division
import logging
import numpy as np
from scipy.optimize import fmin_l_bfgs_b
from .base import Attack
__all__ = ['LBFGSAttack', 'LBFGS']
class LBFGSAttack(Attack):
"""
Uses L-BFGS-B to minimize the cross-entropy and the distance between the
original and the adversary.
Paper link: https://arxiv.org/abs/1510.05328
"""
def __init__(self, model):
super(LBFGSAttack, self).__init__(model)
self._predicts_normalized = None
self._adversary = None # type: Adversary
def _apply(self, adversary, epsilon=0.001, steps=10):
self._adversary = adversary
if not adversary.is_targeted_attack:
raise ValueError("This attack method only support targeted attack!")
# finding initial c
logging.info('finding initial c...')
c = epsilon
x0 = adversary.original.flatten()
for i in range(30):
c = 2 * c
logging.info('c={}'.format(c))
is_adversary = self._lbfgsb(x0, c, steps)
if is_adversary:
break
if not is_adversary:
logging.info('Failed!')
return adversary
# binary search c
logging.info('binary search c...')
c_low = 0
c_high = c
while c_high - c_low >= epsilon:
logging.info('c_high={}, c_low={}, diff={}, epsilon={}'
.format(c_high, c_low, c_high - c_low, epsilon))
c_half = (c_low + c_high) / 2
is_adversary = self._lbfgsb(x0, c_half, steps)
if is_adversary:
c_high = c_half
else:
c_low = c_half
return adversary
def _is_predicts_normalized(self, predicts):
"""
To determine the predicts is normalized.
:param predicts(np.array): the output of the model.
:return: bool
"""
if self._predicts_normalized is None:
if self.model.predict_name().lower() in [
'softmax', 'probabilities', 'probs'
]:
self._predicts_normalized = True
else:
if np.any(predicts < 0.0):
self._predicts_normalized = False
else:
s = np.sum(predicts.flatten())
if 0.999 <= s <= 1.001:
self._predicts_normalized = True
else:
self._predicts_normalized = False
assert self._predicts_normalized is not None
return self._predicts_normalized
def _loss(self, adv_x, c):
"""
To get the loss and gradient.
:param adv_x: the candidate adversarial example
:param c: parameter 'C' in the paper
:return: (loss, gradient)
"""
x = adv_x.reshape(self._adversary.original.shape)
# cross_entropy
logits = self.model.predict(x)
if not self._is_predicts_normalized(logits): # to softmax
e = np.exp(logits)
logits = e / np.sum(e)
e = np.exp(logits)
s = np.sum(e)
ce = np.log(s) - logits[self._adversary.target_label]
# L2 distance
min_, max_ = self.model.bounds()
d = np.sum((x - self._adversary.original).flatten() ** 2) \
/ ((max_ - min_) ** 2) / len(adv_x)
# gradient
gradient = self.model.gradient(x, self._adversary.target_label)
result = (c * ce + d).astype(float), gradient.flatten().astype(float)
return result
def _lbfgsb(self, x0, c, maxiter):
min_, max_ = self.model.bounds()
bounds = [(min_, max_)] * len(x0)
approx_grad_eps = (max_ - min_) / 100.0
x, f, d = fmin_l_bfgs_b(
self._loss,
x0,
args=(c, ),
bounds=bounds,
maxiter=maxiter,
epsilon=approx_grad_eps)
if np.amax(x) > max_ or np.amin(x) < min_:
x = np.clip(x, min_, max_)
shape = self._adversary.original.shape
adv_label = np.argmax(self.model.predict(x.reshape(shape)))
logging.info('pre_label = {}, adv_label={}'.format(
self._adversary.target_label, adv_label))
return self._adversary.try_accept_the_example(
x.reshape(shape), adv_label)
LBFGS = LBFGSAttack
"""
This module provide the attack method for JSMA's implement.
"""
from __future__ import division
import logging
import random
import numpy as np
from .base import Attack
class SaliencyMapAttack(Attack):
"""
Implements the Saliency Map Attack.
The Jacobian-based Saliency Map Approach (Papernot et al. 2016).
Paper link: https://arxiv.org/pdf/1511.07528.pdf
"""
def _apply(self,
adversary,
max_iter=2000,
fast=True,
theta=0.1,
max_perturbations_per_pixel=7):
"""
Apply the JSMA attack.
Args:
adversary(Adversary): The Adversary object.
max_iter(int): The max iterations.
fast(bool): Whether evaluate the pixel influence on sum of residual classes.
theta(float): Perturbation per pixel relative to [min, max] range.
max_perturbations_per_pixel(int): The max count of perturbation per pixel.
Return:
adversary: The Adversary object.
"""
assert adversary is not None
if not adversary.is_targeted_attack or (adversary.target_label is None):
target_labels = self._generate_random_target(
adversary.original_label)
else:
target_labels = [adversary.target_label]
for target in target_labels:
original_image = adversary.original
# the mask defines the search domain
# each modified pixel with border value is set to zero in mask
mask = np.ones_like(original_image)
# count tracks how often each pixel was changed
counts = np.zeros_like(original_image)
labels = range(self.model.num_classes())
adv_img = original_image.copy()
min_, max_ = self.model.bounds()
for step in range(max_iter):
adv_img = np.clip(adv_img, min_, max_)
adv_label = np.argmax(self.model.predict(adv_img))
if adversary.try_accept_the_example(adv_img, adv_label):
return adversary
# stop if mask is all zero
if not any(mask.flatten()):
return adversary
logging.info('step = {}, original_label = {}, adv_label={}'.
format(step, adversary.original_label, adv_label))
# get pixel location with highest influence on class
idx, p_sign = self._saliency_map(
adv_img, target, labels, mask, fast=fast)
# apply perturbation
adv_img[idx] += -p_sign * theta * (max_ - min_)
# tracks number of updates for each pixel
counts[idx] += 1
# remove pixel from search domain if it hits the bound
if adv_img[idx] <= min_ or adv_img[idx] >= max_:
mask[idx] = 0
# remove pixel if it was changed too often
if counts[idx] >= max_perturbations_per_pixel:
mask[idx] = 0
adv_img = np.clip(adv_img, min_, max_)
def _generate_random_target(self, original_label):
"""
Draw random target labels all of which are different and not the original label.
Args:
original_label(int): Original label.
Return:
target_labels(list): random target labels
"""
num_random_target = 1
num_classes = self.model.num_classes()
assert num_random_target <= num_classes - 1
target_labels = random.sample(range(num_classes), num_random_target + 1)
target_labels = [t for t in target_labels if t != original_label]
target_labels = target_labels[:num_random_target]
return target_labels
def _saliency_map(self, image, target, labels, mask, fast=False):
"""
Get pixel location with highest influence on class.
Args:
image(numpy.ndarray): Image with shape (height, width, channels).
target(int): The target label.
labels(int): The number of classes of the output label.
mask(list): Each modified pixel with border value is set to zero in mask.
fast(bool): Whether evaluate the pixel influence on sum of residual classes.
Return:
idx: The index of optimal pixel.
pix_sign: The direction of perturbation
"""
# pixel influence on target class
alphas = self.model.gradient(image, target) * mask
# pixel influence on sum of residual classes(don't evaluate if fast == True)
if fast:
betas = -np.ones_like(alphas)
else:
betas = np.sum([
self.model.gradient(image, label) * mask - alphas
for label in labels
], 0)
# compute saliency map (take into account both pos. & neg. perturbations)
sal_map = np.abs(alphas) * np.abs(betas) * np.sign(alphas * betas)
# find optimal pixel & direction of perturbation
idx = np.argmin(sal_map)
idx = np.unravel_index(idx, mask.shape)
pix_sign = np.sign(alphas)[idx]
return idx, pix_sign
JSMA = SaliencyMapAttack
"""
Paddle model for target of attack
"""
from .base import Model
from .paddle import PaddleModel
Models __init__.py
"""
\ No newline at end of file
......@@ -24,11 +24,21 @@ class Model(object):
assert len(bounds) == 2
assert channel_axis in [0, 1, 2, 3]
if preprocess is None:
preprocess = (0, 1)
self._bounds = bounds
self._channel_axis = channel_axis
self._preprocess = preprocess
# Make self._preprocess to be (0,1) if possible, so that don't need
# to do substract or divide.
if preprocess is not None:
sub, div = np.array(preprocess)
if not np.any(sub):
sub = 0
if np.all(div == 1):
div = 1
assert (div is None) or np.all(div)
self._preprocess = (sub, div)
else:
self._preprocess = (0, 1)
def bounds(self):
"""
......@@ -47,8 +57,7 @@ class Model(object):
sub, div = self._preprocess
if np.any(sub != 0):
res = input_ - sub
assert np.any(div != 0)
if np.any(div != 1):
if not np.all(sub == 1):
if res is None: # "res = input_ - sub" is not executed!
res = input_ / div
else:
......@@ -97,3 +106,11 @@ class Model(object):
with the shape (height, width, channel).
"""
raise NotImplementedError
@abstractmethod
def predict_name(self):
"""
Get the predict name, such as "softmax",etc.
:return: string
"""
raise NotImplementedError
......@@ -4,7 +4,7 @@ Paddle model
from __future__ import absolute_import
import numpy as np
import paddle.v2.fluid as fluid
import paddle.fluid as fluid
from .base import Model
......@@ -16,7 +16,7 @@ class PaddleModel(Model):
instance of PaddleModel.
Args:
program(paddle.v2.fluid.framework.Program): The program of the model
program(paddle.fluid.framework.Program): The program of the model
which generate the adversarial sample.
input_name(string): The name of the input.
logits_name(string): The name of the logits.
......@@ -114,3 +114,10 @@ class PaddleModel(Model):
feed=feeder.feed([(scaled_data, label)]),
fetch_list=[self._gradient])
return grad.reshape(data.shape)
def predict_name(self):
"""
Get the predict name, such as "softmax",etc.
:return: string
"""
return self._program.block(0).var(self._predict_name).op.type
......@@ -2,7 +2,7 @@
CNN on mnist data using fluid api of paddlepaddle
"""
import paddle.v2 as paddle
import paddle.v2.fluid as fluid
import paddle.fluid as fluid
def mnist_cnn_model(img):
......
......@@ -3,10 +3,10 @@ FGSM demos on mnist using advbox tool.
"""
import matplotlib.pyplot as plt
import paddle.v2 as paddle
import paddle.v2.fluid as fluid
import paddle.fluid as fluid
from advbox import Adversary
from advbox.attacks.gradientsign import GradientSignAttack
from advbox.adversary import Adversary
from advbox.attacks.gradient_method import FGSM
from advbox.models.paddle import PaddleModel
......@@ -73,7 +73,7 @@ def main():
# advbox demo
m = PaddleModel(fluid.default_main_program(), IMG_NAME, LABEL_NAME,
logits.name, avg_cost.name, (-1, 1))
att = GradientSignAttack(m)
att = FGSM(m)
for data in train_reader():
# fgsm attack
adversary = att(Adversary(data[0][0], data[0][1]))
......
"""
FGSM demos on mnist using advbox tool.
"""
import matplotlib.pyplot as plt
import paddle.v2 as paddle
import paddle.fluid as fluid
import numpy as np
from advbox import Adversary
from advbox.attacks.saliency import SaliencyMapAttack
from advbox.models.paddle import PaddleModel
def cnn_model(img):
"""
Mnist cnn model
Args:
img(Varaible): the input image to be recognized
Returns:
Variable: the label prediction
"""
# conv1 = fluid.nets.conv2d()
conv_pool_1 = fluid.nets.simple_img_conv_pool(
input=img,
num_filters=20,
filter_size=5,
pool_size=2,
pool_stride=2,
act='relu')
conv_pool_2 = fluid.nets.simple_img_conv_pool(
input=conv_pool_1,
num_filters=50,
filter_size=5,
pool_size=2,
pool_stride=2,
act='relu')
logits = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
return logits
def main():
"""
Advbox demo which demonstrate how to use advbox.
"""
IMG_NAME = 'img'
LABEL_NAME = 'label'
img = fluid.layers.data(name=IMG_NAME, shape=[1, 28, 28], dtype='float32')
# gradient should flow
img.stop_gradient = False
label = fluid.layers.data(name=LABEL_NAME, shape=[1], dtype='int64')
logits = cnn_model(img)
cost = fluid.layers.cross_entropy(input=logits, label=label)
avg_cost = fluid.layers.mean(x=cost)
place = fluid.CPUPlace()
exe = fluid.Executor(place)
BATCH_SIZE = 1
train_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=500),
batch_size=BATCH_SIZE)
feeder = fluid.DataFeeder(
feed_list=[IMG_NAME, LABEL_NAME],
place=place,
program=fluid.default_main_program())
fluid.io.load_params(
exe, "./mnist/", main_program=fluid.default_main_program())
# advbox demo
m = PaddleModel(fluid.default_main_program(), IMG_NAME, LABEL_NAME,
logits.name, avg_cost.name, (-1, 1))
attack = SaliencyMapAttack(m)
total_num = 0
success_num = 0
for data in train_reader():
total_num += 1
# adversary.set_target(True, target_label=target_label)
jsma_attack = attack(Adversary(data[0][0], data[0][1]))
if jsma_attack is not None and jsma_attack.is_successful():
# plt.imshow(jsma_attack.target, cmap='Greys_r')
# plt.show()
success_num += 1
print('original_label=%d, adversary examples label =%d' %
(data[0][1], jsma_attack.adversarial_label))
# np.save('adv_img', jsma_attack.adversarial_example)
print('total num = %d, success num = %d ' % (total_num, success_num))
if total_num == 100:
break
if __name__ == '__main__':
main()
The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# SE-ResNeXt for image classification
This model built with paddle fluid is still under active development and is not
......
import os
import paddle.v2 as paddle
import paddle.fluid as fluid
from paddle.fluid.initializer import MSRA
from paddle.fluid.param_attr import ParamAttr
parameter_attr = ParamAttr(initializer=MSRA())
def conv_bn_layer(input,
filter_size,
num_filters,
stride,
padding,
channels=None,
num_groups=1,
act='relu',
use_cudnn=True):
conv = fluid.layers.conv2d(
input=input,
num_filters=num_filters,
filter_size=filter_size,
stride=stride,
padding=padding,
groups=num_groups,
act=None,
use_cudnn=use_cudnn,
param_attr=parameter_attr,
bias_attr=False)
return fluid.layers.batch_norm(input=conv, act=act)
def depthwise_separable(input, num_filters1, num_filters2, num_groups, stride,
scale):
"""
"""
depthwise_conv = conv_bn_layer(
input=input,
filter_size=3,
num_filters=int(num_filters1 * scale),
stride=stride,
padding=1,
num_groups=int(num_groups * scale),
use_cudnn=False)
pointwise_conv = conv_bn_layer(
input=depthwise_conv,
filter_size=1,
num_filters=int(num_filters2 * scale),
stride=1,
padding=0)
return pointwise_conv
def mobile_net(img, class_dim, scale=1.0):
# conv1: 112x112
tmp = conv_bn_layer(
img,
filter_size=3,
channels=3,
num_filters=int(32 * scale),
stride=2,
padding=1)
# 56x56
tmp = depthwise_separable(
tmp,
num_filters1=32,
num_filters2=64,
num_groups=32,
stride=1,
scale=scale)
tmp = depthwise_separable(
tmp,
num_filters1=64,
num_filters2=128,
num_groups=64,
stride=2,
scale=scale)
# 28x28
tmp = depthwise_separable(
tmp,
num_filters1=128,
num_filters2=128,
num_groups=128,
stride=1,
scale=scale)
tmp = depthwise_separable(
tmp,
num_filters1=128,
num_filters2=256,
num_groups=128,
stride=2,
scale=scale)
# 14x14
tmp = depthwise_separable(
tmp,
num_filters1=256,
num_filters2=256,
num_groups=256,
stride=1,
scale=scale)
tmp = depthwise_separable(
tmp,
num_filters1=256,
num_filters2=512,
num_groups=256,
stride=2,
scale=scale)
# 14x14
for i in range(5):
tmp = depthwise_separable(
tmp,
num_filters1=512,
num_filters2=512,
num_groups=512,
stride=1,
scale=scale)
# 7x7
tmp = depthwise_separable(
tmp,
num_filters1=512,
num_filters2=1024,
num_groups=512,
stride=2,
scale=scale)
tmp = depthwise_separable(
tmp,
num_filters1=1024,
num_filters2=1024,
num_groups=1024,
stride=1,
scale=scale)
tmp = fluid.layers.pool2d(
input=tmp,
pool_size=0,
pool_stride=1,
pool_type='avg',
global_pooling=True)
tmp = fluid.layers.fc(input=tmp,
size=class_dim,
act='softmax',
param_attr=parameter_attr)
return tmp
def train(learning_rate, batch_size, num_passes, model_save_dir='model'):
class_dim = 102
image_shape = [3, 224, 224]
image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
out = mobile_net(image, class_dim=class_dim)
cost = fluid.layers.cross_entropy(input=out, label=label)
avg_cost = fluid.layers.mean(x=cost)
optimizer = fluid.optimizer.Momentum(
learning_rate=learning_rate,
momentum=0.9,
regularization=fluid.regularizer.L2Decay(5 * 1e-5))
opts = optimizer.minimize(avg_cost)
accuracy = fluid.evaluator.Accuracy(input=out, label=label)
inference_program = fluid.default_main_program().clone()
with fluid.program_guard(inference_program):
test_accuracy = fluid.evaluator.Accuracy(input=out, label=label)
test_target = [avg_cost] + test_accuracy.metrics + test_accuracy.states
inference_program = fluid.io.get_inference_program(test_target)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
train_reader = paddle.batch(
paddle.dataset.flowers.train(), batch_size=batch_size)
test_reader = paddle.batch(
paddle.dataset.flowers.test(), batch_size=batch_size)
feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
for pass_id in range(num_passes):
accuracy.reset(exe)
for batch_id, data in enumerate(train_reader()):
loss, acc = exe.run(fluid.default_main_program(),
feed=feeder.feed(data),
fetch_list=[avg_cost] + accuracy.metrics)
print("Pass {0}, batch {1}, loss {2}, acc {3}".format(
pass_id, batch_id, loss[0], acc[0]))
pass_acc = accuracy.eval(exe)
test_accuracy.reset(exe)
for data in test_reader():
loss, acc = exe.run(inference_program,
feed=feeder.feed(data),
fetch_list=[avg_cost] + test_accuracy.metrics)
test_pass_acc = test_accuracy.eval(exe)
print("End pass {0}, train_acc {1}, test_acc {2}".format(
pass_id, pass_acc, test_pass_acc))
if pass_id % 10 == 0:
model_path = os.path.join(model_save_dir, str(pass_id))
print 'save models to %s' % (model_path)
fluid.io.save_inference_model(model_path, ['image'], [out], exe)
if __name__ == '__main__':
train(learning_rate=0.005, batch_size=40, num_passes=300)
import os
import paddle.v2 as paddle
import paddle.v2.fluid as fluid
import paddle.fluid as fluid
import reader
......@@ -103,66 +103,87 @@ def train(learning_rate,
batch_size,
num_passes,
init_model=None,
model_save_dir='model'):
model_save_dir='model',
parallel=True):
class_dim = 1000
image_shape = [3, 224, 224]
image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
out = SE_ResNeXt(input=image, class_dim=class_dim)
cost = fluid.layers.cross_entropy(input=out, label=label)
avg_cost = fluid.layers.mean(x=cost)
if parallel:
places = fluid.layers.get_places()
pd = fluid.layers.ParallelDo(places)
with pd.do():
image_ = pd.read_input(image)
label_ = pd.read_input(label)
out = SE_ResNeXt(input=image_, class_dim=class_dim)
cost = fluid.layers.cross_entropy(input=out, label=label_)
avg_cost = fluid.layers.mean(x=cost)
accuracy = fluid.layers.accuracy(input=out, label=label_)
pd.write_output(avg_cost)
pd.write_output(accuracy)
avg_cost, accuracy = pd()
avg_cost = fluid.layers.mean(x=avg_cost)
accuracy = fluid.layers.mean(x=accuracy)
else:
out = SE_ResNeXt(input=image, class_dim=class_dim)
cost = fluid.layers.cross_entropy(input=out, label=label)
avg_cost = fluid.layers.mean(x=cost)
accuracy = fluid.layers.accuracy(input=out, label=label)
optimizer = fluid.optimizer.Momentum(
learning_rate=learning_rate,
momentum=0.9,
regularization=fluid.regularizer.L2Decay(1e-4))
opts = optimizer.minimize(avg_cost)
accuracy = fluid.evaluator.Accuracy(input=out, label=label)
inference_program = fluid.default_main_program().clone()
with fluid.program_guard(inference_program):
test_accuracy = fluid.evaluator.Accuracy(input=out, label=label)
test_target = [avg_cost] + test_accuracy.metrics + test_accuracy.states
inference_program = fluid.io.get_inference_program(test_target)
inference_program = fluid.io.get_inference_program([avg_cost, accuracy])
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
if init_model is not None:
fluid.io.load_persistables_if_exist(exe, init_model)
fluid.io.load_persistables(exe, init_model)
train_reader = paddle.batch(reader.train(), batch_size=batch_size)
test_reader = paddle.batch(reader.test(), batch_size=batch_size)
feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
for pass_id in range(num_passes):
accuracy.reset(exe)
for batch_id, data in enumerate(train_reader()):
loss, acc = exe.run(fluid.default_main_program(),
feed=feeder.feed(data),
fetch_list=[avg_cost] + accuracy.metrics)
print("Pass {0}, batch {1}, loss {2}, acc {3}".format(
pass_id, batch_id, loss[0], acc[0]))
pass_acc = accuracy.eval(exe)
test_accuracy.reset(exe)
loss = exe.run(fluid.default_main_program(),
feed=feeder.feed(data),
fetch_list=[avg_cost])
print("Pass {0}, batch {1}, loss {2}".format(pass_id, batch_id,
float(loss[0])))
total_loss = 0.0
total_acc = 0.0
total_batch = 0
for data in test_reader():
loss, acc = exe.run(inference_program,
feed=feeder.feed(data),
fetch_list=[avg_cost] + test_accuracy.metrics)
test_pass_acc = test_accuracy.eval(exe)
print("End pass {0}, train_acc {1}, test_acc {2}".format(
pass_id, pass_acc, test_pass_acc))
fetch_list=[avg_cost, accuracy])
total_loss += float(loss)
total_acc += float(acc)
total_batch += 1
print("End pass {0}, test_loss {1}, test_acc {2}".format(
pass_id, total_loss / total_batch, total_acc / total_batch))
model_path = os.path.join(model_save_dir, str(pass_id))
if not os.path.isdir(model_path):
os.makedirs(model_path)
fluid.io.save_persistables(exe, model_path)
fluid.io.save_inference_model(model_path, ['image'], [out], exe)
if __name__ == '__main__':
train(learning_rate=0.1, batch_size=8, num_passes=100, init_model=None)
train(
learning_rate=0.1,
batch_size=8,
num_passes=100,
init_model=None,
parallel=False)
import os
import cv2
import numpy as np
from PIL import Image
from paddle.v2.image import load_image
class DataGenerator(object):
def __init__(self):
pass
def train_reader(self, img_root_dir, img_label_list, batchsize):
'''
Reader interface for training.
:param img_root_dir: The root path of the image for training.
:type file_list: str
:param img_label_list: The path of the <image_name, label> file for training.
:type file_list: str
'''
img_label_lines = []
if batchsize == 1:
to_file = "tmp.txt"
cmd = "cat " + img_label_list + " | awk '{print $1,$2,$3,$4;}' | shuf > " + to_file
print "cmd: " + cmd
os.system(cmd)
print "finish batch shuffle"
img_label_lines = open(to_file, 'r').readlines()
else:
to_file = "tmp.txt"
#cmd1: partial shuffle
cmd = "cat " + img_label_list + " | awk '{printf(\"%04d%.4f %s\\n\", $1, rand(), $0)}' | sort | sed 1,$((1 + RANDOM % 100))d | "
#cmd2: batch merge and shuffle
cmd += "awk '{printf $2\" \"$3\" \"$4\" \"$5\" \"; if(NR % " + str(
batchsize) + " == 0) print \"\";}' | shuf | "
#cmd3: batch split
cmd += "awk '{if(NF == " + str(
batchsize
) + " * 4) {for(i = 0; i < " + str(
batchsize
) + "; i++) print $(4*i+1)\" \"$(4*i+2)\" \"$(4*i+3)\" \"$(4*i+4);}}' > " + to_file
print "cmd: " + cmd
os.system(cmd)
print "finish batch shuffle"
img_label_lines = open(to_file, 'r').readlines()
def reader():
sizes = len(img_label_lines) / batchsize
for i in range(sizes):
result = []
sz = [0, 0]
for j in range(batchsize):
line = img_label_lines[i * batchsize + j]
# h, w, img_name, labels
items = line.split(' ')
label = [int(c) for c in items[-1].split(',')]
img = Image.open(os.path.join(img_root_dir, items[
2])).convert('L') #zhuanhuidu
if j == 0:
sz = img.size
img = img.resize((sz[0], sz[1]))
img = np.array(img) - 127.5
img = img[np.newaxis, ...]
result.append([img, label])
yield result
return reader
def test_reader(self, img_root_dir, img_label_list):
'''
Reader interface for inference.
:param img_root_dir: The root path of the images for training.
:type file_list: str
:param img_label_list: The path of the <image_name, label> file for testing.
:type file_list: list
'''
def reader():
for line in open(img_label_list):
# h, w, img_name, labels
items = line.split(' ')
label = [int(c) for c in items[-1].split(',')]
img = Image.open(os.path.join(img_root_dir, items[2])).convert(
'L')
img = np.array(img) - 127.5
img = img[np.newaxis, ...]
yield img, label
return reader
# Policy Gradient RL by PaddlePaddle
运行本目录下的程序示例需要使用PaddlePaddle的最新develop分枝。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# Policy Gradient RL by PaddlePaddle
本文介绍了如何使用PaddlePaddle通过policy-based的强化学习方法来训练一个player(actor model), 我们希望这个player可以完成简单的走阶梯任务。
内容分为:
......
import numpy as np
import paddle.v2 as paddle
import paddle.v2.fluid as fluid
import paddle.fluid as fluid
# reproducible
np.random.seed(1)
......
The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Text Classification
## Data Preparation
......
......@@ -5,7 +5,7 @@ import argparse
import time
import paddle.v2 as paddle
import paddle.v2.fluid as fluid
import paddle.fluid as fluid
from config import TrainConfig as conf
......
The minimum PaddlePaddle version needed for the code sample in this directory is the lastest develop branch. If you are on a version of PaddlePaddle earlier than this, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Attention is All You Need: A Paddle Fluid implementation
This is a Paddle Fluid implementation of the Transformer model in [Attention is All You Need]() (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, arxiv, 2017).
If you use the dataset/code in your research, please cite the paper:
```text
@inproceedings{vaswani2017attention,
title={Attention is all you need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
booktitle={Advances in Neural Information Processing Systems},
pages={6000--6010},
year={2017}
}
```
### TODO
This project is still under active development.
class TrainTaskConfig(object):
use_gpu = False
# the epoch number to train.
pass_num = 2
# number of sequences contained in a mini-batch.
batch_size = 64
# the hyper params for Adam optimizer.
learning_rate = 0.001
beta1 = 0.9
beta2 = 0.98
eps = 1e-9
# the params for learning rate scheduling
warmup_steps = 4000
class ModelHyperParams(object):
# Dictionary size for source and target language. This model directly uses
# paddle.dataset.wmt16 in which <bos>, <eos> and <unk> token has
# alreay been added, but the <pad> token is not added. Transformer requires
# sequences in a mini-batch are padded to have the same length. A <pad> token is
# added into the original dictionary in paddle.dateset.wmt16.
# size of source word dictionary.
src_vocab_size = 10000
# index for <pad> token in source language.
src_pad_idx = src_vocab_size
# size of target word dictionay
trg_vocab_size = 10000
# index for <pad> token in target language.
trg_pad_idx = trg_vocab_size
# position value corresponding to the <pad> token.
pos_pad_idx = 0
# max length of sequences. It should plus 1 to include position
# padding token for position encoding.
max_length = 50
# the dimension for word embeddings, which is also the last dimension of
# the input and output of multi-head attention, position-wise feed-forward
# networks, encoder and decoder.
d_model = 512
# size of the hidden layer in position-wise feed-forward networks.
d_inner_hid = 1024
# the dimension that keys are projected to for dot-product attention.
d_key = 64
# the dimension that values are projected to for dot-product attention.
d_value = 64
# number of head used in multi-head attention.
n_head = 8
# number of sub-layers to be stacked in the encoder and decoder.
n_layer = 6
# dropout rate used by all dropout layers.
dropout = 0.1
# Names of position encoding table which will be initialized externally.
pos_enc_param_names = (
"src_pos_enc_table",
"trg_pos_enc_table", )
# Names of all data layers listed in order.
input_data_names = (
"src_word",
"src_pos",
"trg_word",
"trg_pos",
"src_slf_attn_bias",
"trg_slf_attn_bias",
"trg_src_attn_bias",
"lbl_word",
"lbl_weight", )
from functools import partial
import numpy as np
import paddle.v2 as paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from config import TrainTaskConfig, input_data_names, pos_enc_param_names
# FIXME(guosheng): Remove out the batch_size from the model.
batch_size = TrainTaskConfig.batch_size
def position_encoding_init(n_position, d_pos_vec):
"""
Generate the initial values for the sinusoid position encoding table.
"""
position_enc = np.array([[
pos / np.power(10000, 2 * (j // 2) / d_pos_vec)
for j in range(d_pos_vec)
] if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i
position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1
return position_enc.astype("float32")
def multi_head_attention(queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
num_heads=1,
dropout_rate=0.):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, num_heads, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * num_heads,
bias_attr=False,
num_flatten_dims=2)
k = layers.fc(input=keys,
size=d_key * num_heads,
bias_attr=False,
num_flatten_dims=2)
v = layers.fc(input=values,
size=d_value * num_heads,
bias_attr=False,
num_flatten_dims=2)
return q, k, v
def __split_heads(x, num_heads):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, num_heads * hidden_dim] then output a tensor
with shape [bs, num_heads, max_sequence_length, hidden_dim].
"""
if num_heads == 1:
return x
hidden_size = x.shape[-1]
# FIXME(guosheng): Decouple the program desc with batch_size.
reshaped = layers.reshape(
x=x, shape=[batch_size, -1, num_heads, hidden_size // num_heads])
# permuate the dimensions into:
# [batch_size, num_heads, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3: return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# FIXME(guosheng): Decouple the program desc with batch_size.
return layers.reshape(
x=trans_x,
shape=map(int,
[batch_size, -1, trans_x.shape[2] * trans_x.shape[3]]))
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
# FIXME(guosheng): Optimize the shape in reshape_op or softmax_op.
# The current implementation of softmax_op only supports 2D tensor,
# consequently it cannot be directly used here.
# If to use the reshape_op, Besides, the shape of product inferred in
# compile-time is not the actual shape in run-time. It cann't be used
# to set the attribute of reshape_op.
# So, here define the softmax for temporary solution.
def __softmax(x, eps=1e-9):
exp_out = layers.exp(x=x)
sum_out = layers.reduce_sum(exp_out, dim=-1, keep_dim=False)
return layers.elementwise_div(x=exp_out, y=sum_out, axis=0)
scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
weights = __softmax(layers.elementwise_add(x=product, y=attn_bias))
if dropout_rate:
weights = layers.dropout(
weights, dropout_prob=dropout_rate, is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, num_heads, d_key, d_value)
q = __split_heads(q, num_heads)
k = __split_heads(k, num_heads)
v = __split_heads(v, num_heads)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
bias_attr=False,
num_flatten_dims=2)
return proj_out
def positionwise_feed_forward(x, d_inner_hid, d_hid):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x,
size=d_inner_hid,
num_flatten_dims=2,
act="relu")
out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2)
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout=0.):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out = layers.layer_norm(out, begin_norm_axis=len(out.shape) - 1)
elif cmd == "d": # add dropout
if dropout:
out = layers.dropout(out, dropout_prob=dropout, is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def prepare_encoder(src_word,
src_pos,
src_vocab_size,
src_emb_dim,
src_pad_idx,
src_max_len,
dropout=0.,
pos_pad_idx=0,
pos_enc_param_name=None):
"""Add word embeddings and position encodings.
The output tensor has a shape of:
[batch_size, max_src_length_in_batch, d_model].
This module is used at the bottom of the encoder stacks.
"""
src_word_emb = layers.embedding(
src_word, size=[src_vocab_size, src_emb_dim], padding_idx=src_pad_idx)
src_pos_enc = layers.embedding(
src_pos,
size=[src_max_len, src_emb_dim],
padding_idx=pos_pad_idx,
param_attr=fluid.ParamAttr(
name=pos_enc_param_name, trainable=False))
enc_input = src_word_emb + src_pos_enc
# FIXME(guosheng): Decouple the program desc with batch_size.
enc_input = layers.reshape(x=enc_input, shape=[batch_size, -1, src_emb_dim])
return layers.dropout(
enc_input, dropout_prob=dropout,
is_test=False) if dropout else enc_input
prepare_encoder = partial(
prepare_encoder, pos_enc_param_name=pos_enc_param_names[0])
prepare_decoder = partial(
prepare_encoder, pos_enc_param_name=pos_enc_param_names[1])
def encoder_layer(enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
dropout_rate=0.):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(enc_input, enc_input, enc_input,
attn_bias, d_key, d_value, d_model,
n_head, dropout_rate)
attn_output = post_process_layer(enc_input, attn_output, "dan",
dropout_rate)
ffd_output = positionwise_feed_forward(attn_output, d_inner_hid, d_model)
return post_process_layer(attn_output, ffd_output, "dan", dropout_rate)
def encoder(enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
dropout_rate=0.):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(enc_input, attn_bias, n_head, d_key, d_value,
d_model, d_inner_hid, dropout_rate)
enc_input = enc_output
return enc_output
def decoder_layer(dec_input,
enc_output,
slf_attn_bias,
dec_enc_attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
dropout_rate=0.):
""" The layer to be stacked in decoder part.
The structure of this module is similar to that in the encoder part except
a multi-head attention is added to implement encoder-decoder attention.
"""
slf_attn_output = multi_head_attention(
dec_input,
dec_input,
dec_input,
slf_attn_bias,
d_key,
d_value,
d_model,
n_head,
dropout_rate, )
slf_attn_output = post_process_layer(
dec_input,
slf_attn_output,
"dan", # residual connection + dropout + layer normalization
dropout_rate, )
enc_attn_output = multi_head_attention(
slf_attn_output,
enc_output,
enc_output,
dec_enc_attn_bias,
d_key,
d_value,
d_model,
n_head,
dropout_rate, )
enc_attn_output = post_process_layer(
slf_attn_output,
enc_attn_output,
"dan", # residual connection + dropout + layer normalization
dropout_rate, )
ffd_output = positionwise_feed_forward(
enc_attn_output,
d_inner_hid,
d_model, )
dec_output = post_process_layer(
enc_attn_output,
ffd_output,
"dan", # residual connection + dropout + layer normalization
dropout_rate, )
return dec_output
def decoder(dec_input,
enc_output,
dec_slf_attn_bias,
dec_enc_attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
dropout_rate=0.):
"""
The decoder is composed of a stack of identical decoder_layer layers.
"""
for i in range(n_layer):
dec_output = decoder_layer(
dec_input,
enc_output,
dec_slf_attn_bias,
dec_enc_attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
dropout_rate, )
dec_input = dec_output
return dec_output
def transformer(
src_vocab_size,
trg_vocab_size,
max_length,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
dropout_rate,
src_pad_idx,
trg_pad_idx,
pos_pad_idx, ):
# The shapes here act as placeholder.
# The shapes set here is to pass the infer-shape in compile time. The actual
# shape of src_word in run time is:
# [batch_size * max_src_length_in_a_batch, 1].
src_word = layers.data(
name=input_data_names[0],
shape=[batch_size * max_length, 1],
dtype="int64",
append_batch_size=False)
# The actual shape of src_pos in runtime is:
# [batch_size * max_src_length_in_a_batch, 1].
src_pos = layers.data(
name=input_data_names[1],
shape=[batch_size * max_length, 1],
dtype="int64",
append_batch_size=False)
# The actual shape of trg_word is in runtime is:
# [batch_size * max_trg_length_in_a_batch, 1].
trg_word = layers.data(
name=input_data_names[2],
shape=[batch_size * max_length, 1],
dtype="int64",
append_batch_size=False)
# The actual shape of trg_pos in runtime is:
# [batch_size * max_trg_length_in_a_batch, 1].
trg_pos = layers.data(
name=input_data_names[3],
shape=[batch_size * max_length, 1],
dtype="int64",
append_batch_size=False)
# The actual shape of src_slf_attn_bias in runtime is:
# [batch_size, n_head, max_src_length_in_a_batch, max_src_length_in_a_batch].
# This input is used to remove attention weights on paddings.
src_slf_attn_bias = layers.data(
name=input_data_names[4],
shape=[batch_size, n_head, max_length, max_length],
dtype="float32",
append_batch_size=False)
# The actual shape of trg_slf_attn_bias in runtime is:
# [batch_size, n_head, max_trg_length_in_batch, max_trg_length_in_batch].
# This is used to remove attention weights on paddings and subsequent words.
trg_slf_attn_bias = layers.data(
name=input_data_names[5],
shape=[batch_size, n_head, max_length, max_length],
dtype="float32",
append_batch_size=False)
# The actual shape of trg_src_attn_bias in runtime is:
# [batch_size, n_head, max_trg_length_in_batch, max_src_length_in_batch].
# This is used to remove attention weights on paddings.
trg_src_attn_bias = layers.data(
name=input_data_names[6],
shape=[batch_size, n_head, max_length, max_length],
dtype="float32",
append_batch_size=False)
enc_input = prepare_encoder(
src_word,
src_pos,
src_vocab_size,
d_model,
src_pad_idx,
max_length,
dropout_rate, )
enc_output = encoder(
enc_input,
src_slf_attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
dropout_rate, )
dec_input = prepare_decoder(
trg_word,
trg_pos,
trg_vocab_size,
d_model,
trg_pad_idx,
max_length,
dropout_rate, )
dec_output = decoder(
dec_input,
enc_output,
trg_slf_attn_bias,
trg_src_attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
dropout_rate, )
# TODO(guosheng): Share the weight matrix between the embedding layers and
# the pre-softmax linear transformation.
predict = layers.reshape(
x=layers.fc(input=dec_output,
size=trg_vocab_size,
bias_attr=False,
num_flatten_dims=2),
shape=[-1, trg_vocab_size],
act="softmax")
# The actual shape of gold in runtime is:
# [batch_size * max_trg_length_in_a_batch, 1].
gold = layers.data(
name=input_data_names[7],
shape=[batch_size * max_length, 1],
dtype="int64",
append_batch_size=False)
cost = layers.cross_entropy(input=predict, label=gold)
# The actual shape of weights in runtime is:
# [batch_size * max_trg_length_in_a_batch, 1].
# Padding index do not contribute to the total loss. This Weight is used to
# cancel padding index in calculating the loss.
weights = layers.data(
name=input_data_names[8],
shape=[batch_size * max_length, 1],
dtype="float32",
append_batch_size=False)
weighted_cost = cost * weights
return layers.reduce_sum(weighted_cost)
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.layers as layers
class LearningRateScheduler(object):
"""
Wrapper for learning rate scheduling as described in the Transformer paper.
LearningRateScheduler adapts the learning rate externally and the adapted
learning rate will be feeded into the main_program as input data.
"""
def __init__(self,
d_model,
warmup_steps,
place,
learning_rate=0.001,
current_steps=0,
name="learning_rate"):
self.current_steps = current_steps
self.warmup_steps = warmup_steps
self.d_model = d_model
self.learning_rate = layers.create_global_var(
name=name,
shape=[1],
value=float(learning_rate),
dtype="float32",
persistable=True)
self.place = place
def update_learning_rate(self, data_input):
self.current_steps += 1
lr_value = np.power(self.d_model, -0.5) * np.min([
np.power(self.current_steps, -0.5),
np.power(self.warmup_steps, -1.5) * self.current_steps
])
lr_tensor = fluid.LoDTensor()
lr_tensor.set(np.array([lr_value], dtype="float32"), self.place)
data_input[self.learning_rate.name] = lr_tensor
import numpy as np
import paddle.v2 as paddle
import paddle.fluid as fluid
from model import transformer, position_encoding_init
from optim import LearningRateScheduler
from config import TrainTaskConfig, ModelHyperParams, \
pos_enc_param_names, input_data_names
def prepare_batch_input(insts, input_data_names, src_pad_idx, trg_pad_idx,
max_length, n_head, place):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias. Then, convert the numpy
data to tensors and return a dict mapping names to tensors.
"""
input_dict = {}
def __pad_batch_data(insts,
pad_idx,
is_target=False,
return_pos=True,
return_attn_bias=True,
return_max_len=True):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
inst_data = np.array(
[inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, 1])]
if return_pos:
inst_pos = np.array([[
pos_i + 1 if w_i != pad_idx else 0
for pos_i, w_i in enumerate(inst)
] for inst in inst_data])
return_list += [inst_pos.astype("int64").reshape([-1, 1])]
if return_attn_bias:
if is_target:
# This is used to avoid attention on paddings and subsequent
# words.
slf_attn_bias_data = np.ones((inst_data.shape[0], max_len,
max_len))
slf_attn_bias_data = np.triu(slf_attn_bias_data, 1).reshape(
[-1, 1, max_len, max_len])
slf_attn_bias_data = np.tile(slf_attn_bias_data,
[1, n_head, 1, 1]) * [-1e9]
else:
# This is used to avoid attention on paddings.
slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] *
(max_len - len(inst))
for inst in insts])
slf_attn_bias_data = np.tile(
slf_attn_bias_data.reshape([-1, 1, 1, max_len]),
[1, n_head, max_len, 1])
return_list += [slf_attn_bias_data.astype("float32")]
if return_max_len:
return_list += [max_len]
return return_list if len(return_list) > 1 else return_list[0]
def data_to_tensor(data_list, name_list, input_dict, place):
assert len(data_list) == len(name_list)
for i in range(len(name_list)):
tensor = fluid.LoDTensor()
tensor.set(data_list[i], place)
input_dict[name_list[i]] = tensor
src_word, src_pos, src_slf_attn_bias, src_max_len = __pad_batch_data(
[inst[0] for inst in insts], src_pad_idx, is_target=False)
trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = __pad_batch_data(
[inst[1] for inst in insts], trg_pad_idx, is_target=True)
trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
[1, 1, trg_max_len, 1]).astype("float32")
lbl_word = __pad_batch_data([inst[2] for inst in insts], trg_pad_idx, False,
False, False, False)
lbl_weight = (lbl_word != trg_pad_idx).astype("float32").reshape([-1, 1])
data_to_tensor([
src_word, src_pos, trg_word, trg_pos, src_slf_attn_bias,
trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight
], input_data_names, input_dict, place)
return input_dict
def main():
place = fluid.CUDAPlace(0) if TrainTaskConfig.use_gpu else fluid.CPUPlace()
exe = fluid.Executor(place)
cost = transformer(
ModelHyperParams.src_vocab_size + 1,
ModelHyperParams.trg_vocab_size + 1, ModelHyperParams.max_length + 1,
ModelHyperParams.n_layer, ModelHyperParams.n_head,
ModelHyperParams.d_key, ModelHyperParams.d_value,
ModelHyperParams.d_model, ModelHyperParams.d_inner_hid,
ModelHyperParams.dropout, ModelHyperParams.src_pad_idx,
ModelHyperParams.trg_pad_idx, ModelHyperParams.pos_pad_idx)
lr_scheduler = LearningRateScheduler(ModelHyperParams.d_model,
TrainTaskConfig.warmup_steps, place,
TrainTaskConfig.learning_rate)
optimizer = fluid.optimizer.Adam(
learning_rate=lr_scheduler.learning_rate,
beta1=TrainTaskConfig.beta1,
beta2=TrainTaskConfig.beta2,
epsilon=TrainTaskConfig.eps)
optimizer.minimize(cost)
train_data = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt16.train(ModelHyperParams.src_vocab_size,
ModelHyperParams.trg_vocab_size),
buf_size=51200),
batch_size=TrainTaskConfig.batch_size)
# Initialize the parameters.
exe.run(fluid.framework.default_startup_program())
for pos_enc_param_name in pos_enc_param_names:
pos_enc_param = fluid.global_scope().find_var(
pos_enc_param_name).get_tensor()
pos_enc_param.set(
position_encoding_init(ModelHyperParams.max_length + 1,
ModelHyperParams.d_model), place)
for pass_id in xrange(TrainTaskConfig.pass_num):
for batch_id, data in enumerate(train_data()):
# The current program desc is coupled with batch_size, thus all
# mini-batches must have the same number of instances currently.
if len(data) != TrainTaskConfig.batch_size:
continue
data_input = prepare_batch_input(
data, input_data_names, ModelHyperParams.src_pad_idx,
ModelHyperParams.trg_pad_idx, ModelHyperParams.max_length,
ModelHyperParams.n_head, place)
lr_scheduler.update_learning_rate(data_input)
outs = exe.run(fluid.framework.default_main_program(),
feed=data_input,
fetch_list=[cost])
cost_val = np.array(outs[0])
print("pass_id = " + str(pass_id) + " batch = " + str(batch_id) +
" avg_cost = " + str(cost_val))
if __name__ == "__main__":
main()
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 中国古诗生成
## 简介
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 使用循环神经网语言模型生成文本
语言模型(Language Model)是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词。语言模型是自然语言处理领域里一个重要的基础模型。
......
The minimum PaddlePaddle version needed for the code sample in this directory is v0.11.0. If you are on a version of PaddlePaddle earlier than v0.11.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Globally Normalized Reader
This model implements the work in the following paper:
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# Hsigmoid加速词向量训练
## 背景介绍
在自然语言处理领域中,传统做法通常使用one-hot向量来表示词,比如词典为['我', '你', '喜欢'],可以用[1,0,0]、[0,1,0]和[0,0,1]这三个向量分别表示'我'、'你'和'喜欢'。这种表示方式比较简洁,但是当词表很大时,容易产生维度爆炸问题;而且任意两个词的向量是正交的,向量包含的信息有限。为了避免或减轻one-hot表示的缺点,目前通常使用词向量来取代one-hot表示,词向量也就是word embedding,即使用一个低维稠密的实向量取代高维稀疏的one-hot向量。训练词向量的方法有很多种,神经网络模型是其中之一,包括CBOW、Skip-gram等,这些模型本质上都是一个分类模型,当词表较大即类别较多时,传统的softmax将非常消耗时间。PaddlePaddle提供了Hsigmoid Layer、NCE Layer,来加速模型的训练过程。本文主要介绍如何使用Hsigmoid Layer来加速训练,词向量相关内容请查阅PaddlePaddle Book中的[词向量章节](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)
......
运行本目录下的程序示例需要使用PaddlePaddle v0.11.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
图像分类
=======================
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 排序学习(Learning To Rank)
排序学习技术\[[1](#参考文献1)\]是构建排序模型的机器学习方法,在信息检索、自然语言处理,数据挖掘等机器学场景中具有重要作用。排序学习的主要目的是对给定一组文档,对任意查询请求给出反映相关性的文档排序。在本例子中,利用标注过的语料库训练两种经典排序模型RankNet[[4](#参考文献4)\]和LamdaRank[[6](#参考文献6)\],分别可以生成对应的排序模型,能够对任意查询请求,给出相关性文档排序。
......
运行本目录下的程序示例需要使用PaddlePaddle v0.11.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 带外部记忆机制的神经机器翻译
**外部记忆**(External Memory)机制的神经机器翻译模型(Neural Machine Translation, NMT),是神经机器翻译模型的一个重要扩展。它引入可微分的记忆网络作为额外的记忆单元,拓展神经翻译模型内部工作记忆(Working Memory)的容量或带宽,辅助完成翻译等任务中信息的临时存取,改善模型表现。
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 使用噪声对比估计加速语言模型训练
## 为什么需要噪声对比估计
......@@ -101,11 +105,11 @@ return paddle.layer.nce(
NCE 层的一些重要参数解释如下:
| 参数名 | 参数作用 | 介绍 |
|:------ |:-------| :--------|
| param\_attr / bias\_attr | 用来设置参数名字 |方便预测阶段加载参数,具体在预测一节中介绍。|
| num\_neg\_samples | 负样本采样个数|可以控制正负样本比例,这个值取值区间为 [1, 字典大小-1],负样本个数越多则整个模型的训练速度越慢,模型精度也会越高 |
| neg\_distribution | 生成负样例标签的分布,默认是一个均匀分布| 可以自行控制负样本采样时各个类别的采样权重。例如:希望正样例为“晴天”时,负样例“洪水”在训练时更被着重区分,则可以将“洪水”这个类别的采样权重增加|
| 参数名 | 参数作用 | 介绍 |
| :----------------------- | :--------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------- |
| param\_attr / bias\_attr | 用来设置参数名字 | 方便预测阶段加载参数,具体在预测一节中介绍。 |
| num\_neg\_samples | 负样本采样个数 | 可以控制正负样本比例,这个值取值区间为 [1, 字典大小-1],负样本个数越多则整个模型的训练速度越慢,模型精度也会越高 |
| neg\_distribution | 生成负样例标签的分布,默认是一个均匀分布 | 可以自行控制负样本采样时各个类别的采样权重。例如:希望正样例为“晴天”时,负样例“洪水”在训练时更被着重区分,则可以将“洪水”这个类别的采样权重增加 |
## 预测
1. 在命令行运行 :
......
运行本目录下的程序示例需要使用PaddlePaddle v0.11.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 基于双层序列的文本分类
## 简介
......
The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering
This model implements the work in the following paper:
......
The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Neural Machine Translation Model
## Background Introduction
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 场景文字识别 (STR, Scene Text Recognition)
## STR任务简介
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# Scheduled Sampling
## 概述
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 命名实体识别
以下是本例的简要目录结构及说明:
......@@ -88,14 +92,14 @@ Baghdad NNP I-NP I-LOC
预处理完成后,一条训练样本包含3个部分作为神经网络的输入信息用于训练:(1)句子序列;(2)首字母大写标记序列;(3)标注序列,下表是一条训练样本的示例:
| 句子序列 | 大写标记序列 | 标注序列 |
|---|---|---|
| u.n. | 1 | B-ORG |
| official | 0 | O |
| ekeus | 1 | B-PER |
| heads | 0 | O |
| for | 0 | O |
| baghdad | 1 | B-LOC |
| . | 0 | O |
| -------- | ------------ | -------- |
| u.n. | 1 | B-ORG |
| official | 0 | O |
| ekeus | 1 | B-PER |
| heads | 0 | O |
| for | 0 | O |
| baghdad | 1 | B-LOC |
| . | 0 | O |
## 运行
### 编写数据读取接口
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# SSD目标检测
## 概述
SSD全称:Single Shot MultiBox Detector,是目标检测领域较新且效果较好的检测算法之一\[[1](#引用)\],有着检测速度快且检测精度高的有的。PaddlePaddle已集成SSD算法,本示例旨在介绍如何使用PaddlePaddle中的SSD模型进行目标检测。下文首先简要介绍SSD原理,然后介绍示例包含文件及如何使用,接着介绍如何在PASCAL VOC数据集上训练、评估及检测,最后简要介绍如何在自有数据集上使用SSD。
......
The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Single Shot MultiBox Detector (SSD) Object Detection
## Introduction
......
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# 文本分类
以下是本例目录包含的文件以及对应说明:
......@@ -129,70 +133,70 @@ negative 0.0300 0.9700 i love scifi and am willing to put up with a lot
1. 数据组织
假设有如下格式的训练数据:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容,文本内容中的词语以空格分隔。以下是两条示例数据:
假设有如下格式的训练数据:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容,文本内容中的词语以空格分隔。以下是两条示例数据:
```
positive PaddlePaddle is good
negative What a terrible weather
```
```
positive PaddlePaddle is good
negative What a terrible weather
```
2. 编写数据读取接口
自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`(词语在字典的序号)和 `paddle.data_type.integer_value`(类别标签)的 2 个输入给网络中定义的 2 个 `data_layer` 的功能。
```python
def train_reader(data_dir, word_dict, label_dict):
def reader():
UNK_ID = word_dict["<UNK>"]
word_col = 0
lbl_col = 1
for file_name in os.listdir(data_dir):
with open(os.path.join(data_dir, file_name), "r") as f:
for line in f:
line_split = line.strip().split("\t")
word_ids = [
word_dict.get(w, UNK_ID)
for w in line_split[word_col].split()
]
yield word_ids, label_dict[line_split[lbl_col]]
return reader
```
- 关于 PaddlePaddle 中 `data_layer` 接受输入数据的类型,以及数据读取接口对应该返回数据的格式,请参考 [input-types](http://www.paddlepaddle.org/release_doc/0.9.0/doc_cn/ui/data_provider/pydataprovider2.html#input-types) 一节。
- 以上代码片段详见本例目录下的 `reader.py` 脚本,`reader.py` 同时提供了读取测试数据的全部代码。
接下来,只需要将数据读取函数 `train_reader` 作为参数传递给 `train.py` 脚本中的 `paddle.batch` 接口即可使用自定义数据接口读取数据,调用方式如下:
```python
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader(train_data_dir, word_dict, lbl_dict),
buf_size=1000),
batch_size=batch_size)
```
自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`(词语在字典的序号)和 `paddle.data_type.integer_value`(类别标签)的 2 个输入给网络中定义的 2 个 `data_layer` 的功能。
```python
def train_reader(data_dir, word_dict, label_dict):
def reader():
UNK_ID = word_dict["<UNK>"]
word_col = 0
lbl_col = 1
for file_name in os.listdir(data_dir):
with open(os.path.join(data_dir, file_name), "r") as f:
for line in f:
line_split = line.strip().split("\t")
word_ids = [
word_dict.get(w, UNK_ID)
for w in line_split[word_col].split()
]
yield word_ids, label_dict[line_split[lbl_col]]
return reader
```
- 关于 PaddlePaddle 中 `data_layer` 接受输入数据的类型,以及数据读取接口对应该返回数据的格式,请参考 [input-types](http://www.paddlepaddle.org/release_doc/0.9.0/doc_cn/ui/data_provider/pydataprovider2.html#input-types) 一节。
- 以上代码片段详见本例目录下的 `reader.py` 脚本,`reader.py` 同时提供了读取测试数据的全部代码。
接下来,只需要将数据读取函数 `train_reader` 作为参数传递给 `train.py` 脚本中的 `paddle.batch` 接口即可使用自定义数据接口读取数据,调用方式如下:
```python
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader(train_data_dir, word_dict, lbl_dict),
buf_size=1000),
batch_size=batch_size)
```
3. 修改命令行参数
- 如果将数据组织成示例数据的同样的格式,只需在 `run.sh` 脚本中修改 `train.py` 启动参数,指定 `train_data_dir` 参数,可以直接运行本例,无需修改数据读取接口 `reader.py`。
- 执行 `python train.py --help` 可以获取`train.py` 脚本各项启动参数的详细说明,主要参数如下:
- `nn_type`:选择要使用的模型,目前支持两种:“dnn” 或者 “cnn”。
- `train_data_dir`:指定训练数据所在的文件夹,使用自定义数据训练,必须指定此参数,否则使用`paddle.dataset.imdb`训练,同时忽略`test_data_dir`,`word_dict`,和 `label_dict` 参数。
- `test_data_dir`:指定测试数据所在的文件夹,若不指定将不进行测试。
- `word_dict`:字典文件所在的路径,若不指定,将从训练数据根据词频统计,自动建立字典。
- `label_dict`:类别标签字典,用于将字符串类型的类别标签,映射为整数类型的序号。
- `batch_size`:指定多少条样本后进行一次神经网络的前向运行及反向更新。
- `num_passes`:指定训练多少个轮次。
- 如果将数据组织成示例数据的同样的格式,只需在 `run.sh` 脚本中修改 `train.py` 启动参数,指定 `train_data_dir` 参数,可以直接运行本例,无需修改数据读取接口 `reader.py`。
- 执行 `python train.py --help` 可以获取`train.py` 脚本各项启动参数的详细说明,主要参数如下:
- `nn_type`:选择要使用的模型,目前支持两种:“dnn” 或者 “cnn”。
- `train_data_dir`:指定训练数据所在的文件夹,使用自定义数据训练,必须指定此参数,否则使用`paddle.dataset.imdb`训练,同时忽略`test_data_dir`,`word_dict`,和 `label_dict` 参数。
- `test_data_dir`:指定测试数据所在的文件夹,若不指定将不进行测试。
- `word_dict`:字典文件所在的路径,若不指定,将从训练数据根据词频统计,自动建立字典。
- `label_dict`:类别标签字典,用于将字符串类型的类别标签,映射为整数类型的序号。
- `batch_size`:指定多少条样本后进行一次神经网络的前向运行及反向更新。
- `num_passes`:指定训练多少个轮次。
### 如何预测
1. 修改 `infer.py` 中以下变量,指定使用的模型、指定测试数据。
```python
model_path = "dnn_params_pass_00000.tar.gz" # 指定模型所在的路径
nn_type = "dnn" # 指定测试使用的模型
test_dir = "./data/test" # 指定测试文件所在的目录
word_dict = "./data/dict/word_dict.txt" # 指定字典所在的路径
label_dict = "./data/dict/label_dict.txt" # 指定类别标签字典的路径
```
```python
model_path = "dnn_params_pass_00000.tar.gz" # 指定模型所在的路径
nn_type = "dnn" # 指定测试使用的模型
test_dir = "./data/test" # 指定测试文件所在的目录
word_dict = "./data/dict/word_dict.txt" # 指定字典所在的路径
label_dict = "./data/dict/label_dict.txt" # 指定类别标签字典的路径
```
2. 在终端中执行 `python infer.py`
运行本目录下的程序示例需要使用PaddlePaddle v0.10.0 版本。如果您的PaddlePaddle安装版本低于此要求,请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新PaddlePaddle安装版本。
---
# Youtube DNN推荐模型
以下是本例目录包含的文件以及对应说明:
```
├── README.md # 文档
├── README.cn.md # 中文文档
├── data # 示例数据
│   ├── data.tar # 示例数据
├── infer.py # 预测脚本
├── network_conf.py # 模型网络配置
├── reader.py # data reader
├── train.py # 训练脚本
└── utils.py # 工具
└── data_processer.py # 数据预处理脚本
└── user_vector.py # 获取用户向量脚本
└── item_vector.py # 获取视频向量脚本
├── infer_user.py # 获取用户个性化脚本
```
## 背景介绍\[[1](#参考文献)\]
Youtube是世界最大的视频网站之一,其推荐系统帮助10亿以上的用户,从海量视频中,发现个性化的内容。该推荐系统主要面临以下三个挑战:
- 规模: 许多现有的推荐算法证明在小数据量下运行良好,但不能满足YouTube这样庞大的用户群和内容库的场景,因此需要高度专业化的分布式学习算法和高效的线上服务。
- 新鲜度: YouTube内容库更新频率极高,每秒上传大量视频。系统应及时追踪新上传的视频和用户的实时行为,并且模型在推荐新/旧视频上有良好平衡能力。
- 噪音: 噪音来自于两方面,其一,用户历史行为稀疏,且有各种不可观测的外部因素,以及用户满意度不明确。其二,内容本身的数据是非结构化的。因此算法应更具有鲁棒性。
下图展示了整个推荐系统框图:
<p align="center">
<img src="images/recommendation_system.png" width="500" height="300" hspace='10'/> <br/>
Figure 1. 推荐系统框图(出自论文[1])
</p>
整个推荐系统有两部分组成: 召回(candidate generation/recall)和排序(ranking)。
- 召回模型: 输入用户的历史行为,从大规模的内容库中获得一个小集合(百级别)。召回出的视频与用户高度相关。一个用户是用其历史点击过的视频,搜索过的关键词,和人口统计相关的特征来表征。
- 排序模型: 采用更精细的特征计算得到排序分,对召回得到的候选集合中的视频进行排序。
本文主要详细介绍了召回模型的原理与使用。
## 召回模型简介
该推荐问题可以被建模成一个"超大规模多分类"问题。即在时刻![](https://www.zhihu.com/equation?tex=t),为用户![](https://www.zhihu.com/equation?tex=U)(已知上下文信息![](https://www.zhihu.com/equation?tex=C))在视频库![](https://www.zhihu.com/equation?tex=V)中预测出观看视频![](https://www.zhihu.com/equation?tex=i)的类别,
![](https://www.zhihu.com/equation?tex=%24P(%5Comega_t%3Di%7CU%2CC)%3D%5Cfrac%7Be%5E%7B%5Cmathbf%7Bv_i%7D%5Cmathbf%7Bu%7D%7D%7D%7B%5Csum_%7Bj%5Cin%20V%7D%5E%7B%20%7De%5E%7B%5Cmathbf%7Bv_j%7D%5Cmathbf%7Bu%7D%7D%7D)
其中![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%5Cin%20%5Cmathbb%7BR%7D%5EN),是<用户上下文信息>的高维向量表示。![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bv_j%7D%5Cin%20%5Cmathbb%7BR%7D%5EN)是视频![](https://www.zhihu.com/equation?tex=j)的高维向量表示。DNN模型的目标是以用户信息和上下文信息为输入条件下,学习用户的高维向量表示,以此输入softmax分类器,来预测视频库中各个视频(类别)的观看概率。
下图展示了召回模型的网络结构:
<p align="center">
<img src="images/model_network.png" width="600" height="500" hspace='10'/> <br/>
Figure 2. 召回模型网络结构(出自论文[1])
</p>
- 输入层:用户的浏览序列、搜索序列、人口统计学特征、和其他上下文信息等
- embedding层:将用户浏览视频序列接embedding层,再做时间序列上的平均。对于搜索序列同样处理。
- 隐层:包含三个隐层,用RELU激活函数,最后一层隐层的输出即为高维向量表示![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D)
- 输出层: softmax层,输出视频库中各个视频(类别)的观看概率。在线上预测时,提取模型训练得到的softmax层内部的参数,作为视频![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bv%7D)的高维向量表示。可利用类似局部敏感哈希(Locality Sensitive Hashing)用![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D)查询最相关的N个视频。
## 数据预处理
本例模拟了用户的视频点击日志,作为样本数据。格式如下:
```
用户Id \t 所在省份 \t 所在城市 \t 历史点击的视频序列信息 \t 手机型号
历史点击的视频序列信息的格式为 视频信息1;视频信息2;...;视频信息K
视频信息的格式为 视频id:视频类目:视频标签1_视频标签2_视频标签3_...视频标签M
例如:
USER_ID_15 上海市 上海市 VIDEO_42:CATEGORY_9:TAG115;VIDEO_43:CATEGORY_9:TAG116_TAG115;VIDEO_44:CATEGORY_2:TAG117_TAG71 GO T5
```
在youtube_recall目录下运行以下命令(下同),可以解压样本数据。
```
cd data
tar -zxvf data.tar
```
然后,脚本`data_preprocess.py`将对训练数据做预处理。具体使用方法参考如下说明:
```
usage: data_processor.py [-h] --train_set_path TRAIN_SET_PATH --output_dir
OUTPUT_DIR [--feat_appear_limit FEAT_APPEAR_LIMIT]
PaddlePaddle Youtube Recall Model Example
optional arguments:
-h, --help show this help message and exit
--train_set_path TRAIN_SET_PATH
path of the train set
--output_dir OUTPUT_DIR
directory to output
--feat_appear_limit FEAT_APPEAR_LIMIT
the minimum number of feature values appears (default:
20)
```
该脚本的作用如下:
- 借鉴\[[2](#参考文献)\]中对特征的处理,过滤低频特征(样本中出现次数低于`feat_appear_limit`)。
- 对特征进行编码,生成字典`feature_dict.pkl`
- 统计每个视频出现的概率,保存至`item_freq.pkl`,提供给nce层使用。
例如可执行下列命令,完成数据预处理:
```shell
mkdir output
python data_processor.py --train_set_path=./data/train.txt \
--output_dir=./output \
--feat_appear_limit=20
```
## 模型实现
下面是网络中各个部分的具体实现,相关代码均包含在 `./network_conf.py` 中。
### 输入层
```python
def _build_input_layer(self):
"""
build input layer
"""
self._history_clicked_items = paddle.layer.data(
name="history_clicked_items", type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_items'])))
self._history_clicked_categories = paddle.layer.data(
name="history_clicked_categories", type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_categories'])))
self._history_clicked_tags = paddle.layer.data(
name="history_clicked_tags", type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_tags'])))
self._user_id = paddle.layer.data(
name="user_id", type=paddle.data_type.integer_value(
len(self._feature_dict['user_id'])))
self._province = paddle.layer.data(
name="province", type=paddle.data_type.integer_value(
len(self._feature_dict['province'])))
self._city = paddle.layer.data(
name="city", type=paddle.data_type.integer_value(len(self._feature_dict['city'])))
self._phone = paddle.layer.data(
name="phone", type=paddle.data_type.integer_value(len(self._feature_dict['phone'])))
self._target_item = paddle.layer.data(
name="target_item", type=paddle.data_type.integer_value(
len(self._feature_dict['history_clicked_items'])))
```
### Embedding层
每个输入特征通过embedding到固定维度的向量中。
```python
def _create_emb_attr(self, name):
"""
create embedding parameter
"""
return paddle.attr.Param(
name=name, initial_std=0.001, learning_rate=1, l2_rate=0, sparse_update=False)
def _build_embedding_layer(self):
"""
build embedding layer
"""
self._user_id_emb = paddle.layer.embedding(input=self._user_id,
size=64,
param_attr=self._create_emb_attr(
'_proj_user_id'))
self._province_emb = paddle.layer.embedding(input=self._province,
size=8,
param_attr=self._create_emb_attr(
'_proj_province'))
self._city_emb = paddle.layer.embedding(input=self._city,
size=16,
param_attr=self._create_emb_attr('_proj_city'))
self._phone_emb = paddle.layer.embedding(input=self._phone,
size=16,
param_attr=self._create_emb_attr('_proj_phone'))
self._history_clicked_items_emb = paddle.layer.embedding(
input=self._history_clicked_items,
size=64,
param_attr=self._create_emb_attr('_proj_history_clicked_items'))
self._history_clicked_categories_emb = paddle.layer.embedding(
input=self._history_clicked_categories,
size=8,
param_attr=self._create_emb_attr('_proj_history_clicked_categories'))
self._history_clicked_tags_emb = paddle.layer.embedding(
input=self._history_clicked_tags,
size=64,
param_attr=self._create_emb_attr('_proj_history_clicked_tags'))
```
### 隐层
本文对\[[原论文](#参考文献)\](Covington, Paul, Jay Adams, and Emre Sargin. "Deep neural networks for youtube recommendations." Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016.)中的模型做了如下改进:
- 历史用户点击的视频序列,经过embedding之后,不再使用加权求平均,而是使用lstm序列模型。本文将用户点击的先后次序纳入模型中,然后在时间序列上做最大池化,得到定长向量表示,从而使模型学习到与点击时序相关的隐藏信息。
- 考虑到数据规模与训练性能,本文只用了两个Relu层,也有很不错的效果。
```python
self._rnn_cell = paddle.networks.simple_lstm(
input=self._history_clicked_items_emb, size=64)
self._lstm_last = paddle.layer.pooling(
input=self._rnn_cell, pooling_type=paddle.pooling.Max())
self._avg_emb_cats = paddle.layer.pooling(
input=self._history_clicked_categories_emb,
pooling_type=paddle.pooling.Avg())
self._avg_emb_tags = paddle.layer.pooling(
input=self._history_clicked_tags_emb,
pooling_type=paddle.pooling.Avg())
self._fc_0 = paddle.layer.fc(
name="Relu1",
input=[
self._lstm_last, self._user_id_emb, self._province_emb,
self._city_emb, self._avg_emb_cats, self._avg_emb_tags,
self._phone_emb
],
size=self._dnn_layer_dims[0],
act=paddle.activation.Relu())
self._fc_1 = paddle.layer.fc(
name="Relu2",
input=self._fc_0,
size=self._dnn_layer_dims[1],
act=paddle.activation.Relu())
```
### 输出层
为了提高模型训练速度,使用噪声对比估计(Noise-contrastive estimation, NCE)\[[3](#参考文献)\]。将[数据预处理](#数据预处理)中产出的item_freq.pkl,也就是负样例的分布,作为nce层的参数。
```python
return paddle.layer.nce(
input=self._fc_1,
label=self._target_item,
num_classes=len(self._feature_dict['history_clicked_items']),
param_attr=paddle.attr.Param(name="nce_w"),
bias_attr=paddle.attr.Param(name="nce_b"),
act=paddle.activation.Sigmoid(),
num_neg_samples=5,
neg_distribution=self._item_freq)
```
## 训练
首先,准备`reader.py`,负责将输入原始数据中的特征,转为编码后的特征id。对一条训练数据,根据`window_size`产出多条训练样本给trainer,例如:
```
window_size=2
原始数据:
用户Id \t 所在省份 \t 所在城市 \t 视频信息1;视频信息2;...;视频信息K \t 手机型号
多条训练样本:
用户Id,所在省份,所在城市,[<unk>,历史点击视频1],[<unk>,历史点击视频类目1],[<unk>,历史点击视频标签1],手机型号,历史点击视频2
用户Id,所在省份,所在城市,[历史点击视频1,历史点击视频2],[历史点击视频类目1,历史点击视频类目2],[历史点击视频标签1,历史点击视频标签2],手机型号,历史点击视频3
用户Id,所在省份,所在城市,[历史点击视频2,历史点击视频3],[历史点击视频类目2,历史点击视频类目3],[历史点击视频标签2,历史点击视频标签3],手机型号,历史点击视频4
......
```
相关代码如下:
```python
for i in range(1, len(history_clicked_items_all)):
start = max(0, i - self._window_size)
history_clicked_items = history_clicked_items_all[start:i]
history_clicked_categories = history_clicked_categories_all[start:i]
history_clicked_tags_str = history_clicked_tags_all[start:i]
history_clicked_tags = []
for tags_a in history_clicked_tags_str:
for tag in tags_a.split("_"):
history_clicked_tags.append(int(tag))
target_item = history_clicked_items_all[i]
yield user_id, province, city, \
history_clicked_items, history_clicked_categories, \
history_clicked_tags, phone, target_item
```
```python
reader = Reader(feature_dict, args.window_size)
trainer.train(
paddle.batch(
paddle.reader.shuffle(
lambda: reader.train(args.train_set_path),
buf_size=7000), args.batch_size),
num_passes=args.num_passes,
feeding=feeding,
event_handler=event_handler)
```
接下去就可以开始训练了,可执行以下命令:
```shell
mkdir output/model
python train.py --train_set_path='./data/train.txt' \
--test_set_path='./data/test.txt' \
--model_output_dir='./output/model/' \
--feature_dict='./output/feature_dict.pkl' \
--item_freq='./output/item_freq.pkl'
```
## 离线预测
输入用户相关的特征,输出topN个最可能观看的视频,可执行以下命令:
```shell
python infer.py --infer_set_path='./data/infer.txt' \
--model_path='./output/model/model_pass_00000.tar.gz' \
--feature_dict='./output/feature_dict.pkl' \
--batch_size=50
```
## 在线预测
在线预测的时候,采用近似最近邻(approximate nearest neighbor-ANN)算法直接用用户向量查询最相关的topN个视频向量,将对应的视频内容推荐给用户。下面介绍如何获得用户向量和视频向量。
### 用户向量
用最后一个RELU层的输出,前拼一个常数项1,作为用户向量。这边最后一个RELU层的大小是31维,拼接后的用户向量就是32维,即
![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%3D%5B1%2Cu_1%2Cu_2%2C...%2Cu_%7B31%7D%5D)
### 视频向量
视频向量从模型训练得到的softmax层的参数中提取。假设共有M个不同的视频,那么softmax层输出的是这M个视频各自用户点击的概率,即
![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bo%7D%3D%5Bs_1%2Cs_2%2C...%2Cs_%7BM%7D%5D)
从最后一个RELU层输出的用户向量![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D),到softmax层输出的M个视频的概率![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bo%7D),中间则是通过乘以了softmax层的参数w,b构成的一个![](https://www.zhihu.com/equation?tex=32%5Ctimes%20M)矩阵,其中的每一列为一个32维的视频向量,按照字典顺序一一对应。
![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%5Ccdot%20%5Cbegin%7Bbmatrix%7D%0A%20b_1%20%20%26%20b_2%20%26%20%20%5Ccdots%20%26%20b_M%20%5C%5C%20%0A%20w_%7B11%7D%20%26%20w_%7B21%7D%20%26%20%20%5Ccdots%20%20%26%20w_%7BM1%7D%20%5C%5C%20%0A%20w_%7B12%7D%20%26%20w_%7B22%7D%20%26%20%20%20%5Ccdots%20%26%20w_%7BM2%7D%20%20%5C%5C%20%0A%5Cvdots%20%26%20%5Cvdots%20%26%20%20%5Cvdots%20%26%20%5Cvdots%20%5C%5C%20%0Aw_%7B131%7D%20%26%20%20w_%7B231%7D%20%26%20%20%5Ccdots%20%20%26%20w_%7BM31%7D%20%20%0A%5Cend%7Bbmatrix%7D_%7B32%5Ctimes%20M%7D%20%3D%20%5Cmathbf%7Bu%7D%20%5Ccdot%20%20%5Cbegin%7Bbmatrix%7D%20%0A%5Cmathbf%7Bv_1%7D%2C%20%5Cmathbf%7Bv_2%7D%2C%20%5Ccdots%2C%20%5Cmathbf%7Bv_M%7D%20%0A%5Cend%7Bbmatrix%7D_%7B1%5Ctimes%20M%7D%3D%5Cmathbf%7Bo%7D)
### SIMPLE-LSH变换
很多ann算法只支持cosine距离,而模型是根据内积排序的,两者效果差异较大。为此,这边的解决方案是,对前面得到的用户和视频向量,作SIMPLE-LSH变换\[[4](#参考文献)\],使内积排序与cosin排序等价。
具体如下:
- 对于视频向量![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bv%7D%5Cin%20%5Cmathbb%7BR%7D%5EN),有![](https://www.zhihu.com/equation?tex=%5Cleft%20%5C%7C%20%5Cmathbf%7Bv%7D%20%5Cright%20%5C%7C%5Cleqslant%20m),变换后的![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%2B1%7D),![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%20%3D%20%5B%5Cfrac%7B%5Cmathbf%7Bv%7D%7D%7Bm%7D%3B%20%5Csqrt%7B1%20-%5Cleft%20%5C%7C%20%5Cmathbf%7B%5Cfrac%7B%5Cmathbf%7Bv%7D%7D%7Bm%7D%7B%7D%7D%20%5Cright%20%5C%7C%5E2%7D%5D)
- 对于用户向量![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%5Cin%20%5Cmathbb%7BR%7D%5EN),变换后的![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%2B1%7D),![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%20%3D%20%5B%5Cmathbf%7Bu%7D_%7Bnorm%7D%3B%200%5D),其中![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D_%7Bnorm%7D)是模长归一化后的![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D)
线上对于一个![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D)用内积召回![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bv%7D),作上述变换![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%5Crightarrow%20%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%2C%20%5Cmathbf%7Bv%7D%5Crightarrow%20%5Ctilde%7B%5Cmathbf%7Bv%7D%7D)后,不改变内积排序的顺序。又因为![](https://www.zhihu.com/equation?tex=%5Cleft%20%5C%7C%20%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%20%5Cright%20%5C%7C) 和![](https://www.zhihu.com/equation?tex=%5Cleft%20%5C%7C%20%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%20%5Cright%20%5C%7C)都为1,因此![](https://www.zhihu.com/equation?tex=cos(%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%20%2C%5Ctilde%7B%5Cmathbf%7Bv%7D%7D)%20%3D%20%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%5Ccdot%20%5Ctilde%7B%5Cmathbf%7Bv%7D%7D),就可以兼容ANN用cosin的方式召回了,结果等价。
线上使用时,为保留精度,可以不除以![](https://www.zhihu.com/equation?tex=m),也就变成![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%3D%5B%5Cmathbf%7Bv%7D%3B%5Csqrt%7Bm%5E2-%5Cleft%5C%7C%20%5Cmathbf%7B%5Cmathbf%7Bv%7D%7D%5Cright%5C%7C%5E2%7D%5D),排序依然等价。
### 实现
可使用`user_vector.py`获取用户向量, 输入用户特征经过网络预测,probs[1]中存储的是最后一个RELU层的输出,先前拼接一个1,再做SIMPLE-LSH变换(后接一个0,归一化):
```python
probs = inferer.infer(
input=test_batch,
feeding=feeding,
field=["value"],
flatten_result=False)
for i, res in enumerate(zip(probs[1])):
# do simple lsh conversion
user_vector = [1.000]
for i in res[0]:
user_vector.append(i)
user_vector.append(0.000)
norm = np.linalg.norm(user_vector)
user_vector_norm = [str(_ / norm) for _ in user_vector]
print ",".join(user_vector_norm)
```
可使用`item_vector.py`分别获视频向量。加载模型,提取参数nce_w和nce_b,拼接M个视频向量,第i个视频向量的第一维是对应的nce_b[0][i],后面是nce_w[i][1:31]。再做SIMPLE-LSH变换,找到所有向量最大的模,按照![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%3D%5B%5Cmathbf%7Bv%7D%3B%5Csqrt%7Bm%5E2-%5Cleft%5C%7C%20%5Cmathbf%7B%5Cmathbf%7Bv%7D%7D%5Cright%5C%7C%5E2%7D%5D)处理。
```python
# load the trained model.
with gzip.open(args.model_path) as f:
parameters = paddle.parameters.Parameters.from_tar(f)
nce_w = parameters.get("nce_w")
nce_b = parameters.get("nce_b")
item_vector = convt_simple_lsh(get_item_vec_from_softmax(nce_w, nce_b))
def get_item_vec_from_softmax(nce_w, nce_b):
"""
get item vectors from softmax parameter
"""
if nce_w is None or nce_b is None:
return None
vector = []
total_items_num = nce_w.shape[0]
if total_items_num != nce_b.shape[1]:
return None
dim_vector = nce_w.shape[1] + 1
for i in range(0, total_items_num):
vector.append([])
vector[i].append(nce_b[0][i])
for j in range(1, dim_vector):
vector[i].append(nce_w[i][j - 1])
return vector
def convt_simple_lsh(vector):
"""
do simple lsh conversion
"""
max_norm = 0
num_of_vec = len(vector)
for i in range(0, num_of_vec):
norm = np.linalg.norm(vector[i])
if norm > max_norm:
max_norm = norm
for i in range(0, num_of_vec):
vector[i].append(
math.sqrt(
math.pow(max_norm, 2) - math.pow(np.linalg.norm(vector[i]), 2)))
return vector
```
可执行下列命令运行脚本:
```shell
python user_vector.py --infer_set_path='./data/infer.txt' \
--model_path='./output/model/model_pass_00000.tar.gz' \
--feature_dict='./output/feature_dict.pkl' \
--batch_size=50
python item_vector.py --model_path='./output/model/model_pass_00000.tar.gz' \
--feature_dict='./output/feature_dict.pkl'
```
## 离线挖掘
因为实时召回需要大量机器资源,这边也可以离线挖掘产出数据,线上召回使用挖掘好的数据。可以产出最热,用户个性化,视频相关等数据。下面的示例产出了用户个性化数据。
```
python infer_user.py --model_path='./output/model/model_pass_00000.tar.gz' \
--feature_dict='./output/feature_dict.pkl'
```
## 参考文献
1. Covington, Paul, Jay Adams, and Emre Sargin. "Deep neural networks for youtube recommendations." Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016.
2. https://code.google.com/archive/p/word2vec/
3. http://paddlepaddle.org/docs/develop/models/nce_cost/README.html
4. Neyshabur, Behnam, and Nathan Srebro. "On symmetric and asymmetric LSHs for inner product search." arXiv preprint arXiv:1410.5518 (2014).
The minimum PaddlePaddle version needed for the code sample in this directory is v0.10.0. If you are on a version of PaddlePaddle earlier than v0.10.0, [please update your installation](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html).
---
# Deep Neural Networks for YouTube Recommendations
## Introduction\[[1](#References)\]
YouTube is the world's largest platform for creating, sharing and discovering video content. Youtube recommendations are responsible for helping more than a billion users discover personalized content from an ever-growing corpus of videos.
- Scale: Many existing recommendation algorithm proven to work well on small problems fail to operate on massive scale. Highly specialized distributed learning algorithms and efficient serving systems are essential.
- Freshness: YouTube has a very dynamic corpus with many hours of video are uploaded per second. The recommendation system should model newly uploaded content as well as the latest actions taken by user.
- Noise: Historical user behavior on YouTube is inherently difficult to predict due to sparsity and a variety of unobservable external factors. Furthermore, the noisy implicit feedback signals instead of the ground truth of user satisfaction is observed, and metadata associated with content is poorly structured, which forces the algorithms to be robust.
The overall structure of the recommendation system is illustrated in Figure 1.
<p align="center">
<img src="images/recommendation_system.png" width="500" height="300" hspace='10'/> <br/>
Figure 1. Recommendation system architecture[1]
</p>
The system is comprised of two neural networks: one for candidate generation and one for ranking.
- The candidate generation network: It takes events from the user's YouTube activity history as input and retrieves a small subset(hundreds) of videos, highly relevant to the user, from a large corpus. The similarity between users is expressed in terms of coarse features such as IDs of video watches, search query tokens and demographics.
- The ranking network: It accomplishes this task by assigning a score to each video according to a desired objective function using a rich set of features describing the video and user.
This markdown describes the principle and use of the candidate generation network in detail.
## Candidate Generation
Here, candidate generation is modeled as extreme multiclass classification where the prediction problem becomes accurately classifying a specific video watch ![](https://www.zhihu.com/equation?tex=%5Comega_t) at time ![](https://www.zhihu.com/equation?tex=t) among millions of video ![](https://www.zhihu.com/equation?tex=i) (classes) from a corpus ![](https://www.zhihu.com/equation?tex=V) based on user ![](https://www.zhihu.com/equation?tex=U) and context ![](https://www.zhihu.com/equation?tex=C),
![](https://www.zhihu.com/equation?tex=%24P(%5Comega_t%3Di%7CU%2CC)%3D%5Cfrac%7Be%5E%7B%5Cmathbf%7Bv_i%7D%5Cmathbf%7Bu%7D%7D%7D%7B%5Csum_%7Bj%5Cin%20V%7D%5E%7B%20%7De%5E%7B%5Cmathbf%7Bv_j%7D%5Cmathbf%7Bu%7D%7D%7D)
where ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%5Cin%20%5Cmathbb%7BR%7D%5EN) represents a high-dimensional "embedding" of the user, context pair and the ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bv_j%7D%5Cin%20%5Cmathbb%7BR%7D%5EN) represent embeddings of each candidate video. The task of the deep neural network is to learn user embeddings ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D) as a function of the user's history and context that are useful for discriminating among videos with a softmax classifier.
Figure 2 shows the general network architecture of candidate generation model:
<p align="center">
<img src="images/model_network.png" width="600" height="500" hspace='10'/> <br/>
Figure 2. Candidate generation model architecture[1]
</p>
- Input layer: A user's watch history is represented by a variable-length sequence of sparse video IDs, and search history is similarly represented by a variable-length sequence of search tokens.
- Embedding layer: The input features each is mapped to a fixed-sized dense vector representation via the embeddings, and then simply averaging the embeddings. The embeddings are learned jointly with all other model parameters through normal gradient descent back-propagation updates.
- Hidden layer: Features are concatenated into a wide first layer, followed by several layers of fully connected Rectified Linear Units (ReLU). The output of the last ReLU layer is the previous mentioned high-dimensional "embedding" of the user ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D), so called user vector.
- Output layer: A softmax classifier is connected to do discriminating millions of classes (videos). To speed up training process, a technique is applied that samples negative classes from background distribution with importance weighting. The previous mentioned high-dimensional "embedding" of the candidate video ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bv%7D) is obtained by weight and bias of the softmax layer. At serving time, the most likely N classes (videos) is computed for presenting to the user. To Score millions of items under a strict serving laterncy, the scoring problem reduces to a nearest neighbor search in the dot product space, and Locality Sensitive Hashing is relied on.
## Data Pre-processing
In this example, here moke the click log of users as sample data, and its format is as follows:
```
user-id \t province \t city \t history-clicked-video-info-sequence \t phone
history-clicked-video-info-sequence is formated as
video-info1;video-info2;...;video-infoK
video-info is formated as
video-id:category:tag1_tag2_tag3_...tagM
For example:
USER_ID_15 Shanghai Shanghai VIDEO_42:CATEGORY_9:TAG115;VIDEO_43:CATEGORY_9:TAG116_TAG115;VIDEO_44:CATEGORY_2:TAG117_TAG71 GO T5
```
Run this code in `youtube_recall` directory (the same below) to prepare the sample data.
```
cd data
tar -zxvf data.tar
```
Then, run `data_preprocess.py` for data pre-processiong. Refer to the following instructions:
```
usage: data_processor.py [-h] --train_set_path TRAIN_SET_PATH --output_dir
OUTPUT_DIR [--feat_appear_limit FEAT_APPEAR_LIMIT]
PaddlePaddle Deep Candidate Generation Example
optional arguments:
-h, --help show this help message and exit
--train_set_path TRAIN_SET_PATH
path of the train set
--output_dir OUTPUT_DIR
directory to output
--feat_appear_limit FEAT_APPEAR_LIMIT
the minimum number of feature values appears (default:
20)
```
The fucntion of this script is as follows:
- Filter low-frequency features\[[2](#References)\], which appears less than `feat_appear_limit` times.
- Encode features, and generate dictionary `feature_dict.pkl`.
- Count the probability of each video appears and write into `item_freq.pkl`, and provide it to NCE layer.
For example, run the following command to accomplish data pre-processing:
```
mkdir output
python data_processor.py --train_set_path=./data/train.txt \
--output_dir=./output \
--feat_appear_limit=20
```
## Model Implementaion
The details of model implementation is illustrated as follows. The code is in `./network_conf.py`.
### Input layer
```python
def _build_input_layer(self):
"""
build input layer
"""
self._history_clicked_items = paddle.layer.data(
name="history_clicked_items", type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_items'])))
self._history_clicked_categories = paddle.layer.data(
name="history_clicked_categories", type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_categories'])))
self._history_clicked_tags = paddle.layer.data(
name="history_clicked_tags", type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_tags'])))
self._user_id = paddle.layer.data(
name="user_id", type=paddle.data_type.integer_value(
len(self._feature_dict['user_id'])))
self._province = paddle.layer.data(
name="province", type=paddle.data_type.integer_value(
len(self._feature_dict['province'])))
self._city = paddle.layer.data(
name="city", type=paddle.data_type.integer_value(len(self._feature_dict['city'])))
self._phone = paddle.layer.data(
name="phone", type=paddle.data_type.integer_value(len(self._feature_dict['phone'])))
self._target_item = paddle.layer.data(
name="target_item", type=paddle.data_type.integer_value(
len(self._feature_dict['history_clicked_items'])))
```
### Embedding layer
The each of input features is mapped to a fixed-sized dense vector representation
```python
def _create_emb_attr(self, name):
"""
create embedding parameter
"""
return paddle.attr.Param(
name=name, initial_std=0.001, learning_rate=1, l2_rate=0, sparse_update=False)
def _build_embedding_layer(self):
"""
build embedding layer
"""
self._user_id_emb = paddle.layer.embedding(input=self._user_id,
size=64,
param_attr=self._create_emb_attr(
'_proj_user_id'))
self._province_emb = paddle.layer.embedding(input=self._province,
size=8,
param_attr=self._create_emb_attr(
'_proj_province'))
self._city_emb = paddle.layer.embedding(input=self._city,
size=16,
param_attr=self._create_emb_attr('_proj_city'))
self._phone_emb = paddle.layer.embedding(input=self._phone,
size=16,
param_attr=self._create_emb_attr('_proj_phone'))
self._history_clicked_items_emb = paddle.layer.embedding(
input=self._history_clicked_items,
size=64,
param_attr=self._create_emb_attr('_proj_history_clicked_items'))
self._history_clicked_categories_emb = paddle.layer.embedding(
input=self._history_clicked_categories,
size=8,
param_attr=self._create_emb_attr('_proj_history_clicked_categories'))
self._history_clicked_tags_emb = paddle.layer.embedding(
input=self._history_clicked_tags,
size=64,
param_attr=self._create_emb_attr('_proj_history_clicked_tags'))
```
### Hiddern layer
Here improves the original networks in \[[Original Paper](#References)\](Covington, Paul, Jay Adams, and Emre Sargin. "Deep neural networks for youtube recommendations." Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016.)
- By modifying that the embeddings of video watches are not simply averaged but are connected to a LSTM layer with max temporal pooling instead, so that the deep sequential information related to user interests can be learned well.
- Considering data scale and efficiency of training, only two ReLU layers are applied, which also leads to good performance.
```python
self._rnn_cell = paddle.networks.simple_lstm(input=self._history_clicked_items_emb, size=64)
self._lstm_last = paddle.layer.pooling(
input=self._rnn_cell, pooling_type=paddle.pooling.Max())
self._avg_emb_cats = paddle.layer.pooling(input=self._history_clicked_categories_emb,
pooling_type=paddle.pooling.Avg())
self._avg_emb_tags = paddle.layer.pooling(input=self._history_clicked_tags_emb,
pooling_type=paddle.pooling.Avg())
self._fc_0 = paddle.layer.fc(
name="Relu1",
input=[self._lstm_last, self._user_id_emb,
self._city_emb, self._phone_emb],
size=self._dnn_layer_dims[0],
act=paddle.activation.Relu())
self._fc_1 = paddle.layer.fc(
name="Relu2",
input=self._fc_0,
size=self._dnn_layer_dims[1],
act=paddle.activation.Relu())
```
### Output layer
To speed up training process, Noise-contrastive estimation, NCE\[[3](#references)\] is applied to sample negative classes from background distribution with importance weighting. The previous mentioned `item_freq.pkl`[data pre-processing](#data pre-processing) is used as neg_distribution.
```python
return paddle.layer.nce(
input=self._fc_1,
label=self._target_item,
num_classes=len(self._feature_dict['history_clicked_items']),
param_attr=paddle.attr.Param(name="nce_w"),
bias_attr=paddle.attr.Param(name="nce_b"),
num_neg_samples=5,
neg_distribution=self._item_freq)
```
## Train
First of all, prepare `reader.py`, the function of which is to convert raw features into encoding id. One piece of train data generates several data instances according to `window_size`, and then is fed into trainer.
```
window_size=2
train data:
user-id \t province \t city \t video-info1;video-info2;...;video-infoK \t phone
several data instances:
user-id,province,city,[<unk>,video-id1],[<unk>,category1],[<unk>,tags1],phone,video-id2
user-id,province,city,[video-id1,video-id2],[category1,category2],[tags1,tags2],phone,video-id3
user-id,province,city,[video-id2,video-id3],[category2,category3],[tags2,tags3],phone,video-id4
......
```
The relevant code is as follows:
```python
for i in range(1, len(history_clicked_items_all)):
start = max(0, i - self._window_size)
history_clicked_items = history_clicked_items_all[start:i]
history_clicked_categories = history_clicked_categories_all[start:i]
history_clicked_tags_str = history_clicked_tags_all[start:i]
history_clicked_tags = []
for tags_a in history_clicked_tags_str:
for tag in tags_a.split("_"):
history_clicked_tags.append(int(tag))
target_item = history_clicked_items_all[i]
yield user_id, province, city, \
history_clicked_items, history_clicked_categories, \
history_clicked_tags, phone, target_item
```
```python
reader = Reader(feature_dict, args.window_size)
trainer.train(
paddle.batch(
paddle.reader.shuffle(
lambda: reader.train(args.train_set_path),
buf_size=7000), args.batch_size),
num_passes=args.num_passes,
feeding=feeding,
event_handler=event_handler)
```
Then start training.
```shell
mkdir output/model
python train.py --train_set_path='./data/train.txt' \
--test_set_path='./data/test.txt' \
--model_output_dir='./output/model/' \
--feature_dict='./output/feature_dict.pkl' \
--item_freq='./output/item_freq.pkl'
```
## Offline prediction
Input user related features, and then get the most likely N videos for user.
```shell
python infer.py --infer_set_path='./data/infer.txt' \
--model_path='./output/model/model_pass_00000.tar.gz' \
--feature_dict='./output/feature_dict.pkl' \
--batch_size=50
```
## Online prediction
For online prediction, Approximate Nearest Neighbor(ANN) is adopted to directly recall top N most likely watch video. Here shows how to get user vector and video vector.
### User Vector
User vector is the output of the last RELU layer with cascading a constant term 1 in the front. Here the dimension of the last RELU layer is 31, and thus the dimension of user vector is 32.
![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%3D%5B1%2Cu_1%2Cu_2%2C...%2Cu_%7B31%7D%5D)
### Video Vector
Video vector is extracted from the parameters of softmax layer. If there are M different videos, the output of softmax layer will be the probability of click of these M videos.
![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bo%7D%3D%5Bs_1%2Cs_2%2C...%2Cs_%7BM%7D%5D)
To get ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bo%7D) from user vector ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D), a ![](https://www.zhihu.com/equation?tex=32%5Ctimes%20M) matrix which consists of the parameters w, b of softmax layer is multiplied. Each column of this matrix is a 32-dim video vector, according to the dictionary order one by one.
![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%5Ccdot%20%5Cbegin%7Bbmatrix%7D%0A%20b_1%20%20%26%20b_2%20%26%20%20%5Ccdots%20%26%20b_M%20%5C%5C%20%0A%20w_%7B11%7D%20%26%20w_%7B21%7D%20%26%20%20%5Ccdots%20%20%26%20w_%7BM1%7D%20%5C%5C%20%0A%20w_%7B12%7D%20%26%20w_%7B22%7D%20%26%20%20%20%5Ccdots%20%26%20w_%7BM2%7D%20%20%5C%5C%20%0A%5Cvdots%20%26%20%5Cvdots%20%26%20%20%5Cvdots%20%26%20%5Cvdots%20%5C%5C%20%0Aw_%7B131%7D%20%26%20%20w_%7B231%7D%20%26%20%20%5Ccdots%20%20%26%20w_%7BM31%7D%20%20%0A%5Cend%7Bbmatrix%7D_%7B32%5Ctimes%20M%7D%20%3D%20%5Cmathbf%7Bu%7D%20%5Ccdot%20%20%5Cbegin%7Bbmatrix%7D%20%0A%5Cmathbf%7Bv_1%7D%2C%20%5Cmathbf%7Bv_2%7D%2C%20%5Ccdots%2C%20%5Cmathbf%7Bv_M%7D%20%0A%5Cend%7Bbmatrix%7D_%7B1%5Ctimes%20M%7D%3D%5Cmathbf%7Bo%7D)
### SIMPLE-LSH conversion
However, most of ANN systems currently only support cosin sorting, not by inner product sorting, which leads to big effect difference.
To solve it, user and video vectors are sliently modified by a SIMPLE-LSH conversion\[[4](#References)\], so that inner sorting is equivalent to cosin sorting after conversion.
Details are as follows:
- For video vector ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bv%7D%5Cin%20%5Cmathbb%7BR%7D%5EN), ![](https://www.zhihu.com/equation?tex=%5Cleft%20%5C%7C%20%5Cmathbf%7Bv%7D%20%5Cright%20%5C%7C%5Cleqslant%20m). The modified video vector ![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%2B1%7D), and let ![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%20%3D%20%5B%5Cfrac%7B%5Cmathbf%7Bv%7D%7D%7Bm%7D%3B%20%5Csqrt%7B1%20-%5Cleft%20%5C%7C%20%5Cmathbf%7B%5Cfrac%7B%5Cmathbf%7Bv%7D%7D%7Bm%7D%7B%7D%7D%20%5Cright%20%5C%7C%5E2%7D%5D).
- For user vector ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%5Cin%20%5Cmathbb%7BR%7D%5EN), and the modified user vector ![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%5Cin%20%5Cmathbb%7BR%7D%5E%7BN%2B1%7D), and let ![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%20%3D%20%5B%5Cmathbf%7Bu%7D_%7Bnorm%7D%3B%200%5D), where ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D_%7Bnorm%7D) is normalized ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D).
When online predicting, for a coming ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D), it should recall ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bv%7D) by inner product sorting. After ![](https://www.zhihu.com/equation?tex=%5Cmathbf%7Bu%7D%5Crightarrow%20%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%2C%20%5Cmathbf%7Bv%7D%5Crightarrow%20%5Ctilde%7B%5Cmathbf%7Bv%7D%7D) conversion, the order of inner prodct sorting is unchanged. Since ![](https://www.zhihu.com/equation?tex=%5Cleft%20%5C%7C%20%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%20%5Cright%20%5C%7C) and ![](https://www.zhihu.com/equation?tex=%5Cleft%20%5C%7C%20%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%20%5Cright%20%5C%7C) are both equal to 1, ![](https://www.zhihu.com/equation?tex=cos(%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%20%2C%5Ctilde%7B%5Cmathbf%7Bv%7D%7D)%20%3D%20%5Ctilde%7B%5Cmathbf%7Bu%7D%7D%5Ccdot%20%5Ctilde%7B%5Cmathbf%7Bv%7D%7D), which makes cosin-supported-only ANN system works.
And in order to retain precision, use ![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%3D%5B%5Cmathbf%7Bv%7D%3B%5Csqrt%7Bm%5E2-%5Cleft%5C%7C%20%5Cmathbf%7B%5Cmathbf%7Bv%7D%7D%5Cright%5C%7C%5E2%7D%5D) is also equivalent.
### Implemention
Run `user_vector.py` to generate user vector. First input the features into network and then infer. The output of the last RELU layer is saved in variable probs[1]. By cascading a contant term 1 in the front and making SIMPLE-LSH conversion, user vector is generated.
```python
probs = inferer.infer(
input=test_batch,
feeding=feeding,
field=["value"],
flatten_result=False)
for i, res in enumerate(zip(probs[1])):
# do simple lsh conversion
user_vector = [1.000]
for i in res[0]:
user_vector.append(i)
user_vector.append(0.000)
norm = np.linalg.norm(user_vector)
user_vector_norm = [str(_ / norm) for _ in user_vector]
print ",".join(user_vector_norm)
```
Run `item_vector.py` to generate video vector. First load the model and extract the parameters nce_w and nce_b. And then generate ith video vector by putting nce_b[0][i] in the first dimension and nce_b[0][i] in the next. Finally make SIMPLE-LSH conversion, finding the maximum norm and processing according to ![](https://www.zhihu.com/equation?tex=%5Ctilde%7B%5Cmathbf%7Bv%7D%7D%3D%5B%5Cmathbf%7Bv%7D%3B%5Csqrt%7Bm%5E2-%5Cleft%5C%7C%20%5Cmathbf%7B%5Cmathbf%7Bv%7D%7D%5Cright%5C%7C%5E2%7D%5D).
```python
# load the trained model.
with gzip.open(args.model_path) as f:
parameters = paddle.parameters.Parameters.from_tar(f)
nce_w = parameters.get("nce_w")
nce_b = parameters.get("nce_b")
item_vector = convt_simple_lsh(get_item_vec_from_softmax(nce_w, nce_b))
def get_item_vec_from_softmax(nce_w, nce_b):
"""
get item vectors from softmax parameter
"""
if nce_w is None or nce_b is None:
return None
vector = []
total_items_num = nce_w.shape[0]
if total_items_num != nce_b.shape[1]:
return None
dim_vector = nce_w.shape[1] + 1
for i in range(0, total_items_num):
vector.append([])
vector[i].append(nce_b[0][i])
for j in range(1, dim_vector):
vector[i].append(nce_w[i][j - 1])
return vector
def convt_simple_lsh(vector):
"""
do simple lsh conversion
"""
max_norm = 0
num_of_vec = len(vector)
for i in range(0, num_of_vec):
norm = np.linalg.norm(vector[i])
if norm > max_norm:
max_norm = norm
for i in range(0, num_of_vec):
vector[i].append(
math.sqrt(
math.pow(max_norm, 2) - math.pow(np.linalg.norm(vector[i]), 2)))
return vector
```
Use `user_vector.py` and `item_vector.py` to calculate user and item vectors. For example, run the following commands:
```shell
python user_vector.py --infer_set_path='./data/infer.txt' \
--model_path='./output/model/model_pass_00000.tar.gz' \
--feature_dict='./output/feature_dict.pkl' \
--batch_size=50
python item_vector.py --model_path='./output/model/model_pass_00000.tar.gz' \
--feature_dict='./output/feature_dict.pkl'
```
## Offline data mining
Since it is inevitable to consume large amount of machine resources for online predicting, an alternative is offline data mining, e.g. hottest videos, user personalized recommendation, item-based recommendation, and online systems directly access it. Here shows an example to get user personalized recommendation.
```
python infer_user.py --model_path='./output/model/model_pass_00000.tar.gz' \
--feature_dict='./output/feature_dict.pkl'
```
## References
1. Covington, Paul, Jay Adams, and Emre Sargin. "Deep neural networks for youtube recommendations." Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016.
2. https://code.google.com/archive/p/word2vec/
3. http://paddlepaddle.org/docs/develop/models/nce_cost/README.html
4. Neyshabur, Behnam, and Nathan Srebro. "On symmetric and asymmetric LSHs for inner product search." arXiv preprint arXiv:1410.5518 (2014).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import argparse
import os
import cPickle
from utils import logger
"""
This script will output 2 files:
1. feature_dict.pkl
2. item_freq.pkl
"""
class FeatureGenerator(object):
"""
Encode feature values with low-frequency filtering.
"""
def __init__(self, feat_appear_limit=20):
"""
@feat_appear_limit: int
"""
self._dic = None # feature value --> id
self._count = None # numbers of appearances of feature values
self._feat_appear_limit = feat_appear_limit
def add_feat_val(self, feat_val):
"""
Add feature values and count numbers of its appearance.
"""
if self._count is None:
self._count = {'<unk>': 0}
if feat_val == "NULL":
feat_val = '<unk>'
if feat_val not in self._count:
self._count[feat_val] = 1
else:
self._count[feat_val] += 1
self._count['<unk>'] += 1
def _filter_feat(self):
"""
Filter low-frequency feature values.
"""
self._items = filter(lambda x: x[1] > self._feat_appear_limit,
self._count.items())
self._items.sort(key=lambda x: x[1], reverse=True)
def _build_dict(self):
"""
Build feature values --> ids dict.
"""
self._dic = {}
self._filter_feat()
for i in xrange(len(self._items)):
self._dic[self._items[i][0]] = i
self.dim = len(self._dic)
def get_feat_id(self, feat_val):
"""
Get id of feature value after encoding.
"""
# build dict
if self._dic is None:
self._build_dict()
# find id
if feat_val in self._dic:
return self._dic[feat_val]
else:
return self._dic['<unk>']
def get_dim(self):
"""
Get dim.
"""
# build dict
if self._dic is None:
self._build_dict()
return len(self._dic)
def get_dict(self):
"""
Get dict.
"""
# build dict
if self._dic is None:
self._build_dict()
return self._dic
def get_total_count(self):
"""
Compute total num of count.
"""
total_count = 0
for i in xrange(len(self._items)):
feat_val = self._items[i][0]
c = self._items[i][1]
total_count += c
return total_count
def count_iterator(self):
"""
Iterate feature values and its num of appearance.
"""
for i in xrange(len(self._items)):
yield self._items[i][0], self._items[i][1]
def __repr__(self):
"""
"""
return '<FeatureGenerator %d>' % self._dim
def scan_build_dict(data_path, features_dict):
"""
Scan the raw data and add all feature values.
"""
logger.info('scan data set')
with open(data_path, 'r') as f:
for (line_id, line) in enumerate(f):
fields = line.strip('\n').split('\t')
user_id = fields[0]
province = fields[1]
features_dict['province'].add_feat_val(province)
city = fields[2]
features_dict['city'].add_feat_val(city)
item_infos = fields[3]
phone = fields[4]
features_dict['phone'].add_feat_val(phone)
for item_info in item_infos.split(";"):
item_info_array = item_info.split(":")
item = item_info_array[0]
features_dict['history_clicked_items'].add_feat_val(item)
features_dict['user_id'].add_feat_val(user_id)
category = item_info_array[1]
features_dict['history_clicked_categories'].add_feat_val(
category)
tags = item_info_array[2]
for tag in tags.split("_"):
features_dict['history_clicked_tags'].add_feat_val(tag)
def parse_args():
"""
parse arguments
"""
parser = argparse.ArgumentParser(
description="PaddlePaddle Youtube Recall Model Example")
parser.add_argument(
'--train_set_path',
type=str,
required=True,
help="path of the train set")
parser.add_argument(
'--output_dir', type=str, required=True, help="directory to output")
parser.add_argument(
'--feat_appear_limit',
type=int,
default=20,
help="the minimum number of feature values appears (default: 20)")
return parser.parse_args()
if __name__ == '__main__':
args = parse_args()
# check argument
assert os.path.exists(
args.train_set_path), 'The train set path does not exist.'
# features used
features = [
'user_id', 'province', 'city', 'phone', 'history_clicked_items',
'history_clicked_tags', 'history_clicked_categories'
]
# init feature generators
features_dict = {}
for feature in features:
features_dict[feature] = FeatureGenerator(
feat_appear_limit=args.feat_appear_limit)
# scan data for building dict
scan_build_dict(args.train_set_path, features_dict)
# generate feature_dict.pkl
feature_encoding_dict = {}
for feature in features:
d = features_dict[feature].get_dict()
feature_encoding_dict[feature] = d
logger.info('Feature:%s, dimension is %d' % (feature, len(d)))
output_dict_path = os.path.join(args.output_dir, 'feature_dict.pkl')
with open(output_dict_path, "w") as f:
cPickle.dump(feature_encoding_dict, f, -1)
# generate item_freq.pkl
item_freq_list = []
g = features_dict['history_clicked_items']
total_count = g.get_total_count()
for feat_val, feat_count in g.count_iterator():
item_freq_list.append(float(feat_count) / total_count)
logger.info('item_freq, dimension is %d' % (len(item_freq_list)))
output_item_freq_path = os.path.join(args.output_dir, 'item_freq.pkl')
with open(output_item_freq_path, "w") as f:
cPickle.dump(item_freq_list, f, -1)
logger.info('Complete!')
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import gzip
import paddle.v2 as paddle
import argparse
import cPickle
from reader import Reader
from network_conf import DNNmodel
from utils import logger
def parse_args():
"""
parse arguments
:return:
"""
parser = argparse.ArgumentParser(
description="PaddlePaddle Youtube Recall Model Example")
parser.add_argument(
'--infer_set_path',
type=str,
required=True,
help="path of the infer set")
parser.add_argument(
'--model_path', type=str, required=True, help="path of the model")
parser.add_argument(
'--feature_dict',
type=str,
required=True,
help="path of feature_dict.pkl")
parser.add_argument(
'--batch_size',
type=int,
default=50,
help="size of mini-batch (default:50)")
return parser.parse_args()
def infer():
"""
infer
"""
args = parse_args()
# check argument
assert os.path.exists(
args.infer_set_path), 'The infer_set_path path does not exist.'
assert os.path.exists(
args.model_path), 'The model_path path does not exist.'
assert os.path.exists(
args.feature_dict), 'The feature_dict path does not exist.'
paddle.init(use_gpu=False, trainer_count=1)
with open(args.feature_dict) as f:
feature_dict = cPickle.load(f)
nid_dict = feature_dict['history_clicked_items']
nid_to_word = dict((v, k) for k, v in nid_dict.items())
# load the trained model.
with gzip.open(args.model_path) as f:
parameters = paddle.parameters.Parameters.from_tar(f)
# build model
prediction_layer, fc = DNNmodel(
dnn_layer_dims=[256, 31], feature_dict=feature_dict,
is_infer=True).model_cost
inferer = paddle.inference.Inference(
output_layer=[prediction_layer, fc], parameters=parameters)
reader = Reader(feature_dict)
test_batch = []
for idx, item in enumerate(reader.infer(args.infer_set_path)):
test_batch.append(item)
if len(test_batch) == args.batch_size:
infer_a_batch(inferer, test_batch, nid_to_word)
test_batch = []
if len(test_batch):
infer_a_batch(inferer, test_batch, nid_to_word)
def infer_a_batch(inferer, test_batch, nid_to_word):
"""
input a batch of data and infer
"""
feeding = {
'user_id': 0,
'province': 1,
'city': 2,
'history_clicked_items': 3,
'history_clicked_categories': 4,
'history_clicked_tags': 5,
'phone': 6
}
probs = inferer.infer(
input=test_batch,
feeding=feeding,
field=["value"],
flatten_result=False)
for i, res in enumerate(zip(test_batch, probs[0], probs[1])):
softmax_output = res[1]
sort_nid = res[1].argsort()
# print top 30 recommended item
ret = ""
for j in range(1, 30):
item_id = sort_nid[-1 * j]
item_id_to_word = nid_to_word[item_id]
ret += "%s:%.6f," \
% (item_id_to_word, softmax_output[item_id])
print ret.rstrip(",")
if __name__ == "__main__":
infer()
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import gzip
import paddle.v2 as paddle
import argparse
import cPickle
from reader import Reader
from network_conf import DNNmodel
from utils import logger
import numpy as np
def parse_args():
"""
parse arguments
:return:
"""
parser = argparse.ArgumentParser(
description="PaddlePaddle Youtube Recall Model Example")
parser.add_argument(
'--model_path', type=str, required=True, help="path of the model")
parser.add_argument(
'--feature_dict',
type=str,
required=True,
help="path of feature_dict.pkl")
return parser.parse_args()
def infer_user():
"""
infer_user
"""
args = parse_args()
# check argument
assert os.path.exists(
args.model_path), 'The model_path path does not exist.'
assert os.path.exists(
args.feature_dict), 'The feature_dict path does not exist.'
paddle.init(use_gpu=False, trainer_count=1)
with open(args.feature_dict) as f:
feature_dict = cPickle.load(f)
nid_dict = feature_dict['history_clicked_items']
nid_to_word = dict((v, k) for k, v in nid_dict.items())
# load the trained model.
with gzip.open(args.model_path) as f:
parameters = paddle.parameters.Parameters.from_tar(f)
parameters.set('_proj_province', \
np.zeros(shape=parameters.get('_proj_province').shape))
parameters.set('_proj_city', \
np.zeros(shape=parameters.get('_proj_city').shape))
parameters.set('_proj_phone', \
np.zeros(shape=parameters.get('_proj_phone').shape))
parameters.set('_proj_history_clicked_items', \
np.zeros(shape= parameters.get('_proj_history_clicked_items').shape))
parameters.set('_proj_history_clicked_categories', \
np.zeros(shape= parameters.get('_proj_history_clicked_categories').shape))
parameters.set('_proj_history_clicked_tags', \
np.zeros(shape= parameters.get('_proj_history_clicked_tags').shape))
# build model
prediction_layer, fc = DNNmodel(
dnn_layer_dims=[256, 31], feature_dict=feature_dict,
is_infer=True).model_cost
inferer = paddle.inference.Inference(
output_layer=[prediction_layer, fc], parameters=parameters)
reader = Reader(feature_dict)
test_batch = []
for idx, item in enumerate(
reader.infer_user(['USER_ID_0', 'USER_ID_981', 'USER_ID_310806'])):
test_batch.append(item)
infer_a_batch(inferer, test_batch, nid_to_word)
def infer_a_batch(inferer, test_batch, nid_to_word):
"""
input a batch of data and infer
"""
feeding = {
'user_id': 0,
'province': 1,
'city': 2,
'history_clicked_items': 3,
'history_clicked_categories': 4,
'history_clicked_tags': 5,
'phone': 6
}
probs = inferer.infer(
input=test_batch,
feeding=feeding,
field=["value"],
flatten_result=False)
for i, res in enumerate(zip(test_batch, probs[0], probs[1])):
softmax_output = res[1]
sort_nid = res[1].argsort()
# print top 30 recommended item
ret = ""
for j in range(1, 30):
item_id = sort_nid[-1 * j]
item_id_to_word = nid_to_word[item_id]
ret += "%s:%.6f," \
% (item_id_to_word, softmax_output[item_id])
print ret.rstrip(",")
if __name__ == "__main__":
infer_user()
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import gzip
import paddle.v2 as paddle
import argparse
import cPickle
from reader import Reader
from network_conf import DNNmodel
from utils import logger
import numpy as np
import math
def parse_args():
"""
parse arguments
:return:
"""
parser = argparse.ArgumentParser(
description="PaddlePaddle Youtube Recall Model Example")
parser.add_argument(
'--model_path', type=str, required=True, help="path of the model")
parser.add_argument(
'--feature_dict',
type=str,
required=True,
help="path of feature_dict.pkl")
return parser.parse_args()
def get_item_vec_from_softmax(nce_w, nce_b):
"""
get item vectors from softmax parameter
"""
if nce_w is None or nce_b is None:
return None
vector = []
total_items_num = nce_w.shape[0]
if total_items_num != nce_b.shape[1]:
return None
dim_vector = nce_w.shape[1] + 1
for i in range(0, total_items_num):
vector.append([])
vector[i].append(nce_b[0][i])
for j in range(1, dim_vector):
vector[i].append(nce_w[i][j - 1])
return vector
def convt_simple_lsh(vector):
"""
do simple lsh conversion
"""
max_norm = 0
num_of_vec = len(vector)
for i in range(0, num_of_vec):
norm = np.linalg.norm(vector[i])
if norm > max_norm:
max_norm = norm
for i in range(0, num_of_vec):
vector[i].append(
math.sqrt(
math.pow(max_norm, 2) - math.pow(np.linalg.norm(vector[i]), 2)))
return vector
def item_vector():
"""
get item vectors
"""
args = parse_args()
# check argument
assert os.path.exists(
args.model_path), 'The model_path path does not exist.'
assert os.path.exists(
args.feature_dict), 'The feature_dict path does not exist.'
paddle.init(use_gpu=False, trainer_count=1)
with open(args.feature_dict) as f:
feature_dict = cPickle.load(f)
# load the trained model.
with gzip.open(args.model_path) as f:
parameters = paddle.parameters.Parameters.from_tar(f)
nid_dict = feature_dict['history_clicked_items']
nid_to_word = dict((v, k) for k, v in nid_dict.items())
nce_w = parameters.get("nce_w")
nce_b = parameters.get("nce_b")
item_vector = convt_simple_lsh(get_item_vec_from_softmax(nce_w, nce_b))
for i in range(0, len(item_vector)):
itemid = nid_to_word[i]
print itemid + "\t" + ",".join(map(str, item_vector[i]))
if __name__ == "__main__":
item_vector()
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import paddle.v2 as paddle
import cPickle
class DNNmodel(object):
"""
Deep Neural Networks for YouTube candidate generation
"""
def __init__(self,
dnn_layer_dims=None,
feature_dict=None,
item_freq=None,
is_infer=False):
"""
initialize model
@dnn_layer_dims: dimension of each hidden layer
@feature_dict: dictionary of encoded feature
@item_freq: dictionary of feature values and its frequency
@is_infer: if infer mode
"""
self._dnn_layer_dims = dnn_layer_dims
self._feature_dict = feature_dict
self._item_freq = item_freq
self._is_infer = is_infer
# build model
self._build_input_layer()
self._build_embedding_layer()
self.model_cost = self._build_dnn_model()
def _build_input_layer(self):
"""
build input layer
"""
self._history_clicked_items = paddle.layer.data(
name="history_clicked_items",
type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_items'])))
self._history_clicked_categories = paddle.layer.data(
name="history_clicked_categories",
type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_categories'])))
self._history_clicked_tags = paddle.layer.data(
name="history_clicked_tags",
type=paddle.data_type.integer_value_sequence(
len(self._feature_dict['history_clicked_tags'])))
self._user_id = paddle.layer.data(
name="user_id",
type=paddle.data_type.integer_value(
len(self._feature_dict['user_id'])))
self._province = paddle.layer.data(
name="province",
type=paddle.data_type.integer_value(
len(self._feature_dict['province'])))
self._city = paddle.layer.data(
name="city",
type=paddle.data_type.integer_value(
len(self._feature_dict['city'])))
self._phone = paddle.layer.data(
name="phone",
type=paddle.data_type.integer_value(
len(self._feature_dict['phone'])))
self._target_item = paddle.layer.data(
name="target_item",
type=paddle.data_type.integer_value(
len(self._feature_dict['history_clicked_items'])))
def _create_emb_attr(self, name):
"""
create embedding parameter
"""
return paddle.attr.Param(
name=name,
initial_std=0.001,
learning_rate=1,
l2_rate=0,
sparse_update=False)
def _build_embedding_layer(self):
"""
build embedding layer
"""
self._user_id_emb = paddle.layer.embedding(
input=self._user_id,
size=64,
param_attr=self._create_emb_attr('_proj_user_id'))
self._province_emb = paddle.layer.embedding(
input=self._province,
size=8,
param_attr=self._create_emb_attr('_proj_province'))
self._city_emb = paddle.layer.embedding(
input=self._city,
size=16,
param_attr=self._create_emb_attr('_proj_city'))
self._phone_emb = paddle.layer.embedding(
input=self._phone,
size=16,
param_attr=self._create_emb_attr('_proj_phone'))
self._history_clicked_items_emb = paddle.layer.embedding(
input=self._history_clicked_items,
size=64,
param_attr=self._create_emb_attr('_proj_history_clicked_items'))
self._history_clicked_categories_emb = paddle.layer.embedding(
input=self._history_clicked_categories,
size=8,
param_attr=self._create_emb_attr(
'_proj_history_clicked_categories'))
self._history_clicked_tags_emb = paddle.layer.embedding(
input=self._history_clicked_tags,
size=64,
param_attr=self._create_emb_attr('_proj_history_clicked_tags'))
def _build_dnn_model(self):
"""
build dnn model
"""
self._rnn_cell = paddle.networks.simple_lstm(
input=self._history_clicked_items_emb, size=64)
self._lstm_last = paddle.layer.pooling(
input=self._rnn_cell, pooling_type=paddle.pooling.Max())
self._avg_emb_cats = paddle.layer.pooling(
input=self._history_clicked_categories_emb,
pooling_type=paddle.pooling.Avg())
self._avg_emb_tags = paddle.layer.pooling(
input=self._history_clicked_tags_emb,
pooling_type=paddle.pooling.Avg())
self._fc_0 = paddle.layer.fc(
name="Relu1",
input=[
self._lstm_last, self._user_id_emb, self._province_emb,
self._city_emb, self._avg_emb_cats, self._avg_emb_tags,
self._phone_emb
],
size=self._dnn_layer_dims[0],
act=paddle.activation.Relu())
self._fc_1 = paddle.layer.fc(name="Relu2",
input=self._fc_0,
size=self._dnn_layer_dims[1],
act=paddle.activation.Relu())
if not self._is_infer:
return paddle.layer.nce(
input=self._fc_1,
label=self._target_item,
num_classes=len(self._feature_dict['history_clicked_items']),
param_attr=paddle.attr.Param(name="nce_w"),
bias_attr=paddle.attr.Param(name="nce_b"),
num_neg_samples=5,
neg_distribution=self._item_freq)
else:
self.prediction_layer = paddle.layer.mixed(
size=len(self._feature_dict['history_clicked_items']),
input=paddle.layer.trans_full_matrix_projection(
self._fc_1, param_attr=paddle.attr.Param(name="nce_w")),
act=paddle.activation.Softmax(),
bias_attr=paddle.attr.Param(name="nce_b"))
return self.prediction_layer, self._fc_1
if __name__ == "__main__":
# this is to test and debug the network topology defination.
# please set the hyper-parameters as needed.
item_freq_path = "./output/item_freq.pkl"
with open(item_freq_path) as f:
item_freq = cPickle.load(f)
feature_dict_path = "./output/feature_dict.pkl"
with open(feature_dict_path) as f:
feature_dict = cPickle.load(f)
a = DNNmodel(
dnn_layer_dims=[256, 31],
feature_dict=feature_dict,
item_freq=item_freq,
is_infer=False)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from utils import logger
from utils import TaskMode
class Reader(object):
"""
Reader
"""
def __init__(self, feature_dict=None, window_size=20):
"""
init
@window_size: window_size
"""
self._feature_dict = feature_dict
self._window_size = window_size
def train(self, path):
"""
load train set
@path: train set path
"""
logger.info("start train reader from %s" % path)
mode = TaskMode.create_train()
return self._reader(path, mode)
def test(self, path):
"""
load test set
@path: test set path
"""
logger.info("start test reader from %s" % path)
mode = TaskMode.create_test()
return self._reader(path, mode)
def infer(self, path):
"""
load infer set
@path: infer set path
"""
logger.info("start infer reader from %s" % path)
mode = TaskMode.create_infer()
return self._reader(path, mode)
def infer_user(self, user_list):
"""
load user set to infer
@user_list: user list
"""
return self._reader_user(user_list)
def _reader(self, path, mode):
"""
parse data set
"""
USER_ID_UNK = self._feature_dict['user_id'].get('<unk>')
PROVINCE_UNK = self._feature_dict['province'].get('<unk>')
CITY_UNK = self._feature_dict['city'].get('<unk>')
ITEM_UNK = self._feature_dict['history_clicked_items'].get('<unk>')
CATEGORY_UNK = self._feature_dict['history_clicked_categories'].get(
'<unk>')
TAG_UNK = self._feature_dict['history_clicked_tags'].get('<unk>')
PHONE_UNK = self._feature_dict['phone'].get('<unk>')
with open(path) as f:
for line in f:
fields = line.strip('\n').split('\t')
user_id = self._feature_dict['user_id'].get(fields[0],
USER_ID_UNK)
province = self._feature_dict['province'].get(fields[1],
PROVINCE_UNK)
city = self._feature_dict['city'].get(fields[2], CITY_UNK)
item_infos = fields[3]
phone = self._feature_dict['phone'].get(fields[4], PHONE_UNK)
history_clicked_items_all = []
history_clicked_tags_all = []
history_clicked_categories_all = []
for item_info in item_infos.split(';'):
item_info_array = item_info.split(':')
item = item_info_array[0]
item_encoded_id = self._feature_dict['history_clicked_items'].get(\
item, ITEM_UNK)
if item_encoded_id != ITEM_UNK:
history_clicked_items_all.append(item_encoded_id)
category = item_info_array[1]
history_clicked_categories_all.append(
self._feature_dict['history_clicked_categories'].get(\
category, CATEGORY_UNK))
tags = item_info_array[2]
tag_split = map(str, [self._feature_dict['history_clicked_tags'].get(\
tag, TAG_UNK) \
for tag in tags.strip().split("_")])
history_clicked_tags_all.append("_".join(tag_split))
if not mode.is_infer():
history_clicked_items_all.insert(0, 0)
history_clicked_tags_all.insert(0, "0")
history_clicked_categories_all.insert(0, 0)
for i in range(1, len(history_clicked_items_all)):
start = max(0, i - self._window_size)
history_clicked_items = history_clicked_items_all[start:
i]
history_clicked_categories = history_clicked_categories_all[
start:i]
history_clicked_tags_str = history_clicked_tags_all[
start:i]
history_clicked_tags = []
for tags_a in history_clicked_tags_str:
for tag in tags_a.split("_"):
history_clicked_tags.append(int(tag))
target_item = history_clicked_items_all[i]
yield user_id, province, city, \
history_clicked_items, history_clicked_categories, \
history_clicked_tags, phone, target_item
else:
history_clicked_items = history_clicked_items_all
history_clicked_categories = history_clicked_categories_all
history_clicked_tags_str = history_clicked_tags_all
history_clicked_tags = []
for tags_a in history_clicked_tags_str:
for tag in tags_a.split("_"):
history_clicked_tags.append(int(tag))
yield user_id, province, city, \
history_clicked_items, history_clicked_categories, \
history_clicked_tags, phone
def _reader_user(self, user_list):
"""
parse user list
"""
USER_ID_UNK = self._feature_dict['user_id'].get('<unk>')
for user in user_list:
user_id = self._feature_dict['user_id'].get(user, USER_ID_UNK)
yield user_id, 0, 0, [0], [0], [0], 0
if __name__ == "__main__":
# this is to test and debug reader function
train_data = sys.argv[1]
feature_dict = sys.argv[2]
window_size = int(sys.argv[3])
import cPickle
with open(feature_dict) as f:
feature_dict = cPickle.load(f)
r = Reader(feature_dict, window_size)
for dat in r.train(train_data):
print dat
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import gzip
import paddle.v2 as paddle
import argparse
import cPickle
from reader import Reader
from network_conf import DNNmodel
from utils import logger
def parse_args():
"""
parse arguments
"""
parser = argparse.ArgumentParser(
description="PaddlePaddle Youtube Recall Model Example")
parser.add_argument(
'--train_set_path',
type=str,
required=True,
help="path of the train set")
parser.add_argument(
'--test_set_path', type=str, required=True, help="path of the test set")
parser.add_argument(
'--model_output_dir',
type=str,
required=True,
help="directory to output")
parser.add_argument(
'--feature_dict',
type=str,
required=True,
help="path of feature_dict.pkl")
parser.add_argument(
'--item_freq', type=str, required=True, help="path of item_freq.pkl ")
parser.add_argument(
'--window_size', type=int, default=20, help="window size(default: 20)")
parser.add_argument(
'--num_passes', type=int, default=1, help="number of passes to train")
parser.add_argument(
'--batch_size',
type=int,
default=50,
help="size of mini-batch (default:50)")
return parser.parse_args()
def train():
"""
train
"""
args = parse_args()
# check argument
assert os.path.exists(
args.train_set_path), 'The train_set_path path does not exist.'
assert os.path.exists(
args.test_set_path), 'The test_set_path path does not exist.'
assert os.path.exists(
args.feature_dict), 'The feature_dict path does not exist.'
assert os.path.exists(args.item_freq), 'The item_freq path does not exist.'
assert os.path.exists(
args.model_output_dir), 'The model_output_dir path does not exist.'
paddle.init(use_gpu=False, trainer_count=1)
with open(args.feature_dict) as f:
feature_dict = cPickle.load(f)
with open(args.item_freq) as f:
item_freq = cPickle.load(f)
feeding = {
'user_id': 0,
'province': 1,
'city': 2,
'history_clicked_items': 3,
'history_clicked_categories': 4,
'history_clicked_tags': 5,
'phone': 6,
'target_item': 7
}
optimizer = paddle.optimizer.AdaGrad(
learning_rate=1e-1,
regularization=paddle.optimizer.L2Regularization(rate=1e-3))
cost = DNNmodel(
dnn_layer_dims=[256, 31],
feature_dict=feature_dict,
item_freq=item_freq,
is_infer=False).model_cost
parameters = paddle.parameters.create(cost)
trainer = paddle.trainer.SGD(cost, parameters, optimizer)
def event_handler(event):
"""
event handler
"""
if isinstance(event, paddle.event.EndIteration):
if event.batch_id and not event.batch_id % 10:
logger.info("Pass %d, Batch %d, Cost %f" %
(event.pass_id, event.batch_id, event.cost))
elif isinstance(event, paddle.event.EndPass):
save_path = os.path.join(args.model_output_dir,
"model_pass_%05d.tar.gz" % event.pass_id)
logger.info("Save model into %s ..." % save_path)
with gzip.open(save_path, "w") as f:
trainer.save_parameter_to_tar(f)
reader = Reader(feature_dict, args.window_size)
trainer.train(
paddle.batch(
paddle.reader.shuffle(
lambda: reader.train(args.train_set_path), buf_size=7000),
args.batch_size),
num_passes=args.num_passes,
feeding=feeding,
event_handler=event_handler)
if __name__ == "__main__":
train()
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import gzip
import paddle.v2 as paddle
import argparse
import cPickle
from reader import Reader
from network_conf import DNNmodel
from utils import logger
import numpy as np
def parse_args():
"""
parse arguments
"""
parser = argparse.ArgumentParser(
description="PaddlePaddle Youtube Recall Model Example")
parser.add_argument(
'--infer_set_path',
type=str,
required=True,
help="path of the infer set")
parser.add_argument(
'--model_path', type=str, required=True, help="path of the model")
parser.add_argument(
'--feature_dict',
type=str,
required=True,
help="path of feature_dict.pkl")
parser.add_argument(
'--batch_size',
type=int,
default=50,
help="size of mini-batch (default:50)")
return parser.parse_args()
def user_vector():
"""
get user vectors
"""
args = parse_args()
# check argument
assert os.path.exists(
args.infer_set_path), 'The infer_set_path path does not exist.'
assert os.path.exists(
args.model_path), 'The model_path path does not exist.'
assert os.path.exists(
args.feature_dict), 'The feature_dict path does not exist.'
paddle.init(use_gpu=False, trainer_count=1)
with open(args.feature_dict) as f:
feature_dict = cPickle.load(f)
# load the trained model.
with gzip.open(args.model_path) as f:
parameters = paddle.parameters.Parameters.from_tar(f)
# build model
prediction_layer, fc = DNNmodel(
dnn_layer_dims=[256, 31], feature_dict=feature_dict,
is_infer=True).model_cost
inferer = paddle.inference.Inference(
output_layer=[prediction_layer, fc], parameters=parameters)
reader = Reader(feature_dict)
test_batch = []
for idx, item in enumerate(reader.infer(args.infer_set_path)):
test_batch.append(item)
if len(test_batch) == args.batch_size:
get_a_batch_user_vector(inferer, test_batch)
test_batch = []
if len(test_batch):
get_a_batch_user_vector(inferer, test_batch)
def get_a_batch_user_vector(inferer, test_batch):
"""
input a batch of data and get user vectors
"""
feeding = {
'user_id': 0,
'province': 1,
'city': 2,
'history_clicked_items': 3,
'history_clicked_categories': 4,
'history_clicked_tags': 5,
'phone': 6
}
probs = inferer.infer(
input=test_batch,
feeding=feeding,
field=["value"],
flatten_result=False)
for i, res in enumerate(zip(probs[1])):
# do simple lsh conversion
user_vector = [1.000]
for i in res[0]:
user_vector.append(i)
user_vector.append(0.000)
norm = np.linalg.norm(user_vector)
user_vector_norm = [str(_ / norm) for _ in user_vector]
print ",".join(user_vector_norm)
if __name__ == "__main__":
user_vector()
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
logging.basicConfig()
logger = logging.getLogger("paddle")
logger.setLevel(logging.INFO)
class TaskMode(object):
"""
TaskMode
"""
TRAIN_MODE = 0
TEST_MODE = 1
INFER_MODE = 2
def __init__(self, mode):
"""
:param mode:
"""
self.mode = mode
def is_train(self):
"""
:return:
"""
return self.mode == self.TRAIN_MODE
def is_test(self):
"""
:return:
"""
return self.mode == self.TEST_MODE
def is_infer(self):
"""
:return:
"""
return self.mode == self.INFER_MODE
@staticmethod
def create_train():
"""
:return:
"""
return TaskMode(TaskMode.TRAIN_MODE)
@staticmethod
def create_test():
"""
:return:
"""
return TaskMode(TaskMode.TEST_MODE)
@staticmethod
def create_infer():
"""
:return:
"""
return TaskMode(TaskMode.INFER_MODE)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册