提交 c8b0ae8f 编写于 作者: Y Yibing Liu

adapt to the new data provider

...@@ -2,6 +2,8 @@ language: cpp ...@@ -2,6 +2,8 @@ language: cpp
cache: ccache cache: ccache
sudo: required sudo: required
dist: trusty dist: trusty
services:
- docker
os: os:
- linux - linux
env: env:
...@@ -16,8 +18,12 @@ addons: ...@@ -16,8 +18,12 @@ addons:
- python2.7-dev - python2.7-dev
before_install: before_install:
- pip install -U virtualenv pre-commit pip - pip install -U virtualenv pre-commit pip
- docker pull paddlepaddle/paddle:latest
script: script:
- .travis/precommit.sh - .travis/precommit.sh
- docker run -i --rm -v "$PWD:/py_unittest" paddlepaddle/paddle:latest /bin/bash -c
'cd /py_unittest; sh .travis/unittest.sh'
notifications: notifications:
email: email:
on_success: change on_success: change
......
#!/bin/bash
abort(){
echo "Run unittest failed" 1>&2
echo "Please check your code" 1>&2
exit 1
}
unittest(){
cd $1 > /dev/null
if [ -f "requirements.txt" ]; then
pip install -r requirements.txt
fi
if [ $? != 0 ]; then
exit 1
fi
find . -name 'tests' -type d -print0 | \
xargs -0 -I{} -n1 bash -c \
'python -m unittest discover -v -s {}'
cd - > /dev/null
}
trap 'abort' 0
set -e
for proj in */ ; do
if [ -d $proj ]; then
unittest $proj
if [ $? != 0 ]; then
exit 1
fi
fi
done
trap : 0
...@@ -14,46 +14,57 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式 ...@@ -14,46 +14,57 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式
在词向量的例子中,我们向大家展示如何使用Hierarchical-Sigmoid 和噪声对比估计(Noise Contrastive Estimation,NCE)来加速词向量的学习。 在词向量的例子中,我们向大家展示如何使用Hierarchical-Sigmoid 和噪声对比估计(Noise Contrastive Estimation,NCE)来加速词向量的学习。
- 1.1 [Hsigmoid加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/word_embedding) - 1.1 [Hsigmoid加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/word_embedding)
- 1.2 [噪声对比估计加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/nce_cost)
## 2. 点击率预估
## 2. 语言模型
语言模型是自然语言处理领域里一个重要的基础模型,它是一个概率分布模型,利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词。语言模型被应用在很多领域,如:自动写作、QA、机器翻译、拼写检查、语音识别、词性标注等。
在语言模型的例子中,我们以文本生成为例,提供了RNN LM(包括LSTM、GRU)和N-Gram LM,供大家学习和使用。用户可以通过文档中的 “使用说明” 快速上手:适配训练语料,以训练 “自动写诗”、“自动写散文” 等有趣的模型。
- 2.1 [基于LSTM、GRU、N-Gram的文本生成模型](https://github.com/PaddlePaddle/models/tree/develop/language_model)
## 3. 点击率预估
点击率预估模型预判用户对一条广告点击的概率,对每次广告的点击情况做出预测,是广告技术的核心算法之一。逻谛斯克回归对大规模稀疏特征有着很好的学习能力,在点击率预估任务发展的早期一统天下。近年来,DNN 模型由于其强大的学习能力逐渐接过点击率预估任务的大旗。 点击率预估模型预判用户对一条广告点击的概率,对每次广告的点击情况做出预测,是广告技术的核心算法之一。逻谛斯克回归对大规模稀疏特征有着很好的学习能力,在点击率预估任务发展的早期一统天下。近年来,DNN 模型由于其强大的学习能力逐渐接过点击率预估任务的大旗。
在点击率预估的例子中,我们给出谷歌提出的 Wide & Deep 模型。这一模型融合了适用于学习抽象特征的 DNN 和适用于大规模稀疏特征的逻谛斯克回归两者模型的优点,可以作为一种相对成熟的模型框架使用, 在工业界也有一定的应用。 在点击率预估的例子中,我们给出谷歌提出的 Wide & Deep 模型。这一模型融合了适用于学习抽象特征的 DNN 和适用于大规模稀疏特征的逻谛斯克回归两者模型的优点,可以作为一种相对成熟的模型框架使用, 在工业界也有一定的应用。
- 2.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr) - 3.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr)
## 3. 文本分类 ## 4. 文本分类
文本分类是自然语言处理领域最基础的任务之一,深度学习方法能够免除复杂的特征工程,直接使用原始文本作为输入,数据驱动地最优化分类准确率。 文本分类是自然语言处理领域最基础的任务之一,深度学习方法能够免除复杂的特征工程,直接使用原始文本作为输入,数据驱动地最优化分类准确率。
在文本分类的例子中,我们以情感分类任务为例,提供了基于DNN的非序列文本分类模型,以及基于CNN的序列模型供大家学习和使用(基于LSTM的模型见PaddleBook中[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一课)。 在文本分类的例子中,我们以情感分类任务为例,提供了基于DNN的非序列文本分类模型,以及基于CNN的序列模型供大家学习和使用(基于LSTM的模型见PaddleBook中[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一课)。
- 3.1 [基于 DNN / CNN 的情感分类](https://github.com/PaddlePaddle/models/tree/develop/text_classification) - 4.1 [基于 DNN / CNN 的情感分类](https://github.com/PaddlePaddle/models/tree/develop/text_classification)
## 4. 排序学习 ## 5. 排序学习
排序学习(Learning to Rank, LTR)是信息检索和搜索引擎研究的核心问题之一,通过机器学习方法学习一个分值函数对待排序的候选进行打分,再根据分值的高低确定序关系。深度神经网络可以用来建模分值函数,构成各类基于深度学习的LTR模型。 排序学习(Learning to Rank, LTR)是信息检索和搜索引擎研究的核心问题之一,通过机器学习方法学习一个分值函数对待排序的候选进行打分,再根据分值的高低确定序关系。深度神经网络可以用来建模分值函数,构成各类基于深度学习的LTR模型。
在排序学习的例子中,我们介绍基于 RankLoss 损失函数的 Pairwise 排序模型和基于LambdaRank损失函数的Listwise排序模型(Pointwise学习策略见PaddleBook中[推荐系统](https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/README.cn.md)一课)。 在排序学习的例子中,我们介绍基于 RankLoss 损失函数的 Pairwise 排序模型和基于LambdaRank损失函数的Listwise排序模型(Pointwise学习策略见PaddleBook中[推荐系统](https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/README.cn.md)一课)。
- 4.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr) - 5.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr)
## 5. 序列标注 ## 6. 序列标注
给定输入序列,序列标注模型为序列中每一个元素贴上一个类别标签,是自然语言处理领域最基础的任务之一。随着深度学习的不断探索和发展,利用循环神经网络学习输入序列的特征表示,条件随机场(Conditional Random Field, CRF)在特征基础上完成序列标注任务,逐渐成为解决序列标注问题的标配解决方案。 给定输入序列,序列标注模型为序列中每一个元素贴上一个类别标签,是自然语言处理领域最基础的任务之一。随着深度学习的不断探索和发展,利用循环神经网络学习输入序列的特征表示,条件随机场(Conditional Random Field, CRF)在特征基础上完成序列标注任务,逐渐成为解决序列标注问题的标配解决方案。
在序列标注的例子中,我们以命名实体识别(Named Entity Recognition,NER)任务为例,介绍如何训练一个端到端的序列标注模型。 在序列标注的例子中,我们以命名实体识别(Named Entity Recognition,NER)任务为例,介绍如何训练一个端到端的序列标注模型。
- 5.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner) - 6.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner)
## 6. 序列到序列学习 ## 7. 序列到序列学习
序列到序列学习实现两个甚至是多个不定长模型之间的映射,有着广泛的应用,包括:机器翻译、智能对话与问答、广告创意语料生成、自动编码(如金融画像编码)、判断多个文本串之间的语义相关性等。 序列到序列学习实现两个甚至是多个不定长模型之间的映射,有着广泛的应用,包括:机器翻译、智能对话与问答、广告创意语料生成、自动编码(如金融画像编码)、判断多个文本串之间的语义相关性等。
在序列到序列学习的例子中,我们以机器翻译任务为例,提供了多种改进模型,供大家学习和使用。包括:不带注意力机制的序列到序列映射模型,这一模型是所有序列到序列学习模型的基础;使用 scheduled sampling 改善 RNN 模型在生成任务中的错误累积问题;带外部记忆机制的神经机器翻译,通过增强神经网络的记忆能力,来完成复杂的序列到序列学习任务。 在序列到序列学习的例子中,我们以机器翻译任务为例,提供了多种改进模型,供大家学习和使用。包括:不带注意力机制的序列到序列映射模型,这一模型是所有序列到序列学习模型的基础;使用 scheduled sampling 改善 RNN 模型在生成任务中的错误累积问题;带外部记忆机制的神经机器翻译,通过增强神经网络的记忆能力,来完成复杂的序列到序列学习任务。
- 6.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention) - 7.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention)
## Copyright and License ## Copyright and License
PaddlePaddle is provided under the [Apache-2.0 license](LICENSE). PaddlePaddle is provided under the [Apache-2.0 license](LICENSE).
...@@ -16,34 +16,48 @@ For some machines, we also need to install libsndfile1. Details to be added. ...@@ -16,34 +16,48 @@ For some machines, we also need to install libsndfile1. Details to be added.
### Preparing Data ### Preparing Data
``` ```
cd data cd datasets
python librispeech.py sh run_all.sh
cat manifest.libri.train-* > manifest.libri.train-all
cd .. cd ..
``` ```
After running librispeech.py, we have several "manifest" json files named with a prefix `manifest.libri.`. A manifest file summarizes a speech data set, with each line containing the meta data (i.e. audio filepath, transcription text, audio duration) of each audio file within the data set, in json format. `sh run_all.sh` prepares all ASR datasets (currently, only LibriSpeech available). After running, we have several summarization manifest files in json-format.
By `cat manifest.libri.train-* > manifest.libri.train-all`, we simply merge the three seperate sample sets of LibriSpeech (train-clean-100, train-clean-360, train-other-500) into one training set. This is a simple way for merging different data sets. A manifest file summarizes a speech data set, with each line containing the meta data (i.e. audio filepath, transcript text, audio duration) of each audio file within the data set, in json format. Manifest file serves as an interface informing our system of where and what to read the speech samples.
More help for arguments:
```
python datasets/librispeech/librispeech.py --help
```
### Preparing for Training
```
python compute_mean_std.py
```
`python compute_mean_std.py` computes mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing.
More help for arguments: More help for arguments:
``` ```
python librispeech.py --help python compute_mean_std.py --help
``` ```
### Traininig ### Training
For GPU Training: For GPU Training:
``` ```
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --trainer_count 4 --train_manifest_path ./data/manifest.libri.train-all CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --trainer_count 4
``` ```
For CPU Training: For CPU Training:
``` ```
python train.py --trainer_count 8 --use_gpu False -- train_manifest_path ./data/manifest.libri.train-all python train.py --trainer_count 8 --use_gpu False
``` ```
More help for arguments: More help for arguments:
...@@ -55,7 +69,7 @@ python train.py --help ...@@ -55,7 +69,7 @@ python train.py --help
### Inferencing ### Inferencing
``` ```
python infer.py CUDA_VISIBLE_DEVICES=0 python infer.py
``` ```
More help for arguments: More help for arguments:
......
"""
Providing basic audio data preprocessing pipeline, and offering
both instance-level and batch-level data reader interfaces.
"""
import paddle.v2 as paddle
import logging
import json
import random
import soundfile
import numpy as np
import os
RANDOM_SEED = 0
logger = logging.getLogger(__name__)
class DataGenerator(object):
"""
DataGenerator provides basic audio data preprocessing pipeline, and offers
both instance-level and batch-level data reader interfaces.
Normalized FFT are used as audio features here.
:param vocab_filepath: Vocabulary file path for indexing tokenized
transcriptions.
:type vocab_filepath: basestring
:param normalizer_manifest_path: Manifest filepath for collecting feature
normalization statistics, e.g. mean, std.
:type normalizer_manifest_path: basestring
:param normalizer_num_samples: Number of instances sampled for collecting
feature normalization statistics.
Default is 100.
:type normalizer_num_samples: int
:param max_duration: Audio clips with duration (in seconds) greater than
this will be discarded. Default is 20.0.
:type max_duration: float
:param min_duration: Audio clips with duration (in seconds) smaller than
this will be discarded. Default is 0.0.
:type min_duration: float
:param stride_ms: Striding size (in milliseconds) for generating frames.
Default is 10.0.
:type stride_ms: float
:param window_ms: Window size (in milliseconds) for frames. Default is 20.0.
:type window_ms: float
:param max_frequency: Maximun frequency for FFT features. FFT features of
frequency larger than this will be discarded.
If set None, all features will be kept.
Default is None.
:type max_frequency: float
"""
def __init__(self,
vocab_filepath,
normalizer_manifest_path,
normalizer_num_samples=100,
max_duration=20.0,
min_duration=0.0,
stride_ms=10.0,
window_ms=20.0,
max_frequency=None):
self.__max_duration__ = max_duration
self.__min_duration__ = min_duration
self.__stride_ms__ = stride_ms
self.__window_ms__ = window_ms
self.__max_frequency__ = max_frequency
self.__random__ = random.Random(RANDOM_SEED)
# load vocabulary (dictionary)
self.__vocab_dict__, self.__vocab_list__ = \
self.__load_vocabulary_from_file__(vocab_filepath)
# collect normalizer statistics
self.__mean__, self.__std__ = self.__collect_normalizer_statistics__(
manifest_path=normalizer_manifest_path,
num_samples=normalizer_num_samples)
def __audio_featurize__(self, audio_filename):
"""
Preprocess audio data, including feature extraction, normalization etc..
"""
features = self.__audio_basic_featurize__(audio_filename)
return self.__normalize__(features)
def __text_featurize__(self, text):
"""
Preprocess text data, including tokenizing and token indexing etc..
"""
return self.__convert_text_to_char_index__(
text=text, vocabulary=self.__vocab_dict__)
def __audio_basic_featurize__(self, audio_filename):
"""
Compute basic (without normalization etc.) features for audio data.
"""
return self.__spectrogram_from_file__(
filename=audio_filename,
stride_ms=self.__stride_ms__,
window_ms=self.__window_ms__,
max_freq=self.__max_frequency__)
def __collect_normalizer_statistics__(self, manifest_path, num_samples=100):
"""
Compute feature normalization statistics, i.e. mean and stddev.
"""
# read manifest
manifest = self.__read_manifest__(
manifest_path=manifest_path,
max_duration=self.__max_duration__,
min_duration=self.__min_duration__)
# sample for statistics
sampled_manifest = self.__random__.sample(manifest, num_samples)
# extract spectrogram feature
features = []
for instance in sampled_manifest:
spectrogram = self.__audio_basic_featurize__(
instance["audio_filepath"])
features.append(spectrogram)
features = np.hstack(features)
mean = np.mean(features, axis=1).reshape([-1, 1])
std = np.std(features, axis=1).reshape([-1, 1])
return mean, std
def __normalize__(self, features, eps=1e-14):
"""
Normalize features to be of zero mean and unit stddev.
"""
return (features - self.__mean__) / (self.__std__ + eps)
def __spectrogram_from_file__(self,
filename,
stride_ms=10.0,
window_ms=20.0,
max_freq=None,
eps=1e-14):
"""
Laod audio data and calculate the log of spectrogram by FFT.
Refer to utils.py in https://github.com/baidu-research/ba-dls-deepspeech
"""
audio, sample_rate = soundfile.read(filename)
if audio.ndim >= 2:
audio = np.mean(audio, 1)
if max_freq is None:
max_freq = sample_rate / 2
if max_freq > sample_rate / 2:
raise ValueError("max_freq must be greater than half of "
"sample rate.")
if stride_ms > window_ms:
raise ValueError("Stride size must not be greater than "
"window size.")
stride_size = int(0.001 * sample_rate * stride_ms)
window_size = int(0.001 * sample_rate * window_ms)
spectrogram, freqs = self.__extract_spectrogram__(
audio,
window_size=window_size,
stride_size=stride_size,
sample_rate=sample_rate)
ind = np.where(freqs <= max_freq)[0][-1] + 1
return np.log(spectrogram[:ind, :] + eps)
def __extract_spectrogram__(self, samples, window_size, stride_size,
sample_rate):
"""
Compute the spectrogram by FFT for a discrete real signal.
Refer to utils.py in https://github.com/baidu-research/ba-dls-deepspeech
"""
# extract strided windows
truncate_size = (len(samples) - window_size) % stride_size
samples = samples[:len(samples) - truncate_size]
nshape = (window_size, (len(samples) - window_size) // stride_size + 1)
nstrides = (samples.strides[0], samples.strides[0] * stride_size)
windows = np.lib.stride_tricks.as_strided(
samples, shape=nshape, strides=nstrides)
assert np.all(
windows[:, 1] == samples[stride_size:(stride_size + window_size)])
# window weighting, squared Fast Fourier Transform (fft), scaling
weighting = np.hanning(window_size)[:, None]
fft = np.fft.rfft(windows * weighting, axis=0)
fft = np.absolute(fft)**2
scale = np.sum(weighting**2) * sample_rate
fft[1:-1, :] *= (2.0 / scale)
fft[(0, -1), :] /= scale
# prepare fft frequency list
freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
return fft, freqs
def __load_vocabulary_from_file__(self, vocabulary_path):
"""
Load vocabulary from file.
"""
if not os.path.exists(vocabulary_path):
raise ValueError("Vocabulary file %s not found.", vocabulary_path)
vocab_lines = []
with open(vocabulary_path, 'r') as file:
vocab_lines.extend(file.readlines())
vocab_list = [line[:-1] for line in vocab_lines]
vocab_dict = dict(
[(token, id) for (id, token) in enumerate(vocab_list)])
return vocab_dict, vocab_list
def __convert_text_to_char_index__(self, text, vocabulary):
"""
Convert text string to a list of character index integers.
"""
return [vocabulary[w] for w in text]
def __read_manifest__(self, manifest_path, max_duration, min_duration):
"""
Load and parse manifest file.
"""
manifest = []
for json_line in open(manifest_path):
try:
json_data = json.loads(json_line)
except Exception as e:
raise ValueError("Error reading manifest: %s" % str(e))
if (json_data["duration"] <= max_duration and
json_data["duration"] >= min_duration):
manifest.append(json_data)
return manifest
def __padding_batch__(self, batch, padding_to=-1, flatten=False):
"""
Padding audio part of features (only in the time axis -- column axis)
with zeros, to make each instance in the batch share the same
audio feature shape.
If `padding_to` is set -1, the maximun column numbers in the batch will
be used as the target size. Otherwise, `padding_to` will be the target
size. Default is -1.
If `flatten` is set True, audio data will be flatten to be a 1-dim
ndarray. Default is False.
"""
new_batch = []
# get target shape
max_length = max([audio.shape[1] for audio, text in batch])
if padding_to != -1:
if padding_to < max_length:
raise ValueError("If padding_to is not -1, it should be greater"
" or equal to the original instance length.")
max_length = padding_to
# padding
for audio, text in batch:
padded_audio = np.zeros([audio.shape[0], max_length])
padded_audio[:, :audio.shape[1]] = audio
if flatten:
padded_audio = padded_audio.flatten()
new_batch.append((padded_audio, text))
return new_batch
def instance_reader_creator(self,
manifest_path,
sort_by_duration=True,
shuffle=False):
"""
Instance reader creator for audio data. Creat a callable function to
produce instances of data.
Instance: a tuple of a numpy ndarray of audio spectrogram and a list of
tokenized and indexed transcription text.
:param manifest_path: Filepath of manifest for audio clip files.
:type manifest_path: basestring
:param sort_by_duration: Sort the audio clips by duration if set True
(for SortaGrad).
:type sort_by_duration: bool
:param shuffle: Shuffle the audio clips if set True.
:type shuffle: bool
:return: Data reader function.
:rtype: callable
"""
if sort_by_duration and shuffle:
sort_by_duration = False
logger.warn("When shuffle set to true, "
"sort_by_duration is forced to set False.")
def reader():
# read manifest
manifest = self.__read_manifest__(
manifest_path=manifest_path,
max_duration=self.__max_duration__,
min_duration=self.__min_duration__)
# sort (by duration) or shuffle manifest
if sort_by_duration:
manifest.sort(key=lambda x: x["duration"])
if shuffle:
self.__random__.shuffle(manifest)
# extract spectrogram feature
for instance in manifest:
spectrogram = self.__audio_featurize__(
instance["audio_filepath"])
transcript = self.__text_featurize__(instance["text"])
yield (spectrogram, transcript)
return reader
def batch_reader_creator(self,
manifest_path,
batch_size,
padding_to=-1,
flatten=False,
sort_by_duration=True,
shuffle=False):
"""
Batch data reader creator for audio data. Creat a callable function to
produce batches of data.
Audio features will be padded with zeros to make each instance in the
batch to share the same audio feature shape.
:param manifest_path: Filepath of manifest for audio clip files.
:type manifest_path: basestring
:param batch_size: Instance number in a batch.
:type batch_size: int
:param padding_to: If set -1, the maximun column numbers in the batch
will be used as the target size for padding.
Otherwise, `padding_to` will be the target size.
Default is -1.
:type padding_to: int
:param flatten: If set True, audio data will be flatten to be a 1-dim
ndarray. Otherwise, 2-dim ndarray. Default is False.
:type flatten: bool
:param sort_by_duration: Sort the audio clips by duration if set True
(for SortaGrad).
:type sort_by_duration: bool
:param shuffle: Shuffle the audio clips if set True.
:type shuffle: bool
:return: Batch reader function, producing batches of data when called.
:rtype: callable
"""
def batch_reader():
instance_reader = self.instance_reader_creator(
manifest_path=manifest_path,
sort_by_duration=sort_by_duration,
shuffle=shuffle)
batch = []
for instance in instance_reader():
batch.append(instance)
if len(batch) == batch_size:
yield self.__padding_batch__(batch, padding_to, flatten)
batch = []
if len(batch) > 0:
yield self.__padding_batch__(batch, padding_to, flatten)
return batch_reader
def vocabulary_size(self):
"""
Get vocabulary size.
:return: Vocabulary size.
:rtype: int
"""
return len(self.__vocab_list__)
def vocabulary_dict(self):
"""
Get vocabulary in dict.
:return: Vocabulary in dict.
:rtype: dict
"""
return self.__vocab_dict__
def vocabulary_list(self):
"""
Get vocabulary in list.
:return: Vocabulary in list
:rtype: list
"""
return self.__vocab_list__
def data_name_feeding(self):
"""
Get feeddings (data field name and corresponding field id).
:return: Feeding dict.
:rtype: dict
"""
feeding = {
"audio_spectrogram": 0,
"transcript_text": 1,
}
return feeding
"""Compute mean and std for feature normalizer, and save to file."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
from data_utils.normalizer import FeatureNormalizer
from data_utils.augmentor.augmentation import AugmentationPipeline
from data_utils.featurizer.audio_featurizer import AudioFeaturizer
parser = argparse.ArgumentParser(
description='Computing mean and stddev for feature normalizer.')
parser.add_argument(
"--manifest_path",
default='datasets/manifest.train',
type=str,
help="Manifest path for computing normalizer's mean and stddev."
"(default: %(default)s)")
parser.add_argument(
"--num_samples",
default=2000,
type=int,
help="Number of samples for computing mean and stddev. "
"(default: %(default)s)")
parser.add_argument(
"--augmentation_config",
default='{}',
type=str,
help="Augmentation configuration in json-format. "
"(default: %(default)s)")
parser.add_argument(
"--output_file",
default='mean_std.npz',
type=str,
help="Filepath to write mean and std to (.npz)."
"(default: %(default)s)")
args = parser.parse_args()
def main():
augmentation_pipeline = AugmentationPipeline(args.augmentation_config)
audio_featurizer = AudioFeaturizer()
def augment_and_featurize(audio_segment):
augmentation_pipeline.transform_audio(audio_segment)
return audio_featurizer.featurize(audio_segment)
normalizer = FeatureNormalizer(
mean_std_filepath=None,
manifest_path=args.manifest_path,
featurize_func=augment_and_featurize,
num_samples=args.num_samples)
normalizer.write_to_file(args.output_file)
if __name__ == '__main__':
main()
"""Contains the audio segment class."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import io
import soundfile
class AudioSegment(object):
"""Monaural audio segment abstraction.
:param samples: Audio samples [num_samples x num_channels].
:type samples: ndarray.float32
:param sample_rate: Audio sample rate.
:type sample_rate: int
:raises TypeError: If the sample data type is not float or int.
"""
def __init__(self, samples, sample_rate):
"""Create audio segment from samples.
Samples are convert float32 internally, with int scaled to [-1, 1].
"""
self._samples = self._convert_samples_to_float32(samples)
self._sample_rate = sample_rate
if self._samples.ndim >= 2:
self._samples = np.mean(self._samples, 1)
def __eq__(self, other):
"""Return whether two objects are equal."""
if type(other) is not type(self):
return False
if self._sample_rate != other._sample_rate:
return False
if self._samples.shape != other._samples.shape:
return False
if np.any(self.samples != other._samples):
return False
return True
def __ne__(self, other):
"""Return whether two objects are unequal."""
return not self.__eq__(other)
def __str__(self):
"""Return human-readable representation of segment."""
return ("%s: num_samples=%d, sample_rate=%d, duration=%.2fsec, "
"rms=%.2fdB" % (type(self), self.num_samples, self.sample_rate,
self.duration, self.rms_db))
@classmethod
def from_file(cls, file):
"""Create audio segment from audio file.
:param filepath: Filepath or file object to audio file.
:type filepath: basestring|file
:return: Audio segment instance.
:rtype: AudioSegment
"""
samples, sample_rate = soundfile.read(file, dtype='float32')
return cls(samples, sample_rate)
@classmethod
def from_bytes(cls, bytes):
"""Create audio segment from a byte string containing audio samples.
:param bytes: Byte string containing audio samples.
:type bytes: str
:return: Audio segment instance.
:rtype: AudioSegment
"""
samples, sample_rate = soundfile.read(
io.BytesIO(bytes), dtype='float32')
return cls(samples, sample_rate)
def to_wav_file(self, filepath, dtype='float32'):
"""Save audio segment to disk as wav file.
:param filepath: WAV filepath or file object to save the
audio segment.
:type filepath: basestring|file
:param dtype: Subtype for audio file. Options: 'int16', 'int32',
'float32', 'float64'. Default is 'float32'.
:type dtype: str
:raises TypeError: If dtype is not supported.
"""
samples = self._convert_samples_from_float32(self._samples, dtype)
subtype_map = {
'int16': 'PCM_16',
'int32': 'PCM_32',
'float32': 'FLOAT',
'float64': 'DOUBLE'
}
soundfile.write(
filepath,
samples,
self._sample_rate,
format='WAV',
subtype=subtype_map[dtype])
def to_bytes(self, dtype='float32'):
"""Create a byte string containing the audio content.
:param dtype: Data type for export samples. Options: 'int16', 'int32',
'float32', 'float64'. Default is 'float32'.
:type dtype: str
:return: Byte string containing audio content.
:rtype: str
"""
samples = self._convert_samples_from_float32(self._samples, dtype)
return samples.tostring()
def apply_gain(self, gain):
"""Apply gain in decibels to samples.
Note that this is an in-place transformation.
:param gain: Gain in decibels to apply to samples.
:type gain: float
"""
self._samples *= 10.**(gain / 20.)
def change_speed(self, speed_rate):
"""Change the audio speed by linear interpolation.
Note that this is an in-place transformation.
:param speed_rate: Rate of speed change:
speed_rate > 1.0, speed up the audio;
speed_rate = 1.0, unchanged;
speed_rate < 1.0, slow down the audio;
speed_rate <= 0.0, not allowed, raise ValueError.
:type speed_rate: float
:raises ValueError: If speed_rate <= 0.0.
"""
if speed_rate <= 0:
raise ValueError("speed_rate should be greater than zero.")
old_length = self._samples.shape[0]
new_length = int(old_length / speed_rate)
old_indices = np.arange(old_length)
new_indices = np.linspace(start=0, stop=old_length, num=new_length)
self._samples = np.interp(new_indices, old_indices, self._samples)
def normalize(self, target_sample_rate):
raise NotImplementedError()
def resample(self, target_sample_rate):
raise NotImplementedError()
def pad_silence(self, duration, sides='both'):
raise NotImplementedError()
def subsegment(self, start_sec=None, end_sec=None):
raise NotImplementedError()
def convolve(self, filter, allow_resample=False):
raise NotImplementedError()
def convolve_and_normalize(self, filter, allow_resample=False):
raise NotImplementedError()
@property
def samples(self):
"""Return audio samples.
:return: Audio samples.
:rtype: ndarray
"""
return self._samples.copy()
@property
def sample_rate(self):
"""Return audio sample rate.
:return: Audio sample rate.
:rtype: int
"""
return self._sample_rate
@property
def num_samples(self):
"""Return number of samples.
:return: Number of samples.
:rtype: int
"""
return self._samples.shape(0)
@property
def duration(self):
"""Return audio duration.
:return: Audio duration in seconds.
:rtype: float
"""
return self._samples.shape[0] / float(self._sample_rate)
@property
def rms_db(self):
"""Return root mean square energy of the audio in decibels.
:return: Root mean square energy in decibels.
:rtype: float
"""
# square root => multiply by 10 instead of 20 for dBs
mean_square = np.mean(self._samples**2)
return 10 * np.log10(mean_square)
def _convert_samples_to_float32(self, samples):
"""Convert sample type to float32.
Audio sample type is usually integer or float-point.
Integers will be scaled to [-1, 1] in float32.
"""
float32_samples = samples.astype('float32')
if samples.dtype in np.sctypes['int']:
bits = np.iinfo(samples.dtype).bits
float32_samples *= (1. / 2**(bits - 1))
elif samples.dtype in np.sctypes['float']:
pass
else:
raise TypeError("Unsupported sample type: %s." % samples.dtype)
return float32_samples
def _convert_samples_from_float32(self, samples, dtype):
"""Convert sample type from float32 to dtype.
Audio sample type is usually integer or float-point. For integer
type, float32 will be rescaled from [-1, 1] to the maximum range
supported by the integer type.
This is for writing a audio file.
"""
dtype = np.dtype(dtype)
output_samples = samples.copy()
if dtype in np.sctypes['int']:
bits = np.iinfo(dtype).bits
output_samples *= (2**(bits - 1) / 1.)
min_val = np.iinfo(dtype).min
max_val = np.iinfo(dtype).max
output_samples[output_samples > max_val] = max_val
output_samples[output_samples < min_val] = min_val
elif samples.dtype in np.sctypes['float']:
min_val = np.finfo(dtype).min
max_val = np.finfo(dtype).max
output_samples[output_samples > max_val] = max_val
output_samples[output_samples < min_val] = min_val
else:
raise TypeError("Unsupported sample type: %s." % samples.dtype)
return output_samples.astype(dtype)
"""Contains the data augmentation pipeline."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import random
from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor
class AugmentationPipeline(object):
"""Build a pre-processing pipeline with various augmentation models.Such a
data augmentation pipeline is oftern leveraged to augment the training
samples to make the model invariant to certain types of perturbations in the
real world, improving model's generalization ability.
The pipeline is built according the the augmentation configuration in json
string, e.g.
.. code-block::
'[{"type": "volume",
"params": {"min_gain_dBFS": -15,
"max_gain_dBFS": 15},
"prob": 0.5},
{"type": "speed",
"params": {"min_speed_rate": 0.8,
"max_speed_rate": 1.2},
"prob": 0.5}
]'
This augmentation configuration inserts two augmentation models
into the pipeline, with one is VolumePerturbAugmentor and the other
SpeedPerturbAugmentor. "prob" indicates the probability of the current
augmentor to take effect.
:param augmentation_config: Augmentation configuration in json string.
:type augmentation_config: str
:param random_seed: Random seed.
:type random_seed: int
:raises ValueError: If the augmentation json config is in incorrect format".
"""
def __init__(self, augmentation_config, random_seed=0):
self._rng = random.Random(random_seed)
self._augmentors, self._rates = self._parse_pipeline_from(
augmentation_config)
def transform_audio(self, audio_segment):
"""Run the pre-processing pipeline for data augmentation.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to process.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
for augmentor, rate in zip(self._augmentors, self._rates):
if self._rng.uniform(0., 1.) <= rate:
augmentor.transform_audio(audio_segment)
def _parse_pipeline_from(self, config_json):
"""Parse the config json to build a augmentation pipelien."""
try:
configs = json.loads(config_json)
augmentors = [
self._get_augmentor(config["type"], config["params"])
for config in configs
]
rates = [config["prob"] for config in configs]
except Exception as e:
raise ValueError("Failed to parse the augmentation config json: "
"%s" % str(e))
return augmentors, rates
def _get_augmentor(self, augmentor_type, params):
"""Return an augmentation model by the type name, and pass in params."""
if augmentor_type == "volume":
return VolumePerturbAugmentor(self._rng, **params)
else:
raise ValueError("Unknown augmentor type [%s]." % augmentor_type)
"""Contains the abstract base class for augmentation models."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from abc import ABCMeta, abstractmethod
class AugmentorBase(object):
"""Abstract base class for augmentation model (augmentor) class.
All augmentor classes should inherit from this class, and implement the
following abstract methods.
"""
__metaclass__ = ABCMeta
@abstractmethod
def __init__(self):
pass
@abstractmethod
def transform_audio(self, audio_segment):
"""Adds various effects to the input audio segment. Such effects
will augment the training data to make the model invariant to certain
types of perturbations in the real world, improving model's
generalization ability.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
pass
"""Contains the volume perturb augmentation model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from data_utils.augmentor.base import AugmentorBase
class VolumePerturbAugmentor(AugmentorBase):
"""Augmentation model for adding random volume perturbation.
This is used for multi-loudness training of PCEN. See
https://arxiv.org/pdf/1607.05666v1.pdf
for more details.
:param rng: Random generator object.
:type rng: random.Random
:param min_gain_dBFS: Minimal gain in dBFS.
:type min_gain_dBFS: float
:param max_gain_dBFS: Maximal gain in dBFS.
:type max_gain_dBFS: float
"""
def __init__(self, rng, min_gain_dBFS, max_gain_dBFS):
self._min_gain_dBFS = min_gain_dBFS
self._max_gain_dBFS = max_gain_dBFS
self._rng = rng
def transform_audio(self, audio_segment):
"""Change audio loadness.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
gain = self._rng.uniform(min_gain_dBFS, max_gain_dBFS)
audio_segment.apply_gain(gain)
"""Contains data generator for orgnaizing various audio data preprocessing
pipeline and offering data reader interface of PaddlePaddle requirements.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import random
import numpy as np
import paddle.v2 as paddle
from data_utils import utils
from data_utils.augmentor.augmentation import AugmentationPipeline
from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
from data_utils.speech import SpeechSegment
from data_utils.normalizer import FeatureNormalizer
class DataGenerator(object):
"""
DataGenerator provides basic audio data preprocessing pipeline, and offers
data reader interfaces of PaddlePaddle requirements.
:param vocab_filepath: Vocabulary filepath for indexing tokenized
transcripts.
:type vocab_filepath: basestring
:param mean_std_filepath: File containing the pre-computed mean and stddev.
:type mean_std_filepath: None|basestring
:param augmentation_config: Augmentation configuration in json string.
Details see AugmentationPipeline.__doc__.
:type augmentation_config: str
:param max_duration: Audio with duration (in seconds) greater than
this will be discarded.
:type max_duration: float
:param min_duration: Audio with duration (in seconds) smaller than
this will be discarded.
:type min_duration: float
:param stride_ms: Striding size (in milliseconds) for generating frames.
:type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float
:param max_freq: Used when specgram_type is 'linear', only FFT bins
corresponding to frequencies between [0, max_freq] are
returned.
:types max_freq: None|float
:param specgram_type: Specgram feature type. Options: 'linear'.
:type specgram_type: str
:param random_seed: Random seed.
:type random_seed: int
"""
def __init__(self,
vocab_filepath,
mean_std_filepath,
augmentation_config='{}',
max_duration=float('inf'),
min_duration=0.0,
stride_ms=10.0,
window_ms=20.0,
max_freq=None,
specgram_type='linear',
random_seed=0):
self._max_duration = max_duration
self._min_duration = min_duration
self._normalizer = FeatureNormalizer(mean_std_filepath)
self._augmentation_pipeline = AugmentationPipeline(
augmentation_config=augmentation_config, random_seed=random_seed)
self._speech_featurizer = SpeechFeaturizer(
vocab_filepath=vocab_filepath,
specgram_type=specgram_type,
stride_ms=stride_ms,
window_ms=window_ms,
max_freq=max_freq)
self._rng = random.Random(random_seed)
self._epoch = 0
def batch_reader_creator(self,
manifest_path,
batch_size,
min_batch_size=1,
padding_to=-1,
flatten=False,
sortagrad=False,
shuffle_method="batch_shuffle"):
"""
Batch data reader creator for audio data. Return a callable generator
function to produce batches of data.
Audio features within one batch will be padded with zeros to have the
same shape, or a user-defined shape.
:param manifest_path: Filepath of manifest for audio files.
:type manifest_path: basestring
:param batch_size: Number of instances in a batch.
:type batch_size: int
:param min_batch_size: Any batch with batch size smaller than this will
be discarded. (To be deprecated in the future.)
:type min_batch_size: int
:param padding_to: If set -1, the maximun shape in the batch
will be used as the target shape for padding.
Otherwise, `padding_to` will be the target shape.
:type padding_to: int
:param flatten: If set True, audio features will be flatten to 1darray.
:type flatten: bool
:param sortagrad: If set True, sort the instances by audio duration
in the first epoch for speed up training.
:type sortagrad: bool
:param shuffle_method: Shuffle method. Options:
'' or None: no shuffle.
'instance_shuffle': instance-wise shuffle.
'batch_shuffle': similarly-sized instances are
put into batches, and then
batch-wise shuffle the batches.
For more details, please see
``_batch_shuffle.__doc__``.
'batch_shuffle_clipped': 'batch_shuffle' with
head shift and tail
clipping. For more
details, please see
``_batch_shuffle``.
If sortagrad is True, shuffle is disabled
for the first epoch.
:type shuffle_method: None|str
:return: Batch reader function, producing batches of data when called.
:rtype: callable
"""
def batch_reader():
# read manifest
manifest = utils.read_manifest(
manifest_path=manifest_path,
max_duration=self._max_duration,
min_duration=self._min_duration)
# sort (by duration) or batch-wise shuffle the manifest
if self._epoch == 0 and sortagrad:
manifest.sort(key=lambda x: x["duration"])
else:
if shuffle_method == "batch_shuffle":
manifest = self._batch_shuffle(
manifest, batch_size, clipped=False)
elif shuffle_method == "batch_shuffle_clipped":
manifest = self._batch_shuffle(
manifest, batch_size, clipped=True)
elif shuffle_method == "instance_shuffle":
self._rng.shuffle(manifest)
elif not shuffle_method:
pass
else:
raise ValueError("Unknown shuffle method %s." %
shuffle_method)
# prepare batches
instance_reader = self._instance_reader_creator(manifest)
batch = []
for instance in instance_reader():
batch.append(instance)
if len(batch) == batch_size:
yield self._padding_batch(batch, padding_to, flatten)
batch = []
if len(batch) >= min_batch_size:
yield self._padding_batch(batch, padding_to, flatten)
self._epoch += 1
return batch_reader
@property
def feeding(self):
"""Returns data reader's feeding dict.
:return: Data feeding dict.
:rtype: dict
"""
return {"audio_spectrogram": 0, "transcript_text": 1}
@property
def vocab_size(self):
"""Return the vocabulary size.
:return: Vocabulary size.
:rtype: int
"""
return self._speech_featurizer.vocab_size
@property
def vocab_list(self):
"""Return the vocabulary in list.
:return: Vocabulary in list.
:rtype: list
"""
return self._speech_featurizer.vocab_list
def _process_utterance(self, filename, transcript):
"""Load, augment, featurize and normalize for speech data."""
speech_segment = SpeechSegment.from_file(filename, transcript)
self._augmentation_pipeline.transform_audio(speech_segment)
specgram, text_ids = self._speech_featurizer.featurize(speech_segment)
specgram = self._normalizer.apply(specgram)
return specgram, text_ids
def _instance_reader_creator(self, manifest):
"""
Instance reader creator. Create a callable function to produce
instances of data.
Instance: a tuple of ndarray of audio spectrogram and a list of
token indices for transcript.
"""
def reader():
for instance in manifest:
yield self._process_utterance(instance["audio_filepath"],
instance["text"])
return reader
def _padding_batch(self, batch, padding_to=-1, flatten=False):
"""
Padding audio features with zeros to make them have the same shape (or
a user-defined shape) within one bach.
If ``padding_to`` is -1, the maximun shape in the batch will be used
as the target shape for padding. Otherwise, `padding_to` will be the
target shape (only refers to the second axis).
If `flatten` is True, features will be flatten to 1darray.
"""
new_batch = []
# get target shape
max_length = max([audio.shape[1] for audio, text in batch])
if padding_to != -1:
if padding_to < max_length:
raise ValueError("If padding_to is not -1, it should be larger "
"than any instance's shape in the batch")
max_length = padding_to
# padding
for audio, text in batch:
padded_audio = np.zeros([audio.shape[0], max_length])
padded_audio[:, :audio.shape[1]] = audio
if flatten:
padded_audio = padded_audio.flatten()
new_batch.append((padded_audio, text))
return new_batch
def _batch_shuffle(self, manifest, batch_size, clipped=False):
"""Put similarly-sized instances into minibatches for better efficiency
and make a batch-wise shuffle.
1. Sort the audio clips by duration.
2. Generate a random number `k`, k in [0, batch_size).
3. Randomly shift `k` instances in order to create different batches
for different epochs. Create minibatches.
4. Shuffle the minibatches.
:param manifest: Manifest contents. List of dict.
:type manifest: list
:param batch_size: Batch size. This size is also used for generate
a random number for batch shuffle.
:type batch_size: int
:param clipped: Whether to clip the heading (small shift) and trailing
(incomplete batch) instances.
:type clipped: bool
:return: Batch shuffled mainifest.
:rtype: list
"""
manifest.sort(key=lambda x: x["duration"])
shift_len = self._rng.randint(0, batch_size - 1)
batch_manifest = zip(*[iter(manifest[shift_len:])] * batch_size)
self._rng.shuffle(batch_manifest)
batch_manifest = list(sum(batch_manifest, ()))
if not clipped:
res_len = len(manifest) - shift_len - len(batch_manifest)
batch_manifest.extend(manifest[-res_len:])
batch_manifest.extend(manifest[0:shift_len])
return batch_manifest
"""Contains the audio featurizer class."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
from data_utils import utils
from data_utils.audio import AudioSegment
class AudioFeaturizer(object):
"""Audio featurizer, for extracting features from audio contents of
AudioSegment or SpeechSegment.
Currently, it only supports feature type of linear spectrogram.
:param specgram_type: Specgram feature type. Options: 'linear'.
:type specgram_type: str
:param stride_ms: Striding size (in milliseconds) for generating frames.
:type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float
:param max_freq: Used when specgram_type is 'linear', only FFT bins
corresponding to frequencies between [0, max_freq] are
returned.
:types max_freq: None|float
"""
def __init__(self,
specgram_type='linear',
stride_ms=10.0,
window_ms=20.0,
max_freq=None):
self._specgram_type = specgram_type
self._stride_ms = stride_ms
self._window_ms = window_ms
self._max_freq = max_freq
def featurize(self, audio_segment):
"""Extract audio features from AudioSegment or SpeechSegment.
:param audio_segment: Audio/speech segment to extract features from.
:type audio_segment: AudioSegment|SpeechSegment
:return: Spectrogram audio feature in 2darray.
:rtype: ndarray
"""
return self._compute_specgram(audio_segment.samples,
audio_segment.sample_rate)
def _compute_specgram(self, samples, sample_rate):
"""Extract various audio features."""
if self._specgram_type == 'linear':
return self._compute_linear_specgram(
samples, sample_rate, self._stride_ms, self._window_ms,
self._max_freq)
else:
raise ValueError("Unknown specgram_type %s. "
"Supported values: linear." % self._specgram_type)
def _compute_linear_specgram(self,
samples,
sample_rate,
stride_ms=10.0,
window_ms=20.0,
max_freq=None,
eps=1e-14):
"""Compute the linear spectrogram from FFT energy."""
if max_freq is None:
max_freq = sample_rate / 2
if max_freq > sample_rate / 2:
raise ValueError("max_freq must be greater than half of "
"sample rate.")
if stride_ms > window_ms:
raise ValueError("Stride size must not be greater than "
"window size.")
stride_size = int(0.001 * sample_rate * stride_ms)
window_size = int(0.001 * sample_rate * window_ms)
specgram, freqs = self._specgram_real(
samples,
window_size=window_size,
stride_size=stride_size,
sample_rate=sample_rate)
ind = np.where(freqs <= max_freq)[0][-1] + 1
return np.log(specgram[:ind, :] + eps)
def _specgram_real(self, samples, window_size, stride_size, sample_rate):
"""Compute the spectrogram for samples from a real signal."""
# extract strided windows
truncate_size = (len(samples) - window_size) % stride_size
samples = samples[:len(samples) - truncate_size]
nshape = (window_size, (len(samples) - window_size) // stride_size + 1)
nstrides = (samples.strides[0], samples.strides[0] * stride_size)
windows = np.lib.stride_tricks.as_strided(
samples, shape=nshape, strides=nstrides)
assert np.all(
windows[:, 1] == samples[stride_size:(stride_size + window_size)])
# window weighting, squared Fast Fourier Transform (fft), scaling
weighting = np.hanning(window_size)[:, None]
fft = np.fft.rfft(windows * weighting, axis=0)
fft = np.absolute(fft)**2
scale = np.sum(weighting**2) * sample_rate
fft[1:-1, :] *= (2.0 / scale)
fft[(0, -1), :] /= scale
# prepare fft frequency list
freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
return fft, freqs
"""Contains the speech featurizer class."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from data_utils.featurizer.audio_featurizer import AudioFeaturizer
from data_utils.featurizer.text_featurizer import TextFeaturizer
class SpeechFeaturizer(object):
"""Speech featurizer, for extracting features from both audio and transcript
contents of SpeechSegment.
Currently, for audio parts, it only supports feature type of linear
spectrogram; for transcript parts, it only supports char-level tokenizing
and conversion into a list of token indices. Note that the token indexing
order follows the given vocabulary file.
:param vocab_filepath: Filepath to load vocabulary for token indices
conversion.
:type specgram_type: basestring
:param specgram_type: Specgram feature type. Options: 'linear'.
:type specgram_type: str
:param stride_ms: Striding size (in milliseconds) for generating frames.
:type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float
:param max_freq: Used when specgram_type is 'linear', only FFT bins
corresponding to frequencies between [0, max_freq] are
returned.
:types max_freq: None|float
"""
def __init__(self,
vocab_filepath,
specgram_type='linear',
stride_ms=10.0,
window_ms=20.0,
max_freq=None):
self._audio_featurizer = AudioFeaturizer(specgram_type, stride_ms,
window_ms, max_freq)
self._text_featurizer = TextFeaturizer(vocab_filepath)
def featurize(self, speech_segment):
"""Extract features for speech segment.
1. For audio parts, extract the audio features.
2. For transcript parts, convert text string to a list of token indices
in char-level.
:param audio_segment: Speech segment to extract features from.
:type audio_segment: SpeechSegment
:return: A tuple of 1) spectrogram audio feature in 2darray, 2) list of
char-level token indices.
:rtype: tuple
"""
audio_feature = self._audio_featurizer.featurize(speech_segment)
text_ids = self._text_featurizer.featurize(speech_segment.transcript)
return audio_feature, text_ids
@property
def vocab_size(self):
"""Return the vocabulary size.
:return: Vocabulary size.
:rtype: int
"""
return self._text_featurizer.vocab_size
@property
def vocab_list(self):
"""Return the vocabulary in list.
:return: Vocabulary in list.
:rtype: list
"""
return self._text_featurizer.vocab_list
"""Contains the text featurizer class."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
class TextFeaturizer(object):
"""Text featurizer, for processing or extracting features from text.
Currently, it only supports char-level tokenizing and conversion into
a list of token indices. Note that the token indexing order follows the
given vocabulary file.
:param vocab_filepath: Filepath to load vocabulary for token indices
conversion.
:type specgram_type: basestring
"""
def __init__(self, vocab_filepath):
self._vocab_dict, self._vocab_list = self._load_vocabulary_from_file(
vocab_filepath)
def featurize(self, text):
"""Convert text string to a list of token indices in char-level.Note
that the token indexing order follows the given vocabulary file.
:param text: Text to process.
:type text: basestring
:return: List of char-level token indices.
:rtype: list
"""
tokens = self._char_tokenize(text)
return [self._vocab_dict[token] for token in tokens]
@property
def vocab_size(self):
"""Return the vocabulary size.
:return: Vocabulary size.
:rtype: int
"""
return len(self._vocab_list)
@property
def vocab_list(self):
"""Return the vocabulary in list.
:return: Vocabulary in list.
:rtype: list
"""
return self._vocab_list
def _char_tokenize(self, text):
"""Character tokenizer."""
return list(text.strip())
def _load_vocabulary_from_file(self, vocab_filepath):
"""Load vocabulary from file."""
vocab_lines = []
with open(vocab_filepath, 'r') as file:
vocab_lines.extend(file.readlines())
vocab_list = [line[:-1] for line in vocab_lines]
vocab_dict = dict(
[(token, id) for (id, token) in enumerate(vocab_list)])
return vocab_dict, vocab_list
"""Contains feature normalizers."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import random
import data_utils.utils as utils
from data_utils.audio import AudioSegment
class FeatureNormalizer(object):
"""Feature normalizer. Normalize features to be of zero mean and unit
stddev.
if mean_std_filepath is provided (not None), the normalizer will directly
initilize from the file. Otherwise, both manifest_path and featurize_func
should be given for on-the-fly mean and stddev computing.
:param mean_std_filepath: File containing the pre-computed mean and stddev.
:type mean_std_filepath: None|basestring
:param manifest_path: Manifest of instances for computing mean and stddev.
:type meanifest_path: None|basestring
:param featurize_func: Function to extract features. It should be callable
with ``featurize_func(audio_segment)``.
:type featurize_func: None|callable
:param num_samples: Number of random samples for computing mean and stddev.
:type num_samples: int
:param random_seed: Random seed for sampling instances.
:type random_seed: int
:raises ValueError: If both mean_std_filepath and manifest_path
(or both mean_std_filepath and featurize_func) are None.
"""
def __init__(self,
mean_std_filepath,
manifest_path=None,
featurize_func=None,
num_samples=500,
random_seed=0):
if not mean_std_filepath:
if not (manifest_path and featurize_func):
raise ValueError("If mean_std_filepath is None, meanifest_path "
"and featurize_func should not be None.")
self._rng = random.Random(random_seed)
self._compute_mean_std(manifest_path, featurize_func, num_samples)
else:
self._read_mean_std_from_file(mean_std_filepath)
def apply(self, features, eps=1e-14):
"""Normalize features to be of zero mean and unit stddev.
:param features: Input features to be normalized.
:type features: ndarray
:param eps: added to stddev to provide numerical stablibity.
:type eps: float
:return: Normalized features.
:rtype: ndarray
"""
return (features - self._mean) / (self._std + eps)
def write_to_file(self, filepath):
"""Write the mean and stddev to the file.
:param filepath: File to write mean and stddev.
:type filepath: basestring
"""
np.savez(filepath, mean=self._mean, std=self._std)
def _read_mean_std_from_file(self, filepath):
"""Load mean and std from file."""
npzfile = np.load(filepath)
self._mean = npzfile["mean"]
self._std = npzfile["std"]
def _compute_mean_std(self, manifest_path, featurize_func, num_samples):
"""Compute mean and std from randomly sampled instances."""
manifest = utils.read_manifest(manifest_path)
sampled_manifest = self._rng.sample(manifest, num_samples)
features = []
for instance in sampled_manifest:
features.append(
featurize_func(
AudioSegment.from_file(instance["audio_filepath"])))
features = np.hstack(features)
self._mean = np.mean(features, axis=1).reshape([-1, 1])
self._std = np.std(features, axis=1).reshape([-1, 1])
"""Contains the speech segment class."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from data_utils.audio import AudioSegment
class SpeechSegment(AudioSegment):
"""Speech segment abstraction, a subclass of AudioSegment,
with an additional transcript.
:param samples: Audio samples [num_samples x num_channels].
:type samples: ndarray.float32
:param sample_rate: Audio sample rate.
:type sample_rate: int
:param transcript: Transcript text for the speech.
:type transript: basestring
:raises TypeError: If the sample data type is not float or int.
"""
def __init__(self, samples, sample_rate, transcript):
AudioSegment.__init__(self, samples, sample_rate)
self._transcript = transcript
def __eq__(self, other):
"""Return whether two objects are equal.
"""
if not AudioSegment.__eq__(self, other):
return False
if self._transcript != other._transcript:
return False
return True
def __ne__(self, other):
"""Return whether two objects are unequal."""
return not self.__eq__(other)
@classmethod
def from_file(cls, filepath, transcript):
"""Create speech segment from audio file and corresponding transcript.
:param filepath: Filepath or file object to audio file.
:type filepath: basestring|file
:param transcript: Transcript text for the speech.
:type transript: basestring
:return: Audio segment instance.
:rtype: AudioSegment
"""
audio = AudioSegment.from_file(filepath)
return cls(audio.samples, audio.sample_rate, transcript)
@classmethod
def from_bytes(cls, bytes, transcript):
"""Create speech segment from a byte string and corresponding
transcript.
:param bytes: Byte string containing audio samples.
:type bytes: str
:param transcript: Transcript text for the speech.
:type transript: basestring
:return: Audio segment instance.
:rtype: AudioSegment
"""
audio = AudioSegment.from_bytes(bytes)
return cls(audio.samples, audio.sample_rate, transcript)
@property
def transcript(self):
"""Return the transcript text.
:return: Transcript text for the speech.
:rtype: basestring
"""
return self._transcript
"""Contains data helper functions."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
"""Load and parse manifest file.
Instances with durations outside [min_duration, max_duration] will be
filtered out.
:param manifest_path: Manifest file to load and parse.
:type manifest_path: basestring
:param max_duration: Maximal duration in seconds for instance filter.
:type max_duration: float
:param min_duration: Minimal duration in seconds for instance filter.
:type min_duration: float
:return: Manifest parsing results. List of dict.
:rtype: list
:raises IOError: If failed to parse the manifest.
"""
manifest = []
for json_line in open(manifest_path):
try:
json_data = json.loads(json_line)
except Exception as e:
raise IOError("Error reading manifest: %s" % str(e))
if (json_data["duration"] <= max_duration and
json_data["duration"] >= min_duration):
manifest.append(json_data)
return manifest
""" """Prepare Librispeech ASR datasets.
Download, unpack and create manifest json files for the Librespeech dataset.
A manifest is a json file summarizing filelist in a data set, with each line Download, unpack and create manifest files.
containing the meta data (i.e. audio filepath, transcription text, audio Manifest file is a json-format file with each line containing the
duration) of each audio file in the data set. meta data (i.e. audio filepath, transcript and audio duration)
of each audio file in the data set.
""" """
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle
from paddle.v2.dataset.common import md5file
import distutils.util import distutils.util
import os import os
import wget import wget
...@@ -15,6 +16,7 @@ import tarfile ...@@ -15,6 +16,7 @@ import tarfile
import argparse import argparse
import soundfile import soundfile
import json import json
from paddle.v2.dataset.common import md5file
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
...@@ -35,8 +37,7 @@ MD5_TRAIN_CLEAN_100 = "2a93770f6d5c6c964bc36631d331a522" ...@@ -35,8 +37,7 @@ MD5_TRAIN_CLEAN_100 = "2a93770f6d5c6c964bc36631d331a522"
MD5_TRAIN_CLEAN_360 = "c0e676e450a7ff2f54aeade5171606fa" MD5_TRAIN_CLEAN_360 = "c0e676e450a7ff2f54aeade5171606fa"
MD5_TRAIN_OTHER_500 = "d1a0fd59409feb2c614ce4d30c387708" MD5_TRAIN_OTHER_500 = "d1a0fd59409feb2c614ce4d30c387708"
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(description=__doc__)
description='Downloads and prepare LibriSpeech dataset.')
parser.add_argument( parser.add_argument(
"--target_dir", "--target_dir",
default=DATA_HOME + "/Libri", default=DATA_HOME + "/Libri",
...@@ -44,7 +45,7 @@ parser.add_argument( ...@@ -44,7 +45,7 @@ parser.add_argument(
help="Directory to save the dataset. (default: %(default)s)") help="Directory to save the dataset. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--manifest_prefix", "--manifest_prefix",
default="manifest.libri", default="manifest",
type=str, type=str,
help="Filepath prefix for output manifests. (default: %(default)s)") help="Filepath prefix for output manifests. (default: %(default)s)")
parser.add_argument( parser.add_argument(
......
cd librispeech
python librispeech.py
if [ $? -ne 0 ]; then
echo "Prepare LibriSpeech failed. Terminated."
exit 1
fi
cd -
cat librispeech/manifest.train* | shuf > manifest.train
cat librispeech/manifest.dev-clean > manifest.dev
cat librispeech/manifest.test-clean > manifest.test
echo "All done."
""" """Contains various CTC decoder."""
CTC-like decoder utilitis. from __future__ import absolute_import
""" from __future__ import division
from __future__ import print_function
import os import os
from itertools import groupby from itertools import groupby
...@@ -10,8 +11,7 @@ import multiprocessing ...@@ -10,8 +11,7 @@ import multiprocessing
def ctc_best_path_decode(probs_seq, vocabulary): def ctc_best_path_decode(probs_seq, vocabulary):
""" """Best path decoding, also called argmax decoding or greedy decoding.
Best path decoding, also called argmax decoding or greedy decoding.
Path consisting of the most probable tokens are further post-processed to Path consisting of the most probable tokens are further post-processed to
remove consecutive repetitions and all blanks. remove consecutive repetitions and all blanks.
...@@ -40,8 +40,7 @@ def ctc_best_path_decode(probs_seq, vocabulary): ...@@ -40,8 +40,7 @@ def ctc_best_path_decode(probs_seq, vocabulary):
class Scorer(object): class Scorer(object):
""" """External defined scorer to evaluate a sentence in beam search
External defined scorer to evaluate a sentence in beam search
decoding, consisting of language model and word count. decoding, consisting of language model and word count.
:param alpha: Parameter associated with language model. :param alpha: Parameter associated with language model.
...@@ -73,8 +72,7 @@ class Scorer(object): ...@@ -73,8 +72,7 @@ class Scorer(object):
# execute evaluation # execute evaluation
def __call__(self, sentence, log=False): def __call__(self, sentence, log=False):
""" """Evaluation function, gathering all the scores.
Evaluation function
:param sentence: The input sentence for evalutation :param sentence: The input sentence for evalutation
:type sentence: basestring :type sentence: basestring
...@@ -101,8 +99,7 @@ def ctc_beam_search_decoder(probs_seq, ...@@ -101,8 +99,7 @@ def ctc_beam_search_decoder(probs_seq,
cutoff_prob=1.0, cutoff_prob=1.0,
ext_scoring_func=None, ext_scoring_func=None,
nproc=False): nproc=False):
''' '''Beam search decoder for CTC-trained network, using beam search with width
Beam search decoder for CTC-trained network, using beam search with width
beam_size to find many paths to one label, return beam_size labels in beam_size to find many paths to one label, return beam_size labels in
the descending order of probabilities. The implementation is based on Prefix the descending order of probabilities. The implementation is based on Prefix
Beam Search(https://arxiv.org/abs/1408.2873), and the unclear part is Beam Search(https://arxiv.org/abs/1408.2873), and the unclear part is
...@@ -129,7 +126,6 @@ def ctc_beam_search_decoder(probs_seq, ...@@ -129,7 +126,6 @@ def ctc_beam_search_decoder(probs_seq,
:type nproc: bool :type nproc: bool
:return: Decoding log probabilities and result sentences in descending order. :return: Decoding log probabilities and result sentences in descending order.
:rtype: list :rtype: list
''' '''
# dimension check # dimension check
for prob_list in probs_seq: for prob_list in probs_seq:
...@@ -242,8 +238,7 @@ def ctc_beam_search_decoder_nproc(probs_split, ...@@ -242,8 +238,7 @@ def ctc_beam_search_decoder_nproc(probs_split,
cutoff_prob=1.0, cutoff_prob=1.0,
ext_scoring_func=None, ext_scoring_func=None,
num_processes=None): num_processes=None):
''' '''Beam search decoder using multiple processes.
Beam search decoder using multiple processes.
:param probs_seq: 3-D list with length batch_size, each element :param probs_seq: 3-D list with length batch_size, each element
is a 2-D list of probabilities can be used by is a 2-D list of probabilities can be used by
......
# -*- coding: utf-8 -*-
"""This module provides functions to calculate error rate in different level.
e.g. wer for word-level, cer for char-level.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
def _levenshtein_distance(ref, hyp):
"""Levenshtein distance is a string metric for measuring the difference between
two sequences. Informally, the levenshtein disctance is defined as the minimum
number of single-character edits (substitutions, insertions or deletions)
required to change one word into the other. We can naturally extend the edits to
word level when calculate levenshtein disctance for two sentences.
"""
ref_len = len(ref)
hyp_len = len(hyp)
# special case
if ref == hyp:
return 0
if ref_len == 0:
return hyp_len
if hyp_len == 0:
return ref_len
distance = np.zeros((ref_len + 1, hyp_len + 1), dtype=np.int32)
# initialize distance matrix
for j in xrange(hyp_len + 1):
distance[0][j] = j
for i in xrange(ref_len + 1):
distance[i][0] = i
# calculate levenshtein distance
for i in xrange(1, ref_len + 1):
for j in xrange(1, hyp_len + 1):
if ref[i - 1] == hyp[j - 1]:
distance[i][j] = distance[i - 1][j - 1]
else:
s_num = distance[i - 1][j - 1] + 1
i_num = distance[i][j - 1] + 1
d_num = distance[i - 1][j] + 1
distance[i][j] = min(s_num, i_num, d_num)
return distance[ref_len][hyp_len]
def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
"""Calculate word error rate (WER). WER compares reference text and
hypothesis text in word-level. WER is defined as:
.. math::
WER = (Sw + Dw + Iw) / Nw
where
.. code-block:: text
Sw is the number of words subsituted,
Dw is the number of words deleted,
Iw is the number of words inserted,
Nw is the number of words in the reference
We can use levenshtein distance to calculate WER. Please draw an attention that
empty items will be removed when splitting sentences by delimiter.
:param reference: The reference sentence.
:type reference: basestring
:param hypothesis: The hypothesis sentence.
:type hypothesis: basestring
:param ignore_case: Whether case-sensitive or not.
:type ignore_case: bool
:param delimiter: Delimiter of input sentences.
:type delimiter: char
:return: Word error rate.
:rtype: float
:raises ValueError: If the reference length is zero.
"""
if ignore_case == True:
reference = reference.lower()
hypothesis = hypothesis.lower()
ref_words = filter(None, reference.split(delimiter))
hyp_words = filter(None, hypothesis.split(delimiter))
if len(ref_words) == 0:
raise ValueError("Reference's word number should be greater than 0.")
edit_distance = _levenshtein_distance(ref_words, hyp_words)
wer = float(edit_distance) / len(ref_words)
return wer
def cer(reference, hypothesis, ignore_case=False):
"""Calculate charactor error rate (CER). CER compares reference text and
hypothesis text in char-level. CER is defined as:
.. math::
CER = (Sc + Dc + Ic) / Nc
where
.. code-block:: text
Sc is the number of characters substituted,
Dc is the number of characters deleted,
Ic is the number of characters inserted
Nc is the number of characters in the reference
We can use levenshtein distance to calculate CER. Chinese input should be
encoded to unicode. Please draw an attention that the leading and tailing
white space characters will be truncated and multiple consecutive white
space characters in a sentence will be replaced by one white space character.
:param reference: The reference sentence.
:type reference: basestring
:param hypothesis: The hypothesis sentence.
:type hypothesis: basestring
:param ignore_case: Whether case-sensitive or not.
:type ignore_case: bool
:return: Character error rate.
:rtype: float
:raises ValueError: If the reference length is zero.
"""
if ignore_case == True:
reference = reference.lower()
hypothesis = hypothesis.lower()
reference = ' '.join(filter(None, reference.split(' ')))
hypothesis = ' '.join(filter(None, hypothesis.split(' ')))
if len(reference) == 0:
raise ValueError("Length of reference should be greater than 0.")
edit_distance = _levenshtein_distance(reference, hypothesis)
cer = float(edit_distance) / len(reference)
return cer
""" """Evaluation for DeepSpeech2 model."""
Evaluation for a simplifed version of Baidu DeepSpeech2 model. from __future__ import absolute_import
""" from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle import paddle.v2 as paddle
import distutils.util import distutils.util
import argparse import argparse
import gzip import gzip
from audio_data_utils import DataGenerator from data_utils.data import DataGenerator
from model import deep_speech2 from model import deep_speech2
from decoder import * from decoder import *
from error_rate import wer from error_rate import wer
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(description=__doc__)
description='Simplified version of DeepSpeech2 evaluation.')
parser.add_argument( parser.add_argument(
"--num_samples", "--num_samples",
default=100, default=100,
...@@ -38,6 +38,11 @@ parser.add_argument( ...@@ -38,6 +38,11 @@ parser.add_argument(
default=True, default=True,
type=distutils.util.strtobool, type=distutils.util.strtobool,
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument(
"--mean_std_filepath",
default='mean_std.npz',
type=str,
help="Manifest path for normalizer. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--decode_method", "--decode_method",
default='beam_search_nproc', default='beam_search_nproc',
...@@ -46,7 +51,7 @@ parser.add_argument( ...@@ -46,7 +51,7 @@ parser.add_argument(
"beam_search or beam_search_nproc. (default: %(default)s)") "beam_search or beam_search_nproc. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--language_model_path", "--language_model_path",
default="./data/1Billion.klm", default="data/1Billion.klm",
type=str, type=str,
help="Path for language model. (default: %(default)s)") help="Path for language model. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -87,41 +92,34 @@ parser.add_argument( ...@@ -87,41 +92,34 @@ parser.add_argument(
help="Model filepath. (default: %(default)s)") help="Model filepath. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--vocab_filepath", "--vocab_filepath",
default='data/eng_vocab.txt', default='datasets/vocab/eng_vocab.txt',
type=str, type=str,
help="Vocabulary filepath. (default: %(default)s)") help="Vocabulary filepath. (default: %(default)s)")
args = parser.parse_args() args = parser.parse_args()
def evaluate(): def evaluate():
""" """Evaluate on whole test data for DeepSpeech2."""
Evaluate on whole test data for DeepSpeech2.
"""
# initialize data generator # initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
normalizer_manifest_path=args.normalizer_manifest_path, mean_std_filepath=args.mean_std_filepath,
normalizer_num_samples=200, augmentation_config='{}')
max_duration=20.0,
min_duration=0.0,
stride_ms=10,
window_ms=20)
# create network config # create network config
dict_size = data_generator.vocabulary_size() # paddle.data_type.dense_array is used for variable batch input.
vocab_list = data_generator.vocabulary_list() # The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data( audio_data = paddle.layer.data(
name="audio_spectrogram", name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
height=161,
width=2000,
type=paddle.data_type.dense_vector(322000))
text_data = paddle.layer.data( text_data = paddle.layer.data(
name="transcript_text", name="transcript_text",
type=paddle.data_type.integer_value_sequence(dict_size)) type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2( output_probs = deep_speech2(
audio_data=audio_data, audio_data=audio_data,
text_data=text_data, text_data=text_data,
dict_size=dict_size, dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers, num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers, num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size, rnn_size=args.rnn_layer_size,
...@@ -132,14 +130,11 @@ def evaluate(): ...@@ -132,14 +130,11 @@ def evaluate():
gzip.open(args.model_filepath)) gzip.open(args.model_filepath))
# prepare infer data # prepare infer data
feeding = data_generator.data_name_feeding() batch_reader = data_generator.batch_reader_creator(
test_batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.decode_manifest_path,
batch_size=args.num_samples, batch_size=args.num_samples,
padding_to=2000, sortagrad=False,
flatten=True, shuffle_method=None)
sort_by_duration=False,
shuffle=False)
# define inferer # define inferer
inferer = paddle.inference.Inference( inferer = paddle.inference.Inference(
...@@ -151,10 +146,10 @@ def evaluate(): ...@@ -151,10 +146,10 @@ def evaluate():
ext_scorer = Scorer(args.alpha, args.beta, args.language_model_path) ext_scorer = Scorer(args.alpha, args.beta, args.language_model_path)
wer_counter, wer_sum = 0, 0.0 wer_counter, wer_sum = 0, 0.0
for infer_data in test_batch_reader(): for infer_data in batch_reader():
# run inference # run inference
infer_results = inferer.infer(input=infer_data) infer_results = inferer.infer(input=infer_data)
num_steps = len(infer_results) / len(infer_data) num_steps = len(infer_results) // len(infer_data)
probs_split = [ probs_split = [
infer_results[i * num_steps:(i + 1) * num_steps] infer_results[i * num_steps:(i + 1) * num_steps]
for i in xrange(0, len(infer_data)) for i in xrange(0, len(infer_data))
...@@ -164,22 +159,26 @@ def evaluate(): ...@@ -164,22 +159,26 @@ def evaluate():
# best path decode # best path decode
if args.decode_method == "best_path": if args.decode_method == "best_path":
for i, probs in enumerate(probs_split): for i, probs in enumerate(probs_split):
output_transcription = ctc_decode( output_transcription = ctc_best_path_decode(
probs_seq=probs, vocabulary=vocab_list, method="best_path") probs_seq=probs, vocabulary=data_generator.vocab_list)
target_transcription = ''.join( target_transcription = ''.join([
[vocab_list[index] for index in infer_data[i][1]]) data_generator.vocab_list[index]
for index in infer_data[i][1]
])
wer_sum += wer(target_transcription, output_transcription) wer_sum += wer(target_transcription, output_transcription)
wer_counter += 1 wer_counter += 1
# beam search decode in single process # beam search decode in single process
elif args.decode_method == "beam_search": elif args.decode_method == "beam_search":
for i, probs in enumerate(probs_split): for i, probs in enumerate(probs_split):
target_transcription = ''.join( target_transcription = ''.join([
[vocab_list[index] for index in infer_data[i][1]]) data_generator.vocab_list[index]
for index in infer_data[i][1]
])
beam_search_result = ctc_beam_search_decoder( beam_search_result = ctc_beam_search_decoder(
probs_seq=probs, probs_seq=probs,
vocabulary=vocab_list, vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
blank_id=len(vocab_list), blank_id=len(data_generator.vocab_list),
ext_scoring_func=ext_scorer, ext_scoring_func=ext_scorer,
cutoff_prob=args.cutoff_prob, ) cutoff_prob=args.cutoff_prob, )
wer_sum += wer(target_transcription, beam_search_result[0][1]) wer_sum += wer(target_transcription, beam_search_result[0][1])
...@@ -188,18 +187,21 @@ def evaluate(): ...@@ -188,18 +187,21 @@ def evaluate():
elif args.decode_method == "beam_search_nproc": elif args.decode_method == "beam_search_nproc":
beam_search_nproc_results = ctc_beam_search_decoder_nproc( beam_search_nproc_results = ctc_beam_search_decoder_nproc(
probs_split=probs_split, probs_split=probs_split,
vocabulary=vocab_list, vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
blank_id=len(vocab_list), blank_id=len(data_generator.vocab_list),
ext_scoring_func=ext_scorer, ext_scoring_func=ext_scorer,
cutoff_prob=args.cutoff_prob, ) cutoff_prob=args.cutoff_prob, )
for i, beam_search_result in enumerate(beam_search_nproc_results): for i, beam_search_result in enumerate(beam_search_nproc_results):
target_transcription = ''.join( target_transcription = ''.join([
[vocab_list[index] for index in infer_data[i][1]]) data_generator.vocab_list[index]
for index in infer_data[i][1]
])
wer_sum += wer(target_transcription, beam_search_result[0][1]) wer_sum += wer(target_transcription, beam_search_result[0][1])
wer_counter += 1 wer_counter += 1
else: else:
raise ValueError("Decoding method [%s] is not supported." % method) raise ValueError("Decoding method [%s] is not supported." %
decode_method)
print("Cur WER = %f" % (wer_sum / wer_counter)) print("Cur WER = %f" % (wer_sum / wer_counter))
print("Final WER = %f" % (wer_sum / wer_counter)) print("Final WER = %f" % (wer_sum / wer_counter))
......
""" """Inferer for DeepSpeech2 model."""
Inference for a simplifed version of Baidu DeepSpeech2 model. from __future__ import absolute_import
""" from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle
import distutils.util
import argparse import argparse
import gzip import gzip
from audio_data_utils import DataGenerator import distutils.util
import paddle.v2 as paddle
from data_utils.data import DataGenerator
from model import deep_speech2 from model import deep_speech2
from decoder import * from decoder import *
from error_rate import wer from error_rate import wer
import time import utils
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(description=__doc__)
description='Simplified version of DeepSpeech2 inference.')
parser.add_argument( parser.add_argument(
"--num_samples", "--num_samples",
default=100, default=100,
...@@ -40,8 +40,8 @@ parser.add_argument( ...@@ -40,8 +40,8 @@ parser.add_argument(
type=distutils.util.strtobool, type=distutils.util.strtobool,
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--normalizer_manifest_path", "--mean_std_filepath",
default='data/manifest.libri.train-clean-100', default='mean_std.npz',
type=str, type=str,
help="Manifest path for normalizer. (default: %(default)s)") help="Manifest path for normalizer. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -56,12 +56,12 @@ parser.add_argument( ...@@ -56,12 +56,12 @@ parser.add_argument(
help="Model filepath. (default: %(default)s)") help="Model filepath. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--vocab_filepath", "--vocab_filepath",
default='data/eng_vocab.txt', default='datasets/vocab/eng_vocab.txt',
type=str, type=str,
help="Vocabulary filepath. (default: %(default)s)") help="Vocabulary filepath. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--decode_method", "--decode_method",
default='beam_search_nproc', default='best_path',
type=str, type=str,
help="Method for ctc decoding:" help="Method for ctc decoding:"
" best_path," " best_path,"
...@@ -79,7 +79,7 @@ parser.add_argument( ...@@ -79,7 +79,7 @@ parser.add_argument(
help="Number of output per sample in beam search. (default: %(default)d)") help="Number of output per sample in beam search. (default: %(default)d)")
parser.add_argument( parser.add_argument(
"--language_model_path", "--language_model_path",
default="./data/1Billion.klm", default="data/1Billion.klm",
type=str, type=str,
help="Path for language model. (default: %(default)s)") help="Path for language model. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -102,34 +102,26 @@ args = parser.parse_args() ...@@ -102,34 +102,26 @@ args = parser.parse_args()
def infer(): def infer():
""" """Inference for DeepSpeech2."""
Inference for DeepSpeech2.
"""
# initialize data generator # initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
normalizer_manifest_path=args.normalizer_manifest_path, mean_std_filepath=args.mean_std_filepath,
normalizer_num_samples=200, augmentation_config='{}')
max_duration=20.0,
min_duration=0.0,
stride_ms=10,
window_ms=20)
# create network config # create network config
dict_size = data_generator.vocabulary_size() # paddle.data_type.dense_array is used for variable batch input.
vocab_list = data_generator.vocabulary_list() # The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data( audio_data = paddle.layer.data(
name="audio_spectrogram", name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
height=161,
width=2000,
type=paddle.data_type.dense_vector(322000))
text_data = paddle.layer.data( text_data = paddle.layer.data(
name="transcript_text", name="transcript_text",
type=paddle.data_type.integer_value_sequence(dict_size)) type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2( output_probs = deep_speech2(
audio_data=audio_data, audio_data=audio_data,
text_data=text_data, text_data=text_data,
dict_size=dict_size, dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers, num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers, num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size, rnn_size=args.rnn_layer_size,
...@@ -140,23 +132,20 @@ def infer(): ...@@ -140,23 +132,20 @@ def infer():
gzip.open(args.model_filepath)) gzip.open(args.model_filepath))
# prepare infer data # prepare infer data
feeding = data_generator.data_name_feeding() batch_reader = data_generator.batch_reader_creator(
test_batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.decode_manifest_path,
batch_size=args.num_samples, batch_size=args.num_samples,
padding_to=2000, sortagrad=False,
flatten=True, shuffle_method=None)
sort_by_duration=False, infer_data = batch_reader().next()
shuffle=False)
infer_data = test_batch_reader().next()
# run inference # run inference
infer_results = paddle.infer( infer_results = paddle.infer(
output_layer=output_probs, parameters=parameters, input=infer_data) output_layer=output_probs, parameters=parameters, input=infer_data)
num_steps = len(infer_results) / len(infer_data) num_steps = len(infer_results) // len(infer_data)
probs_split = [ probs_split = [
infer_results[i * num_steps:(i + 1) * num_steps] infer_results[i * num_steps:(i + 1) * num_steps]
for i in xrange(0, len(infer_data)) for i in xrange(len(infer_data))
] ]
## decode and print ## decode and print
...@@ -165,10 +154,11 @@ def infer(): ...@@ -165,10 +154,11 @@ def infer():
total_time = 0.0 total_time = 0.0
if args.decode_method == "best_path": if args.decode_method == "best_path":
for i, probs in enumerate(probs_split): for i, probs in enumerate(probs_split):
target_transcription = ''.join( target_transcription = ''.join([
[vocab_list[index] for index in infer_data[i][1]]) data_generator.vocab_list[index] for index in infer_data[i][1]
])
best_path_transcription = ctc_best_path_decode( best_path_transcription = ctc_best_path_decode(
probs_seq=probs, vocabulary=vocab_list) probs_seq=probs, vocabulary=data_generator.vocab_list)
print("\nTarget Transcription: %s\nOutput Transcription: %s" % print("\nTarget Transcription: %s\nOutput Transcription: %s" %
(target_transcription, best_path_transcription)) (target_transcription, best_path_transcription))
wer_cur = wer(target_transcription, best_path_transcription) wer_cur = wer(target_transcription, best_path_transcription)
...@@ -180,13 +170,14 @@ def infer(): ...@@ -180,13 +170,14 @@ def infer():
elif args.decode_method == "beam_search": elif args.decode_method == "beam_search":
ext_scorer = Scorer(args.alpha, args.beta, args.language_model_path) ext_scorer = Scorer(args.alpha, args.beta, args.language_model_path)
for i, probs in enumerate(probs_split): for i, probs in enumerate(probs_split):
target_transcription = ''.join( target_transcription = ''.join([
[vocab_list[index] for index in infer_data[i][1]]) data_generator.vocab_list[index] for index in infer_data[i][1]
])
beam_search_result = ctc_beam_search_decoder( beam_search_result = ctc_beam_search_decoder(
probs_seq=probs, probs_seq=probs,
vocabulary=vocab_list, vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
blank_id=len(vocab_list), blank_id=len(data_generator.vocab_list),
cutoff_prob=args.cutoff_prob, cutoff_prob=args.cutoff_prob,
ext_scoring_func=ext_scorer, ) ext_scoring_func=ext_scorer, )
print("\nTarget Transcription:\t%s" % target_transcription) print("\nTarget Transcription:\t%s" % target_transcription)
...@@ -204,14 +195,15 @@ def infer(): ...@@ -204,14 +195,15 @@ def infer():
ext_scorer = Scorer(args.alpha, args.beta, args.language_model_path) ext_scorer = Scorer(args.alpha, args.beta, args.language_model_path)
beam_search_nproc_results = ctc_beam_search_decoder_nproc( beam_search_nproc_results = ctc_beam_search_decoder_nproc(
probs_split=probs_split, probs_split=probs_split,
vocabulary=vocab_list, vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
blank_id=len(vocab_list), blank_id=len(data_generator.vocab_list),
cutoff_prob=args.cutoff_prob, cutoff_prob=args.cutoff_prob,
ext_scoring_func=ext_scorer, ) ext_scoring_func=ext_scorer, )
for i, beam_search_result in enumerate(beam_search_nproc_results): for i, beam_search_result in enumerate(beam_search_nproc_results):
target_transcription = ''.join( target_transcription = ''.join([
[vocab_list[index] for index in infer_data[i][1]]) data_generator.vocab_list[index] for index in infer_data[i][1]
])
print("\nTarget Transcription:\t%s" % target_transcription) print("\nTarget Transcription:\t%s" % target_transcription)
for index in xrange(args.num_results_per_sample): for index in xrange(args.num_results_per_sample):
...@@ -224,10 +216,12 @@ def infer(): ...@@ -224,10 +216,12 @@ def infer():
print("cur wer = %f , average wer = %f" % print("cur wer = %f , average wer = %f" %
(wer_cur, wer_sum / wer_counter)) (wer_cur, wer_sum / wer_counter))
else: else:
raise ValueError("Decoding method [%s] is not supported." % method) raise ValueError("Decoding method [%s] is not supported." %
decode_method)
def main(): def main():
utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=1) paddle.init(use_gpu=args.use_gpu, trainer_count=1)
infer() infer()
......
""" """Contains DeepSpeech2 model."""
A simplifed version of Baidu DeepSpeech2 model. from __future__ import absolute_import
""" from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle import paddle.v2 as paddle
#TODO: add bidirectional rnn.
def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride, def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride,
padding, act): padding, act):
......
# -*- coding: utf-8 -*-
"""Test error rate."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import unittest
import error_rate
class TestParse(unittest.TestCase):
def test_wer_1(self):
ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last night'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6)
def test_wer_2(self):
ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
word_error_rate = error_rate.wer(ref, ref)
self.assertEqual(word_error_rate, 0.0)
def test_wer_3(self):
ref = ' '
hyp = 'Hypothesis sentence'
with self.assertRaises(ValueError):
word_error_rate = error_rate.wer(ref, hyp)
def test_cer_1(self):
ref = 'werewolf'
hyp = 'weae wolf'
char_error_rate = error_rate.cer(ref, hyp)
self.assertTrue(abs(char_error_rate - 0.25) < 1e-6)
def test_cer_2(self):
ref = 'werewolf'
char_error_rate = error_rate.cer(ref, ref)
self.assertEqual(char_error_rate, 0.0)
def test_cer_3(self):
ref = u'我是中国人'
hyp = u'我是 美洲人'
char_error_rate = error_rate.cer(ref, hyp)
self.assertTrue(abs(char_error_rate - 0.6) < 1e-6)
def test_cer_4(self):
ref = u'我是中国人'
char_error_rate = error_rate.cer(ref, ref)
self.assertFalse(char_error_rate, 0.0)
def test_cer_5(self):
ref = ''
hyp = 'Hypothesis'
with self.assertRaises(ValueError):
char_error_rate = error_rate.cer(ref, hyp)
if __name__ == '__main__':
unittest.main()
""" """Trainer for DeepSpeech2 model."""
Trainer for a simplifed version of Baidu DeepSpeech2 model. from __future__ import absolute_import
""" from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle import sys
import distutils.util import os
import argparse import argparse
import gzip import gzip
import time import time
import sys import distutils.util
import paddle.v2 as paddle
from model import deep_speech2 from model import deep_speech2
from audio_data_utils import DataGenerator from data_utils.data import DataGenerator
import numpy as np import utils
import os
#TODO: add WER metric
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(description=__doc__)
description='Simplified version of DeepSpeech2 trainer.')
parser.add_argument( parser.add_argument(
"--batch_size", default=32, type=int, help="Minibatch size.") "--batch_size", default=32, type=int, help="Minibatch size.")
parser.add_argument( parser.add_argument(
...@@ -51,32 +49,38 @@ parser.add_argument( ...@@ -51,32 +49,38 @@ parser.add_argument(
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--use_sortagrad", "--use_sortagrad",
default=False, default=True,
type=distutils.util.strtobool, type=distutils.util.strtobool,
help="Use sortagrad or not. (default: %(default)s)") help="Use sortagrad or not. (default: %(default)s)")
parser.add_argument(
"--shuffle_method",
default='instance_shuffle',
type=str,
help="Shuffle method: 'instance_shuffle', 'batch_shuffle', "
"'batch_shuffle_batch'. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--trainer_count", "--trainer_count",
default=4, default=4,
type=int, type=int,
help="Trainer number. (default: %(default)s)") help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--normalizer_manifest_path", "--mean_std_filepath",
default='data/manifest.libri.train-clean-100', default='mean_std.npz',
type=str, type=str,
help="Manifest path for normalizer. (default: %(default)s)") help="Manifest path for normalizer. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--train_manifest_path", "--train_manifest_path",
default='data/manifest.libri.train-clean-100', default='datasets/manifest.train',
type=str, type=str,
help="Manifest path for training. (default: %(default)s)") help="Manifest path for training. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--dev_manifest_path", "--dev_manifest_path",
default='data/manifest.libri.dev-clean', default='datasets/manifest.dev',
type=str, type=str,
help="Manifest path for validation. (default: %(default)s)") help="Manifest path for validation. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--vocab_filepath", "--vocab_filepath",
default='data/eng_vocab.txt', default='datasets/vocab/eng_vocab.txt',
type=str, type=str,
help="Vocabulary filepath. (default: %(default)s)") help="Vocabulary filepath. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -86,37 +90,42 @@ parser.add_argument( ...@@ -86,37 +90,42 @@ parser.add_argument(
help="If set None, the training will start from scratch. " help="If set None, the training will start from scratch. "
"Otherwise, the training will resume from " "Otherwise, the training will resume from "
"the existing model of this path. (default: %(default)s)") "the existing model of this path. (default: %(default)s)")
parser.add_argument(
"--augmentation_config",
default='{}',
type=str,
help="Augmentation configuration in json-format. "
"(default: %(default)s)")
args = parser.parse_args() args = parser.parse_args()
def train(): def train():
""" """DeepSpeech2 training."""
DeepSpeech2 training.
"""
# initialize data generator # initialize data generator
data_generator = DataGenerator( def data_generator():
vocab_filepath=args.vocab_filepath, return DataGenerator(
normalizer_manifest_path=args.normalizer_manifest_path, vocab_filepath=args.vocab_filepath,
normalizer_num_samples=200, mean_std_filepath=args.mean_std_filepath,
max_duration=20.0, augmentation_config=args.augmentation_config)
min_duration=0.0,
stride_ms=10, train_generator = data_generator()
window_ms=20) test_generator = data_generator()
# create network config # create network config
dict_size = data_generator.vocabulary_size() # paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data( audio_data = paddle.layer.data(
name="audio_spectrogram", name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
height=161,
width=2000,
type=paddle.data_type.dense_vector(322000))
text_data = paddle.layer.data( text_data = paddle.layer.data(
name="transcript_text", name="transcript_text",
type=paddle.data_type.integer_value_sequence(dict_size)) type=paddle.data_type.integer_value_sequence(
train_generator.vocab_size))
cost = deep_speech2( cost = deep_speech2(
audio_data=audio_data, audio_data=audio_data,
text_data=text_data, text_data=text_data,
dict_size=dict_size, dict_size=train_generator.vocab_size,
num_conv_layers=args.num_conv_layers, num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers, num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size, rnn_size=args.rnn_layer_size,
...@@ -136,28 +145,18 @@ def train(): ...@@ -136,28 +145,18 @@ def train():
cost=cost, parameters=parameters, update_equation=optimizer) cost=cost, parameters=parameters, update_equation=optimizer)
# prepare data reader # prepare data reader
train_batch_reader_sortagrad = data_generator.batch_reader_creator( train_batch_reader = train_generator.batch_reader_creator(
manifest_path=args.train_manifest_path,
batch_size=args.batch_size,
padding_to=2000,
flatten=True,
sort_by_duration=True,
shuffle=False)
train_batch_reader_nosortagrad = data_generator.batch_reader_creator(
manifest_path=args.train_manifest_path, manifest_path=args.train_manifest_path,
batch_size=args.batch_size, batch_size=args.batch_size,
padding_to=2000, min_batch_size=args.trainer_count,
flatten=True, sortagrad=args.use_sortagrad if args.init_model_path is None else False,
sort_by_duration=False, shuffle_method=args.shuffle_method)
shuffle=True) test_batch_reader = test_generator.batch_reader_creator(
test_batch_reader = data_generator.batch_reader_creator(
manifest_path=args.dev_manifest_path, manifest_path=args.dev_manifest_path,
batch_size=args.batch_size, batch_size=args.batch_size,
padding_to=2000, min_batch_size=1, # must be 1, but will have errors.
flatten=True, sortagrad=False,
sort_by_duration=False, shuffle_method=None)
shuffle=False)
feeding = data_generator.data_name_feeding()
# create event handler # create event handler
def event_handler(event): def event_handler(event):
...@@ -165,9 +164,9 @@ def train(): ...@@ -165,9 +164,9 @@ def train():
if isinstance(event, paddle.event.EndIteration): if isinstance(event, paddle.event.EndIteration):
cost_sum += event.cost cost_sum += event.cost
cost_counter += 1 cost_counter += 1
if event.batch_id % 50 == 0: if (event.batch_id + 1) % 100 == 0:
print "\nPass: %d, Batch: %d, TrainCost: %f" % ( print("\nPass: %d, Batch: %d, TrainCost: %f" % (
event.pass_id, event.batch_id, cost_sum / cost_counter) event.pass_id, event.batch_id + 1, cost_sum / cost_counter))
cost_sum, cost_counter = 0.0, 0 cost_sum, cost_counter = 0.0, 0
with gzip.open("params.tar.gz", 'w') as f: with gzip.open("params.tar.gz", 'w') as f:
parameters.to_tar(f) parameters.to_tar(f)
...@@ -178,28 +177,21 @@ def train(): ...@@ -178,28 +177,21 @@ def train():
start_time = time.time() start_time = time.time()
cost_sum, cost_counter = 0.0, 0 cost_sum, cost_counter = 0.0, 0
if isinstance(event, paddle.event.EndPass): if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_batch_reader, feeding=feeding) result = trainer.test(
print "\n------- Time: %d sec, Pass: %d, ValidationCost: %s" % ( reader=test_batch_reader, feeding=test_generator.feeding)
time.time() - start_time, event.pass_id, result.cost) print("\n------- Time: %d sec, Pass: %d, ValidationCost: %s" %
(time.time() - start_time, event.pass_id, result.cost))
# run train # run train
# first pass with sortagrad
if args.use_sortagrad:
trainer.train(
reader=train_batch_reader_sortagrad,
event_handler=event_handler,
num_passes=1,
feeding=feeding)
args.num_passes -= 1
# other passes without sortagrad
trainer.train( trainer.train(
reader=train_batch_reader_nosortagrad, reader=train_batch_reader,
event_handler=event_handler, event_handler=event_handler,
num_passes=args.num_passes, num_passes=args.num_passes,
feeding=feeding) feeding=train_generator.feeding)
def main(): def main():
utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count) paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
train() train()
......
""" """Parameters tuning for DeepSpeech2 model."""
Parameters tuning for beam search decoder in Deep Speech 2. from __future__ import absolute_import
""" from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle import paddle.v2 as paddle
import distutils.util import distutils.util
import argparse import argparse
import gzip import gzip
from audio_data_utils import DataGenerator from data_utils.data import DataGenerator
from model import deep_speech2 from model import deep_speech2
from decoder import * from decoder import *
from error_rate import wer from error_rate import wer
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(description=__doc__)
description='Parameters tuning for ctc beam search decoder in Deep Speech 2.'
)
parser.add_argument( parser.add_argument(
"--num_samples", "--num_samples",
default=100, default=100,
...@@ -39,6 +38,11 @@ parser.add_argument( ...@@ -39,6 +38,11 @@ parser.add_argument(
default=True, default=True,
type=distutils.util.strtobool, type=distutils.util.strtobool,
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument(
"--mean_std_filepath",
default='mean_std.npz',
type=str,
help="Manifest path for normalizer. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--normalizer_manifest_path", "--normalizer_manifest_path",
default='data/manifest.libri.train-clean-100', default='data/manifest.libri.train-clean-100',
...@@ -56,7 +60,7 @@ parser.add_argument( ...@@ -56,7 +60,7 @@ parser.add_argument(
help="Model filepath. (default: %(default)s)") help="Model filepath. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--vocab_filepath", "--vocab_filepath",
default='data/eng_vocab.txt', default='datasets/vocab/eng_vocab.txt',
type=str, type=str,
help="Vocabulary filepath. (default: %(default)s)") help="Vocabulary filepath. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -77,7 +81,7 @@ parser.add_argument( ...@@ -77,7 +81,7 @@ parser.add_argument(
help="Number of outputs per sample in beam search. (default: %(default)d)") help="Number of outputs per sample in beam search. (default: %(default)d)")
parser.add_argument( parser.add_argument(
"--language_model_path", "--language_model_path",
default="./data/1Billion.klm", default="data/1Billion.klm",
type=str, type=str,
help="Path for language model. (default: %(default)s)") help="Path for language model. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -120,9 +124,7 @@ args = parser.parse_args() ...@@ -120,9 +124,7 @@ args = parser.parse_args()
def tune(): def tune():
""" """Tune parameters alpha and beta on one minibatch."""
Tune parameters alpha and beta on one minibatch.
"""
if not args.num_alphas >= 0: if not args.num_alphas >= 0:
raise ValueError("num_alphas must be non-negative!") raise ValueError("num_alphas must be non-negative!")
...@@ -133,28 +135,22 @@ def tune(): ...@@ -133,28 +135,22 @@ def tune():
# initialize data generator # initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
normalizer_manifest_path=args.normalizer_manifest_path, mean_std_filepath=args.mean_std_filepath,
normalizer_num_samples=200, augmentation_config='{}')
max_duration=20.0,
min_duration=0.0,
stride_ms=10,
window_ms=20)
# create network config # create network config
dict_size = data_generator.vocabulary_size() # paddle.data_type.dense_array is used for variable batch input.
vocab_list = data_generator.vocabulary_list() # The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data( audio_data = paddle.layer.data(
name="audio_spectrogram", name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
height=161,
width=2000,
type=paddle.data_type.dense_vector(322000))
text_data = paddle.layer.data( text_data = paddle.layer.data(
name="transcript_text", name="transcript_text",
type=paddle.data_type.integer_value_sequence(dict_size)) type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2( output_probs = deep_speech2(
audio_data=audio_data, audio_data=audio_data,
text_data=text_data, text_data=text_data,
dict_size=dict_size, dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers, num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers, num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size, rnn_size=args.rnn_layer_size,
...@@ -165,21 +161,18 @@ def tune(): ...@@ -165,21 +161,18 @@ def tune():
gzip.open(args.model_filepath)) gzip.open(args.model_filepath))
# prepare infer data # prepare infer data
feeding = data_generator.data_name_feeding() batch_reader = data_generator.batch_reader_creator(
test_batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.decode_manifest_path,
batch_size=args.num_samples, batch_size=args.num_samples,
padding_to=2000, sortagrad=False,
flatten=True, shuffle_method=None)
sort_by_duration=False,
shuffle=False)
# get one batch data for tuning # get one batch data for tuning
infer_data = test_batch_reader().next() infer_data = batch_reader().next()
# run inference # run inference
infer_results = paddle.infer( infer_results = paddle.infer(
output_layer=output_probs, parameters=parameters, input=infer_data) output_layer=output_probs, parameters=parameters, input=infer_data)
num_steps = len(infer_results) / len(infer_data) num_steps = len(infer_results) // len(infer_data)
probs_split = [ probs_split = [
infer_results[i * num_steps:(i + 1) * num_steps] infer_results[i * num_steps:(i + 1) * num_steps]
for i in xrange(0, len(infer_data)) for i in xrange(0, len(infer_data))
...@@ -198,13 +191,15 @@ def tune(): ...@@ -198,13 +191,15 @@ def tune():
# beam search decode # beam search decode
if args.decode_method == "beam_search": if args.decode_method == "beam_search":
for i, probs in enumerate(probs_split): for i, probs in enumerate(probs_split):
target_transcription = ''.join( target_transcription = ''.join([
[vocab_list[index] for index in infer_data[i][1]]) data_generator.vocab_list[index]
for index in infer_data[i][1]
])
beam_search_result = ctc_beam_search_decoder( beam_search_result = ctc_beam_search_decoder(
probs_seq=probs, probs_seq=probs,
vocabulary=vocab_list, vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
blank_id=len(vocab_list), blank_id=len(data_generator.vocab_list),
cutoff_prob=args.cutoff_prob, cutoff_prob=args.cutoff_prob,
ext_scoring_func=ext_scorer, ) ext_scoring_func=ext_scorer, )
wer_sum += wer(target_transcription, beam_search_result[0][1]) wer_sum += wer(target_transcription, beam_search_result[0][1])
...@@ -213,18 +208,21 @@ def tune(): ...@@ -213,18 +208,21 @@ def tune():
elif args.decode_method == "beam_search_nproc": elif args.decode_method == "beam_search_nproc":
beam_search_nproc_results = ctc_beam_search_decoder_nproc( beam_search_nproc_results = ctc_beam_search_decoder_nproc(
probs_split=probs_split, probs_split=probs_split,
vocabulary=vocab_list, vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
cutoff_prob=args.cutoff_prob, cutoff_prob=args.cutoff_prob,
blank_id=len(vocab_list), blank_id=len(data_generator.vocab_list),
ext_scoring_func=ext_scorer, ) ext_scoring_func=ext_scorer, )
for i, beam_search_result in enumerate(beam_search_nproc_results): for i, beam_search_result in enumerate(beam_search_nproc_results):
target_transcription = ''.join( target_transcription = ''.join([
[vocab_list[index] for index in infer_data[i][1]]) data_generator.vocab_list[index]
for index in infer_data[i][1]
])
wer_sum += wer(target_transcription, beam_search_result[0][1]) wer_sum += wer(target_transcription, beam_search_result[0][1])
wer_counter += 1 wer_counter += 1
else: else:
raise ValueError("Decoding method [%s] is not supported." % method) raise ValueError("Decoding method [%s] is not supported." %
decode_method)
print("alpha = %f\tbeta = %f\tWER = %f" % print("alpha = %f\tbeta = %f\tWER = %f" %
(alpha, beta, wer_sum / wer_counter)) (alpha, beta, wer_sum / wer_counter))
......
"""Contains common utility functions."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
def print_arguments(args):
"""Print argparse's arguments.
Usage:
.. code-block:: python
parser = argparse.ArgumentParser()
parser.add_argument("name", default="Jonh", type=str, help="User name.")
args = parser.parse_args()
print_arguments(args)
:param args: Input argparse.Namespace for printing.
:type args: argparse.Namespace
"""
print("----- Configuration Arguments -----")
for arg, value in vars(args).iteritems():
print("%s: %s" % (arg, value))
print("------------------------------------")
TBD 图像分类
=======================
这里将介绍如何在PaddlePaddle下使用AlexNet、VGG、GoogLeNet和ResNet模型进行图像分类。图像分类问题的描述和这四种模型的介绍可以参考[PaddlePaddle book](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification)
## 训练模型
### 初始化
在初始化阶段需要导入所用的包,并对PaddlePaddle进行初始化。
```python
import gzip
import paddle.v2.dataset.flowers as flowers
import paddle.v2 as paddle
import reader
import vgg
import resnet
import alexnet
import googlenet
# PaddlePaddle init
paddle.init(use_gpu=False, trainer_count=1)
```
### 定义参数和输入
设置算法参数(如数据维度、类别数目和batch size等参数),定义数据输入层`image`和类别标签`lbl`
```python
DATA_DIM = 3 * 224 * 224
CLASS_DIM = 102
BATCH_SIZE = 128
image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(DATA_DIM))
lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(CLASS_DIM))
```
### 获得所用模型
这里可以选择使用AlexNet、VGG、GoogLeNet和ResNet模型中的一个模型进行图像分类。通过调用相应的方法可以获得网络最后的Softmax层。
1. 使用AlexNet模型
指定输入层`image`和类别数目`CLASS_DIM`后,可以通过下面的代码得到AlexNet的Softmax层。
```python
out = alexnet.alexnet(image, class_dim=CLASS_DIM)
```
2. 使用VGG模型
根据层数的不同,VGG分为VGG13、VGG16和VGG19。使用VGG16模型的代码如下:
```python
out = vgg.vgg16(image, class_dim=CLASS_DIM)
```
类似地,VGG13和VGG19可以分别通过`vgg.vgg13``vgg.vgg19`方法获得。
3. 使用GoogLeNet模型
GoogLeNet在训练阶段使用两个辅助的分类器强化梯度信息并进行额外的正则化。因此`googlenet.googlenet`共返回三个Softmax层,如下面的代码所示:
```python
out, out1, out2 = googlenet.googlenet(image, class_dim=CLASS_DIM)
loss1 = paddle.layer.cross_entropy_cost(
input=out1, label=lbl, coeff=0.3)
paddle.evaluator.classification_error(input=out1, label=lbl)
loss2 = paddle.layer.cross_entropy_cost(
input=out2, label=lbl, coeff=0.3)
paddle.evaluator.classification_error(input=out2, label=lbl)
extra_layers = [loss1, loss2]
```
对于两个辅助的输出,这里分别对其计算损失函数并评价错误率,然后将损失作为后文SGD的extra_layers。
4. 使用ResNet模型
ResNet模型可以通过下面的代码获取:
```python
out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM)
```
### 定义损失函数
```python
cost = paddle.layer.classification_cost(input=out, label=lbl)
```
### 创建参数和优化方法
```python
# Create parameters
parameters = paddle.parameters.create(cost)
# Create optimizer
optimizer = paddle.optimizer.Momentum(
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 *
BATCH_SIZE),
learning_rate=0.001 / BATCH_SIZE,
learning_rate_decay_a=0.1,
learning_rate_decay_b=128000 * 35,
learning_rate_schedule="discexp", )
```
通过 `learning_rate_decay_a` (简写$a$) 、`learning_rate_decay_b` (简写$b$) 和 `learning_rate_schedule` 指定学习率调整策略,这里采用离散指数的方式调节学习率,计算公式如下, $n$ 代表已经处理过的累计总样本数,$lr_{0}$ 即为参数里设置的 `learning_rate`
$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
### 定义数据读取
首先以[花卉数据](http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html)为例说明如何定义输入。下面的代码定义了花卉数据训练集和验证集的输入:
```python
train_reader = paddle.batch(
paddle.reader.shuffle(
flowers.train(),
buf_size=1000),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
flowers.valid(),
batch_size=BATCH_SIZE)
```
若需要使用其他数据,则需要先建立图像列表文件。`reader.py`定义了这种文件的读取方式,它从图像列表文件中解析出图像路径和类别标签。
图像列表文件是一个文本文件,其中每一行由一个图像路径和类别标签构成,二者以跳格符(Tab)隔开。类别标签用整数表示,其最小值为0。下面给出一个图像列表文件的片段示例:
```
dataset_100/train_images/n03982430_23191.jpeg 1
dataset_100/train_images/n04461696_23653.jpeg 7
dataset_100/train_images/n02441942_3170.jpeg 8
dataset_100/train_images/n03733281_31716.jpeg 2
dataset_100/train_images/n03424325_240.jpeg 0
dataset_100/train_images/n02643566_75.jpeg 8
```
训练时需要分别指定训练集和验证集的图像列表文件。这里假设这两个文件分别为`train.list``val.list`,数据读取方式如下:
```python
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader('train.list'),
buf_size=1000),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
reader.test_reader('val.list'),
batch_size=BATCH_SIZE)
```
### 定义事件处理程序
```python
# End batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 1 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
parameters.to_tar(f)
result = trainer.test(reader=test_reader)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
```
### 定义训练方法
对于AlexNet、VGG和ResNet,可以按下面的代码定义训练方法:
```python
# Create trainer
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=optimizer)
```
GoogLeNet有两个额外的输出层,因此需要指定`extra_layers`,如下所示:
```python
# Create trainer
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=optimizer,
extra_layers=extra_layers)
```
### 开始训练
```python
trainer.train(
reader=train_reader, num_passes=200, event_handler=event_handler)
```
## 应用模型
模型训练好后,可以使用下面的代码预测给定图片的类别。
```python
# load parameters
with gzip.open('params_pass_10.tar.gz', 'r') as f:
parameters = paddle.parameters.Parameters.from_tar(f)
file_list = [line.strip() for line in open(image_list_file)]
test_data = [(paddle.image.load_and_transform(image_file, 256, 224, False)
.flatten().astype('float32'), )
for image_file in file_list]
probs = paddle.infer(
output_layer=out, parameters=parameters, input=test_data)
lab = np.argsort(-probs)
for file_name, result in zip(file_list, lab):
print "Label of %s is: %d" % (file_name, result[0])
```
首先从文件中加载训练好的模型(代码里以第10轮迭代的结果为例),然后读取`image_list_file`中的图像。`image_list_file`是一个文本文件,每一行为一个图像路径。代码使用`paddle.infer`判断`image_list_file`中每个图像的类别,并进行输出。
import paddle.v2 as paddle
__all__ = ['alexnet']
def alexnet(input, class_dim):
conv1 = paddle.layer.img_conv(
input=input,
filter_size=11,
num_channels=3,
num_filters=96,
stride=4,
padding=1)
cmrnorm1 = paddle.layer.img_cmrnorm(
input=conv1, size=5, scale=0.0001, power=0.75)
pool1 = paddle.layer.img_pool(input=cmrnorm1, pool_size=3, stride=2)
conv2 = paddle.layer.img_conv(
input=pool1,
filter_size=5,
num_filters=256,
stride=1,
padding=2,
groups=1)
cmrnorm2 = paddle.layer.img_cmrnorm(
input=conv2, size=5, scale=0.0001, power=0.75)
pool2 = paddle.layer.img_pool(input=cmrnorm2, pool_size=3, stride=2)
pool3 = paddle.networks.img_conv_group(
input=pool2,
pool_size=3,
pool_stride=2,
conv_num_filter=[384, 384, 256],
conv_filter_size=3,
pool_type=paddle.pooling.Max())
fc1 = paddle.layer.fc(
input=pool3,
size=4096,
act=paddle.activation.Relu(),
layer_attr=paddle.attr.Extra(drop_rate=0.5))
fc2 = paddle.layer.fc(
input=fc1,
size=4096,
act=paddle.activation.Relu(),
layer_attr=paddle.attr.Extra(drop_rate=0.5))
out = paddle.layer.fc(
input=fc2, size=class_dim, act=paddle.activation.Softmax())
return out
import paddle.v2 as paddle
__all__ = ['googlenet']
def inception(name, input, channels, filter1, filter3R, filter3, filter5R,
filter5, proj):
cov1 = paddle.layer.img_conv(
name=name + '_1',
input=input,
filter_size=1,
num_channels=channels,
num_filters=filter1,
stride=1,
padding=0)
cov3r = paddle.layer.img_conv(
name=name + '_3r',
input=input,
filter_size=1,
num_channels=channels,
num_filters=filter3R,
stride=1,
padding=0)
cov3 = paddle.layer.img_conv(
name=name + '_3',
input=cov3r,
filter_size=3,
num_filters=filter3,
stride=1,
padding=1)
cov5r = paddle.layer.img_conv(
name=name + '_5r',
input=input,
filter_size=1,
num_channels=channels,
num_filters=filter5R,
stride=1,
padding=0)
cov5 = paddle.layer.img_conv(
name=name + '_5',
input=cov5r,
filter_size=5,
num_filters=filter5,
stride=1,
padding=2)
pool1 = paddle.layer.img_pool(
name=name + '_max',
input=input,
pool_size=3,
num_channels=channels,
stride=1,
padding=1)
covprj = paddle.layer.img_conv(
name=name + '_proj',
input=pool1,
filter_size=1,
num_filters=proj,
stride=1,
padding=0)
cat = paddle.layer.concat(name=name, input=[cov1, cov3, cov5, covprj])
return cat
def googlenet(input, class_dim):
# stage 1
conv1 = paddle.layer.img_conv(
name="conv1",
input=input,
filter_size=7,
num_channels=3,
num_filters=64,
stride=2,
padding=3)
pool1 = paddle.layer.img_pool(
name="pool1", input=conv1, pool_size=3, num_channels=64, stride=2)
# stage 2
conv2_1 = paddle.layer.img_conv(
name="conv2_1",
input=pool1,
filter_size=1,
num_filters=64,
stride=1,
padding=0)
conv2_2 = paddle.layer.img_conv(
name="conv2_2",
input=conv2_1,
filter_size=3,
num_filters=192,
stride=1,
padding=1)
pool2 = paddle.layer.img_pool(
name="pool2", input=conv2_2, pool_size=3, num_channels=192, stride=2)
# stage 3
ince3a = inception("ince3a", pool2, 192, 64, 96, 128, 16, 32, 32)
ince3b = inception("ince3b", ince3a, 256, 128, 128, 192, 32, 96, 64)
pool3 = paddle.layer.img_pool(
name="pool3", input=ince3b, num_channels=480, pool_size=3, stride=2)
# stage 4
ince4a = inception("ince4a", pool3, 480, 192, 96, 208, 16, 48, 64)
ince4b = inception("ince4b", ince4a, 512, 160, 112, 224, 24, 64, 64)
ince4c = inception("ince4c", ince4b, 512, 128, 128, 256, 24, 64, 64)
ince4d = inception("ince4d", ince4c, 512, 112, 144, 288, 32, 64, 64)
ince4e = inception("ince4e", ince4d, 528, 256, 160, 320, 32, 128, 128)
pool4 = paddle.layer.img_pool(
name="pool4", input=ince4e, num_channels=832, pool_size=3, stride=2)
# stage 5
ince5a = inception("ince5a", pool4, 832, 256, 160, 320, 32, 128, 128)
ince5b = inception("ince5b", ince5a, 832, 384, 192, 384, 48, 128, 128)
pool5 = paddle.layer.img_pool(
name="pool5",
input=ince5b,
num_channels=1024,
pool_size=7,
stride=7,
pool_type=paddle.pooling.Avg())
dropout = paddle.layer.addto(
input=pool5,
layer_attr=paddle.attr.Extra(drop_rate=0.4),
act=paddle.activation.Linear())
out = paddle.layer.fc(
input=dropout, size=class_dim, act=paddle.activation.Softmax())
# fc for output 1
pool_o1 = paddle.layer.img_pool(
name="pool_o1",
input=ince4a,
num_channels=512,
pool_size=5,
stride=3,
pool_type=paddle.pooling.Avg())
conv_o1 = paddle.layer.img_conv(
name="conv_o1",
input=pool_o1,
filter_size=1,
num_filters=128,
stride=1,
padding=0)
fc_o1 = paddle.layer.fc(
name="fc_o1",
input=conv_o1,
size=1024,
layer_attr=paddle.attr.Extra(drop_rate=0.7),
act=paddle.activation.Relu())
out1 = paddle.layer.fc(
input=fc_o1, size=class_dim, act=paddle.activation.Softmax())
# fc for output 2
pool_o2 = paddle.layer.img_pool(
name="pool_o2",
input=ince4d,
num_channels=528,
pool_size=5,
stride=3,
pool_type=paddle.pooling.Avg())
conv_o2 = paddle.layer.img_conv(
name="conv_o2",
input=pool_o2,
filter_size=1,
num_filters=128,
stride=1,
padding=0)
fc_o2 = paddle.layer.fc(
name="fc_o2",
input=conv_o2,
size=1024,
layer_attr=paddle.attr.Extra(drop_rate=0.7),
act=paddle.activation.Relu())
out2 = paddle.layer.fc(
input=fc_o2, size=class_dim, act=paddle.activation.Softmax())
return out, out1, out2
import gzip
import paddle.v2 as paddle
import reader
import vgg
import resnet
import alexnet
import googlenet
import argparse
import os
from PIL import Image
import numpy as np
WIDTH = 224
HEIGHT = 224
DATA_DIM = 3 * WIDTH * HEIGHT
CLASS_DIM = 102
def main():
# parse the argument
parser = argparse.ArgumentParser()
parser.add_argument(
'data_list',
help='The path of data list file, which consists of one image path per line'
)
parser.add_argument(
'model',
help='The model for image classification',
choices=['alexnet', 'vgg13', 'vgg16', 'vgg19', 'resnet', 'googlenet'])
parser.add_argument(
'params_path', help='The file which stores the parameters')
args = parser.parse_args()
# PaddlePaddle init
paddle.init(use_gpu=True, trainer_count=1)
image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(DATA_DIM))
if args.model == 'alexnet':
out = alexnet.alexnet(image, class_dim=CLASS_DIM)
elif args.model == 'vgg13':
out = vgg.vgg13(image, class_dim=CLASS_DIM)
elif args.model == 'vgg16':
out = vgg.vgg16(image, class_dim=CLASS_DIM)
elif args.model == 'vgg19':
out = vgg.vgg19(image, class_dim=CLASS_DIM)
elif args.model == 'resnet':
out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM)
elif args.model == 'googlenet':
out, _, _ = googlenet.googlenet(image, class_dim=CLASS_DIM)
# load parameters
with gzip.open(args.params_path, 'r') as f:
parameters = paddle.parameters.Parameters.from_tar(f)
file_list = [line.strip() for line in open(args.data_list)]
test_data = [(paddle.image.load_and_transform(image_file, 256, 224, False)
.flatten().astype('float32'), ) for image_file in file_list]
probs = paddle.infer(
output_layer=out, parameters=parameters, input=test_data)
lab = np.argsort(-probs)
for file_name, result in zip(file_list, lab):
print "Label of %s is: %d" % (file_name, result[0])
if __name__ == '__main__':
main()
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
import random import random
from paddle.v2.image import load_and_transform from paddle.v2.image import load_and_transform
import paddle.v2 as paddle
from multiprocessing import cpu_count
def train_mapper(sample):
'''
map image path to type needed by model input layer for the training set
'''
img, label = sample
img = paddle.image.load_image(img)
img = paddle.image.simple_transform(img, 256, 224, True)
return img.flatten().astype('float32'), label
def test_mapper(sample):
'''
map image path to type needed by model input layer for the test set
'''
img, label = sample
img = paddle.image.load_image(img)
img = paddle.image.simple_transform(img, 256, 224, True)
return img.flatten().astype('float32'), label
def train_reader(train_list): def train_reader(train_list, buffered_size=1024):
def reader(): def reader():
with open(train_list, 'r') as f: with open(train_list, 'r') as f:
lines = [line.strip() for line in f] lines = [line.strip() for line in f]
random.shuffle(lines)
for line in lines: for line in lines:
img_path, lab = line.strip().split('\t') img_path, lab = line.strip().split('\t')
im = load_and_transform(img_path, 256, 224, True) yield img_path, int(lab)
yield im.flatten().astype('float32'), int(lab)
return reader return paddle.reader.xmap_readers(train_mapper, reader,
cpu_count(), buffered_size)
def test_reader(test_list): def test_reader(test_list, buffered_size=1024):
def reader(): def reader():
with open(test_list, 'r') as f: with open(test_list, 'r') as f:
lines = [line.strip() for line in f] lines = [line.strip() for line in f]
for line in lines: for line in lines:
img_path, lab = line.strip().split('\t') img_path, lab = line.strip().split('\t')
im = load_and_transform(img_path, 256, 224, False) yield img_path, int(lab)
yield im.flatten().astype('float32'), int(lab)
return reader return paddle.reader.xmap_readers(test_mapper, reader,
cpu_count(), buffered_size)
if __name__ == '__main__': if __name__ == '__main__':
......
import paddle.v2 as paddle
__all__ = ['resnet_imagenet', 'resnet_cifar10']
def conv_bn_layer(input,
ch_out,
filter_size,
stride,
padding,
active_type=paddle.activation.Relu(),
ch_in=None):
tmp = paddle.layer.img_conv(
input=input,
filter_size=filter_size,
num_channels=ch_in,
num_filters=ch_out,
stride=stride,
padding=padding,
act=paddle.activation.Linear(),
bias_attr=False)
return paddle.layer.batch_norm(input=tmp, act=active_type)
def shortcut(input, ch_in, ch_out, stride):
if ch_in != ch_out:
return conv_bn_layer(input, ch_out, 1, stride, 0,
paddle.activation.Linear())
else:
return input
def basicblock(input, ch_in, ch_out, stride):
short = shortcut(input, ch_in, ch_out, stride)
conv1 = conv_bn_layer(input, ch_out, 3, stride, 1)
conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1, paddle.activation.Linear())
return paddle.layer.addto(
input=[short, conv2], act=paddle.activation.Relu())
def bottleneck(input, ch_in, ch_out, stride):
short = shortcut(input, ch_in, ch_out * 4, stride)
conv1 = conv_bn_layer(input, ch_out, 1, stride, 0)
conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1)
conv3 = conv_bn_layer(conv2, ch_out * 4, 1, 1, 0,
paddle.activation.Linear())
return paddle.layer.addto(
input=[short, conv3], act=paddle.activation.Relu())
def layer_warp(block_func, input, ch_in, ch_out, count, stride):
conv = block_func(input, ch_in, ch_out, stride)
for i in range(1, count):
conv = block_func(conv, ch_out, ch_out, 1)
return conv
def resnet_imagenet(input, class_dim, depth=50):
cfg = {
18: ([2, 2, 2, 1], basicblock),
34: ([3, 4, 6, 3], basicblock),
50: ([3, 4, 6, 3], bottleneck),
101: ([3, 4, 23, 3], bottleneck),
152: ([3, 8, 36, 3], bottleneck)
}
stages, block_func = cfg[depth]
conv1 = conv_bn_layer(
input, ch_in=3, ch_out=64, filter_size=7, stride=2, padding=3)
pool1 = paddle.layer.img_pool(input=conv1, pool_size=3, stride=2)
res1 = layer_warp(block_func, pool1, 64, 64, stages[0], 1)
res2 = layer_warp(block_func, res1, 64, 128, stages[1], 2)
res3 = layer_warp(block_func, res2, 128, 256, stages[2], 2)
res4 = layer_warp(block_func, res3, 256, 512, stages[3], 2)
pool2 = paddle.layer.img_pool(
input=res4, pool_size=7, stride=1, pool_type=paddle.pooling.Avg())
out = paddle.layer.fc(
input=pool2, size=class_dim, act=paddle.activation.Softmax())
return out
def resnet_cifar10(input, class_dim, depth=32):
# depth should be one of 20, 32, 44, 56, 110, 1202
assert (depth - 2) % 6 == 0
n = (depth - 2) / 6
nStages = {16, 64, 128}
conv1 = conv_bn_layer(
input, ch_in=3, ch_out=16, filter_size=3, stride=1, padding=1)
res1 = layer_warp(basicblock, conv1, 16, 16, n, 1)
res2 = layer_warp(basicblock, res1, 16, 32, n, 2)
res3 = layer_warp(basicblock, res2, 32, 64, n, 2)
pool = paddle.layer.img_pool(
input=res3, pool_size=8, stride=1, pool_type=paddle.pooling.Avg())
out = paddle.layer.fc(
input=pool, size=class_dim, act=paddle.activation.Softmax())
return out
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
import gzip import gzip
import paddle.v2.dataset.flowers as flowers
import paddle.v2 as paddle import paddle.v2 as paddle
import reader import reader
import vgg import vgg
import resnet
import alexnet
import googlenet
import argparse
DATA_DIM = 3 * 224 * 224 DATA_DIM = 3 * 224 * 224
CLASS_DIM = 1000 CLASS_DIM = 102
BATCH_SIZE = 128 BATCH_SIZE = 128
def main(): def main():
# parse the argument
parser = argparse.ArgumentParser()
parser.add_argument(
'model',
help='The model for image classification',
choices=['alexnet', 'vgg13', 'vgg16', 'vgg19', 'resnet', 'googlenet'])
args = parser.parse_args()
# PaddlePaddle init # PaddlePaddle init
paddle.init(use_gpu=True, trainer_count=4) paddle.init(use_gpu=True, trainer_count=1)
image = paddle.layer.data( image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(DATA_DIM)) name="image", type=paddle.data_type.dense_vector(DATA_DIM))
lbl = paddle.layer.data( lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(CLASS_DIM)) name="label", type=paddle.data_type.integer_value(CLASS_DIM))
net = vgg.vgg13(image)
out = paddle.layer.fc( extra_layers = None
input=net, size=CLASS_DIM, act=paddle.activation.Softmax()) learning_rate = 0.01
if args.model == 'alexnet':
out = alexnet.alexnet(image, class_dim=CLASS_DIM)
elif args.model == 'vgg13':
out = vgg.vgg13(image, class_dim=CLASS_DIM)
elif args.model == 'vgg16':
out = vgg.vgg16(image, class_dim=CLASS_DIM)
elif args.model == 'vgg19':
out = vgg.vgg19(image, class_dim=CLASS_DIM)
elif args.model == 'resnet':
out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM)
learning_rate = 0.1
elif args.model == 'googlenet':
out, out1, out2 = googlenet.googlenet(image, class_dim=CLASS_DIM)
loss1 = paddle.layer.cross_entropy_cost(
input=out1, label=lbl, coeff=0.3)
paddle.evaluator.classification_error(input=out1, label=lbl)
loss2 = paddle.layer.cross_entropy_cost(
input=out2, label=lbl, coeff=0.3)
paddle.evaluator.classification_error(input=out2, label=lbl)
extra_layers = [loss1, loss2]
cost = paddle.layer.classification_cost(input=out, label=lbl) cost = paddle.layer.classification_cost(input=out, label=lbl)
# Create parameters # Create parameters
...@@ -45,16 +63,23 @@ def main(): ...@@ -45,16 +63,23 @@ def main():
momentum=0.9, momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * regularization=paddle.optimizer.L2Regularization(rate=0.0005 *
BATCH_SIZE), BATCH_SIZE),
learning_rate=0.01 / BATCH_SIZE, learning_rate=learning_rate / BATCH_SIZE,
learning_rate_decay_a=0.1, learning_rate_decay_a=0.1,
learning_rate_decay_b=128000 * 35, learning_rate_decay_b=128000 * 35,
learning_rate_schedule="discexp", ) learning_rate_schedule="discexp", )
train_reader = paddle.batch( train_reader = paddle.batch(
paddle.reader.shuffle(reader.test_reader("train.list"), buf_size=1000), paddle.reader.shuffle(
flowers.train(),
# To use other data, replace the above line with:
# reader.train_reader('train.list'),
buf_size=1000),
batch_size=BATCH_SIZE) batch_size=BATCH_SIZE)
test_reader = paddle.batch( test_reader = paddle.batch(
reader.train_reader("test.list"), batch_size=BATCH_SIZE) flowers.valid(),
# To use other data, replace the above line with:
# reader.test_reader('val.list'),
batch_size=BATCH_SIZE)
# End batch and end pass event handler # End batch and end pass event handler
def event_handler(event): def event_handler(event):
...@@ -71,11 +96,14 @@ def main(): ...@@ -71,11 +96,14 @@ def main():
# Create trainer # Create trainer
trainer = paddle.trainer.SGD( trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer) cost=cost,
parameters=parameters,
update_equation=optimizer,
extra_layers=extra_layers)
trainer.train( trainer.train(
reader=train_reader, num_passes=200, event_handler=event_handler) reader=train_reader, num_passes=200, event_handler=event_handler)
if __name__ == '__main__': if __name__ == '__main__':
main() main()
\ No newline at end of file
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.v2 as paddle import paddle.v2 as paddle
__all__ = ['vgg13', 'vgg16', 'vgg19'] __all__ = ['vgg13', 'vgg16', 'vgg19']
def vgg(input, nums): def vgg(input, nums, class_dim):
def conv_block(input, num_filter, groups, num_channels=None): def conv_block(input, num_filter, groups, num_channels=None):
return paddle.networks.img_conv_group( return paddle.networks.img_conv_group(
input=input, input=input,
...@@ -48,19 +34,21 @@ def vgg(input, nums): ...@@ -48,19 +34,21 @@ def vgg(input, nums):
size=fc_dim, size=fc_dim,
act=paddle.activation.Relu(), act=paddle.activation.Relu(),
layer_attr=paddle.attr.Extra(drop_rate=0.5)) layer_attr=paddle.attr.Extra(drop_rate=0.5))
return fc2 out = paddle.layer.fc(
input=fc2, size=class_dim, act=paddle.activation.Softmax())
return out
def vgg13(input): def vgg13(input, class_dim):
nums = [2, 2, 2, 2, 2] nums = [2, 2, 2, 2, 2]
return vgg(input, nums) return vgg(input, nums, class_dim)
def vgg16(input): def vgg16(input, class_dim):
nums = [2, 2, 3, 3, 3] nums = [2, 2, 3, 3, 3]
return vgg(input, nums) return vgg(input, nums, class_dim)
def vgg19(input): def vgg19(input, class_dim):
nums = [2, 2, 4, 4, 4] nums = [2, 2, 4, 4, 4]
return vgg(input, nums) return vgg(input, nums, class_dim)
TBD # 语言模型
## 简介
语言模型即 Language Model,简称LM。它是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词。语言模型是自然语言处理领域里一个重要的基础模型。
## 应用场景
**语言模型被应用在很多领域**,如:
* **自动写作**:语言模型可以根据上文生成下一个词,递归下去可以生成整个句子、段落、篇章。
* **QA**:语言模型可以根据Question生成Answer。
* **机器翻译**:当前主流的机器翻译模型大多基于Encoder-Decoder模式,其中Decoder就是一个语言模型,用来生成目标语言。
* **拼写检查**:语言模型可以计算出词序列的概率,一般在拼写错误处序列的概率会骤减,可以用来识别拼写错误并提供改正候选集。
* **词性标注、句法分析、语音识别......**
## 关于本例
Language Model 常见的实现方式有 N-Gram、RNN、seq2seq。本例中实现了基于N-Gram、RNN的语言模型。**本例的文件结构如下**`images` 文件夹与使用无关可不关心):
```text
.
├── data # toy、demo数据,用户可据此格式化自己的数据
│ ├── chinese.test.txt # test用的数据demo
| ├── chinese.train.txt # train用的数据demo
│ └── input.txt # infer用的输入数据demo
├── config.py # 配置文件,包括data、train、infer相关配置
├── infer.py # 预测任务脚本,即生成文本
├── network_conf.py # 本例中涉及的各种网络结构均定义在此文件中,希望进一步修改模型结构,请修改此文件
├── reader.py # 读取数据接口
├── README.md # 文档
├── train.py # 训练任务脚本
└── utils.py # 定义通用的函数,例如:构建字典、加载字典等
```
**注:一般情况下基于N-Gram的语言模型不如基于RNN的语言模型效果好,所以实际使用时建议使用基于RNN的语言模型,本例中也将着重介绍基于RNN的模型,简略介绍基于N-Gram的模型。**
## RNN 语言模型
### 简介
RNN是一个序列模型,基本思路是:在时刻t,将前一时刻t-1的隐藏层输出和t时刻的词向量一起输入到隐藏层从而得到时刻t的特征表示,然后用这个特征表示得到t时刻的预测输出,如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识,具有“记忆”功能。理论上RNN能实现“长依赖”(即利用很久之前的知识),但在实际应用中发现效果并不理想,于是出现了很多RNN的变种,如常用的LSTM和GRU,它们对传统RNN的cell进行了改进,弥补了传统RNN的不足,本例中即使用了LSTM、GRU。下图是RNN(广义上包含了LSTM、GRU等)语言模型“循环”思想的示意图:
<p align=center><img src='images/rnn.png' width='500px'/></p>
### 模型实现
本例中RNN语言模型的实现简介如下:
* **定义模型参数**`config.py`中的`Config_rnn`**类**中定义了模型的参数变量。
* **定义模型结构**`network_conf.py`中的`rnn_lm`**函数**中定义了模型的**结构**,如下:
* 输入层:将输入的词(或字)序列映射成向量,即embedding。
* 中间层:根据配置实现RNN层,将上一步得到的embedding向量序列作为输入。
* 输出层:使用softmax归一化计算单词的概率,将output结果返回
* loss:定义模型的cost为多类交叉熵损失函数。
* **训练模型**`train.py`中的`main`方法实现了模型的训练,实现流程如下:
* 准备输入数据:建立并保存词典、构建train和test数据的reader。
* 初始化模型:包括模型的结构、参数。
* 构建训练器:demo中使用的是Adam优化算法。
* 定义回调函数:构建`event_handler`来跟踪训练过程中loss的变化,并在每轮训练结束时保存模型的参数。
* 训练:使用trainer训练模型。
* **生成文本**`infer.py`中的`main`方法实现了文本的生成,实现流程如下:
* 根据配置选择生成方法:RNN模型 or N-Gram模型。
* 加载train好的模型和词典文件。
* 读取`input_file`文件(每行为一个sentence的前缀),用启发式图搜索算法`beam_search`根据各sentence的前缀生成文本。
* 将生成的文本及其前缀保存到文件`output_file`
## N-Gram 语言模型
### 简介
N-Gram模型也称为N-1阶马尔科夫模型,它有一个有限历史假设:当前词的出现概率仅仅与前面N-1个词相关。一般采用最大似然估计(Maximum Likelihood Estimation,MLE)方法对模型的参数进行估计。当N取1、2、3时,N-Gram模型分别称为unigram、bigram和trigram语言模型。一般情况下,N越大、训练语料的规模越大,参数估计的结果越可靠,但由于模型较简单、表达能力不强以及数据稀疏等问题。一般情况下用N-Gram实现的语言模型不如RNN、seq2seq效果好。下图是基于神经网络的N-Gram语言模型结构示意图:
<p align=center><img src='images/ngram.png' width='500px'/></p>
### 模型实现
本例中N-Gram语言模型的实现简介如下:
* **定义模型参数**`config.py`中的`Config_ngram`**类**中定义了模型的参数变量。
* **定义模型结构**`network_conf.py`中的`ngram_lm`**函数**中定义了模型的**结构**,如下:
* 输入层:本例中N取5,将前四个词分别做embedding,然后连接起来作为输入。
* 中间层:根据配置实现DNN层,将上一步得到的embedding向量序列作为输入。
* 输出层:使用softmax归一化计算单词的概率,将output结果返回
* loss:定义模型的cost为多类交叉熵损失函数。
* **训练模型**`train.py`中的`main`方法实现了模型的训练,实现流程与上文中RNN语言模型基本一致。
* **生成文本**`infer.py`中的`main`方法实现了文本的生成,实现流程与上文中RNN语言模型基本一致,区别在于构建input时本例会取每个前缀的最后4(N-1)个词作为输入。
## 使用说明
运行本例的方法如下:
* 1,运行`python train.py`命令,开始train模型(默认使用RNN),待训练结束。
* 2,运行`python infer.py`命令做prediction。(输入的文本默认为`data/input.txt`,生成的文本默认保存到`data/output.txt`中。)
**如果用户需要使用自己的语料、定制模型,需要修改的地方主要是`语料`和`config.py`中的配置,需要注意的细节和适配工作详情如下:**
### 语料适配
* 清洗语料:去除原文中空格、tab、乱码,按需去除数字、标点符号、特殊符号等。
* 编码格式:utf-8,本例中已经对中文做了适配。
* 内容格式:每个句子占一行;每行中的各词之间使用一个空格符分开。
* 按需要配置`config.py`中对于data的配置:
```python
# -- config : data --
train_file = 'data/chinese.train.txt'
test_file = 'data/chinese.test.txt'
vocab_file = 'data/vocab_cn.txt' # the file to save vocab
build_vocab_method = 'fixed_size' # 'frequency' or 'fixed_size'
vocab_max_size = 3000 # when build_vocab_method = 'fixed_size'
unk_threshold = 1 # # when build_vocab_method = 'frequency'
min_sentence_length = 3
max_sentence_length = 60
```
其中,`build_vocab_method `指定了构建词典的方法:**1,按词频**,即将出现次数小于`unk_threshold `的词视为`<UNK>`;**2,按词典长度**,`vocab_max_size`定义了词典的最大长度,如果语料中出现的不同词的个数大于这个值,则根据各词的词频倒序排,取`top(vocab_max_size)`个词纳入词典。
其中`min_sentence_length`和`max_sentence_length `分别指定了句子的最小和最大长度,小于最小长度的和大于最大长度的句子将被过滤掉、不参与训练。
*注:需要注意的是词典越大生成的内容越丰富但训练耗时越久,一般中文分词之后,语料中不同的词能有几万乃至几十万,如果vocab\_max\_size取值过小则导致\<UNK\>占比过高,如果vocab\_max\_size取值较大则严重影响训练速度(对精度也有影响),所以也有“按字”训练模型的方式,即:把每个汉字当做一个词,常用汉字也就几千个,使得字典的大小不会太大、不会丢失太多信息,但汉语中同一个字在不同词中语义相差很大,有时导致模型效果不理想。建议用户多试试、根据实际情况选择是“按词训练”还是“按字训练”。*
### 模型适配、训练
* 按需调整`config.py`中对于模型的配置,详解如下:
```python
# -- config : train --
use_which_model = 'rnn' # must be: 'rnn' or 'ngram'
use_gpu = False # whether to use gpu
trainer_count = 1 # number of trainer
class Config_rnn(object):
"""
config for RNN language model
"""
rnn_type = 'gru' # or 'lstm'
emb_dim = 200
hidden_size = 200
num_layer = 2
num_passs = 2
batch_size = 32
model_file_name_prefix = 'lm_' + rnn_type + '_params_pass_'
class Config_ngram(object):
"""
config for N-Gram language model
"""
emb_dim = 200
hidden_size = 200
num_layer = 2
N = 5
num_passs = 2
batch_size = 32
model_file_name_prefix = 'lm_ngram_pass_'
```
其中,`use_which_model`指定了要train的模型,如果使用RNN语言模型则设置为'rnn',如果使用N-Gram语言模型则设置为'ngram';`use_gpu`指定了train的时候是否使用gpu;`trainer_count`指定了并行度、用几个trainer去train模型;`rnn_type` 用于配置rnn cell类型,可以取‘lstm’或‘gru’;`hidden_size`配置unit个数;`num_layer`配置RNN的层数;`num_passs`配置训练的轮数;`emb_dim`配置embedding的dimension;`batch_size `配置了train model时每个batch的大小;`model_file_name_prefix `配置了要保存的模型的名字前缀。
* 运行`python train.py`命令训练模型,模型将被保存到当前目录。
### 按需生成文本
* 按需调整`config.py`中对于infer的配置,详解如下:
```python
# -- config : infer --
input_file = 'data/input.txt' # input file contains sentence prefix each line
output_file = 'data/output.txt' # the file to save results
num_words = 10 # the max number of words need to generate
beam_size = 5 # beam_width, the number of the prediction sentence for each prefix
```
其中,`input_file`中保存的是待生成的文本前缀,utf-8编码,每个前缀占一行,形如:
```text
我 是
```
用户将需要生成的文本前缀按此格式存入文件即可;
`num_words`指定了要生成多少个单词(实际生成过程中遇到结束符会停止生成,所以实际生成的词个数可能会比此值小);`beam_size`指定了beam search方法的width,即每个前缀生成多少个候选词序列;`output_file`指定了生成结果的存放位置。
* 运行`python infer.py`命令生成文本,生成的结果格式如下:
```text
我 <EOS> 0.107702672482
我 爱 。我 中国 中国 <EOS> 0.000177299271939
我 爱 中国 。我 是 中国 <EOS> 4.51695544709e-05
我 爱 中国 中国 <EOS> 0.000910127729821
我 爱 中国 。我 是 <EOS> 0.00015957862922
```
其中,‘我’是前缀,其下方的五个句子时补全的结果,每个句子末尾的浮点数表示此句子的生成概率。
# coding=utf-8
# -- config : data --
train_file = 'data/chinese.train.txt'
test_file = 'data/chinese.test.txt'
vocab_file = 'data/vocab_cn.txt' # the file to save vocab
build_vocab_method = 'fixed_size' # 'frequency' or 'fixed_size'
vocab_max_size = 3000 # when build_vocab_method = 'fixed_size'
unk_threshold = 1 # # when build_vocab_method = 'frequency'
min_sentence_length = 3
max_sentence_length = 60
# -- config : train --
use_which_model = 'ngram' # must be: 'rnn' or 'ngram'
use_gpu = False # whether to use gpu
trainer_count = 1 # number of trainer
class Config_rnn(object):
"""
config for RNN language model
"""
rnn_type = 'gru' # or 'lstm'
emb_dim = 200
hidden_size = 200
num_layer = 2
num_passs = 2
batch_size = 32
model_file_name_prefix = 'lm_' + rnn_type + '_params_pass_'
class Config_ngram(object):
"""
config for N-Gram language model
"""
emb_dim = 200
hidden_size = 200
num_layer = 2
N = 5
num_passs = 2
batch_size = 32
model_file_name_prefix = 'lm_ngram_pass_'
# -- config : infer --
input_file = 'data/input.txt' # input file contains sentence prefix each line
output_file = 'data/output.txt' # the file to save results
num_words = 10 # the max number of words need to generate
beam_size = 5 # beam_width, the number of the prediction sentence for each prefix
我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。
\ No newline at end of file
我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。我 是 中国 人 。
我 爱 中国 。
\ No newline at end of file
我 是
我 是 中国
我 爱
我 是 中国 人。
我 爱 中国
我 爱 中国 。我
我 爱 中国 。我 爱
我 爱 中国 。我 是
我 爱 中国 。我 是 中国
\ No newline at end of file
# coding=utf-8
import paddle.v2 as paddle
import gzip
import numpy as np
from utils import *
import network_conf
from config import *
def generate_using_rnn(word_id_dict, num_words, beam_size):
"""
Demo: use RNN model to do prediction.
:param word_id_dict: vocab.
:type word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
:param num_words: the number of the words to generate.
:type num_words: int
:param beam_size: beam width.
:type beam_size: int
:return: save prediction results to output_file
"""
# prepare and cache model
config = Config_rnn()
_, output_layer = network_conf.rnn_lm(
vocab_size=len(word_id_dict),
emb_dim=config.emb_dim,
rnn_type=config.rnn_type,
hidden_size=config.hidden_size,
num_layer=config.num_layer) # network config
model_file_name = config.model_file_name_prefix + str(config.num_passs -
1) + '.tar.gz'
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_file_name)) # load parameters
inferer = paddle.inference.Inference(
output_layer=output_layer, parameters=parameters)
# tools, different from generate_using_ngram's tools
id_word_dict = dict(
[(v, k) for k, v in word_id_dict.items()]) # {id : word}
def str2ids(str):
return [[[
word_id_dict.get(w, word_id_dict['<UNK>']) for w in str.split()
]]]
def ids2str(ids):
return [[[id_word_dict.get(id, ' ') for id in ids]]]
# generate text
with open(input_file) as file:
output_f = open(output_file, 'w')
for line in file:
line = line.decode('utf-8').strip()
# generate
texts = {} # type: {text : probability}
texts[line] = 1
for _ in range(num_words):
texts_new = {}
for (text, prob) in texts.items():
if '<EOS>' in text: # stop prediction when <EOS> appear
texts_new[text] = prob
continue
# next word's probability distribution
predictions = inferer.infer(input=str2ids(text))
predictions[-1][word_id_dict['<UNK>']] = -1 # filter <UNK>
# find next beam_size words
for _ in range(beam_size):
cur_maxProb_index = np.argmax(
predictions[-1]) # next word's id
text_new = text + ' ' + id_word_dict[
cur_maxProb_index] # text append next word
texts_new[text_new] = texts[text] * predictions[-1][
cur_maxProb_index]
predictions[-1][cur_maxProb_index] = -1
texts.clear()
if len(texts_new) <= beam_size:
texts = texts_new
else: # cutting
texts = dict(
sorted(
texts_new.items(), key=lambda d: d[1], reverse=True)
[:beam_size])
# save results to output file
output_f.write(line.encode('utf-8') + '\n')
for (sentence, prob) in texts.items():
output_f.write('\t' + sentence.encode('utf-8', 'replace') + '\t'
+ str(prob) + '\n')
output_f.write('\n')
output_f.close()
print('already saved results to ' + output_file)
def generate_using_ngram(word_id_dict, num_words, beam_size):
"""
Demo: use N-Gram model to do prediction.
:param word_id_dict: vocab.
:type word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
:param num_words: the number of the words to generate.
:type num_words: int
:param beam_size: beam width.
:type beam_size: int
:return: save prediction results to output_file
"""
# prepare and cache model
config = Config_ngram()
_, output_layer = network_conf.ngram_lm(
vocab_size=len(word_id_dict),
emb_dim=config.emb_dim,
hidden_size=config.hidden_size,
num_layer=config.num_layer) # network config
model_file_name = config.model_file_name_prefix + str(config.num_passs -
1) + '.tar.gz'
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_file_name)) # load parameters
inferer = paddle.inference.Inference(
output_layer=output_layer, parameters=parameters)
# tools, different from generate_using_rnn's tools
id_word_dict = dict(
[(v, k) for k, v in word_id_dict.items()]) # {id : word}
def str2ids(str):
return [[
word_id_dict.get(w, word_id_dict['<UNK>']) for w in str.split()
]]
def ids2str(ids):
return [[id_word_dict.get(id, ' ') for id in ids]]
# generate text
with open(input_file) as file:
output_f = open(output_file, 'w')
for line in file:
line = line.decode('utf-8').strip()
words = line.split()
if len(words) < config.N:
output_f.write(line.encode('utf-8') + "\n\tnone\n")
continue
# generate
texts = {} # type: {text : probability}
texts[line] = 1
for _ in range(num_words):
texts_new = {}
for (text, prob) in texts.items():
if '<EOS>' in text: # stop prediction when <EOS> appear
texts_new[text] = prob
continue
# next word's probability distribution
predictions = inferer.infer(
input=str2ids(' '.join(text.split()[-config.N:])))
predictions[-1][word_id_dict['<UNK>']] = -1 # filter <UNK>
# find next beam_size words
for _ in range(beam_size):
cur_maxProb_index = np.argmax(
predictions[-1]) # next word's id
text_new = text + ' ' + id_word_dict[
cur_maxProb_index] # text append nextWord
texts_new[text_new] = texts[text] * predictions[-1][
cur_maxProb_index]
predictions[-1][cur_maxProb_index] = -1
texts.clear()
if len(texts_new) <= beam_size:
texts = texts_new
else: # cutting
texts = dict(
sorted(
texts_new.items(), key=lambda d: d[1], reverse=True)
[:beam_size])
# save results to output file
output_f.write(line.encode('utf-8') + '\n')
for (sentence, prob) in texts.items():
output_f.write('\t' + sentence.encode('utf-8', 'replace') + '\t'
+ str(prob) + '\n')
output_f.write('\n')
output_f.close()
print('already saved results to ' + output_file)
def main():
# init paddle
paddle.init(use_gpu=use_gpu, trainer_count=trainer_count)
# prepare and cache vocab
if os.path.isfile(vocab_file):
word_id_dict = load_vocab(vocab_file) # load word dictionary
else:
if build_vocab_method == 'fixed_size':
word_id_dict = build_vocab_with_fixed_size(
train_file, vocab_max_size) # build vocab
else:
word_id_dict = build_vocab_using_threshhold(
train_file, unk_threshold) # build vocab
save_vocab(word_id_dict, vocab_file) # save vocab
# generate
if use_which_model == 'rnn':
generate_using_rnn(
word_id_dict=word_id_dict, num_words=num_words, beam_size=beam_size)
elif use_which_model == 'ngram':
generate_using_ngram(
word_id_dict=word_id_dict, num_words=num_words, beam_size=beam_size)
else:
raise Exception('use_which_model must be rnn or ngram!')
if __name__ == "__main__":
main()
# coding=utf-8
import paddle.v2 as paddle
def rnn_lm(vocab_size, emb_dim, rnn_type, hidden_size, num_layer):
"""
RNN language model definition.
:param vocab_size: size of vocab.
:param emb_dim: embedding vector's dimension.
:param rnn_type: the type of RNN cell.
:param hidden_size: number of unit.
:param num_layer: layer number.
:return: cost and output layer of model.
"""
assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0
# input layers
input = paddle.layer.data(
name="input", type=paddle.data_type.integer_value_sequence(vocab_size))
target = paddle.layer.data(
name="target", type=paddle.data_type.integer_value_sequence(vocab_size))
# embedding layer
input_emb = paddle.layer.embedding(input=input, size=emb_dim)
# rnn layer
if rnn_type == 'lstm':
rnn_cell = paddle.networks.simple_lstm(
input=input_emb, size=hidden_size)
for _ in range(num_layer - 1):
rnn_cell = paddle.networks.simple_lstm(
input=rnn_cell, size=hidden_size)
elif rnn_type == 'gru':
rnn_cell = paddle.networks.simple_gru(input=input_emb, size=hidden_size)
for _ in range(num_layer - 1):
rnn_cell = paddle.networks.simple_gru(
input=rnn_cell, size=hidden_size)
else:
raise Exception('rnn_type error!')
# fc(full connected) and output layer
output = paddle.layer.fc(
input=[rnn_cell], size=vocab_size, act=paddle.activation.Softmax())
# loss
cost = paddle.layer.classification_cost(input=output, label=target)
return cost, output
def ngram_lm(vocab_size, emb_dim, hidden_size, num_layer):
"""
N-Gram language model definition.
:param vocab_size: size of vocab.
:param emb_dim: embedding vector's dimension.
:param hidden_size: size of unit.
:param num_layer: layer number.
:return: cost and output layer of model.
"""
assert emb_dim > 0 and hidden_size > 0 and vocab_size > 0 and num_layer > 0
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=emb_dim,
param_attr=paddle.attr.Param(
name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0))
return wordemb
# input layers
first_word = paddle.layer.data(
name="first_word", type=paddle.data_type.integer_value(vocab_size))
second_word = paddle.layer.data(
name="second_word", type=paddle.data_type.integer_value(vocab_size))
third_word = paddle.layer.data(
name="third_word", type=paddle.data_type.integer_value(vocab_size))
fourth_word = paddle.layer.data(
name="fourth_word", type=paddle.data_type.integer_value(vocab_size))
next_word = paddle.layer.data(
name="next_word", type=paddle.data_type.integer_value(vocab_size))
# embedding layer
first_emb = wordemb(first_word)
second_emb = wordemb(second_word)
third_emb = wordemb(third_word)
fourth_emb = wordemb(fourth_word)
context_emb = paddle.layer.concat(
input=[first_emb, second_emb, third_emb, fourth_emb])
# hidden layer
hidden = paddle.layer.fc(
input=context_emb, size=hidden_size, act=paddle.activation.Relu())
for _ in range(num_layer - 1):
hidden = paddle.layer.fc(
input=hidden, size=hidden_size, act=paddle.activation.Relu())
# fc(full connected) and output layer
predict_word = paddle.layer.fc(
input=[hidden], size=vocab_size, act=paddle.activation.Softmax())
# loss
cost = paddle.layer.classification_cost(input=predict_word, label=next_word)
return cost, predict_word
# coding=utf-8
import collections
import os
def rnn_reader(file_name, min_sentence_length, max_sentence_length,
word_id_dict):
"""
create reader for RNN, each line is a sample.
:param file_name: file name.
:param min_sentence_length: sentence's min length.
:param max_sentence_length: sentence's max length.
:param word_id_dict: vocab with content of '{word, id}', 'word' is string type , 'id' is int type.
:return: data reader.
"""
def reader():
UNK = word_id_dict['<UNK>']
with open(file_name) as file:
for line in file:
words = line.decode('utf-8', 'ignore').strip().split()
if len(words) < min_sentence_length or len(
words) > max_sentence_length:
continue
ids = [word_id_dict.get(w, UNK) for w in words]
ids.append(word_id_dict['<EOS>'])
target = ids[1:]
target.append(word_id_dict['<EOS>'])
yield ids[:], target[:]
return reader
def ngram_reader(file_name, N, word_id_dict):
"""
create reader for N-Gram.
:param file_name: file name.
:param N: N-Gram's N.
:param word_id_dict: vocab with content of '{word, id}', 'word' is string type , 'id' is int type.
:return: data reader.
"""
assert N >= 2
def reader():
ids = []
UNK_ID = word_id_dict['<UNK>']
cache_size = 10000000
with open(file_name) as file:
for line in file:
words = line.decode('utf-8', 'ignore').strip().split()
ids += [word_id_dict.get(w, UNK_ID) for w in words]
ids_len = len(ids)
if ids_len > cache_size: # output
for i in range(ids_len - N - 1):
yield tuple(ids[i:i + N])
ids = []
ids_len = len(ids)
for i in range(ids_len - N - 1):
yield tuple(ids[i:i + N])
return reader
# coding=utf-8
import sys
import paddle.v2 as paddle
import reader
from utils import *
import network_conf
import gzip
from config import *
def train(model_cost, train_reader, test_reader, model_file_name_prefix,
num_passes):
"""
train model.
:param model_cost: cost layer of the model to train.
:param train_reader: train data reader.
:param test_reader: test data reader.
:param model_file_name_prefix: model's prefix name.
:param num_passes: epoch.
:return:
"""
# init paddle
paddle.init(use_gpu=use_gpu, trainer_count=trainer_count)
# create parameters
parameters = paddle.parameters.create(model_cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(
average_window=0.5, max_average_window=10000))
# create trainer
trainer = paddle.trainer.SGD(
cost=model_cost, parameters=parameters, update_equation=adam_optimizer)
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print("\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics))
else:
sys.stdout.write('.')
sys.stdout.flush()
# save model each pass
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader)
print("\nTest with Pass %d, %s" % (event.pass_id, result.metrics))
with gzip.open(
model_file_name_prefix + str(event.pass_id) + '.tar.gz',
'w') as f:
parameters.to_tar(f)
# start to train
print('start training...')
trainer.train(
reader=train_reader, event_handler=event_handler, num_passes=num_passes)
print("Training finished.")
def main():
# prepare vocab
print('prepare vocab...')
if build_vocab_method == 'fixed_size':
word_id_dict = build_vocab_with_fixed_size(
train_file, vocab_max_size) # build vocab
else:
word_id_dict = build_vocab_using_threshhold(
train_file, unk_threshold) # build vocab
save_vocab(word_id_dict, vocab_file) # save vocab
# init model and data reader
if use_which_model == 'rnn':
# init RNN model
print('prepare rnn model...')
config = Config_rnn()
cost, _ = network_conf.rnn_lm(
len(word_id_dict), config.emb_dim, config.rnn_type,
config.hidden_size, config.num_layer)
# init RNN data reader
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.rnn_reader(train_file, min_sentence_length,
max_sentence_length, word_id_dict),
buf_size=65536),
batch_size=config.batch_size)
test_reader = paddle.batch(
paddle.reader.shuffle(
reader.rnn_reader(test_file, min_sentence_length,
max_sentence_length, word_id_dict),
buf_size=65536),
batch_size=config.batch_size)
elif use_which_model == 'ngram':
# init N-Gram model
print('prepare ngram model...')
config = Config_ngram()
assert config.N == 5
cost, _ = network_conf.ngram_lm(
vocab_size=len(word_id_dict),
emb_dim=config.emb_dim,
hidden_size=config.hidden_size,
num_layer=config.num_layer)
# init N-Gram data reader
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.ngram_reader(train_file, config.N, word_id_dict),
buf_size=65536),
batch_size=config.batch_size)
test_reader = paddle.batch(
paddle.reader.shuffle(
reader.ngram_reader(test_file, config.N, word_id_dict),
buf_size=65536),
batch_size=config.batch_size)
else:
raise Exception('use_which_model must be rnn or ngram!')
# train model
train(
model_cost=cost,
train_reader=train_reader,
test_reader=test_reader,
model_file_name_prefix=config.model_file_name_prefix,
num_passes=config.num_passs)
if __name__ == "__main__":
main()
# coding=utf-8
import os
import collections
def save_vocab(word_id_dict, vocab_file_name):
"""
save vocab.
:param word_id_dict: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
:param vocab_file_name: vocab file name.
"""
f = open(vocab_file_name, 'w')
for (k, v) in word_id_dict.items():
f.write(k.encode('utf-8') + '\t' + str(v) + '\n')
print('save vocab to ' + vocab_file_name)
f.close()
def load_vocab(vocab_file_name):
"""
load vocab from file.
:param vocab_file_name: vocab file name.
:return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
"""
assert os.path.isfile(vocab_file_name)
dict = {}
with open(vocab_file_name) as file:
for line in file:
if len(line) < 2:
continue
kv = line.decode('utf-8').strip().split('\t')
dict[kv[0]] = int(kv[1])
return dict
def build_vocab_using_threshhold(file_name, unk_threshold):
"""
build vacab using_<UNK> threshhold.
:param file_name:
:param unk_threshold: <UNK> threshhold.
:type unk_threshold: int.
:return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
"""
counter = {}
with open(file_name) as file:
for line in file:
words = line.decode('utf-8', 'ignore').strip().split()
for word in words:
if word in counter:
counter[word] += 1
else:
counter[word] = 1
counter_new = {}
for (word, frequency) in counter.items():
if frequency >= unk_threshold:
counter_new[word] = frequency
counter.clear()
counter_new = sorted(counter_new.items(), key=lambda d: -d[1])
words = [word_frequency[0] for word_frequency in counter_new]
word_id_dict = dict(zip(words, range(2, len(words) + 2)))
word_id_dict['<UNK>'] = 0
word_id_dict['<EOS>'] = 1
return word_id_dict
def build_vocab_with_fixed_size(file_name, vocab_max_size):
"""
build vacab with assigned max size.
:param vocab_max_size: vocab's max size.
:return: dictionary with content of '{word, id}', 'word' is string type , 'id' is int type.
"""
words = []
for line in open(file_name):
words += line.decode('utf-8', 'ignore').strip().split()
counter = collections.Counter(words)
counter = sorted(counter.items(), key=lambda x: -x[1])
if len(counter) > vocab_max_size:
counter = counter[:vocab_max_size]
words, counts = zip(*counter)
word_id_dict = dict(zip(words, range(2, len(words) + 2)))
word_id_dict['<UNK>'] = 0
word_id_dict['<EOS>'] = 1
return word_id_dict
TBD # Scheduled Sampling
## 概述
序列生成任务的生成目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码器每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码器输入数据的概率分布并不一致。
Scheduled Sampling\[[1](#参考文献)\]是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用目标序列中的真实元素作为解码器输入,可以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行,该方法会逐渐更多地使用生成的元素作为解码器输入,以解决数据分布不一致的问题。
标准的序列到序列模型中,如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。Scheduled Sampling以一定概率将生成的元素作为解码器输入,这样即使前面生成错误,其训练目标仍然是最大化真实目标序列的概率,模型会朝着正确的方向进行训练。因此这种方式增加了模型的容错能力。
## 算法简介
Scheduled Sampling主要应用在序列到序列模型的训练阶段,而生成阶段则不需要使用。
训练阶段解码器在最大化第$t$个元素概率时,标准序列到序列模型使用上一时刻的真实元素$y_{t-1}$作为输入。设上一时刻生成的元素为$g_{t-1}$,Scheduled Sampling算法会以一定概率使用$g_{t-1}$作为解码器输入。
设当前已经训练到了第$i$个mini-batch,Scheduled Sampling定义了一个概率$\epsilon_i$控制解码器的输入。$\epsilon_i$是一个随着$i$增大而衰减的变量,常见的定义方式有:
- 线性衰减:$\epsilon_i=max(\epsilon,k-c*i)$,其中$\epsilon$限制$\epsilon_i$的最小值,$k$和$c$控制线性衰减的幅度。
- 指数衰减:$\epsilon_i=k^i$,其中$0<k<1$,$k$控制着指数衰减的幅度。
- 反向Sigmoid衰减:$\epsilon_i=k/(k+exp(i/k))$,其中$k>1$,$k$同样控制衰减的幅度。
图1给出了这三种方式的衰减曲线,
<p align="center">
<img src="img/decay.jpg" width="50%" align="center"><br>
图1. 线性衰减、指数衰减和反向Sigmoid衰减的衰减曲线
</p>
如图2所示,在解码器的$t$时刻Scheduled Sampling以概率$\epsilon_i$使用上一时刻的真实元素$y_{t-1}$作为解码器输入,以概率$1-\epsilon_i$使用上一时刻生成的元素$g_{t-1}$作为解码器输入。从图1可知随着$i$的增大$\epsilon_i$会不断减小,解码器将不断倾向于使用生成的元素作为输入,训练阶段和生成阶段的数据分布将变得越来越一致。
<p align="center">
<img src="img/Scheduled_Sampling.jpg" width="50%" align="center"><br>
图2. Scheduled Sampling选择不同元素作为解码器输入示意图
</p>
## 模型实现
由于Scheduled Sampling是对序列到序列模型的改进,其整体实现框架与序列到序列模型较为相似。为突出本文重点,这里仅介绍与Scheduled Sampling相关的部分,完整的代码见`scheduled_sampling.py`
首先导入需要的包,并定义控制衰减概率的类`RandomScheduleGenerator`,如下:
```python
import numpy as np
import math
class RandomScheduleGenerator:
"""
The random sampling rate for scheduled sampling algoithm, which uses devcayed
sampling rate.
"""
...
```
下面将分别定义类`RandomScheduleGenerator``__init__``getScheduleRate``processBatch`三个方法。
`__init__`方法对类进行初始化,其`schedule_type`参数指定了使用哪种衰减方式,可选的方式有`constant``linear``exponential``inverse_sigmoid``constant`指对所有的mini-batch使用固定的$\epsilon_i$,`linear`指线性衰减方式,`exponential`表示指数衰减方式,`inverse_sigmoid`表示反向Sigmoid衰减。`__init__`方法的参数`a``b`表示衰减方法的参数,需要在验证集上调优。`self.schedule_computers`将衰减方式映射为计算$\epsilon_i$的函数。最后一行根据`schedule_type`将选择的衰减函数赋给`self.schedule_computer`变量。
```python
def __init__(self, schedule_type, a, b):
"""
schduled_type: is the type of the decay. It supports constant, linear,
exponential, and inverse_sigmoid right now.
a: parameter of the decay (MUST BE DOUBLE)
b: parameter of the decay (MUST BE DOUBLE)
"""
self.schedule_type = schedule_type
self.a = a
self.b = b
self.data_processed_ = 0
self.schedule_computers = {
"constant": lambda a, b, d: a,
"linear": lambda a, b, d: max(a, 1 - d / b),
"exponential": lambda a, b, d: pow(a, d / b),
"inverse_sigmoid": lambda a, b, d: b / (b + math.exp(d * a / b)),
}
assert (self.schedule_type in self.schedule_computers)
self.schedule_computer = self.schedule_computers[self.schedule_type]
```
`getScheduleRate`根据衰减函数和已经处理的数据量计算$\epsilon_i$。
```python
def getScheduleRate(self):
"""
Get the schedule sampling rate. Usually not needed to be called by the users
"""
return self.schedule_computer(self.a, self.b, self.data_processed_)
```
`processBatch`方法根据概率值$\epsilon_i$进行采样,得到`indexes``indexes`中每个元素取值为`0`的概率为$\epsilon_i$,取值为`1`的概率为$1-\epsilon_i$。`indexes`决定了解码器的输入是真实元素还是生成的元素,取值为`0`表示使用真实元素,取值为`1`表示使用生成的元素。
```python
def processBatch(self, batch_size):
"""
Get a batch_size of sampled indexes. These indexes can be passed to a
MultiplexLayer to select from the grouth truth and generated samples
from the last time step.
"""
rate = self.getScheduleRate()
numbers = np.random.rand(batch_size)
indexes = (numbers >= rate).astype('int32').tolist()
self.data_processed_ += batch_size
return indexes
```
Scheduled Sampling需要在序列到序列模型的基础上增加一个输入`true_token_flag`,以控制解码器输入。
```python
true_token_flags = paddle.layer.data(
name='true_token_flag',
type=paddle.data_type.integer_value_sequence(2))
```
这里还需要对原始reader进行封装,增加`true_token_flag`的数据生成器。下面以线性衰减为例说明如何调用上面定义的`RandomScheduleGenerator`产生`true_token_flag`的输入数据。
```python
schedule_generator = RandomScheduleGenerator("linear", 0.75, 1000000)
def gen_schedule_data(reader):
"""
Creates a data reader for scheduled sampling.
Output from the iterator that created by original reader will be
appended with "true_token_flag" to indicate whether to use true token.
:param reader: the original reader.
:type reader: callable
:return: the new reader with the field "true_token_flag".
:rtype: callable
"""
def data_reader():
for src_ids, trg_ids, trg_ids_next in reader():
yield src_ids, trg_ids, trg_ids_next, \
[0] + schedule_generator.processBatch(len(trg_ids) - 1)
return data_reader
```
这段代码在原始输入数据(即源序列元素`src_ids`、目标序列元素`trg_ids`和目标序列下一个元素`trg_ids_next`)后追加了控制解码器输入的数据。由于解码器第一个元素是序列开始符,因此将追加的数据第一个元素设置为`0`,表示解码器第一步始终使用真实目标序列的第一个元素(即序列开始符)。
训练时`recurrent_group`每一步调用的解码器函数如下:
```python
def gru_decoder_with_attention_train(enc_vec, enc_proj, true_word,
true_token_flag):
"""
The decoder step for training.
:param enc_vec: the encoder vector for attention
:type enc_vec: LayerOutput
:param enc_proj: the encoder projection for attention
:type enc_proj: LayerOutput
:param true_word: the ground-truth target word
:type true_word: LayerOutput
:param true_token_flag: the flag of using the ground-truth target word
:type true_token_flag: LayerOutput
:return: the softmax output layer
:rtype: LayerOutput
"""
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
gru_out_memory = paddle.layer.memory(
name='gru_out', size=target_dict_dim)
generated_word = paddle.layer.max_id(input=gru_out_memory)
generated_word_emb = paddle.layer.embedding(
input=generated_word,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
current_word = paddle.layer.multiplex(
input=[true_token_flag, true_word, generated_word_emb])
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
name='gru_out',
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
```
该函数使用`memory``gru_out_memory`记忆上一时刻生成的元素,根据`gru_out_memory`选择概率最大的词语`generated_word`作为生成的词语。`multiplex`层会在真实元素`true_word`和生成的元素`generated_word`之间做出选择,并将选择的结果作为解码器输入。`multiplex`层使用了三个输入,分别为`true_token_flag``true_word``generated_word_emb`。对于这三个输入中每个元素,若`true_token_flag`中的值为`0`,则`multiplex`层输出`true_word`中的相应元素;若`true_token_flag`中的值为`1`,则`multiplex`层输出`generated_word_emb`中的相应元素。
## 参考文献
[1] Bengio S, Vinyals O, Jaitly N, et al. [Scheduled sampling for sequence prediction with recurrent neural networks](http://papers.nips.cc/paper/5956-scheduled-sampling-for-sequence-prediction-with-recurrent-neural-networks)//Advances in Neural Information Processing Systems. 2015: 1171-1179.
import numpy as np
import math
class RandomScheduleGenerator:
"""
The random sampling rate for scheduled sampling algoithm, which uses devcayed
sampling rate.
"""
def __init__(self, schedule_type, a, b):
"""
schduled_type: is the type of the decay. It supports constant, linear,
exponential, and inverse_sigmoid right now.
a: parameter of the decay (MUST BE DOUBLE)
b: parameter of the decay (MUST BE DOUBLE)
"""
self.schedule_type = schedule_type
self.a = a
self.b = b
self.data_processed_ = 0
self.schedule_computers = {
"constant": lambda a, b, d: a,
"linear": lambda a, b, d: max(a, 1 - d / b),
"exponential": lambda a, b, d: pow(a, d / b),
"inverse_sigmoid": lambda a, b, d: b / (b + math.exp(d * a / b)),
}
assert (self.schedule_type in self.schedule_computers)
self.schedule_computer = self.schedule_computers[self.schedule_type]
def getScheduleRate(self):
"""
Get the schedule sampling rate. Usually not needed to be called by the users
"""
return self.schedule_computer(self.a, self.b, self.data_processed_)
def processBatch(self, batch_size):
"""
Get a batch_size of sampled indexes. These indexes can be passed to a
MultiplexLayer to select from the grouth truth and generated samples
from the last time step.
"""
rate = self.getScheduleRate()
numbers = np.random.rand(batch_size)
indexes = (numbers >= rate).astype('int32').tolist()
self.data_processed_ += batch_size
return indexes
import sys
import paddle.v2 as paddle
from random_schedule_generator import RandomScheduleGenerator
schedule_generator = RandomScheduleGenerator("linear", 0.75, 1000000)
def gen_schedule_data(reader):
"""
Creates a data reader for scheduled sampling.
Output from the iterator that created by original reader will be
appended with "true_token_flag" to indicate whether to use true token.
:param reader: the original reader.
:type reader: callable
:return: the new reader with the field "true_token_flag".
:rtype: callable
"""
def data_reader():
for src_ids, trg_ids, trg_ids_next in reader():
yield src_ids, trg_ids, trg_ids_next, \
[0] + schedule_generator.processBatch(len(trg_ids) - 1)
return data_reader
def seqToseq_net(source_dict_dim, target_dict_dim, is_generating=False):
"""
The definition of the sequence to sequence model
:param source_dict_dim: the dictionary size of the source language
:type source_dict_dim: int
:param target_dict_dim: the dictionary size of the target language
:type target_dict_dim: int
:param is_generating: whether in generating mode
:type is_generating: Bool
:return: the last layer of the network
:rtype: LayerOutput
"""
### Network Architecture
word_vector_dim = 512 # dimension of word vector
decoder_size = 512 # dimension of hidden unit in GRU Decoder network
encoder_size = 512 # dimension of hidden unit in GRU Encoder network
beam_size = 3
max_length = 250
#### Encoder
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
src_embedding = paddle.layer.embedding(
input=src_word_id, size=word_vector_dim)
src_forward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size)
src_backward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
#### Decoder
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
backward_first = paddle.layer.first_seq(input=src_backward)
with paddle.layer.mixed(
size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first)
def gru_decoder_with_attention_train(enc_vec, enc_proj, true_word,
true_token_flag):
"""
The decoder step for training.
:param enc_vec: the encoder vector for attention
:type enc_vec: LayerOutput
:param enc_proj: the encoder projection for attention
:type enc_proj: LayerOutput
:param true_word: the ground-truth target word
:type true_word: LayerOutput
:param true_token_flag: the flag of using the ground-truth target word
:type true_token_flag: LayerOutput
:return: the softmax output layer
:rtype: LayerOutput
"""
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
gru_out_memory = paddle.layer.memory(
name='gru_out', size=target_dict_dim)
generated_word = paddle.layer.max_id(input=gru_out_memory)
generated_word_emb = paddle.layer.embedding(
input=generated_word,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
current_word = paddle.layer.multiplex(
input=[true_token_flag, true_word, generated_word_emb])
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
name='gru_out',
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
def gru_decoder_with_attention_test(enc_vec, enc_proj, current_word):
"""
The decoder step for generating.
:param enc_vec: the encoder vector for attention
:type enc_vec: LayerOutput
:param enc_proj: the encoder projection for attention
:type enc_proj: LayerOutput
:param current_word: the previously generated word
:type current_word: LayerOutput
:return: the softmax output layer
:rtype: LayerOutput
"""
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInput(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInput(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2]
if not is_generating:
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding)
true_token_flags = paddle.layer.data(
name='true_token_flag',
type=paddle.data_type.integer_value_sequence(2))
group_inputs.append(true_token_flags)
decoder = paddle.layer.recurrent_group(
name=decoder_group_name,
step=gru_decoder_with_attention_train,
input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
return cost
else:
trg_embedding = paddle.layer.GeneratedInput(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention_test,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
return beam_gen
def main():
paddle.init(use_gpu=False, trainer_count=1)
is_generating = False
model_path_for_generating = 'params_pass_1.tar.gz'
# source and target dict dim.
dict_size = 30000
source_dict_dim = target_dict_dim = dict_size
# train the network
if not is_generating:
cost = seqToseq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
# define optimize method and trainer
optimizer = paddle.optimizer.Adam(
learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer)
# define data reader
wmt14_reader = paddle.batch(
gen_schedule_data(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size), buf_size=8192)),
batch_size=5)
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2,
'true_token_flag': 3
}
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost,
event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
# save parameters
with gzip.open('params_pass_%d.tar.gz' % event.pass_id,
'w') as f:
parameters.to_tar(f)
# start to train
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=2)
# generate a english sequence to french
else:
# use the first 3 samples for generation
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
beam_gen = seqToseq_net(source_dict_dim, target_dict_dim, is_generating)
# get the trained model
with gzip.open(model_path_for_generating, 'r') as f:
parameters = Parameters.from_tar(f)
# prob is the prediction probabilities, and id is the prediction word.
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
beam_size = 3
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
if __name__ == '__main__':
main()
data
*.tar.gz
*.log
*.pyc
# 文本分类 # 文本分类
文本分类是机器学习中的一项常见任务,主要目的是根据一条文本的内容,判断该文本所属的类别。在本例子中,我们利用有标注的语料库训练二分类DNN和CNN模型,完成对输入文本的分类任务。
DNN与CNN模型之间最大的区别在于: 以下是本例目录包含的文件以及对应说明:
```text
.
├── images # 文档中的图片
│   ├── cnn_net.png
│   └── dnn_net.png
├── index.html # 文档
├── infer.py # 预测脚本
├── network_conf.py # 本例中涉及的各种网络结构均定义在此文件中,若进一步修改模型结构,请查看此文件
├── reader.py # 读取数据接口,若使用自定义格式的数据,请查看此文件
├── README.md # 文档
├── run.sh # 训练任务运行脚本,直接运行此脚本,将以默认参数开始训练任务
├── train.py # 训练脚本
└── utils.py # 定义通用的函数,例如:打印日志、解析命令行参数、构建字典、加载字典等
```
- DNN不属于序列模型,大多使用基本的全连接结构,只能接受固定维度的特征向量作为输入。 ## 简介
文本分类任务根据给定一条文本的内容,判断该文本所属的类别,是自然语言处理领域的一项重要的基础任务。[PaddleBook](https://github.com/PaddlePaddle/book) 中的[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一课,正是一个典型的文本分类任务,任务流程如下:
- CNN属于序列模型,能够提取一个局部区域之内的特征,能够处理变长的序列输入。 1. 收集电影评论网站的用户评论数据。
2. 清洗,标记。
3. 模型设计。
4. 模型学习效果评估。
举例来说,情感分类是一项常见的文本分类任务,在情感分类中,我们希望训练一个模型来判断句子中表现出的情感是正向还是负向。例如,"The apple is not bad",其中的"not bad"是决定这个句子情感的关键 训练好的分类器能够**自动判断**新出现的用户评论的情感是正面还是负面,在舆情监控、营销策划、产品品牌价值评估等任务中,能够起到重要作用。以上过程也是我们去完成一个新的文本分类任务需要遵循的常规流程。可以看到,深度学习方法的巨大优势体现在:**免除复杂的特征的设计,只需要对原始文本进行基础的清理、标注即可**
- 对于DNN模型来说,只能知道句子中有一个"not"和一个"bad",但两者之间的顺序关系在输入时已经丢失,网络不再有机会学习序列之间的顺序信息。 [PaddleBook](https://github.com/PaddlePaddle/book) 中的[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)介绍了一个较为复杂的栈式双向 LSTM 模型,循环神经网络在一些需要理解语言语义的复杂任务中有着明显的优势,但计算量大,通常对调参技巧也有着更高的要求。在对计算时间有一定限制的任务中,也会考虑其它模型。除了计算时间的考量,更重要的一点:**模型选择往往是机器学习任务成功的基础**。机器学习任务的目标始终是提高泛化能力,也就是对未知的新的样本预测的能力:
- CNN模型接受文本序列作为输入,保留了"not bad"之间的顺序信息。因此,在大多数文本分类任务上,CNN模型的表现要好于DNN。 1. 简单模型拟合能力不足,无法精确拟合训练样本,更加无法期待模型能够准确地预测没有出现在训练样本集中的未知样本,这就是**欠拟合**问题。
2. 然而,过于复杂的模型轻松“记忆”了训练样本集中的每一个样本,但对于没有出现在训练样本集中的未知样本却毫无识别能力,这就是**过拟合**问题。
## 实验数据 "No Free Lunch (NFL)" 是机器学习任务基本原则之一:没有任何一种模型是天生优于其他模型的。模型的设计和选择建立在了解不同模型特性的基础之上,但同时也是一个多次实验评估的过程。在本例中,我们继续向大家介绍几种最常用的文本分类模型,它们的能力和复杂程度不同,帮助大家对比学习这些模型学习效果之间的差异,针对不同的场景选择使用。
本例子的实验在[IMDB数据集](http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz)上进行。IMDB数据集包含了来自IMDB(互联网电影数据库)网站的5万条电影影评,并被标注为正面/负面两种评价。数据集被划分为train和test两部分,各2.5万条数据,正负样本的比例基本为1:1。样本直接以英文原文的形式表示。
## DNN模型 ## 模型详解
**DNN的模型结构入下图所示:** `network_conf.py` 中包括以下模型:
<p align="center"> 1. `fc_net`: DNN 模型,是一个非序列模型。使用基本的全连接结构。
<img src="images/dnn_net.png" width = "90%" align="center"/><br/> 2. `convolution_net`:浅层 CNN 模型,是一个基础的序列模型,能够处理变长的序列输入,提取一个局部区域之内的特征。
图1. DNN文本分类模型
</p>
**可以看到,模型主要分为如下几个部分:** 我们以情感分类任务为例,简单说明序列模型和非序列模型之间的差异。情感分类是一项常见的文本分类任务,模型自动判断文本中表现出的情感是正向还是负向。以句子 "The apple is not bad" 为例,"not bad" 是决定这个句子情感的关键:
- **词向量层**:IMDB的样本由原始的英文单词组成,为了更好地表示不同词之间语义上的关系,首先将英文单词转化为固定维度的向量。训练完成后,词与词语义上的相似程度可以用它们的词向量之间的距离来表示,语义上越相似,距离越近。关于词向量的更多信息请参考PaddleBook中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。 - 对于 DNN 模型来说,只知道句子中有一个 "not" 和一个 "bad",两者之间的顺序关系在输入网络时已丢失,网络不再有机会学习序列之间的顺序信息。
- CNN 模型接受文本序列作为输入,保留了 "not bad" 之间的顺序信息。
- **最大池化层**:最大池化在时间序列上进行,池化过程消除了不同语料样本在单词数量多少上的差异,并提炼出词向量中每一下标位置上的最大值。经过池化后,词向量层输出的向量序列被转化为一条固定维度的向量。例如,假设最大池化前向量的序列为`[[2,3,5],[7,3,6],[1,4,0]]`,则最大池化的结果为:`[7,4,6]` 两者各自的一些特点简单总结如下:
- **全连接隐层**:经过最大池化后的向量被送入两个连续的隐层,隐层之间为全连接结构。
1. DNN 的计算量可以远低于 CNN / RNN 模型,在对响应时间有要求的任务中具有优势。
2. DNN 刻画的往往是频繁词特征,潜在会受到分词错误的影响,但对一些依赖关键词特征也能做的不错的任务:如 Spam 短信检测,依然是一个有效的模型。
3. 在大多数需要一定语义理解(例如,借助上下文消除语义中的歧义)的文本分类任务上,以 CNN / RNN 为代表的序列模型的效果往往好于 DNN 模型。
- **输出层**:输出层的神经元数量和样本的类别数一致,例如在二分类问题中,输出层会有2个神经元。通过Softmax激活函数,输出结果是一个归一化的概率分布,和为1,因此第$i$个神经元的输出就可以认为是样本属于第$i$类的预测概率。 ### 1. DNN 模型
**通过PaddlePaddle实现该DNN结构的代码如下:** **DNN 模型结构入下图所示:**
```python
import paddle.v2 as paddle
def fc_net(dict_dim, class_dim=2, emb_dim=28):
"""
dnn network definition
:param dict_dim: size of word dictionary
:type input_dim: int
:params class_dim: number of instance class
:type class_dim: int
:params emb_dim: embedding vector dimension
:type emb_dim: int
"""
# input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(dict_dim))
lbl = paddle.layer.data("label", paddle.data_type.integer_value(class_dim))
# embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)
# max pooling
seq_pool = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
# two hidden layers
hd_layer_size = [28, 8]
hd_layer_init_std = [1.0 / math.sqrt(s) for s in hd_layer_size]
hd1 = paddle.layer.fc(
input=seq_pool,
size=hd_layer_size[0],
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=hd_layer_init_std[0]))
hd2 = paddle.layer.fc(
input=hd1,
size=hd_layer_size[1],
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=hd_layer_init_std[1]))
# output layer
output = paddle.layer.fc(
input=hd2,
size=class_dim,
act=paddle.activation.Softmax(),
param_attr=paddle.attr.Param(initial_std=1.0 / math.sqrt(class_dim)))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost, output, lbl
```
该DNN模型默认对输入的语料进行二分类(`class_dim=2`),embedding的词向量维度默认为28(`emd_dim=28`),两个隐层均使用Tanh激活函数(`act=paddle.activation.Tanh()`)。
需要注意的是,该模型的输入数据为整数序列,而不是原始的英文单词序列。事实上,为了处理方便我们一般会事先将单词根据词频顺序进行id化,即将单词用整数替代, 也就是单词在字典中的序号。这一步一般在DNN模型之外完成。
## CNN模型
**CNN的模型结构如下图所示:**
<p align="center"> <p align="center">
<img src="images/cnn_net.png" width = "90%" align="center"/><br/> <img src="images/dnn_net.png" width = "90%" align="center"/><br/>
2. CNN文本分类模型 1. 本例中的 DNN 文本分类模型
</p> </p>
**可以看到,模型主要分为如下几个部分:** 在 PaddlePaddle 实现该 DNN 结构的代码见 `network_conf.py` 中的 `fc_net` 函数,模型主要分为如下几个部分:
- **词向量层**与DNN中词向量层的作用一样,将英文单词转化为固定维度的向量,利用向量之间的距离来表示词之间的语义相关程度。如图2中所示,将得到的词向量定义为行向量,再将语料中所有的单词产生的行向量拼接在一起组成矩阵。假设词向量维度为5,语料“The cat sat on the read mat”包含7个单词,那么得到的矩阵维度为7*5。关于词向量的更多信息请参考PaddleBook中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。 - **词向量层**为了更好地表示不同词之间语义上的关系,首先将词语转化为固定维度的向量。训练完成后,词与词语义上的相似程度可以用它们的词向量之间的距离来表示,语义上越相似,距离越近。关于词向量的更多信息请参考PaddleBook中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。
- **卷积层**: 文本分类中的卷积在时间序列上进行,即卷积核的宽度和词向量层产出的矩阵一致,卷积沿着矩阵的高度方向进行。卷积后得到的结果被称为“特征图”(feature map)。假设卷积核的高度为$h$,矩阵的高度为$N$,卷积的步长为1,则得到的特征图为一个高度为$N+1-h$的向量。可以同时使用多个不同高度的卷积核,得到多个特征图。 - **最大池化层**:最大池化在时间序列上进行,池化过程消除了不同语料样本在单词数量多少上的差异,并提炼出词向量中每一下标位置上的最大值。经过池化后,词向量层输出的向量序列被转化为一条固定维度的向量。例如,假设最大池化前向量的序列为`[[2,3,5],[7,3,6],[1,4,0]]`,则最大池化的结果为:`[7,4,6]`
- **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量,因此这里的最大池化实际上就是简单地选出各个向量中的最大元素。各个最大元素又被拼接在一起,组成新的向量,显然,该向量的维度等于特征图的数量,也就是卷积核的数量。举例来说,假设我们使用了四个不同的卷积核,卷积产生的特征图分别为:`[2,3,5]``[8,2,1]``[5,7,7,6]``[4,5,1,8]`,由于卷积核的高度不同,因此产生的特征图尺寸也有所差异。分别在这四个特征图上进行最大池化,结果为:`[5]``[8]``[7]``[8]`,最后将池化结果拼接在一起,得到`[5,8,7,8]`
- **全连接与输出层**:将最大池化的结果通过全连接层输出,与DNN模型一样,最后输出层的神经元个数与样本的类别数量一致,且输出之和为1 - **全连接隐层**:经过最大池化后的向量被送入两个连续的隐层,隐层之间为全连接结构
**通过PaddlePaddle实现该CNN结构的代码如下:** - **输出层**:输出层的神经元数量和样本的类别数一致,例如在二分类问题中,输出层会有2个神经元。通过Softmax激活函数,输出结果是一个归一化的概率分布,和为1,因此第$i$个神经元的输出就可以认为是样本属于第$i$类的预测概率。
```python 该 DNN 模型默认对输入的语料进行二分类(`class_dim=2`),embedding(词向量)维度默认为28(`emd_dim=28`),两个隐层均使用Tanh激活函数(`act=paddle.activation.Tanh()`)。需要注意的是,该模型的输入数据为整数序列,而不是原始的单词序列。事实上,为了处理方便,我们一般会事先将单词根据词频顺序进行 id 化,即将词语转化成在字典中的序号。
import paddle.v2 as paddle
def convolution_net(dict_dim, class_dim=2, emb_dim=28, hid_dim=128): ### 2. CNN 模型
"""
cnn network definition
:param dict_dim: size of word dictionary **CNN 模型结构如下图所示:**
:type input_dim: int
:params class_dim: number of instance class
:type class_dim: int
:params emb_dim: embedding vector dimension
:type emb_dim: int
:params hid_dim: number of same size convolution kernels
:type hid_dim: int
"""
# input layers <p align="center">
data = paddle.layer.data("word", <img src="images/cnn_net.png" width = "90%" align="center"/><br/>
paddle.data_type.integer_value_sequence(dict_dim)) 图2. 本例中的 CNN 文本分类模型
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2)) </p>
#embedding layer 通过 PaddlePaddle 实现该 CNN 结构的代码见 `network_conf.py` 中的 `convolution_net` 函数,模型主要分为如下几个部分:
emb = paddle.layer.embedding(input=data, size=emb_dim)
# convolution layers with max pooling - **词向量层**:与 DNN 中词向量层的作用一样,将词语转化为固定维度的向量,利用向量之间的距离来表示词之间的语义相关程度。如图2所示,将得到的词向量定义为行向量,再将语料中所有的单词产生的行向量拼接在一起组成矩阵。假设词向量维度为5,句子 “The cat sat on the read mat” 含 7 个词语,那么得到的矩阵维度为 7*5。关于词向量的更多信息请参考 PaddleBook 中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。
conv_3 = paddle.networks.sequence_conv_pool(
input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim)
# fc and output layer - **卷积层**: 文本分类中的卷积在时间序列上进行,即卷积核的宽度和词向量层产出的矩阵一致,卷积沿着矩阵的高度方向进行。卷积后得到的结果被称为“特征图”(feature map)。假设卷积核的高度为 $h$,矩阵的高度为 $N$,卷积的步长为 1,则得到的特征图为一个高度为 $N+1-h$ 的向量。可以同时使用多个不同高度的卷积核,得到多个特征图。
output = paddle.layer.fc(
input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())
cost = paddle.layer.classification_cost(input=output, label=lbl) - **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量,因此这里的最大池化实际上就是简单地选出各个向量中的最大元素。各个最大元素又被拼接在一起,组成新的向量,显然,该向量的维度等于特征图的数量,也就是卷积核的数量。举例来说,假设我们使用了四个不同的卷积核,卷积产生的特征图分别为:`[2,3,5]``[8,2,1]``[5,7,7,6]``[4,5,1,8]`,由于卷积核的高度不同,因此产生的特征图尺寸也有所差异。分别在这四个特征图上进行最大池化,结果为:`[5]``[8]``[7]``[8]`,最后将池化结果拼接在一起,得到`[5,8,7,8]`
return cost, output, lbl - **全连接与输出层**:将最大池化的结果通过全连接层输出,与 DNN 模型一样,最后输出层的神经元个数与样本的类别数量一致,且输出之和为 1。
```
该CNN网络的输入数据类型和前面介绍过的DNN一致。`paddle.networks.sequence_conv_pool`为PaddlePaddle中已经封装好的带有池化的文本序列卷积模块,该模块的`context_len`参数用于指定卷积核在同一时间覆盖的文本长度,即图2中的卷积核的高度;`hidden_size`用于指定该类型的卷积核的数量。可以看到,上述代码定义的结构中使用了128个大小为3的卷积核和128个大小为4的卷积核,这些卷积的结果经过最大池化和结果拼接后产生一个256维的向量,向量经过一个全连接层输出最终预测结果。 CNN 网络的输入数据类型和 DNN 一致。PaddlePaddle 中已经封装好的带有池化的文本序列卷积模块:`paddle.networks.sequence_conv_pool`,可直接调用。该模块的 `context_len` 参数用于指定卷积核在同一时间覆盖的文本长度,即图 2 中的卷积核的高度。`hidden_size` 用于指定该类型的卷积核的数量。本例代码默认使用了 128 个大小为 3 的卷积核和 128 个大小为 4 的卷积核,这些卷积的结果经过最大池化和结果拼接后产生一个 256 维的向量,向量经过一个全连接层输出最终的预测结果。
## 自定义数据 ## 使用 PaddlePaddle 内置数据运行
本样例中的代码通过`Paddle.dataset.imdb.train`接口使用了PaddlePaddle自带的样例数据,在第一次运行代码时,PaddlePaddle会自动下载并缓存所需的数据。如果希望使用自己的数据进行训练,需要自行编写数据读取接口。
编写数据读取接口的关键在于实现一个Python生成器,生成器负责从原始输入文本中解析出一条训练样本,并组合成适当的数据形式传送给网络中的data layer。例如在本样例中,data layer需要的数据类型为`paddle.data_type.integer_value_sequence`,本质上是一个Python list。因此我们的生成器需要完成:从文件中读取数据, 以及转换成适当形式的Python list,这两件事情。 ### 如何训练
假设原始数据的格式为 在终端中执行 `sh run.sh` 以下命令, 将以 PaddlePaddle 内置的情感分类数据集:`paddle.dataset.imdb` 直接运行本例,会看到如下输入
```text
Pass 0, Batch 0, Cost 0.696031, {'__auc_evaluator_0__': 0.47360000014305115, 'classification_error_evaluator': 0.5}
Pass 0, Batch 100, Cost 0.544438, {'__auc_evaluator_0__': 0.839249312877655, 'classification_error_evaluator': 0.30000001192092896}
Pass 0, Batch 200, Cost 0.406581, {'__auc_evaluator_0__': 0.9030032753944397, 'classification_error_evaluator': 0.2199999988079071}
Test at Pass 0, {'__auc_evaluator_0__': 0.9289745092391968, 'classification_error_evaluator': 0.14927999675273895}
``` ```
PaddlePaddle is good 1 日志每隔 100 个 batch 输出一次,输出信息包括:(1)Pass 序号;(2)Batch 序号;(3)依次输出当前 Batch 上评估指标的评估结果。评估指标在配置网络拓扑结构时指定,在上面的输出中,输出了训练样本集之的 AUC 以及错误率指标。
What a terrible weather 0
```
每一行为一条样本,样本包括了原始语料和标签,语料内部单词以空格分隔,语料和标签之间用`\t`分隔。对以上格式的数据,可以使用如下自定义的数据读取接口为PaddlePaddle返回训练数据:
```python
def encode_word(word, word_dict):
"""
map word to id
:param word: the word to be mapped
:type word: str
:param word_dict: word dictionary
:type word_dict: Python dict
"""
if word_dict.has_key(word):
return word_dict[word]
else:
return word_dict['<unk>']
def data_reader(file_name, word_dict):
"""
Reader interface for training data
:param file_name: data file name
:type file_name: str
:param word_dict: word dictionary
:type word_dict: Python dict
"""
def reader():
with open(file_name, "r") as f:
for line in f:
ins, label = line.strip('\n').split('\t')
ins_data = [int(encode_word(w, word_dict)) for w in ins.split(' ')]
yield ins_data, int(label)
return reader
```
`word_dict`是字典,用来将原始的单词字符串转化为在字典中的序号。可以用`data_reader`替换原先代码中的`Paddle.dataset.imdb.train`接口用以提供自定义的训练数据。 ### 如何预测
## 运行与输出 训练结束后模型默认存储在当前工作目录下,在终端中执行 `python infer.py` ,预测脚本会加载训练好的模型进行预测。
本部分以上文介绍的DNN网络为例,介绍如何利用样例中的`text_classification_dnn.py`脚本进行DNN网络的训练和对新样本的预测。 - 默认加载使用 `paddle.data.imdb.train` 训练一个 Pass 产出的 DNN 模型对 `paddle.dataset.imdb.test` 进行测试
`text_classification_dnn.py`中的代码分为四部分: 会看到如下输出:
- **fc_net函数**:定义dnn网络结构,上文已经有说明。
- **train\_dnn\_model函数**:模型训练函数。定义优化方式、训练输出等内容,并组织训练流程。每完成一个pass的训练,程序都会将当前的模型参数保存在硬盘上,文件名为:`dnn_params_pass***.tar.gz`,其中`***`表示pass的id,从0开始计数。本函数接受一个整数类型的参数,表示训练pass的总轮数。
- **dnn_infer函数**:载入已有模型并对新样本进行预测。函数开始运行后会从当前路径下寻找并读取指定名称的参数文件,加载其中的模型参数,并对test数据集中的样本进行预测。
- **main函数**:主函数
要运行本样例,直接在`text_classification_dnn.py`所在路径下执行`python text_classification_dnn.py`即可,样例会自动依次执行数据集下载、数据读取、模型训练和保存、模型读取、新样本预测等步骤。
预测的输出形式为:
```text
positive 0.9275 0.0725 previous reviewer <unk> <unk> gave a much better <unk> of the films plot details than i could what i recall mostly is that it was just so beautiful in every sense emotionally visually <unk> just <unk> br if you like movies that are wonderful to look at and also have emotional content to which that beauty is relevant i think you will be glad to have seen this extraordinary and unusual work of <unk> br on a scale of 1 to 10 id give it about an <unk> the only reason i shy away from 9 is that it is a mood piece if you are in the mood for a really artistic very romantic film then its a 10 i definitely think its a mustsee but none of us can be in that mood all the time so overall <unk>
negative 0.0300 0.9700 i love scifi and am willing to put up with a lot scifi <unk> are usually <unk> <unk> and <unk> i tried to like this i really did but it is to good tv scifi as <unk> 5 is to star trek the original silly <unk> cheap cardboard sets stilted dialogues cg that doesnt match the background and painfully onedimensional characters cannot be overcome with a scifi setting im sure there are those of you out there who think <unk> 5 is good scifi tv its not its clichéd and <unk> while us viewers might like emotion and character development scifi is a genre that does not take itself seriously <unk> star trek it may treat important issues yet not as a serious philosophy its really difficult to care about the characters here as they are not simply <unk> just missing a <unk> of life their actions and reactions are wooden and predictable often painful to watch the makers of earth know its rubbish as they have to always say gene <unk> earth otherwise people would not continue watching <unk> <unk> must be turning in their <unk> as this dull cheap poorly edited watching it without <unk> breaks really brings this home <unk> <unk> of a show <unk> into space spoiler so kill off a main character and then bring him back as another actor <unk> <unk> all over again
``` ```
[ 0.99892634 0.00107362] 0
[ 0.00107638 0.9989236 ] 1
[ 0.98185927 0.01814074] 0
[ 0.31667888 0.68332112] 1
[ 0.98853314 0.01146684] 0
```
每一行表示一条样本的预测结果。前两列表示该样本属于0、1这两个类别的预测概率,最后一列表示样本的实际label。
在运行CNN模型的`text_classification_cnn.py`脚本中,网络模型定义在`convolution_net`函数中,模型训练函数名为`train_cnn_model`,预测函数名为`cnn_infer`。其他用法和`text_classification_dnn.py`是一致的。 输出日志每一行是对一条样本预测的结果,以 `\t` 分隔,共 3 列,分别是:(1)预测类别标签;(2)样本分别属于每一类的概率,内部以空格分隔;(3)输入文本。
## 使用自定义数据训练和预测
### 如何训练
1. 数据组织
假设有如下格式的训练数据:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容,文本内容中的词语以空格分隔。以下是两条示例数据:
```
positive PaddlePaddle is good
negative What a terrible weather
```
2. 编写数据读取接口
自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`(词语在字典的序号)和 `paddle.data_type.integer_value`(类别标签)的 2 个输入给网络中定义的 2 个 `data_layer` 的功能。
```python
def train_reader(data_dir, word_dict, label_dict):
def reader():
UNK_ID = word_dict["<UNK>"]
word_col = 0
lbl_col = 1
for file_name in os.listdir(data_dir):
with open(os.path.join(data_dir, file_name), "r") as f:
for line in f:
line_split = line.strip().split("\t")
word_ids = [
word_dict.get(w, UNK_ID)
for w in line_split[word_col].split()
]
yield word_ids, label_dict[line_split[lbl_col]]
return reader
```
- 关于 PaddlePaddle 中 `data_layer` 接受输入数据的类型,以及数据读取接口对应该返回数据的格式,请参考 [input-types](http://www.paddlepaddle.org/release_doc/0.9.0/doc_cn/ui/data_provider/pydataprovider2.html#input-types) 一节。
- 以上代码片段详见本例目录下的 `reader.py` 脚本,`reader.py` 同时提供了读取测试数据的全部代码。
接下来,只需要将数据读取函数 `train_reader` 作为参数传递给 `train.py` 脚本中的 `paddle.batch` 接口即可使用自定义数据接口读取数据,调用方式如下:
```python
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader(train_data_dir, word_dict, lbl_dict),
buf_size=1000),
batch_size=batch_size)
```
3. 修改命令行参数
- 如果将数据组织成示例数据的同样的格式,只需在 `run.sh` 脚本中修改 `train.py` 启动参数,指定 `train_data_dir` 参数,可以直接运行本例,无需修改数据读取接口 `reader.py`。
- 执行 `python train.py --help` 可以获取`train.py` 脚本各项启动参数的详细说明,主要参数如下:
- `nn_type`:选择要使用的模型,目前支持两种:“dnn” 或者 “cnn”。
- `train_data_dir`:指定训练数据所在的文件夹,使用自定义数据训练,必须指定此参数,否则使用`paddle.dataset.imdb`训练,同时忽略`test_data_dir`,`word_dict`,和 `label_dict` 参数。
- `test_data_dir`:指定测试数据所在的文件夹,若不指定将不进行测试。
- `word_dict`:字典文件所在的路径,若不指定,将从训练数据根据词频统计,自动建立字典。
- `label_dict`:类别标签字典,用于将字符串类型的类别标签,映射为整数类型的序号。
- `batch_size`:指定多少条样本后进行一次神经网络的前向运行及反向更新。
- `num_passes`:指定训练多少个轮次。
### 如何预测
1. 修改 `infer.py` 中以下变量,指定使用的模型、指定测试数据。
```python
model_path = "dnn_params_pass_00000.tar.gz" # 指定模型所在的路径
nn_type = "dnn" # 指定测试使用的模型
test_dir = "./data/test" # 指定测试文件所在的目录
word_dict = "./data/dict/word_dict.txt" # 指定字典所在的路径
label_dict = "./data/dict/label_dict.txt" # 指定类别标签字典的路径
```
2. 在终端中执行 `python infer.py`
...@@ -41,243 +41,203 @@ ...@@ -41,243 +41,203 @@
<!-- This block will be replaced by each markdown file content. Please do not change lines below.--> <!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'> <div id="markdown" style='display:none'>
# 文本分类 # 文本分类
文本分类是机器学习中的一项常见任务,主要目的是根据一条文本的内容,判断该文本所属的类别。在本例子中,我们利用有标注的语料库训练二分类DNN和CNN模型,完成对输入文本的分类任务。
DNN与CNN模型之间最大的区别在于: 以下是本例目录包含的文件以及对应说明:
```text
.
├── images # 文档中的图片
│   ├── cnn_net.png
│   └── dnn_net.png
├── index.html # 文档
├── infer.py # 预测脚本
├── network_conf.py # 本例中涉及的各种网络结构均定义在此文件中,若进一步修改模型结构,请查看此文件
├── reader.py # 读取数据接口,若使用自定义格式的数据,请查看此文件
├── README.md # 文档
├── run.sh # 训练任务运行脚本,直接运行此脚本,将以默认参数开始训练任务
├── train.py # 训练脚本
└── utils.py # 定义通用的函数,例如:打印日志、解析命令行参数、构建字典、加载字典等
```
- DNN不属于序列模型,大多使用基本的全连接结构,只能接受固定维度的特征向量作为输入。 ## 简介
文本分类任务根据给定一条文本的内容,判断该文本所属的类别,是自然语言处理领域的一项重要的基础任务。[PaddleBook](https://github.com/PaddlePaddle/book) 中的[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一课,正是一个典型的文本分类任务,任务流程如下:
- CNN属于序列模型,能够提取一个局部区域之内的特征,能够处理变长的序列输入。 1. 收集电影评论网站的用户评论数据。
2. 清洗,标记。
3. 模型设计。
4. 模型学习效果评估。
举例来说,情感分类是一项常见的文本分类任务,在情感分类中,我们希望训练一个模型来判断句子中表现出的情感是正向还是负向。例如,"The apple is not bad",其中的"not bad"是决定这个句子情感的关键 训练好的分类器能够**自动判断**新出现的用户评论的情感是正面还是负面,在舆情监控、营销策划、产品品牌价值评估等任务中,能够起到重要作用。以上过程也是我们去完成一个新的文本分类任务需要遵循的常规流程。可以看到,深度学习方法的巨大优势体现在:**免除复杂的特征的设计,只需要对原始文本进行基础的清理、标注即可**
- 对于DNN模型来说,只能知道句子中有一个"not"和一个"bad",但两者之间的顺序关系在输入时已经丢失,网络不再有机会学习序列之间的顺序信息。 [PaddleBook](https://github.com/PaddlePaddle/book) 中的[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)介绍了一个较为复杂的栈式双向 LSTM 模型,循环神经网络在一些需要理解语言语义的复杂任务中有着明显的优势,但计算量大,通常对调参技巧也有着更高的要求。在对计算时间有一定限制的任务中,也会考虑其它模型。除了计算时间的考量,更重要的一点:**模型选择往往是机器学习任务成功的基础**。机器学习任务的目标始终是提高泛化能力,也就是对未知的新的样本预测的能力:
- CNN模型接受文本序列作为输入,保留了"not bad"之间的顺序信息。因此,在大多数文本分类任务上,CNN模型的表现要好于DNN。 1. 简单模型拟合能力不足,无法精确拟合训练样本,更加无法期待模型能够准确地预测没有出现在训练样本集中的未知样本,这就是**欠拟合**问题。
2. 然而,过于复杂的模型轻松“记忆”了训练样本集中的每一个样本,但对于没有出现在训练样本集中的未知样本却毫无识别能力,这就是**过拟合**问题。
## 实验数据 "No Free Lunch (NFL)" 是机器学习任务基本原则之一:没有任何一种模型是天生优于其他模型的。模型的设计和选择建立在了解不同模型特性的基础之上,但同时也是一个多次实验评估的过程。在本例中,我们继续向大家介绍几种最常用的文本分类模型,它们的能力和复杂程度不同,帮助大家对比学习这些模型学习效果之间的差异,针对不同的场景选择使用。
本例子的实验在[IMDB数据集](http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz)上进行。IMDB数据集包含了来自IMDB(互联网电影数据库)网站的5万条电影影评,并被标注为正面/负面两种评价。数据集被划分为train和test两部分,各2.5万条数据,正负样本的比例基本为1:1。样本直接以英文原文的形式表示。
## DNN模型 ## 模型详解
**DNN的模型结构入下图所示:** `network_conf.py` 中包括以下模型:
<p align="center"> 1. `fc_net`: DNN 模型,是一个非序列模型。使用基本的全连接结构。
<img src="images/dnn_net.png" width = "90%" align="center"/><br/> 2. `convolution_net`:浅层 CNN 模型,是一个基础的序列模型,能够处理变长的序列输入,提取一个局部区域之内的特征。
图1. DNN文本分类模型
</p>
**可以看到,模型主要分为如下几个部分:** 我们以情感分类任务为例,简单说明序列模型和非序列模型之间的差异。情感分类是一项常见的文本分类任务,模型自动判断文本中表现出的情感是正向还是负向。以句子 "The apple is not bad" 为例,"not bad" 是决定这个句子情感的关键:
- **词向量层**:IMDB的样本由原始的英文单词组成,为了更好地表示不同词之间语义上的关系,首先将英文单词转化为固定维度的向量。训练完成后,词与词语义上的相似程度可以用它们的词向量之间的距离来表示,语义上越相似,距离越近。关于词向量的更多信息请参考PaddleBook中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。 - 对于 DNN 模型来说,只知道句子中有一个 "not" 和一个 "bad",两者之间的顺序关系在输入网络时已丢失,网络不再有机会学习序列之间的顺序信息。
- CNN 模型接受文本序列作为输入,保留了 "not bad" 之间的顺序信息。
- **最大池化层**:最大池化在时间序列上进行,池化过程消除了不同语料样本在单词数量多少上的差异,并提炼出词向量中每一下标位置上的最大值。经过池化后,词向量层输出的向量序列被转化为一条固定维度的向量。例如,假设最大池化前向量的序列为`[[2,3,5],[7,3,6],[1,4,0]]`,则最大池化的结果为:`[7,4,6]`。 两者各自的一些特点简单总结如下:
- **全连接隐层**:经过最大池化后的向量被送入两个连续的隐层,隐层之间为全连接结构。
1. DNN 的计算量可以远低于 CNN / RNN 模型,在对响应时间有要求的任务中具有优势。
2. DNN 刻画的往往是频繁词特征,潜在会受到分词错误的影响,但对一些依赖关键词特征也能做的不错的任务:如 Spam 短信检测,依然是一个有效的模型。
3. 在大多数需要一定语义理解(例如,借助上下文消除语义中的歧义)的文本分类任务上,以 CNN / RNN 为代表的序列模型的效果往往好于 DNN 模型。
- **输出层**:输出层的神经元数量和样本的类别数一致,例如在二分类问题中,输出层会有2个神经元。通过Softmax激活函数,输出结果是一个归一化的概率分布,和为1,因此第$i$个神经元的输出就可以认为是样本属于第$i$类的预测概率。 ### 1. DNN 模型
**通过PaddlePaddle实现该DNN结构的代码如下:** **DNN 模型结构入下图所示:**
```python
import paddle.v2 as paddle
def fc_net(dict_dim, class_dim=2, emb_dim=28):
"""
dnn network definition
:param dict_dim: size of word dictionary
:type input_dim: int
:params class_dim: number of instance class
:type class_dim: int
:params emb_dim: embedding vector dimension
:type emb_dim: int
"""
# input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(dict_dim))
lbl = paddle.layer.data("label", paddle.data_type.integer_value(class_dim))
# embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)
# max pooling
seq_pool = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
# two hidden layers
hd_layer_size = [28, 8]
hd_layer_init_std = [1.0 / math.sqrt(s) for s in hd_layer_size]
hd1 = paddle.layer.fc(
input=seq_pool,
size=hd_layer_size[0],
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=hd_layer_init_std[0]))
hd2 = paddle.layer.fc(
input=hd1,
size=hd_layer_size[1],
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=hd_layer_init_std[1]))
# output layer
output = paddle.layer.fc(
input=hd2,
size=class_dim,
act=paddle.activation.Softmax(),
param_attr=paddle.attr.Param(initial_std=1.0 / math.sqrt(class_dim)))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost, output, lbl
```
该DNN模型默认对输入的语料进行二分类(`class_dim=2`),embedding的词向量维度默认为28(`emd_dim=28`),两个隐层均使用Tanh激活函数(`act=paddle.activation.Tanh()`)。
需要注意的是,该模型的输入数据为整数序列,而不是原始的英文单词序列。事实上,为了处理方便我们一般会事先将单词根据词频顺序进行id化,即将单词用整数替代, 也就是单词在字典中的序号。这一步一般在DNN模型之外完成。
## CNN模型
**CNN的模型结构如下图所示:**
<p align="center"> <p align="center">
<img src="images/cnn_net.png" width = "90%" align="center"/><br/> <img src="images/dnn_net.png" width = "90%" align="center"/><br/>
2. CNN文本分类模型 1. 本例中的 DNN 文本分类模型
</p> </p>
**可以看到,模型主要分为如下几个部分:** 在 PaddlePaddle 实现该 DNN 结构的代码见 `network_conf.py` 中的 `fc_net` 函数,模型主要分为如下几个部分:
- **词向量层**:与DNN中词向量层的作用一样,将英文单词转化为固定维度的向量,利用向量之间的距离来表示词之间的语义相关程度。如图2中所示,将得到的词向量定义为行向量,再将语料中所有的单词产生的行向量拼接在一起组成矩阵。假设词向量维度为5,语料“The cat sat on the read mat”包含7个单词,那么得到的矩阵维度为7*5。关于词向量的更多信息请参考PaddleBook中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。 - **词向量层**:为了更好地表示不同词之间语义上的关系,首先将词语转化为固定维度的向量。训练完成后,词与词语义上的相似程度可以用它们的词向量之间的距离来表示,语义上越相似,距离越近。关于词向量的更多信息请参考PaddleBook中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。
- **卷积层**: 文本分类中的卷积在时间序列上进行,即卷积核的宽度和词向量层产出的矩阵一致,卷积沿着矩阵的高度方向进行。卷积后得到的结果被称为“特征图”(feature map)。假设卷积核的高度为$h$,矩阵的高度为$N$,卷积的步长为1,则得到的特征图为一个高度为$N+1-h$的向量。可以同时使用多个不同高度的卷积核,得到多个特征图。 - **最大池化层**:最大池化在时间序列上进行,池化过程消除了不同语料样本在单词数量多少上的差异,并提炼出词向量中每一下标位置上的最大值。经过池化后,词向量层输出的向量序列被转化为一条固定维度的向量。例如,假设最大池化前向量的序列为`[[2,3,5],[7,3,6],[1,4,0]]`,则最大池化的结果为:`[7,4,6]`。
- **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量,因此这里的最大池化实际上就是简单地选出各个向量中的最大元素。各个最大元素又被拼接在一起,组成新的向量,显然,该向量的维度等于特征图的数量,也就是卷积核的数量。举例来说,假设我们使用了四个不同的卷积核,卷积产生的特征图分别为:`[2,3,5]`、`[8,2,1]`、`[5,7,7,6]`和`[4,5,1,8]`,由于卷积核的高度不同,因此产生的特征图尺寸也有所差异。分别在这四个特征图上进行最大池化,结果为:`[5]`、`[8]`、`[7]`和`[8]`,最后将池化结果拼接在一起,得到`[5,8,7,8]`。
- **全连接与输出层**:将最大池化的结果通过全连接层输出,与DNN模型一样,最后输出层的神经元个数与样本的类别数量一致,且输出之和为1 - **全连接隐层**:经过最大池化后的向量被送入两个连续的隐层,隐层之间为全连接结构
**通过PaddlePaddle实现该CNN结构的代码如下:** - **输出层**:输出层的神经元数量和样本的类别数一致,例如在二分类问题中,输出层会有2个神经元。通过Softmax激活函数,输出结果是一个归一化的概率分布,和为1,因此第$i$个神经元的输出就可以认为是样本属于第$i$类的预测概率。
```python 该 DNN 模型默认对输入的语料进行二分类(`class_dim=2`),embedding(词向量)维度默认为28(`emd_dim=28`),两个隐层均使用Tanh激活函数(`act=paddle.activation.Tanh()`)。需要注意的是,该模型的输入数据为整数序列,而不是原始的单词序列。事实上,为了处理方便,我们一般会事先将单词根据词频顺序进行 id 化,即将词语转化成在字典中的序号。
import paddle.v2 as paddle
def convolution_net(dict_dim, class_dim=2, emb_dim=28, hid_dim=128): ### 2. CNN 模型
"""
cnn network definition
:param dict_dim: size of word dictionary **CNN 模型结构如下图所示:**
:type input_dim: int
:params class_dim: number of instance class
:type class_dim: int
:params emb_dim: embedding vector dimension
:type emb_dim: int
:params hid_dim: number of same size convolution kernels
:type hid_dim: int
"""
# input layers <p align="center">
data = paddle.layer.data("word", <img src="images/cnn_net.png" width = "90%" align="center"/><br/>
paddle.data_type.integer_value_sequence(dict_dim)) 图2. 本例中的 CNN 文本分类模型
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2)) </p>
#embedding layer 通过 PaddlePaddle 实现该 CNN 结构的代码见 `network_conf.py` 中的 `convolution_net` 函数,模型主要分为如下几个部分:
emb = paddle.layer.embedding(input=data, size=emb_dim)
# convolution layers with max pooling - **词向量层**:与 DNN 中词向量层的作用一样,将词语转化为固定维度的向量,利用向量之间的距离来表示词之间的语义相关程度。如图2所示,将得到的词向量定义为行向量,再将语料中所有的单词产生的行向量拼接在一起组成矩阵。假设词向量维度为5,句子 “The cat sat on the read mat” 含 7 个词语,那么得到的矩阵维度为 7*5。关于词向量的更多信息请参考 PaddleBook 中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。
conv_3 = paddle.networks.sequence_conv_pool(
input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim)
# fc and output layer - **卷积层**: 文本分类中的卷积在时间序列上进行,即卷积核的宽度和词向量层产出的矩阵一致,卷积沿着矩阵的高度方向进行。卷积后得到的结果被称为“特征图”(feature map)。假设卷积核的高度为 $h$,矩阵的高度为 $N$,卷积的步长为 1,则得到的特征图为一个高度为 $N+1-h$ 的向量。可以同时使用多个不同高度的卷积核,得到多个特征图。
output = paddle.layer.fc(
input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())
cost = paddle.layer.classification_cost(input=output, label=lbl) - **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量,因此这里的最大池化实际上就是简单地选出各个向量中的最大元素。各个最大元素又被拼接在一起,组成新的向量,显然,该向量的维度等于特征图的数量,也就是卷积核的数量。举例来说,假设我们使用了四个不同的卷积核,卷积产生的特征图分别为:`[2,3,5]`、`[8,2,1]`、`[5,7,7,6]` 和 `[4,5,1,8]`,由于卷积核的高度不同,因此产生的特征图尺寸也有所差异。分别在这四个特征图上进行最大池化,结果为:`[5]`、`[8]`、`[7]`和`[8]`,最后将池化结果拼接在一起,得到`[5,8,7,8]`。
return cost, output, lbl - **全连接与输出层**:将最大池化的结果通过全连接层输出,与 DNN 模型一样,最后输出层的神经元个数与样本的类别数量一致,且输出之和为 1。
```
该CNN网络的输入数据类型和前面介绍过的DNN一致。`paddle.networks.sequence_conv_pool`为PaddlePaddle中已经封装好的带有池化的文本序列卷积模块,该模块的`context_len`参数用于指定卷积核在同一时间覆盖的文本长度,即图2中的卷积核的高度;`hidden_size`用于指定该类型的卷积核的数量。可以看到,上述代码定义的结构中使用了128个大小为3的卷积核和128个大小为4的卷积核,这些卷积的结果经过最大池化和结果拼接后产生一个256维的向量,向量经过一个全连接层输出最终预测结果。 CNN 网络的输入数据类型和 DNN 一致。PaddlePaddle 中已经封装好的带有池化的文本序列卷积模块:`paddle.networks.sequence_conv_pool`,可直接调用。该模块的 `context_len` 参数用于指定卷积核在同一时间覆盖的文本长度,即图 2 中的卷积核的高度。`hidden_size` 用于指定该类型的卷积核的数量。本例代码默认使用了 128 个大小为 3 的卷积核和 128 个大小为 4 的卷积核,这些卷积的结果经过最大池化和结果拼接后产生一个 256 维的向量,向量经过一个全连接层输出最终的预测结果。
## 自定义数据 ## 使用 PaddlePaddle 内置数据运行
本样例中的代码通过`Paddle.dataset.imdb.train`接口使用了PaddlePaddle自带的样例数据,在第一次运行代码时,PaddlePaddle会自动下载并缓存所需的数据。如果希望使用自己的数据进行训练,需要自行编写数据读取接口。
编写数据读取接口的关键在于实现一个Python生成器,生成器负责从原始输入文本中解析出一条训练样本,并组合成适当的数据形式传送给网络中的data layer。例如在本样例中,data layer需要的数据类型为`paddle.data_type.integer_value_sequence`,本质上是一个Python list。因此我们的生成器需要完成:从文件中读取数据, 以及转换成适当形式的Python list,这两件事情。 ### 如何训练
假设原始数据的格式为 在终端中执行 `sh run.sh` 以下命令, 将以 PaddlePaddle 内置的情感分类数据集:`paddle.dataset.imdb` 直接运行本例,会看到如下输入
```text
Pass 0, Batch 0, Cost 0.696031, {'__auc_evaluator_0__': 0.47360000014305115, 'classification_error_evaluator': 0.5}
Pass 0, Batch 100, Cost 0.544438, {'__auc_evaluator_0__': 0.839249312877655, 'classification_error_evaluator': 0.30000001192092896}
Pass 0, Batch 200, Cost 0.406581, {'__auc_evaluator_0__': 0.9030032753944397, 'classification_error_evaluator': 0.2199999988079071}
Test at Pass 0, {'__auc_evaluator_0__': 0.9289745092391968, 'classification_error_evaluator': 0.14927999675273895}
``` ```
PaddlePaddle is good 1 日志每隔 100 个 batch 输出一次,输出信息包括:(1)Pass 序号;(2)Batch 序号;(3)依次输出当前 Batch 上评估指标的评估结果。评估指标在配置网络拓扑结构时指定,在上面的输出中,输出了训练样本集之的 AUC 以及错误率指标。
What a terrible weather 0
```
每一行为一条样本,样本包括了原始语料和标签,语料内部单词以空格分隔,语料和标签之间用`\t`分隔。对以上格式的数据,可以使用如下自定义的数据读取接口为PaddlePaddle返回训练数据:
```python
def encode_word(word, word_dict):
"""
map word to id
:param word: the word to be mapped
:type word: str
:param word_dict: word dictionary
:type word_dict: Python dict
"""
if word_dict.has_key(word):
return word_dict[word]
else:
return word_dict['<unk>']
def data_reader(file_name, word_dict):
"""
Reader interface for training data
:param file_name: data file name
:type file_name: str
:param word_dict: word dictionary
:type word_dict: Python dict
"""
def reader():
with open(file_name, "r") as f:
for line in f:
ins, label = line.strip('\n').split('\t')
ins_data = [int(encode_word(w, word_dict)) for w in ins.split(' ')]
yield ins_data, int(label)
return reader
```
`word_dict`是字典,用来将原始的单词字符串转化为在字典中的序号。可以用`data_reader`替换原先代码中的`Paddle.dataset.imdb.train`接口用以提供自定义的训练数据。 ### 如何预测
## 运行与输出 训练结束后模型默认存储在当前工作目录下,在终端中执行 `python infer.py` ,预测脚本会加载训练好的模型进行预测。
本部分以上文介绍的DNN网络为例,介绍如何利用样例中的`text_classification_dnn.py`脚本进行DNN网络的训练和对新样本的预测。 - 默认加载使用 `paddle.data.imdb.train` 训练一个 Pass 产出的 DNN 模型对 `paddle.dataset.imdb.test` 进行测试
`text_classification_dnn.py`中的代码分为四部分: 会看到如下输出:
- **fc_net函数**:定义dnn网络结构,上文已经有说明。
- **train\_dnn\_model函数**:模型训练函数。定义优化方式、训练输出等内容,并组织训练流程。每完成一个pass的训练,程序都会将当前的模型参数保存在硬盘上,文件名为:`dnn_params_pass***.tar.gz`,其中`***`表示pass的id,从0开始计数。本函数接受一个整数类型的参数,表示训练pass的总轮数。
- **dnn_infer函数**:载入已有模型并对新样本进行预测。函数开始运行后会从当前路径下寻找并读取指定名称的参数文件,加载其中的模型参数,并对test数据集中的样本进行预测。
- **main函数**:主函数
要运行本样例,直接在`text_classification_dnn.py`所在路径下执行`python text_classification_dnn.py`即可,样例会自动依次执行数据集下载、数据读取、模型训练和保存、模型读取、新样本预测等步骤。
预测的输出形式为:
```text
positive 0.9275 0.0725 previous reviewer <unk> <unk> gave a much better <unk> of the films plot details than i could what i recall mostly is that it was just so beautiful in every sense emotionally visually <unk> just <unk> br if you like movies that are wonderful to look at and also have emotional content to which that beauty is relevant i think you will be glad to have seen this extraordinary and unusual work of <unk> br on a scale of 1 to 10 id give it about an <unk> the only reason i shy away from 9 is that it is a mood piece if you are in the mood for a really artistic very romantic film then its a 10 i definitely think its a mustsee but none of us can be in that mood all the time so overall <unk>
negative 0.0300 0.9700 i love scifi and am willing to put up with a lot scifi <unk> are usually <unk> <unk> and <unk> i tried to like this i really did but it is to good tv scifi as <unk> 5 is to star trek the original silly <unk> cheap cardboard sets stilted dialogues cg that doesnt match the background and painfully onedimensional characters cannot be overcome with a scifi setting im sure there are those of you out there who think <unk> 5 is good scifi tv its not its clichéd and <unk> while us viewers might like emotion and character development scifi is a genre that does not take itself seriously <unk> star trek it may treat important issues yet not as a serious philosophy its really difficult to care about the characters here as they are not simply <unk> just missing a <unk> of life their actions and reactions are wooden and predictable often painful to watch the makers of earth know its rubbish as they have to always say gene <unk> earth otherwise people would not continue watching <unk> <unk> must be turning in their <unk> as this dull cheap poorly edited watching it without <unk> breaks really brings this home <unk> <unk> of a show <unk> into space spoiler so kill off a main character and then bring him back as another actor <unk> <unk> all over again
``` ```
[ 0.99892634 0.00107362] 0
[ 0.00107638 0.9989236 ] 1
[ 0.98185927 0.01814074] 0
[ 0.31667888 0.68332112] 1
[ 0.98853314 0.01146684] 0
```
每一行表示一条样本的预测结果。前两列表示该样本属于0、1这两个类别的预测概率,最后一列表示样本的实际label。
在运行CNN模型的`text_classification_cnn.py`脚本中,网络模型定义在`convolution_net`函数中,模型训练函数名为`train_cnn_model`,预测函数名为`cnn_infer`。其他用法和`text_classification_dnn.py`是一致的。 输出日志每一行是对一条样本预测的结果,以 `\t` 分隔,共 3 列,分别是:(1)预测类别标签;(2)样本分别属于每一类的概率,内部以空格分隔;(3)输入文本。
## 使用自定义数据训练和预测
### 如何训练
1. 数据组织
假设有如下格式的训练数据:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容,文本内容中的词语以空格分隔。以下是两条示例数据:
```
positive PaddlePaddle is good
negative What a terrible weather
```
2. 编写数据读取接口
自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`(词语在字典的序号)和 `paddle.data_type.integer_value`(类别标签)的 2 个输入给网络中定义的 2 个 `data_layer` 的功能。
```python
def train_reader(data_dir, word_dict, label_dict):
def reader():
UNK_ID = word_dict["<UNK>"]
word_col = 0
lbl_col = 1
for file_name in os.listdir(data_dir):
with open(os.path.join(data_dir, file_name), "r") as f:
for line in f:
line_split = line.strip().split("\t")
word_ids = [
word_dict.get(w, UNK_ID)
for w in line_split[word_col].split()
]
yield word_ids, label_dict[line_split[lbl_col]]
return reader
```
- 关于 PaddlePaddle 中 `data_layer` 接受输入数据的类型,以及数据读取接口对应该返回数据的格式,请参考 [input-types](http://www.paddlepaddle.org/release_doc/0.9.0/doc_cn/ui/data_provider/pydataprovider2.html#input-types) 一节。
- 以上代码片段详见本例目录下的 `reader.py` 脚本,`reader.py` 同时提供了读取测试数据的全部代码。
接下来,只需要将数据读取函数 `train_reader` 作为参数传递给 `train.py` 脚本中的 `paddle.batch` 接口即可使用自定义数据接口读取数据,调用方式如下:
```python
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader(train_data_dir, word_dict, lbl_dict),
buf_size=1000),
batch_size=batch_size)
```
3. 修改命令行参数
- 如果将数据组织成示例数据的同样的格式,只需在 `run.sh` 脚本中修改 `train.py` 启动参数,指定 `train_data_dir` 参数,可以直接运行本例,无需修改数据读取接口 `reader.py`。
- 执行 `python train.py --help` 可以获取`train.py` 脚本各项启动参数的详细说明,主要参数如下:
- `nn_type`:选择要使用的模型,目前支持两种:“dnn” 或者 “cnn”。
- `train_data_dir`:指定训练数据所在的文件夹,使用自定义数据训练,必须指定此参数,否则使用`paddle.dataset.imdb`训练,同时忽略`test_data_dir`,`word_dict`,和 `label_dict` 参数。
- `test_data_dir`:指定测试数据所在的文件夹,若不指定将不进行测试。
- `word_dict`:字典文件所在的路径,若不指定,将从训练数据根据词频统计,自动建立字典。
- `label_dict`:类别标签字典,用于将字符串类型的类别标签,映射为整数类型的序号。
- `batch_size`:指定多少条样本后进行一次神经网络的前向运行及反向更新。
- `num_passes`:指定训练多少个轮次。
### 如何预测
1. 修改 `infer.py` 中以下变量,指定使用的模型、指定测试数据。
```python
model_path = "dnn_params_pass_00000.tar.gz" # 指定模型所在的路径
nn_type = "dnn" # 指定测试使用的模型
test_dir = "./data/test" # 指定测试文件所在的目录
word_dict = "./data/dict/word_dict.txt" # 指定字典所在的路径
label_dict = "./data/dict/label_dict.txt" # 指定类别标签字典的路径
```
2. 在终端中执行 `python infer.py`。
</div> </div>
<!-- You can change the lines below now. --> <!-- You can change the lines below now. -->
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import os
import gzip
import paddle.v2 as paddle
import network_conf
import reader
from utils import *
def infer(topology, data_dir, model_path, word_dict_path, label_dict_path,
batch_size):
def _infer_a_batch(inferer, test_batch, ids_2_word, ids_2_label):
probs = inferer.infer(input=test_batch, field=['value'])
assert len(probs) == len(test_batch)
for word_ids, prob in zip(test_batch, probs):
word_text = " ".join([ids_2_word[id] for id in word_ids[0]])
print("%s\t%s\t%s" % (ids_2_label[prob.argmax()],
" ".join(["{:0.4f}".format(p)
for p in prob]), word_text))
logger.info('begin to predict...')
use_default_data = (data_dir is None)
if use_default_data:
word_dict = paddle.dataset.imdb.word_dict()
word_reverse_dict = dict((value, key)
for key, value in word_dict.iteritems())
label_reverse_dict = {0: "positive", 1: "negative"}
test_reader = paddle.dataset.imdb.test(word_dict)
else:
assert os.path.exists(
word_dict_path), 'the word dictionary file does not exist'
assert os.path.exists(
label_dict_path), 'the label dictionary file does not exist'
word_dict = load_dict(word_dict_path)
word_reverse_dict = load_reverse_dict(word_dict_path)
label_reverse_dict = load_reverse_dict(label_dict_path)
test_reader = reader.test_reader(data_dir, word_dict)()
dict_dim = len(word_dict)
class_num = len(label_reverse_dict)
prob_layer = topology(dict_dim, class_num, is_infer=True)
# initialize PaddlePaddle
paddle.init(use_gpu=False, trainer_count=1)
# load the trained models
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_path, 'r'))
inferer = paddle.inference.Inference(
output_layer=prob_layer, parameters=parameters)
test_batch = []
for idx, item in enumerate(test_reader):
test_batch.append([item[0]])
if len(test_batch) == batch_size:
_infer_a_batch(inferer, test_batch, word_reverse_dict,
label_reverse_dict)
test_batch = []
if len(test_batch):
_infer_a_batch(inferer, test_batch, word_reverse_dict,
label_reverse_dict)
test_batch = []
if __name__ == '__main__':
model_path = 'dnn_params_pass_00000.tar.gz'
assert os.path.exists(model_path), "the trained model does not exist."
nn_type = 'dnn'
test_dir = None
word_dict = None
label_dict = None
if nn_type == 'dnn':
topology = network_conf.fc_net
elif nn_type == 'cnn':
topology = network_conf.convolution_net
infer(
topology=topology,
data_dir=test_dir,
word_dict_path=word_dict,
label_dict_path=label_dict,
model_path=model_path,
batch_size=10)
import sys
import math
import gzip
from paddle.v2.layer import parse_network
import paddle.v2 as paddle
__all__ = ["fc_net", "convolution_net"]
def fc_net(dict_dim,
class_num,
emb_dim=28,
hidden_layer_sizes=[28, 8],
is_infer=False):
"""
define the topology of the dnn network
:param dict_dim: size of word dictionary
:type input_dim: int
:params class_num: number of instance class
:type class_num: int
:params emb_dim: embedding vector dimension
:type emb_dim: int
"""
# define the input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(dict_dim))
if not is_infer:
lbl = paddle.layer.data("label",
paddle.data_type.integer_value(class_num))
# define the embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)
# max pooling to reduce the input sequence into a vector (non-sequence)
seq_pool = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
for idx, hidden_size in enumerate(hidden_layer_sizes):
hidden_init_std = 1.0 / math.sqrt(hidden_size)
hidden = paddle.layer.fc(
input=hidden if idx else seq_pool,
size=hidden_size,
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=hidden_init_std))
prob = paddle.layer.fc(
input=hidden,
size=class_num,
act=paddle.activation.Softmax(),
param_attr=paddle.attr.Param(initial_std=1.0 / math.sqrt(class_num)))
if is_infer:
return prob
else:
return paddle.layer.classification_cost(
input=prob, label=lbl), prob, lbl
def convolution_net(dict_dim,
class_dim=2,
emb_dim=28,
hid_dim=128,
is_infer=False):
"""
cnn network definition
:param dict_dim: size of word dictionary
:type input_dim: int
:params class_dim: number of instance class
:type class_dim: int
:params emb_dim: embedding vector dimension
:type emb_dim: int
:params hid_dim: number of same size convolution kernels
:type hid_dim: int
"""
# input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(dict_dim))
lbl = paddle.layer.data("label", paddle.data_type.integer_value(class_dim))
# embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)
# convolution layers with max pooling
conv_3 = paddle.networks.sequence_conv_pool(
input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim)
# fc and output layer
prob = paddle.layer.fc(
input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())
if is_infer:
return prob
else:
cost = paddle.layer.classification_cost(input=prob, label=lbl)
return cost, prob, lbl
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
def train_reader(data_dir, word_dict, label_dict):
"""
Reader interface for training data
:param data_dir: data directory
:type data_dir: str
:param word_dict: path of word dictionary,
the dictionary must has a "UNK" in it.
:type word_dict: Python dict
:param label_dict: path of label dictionary
:type label_dict: Python dict
"""
def reader():
UNK_ID = word_dict["<UNK>"]
word_col = 1
lbl_col = 0
for file_name in os.listdir(data_dir):
with open(os.path.join(data_dir, file_name), "r") as f:
for line in f:
line_split = line.strip().split("\t")
word_ids = [
word_dict.get(w, UNK_ID)
for w in line_split[word_col].split()
]
yield word_ids, label_dict[line_split[lbl_col]]
return reader
def test_reader(data_dir, word_dict):
"""
Reader interface for testing data
:param data_dir: data directory.
:type data_dir: str
:param word_dict: path of word dictionary,
the dictionary must has a "UNK" in it.
:type word_dict: Python dict
"""
def reader():
UNK_ID = word_dict["<UNK>"]
word_col = 1
for file_name in os.listdir(data_dir):
with open(os.path.join(data_dir, file_name), "r") as f:
for line in f:
line_split = line.strip().split("\t")
if len(line_split) < word_col: continue
word_ids = [
word_dict.get(w, UNK_ID)
for w in line_split[word_col].split()
]
yield word_ids, line_split[word_col]
return reader
#!/bin/sh
python train.py \
--nn_type="dnn" \
--batch_size=64 \
--num_passes=10 \
2>&1 | tee train.log
import sys
import paddle.v2 as paddle
import gzip
def convolution_net(dict_dim, class_dim=2, emb_dim=28, hid_dim=128):
"""
cnn network definition
:param dict_dim: size of word dictionary
:type input_dim: int
:params class_dim: number of instance class
:type class_dim: int
:params emb_dim: embedding vector dimension
:type emb_dim: int
:params hid_dim: number of same size convolution kernels
:type hid_dim: int
"""
# input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(dict_dim))
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
#embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)
# convolution layers with max pooling
conv_3 = paddle.networks.sequence_conv_pool(
input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim)
# fc and output layer
output = paddle.layer.fc(
input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost, output, lbl
def train_cnn_model(num_pass):
"""
train cnn model
:params num_pass: train pass number
:type num_pass: int
"""
# load word dictionary
print 'load dictionary...'
word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict)
class_dim = 2
# define data reader
train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
# network config
[cost, output, label] = convolution_net(dict_dim, class_dim=class_dim)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# add auc evaluator
paddle.evaluator.auc(input=output, label=label)
# create trainer
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=adam_optimizer)
# Define end batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
with gzip.open("cnn_params_pass" + str(event.pass_id) + ".tar.gz",
'w') as f:
parameters.to_tar(f)
# begin training network
feeding = {'word': 0, 'label': 1}
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=num_pass)
print("Training finished.")
def cnn_infer(file_name):
"""
predict instance labels by cnn network
:params file_name: network parameter file
:type file_name: str
"""
print("Begin to predict...")
word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict)
class_dim = 2
[_, output, _] = convolution_net(dict_dim, class_dim=class_dim)
parameters = paddle.parameters.Parameters.from_tar(gzip.open(file_name))
infer_data = []
infer_data_label = []
for item in paddle.dataset.imdb.test(word_dict):
infer_data.append([item[0]])
infer_data_label.append(item[1])
predictions = paddle.infer(
output_layer=output,
parameters=parameters,
input=infer_data,
field=['value'])
for i, prob in enumerate(predictions):
print prob, infer_data_label[i]
if __name__ == "__main__":
paddle.init(use_gpu=False, trainer_count=1)
num_pass = 5
train_cnn_model(num_pass=num_pass)
param_file_name = "cnn_params_pass" + str(num_pass - 1) + ".tar.gz"
cnn_infer(file_name=param_file_name)
import sys
import math
import paddle.v2 as paddle
import gzip
def fc_net(dict_dim, class_dim=2, emb_dim=28):
"""
dnn network definition
:param dict_dim: size of word dictionary
:type input_dim: int
:params class_dim: number of instance class
:type class_dim: int
:params emb_dim: embedding vector dimension
:type emb_dim: int
"""
# input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(dict_dim))
lbl = paddle.layer.data("label", paddle.data_type.integer_value(class_dim))
# embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)
# max pooling
seq_pool = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())
# two hidden layers
hd_layer_size = [28, 8]
hd_layer_init_std = [1.0 / math.sqrt(s) for s in hd_layer_size]
hd1 = paddle.layer.fc(
input=seq_pool,
size=hd_layer_size[0],
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=hd_layer_init_std[0]))
hd2 = paddle.layer.fc(
input=hd1,
size=hd_layer_size[1],
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=hd_layer_init_std[1]))
# output layer
output = paddle.layer.fc(
input=hd2,
size=class_dim,
act=paddle.activation.Softmax(),
param_attr=paddle.attr.Param(initial_std=1.0 / math.sqrt(class_dim)))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost, output, lbl
def train_dnn_model(num_pass):
"""
train dnn model
:params num_pass: train pass number
:type num_pass: int
"""
# load word dictionary
print 'load dictionary...'
word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict)
class_dim = 2
# define data reader
train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
# network config
[cost, output, label] = fc_net(dict_dim, class_dim=class_dim)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# add auc evaluator
paddle.evaluator.auc(input=output, label=label)
# create trainer
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=adam_optimizer)
# Define end batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
with gzip.open("dnn_params_pass" + str(event.pass_id) + ".tar.gz",
'w') as f:
parameters.to_tar(f)
# begin training network
feeding = {'word': 0, 'label': 1}
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=num_pass)
print("Training finished.")
def dnn_infer(file_name):
"""
predict instance labels by dnn network
:params file_name: network parameter file
:type file_name: str
"""
print("Begin to predict...")
word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict)
class_dim = 2
[_, output, _] = fc_net(dict_dim, class_dim=class_dim)
parameters = paddle.parameters.Parameters.from_tar(gzip.open(file_name))
infer_data = []
infer_data_label = []
for item in paddle.dataset.imdb.test(word_dict):
infer_data.append([item[0]])
infer_data_label.append(item[1])
predictions = paddle.infer(
output_layer=output,
parameters=parameters,
input=infer_data,
field=['value'])
for i, prob in enumerate(predictions):
print prob, infer_data_label[i]
if __name__ == "__main__":
paddle.init(use_gpu=False, trainer_count=1)
num_pass = 5
train_dnn_model(num_pass=num_pass)
param_file_name = "dnn_params_pass" + str(num_pass - 1) + ".tar.gz"
dnn_infer(file_name=param_file_name)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import gzip
import paddle.v2 as paddle
import network_conf
import reader
from utils import *
def train(topology,
train_data_dir=None,
test_data_dir=None,
word_dict_path=None,
label_dict_path=None,
batch_size=32,
num_passes=10):
"""
train dnn model
:params train_data_path: path of training data, if this parameter
is not specified, paddle.dataset.imdb will be used to run this example
:type train_data_path: str
:params test_data_path: path of testing data, if this parameter
is not specified, paddle.dataset.imdb will be used to run this example
:type test_data_path: str
:params word_dict_path: path of training data, if this parameter
is not specified, paddle.dataset.imdb will be used to run this example
:type word_dict_path: str
:params num_pass: train pass number
:type num_pass: int
"""
use_default_data = (train_data_dir is None)
if use_default_data:
logger.info(("No training data are porivided, "
"use paddle.dataset.imdb to train the model."))
logger.info("please wait to build the word dictionary ...")
word_dict = paddle.dataset.imdb.word_dict()
train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
class_num = 2
else:
if word_dict_path is None or not os.path.exists(word_dict_path):
logger.info(("word dictionary is not given, the dictionary "
"is automatically built from the training data."))
# build the word dictionary to map the original string-typed
# words into integer-typed index
build_dict(
data_dir=train_data_dir,
save_path=word_dict_path,
use_col=1,
cutoff_fre=5,
insert_extra_words=["<UNK>"])
if not os.path.exists(label_dict_path):
logger.info(("label dictionary is not given, the dictionary "
"is automatically built from the training data."))
# build the label dictionary to map the original string-typed
# label into integer-typed index
build_dict(
data_dir=train_data_dir, save_path=label_dict_path, use_col=0)
word_dict = load_dict(word_dict_path)
lbl_dict = load_dict(label_dict_path)
class_num = len(lbl_dict)
logger.info("class number is : %d." % (len(lbl_dict)))
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader(train_data_dir, word_dict, lbl_dict),
buf_size=1000),
batch_size=batch_size)
if test_data_dir is not None:
# here, because training and testing data share a same format,
# we still use the reader.train_reader to read the testing data.
test_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader(test_data_dir, word_dict, lbl_dict),
buf_size=1000),
batch_size=batch_size)
else:
test_reader = None
dict_dim = len(word_dict)
logger.info("length of word dictionary is : %d." % (dict_dim))
paddle.init(use_gpu=False, trainer_count=1)
# network config
cost, prob, label = topology(dict_dim, class_num)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=1e-3,
regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# create trainer
trainer = paddle.trainer.SGD(
cost=cost,
extra_layers=paddle.evaluator.auc(input=prob, label=label),
parameters=parameters,
update_equation=adam_optimizer)
# begin training network
feeding = {"word": 0, "label": 1}
def _event_handler(event):
"""
Define end batch and end pass event handler
"""
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
logger.info("Pass %d, Batch %d, Cost %f, %s\n" % (
event.pass_id, event.batch_id, event.cost, event.metrics))
if isinstance(event, paddle.event.EndPass):
if test_reader is not None:
result = trainer.test(reader=test_reader, feeding=feeding)
logger.info("Test at Pass %d, %s \n" % (event.pass_id,
result.metrics))
with gzip.open("dnn_params_pass_%05d.tar.gz" % event.pass_id,
"w") as f:
parameters.to_tar(f)
trainer.train(
reader=train_reader,
event_handler=_event_handler,
feeding=feeding,
num_passes=num_passes)
logger.info("Training has finished.")
def main(args):
if args.nn_type == "dnn":
topology = network_conf.fc_net
elif args.nn_type == "cnn":
topology = network_conf.convolution_net
train(
topology=topology,
train_data_dir=args.train_data_dir,
test_data_dir=args.test_data_dir,
word_dict_path=args.word_dict,
label_dict_path=args.label_dict,
batch_size=args.batch_size,
num_passes=args.num_passes)
if __name__ == "__main__":
args = parse_train_cmd()
if args.train_data_dir is not None:
assert args.word_dict and args.label_dict, (
"the parameter train_data_dir, word_dict_path, and label_dict_path "
"should be set at the same time.")
main(args)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os
import argparse
from collections import defaultdict
logger = logging.getLogger("logger")
logger.setLevel(logging.INFO)
def parse_train_cmd():
parser = argparse.ArgumentParser(
description="PaddlePaddle text classification demo")
parser.add_argument(
"--nn_type",
type=str,
help="define which type of network to use, available: [dnn, cnn]",
default="dnn")
parser.add_argument(
"--train_data_dir",
type=str,
required=False,
help=("path of training dataset (default: None). "
"if this parameter is not set, "
"paddle.dataset.imdb will be used."),
default=None)
parser.add_argument(
"--test_data_dir",
type=str,
required=False,
help=("path of testing dataset (default: None). "
"if this parameter is not set, "
"paddle.dataset.imdb will be used."),
default=None)
parser.add_argument(
"--word_dict",
type=str,
required=False,
help=("path of word dictionary (default: None)."
"if this parameter is not set, paddle.dataset.imdb will be used."
"if this parameter is set, but the file does not exist, "
"word dictionay will be built from "
"the training data automatically."),
default=None)
parser.add_argument(
"--label_dict",
type=str,
required=False,
help=("path of label dictionay (default: None)."
"if this parameter is not set, paddle.dataset.imdb will be used."
"if this parameter is set, but the file does not exist, "
"word dictionay will be built from "
"the training data automatically."),
default=None)
parser.add_argument(
"--batch_size",
type=int,
default=32,
help="the number of training examples in one forward/backward pass")
parser.add_argument(
"--num_passes", type=int, default=10, help="number of passes to train")
return parser.parse_args()
def build_dict(data_dir,
save_path,
use_col=0,
cutoff_fre=0,
insert_extra_words=[]):
values = defaultdict(int)
for file_name in os.listdir(data_dir):
file_path = os.path.join(data_dir, file_name)
if not os.path.isfile(file_path):
continue
with open(file_path, "r") as fdata:
for line in fdata:
line_splits = line.strip().split("\t")
if len(line_splits) < use_col: continue
for w in line_splits[use_col].split():
values[w] += 1
with open(save_path, "w") as f:
for w in insert_extra_words:
f.write("%s\t-1\n" % (w))
for v, count in sorted(
values.iteritems(), key=lambda x: x[1], reverse=True):
if count < cutoff_fre:
break
f.write("%s\t%d\n" % (v, count))
def load_dict(dict_path):
return dict((line.strip().split("\t")[0], idx)
for idx, line in enumerate(open(dict_path, "r").readlines()))
def load_reverse_dict(dict_path):
return dict((idx, line.strip().split("\t")[0])
for idx, line in enumerate(open(dict_path, "r").readlines()))
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册