提交 72a72150 编写于 作者: C caoying03

update README.

......@@ -22,8 +22,7 @@ before_install:
script:
- .travis/precommit.sh
- docker run -i --rm -v "$PWD:/py_unittest" paddlepaddle/paddle:latest /bin/bash -c
"cd /py_unittest && find . -name 'tests' -type d -print0 | xargs -0 -I{} -n1 bash -c 'cd {};
python -m unittest discover -v'"
'cd /py_unittest; sh .travis/unittest.sh'
notifications:
email:
......
#!/bin/bash
abort(){
echo "Run unittest failed" 1>&2
echo "Please check your code" 1>&2
exit 1
}
unittest(){
cd $1 > /dev/null
if [ -f "requirements.txt" ]; then
pip install -r requirements.txt
fi
if [ $? != 0 ]; then
exit 1
fi
find . -name 'tests' -type d -print0 | \
xargs -0 -I{} -n1 bash -c \
'python -m unittest discover -v -s {}'
cd - > /dev/null
}
trap 'abort' 0
set -e
for proj in */ ; do
if [ -d $proj ]; then
unittest $proj
if [ $? != 0 ]; then
exit 1
fi
fi
done
trap : 0
......@@ -14,46 +14,57 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式
在词向量的例子中,我们向大家展示如何使用Hierarchical-Sigmoid 和噪声对比估计(Noise Contrastive Estimation,NCE)来加速词向量的学习。
- 1.1 [Hsigmoid加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/word_embedding)
- 1.2 [噪声对比估计加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/nce_cost)
## 2. 点击率预估
## 2. 语言模型
语言模型是自然语言处理领域里一个重要的基础模型,它是一个概率分布模型,利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词。语言模型被应用在很多领域,如:自动写作、QA、机器翻译、拼写检查、语音识别、词性标注等。
在语言模型的例子中,我们以文本生成为例,提供了RNN LM(包括LSTM、GRU)和N-Gram LM,供大家学习和使用。用户可以通过文档中的 “使用说明” 快速上手:适配训练语料,以训练 “自动写诗”、“自动写散文” 等有趣的模型。
- 2.1 [基于LSTM、GRU、N-Gram的文本生成模型](https://github.com/PaddlePaddle/models/tree/develop/language_model)
## 3. 点击率预估
点击率预估模型预判用户对一条广告点击的概率,对每次广告的点击情况做出预测,是广告技术的核心算法之一。逻谛斯克回归对大规模稀疏特征有着很好的学习能力,在点击率预估任务发展的早期一统天下。近年来,DNN 模型由于其强大的学习能力逐渐接过点击率预估任务的大旗。
在点击率预估的例子中,我们给出谷歌提出的 Wide & Deep 模型。这一模型融合了适用于学习抽象特征的 DNN 和适用于大规模稀疏特征的逻谛斯克回归两者模型的优点,可以作为一种相对成熟的模型框架使用, 在工业界也有一定的应用。
- 2.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr)
- 3.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr)
## 3. 文本分类
## 4. 文本分类
文本分类是自然语言处理领域最基础的任务之一,深度学习方法能够免除复杂的特征工程,直接使用原始文本作为输入,数据驱动地最优化分类准确率。
在文本分类的例子中,我们以情感分类任务为例,提供了基于DNN的非序列文本分类模型,以及基于CNN的序列模型供大家学习和使用(基于LSTM的模型见PaddleBook中[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一课)。
- 3.1 [基于 DNN / CNN 的情感分类](https://github.com/PaddlePaddle/models/tree/develop/text_classification)
- 4.1 [基于 DNN / CNN 的情感分类](https://github.com/PaddlePaddle/models/tree/develop/text_classification)
## 4. 排序学习
## 5. 排序学习
排序学习(Learning to Rank, LTR)是信息检索和搜索引擎研究的核心问题之一,通过机器学习方法学习一个分值函数对待排序的候选进行打分,再根据分值的高低确定序关系。深度神经网络可以用来建模分值函数,构成各类基于深度学习的LTR模型。
在排序学习的例子中,我们介绍基于 RankLoss 损失函数的 Pairwise 排序模型和基于LambdaRank损失函数的Listwise排序模型(Pointwise学习策略见PaddleBook中[推荐系统](https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/README.cn.md)一课)。
- 4.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr)
- 5.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr)
## 5. 序列标注
## 6. 序列标注
给定输入序列,序列标注模型为序列中每一个元素贴上一个类别标签,是自然语言处理领域最基础的任务之一。随着深度学习的不断探索和发展,利用循环神经网络学习输入序列的特征表示,条件随机场(Conditional Random Field, CRF)在特征基础上完成序列标注任务,逐渐成为解决序列标注问题的标配解决方案。
在序列标注的例子中,我们以命名实体识别(Named Entity Recognition,NER)任务为例,介绍如何训练一个端到端的序列标注模型。
- 5.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner)
- 6.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner)
## 6. 序列到序列学习
## 7. 序列到序列学习
序列到序列学习实现两个甚至是多个不定长模型之间的映射,有着广泛的应用,包括:机器翻译、智能对话与问答、广告创意语料生成、自动编码(如金融画像编码)、判断多个文本串之间的语义相关性等。
在序列到序列学习的例子中,我们以机器翻译任务为例,提供了多种改进模型,供大家学习和使用。包括:不带注意力机制的序列到序列映射模型,这一模型是所有序列到序列学习模型的基础;使用 scheduled sampling 改善 RNN 模型在生成任务中的错误累积问题;带外部记忆机制的神经机器翻译,通过增强神经网络的记忆能力,来完成复杂的序列到序列学习任务。
- 6.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention)
- 7.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention)
## Copyright and License
PaddlePaddle is provided under the [Apache-2.0 license](LICENSE).
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -6,6 +6,10 @@ from __future__ import print_function
import numpy as np
import io
import soundfile
import scikits.samplerate
from scipy import signal
import random
import copy
class AudioSegment(object):
......@@ -75,6 +79,32 @@ class AudioSegment(object):
io.BytesIO(bytes), dtype='float32')
return cls(samples, sample_rate)
@classmethod
def concatenate(cls, *segments):
"""Concatenate an arbitrary number of audio segments together.
:param *segments: Input audio segments to be concatenated.
:type *segments: tuple of AudioSegment
:return: Audio segment instance as concatenating results.
:rtype: AudioSegment
:raises ValueError: If the number of segments is zero, or if the
sample_rate of any segments does not match.
:raises TypeError: If any segment is not AudioSegment instance.
"""
# Perform basic sanity-checks.
if len(segments) == 0:
raise ValueError("No audio segments are given to concatenate.")
sample_rate = segments[0]._sample_rate
for seg in segments:
if sample_rate != seg._sample_rate:
raise ValueError("Can't concatenate segments with "
"different sample rates")
if type(seg) is not cls:
raise TypeError("Only audio segments of the same type "
"can be concatenated.")
samples = np.concatenate([seg.samples for seg in segments])
return cls(samples, sample_rate)
def to_wav_file(self, filepath, dtype='float32'):
"""Save audio segment to disk as wav file.
......@@ -100,6 +130,89 @@ class AudioSegment(object):
format='WAV',
subtype=subtype_map[dtype])
@classmethod
def slice_from_file(cls, file, start=None, end=None):
"""Loads a small section of an audio without having to load
the entire file into the memory which can be incredibly wasteful.
:param file: Input audio filepath or file object.
:type file: basestring|file
:param start: Start time in seconds. If start is negative, it wraps
around from the end. If not provided, this function
reads from the very beginning.
:type start: float
:param end: End time in seconds. If end is negative, it wraps around
from the end. If not provided, the default behvaior is
to read to the end of the file.
:type end: float
:return: AudioSegment instance of the specified slice of the input
audio file.
:rtype: AudioSegment
:raise ValueError: If start or end is incorrectly set, e.g. out of
bounds in time.
"""
sndfile = soundfile.SoundFile(file)
sample_rate = sndfile.samplerate
duration = float(len(sndfile)) / sample_rate
start = 0. if start is None else start
end = 0. if end is None else end
if start < 0.0:
start += duration
if end < 0.0:
end += duration
if start < 0.0:
raise ValueError("The slice start position (%f s) is out of "
"bounds." % start)
if end < 0.0:
raise ValueError("The slice end position (%f s) is out of bounds." %
end)
if start > end:
raise ValueError("The slice start position (%f s) is later than "
"the slice end position (%f s)." % (start, end))
if end > duration:
raise ValueError("The slice end position (%f s) is out of bounds "
"(> %f s)" % (end, duration))
start_frame = int(start * sample_rate)
end_frame = int(end * sample_rate)
sndfile.seek(start_frame)
data = sndfile.read(frames=end_frame - start_frame, dtype='float32')
return cls(data, sample_rate)
@classmethod
def make_silence(cls, duration, sample_rate):
"""Creates a silent audio segment of the given duration and sample rate.
:param duration: Length of silence in seconds.
:type duration: float
:param sample_rate: Sample rate.
:type sample_rate: float
:return: Silent AudioSegment instance of the given duration.
:rtype: AudioSegment
"""
samples = np.zeros(int(duration * sample_rate))
return cls(samples, sample_rate)
def superimpose(self, other):
"""Add samples from another segment to those of this segment
(sample-wise addition, not segment concatenation).
Note that this is an in-place transformation.
:param other: Segment containing samples to be added in.
:type other: AudioSegments
:raise TypeError: If type of two segments don't match.
:raise ValueError: If the sample rates of the two segments are not
equal, or if the lengths of segments don't match.
"""
if type(self) != type(other):
raise TypeError("Cannot add segments of different types: %s "
"and %s." % (type(self), type(other)))
if self._sample_rate != other._sample_rate:
raise ValueError("Sample rates must match to add segments.")
if len(self._samples) != len(other._samples):
raise ValueError("Segment lengths must match to add segments.")
self._samples += other._samples
def to_bytes(self, dtype='float32'):
"""Create a byte string containing the audio content.
......@@ -143,23 +256,257 @@ class AudioSegment(object):
new_indices = np.linspace(start=0, stop=old_length, num=new_length)
self._samples = np.interp(new_indices, old_indices, self._samples)
def normalize(self, target_sample_rate):
raise NotImplementedError()
def normalize(self, target_db=-20, max_gain_db=300.0):
"""Normalize audio to be of the desired RMS value in decibels.
Note that this is an in-place transformation.
:param target_db: Target RMS value in decibels. This value should be
less than 0.0 as 0.0 is full-scale audio.
:type target_db: float
:param max_gain_db: Max amount of gain in dB that can be applied for
normalization. This is to prevent nans when
attempting to normalize a signal consisting of
all zeros.
:type max_gain_db: float
:raises ValueError: If the required gain to normalize the segment to
the target_db value exceeds max_gain_db.
"""
gain = target_db - self.rms_db
if gain > max_gain_db:
raise ValueError(
"Unable to normalize segment to %f dB because the "
"the probable gain have exceeds max_gain_db (%f dB)" %
(target_db, max_gain_db))
self.apply_gain(min(max_gain_db, target_db - self.rms_db))
def normalize_online_bayesian(self,
target_db,
prior_db,
prior_samples,
startup_delay=0.0):
"""Normalize audio using a production-compatible online/causal
algorithm. This uses an exponential likelihood and gamma prior to
make online estimates of the RMS even when there are very few samples.
Note that this is an in-place transformation.
:param target_db: Target RMS value in decibels.
:type target_bd: float
:param prior_db: Prior RMS estimate in decibels.
:type prior_db: float
:param prior_samples: Prior strength in number of samples.
:type prior_samples: float
:param startup_delay: Default 0.0s. If provided, this function will
accrue statistics for the first startup_delay
seconds before applying online normalization.
:type startup_delay: float
"""
# Estimate total RMS online.
startup_sample_idx = min(self.num_samples - 1,
int(self.sample_rate * startup_delay))
prior_mean_squared = 10.**(prior_db / 10.)
prior_sum_of_squares = prior_mean_squared * prior_samples
cumsum_of_squares = np.cumsum(self.samples**2)
sample_count = np.arange(len(self.num_samples)) + 1
if startup_sample_idx > 0:
cumsum_of_squares[:startup_sample_idx] = \
cumsum_of_squares[startup_sample_idx]
sample_count[:startup_sample_idx] = \
sample_count[startup_sample_idx]
mean_squared_estimate = ((cumsum_of_squares + prior_sum_of_squares) /
(sample_count + prior_samples))
rms_estimate_db = 10 * np.log10(mean_squared_estimate)
# Compute required time-varying gain.
gain_db = target_db - rms_estimate_db
self.apply_gain(gain_db)
def resample(self, target_sample_rate, quality='sinc_medium'):
"""Resample the audio to a target sample rate.
Note that this is an in-place transformation.
def resample(self, target_sample_rate):
raise NotImplementedError()
:param target_sample_rate: Target sample rate.
:type target_sample_rate: int
:param quality: One of {'sinc_fastest', 'sinc_medium', 'sinc_best'}.
Sets resampling speed/quality tradeoff.
See http://www.mega-nerd.com/SRC/api_misc.html#Converters
:type quality: str
"""
resample_ratio = target_sample_rate / self._sample_rate
self._samples = scikits.samplerate.resample(
self._samples, r=resample_ratio, type=quality)
self._sample_rate = target_sample_rate
def pad_silence(self, duration, sides='both'):
raise NotImplementedError()
"""Pad this audio sample with a period of silence.
Note that this is an in-place transformation.
:param duration: Length of silence in seconds to pad.
:type duration: float
:param sides: Position for padding:
'beginning' - adds silence in the beginning;
'end' - adds silence in the end;
'both' - adds silence in both the beginning and the end.
:type sides: str
:raises ValueError: If sides is not supported.
"""
if duration == 0.0:
return self
cls = type(self)
silence = self.make_silence(duration, self._sample_rate)
if sides == "beginning":
padded = cls.concatenate(silence, self)
elif sides == "end":
padded = cls.concatenate(self, silence)
elif sides == "both":
padded = cls.concatenate(silence, self, silence)
else:
raise ValueError("Unknown value for the sides %s" % sides)
self._samples = padded._samples
def subsegment(self, start_sec=None, end_sec=None):
raise NotImplementedError()
"""Cut the AudioSegment between given boundaries.
Note that this is an in-place transformation.
:param start_sec: Beginning of subsegment in seconds.
:type start_sec: float
:param end_sec: End of subsegment in seconds.
:type end_sec: float
:raise ValueError: If start_sec or end_sec is incorrectly set, e.g. out
of bounds in time.
"""
start_sec = 0.0 if start_sec is None else start_sec
end_sec = self.duration if end_sec is None else end_sec
if start_sec < 0.0:
start_sec = self.duration + start_sec
if end_sec < 0.0:
end_sec = self.duration + end_sec
if start_sec < 0.0:
raise ValueError("The slice start position (%f s) is out of "
"bounds." % start_sec)
if end_sec < 0.0:
raise ValueError("The slice end position (%f s) is out of bounds." %
end_sec)
if start_sec > end_sec:
raise ValueError("The slice start position (%f s) is later than "
"the end position (%f s)." % (start_sec, end_sec))
if end_sec > self.duration:
raise ValueError("The slice end position (%f s) is out of bounds "
"(> %f s)" % (end_sec, self.duration))
start_sample = int(round(start_sec * self._sample_rate))
end_sample = int(round(end_sec * self._sample_rate))
self._samples = self._samples[start_sample:end_sample]
def random_subsegment(self, subsegment_length, rng=None):
"""Cut the specified length of the audiosegment randomly.
Note that this is an in-place transformation.
:param subsegment_length: Subsegment length in seconds.
:type subsegment_length: float
:param rng: Random number generator state.
:type rng: random.Random
:raises ValueError: If the length of subsegment is greater than
the origineal segemnt.
"""
rng = random.Random() if rng is None else rng
if subsegment_length > self.duration:
raise ValueError("Length of subsegment must not be greater "
"than original segment.")
start_time = rng.uniform(0.0, self.duration - subsegment_length)
self.subsegment(start_time, start_time + subsegment_length)
def convolve(self, impulse_segment, allow_resample=False):
"""Convolve this audio segment with the given impulse segment.
Note that this is an in-place transformation.
:param impulse_segment: Impulse response segments.
:type impulse_segment: AudioSegment
:param allow_resample: Indicates whether resampling is allowed when
the impulse_segment has a different sample
rate from this signal.
:type allow_resample: bool
:raises ValueError: If the sample rate is not match between two
audio segments when resample is not allowed.
"""
if allow_resample and self.sample_rate != impulse_segment.sample_rate:
impulse_segment = impulse_segment.resample(self.sample_rate)
if self.sample_rate != impulse_segment.sample_rate:
raise ValueError("Impulse segment's sample rate (%d Hz) is not"
"equal to base signal sample rate (%d Hz)." %
(impulse_segment.sample_rate, self.sample_rate))
samples = signal.fftconvolve(self.samples, impulse_segment.samples,
"full")
self._samples = samples
def convolve_and_normalize(self, impulse_segment, allow_resample=False):
"""Convolve and normalize the resulting audio segment so that it
has the same average power as the input signal.
Note that this is an in-place transformation.
def convolve(self, filter, allow_resample=False):
raise NotImplementedError()
:param impulse_segment: Impulse response segments.
:type impulse_segment: AudioSegment
:param allow_resample: Indicates whether resampling is allowed when
the impulse_segment has a different sample
rate from this signal.
:type allow_resample: bool
"""
target_db = self.rms_db
self.convolve(impulse_segment, allow_resample=allow_resample)
self.normalize(target_db)
def add_noise(self,
noise,
snr_dB,
allow_downsampling=False,
max_gain_db=300.0,
rng=None):
"""Add the given noise segment at a specific signal-to-noise ratio.
If the noise segment is longer than this segment, a random subsegment
of matching length is sampled from it and used instead.
def convolve_and_normalize(self, filter, allow_resample=False):
raise NotImplementedError()
Note that this is an in-place transformation.
:param noise: Noise signal to add.
:type noise: AudioSegment
:param snr_dB: Signal-to-Noise Ratio, in decibels.
:type snr_dB: float
:param allow_downsampling: Whether to allow the noise signal to be
downsampled to match the base signal sample
rate.
:type allow_downsampling: bool
:param max_gain_db: Maximum amount of gain to apply to noise signal
before adding it in. This is to prevent attempting
to apply infinite gain to a zero signal.
:type max_gain_db: float
:param rng: Random number generator state.
:type rng: None|random.Random
:raises ValueError: If the sample rate does not match between the two
audio segments when downsampling is not allowed, or
if the duration of noise segments is shorter than
original audio segments.
"""
rng = random.Random() if rng is None else rng
if allow_downsampling and noise.sample_rate > self.sample_rate:
noise = noise.resample(self.sample_rate)
if noise.sample_rate != self.sample_rate:
raise ValueError("Noise sample rate (%d Hz) is not equal to base "
"signal sample rate (%d Hz)." % (noise.sample_rate,
self.sample_rate))
if noise.duration < self.duration:
raise ValueError("Noise signal (%f sec) must be at least as long as"
" base signal (%f sec)." %
(noise.duration, self.duration))
noise_gain_db = min(self.rms_db - noise.rms_db - snr_dB, max_gain_db)
noise_new = copy.deepcopy(noise)
noise_new.random_subsegment(self.duration, rng=rng)
noise_new.apply_gain(noise_gain_db)
self.superimpose(noise_new)
@property
def samples(self):
......@@ -186,7 +533,7 @@ class AudioSegment(object):
:return: Number of samples.
:rtype: int
"""
return self._samples.shape(0)
return self._samples.shape[0]
@property
def duration(self):
......
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -80,7 +80,7 @@ class DataGenerator(object):
padding_to=-1,
flatten=False,
sortagrad=False,
batch_shuffle=False):
shuffle_method="batch_shuffle"):
"""
Batch data reader creator for audio data. Return a callable generator
function to produce batches of data.
......@@ -104,12 +104,22 @@ class DataGenerator(object):
:param sortagrad: If set True, sort the instances by audio duration
in the first epoch for speed up training.
:type sortagrad: bool
:param batch_shuffle: If set True, instances are batch-wise shuffled.
:param shuffle_method: Shuffle method. Options:
'' or None: no shuffle.
'instance_shuffle': instance-wise shuffle.
'batch_shuffle': similarly-sized instances are
put into batches, and then
batch-wise shuffle the batches.
For more details, please see
``_batch_shuffle.__doc__``.
If sortagrad is True, batch_shuffle is disabled
'batch_shuffle_clipped': 'batch_shuffle' with
head shift and tail
clipping. For more
details, please see
``_batch_shuffle``.
If sortagrad is True, shuffle is disabled
for the first epoch.
:type batch_shuffle: bool
:type shuffle_method: None|str
:return: Batch reader function, producing batches of data when called.
:rtype: callable
"""
......@@ -123,8 +133,20 @@ class DataGenerator(object):
# sort (by duration) or batch-wise shuffle the manifest
if self._epoch == 0 and sortagrad:
manifest.sort(key=lambda x: x["duration"])
elif batch_shuffle:
manifest = self._batch_shuffle(manifest, batch_size)
else:
if shuffle_method == "batch_shuffle":
manifest = self._batch_shuffle(
manifest, batch_size, clipped=False)
elif shuffle_method == "batch_shuffle_clipped":
manifest = self._batch_shuffle(
manifest, batch_size, clipped=True)
elif shuffle_method == "instance_shuffle":
self._rng.shuffle(manifest)
elif not shuffle_method:
pass
else:
raise ValueError("Unknown shuffle method %s." %
shuffle_method)
# prepare batches
instance_reader = self._instance_reader_creator(manifest)
batch = []
......@@ -218,7 +240,7 @@ class DataGenerator(object):
new_batch.append((padded_audio, text))
return new_batch
def _batch_shuffle(self, manifest, batch_size):
def _batch_shuffle(self, manifest, batch_size, clipped=False):
"""Put similarly-sized instances into minibatches for better efficiency
and make a batch-wise shuffle.
......@@ -233,6 +255,9 @@ class DataGenerator(object):
:param batch_size: Batch size. This size is also used for generate
a random number for batch shuffle.
:type batch_size: int
:param clipped: Whether to clip the heading (small shift) and trailing
(incomplete batch) instances.
:type clipped: bool
:return: Batch shuffled mainifest.
:rtype: list
"""
......@@ -241,6 +266,7 @@ class DataGenerator(object):
batch_manifest = zip(*[iter(manifest[shift_len:])] * batch_size)
self._rng.shuffle(batch_manifest)
batch_manifest = list(sum(batch_manifest, ()))
if not clipped:
res_len = len(manifest) - shift_len - len(batch_manifest)
batch_manifest.extend(manifest[-res_len:])
batch_manifest.extend(manifest[0:shift_len])
......
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
......@@ -65,6 +65,74 @@ class SpeechSegment(AudioSegment):
audio = AudioSegment.from_bytes(bytes)
return cls(audio.samples, audio.sample_rate, transcript)
@classmethod
def concatenate(cls, *segments):
"""Concatenate an arbitrary number of speech segments together, both
audio and transcript will be concatenated.
:param *segments: Input speech segments to be concatenated.
:type *segments: tuple of SpeechSegment
:return: Speech segment instance.
:rtype: SpeechSegment
:raises ValueError: If the number of segments is zero, or if the
sample_rate of any two segments does not match.
:raises TypeError: If any segment is not SpeechSegment instance.
"""
if len(segments) == 0:
raise ValueError("No speech segments are given to concatenate.")
sample_rate = segments[0]._sample_rate
transcripts = ""
for seg in segments:
if sample_rate != seg._sample_rate:
raise ValueError("Can't concatenate segments with "
"different sample rates")
if type(seg) is not cls:
raise TypeError("Only speech segments of the same type "
"instance can be concatenated.")
transcripts += seg._transcript
samples = np.concatenate([seg.samples for seg in segments])
return cls(samples, sample_rate, transcripts)
@classmethod
def slice_from_file(cls, filepath, start=None, end=None, transcript):
"""Loads a small section of an speech without having to load
the entire file into the memory which can be incredibly wasteful.
:param filepath: Filepath or file object to audio file.
:type filepath: basestring|file
:param start: Start time in seconds. If start is negative, it wraps
around from the end. If not provided, this function
reads from the very beginning.
:type start: float
:param end: End time in seconds. If end is negative, it wraps around
from the end. If not provided, the default behvaior is
to read to the end of the file.
:type end: float
:param transcript: Transcript text for the speech. if not provided,
the defaults is an empty string.
:type transript: basestring
:return: SpeechSegment instance of the specified slice of the input
speech file.
:rtype: SpeechSegment
"""
audio = Audiosegment.slice_from_file(filepath, start, end)
return cls(audio.samples, audio.sample_rate, transcript)
@classmethod
def make_silence(cls, duration, sample_rate):
"""Creates a silent speech segment of the given duration and
sample rate, transcript will be an empty string.
:param duration: Length of silence in seconds.
:type duration: float
:param sample_rate: Sample rate.
:type sample_rate: float
:return: Silence of the given duration.
:rtype: SpeechSegment
"""
audio = AudioSegment.make_silence(duration, sample_rate)
return cls(audio.samples, audio.sample_rate, "")
@property
def transcript(self):
"""Return the transcript text.
......
文件模式从 100755 更改为 100644
......@@ -37,8 +37,7 @@ MD5_TRAIN_CLEAN_100 = "2a93770f6d5c6c964bc36631d331a522"
MD5_TRAIN_CLEAN_360 = "c0e676e450a7ff2f54aeade5171606fa"
MD5_TRAIN_OTHER_500 = "d1a0fd59409feb2c614ce4d30c387708"
parser = argparse.ArgumentParser(
description='Downloads and prepare LibriSpeech dataset.')
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--target_dir",
default=DATA_HOME + "/Libri",
......
文件模式从 100755 更改为 100644
......@@ -8,8 +8,7 @@ from itertools import groupby
def ctc_best_path_decode(probs_seq, vocabulary):
"""
Best path decoding, also called argmax decoding or greedy decoding.
"""Best path decoding, also called argmax decoding or greedy decoding.
Path consisting of the most probable tokens are further post-processed to
remove consecutive repetitions and all blanks.
......@@ -38,8 +37,7 @@ def ctc_best_path_decode(probs_seq, vocabulary):
def ctc_decode(probs_seq, vocabulary, method):
"""
CTC-like sequence decoding from a sequence of likelihood probablilites.
"""CTC-like sequence decoding from a sequence of likelihood probablilites.
:param probs_seq: 2-D list of probabilities over the vocabulary for each
character. Each element is a list of float probabilities
......
# -*- coding: utf-8 -*-
"""This module provides functions to calculate error rate in different level.
e.g. wer for word-level, cer for char-level.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
def _levenshtein_distance(ref, hyp):
"""Levenshtein distance is a string metric for measuring the difference between
two sequences. Informally, the levenshtein disctance is defined as the minimum
number of single-character edits (substitutions, insertions or deletions)
required to change one word into the other. We can naturally extend the edits to
word level when calculate levenshtein disctance for two sentences.
"""
ref_len = len(ref)
hyp_len = len(hyp)
# special case
if ref == hyp:
return 0
if ref_len == 0:
return hyp_len
if hyp_len == 0:
return ref_len
distance = np.zeros((ref_len + 1, hyp_len + 1), dtype=np.int32)
# initialize distance matrix
for j in xrange(hyp_len + 1):
distance[0][j] = j
for i in xrange(ref_len + 1):
distance[i][0] = i
# calculate levenshtein distance
for i in xrange(1, ref_len + 1):
for j in xrange(1, hyp_len + 1):
if ref[i - 1] == hyp[j - 1]:
distance[i][j] = distance[i - 1][j - 1]
else:
s_num = distance[i - 1][j - 1] + 1
i_num = distance[i][j - 1] + 1
d_num = distance[i - 1][j] + 1
distance[i][j] = min(s_num, i_num, d_num)
return distance[ref_len][hyp_len]
def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
"""Calculate word error rate (WER). WER compares reference text and
hypothesis text in word-level. WER is defined as:
.. math::
WER = (Sw + Dw + Iw) / Nw
where
.. code-block:: text
Sw is the number of words subsituted,
Dw is the number of words deleted,
Iw is the number of words inserted,
Nw is the number of words in the reference
We can use levenshtein distance to calculate WER. Please draw an attention that
empty items will be removed when splitting sentences by delimiter.
:param reference: The reference sentence.
:type reference: basestring
:param hypothesis: The hypothesis sentence.
:type hypothesis: basestring
:param ignore_case: Whether case-sensitive or not.
:type ignore_case: bool
:param delimiter: Delimiter of input sentences.
:type delimiter: char
:return: Word error rate.
:rtype: float
:raises ValueError: If the reference length is zero.
"""
if ignore_case == True:
reference = reference.lower()
hypothesis = hypothesis.lower()
ref_words = filter(None, reference.split(delimiter))
hyp_words = filter(None, hypothesis.split(delimiter))
if len(ref_words) == 0:
raise ValueError("Reference's word number should be greater than 0.")
edit_distance = _levenshtein_distance(ref_words, hyp_words)
wer = float(edit_distance) / len(ref_words)
return wer
def cer(reference, hypothesis, ignore_case=False):
"""Calculate charactor error rate (CER). CER compares reference text and
hypothesis text in char-level. CER is defined as:
.. math::
CER = (Sc + Dc + Ic) / Nc
where
.. code-block:: text
Sc is the number of characters substituted,
Dc is the number of characters deleted,
Ic is the number of characters inserted
Nc is the number of characters in the reference
We can use levenshtein distance to calculate CER. Chinese input should be
encoded to unicode. Please draw an attention that the leading and tailing
white space characters will be truncated and multiple consecutive white
space characters in a sentence will be replaced by one white space character.
:param reference: The reference sentence.
:type reference: basestring
:param hypothesis: The hypothesis sentence.
:type hypothesis: basestring
:param ignore_case: Whether case-sensitive or not.
:type ignore_case: bool
:return: Character error rate.
:rtype: float
:raises ValueError: If the reference length is zero.
"""
if ignore_case == True:
reference = reference.lower()
hypothesis = hypothesis.lower()
reference = ' '.join(filter(None, reference.split(' ')))
hypothesis = ' '.join(filter(None, hypothesis.split(' ')))
if len(reference) == 0:
raise ValueError("Length of reference should be greater than 0.")
edit_distance = _levenshtein_distance(reference, hypothesis)
cer = float(edit_distance) / len(reference)
return cer
......@@ -10,9 +10,9 @@ import paddle.v2 as paddle
from data_utils.data import DataGenerator
from model import deep_speech2
from decoder import ctc_decode
import utils
parser = argparse.ArgumentParser(
description='Simplified version of DeepSpeech2 inference.')
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--num_samples",
default=10,
......@@ -62,9 +62,7 @@ args = parser.parse_args()
def infer():
"""
Max-ctc-decoding for DeepSpeech2.
"""
"""Max-ctc-decoding for DeepSpeech2."""
# initialize data generator
data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath,
......@@ -98,7 +96,7 @@ def infer():
manifest_path=args.decode_manifest_path,
batch_size=args.num_samples,
sortagrad=False,
batch_shuffle=False)
shuffle_method=None)
infer_data = batch_reader().next()
# run inference
......@@ -123,6 +121,7 @@ def infer():
def main():
utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=1)
infer()
......
SoundFile==0.9.0.post1
wget==3.2
scikits.samplerate==0.3.3
scipy==0.13.0b1
# -*- coding: utf-8 -*-
"""Test error rate."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import unittest
import error_rate
class TestParse(unittest.TestCase):
def test_wer_1(self):
ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last night'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6)
def test_wer_2(self):
ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
word_error_rate = error_rate.wer(ref, ref)
self.assertEqual(word_error_rate, 0.0)
def test_wer_3(self):
ref = ' '
hyp = 'Hypothesis sentence'
with self.assertRaises(ValueError):
word_error_rate = error_rate.wer(ref, hyp)
def test_cer_1(self):
ref = 'werewolf'
hyp = 'weae wolf'
char_error_rate = error_rate.cer(ref, hyp)
self.assertTrue(abs(char_error_rate - 0.25) < 1e-6)
def test_cer_2(self):
ref = 'werewolf'
char_error_rate = error_rate.cer(ref, ref)
self.assertEqual(char_error_rate, 0.0)
def test_cer_3(self):
ref = u'我是中国人'
hyp = u'我是 美洲人'
char_error_rate = error_rate.cer(ref, hyp)
self.assertTrue(abs(char_error_rate - 0.6) < 1e-6)
def test_cer_4(self):
ref = u'我是中国人'
char_error_rate = error_rate.cer(ref, ref)
self.assertFalse(char_error_rate, 0.0)
def test_cer_5(self):
ref = ''
hyp = 'Hypothesis'
with self.assertRaises(ValueError):
char_error_rate = error_rate.cer(ref, hyp)
if __name__ == '__main__':
unittest.main()
......@@ -12,6 +12,7 @@ import distutils.util
import paddle.v2 as paddle
from model import deep_speech2
from data_utils.data import DataGenerator
import utils
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
......@@ -51,6 +52,12 @@ parser.add_argument(
default=True,
type=distutils.util.strtobool,
help="Use sortagrad or not. (default: %(default)s)")
parser.add_argument(
"--shuffle_method",
default='instance_shuffle',
type=str,
help="Shuffle method: 'instance_shuffle', 'batch_shuffle', "
"'batch_shuffle_batch'. (default: %(default)s)")
parser.add_argument(
"--trainer_count",
default=4,
......@@ -93,9 +100,7 @@ args = parser.parse_args()
def train():
"""
DeepSpeech2 training.
"""
"""DeepSpeech2 training."""
# initialize data generator
def data_generator():
......@@ -143,13 +148,15 @@ def train():
train_batch_reader = train_generator.batch_reader_creator(
manifest_path=args.train_manifest_path,
batch_size=args.batch_size,
min_batch_size=args.trainer_count,
sortagrad=args.use_sortagrad if args.init_model_path is None else False,
batch_shuffle=True)
shuffle_method=args.shuffle_method)
test_batch_reader = test_generator.batch_reader_creator(
manifest_path=args.dev_manifest_path,
batch_size=args.batch_size,
min_batch_size=1, # must be 1, but will have errors.
sortagrad=False,
batch_shuffle=False)
shuffle_method=None)
# create event handler
def event_handler(event):
......@@ -157,11 +164,11 @@ def train():
if isinstance(event, paddle.event.EndIteration):
cost_sum += event.cost
cost_counter += 1
if event.batch_id % 50 == 0:
print("\nPass: %d, Batch: %d, TrainCost: %f" %
(event.pass_id, event.batch_id, cost_sum / cost_counter))
if (event.batch_id + 1) % 100 == 0:
print("\nPass: %d, Batch: %d, TrainCost: %f" % (
event.pass_id, event.batch_id + 1, cost_sum / cost_counter))
cost_sum, cost_counter = 0.0, 0
with gzip.open("params_tmp.tar.gz", 'w') as f:
with gzip.open("params.tar.gz", 'w') as f:
parameters.to_tar(f)
else:
sys.stdout.write('.')
......@@ -184,6 +191,7 @@ def train():
def main():
utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
train()
......
"""Contains common utility functions."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
def print_arguments(args):
"""Print argparse's arguments.
Usage:
.. code-block:: python
parser = argparse.ArgumentParser()
parser.add_argument("name", default="Jonh", type=str, help="User name.")
args = parser.parse_args()
print_arguments(args)
:param args: Input argparse.Namespace for printing.
:type args: argparse.Namespace
"""
print("----- Configuration Arguments -----")
for arg, value in vars(args).iteritems():
print("%s: %s" % (arg, value))
print("------------------------------------")
TBD
图像分类
=======================
这里将介绍如何在PaddlePaddle下使用AlexNet、VGG、GoogLeNet和ResNet模型进行图像分类。图像分类问题的描述和这四种模型的介绍可以参考[PaddlePaddle book](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification)
## 训练模型
### 初始化
在初始化阶段需要导入所用的包,并对PaddlePaddle进行初始化。
```python
import gzip
import paddle.v2.dataset.flowers as flowers
import paddle.v2 as paddle
import reader
import vgg
import resnet
import alexnet
import googlenet
# PaddlePaddle init
paddle.init(use_gpu=False, trainer_count=1)
```
### 定义参数和输入
设置算法参数(如数据维度、类别数目和batch size等参数),定义数据输入层`image`和类别标签`lbl`
```python
DATA_DIM = 3 * 224 * 224
CLASS_DIM = 102
BATCH_SIZE = 128
image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(DATA_DIM))
lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(CLASS_DIM))
```
### 获得所用模型
这里可以选择使用AlexNet、VGG、GoogLeNet和ResNet模型中的一个模型进行图像分类。通过调用相应的方法可以获得网络最后的Softmax层。
1. 使用AlexNet模型
指定输入层`image`和类别数目`CLASS_DIM`后,可以通过下面的代码得到AlexNet的Softmax层。
```python
out = alexnet.alexnet(image, class_dim=CLASS_DIM)
```
2. 使用VGG模型
根据层数的不同,VGG分为VGG13、VGG16和VGG19。使用VGG16模型的代码如下:
```python
out = vgg.vgg16(image, class_dim=CLASS_DIM)
```
类似地,VGG13和VGG19可以分别通过`vgg.vgg13``vgg.vgg19`方法获得。
3. 使用GoogLeNet模型
GoogLeNet在训练阶段使用两个辅助的分类器强化梯度信息并进行额外的正则化。因此`googlenet.googlenet`共返回三个Softmax层,如下面的代码所示:
```python
out, out1, out2 = googlenet.googlenet(image, class_dim=CLASS_DIM)
loss1 = paddle.layer.cross_entropy_cost(
input=out1, label=lbl, coeff=0.3)
paddle.evaluator.classification_error(input=out1, label=lbl)
loss2 = paddle.layer.cross_entropy_cost(
input=out2, label=lbl, coeff=0.3)
paddle.evaluator.classification_error(input=out2, label=lbl)
extra_layers = [loss1, loss2]
```
对于两个辅助的输出,这里分别对其计算损失函数并评价错误率,然后将损失作为后文SGD的extra_layers。
4. 使用ResNet模型
ResNet模型可以通过下面的代码获取:
```python
out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM)
```
### 定义损失函数
```python
cost = paddle.layer.classification_cost(input=out, label=lbl)
```
### 创建参数和优化方法
```python
# Create parameters
parameters = paddle.parameters.create(cost)
# Create optimizer
optimizer = paddle.optimizer.Momentum(
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 *
BATCH_SIZE),
learning_rate=0.001 / BATCH_SIZE,
learning_rate_decay_a=0.1,
learning_rate_decay_b=128000 * 35,
learning_rate_schedule="discexp", )
```
通过 `learning_rate_decay_a` (简写$a$) 、`learning_rate_decay_b` (简写$b$) 和 `learning_rate_schedule` 指定学习率调整策略,这里采用离散指数的方式调节学习率,计算公式如下, $n$ 代表已经处理过的累计总样本数,$lr_{0}$ 即为参数里设置的 `learning_rate`
$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$
### 定义数据读取
首先以[花卉数据](http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html)为例说明如何定义输入。下面的代码定义了花卉数据训练集和验证集的输入:
```python
train_reader = paddle.batch(
paddle.reader.shuffle(
flowers.train(),
buf_size=1000),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
flowers.valid(),
batch_size=BATCH_SIZE)
```
若需要使用其他数据,则需要先建立图像列表文件。`reader.py`定义了这种文件的读取方式,它从图像列表文件中解析出图像路径和类别标签。
图像列表文件是一个文本文件,其中每一行由一个图像路径和类别标签构成,二者以跳格符(Tab)隔开。类别标签用整数表示,其最小值为0。下面给出一个图像列表文件的片段示例:
```
dataset_100/train_images/n03982430_23191.jpeg 1
dataset_100/train_images/n04461696_23653.jpeg 7
dataset_100/train_images/n02441942_3170.jpeg 8
dataset_100/train_images/n03733281_31716.jpeg 2
dataset_100/train_images/n03424325_240.jpeg 0
dataset_100/train_images/n02643566_75.jpeg 8
```
训练时需要分别指定训练集和验证集的图像列表文件。这里假设这两个文件分别为`train.list``val.list`,数据读取方式如下:
```python
train_reader = paddle.batch(
paddle.reader.shuffle(
reader.train_reader('train.list'),
buf_size=1000),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
reader.test_reader('val.list'),
batch_size=BATCH_SIZE)
```
### 定义事件处理程序
```python
# End batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 1 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f:
parameters.to_tar(f)
result = trainer.test(reader=test_reader)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
```
### 定义训练方法
对于AlexNet、VGG和ResNet,可以按下面的代码定义训练方法:
```python
# Create trainer
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=optimizer)
```
GoogLeNet有两个额外的输出层,因此需要指定`extra_layers`,如下所示:
```python
# Create trainer
trainer = paddle.trainer.SGD(
cost=cost,
parameters=parameters,
update_equation=optimizer,
extra_layers=extra_layers)
```
### 开始训练
```python
trainer.train(
reader=train_reader, num_passes=200, event_handler=event_handler)
```
## 应用模型
模型训练好后,可以使用下面的代码预测给定图片的类别。
```python
# load parameters
with gzip.open('params_pass_10.tar.gz', 'r') as f:
parameters = paddle.parameters.Parameters.from_tar(f)
file_list = [line.strip() for line in open(image_list_file)]
test_data = [(paddle.image.load_and_transform(image_file, 256, 224, False)
.flatten().astype('float32'), )
for image_file in file_list]
probs = paddle.infer(
output_layer=out, parameters=parameters, input=test_data)
lab = np.argsort(-probs)
for file_name, result in zip(file_list, lab):
print "Label of %s is: %d" % (file_name, result[0])
```
首先从文件中加载训练好的模型(代码里以第10轮迭代的结果为例),然后读取`image_list_file`中的图像。`image_list_file`是一个文本文件,每一行为一个图像路径。代码使用`paddle.infer`判断`image_list_file`中每个图像的类别,并进行输出。
import paddle.v2 as paddle
__all__ = ['alexnet']
def alexnet(input, class_dim):
conv1 = paddle.layer.img_conv(
input=input,
filter_size=11,
num_channels=3,
num_filters=96,
stride=4,
padding=1)
cmrnorm1 = paddle.layer.img_cmrnorm(
input=conv1, size=5, scale=0.0001, power=0.75)
pool1 = paddle.layer.img_pool(input=cmrnorm1, pool_size=3, stride=2)
conv2 = paddle.layer.img_conv(
input=pool1,
filter_size=5,
num_filters=256,
stride=1,
padding=2,
groups=1)
cmrnorm2 = paddle.layer.img_cmrnorm(
input=conv2, size=5, scale=0.0001, power=0.75)
pool2 = paddle.layer.img_pool(input=cmrnorm2, pool_size=3, stride=2)
pool3 = paddle.networks.img_conv_group(
input=pool2,
pool_size=3,
pool_stride=2,
conv_num_filter=[384, 384, 256],
conv_filter_size=3,
pool_type=paddle.pooling.Max())
fc1 = paddle.layer.fc(
input=pool3,
size=4096,
act=paddle.activation.Relu(),
layer_attr=paddle.attr.Extra(drop_rate=0.5))
fc2 = paddle.layer.fc(
input=fc1,
size=4096,
act=paddle.activation.Relu(),
layer_attr=paddle.attr.Extra(drop_rate=0.5))
out = paddle.layer.fc(
input=fc2, size=class_dim, act=paddle.activation.Softmax())
return out
import paddle.v2 as paddle
__all__ = ['googlenet']
def inception(name, input, channels, filter1, filter3R, filter3, filter5R,
filter5, proj):
cov1 = paddle.layer.img_conv(
name=name + '_1',
input=input,
filter_size=1,
num_channels=channels,
num_filters=filter1,
stride=1,
padding=0)
cov3r = paddle.layer.img_conv(
name=name + '_3r',
input=input,
filter_size=1,
num_channels=channels,
num_filters=filter3R,
stride=1,
padding=0)
cov3 = paddle.layer.img_conv(
name=name + '_3',
input=cov3r,
filter_size=3,
num_filters=filter3,
stride=1,
padding=1)
cov5r = paddle.layer.img_conv(
name=name + '_5r',
input=input,
filter_size=1,
num_channels=channels,
num_filters=filter5R,
stride=1,
padding=0)
cov5 = paddle.layer.img_conv(
name=name + '_5',
input=cov5r,
filter_size=5,
num_filters=filter5,
stride=1,
padding=2)
pool1 = paddle.layer.img_pool(
name=name + '_max',
input=input,
pool_size=3,
num_channels=channels,
stride=1,
padding=1)
covprj = paddle.layer.img_conv(
name=name + '_proj',
input=pool1,
filter_size=1,
num_filters=proj,
stride=1,
padding=0)
cat = paddle.layer.concat(name=name, input=[cov1, cov3, cov5, covprj])
return cat
def googlenet(input, class_dim):
# stage 1
conv1 = paddle.layer.img_conv(
name="conv1",
input=input,
filter_size=7,
num_channels=3,
num_filters=64,
stride=2,
padding=3)
pool1 = paddle.layer.img_pool(
name="pool1", input=conv1, pool_size=3, num_channels=64, stride=2)
# stage 2
conv2_1 = paddle.layer.img_conv(
name="conv2_1",
input=pool1,
filter_size=1,
num_filters=64,
stride=1,
padding=0)
conv2_2 = paddle.layer.img_conv(
name="conv2_2",
input=conv2_1,
filter_size=3,
num_filters=192,
stride=1,
padding=1)
pool2 = paddle.layer.img_pool(
name="pool2", input=conv2_2, pool_size=3, num_channels=192, stride=2)
# stage 3
ince3a = inception("ince3a", pool2, 192, 64, 96, 128, 16, 32, 32)
ince3b = inception("ince3b", ince3a, 256, 128, 128, 192, 32, 96, 64)
pool3 = paddle.layer.img_pool(
name="pool3", input=ince3b, num_channels=480, pool_size=3, stride=2)
# stage 4
ince4a = inception("ince4a", pool3, 480, 192, 96, 208, 16, 48, 64)
ince4b = inception("ince4b", ince4a, 512, 160, 112, 224, 24, 64, 64)
ince4c = inception("ince4c", ince4b, 512, 128, 128, 256, 24, 64, 64)
ince4d = inception("ince4d", ince4c, 512, 112, 144, 288, 32, 64, 64)
ince4e = inception("ince4e", ince4d, 528, 256, 160, 320, 32, 128, 128)
pool4 = paddle.layer.img_pool(
name="pool4", input=ince4e, num_channels=832, pool_size=3, stride=2)
# stage 5
ince5a = inception("ince5a", pool4, 832, 256, 160, 320, 32, 128, 128)
ince5b = inception("ince5b", ince5a, 832, 384, 192, 384, 48, 128, 128)
pool5 = paddle.layer.img_pool(
name="pool5",
input=ince5b,
num_channels=1024,
pool_size=7,
stride=7,
pool_type=paddle.pooling.Avg())
dropout = paddle.layer.addto(
input=pool5,
layer_attr=paddle.attr.Extra(drop_rate=0.4),
act=paddle.activation.Linear())
out = paddle.layer.fc(
input=dropout, size=class_dim, act=paddle.activation.Softmax())
# fc for output 1
pool_o1 = paddle.layer.img_pool(
name="pool_o1",
input=ince4a,
num_channels=512,
pool_size=5,
stride=3,
pool_type=paddle.pooling.Avg())
conv_o1 = paddle.layer.img_conv(
name="conv_o1",
input=pool_o1,
filter_size=1,
num_filters=128,
stride=1,
padding=0)
fc_o1 = paddle.layer.fc(
name="fc_o1",
input=conv_o1,
size=1024,
layer_attr=paddle.attr.Extra(drop_rate=0.7),
act=paddle.activation.Relu())
out1 = paddle.layer.fc(
input=fc_o1, size=class_dim, act=paddle.activation.Softmax())
# fc for output 2
pool_o2 = paddle.layer.img_pool(
name="pool_o2",
input=ince4d,
num_channels=528,
pool_size=5,
stride=3,
pool_type=paddle.pooling.Avg())
conv_o2 = paddle.layer.img_conv(
name="conv_o2",
input=pool_o2,
filter_size=1,
num_filters=128,
stride=1,
padding=0)
fc_o2 = paddle.layer.fc(
name="fc_o2",
input=conv_o2,
size=1024,
layer_attr=paddle.attr.Extra(drop_rate=0.7),
act=paddle.activation.Relu())
out2 = paddle.layer.fc(
input=fc_o2, size=class_dim, act=paddle.activation.Softmax())
return out, out1, out2
import gzip
import paddle.v2 as paddle
import reader
import vgg
import resnet
import alexnet
import googlenet
import argparse
import os
from PIL import Image
import numpy as np
WIDTH = 224
HEIGHT = 224
DATA_DIM = 3 * WIDTH * HEIGHT
CLASS_DIM = 102
def main():
# parse the argument
parser = argparse.ArgumentParser()
parser.add_argument(
'data_list',
help='The path of data list file, which consists of one image path per line'
)
parser.add_argument(
'model',
help='The model for image classification',
choices=['alexnet', 'vgg13', 'vgg16', 'vgg19', 'resnet', 'googlenet'])
parser.add_argument(
'params_path', help='The file which stores the parameters')
args = parser.parse_args()
# PaddlePaddle init
paddle.init(use_gpu=True, trainer_count=1)
image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(DATA_DIM))
if args.model == 'alexnet':
out = alexnet.alexnet(image, class_dim=CLASS_DIM)
elif args.model == 'vgg13':
out = vgg.vgg13(image, class_dim=CLASS_DIM)
elif args.model == 'vgg16':
out = vgg.vgg16(image, class_dim=CLASS_DIM)
elif args.model == 'vgg19':
out = vgg.vgg19(image, class_dim=CLASS_DIM)
elif args.model == 'resnet':
out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM)
elif args.model == 'googlenet':
out, _, _ = googlenet.googlenet(image, class_dim=CLASS_DIM)
# load parameters
with gzip.open(args.params_path, 'r') as f:
parameters = paddle.parameters.Parameters.from_tar(f)
file_list = [line.strip() for line in open(args.data_list)]
test_data = [(paddle.image.load_and_transform(image_file, 256, 224, False)
.flatten().astype('float32'), ) for image_file in file_list]
probs = paddle.infer(
output_layer=out, parameters=parameters, input=test_data)
lab = np.argsort(-probs)
for file_name, result in zip(file_list, lab):
print "Label of %s is: %d" % (file_name, result[0])
if __name__ == '__main__':
main()
import random
from paddle.v2.image import load_and_transform
import paddle.v2 as paddle
from multiprocessing import cpu_count
def train_reader(train_list):
def train_mapper(sample):
'''
map image path to type needed by model input layer for the training set
'''
img, label = sample
img = paddle.image.load_image(img)
img = paddle.image.simple_transform(img, 256, 224, True)
return img.flatten().astype('float32'), label
def test_mapper(sample):
'''
map image path to type needed by model input layer for the test set
'''
img, label = sample
img = paddle.image.load_image(img)
img = paddle.image.simple_transform(img, 256, 224, True)
return img.flatten().astype('float32'), label
def train_reader(train_list, buffered_size=1024):
def reader():
with open(train_list, 'r') as f:
lines = [line.strip() for line in f]
random.shuffle(lines)
for line in lines:
img_path, lab = line.strip().split('\t')
im = load_and_transform(img_path, 256, 224, True)
yield im.flatten().astype('float32'), int(lab)
yield img_path, int(lab)
return reader
return paddle.reader.xmap_readers(train_mapper, reader,
cpu_count(), buffered_size)
def test_reader(test_list):
def test_reader(test_list, buffered_size=1024):
def reader():
with open(test_list, 'r') as f:
lines = [line.strip() for line in f]
for line in lines:
img_path, lab = line.strip().split('\t')
im = load_and_transform(img_path, 256, 224, False)
yield im.flatten().astype('float32'), int(lab)
yield img_path, int(lab)
return reader
return paddle.reader.xmap_readers(test_mapper, reader,
cpu_count(), buffered_size)
if __name__ == '__main__':
......
import paddle.v2 as paddle
__all__ = ['resnet_imagenet', 'resnet_cifar10']
def conv_bn_layer(input,
ch_out,
filter_size,
stride,
padding,
active_type=paddle.activation.Relu(),
ch_in=None):
tmp = paddle.layer.img_conv(
input=input,
filter_size=filter_size,
num_channels=ch_in,
num_filters=ch_out,
stride=stride,
padding=padding,
act=paddle.activation.Linear(),
bias_attr=False)
return paddle.layer.batch_norm(input=tmp, act=active_type)
def shortcut(input, ch_in, ch_out, stride):
if ch_in != ch_out:
return conv_bn_layer(input, ch_out, 1, stride, 0,
paddle.activation.Linear())
else:
return input
def basicblock(input, ch_in, ch_out, stride):
short = shortcut(input, ch_in, ch_out, stride)
conv1 = conv_bn_layer(input, ch_out, 3, stride, 1)
conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1, paddle.activation.Linear())
return paddle.layer.addto(
input=[short, conv2], act=paddle.activation.Relu())
def bottleneck(input, ch_in, ch_out, stride):
short = shortcut(input, ch_in, ch_out * 4, stride)
conv1 = conv_bn_layer(input, ch_out, 1, stride, 0)
conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1)
conv3 = conv_bn_layer(conv2, ch_out * 4, 1, 1, 0,
paddle.activation.Linear())
return paddle.layer.addto(
input=[short, conv3], act=paddle.activation.Relu())
def layer_warp(block_func, input, ch_in, ch_out, count, stride):
conv = block_func(input, ch_in, ch_out, stride)
for i in range(1, count):
conv = block_func(conv, ch_out, ch_out, 1)
return conv
def resnet_imagenet(input, class_dim, depth=50):
cfg = {
18: ([2, 2, 2, 1], basicblock),
34: ([3, 4, 6, 3], basicblock),
50: ([3, 4, 6, 3], bottleneck),
101: ([3, 4, 23, 3], bottleneck),
152: ([3, 8, 36, 3], bottleneck)
}
stages, block_func = cfg[depth]
conv1 = conv_bn_layer(
input, ch_in=3, ch_out=64, filter_size=7, stride=2, padding=3)
pool1 = paddle.layer.img_pool(input=conv1, pool_size=3, stride=2)
res1 = layer_warp(block_func, pool1, 64, 64, stages[0], 1)
res2 = layer_warp(block_func, res1, 64, 128, stages[1], 2)
res3 = layer_warp(block_func, res2, 128, 256, stages[2], 2)
res4 = layer_warp(block_func, res3, 256, 512, stages[3], 2)
pool2 = paddle.layer.img_pool(
input=res4, pool_size=7, stride=1, pool_type=paddle.pooling.Avg())
out = paddle.layer.fc(
input=pool2, size=class_dim, act=paddle.activation.Softmax())
return out
def resnet_cifar10(input, class_dim, depth=32):
# depth should be one of 20, 32, 44, 56, 110, 1202
assert (depth - 2) % 6 == 0
n = (depth - 2) / 6
nStages = {16, 64, 128}
conv1 = conv_bn_layer(
input, ch_in=3, ch_out=16, filter_size=3, stride=1, padding=1)
res1 = layer_warp(basicblock, conv1, 16, 16, n, 1)
res2 = layer_warp(basicblock, res1, 16, 32, n, 2)
res3 = layer_warp(basicblock, res2, 32, 64, n, 2)
pool = paddle.layer.img_pool(
input=res3, pool_size=8, stride=1, pool_type=paddle.pooling.Avg())
out = paddle.layer.fc(
input=pool, size=class_dim, act=paddle.activation.Softmax())
return out
import gzip
import paddle.v2.dataset.flowers as flowers
import paddle.v2 as paddle
import reader
import vgg
import resnet
import alexnet
import googlenet
import argparse
DATA_DIM = 3 * 224 * 224
CLASS_DIM = 1000
CLASS_DIM = 102
BATCH_SIZE = 128
def main():
# parse the argument
parser = argparse.ArgumentParser()
parser.add_argument(
'model',
help='The model for image classification',
choices=['alexnet', 'vgg13', 'vgg16', 'vgg19', 'resnet', 'googlenet'])
args = parser.parse_args()
# PaddlePaddle init
paddle.init(use_gpu=True, trainer_count=4)
paddle.init(use_gpu=True, trainer_count=1)
image = paddle.layer.data(
name="image", type=paddle.data_type.dense_vector(DATA_DIM))
lbl = paddle.layer.data(
name="label", type=paddle.data_type.integer_value(CLASS_DIM))
net = vgg.vgg13(image)
out = paddle.layer.fc(
input=net, size=CLASS_DIM, act=paddle.activation.Softmax())
extra_layers = None
learning_rate = 0.01
if args.model == 'alexnet':
out = alexnet.alexnet(image, class_dim=CLASS_DIM)
elif args.model == 'vgg13':
out = vgg.vgg13(image, class_dim=CLASS_DIM)
elif args.model == 'vgg16':
out = vgg.vgg16(image, class_dim=CLASS_DIM)
elif args.model == 'vgg19':
out = vgg.vgg19(image, class_dim=CLASS_DIM)
elif args.model == 'resnet':
out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM)
learning_rate = 0.1
elif args.model == 'googlenet':
out, out1, out2 = googlenet.googlenet(image, class_dim=CLASS_DIM)
loss1 = paddle.layer.cross_entropy_cost(
input=out1, label=lbl, coeff=0.3)
paddle.evaluator.classification_error(input=out1, label=lbl)
loss2 = paddle.layer.cross_entropy_cost(
input=out2, label=lbl, coeff=0.3)
paddle.evaluator.classification_error(input=out2, label=lbl)
extra_layers = [loss1, loss2]
cost = paddle.layer.classification_cost(input=out, label=lbl)
# Create parameters
......@@ -31,16 +63,23 @@ def main():
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 *
BATCH_SIZE),
learning_rate=0.01 / BATCH_SIZE,
learning_rate=learning_rate / BATCH_SIZE,
learning_rate_decay_a=0.1,
learning_rate_decay_b=128000 * 35,
learning_rate_schedule="discexp", )
train_reader = paddle.batch(
paddle.reader.shuffle(reader.train_reader("train.list"), buf_size=1000),
paddle.reader.shuffle(
flowers.train(),
# To use other data, replace the above line with:
# reader.train_reader('train.list'),
buf_size=1000),
batch_size=BATCH_SIZE)
test_reader = paddle.batch(
reader.test_reader("test.list"), batch_size=BATCH_SIZE)
flowers.valid(),
# To use other data, replace the above line with:
# reader.test_reader('val.list'),
batch_size=BATCH_SIZE)
# End batch and end pass event handler
def event_handler(event):
......@@ -57,7 +96,10 @@ def main():
# Create trainer
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer)
cost=cost,
parameters=parameters,
update_equation=optimizer,
extra_layers=extra_layers)
trainer.train(
reader=train_reader, num_passes=200, event_handler=event_handler)
......
......@@ -3,7 +3,7 @@ import paddle.v2 as paddle
__all__ = ['vgg13', 'vgg16', 'vgg19']
def vgg(input, nums):
def vgg(input, nums, class_dim):
def conv_block(input, num_filter, groups, num_channels=None):
return paddle.networks.img_conv_group(
input=input,
......@@ -34,19 +34,21 @@ def vgg(input, nums):
size=fc_dim,
act=paddle.activation.Relu(),
layer_attr=paddle.attr.Extra(drop_rate=0.5))
return fc2
out = paddle.layer.fc(
input=fc2, size=class_dim, act=paddle.activation.Softmax())
return out
def vgg13(input):
def vgg13(input, class_dim):
nums = [2, 2, 2, 2, 2]
return vgg(input, nums)
return vgg(input, nums, class_dim)
def vgg16(input):
def vgg16(input, class_dim):
nums = [2, 2, 3, 3, 3]
return vgg(input, nums)
return vgg(input, nums, class_dim)
def vgg19(input):
def vgg19(input, class_dim):
nums = [2, 2, 4, 4, 4]
return vgg(input, nums)
return vgg(input, nums, class_dim)
TBD
# Scheduled Sampling
## 概述
序列生成任务的生成目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码器每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码器输入数据的概率分布并不一致。
Scheduled Sampling\[[1](#参考文献)\]是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用目标序列中的真实元素作为解码器输入,可以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行,该方法会逐渐更多地使用生成的元素作为解码器输入,以解决数据分布不一致的问题。
标准的序列到序列模型中,如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。Scheduled Sampling以一定概率将生成的元素作为解码器输入,这样即使前面生成错误,其训练目标仍然是最大化真实目标序列的概率,模型会朝着正确的方向进行训练。因此这种方式增加了模型的容错能力。
## 算法简介
Scheduled Sampling主要应用在序列到序列模型的训练阶段,而生成阶段则不需要使用。
训练阶段解码器在最大化第$t$个元素概率时,标准序列到序列模型使用上一时刻的真实元素$y_{t-1}$作为输入。设上一时刻生成的元素为$g_{t-1}$,Scheduled Sampling算法会以一定概率使用$g_{t-1}$作为解码器输入。
设当前已经训练到了第$i$个mini-batch,Scheduled Sampling定义了一个概率$\epsilon_i$控制解码器的输入。$\epsilon_i$是一个随着$i$增大而衰减的变量,常见的定义方式有:
- 线性衰减:$\epsilon_i=max(\epsilon,k-c*i)$,其中$\epsilon$限制$\epsilon_i$的最小值,$k$和$c$控制线性衰减的幅度。
- 指数衰减:$\epsilon_i=k^i$,其中$0<k<1$,$k$控制着指数衰减的幅度。
- 反向Sigmoid衰减:$\epsilon_i=k/(k+exp(i/k))$,其中$k>1$,$k$同样控制衰减的幅度。
图1给出了这三种方式的衰减曲线,
<p align="center">
<img src="img/decay.jpg" width="50%" align="center"><br>
图1. 线性衰减、指数衰减和反向Sigmoid衰减的衰减曲线
</p>
如图2所示,在解码器的$t$时刻Scheduled Sampling以概率$\epsilon_i$使用上一时刻的真实元素$y_{t-1}$作为解码器输入,以概率$1-\epsilon_i$使用上一时刻生成的元素$g_{t-1}$作为解码器输入。从图1可知随着$i$的增大$\epsilon_i$会不断减小,解码器将不断倾向于使用生成的元素作为输入,训练阶段和生成阶段的数据分布将变得越来越一致。
<p align="center">
<img src="img/Scheduled_Sampling.jpg" width="50%" align="center"><br>
图2. Scheduled Sampling选择不同元素作为解码器输入示意图
</p>
## 模型实现
由于Scheduled Sampling是对序列到序列模型的改进,其整体实现框架与序列到序列模型较为相似。为突出本文重点,这里仅介绍与Scheduled Sampling相关的部分,完整的代码见`scheduled_sampling.py`
首先导入需要的包,并定义控制衰减概率的类`RandomScheduleGenerator`,如下:
```python
import numpy as np
import math
class RandomScheduleGenerator:
"""
The random sampling rate for scheduled sampling algoithm, which uses devcayed
sampling rate.
"""
...
```
下面将分别定义类`RandomScheduleGenerator``__init__``getScheduleRate``processBatch`三个方法。
`__init__`方法对类进行初始化,其`schedule_type`参数指定了使用哪种衰减方式,可选的方式有`constant``linear``exponential``inverse_sigmoid``constant`指对所有的mini-batch使用固定的$\epsilon_i$,`linear`指线性衰减方式,`exponential`表示指数衰减方式,`inverse_sigmoid`表示反向Sigmoid衰减。`__init__`方法的参数`a``b`表示衰减方法的参数,需要在验证集上调优。`self.schedule_computers`将衰减方式映射为计算$\epsilon_i$的函数。最后一行根据`schedule_type`将选择的衰减函数赋给`self.schedule_computer`变量。
```python
def __init__(self, schedule_type, a, b):
"""
schduled_type: is the type of the decay. It supports constant, linear,
exponential, and inverse_sigmoid right now.
a: parameter of the decay (MUST BE DOUBLE)
b: parameter of the decay (MUST BE DOUBLE)
"""
self.schedule_type = schedule_type
self.a = a
self.b = b
self.data_processed_ = 0
self.schedule_computers = {
"constant": lambda a, b, d: a,
"linear": lambda a, b, d: max(a, 1 - d / b),
"exponential": lambda a, b, d: pow(a, d / b),
"inverse_sigmoid": lambda a, b, d: b / (b + math.exp(d * a / b)),
}
assert (self.schedule_type in self.schedule_computers)
self.schedule_computer = self.schedule_computers[self.schedule_type]
```
`getScheduleRate`根据衰减函数和已经处理的数据量计算$\epsilon_i$。
```python
def getScheduleRate(self):
"""
Get the schedule sampling rate. Usually not needed to be called by the users
"""
return self.schedule_computer(self.a, self.b, self.data_processed_)
```
`processBatch`方法根据概率值$\epsilon_i$进行采样,得到`indexes``indexes`中每个元素取值为`0`的概率为$\epsilon_i$,取值为`1`的概率为$1-\epsilon_i$。`indexes`决定了解码器的输入是真实元素还是生成的元素,取值为`0`表示使用真实元素,取值为`1`表示使用生成的元素。
```python
def processBatch(self, batch_size):
"""
Get a batch_size of sampled indexes. These indexes can be passed to a
MultiplexLayer to select from the grouth truth and generated samples
from the last time step.
"""
rate = self.getScheduleRate()
numbers = np.random.rand(batch_size)
indexes = (numbers >= rate).astype('int32').tolist()
self.data_processed_ += batch_size
return indexes
```
Scheduled Sampling需要在序列到序列模型的基础上增加一个输入`true_token_flag`,以控制解码器输入。
```python
true_token_flags = paddle.layer.data(
name='true_token_flag',
type=paddle.data_type.integer_value_sequence(2))
```
这里还需要对原始reader进行封装,增加`true_token_flag`的数据生成器。下面以线性衰减为例说明如何调用上面定义的`RandomScheduleGenerator`产生`true_token_flag`的输入数据。
```python
schedule_generator = RandomScheduleGenerator("linear", 0.75, 1000000)
def gen_schedule_data(reader):
"""
Creates a data reader for scheduled sampling.
Output from the iterator that created by original reader will be
appended with "true_token_flag" to indicate whether to use true token.
:param reader: the original reader.
:type reader: callable
:return: the new reader with the field "true_token_flag".
:rtype: callable
"""
def data_reader():
for src_ids, trg_ids, trg_ids_next in reader():
yield src_ids, trg_ids, trg_ids_next, \
[0] + schedule_generator.processBatch(len(trg_ids) - 1)
return data_reader
```
这段代码在原始输入数据(即源序列元素`src_ids`、目标序列元素`trg_ids`和目标序列下一个元素`trg_ids_next`)后追加了控制解码器输入的数据。由于解码器第一个元素是序列开始符,因此将追加的数据第一个元素设置为`0`,表示解码器第一步始终使用真实目标序列的第一个元素(即序列开始符)。
训练时`recurrent_group`每一步调用的解码器函数如下:
```python
def gru_decoder_with_attention_train(enc_vec, enc_proj, true_word,
true_token_flag):
"""
The decoder step for training.
:param enc_vec: the encoder vector for attention
:type enc_vec: LayerOutput
:param enc_proj: the encoder projection for attention
:type enc_proj: LayerOutput
:param true_word: the ground-truth target word
:type true_word: LayerOutput
:param true_token_flag: the flag of using the ground-truth target word
:type true_token_flag: LayerOutput
:return: the softmax output layer
:rtype: LayerOutput
"""
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
gru_out_memory = paddle.layer.memory(
name='gru_out', size=target_dict_dim)
generated_word = paddle.layer.max_id(input=gru_out_memory)
generated_word_emb = paddle.layer.embedding(
input=generated_word,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
current_word = paddle.layer.multiplex(
input=[true_token_flag, true_word, generated_word_emb])
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
name='gru_out',
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
```
该函数使用`memory``gru_out_memory`记忆上一时刻生成的元素,根据`gru_out_memory`选择概率最大的词语`generated_word`作为生成的词语。`multiplex`层会在真实元素`true_word`和生成的元素`generated_word`之间做出选择,并将选择的结果作为解码器输入。`multiplex`层使用了三个输入,分别为`true_token_flag``true_word``generated_word_emb`。对于这三个输入中每个元素,若`true_token_flag`中的值为`0`,则`multiplex`层输出`true_word`中的相应元素;若`true_token_flag`中的值为`1`,则`multiplex`层输出`generated_word_emb`中的相应元素。
## 参考文献
[1] Bengio S, Vinyals O, Jaitly N, et al. [Scheduled sampling for sequence prediction with recurrent neural networks](http://papers.nips.cc/paper/5956-scheduled-sampling-for-sequence-prediction-with-recurrent-neural-networks)//Advances in Neural Information Processing Systems. 2015: 1171-1179.
import numpy as np
import math
class RandomScheduleGenerator:
"""
The random sampling rate for scheduled sampling algoithm, which uses devcayed
sampling rate.
"""
def __init__(self, schedule_type, a, b):
"""
schduled_type: is the type of the decay. It supports constant, linear,
exponential, and inverse_sigmoid right now.
a: parameter of the decay (MUST BE DOUBLE)
b: parameter of the decay (MUST BE DOUBLE)
"""
self.schedule_type = schedule_type
self.a = a
self.b = b
self.data_processed_ = 0
self.schedule_computers = {
"constant": lambda a, b, d: a,
"linear": lambda a, b, d: max(a, 1 - d / b),
"exponential": lambda a, b, d: pow(a, d / b),
"inverse_sigmoid": lambda a, b, d: b / (b + math.exp(d * a / b)),
}
assert (self.schedule_type in self.schedule_computers)
self.schedule_computer = self.schedule_computers[self.schedule_type]
def getScheduleRate(self):
"""
Get the schedule sampling rate. Usually not needed to be called by the users
"""
return self.schedule_computer(self.a, self.b, self.data_processed_)
def processBatch(self, batch_size):
"""
Get a batch_size of sampled indexes. These indexes can be passed to a
MultiplexLayer to select from the grouth truth and generated samples
from the last time step.
"""
rate = self.getScheduleRate()
numbers = np.random.rand(batch_size)
indexes = (numbers >= rate).astype('int32').tolist()
self.data_processed_ += batch_size
return indexes
import sys
import paddle.v2 as paddle
from random_schedule_generator import RandomScheduleGenerator
schedule_generator = RandomScheduleGenerator("linear", 0.75, 1000000)
def gen_schedule_data(reader):
"""
Creates a data reader for scheduled sampling.
Output from the iterator that created by original reader will be
appended with "true_token_flag" to indicate whether to use true token.
:param reader: the original reader.
:type reader: callable
:return: the new reader with the field "true_token_flag".
:rtype: callable
"""
def data_reader():
for src_ids, trg_ids, trg_ids_next in reader():
yield src_ids, trg_ids, trg_ids_next, \
[0] + schedule_generator.processBatch(len(trg_ids) - 1)
return data_reader
def seqToseq_net(source_dict_dim, target_dict_dim, is_generating=False):
"""
The definition of the sequence to sequence model
:param source_dict_dim: the dictionary size of the source language
:type source_dict_dim: int
:param target_dict_dim: the dictionary size of the target language
:type target_dict_dim: int
:param is_generating: whether in generating mode
:type is_generating: Bool
:return: the last layer of the network
:rtype: LayerOutput
"""
### Network Architecture
word_vector_dim = 512 # dimension of word vector
decoder_size = 512 # dimension of hidden unit in GRU Decoder network
encoder_size = 512 # dimension of hidden unit in GRU Encoder network
beam_size = 3
max_length = 250
#### Encoder
src_word_id = paddle.layer.data(
name='source_language_word',
type=paddle.data_type.integer_value_sequence(source_dict_dim))
src_embedding = paddle.layer.embedding(
input=src_word_id, size=word_vector_dim)
src_forward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size)
src_backward = paddle.networks.simple_gru(
input=src_embedding, size=encoder_size, reverse=True)
encoded_vector = paddle.layer.concat(input=[src_forward, src_backward])
#### Decoder
with paddle.layer.mixed(size=decoder_size) as encoded_proj:
encoded_proj += paddle.layer.full_matrix_projection(
input=encoded_vector)
backward_first = paddle.layer.first_seq(input=src_backward)
with paddle.layer.mixed(
size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot:
decoder_boot += paddle.layer.full_matrix_projection(
input=backward_first)
def gru_decoder_with_attention_train(enc_vec, enc_proj, true_word,
true_token_flag):
"""
The decoder step for training.
:param enc_vec: the encoder vector for attention
:type enc_vec: LayerOutput
:param enc_proj: the encoder projection for attention
:type enc_proj: LayerOutput
:param true_word: the ground-truth target word
:type true_word: LayerOutput
:param true_token_flag: the flag of using the ground-truth target word
:type true_token_flag: LayerOutput
:return: the softmax output layer
:rtype: LayerOutput
"""
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
gru_out_memory = paddle.layer.memory(
name='gru_out', size=target_dict_dim)
generated_word = paddle.layer.max_id(input=gru_out_memory)
generated_word_emb = paddle.layer.embedding(
input=generated_word,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
current_word = paddle.layer.multiplex(
input=[true_token_flag, true_word, generated_word_emb])
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
name='gru_out',
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
def gru_decoder_with_attention_test(enc_vec, enc_proj, current_word):
"""
The decoder step for generating.
:param enc_vec: the encoder vector for attention
:type enc_vec: LayerOutput
:param enc_proj: the encoder projection for attention
:type enc_proj: LayerOutput
:param current_word: the previously generated word
:type current_word: LayerOutput
:return: the softmax output layer
:rtype: LayerOutput
"""
decoder_mem = paddle.layer.memory(
name='gru_decoder', size=decoder_size, boot_layer=decoder_boot)
context = paddle.networks.simple_attention(
encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem)
with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += paddle.layer.full_matrix_projection(input=context)
decoder_inputs += paddle.layer.full_matrix_projection(
input=current_word)
gru_step = paddle.layer.gru_step(
name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with paddle.layer.mixed(
size=target_dict_dim,
bias_attr=True,
act=paddle.activation.Softmax()) as out:
out += paddle.layer.full_matrix_projection(input=gru_step)
return out
decoder_group_name = "decoder_group"
group_input1 = paddle.layer.StaticInput(input=encoded_vector, is_seq=True)
group_input2 = paddle.layer.StaticInput(input=encoded_proj, is_seq=True)
group_inputs = [group_input1, group_input2]
if not is_generating:
trg_embedding = paddle.layer.embedding(
input=paddle.layer.data(
name='target_language_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim)),
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))
group_inputs.append(trg_embedding)
true_token_flags = paddle.layer.data(
name='true_token_flag',
type=paddle.data_type.integer_value_sequence(2))
group_inputs.append(true_token_flags)
decoder = paddle.layer.recurrent_group(
name=decoder_group_name,
step=gru_decoder_with_attention_train,
input=group_inputs)
lbl = paddle.layer.data(
name='target_language_next_word',
type=paddle.data_type.integer_value_sequence(target_dict_dim))
cost = paddle.layer.classification_cost(input=decoder, label=lbl)
return cost
else:
trg_embedding = paddle.layer.GeneratedInput(
size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
beam_gen = paddle.layer.beam_search(
name=decoder_group_name,
step=gru_decoder_with_attention_test,
input=group_inputs,
bos_id=0,
eos_id=1,
beam_size=beam_size,
max_length=max_length)
return beam_gen
def main():
paddle.init(use_gpu=False, trainer_count=1)
is_generating = False
model_path_for_generating = 'params_pass_1.tar.gz'
# source and target dict dim.
dict_size = 30000
source_dict_dim = target_dict_dim = dict_size
# train the network
if not is_generating:
cost = seqToseq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
# define optimize method and trainer
optimizer = paddle.optimizer.Adam(
learning_rate=5e-5,
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer)
# define data reader
wmt14_reader = paddle.batch(
gen_schedule_data(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(dict_size), buf_size=8192)),
batch_size=5)
feeding = {
'source_language_word': 0,
'target_language_word': 1,
'target_language_next_word': 2,
'true_token_flag': 3
}
# define event_handler callback
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 10 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost,
event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
# save parameters
with gzip.open('params_pass_%d.tar.gz' % event.pass_id,
'w') as f:
parameters.to_tar(f)
# start to train
trainer.train(
reader=wmt14_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=2)
# generate a english sequence to french
else:
# use the first 3 samples for generation
gen_creator = paddle.dataset.wmt14.gen(dict_size)
gen_data = []
gen_num = 3
for item in gen_creator():
gen_data.append((item[0], ))
if len(gen_data) == gen_num:
break
beam_gen = seqToseq_net(source_dict_dim, target_dict_dim, is_generating)
# get the trained model
with gzip.open(model_path_for_generating, 'r') as f:
parameters = Parameters.from_tar(f)
# prob is the prediction probabilities, and id is the prediction word.
beam_result = paddle.infer(
output_layer=beam_gen,
parameters=parameters,
input=gen_data,
field=['prob', 'id'])
# get the dictionary
src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size)
# the delimited element of generated sequences is -1,
# the first element of each generated sequence is the sequence length
seq_list = []
seq = []
for w in beam_result[1]:
if w != -1:
seq.append(w)
else:
seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]]))
seq = []
prob = beam_result[0]
beam_size = 3
for i in xrange(gen_num):
print "\n*******************************************************\n"
print "src:", ' '.join(
[src_dict.get(w) for w in gen_data[i][0]]), "\n"
for j in xrange(beam_size):
print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j]
if __name__ == '__main__':
main()
此差异已折叠。
wget http://cs224d.stanford.edu/assignment2/assignment2.zip
unzip assignment2.zip
cp assignment2_release/data/ner/wordVectors.txt ./
cp assignment2_release/data/ner/vocab.txt ./
rm -rf assignment2.zip assignment2_release
if [ $? -eq 0 ];then
unzip assignment2.zip
cp assignment2_release/data/ner/wordVectors.txt ./data
cp assignment2_release/data/ner/vocab.txt ./data
rm -rf assignment2.zip assignment2_release
else
echo "download data error!" >> /dev/stderr
exit 1
fi
此差异已折叠。
......@@ -12,11 +12,10 @@ def infer(model_path, batch_size, test_data_file, vocab_file, target_file):
for idx, test_sample in enumerate(test_data):
start_id = 0
pred_str = ""
for w, tag in zip(test_sample[0],
probs[start_id:start_id + len(test_sample[0])]):
pred_str += "%s[%s] " % (id_2_word[w], id_2_label[tag])
print(pred_str.strip())
print("%s\t%s" % (id_2_word[w], id_2_label[tag]))
print("\n")
start_id += len(test_sample[0])
word_dict = load_dict(vocab_file)
......
......@@ -26,8 +26,6 @@ def canonicalize_word(word, wordset=None, digits=True):
def data_reader(data_file, word_dict, label_dict):
"""
Conll03 train set creator.
The dataset can be obtained according to http://www.clips.uantwerpen.be/conll2003/ner/.
It returns a reader creator, each sample in the reader includes:
word id sequence, label id sequence and raw sentence.
......
......@@ -5,8 +5,6 @@ import reader
from utils import *
from network_conf import *
from paddle.v2.layer import parse_network
def main(train_data_file,
test_data_file,
......
......@@ -19,10 +19,29 @@ def get_embedding(emb_file='data/wordVectors.txt'):
def load_dict(dict_path):
"""
Load the word dictionary from the given file.
Each line of the given file is a word, which can include multiple columns
seperated by tab.
This function takes the first column (columns in a line are seperated by
tab) as key and takes line number of a line as the key (index of the word
in the dictionary).
"""
return dict((line.strip().split("\t")[0], idx)
for idx, line in enumerate(open(dict_path, "r").readlines()))
def load_reverse_dict(dict_path):
"""
Load the word dictionary from the given file.
Each line of the given file is a word, which can include multiple columns
seperated by tab.
This function takes line number of a line as the key (index of the word in
the dictionary) and the first column (columns in a line are seperated by
tab) as the value.
"""
return dict((idx, line.strip().split("\t")[0])
for idx, line in enumerate(open(dict_path, "r").readlines()))
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册