diff --git a/README.md b/README.md index 8fd9edfec30dfde19a8c79a979dc29abfbe2ce47..fb0e20bf42560748f1c9633f19eb0d77090d1b32 100644 --- a/README.md +++ b/README.md @@ -14,46 +14,57 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式 在词向量的例子中,我们向大家展示如何使用Hierarchical-Sigmoid 和噪声对比估计(Noise Contrastive Estimation,NCE)来加速词向量的学习。 - 1.1 [Hsigmoid加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/word_embedding) +- 1.2 [噪声对比估计加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/nce_cost) -## 2. 点击率预估 + +## 2. 语言模型 + +语言模型是自然语言处理领域里一个重要的基础模型,它是一个概率分布模型,利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词。语言模型被应用在很多领域,如:自动写作、QA、机器翻译、拼写检查、语音识别、词性标注等。 + +在语言模型的例子中,我们以文本生成为例,提供了RNN LM(包括LSTM、GRU)和N-Gram LM,供大家学习和使用。用户可以通过文档中的 “使用说明” 快速上手:适配训练语料,以训练 “自动写诗”、“自动写散文” 等有趣的模型。 + +- 2.1 [基于LSTM、GRU、N-Gram的文本生成模型](https://github.com/PaddlePaddle/models/tree/develop/language_model) + +## 3. 点击率预估 点击率预估模型预判用户对一条广告点击的概率,对每次广告的点击情况做出预测,是广告技术的核心算法之一。逻谛斯克回归对大规模稀疏特征有着很好的学习能力,在点击率预估任务发展的早期一统天下。近年来,DNN 模型由于其强大的学习能力逐渐接过点击率预估任务的大旗。 在点击率预估的例子中,我们给出谷歌提出的 Wide & Deep 模型。这一模型融合了适用于学习抽象特征的 DNN 和适用于大规模稀疏特征的逻谛斯克回归两者模型的优点,可以作为一种相对成熟的模型框架使用, 在工业界也有一定的应用。 -- 2.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr) +- 3.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr) -## 3. 文本分类 +## 4. 文本分类 文本分类是自然语言处理领域最基础的任务之一,深度学习方法能够免除复杂的特征工程,直接使用原始文本作为输入,数据驱动地最优化分类准确率。 在文本分类的例子中,我们以情感分类任务为例,提供了基于DNN的非序列文本分类模型,以及基于CNN的序列模型供大家学习和使用(基于LSTM的模型见PaddleBook中[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一课)。 -- 3.1 [基于 DNN / CNN 的情感分类](https://github.com/PaddlePaddle/models/tree/develop/text_classification) +- 4.1 [基于 DNN / CNN 的情感分类](https://github.com/PaddlePaddle/models/tree/develop/text_classification) -## 4. 排序学习 +## 5. 排序学习 排序学习(Learning to Rank, LTR)是信息检索和搜索引擎研究的核心问题之一,通过机器学习方法学习一个分值函数对待排序的候选进行打分,再根据分值的高低确定序关系。深度神经网络可以用来建模分值函数,构成各类基于深度学习的LTR模型。 在排序学习的例子中,我们介绍基于 RankLoss 损失函数的 Pairwise 排序模型和基于LambdaRank损失函数的Listwise排序模型(Pointwise学习策略见PaddleBook中[推荐系统](https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/README.cn.md)一课)。 -- 4.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr) +- 5.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr) -## 5. 序列标注 +## 6. 序列标注 给定输入序列,序列标注模型为序列中每一个元素贴上一个类别标签,是自然语言处理领域最基础的任务之一。随着深度学习的不断探索和发展,利用循环神经网络学习输入序列的特征表示,条件随机场(Conditional Random Field, CRF)在特征基础上完成序列标注任务,逐渐成为解决序列标注问题的标配解决方案。 在序列标注的例子中,我们以命名实体识别(Named Entity Recognition,NER)任务为例,介绍如何训练一个端到端的序列标注模型。 -- 5.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner) +- 6.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner) -## 6. 序列到序列学习 +## 7. 序列到序列学习 序列到序列学习实现两个甚至是多个不定长模型之间的映射,有着广泛的应用,包括:机器翻译、智能对话与问答、广告创意语料生成、自动编码(如金融画像编码)、判断多个文本串之间的语义相关性等。 在序列到序列学习的例子中,我们以机器翻译任务为例,提供了多种改进模型,供大家学习和使用。包括:不带注意力机制的序列到序列映射模型,这一模型是所有序列到序列学习模型的基础;使用 scheduled sampling 改善 RNN 模型在生成任务中的错误累积问题;带外部记忆机制的神经机器翻译,通过增强神经网络的记忆能力,来完成复杂的序列到序列学习任务。 -- 6.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention) +- 7.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention) + ## Copyright and License PaddlePaddle is provided under the [Apache-2.0 license](LICENSE). diff --git a/deep_speech_2/README.md b/deep_speech_2/README.md index 7a372e9bed262d2ee5bc8640a0f480b9ce34cd34..23e0b412b59da4ccfea7a4ce4303faec479ff234 100644 --- a/deep_speech_2/README.md +++ b/deep_speech_2/README.md @@ -16,34 +16,48 @@ For some machines, we also need to install libsndfile1. Details to be added. ### Preparing Data ``` -cd data -python librispeech.py -cat manifest.libri.train-* > manifest.libri.train-all +cd datasets +sh run_all.sh cd .. ``` -After running librispeech.py, we have several "manifest" json files named with a prefix `manifest.libri.`. A manifest file summarizes a speech data set, with each line containing the meta data (i.e. audio filepath, transcription text, audio duration) of each audio file within the data set, in json format. +`sh run_all.sh` prepares all ASR datasets (currently, only LibriSpeech available). After running, we have several summarization manifest files in json-format. -By `cat manifest.libri.train-* > manifest.libri.train-all`, we simply merge the three seperate sample sets of LibriSpeech (train-clean-100, train-clean-360, train-other-500) into one training set. This is a simple way for merging different data sets. +A manifest file summarizes a speech data set, with each line containing the meta data (i.e. audio filepath, transcript text, audio duration) of each audio file within the data set, in json format. Manifest file serves as an interface informing our system of where and what to read the speech samples. + + +More help for arguments: + +``` +python datasets/librispeech/librispeech.py --help +``` + +### Preparing for Training + +``` +python compute_mean_std.py +``` + +`python compute_mean_std.py` computes mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. More help for arguments: ``` -python librispeech.py --help +python compute_mean_std.py --help ``` -### Traininig +### Training For GPU Training: ``` -CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --trainer_count 4 --train_manifest_path ./data/manifest.libri.train-all +CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --trainer_count 4 ``` For CPU Training: ``` -python train.py --trainer_count 8 --use_gpu False -- train_manifest_path ./data/manifest.libri.train-all +python train.py --trainer_count 8 --use_gpu False ``` More help for arguments: @@ -55,7 +69,7 @@ python train.py --help ### Inferencing ``` -python infer.py +CUDA_VISIBLE_DEVICES=0 python infer.py ``` More help for arguments: diff --git a/deep_speech_2/audio_data_utils.py b/deep_speech_2/audio_data_utils.py deleted file mode 100644 index 1cd29be114a416636db8c2d7e888d0d8d6c2a8a8..0000000000000000000000000000000000000000 --- a/deep_speech_2/audio_data_utils.py +++ /dev/null @@ -1,411 +0,0 @@ -""" - Providing basic audio data preprocessing pipeline, and offering - both instance-level and batch-level data reader interfaces. -""" -import paddle.v2 as paddle -import logging -import json -import random -import soundfile -import numpy as np -import itertools -import os - -RANDOM_SEED = 0 -logger = logging.getLogger(__name__) - - -class DataGenerator(object): - """ - DataGenerator provides basic audio data preprocessing pipeline, and offers - both instance-level and batch-level data reader interfaces. - Normalized FFT are used as audio features here. - - :param vocab_filepath: Vocabulary file path for indexing tokenized - transcriptions. - :type vocab_filepath: basestring - :param normalizer_manifest_path: Manifest filepath for collecting feature - normalization statistics, e.g. mean, std. - :type normalizer_manifest_path: basestring - :param normalizer_num_samples: Number of instances sampled for collecting - feature normalization statistics. - Default is 100. - :type normalizer_num_samples: int - :param max_duration: Audio clips with duration (in seconds) greater than - this will be discarded. Default is 20.0. - :type max_duration: float - :param min_duration: Audio clips with duration (in seconds) smaller than - this will be discarded. Default is 0.0. - :type min_duration: float - :param stride_ms: Striding size (in milliseconds) for generating frames. - Default is 10.0. - :type stride_ms: float - :param window_ms: Window size (in milliseconds) for frames. Default is 20.0. - :type window_ms: float - :param max_frequency: Maximun frequency for FFT features. FFT features of - frequency larger than this will be discarded. - If set None, all features will be kept. - Default is None. - :type max_frequency: float - """ - - def __init__(self, - vocab_filepath, - normalizer_manifest_path, - normalizer_num_samples=100, - max_duration=20.0, - min_duration=0.0, - stride_ms=10.0, - window_ms=20.0, - max_frequency=None): - self.__max_duration__ = max_duration - self.__min_duration__ = min_duration - self.__stride_ms__ = stride_ms - self.__window_ms__ = window_ms - self.__max_frequency__ = max_frequency - self.__epoc__ = 0 - self.__random__ = random.Random(RANDOM_SEED) - # load vocabulary (dictionary) - self.__vocab_dict__, self.__vocab_list__ = \ - self.__load_vocabulary_from_file__(vocab_filepath) - # collect normalizer statistics - self.__mean__, self.__std__ = self.__collect_normalizer_statistics__( - manifest_path=normalizer_manifest_path, - num_samples=normalizer_num_samples) - - def __audio_featurize__(self, audio_filename): - """ - Preprocess audio data, including feature extraction, normalization etc.. - """ - features = self.__audio_basic_featurize__(audio_filename) - return self.__normalize__(features) - - def __text_featurize__(self, text): - """ - Preprocess text data, including tokenizing and token indexing etc.. - """ - return self.__convert_text_to_char_index__( - text=text, vocabulary=self.__vocab_dict__) - - def __audio_basic_featurize__(self, audio_filename): - """ - Compute basic (without normalization etc.) features for audio data. - """ - return self.__spectrogram_from_file__( - filename=audio_filename, - stride_ms=self.__stride_ms__, - window_ms=self.__window_ms__, - max_freq=self.__max_frequency__) - - def __collect_normalizer_statistics__(self, manifest_path, num_samples=100): - """ - Compute feature normalization statistics, i.e. mean and stddev. - """ - # read manifest - manifest = self.__read_manifest__( - manifest_path=manifest_path, - max_duration=self.__max_duration__, - min_duration=self.__min_duration__) - # sample for statistics - sampled_manifest = self.__random__.sample(manifest, num_samples) - # extract spectrogram feature - features = [] - for instance in sampled_manifest: - spectrogram = self.__audio_basic_featurize__( - instance["audio_filepath"]) - features.append(spectrogram) - features = np.hstack(features) - mean = np.mean(features, axis=1).reshape([-1, 1]) - std = np.std(features, axis=1).reshape([-1, 1]) - return mean, std - - def __normalize__(self, features, eps=1e-14): - """ - Normalize features to be of zero mean and unit stddev. - """ - return (features - self.__mean__) / (self.__std__ + eps) - - def __spectrogram_from_file__(self, - filename, - stride_ms=10.0, - window_ms=20.0, - max_freq=None, - eps=1e-14): - """ - Laod audio data and calculate the log of spectrogram by FFT. - Refer to utils.py in https://github.com/baidu-research/ba-dls-deepspeech - """ - audio, sample_rate = soundfile.read(filename) - if audio.ndim >= 2: - audio = np.mean(audio, 1) - if max_freq is None: - max_freq = sample_rate / 2 - if max_freq > sample_rate / 2: - raise ValueError("max_freq must be greater than half of " - "sample rate.") - if stride_ms > window_ms: - raise ValueError("Stride size must not be greater than " - "window size.") - stride_size = int(0.001 * sample_rate * stride_ms) - window_size = int(0.001 * sample_rate * window_ms) - spectrogram, freqs = self.__extract_spectrogram__( - audio, - window_size=window_size, - stride_size=stride_size, - sample_rate=sample_rate) - ind = np.where(freqs <= max_freq)[0][-1] + 1 - return np.log(spectrogram[:ind, :] + eps) - - def __extract_spectrogram__(self, samples, window_size, stride_size, - sample_rate): - """ - Compute the spectrogram by FFT for a discrete real signal. - Refer to utils.py in https://github.com/baidu-research/ba-dls-deepspeech - """ - # extract strided windows - truncate_size = (len(samples) - window_size) % stride_size - samples = samples[:len(samples) - truncate_size] - nshape = (window_size, (len(samples) - window_size) // stride_size + 1) - nstrides = (samples.strides[0], samples.strides[0] * stride_size) - windows = np.lib.stride_tricks.as_strided( - samples, shape=nshape, strides=nstrides) - assert np.all( - windows[:, 1] == samples[stride_size:(stride_size + window_size)]) - # window weighting, squared Fast Fourier Transform (fft), scaling - weighting = np.hanning(window_size)[:, None] - fft = np.fft.rfft(windows * weighting, axis=0) - fft = np.absolute(fft)**2 - scale = np.sum(weighting**2) * sample_rate - fft[1:-1, :] *= (2.0 / scale) - fft[(0, -1), :] /= scale - # prepare fft frequency list - freqs = float(sample_rate) / window_size * np.arange(fft.shape[0]) - return fft, freqs - - def __load_vocabulary_from_file__(self, vocabulary_path): - """ - Load vocabulary from file. - """ - if not os.path.exists(vocabulary_path): - raise ValueError("Vocabulary file %s not found.", vocabulary_path) - vocab_lines = [] - with open(vocabulary_path, 'r') as file: - vocab_lines.extend(file.readlines()) - vocab_list = [line[:-1] for line in vocab_lines] - vocab_dict = dict( - [(token, id) for (id, token) in enumerate(vocab_list)]) - return vocab_dict, vocab_list - - def __convert_text_to_char_index__(self, text, vocabulary): - """ - Convert text string to a list of character index integers. - """ - return [vocabulary[w] for w in text] - - def __read_manifest__(self, manifest_path, max_duration, min_duration): - """ - Load and parse manifest file. - """ - manifest = [] - for json_line in open(manifest_path): - try: - json_data = json.loads(json_line) - except Exception as e: - raise ValueError("Error reading manifest: %s" % str(e)) - if (json_data["duration"] <= max_duration and - json_data["duration"] >= min_duration): - manifest.append(json_data) - return manifest - - def __padding_batch__(self, batch, padding_to=-1, flatten=False): - """ - Padding audio part of features (only in the time axis -- column axis) - with zeros, to make each instance in the batch share the same - audio feature shape. - - If `padding_to` is set -1, the maximun column numbers in the batch will - be used as the target size. Otherwise, `padding_to` will be the target - size. Default is -1. - - If `flatten` is set True, audio data will be flatten to be a 1-dim - ndarray. Default is False. - """ - new_batch = [] - # get target shape - max_length = max([audio.shape[1] for audio, text in batch]) - if padding_to != -1: - if padding_to < max_length: - raise ValueError("If padding_to is not -1, it should be greater" - " or equal to the original instance length.") - max_length = padding_to - # padding - for audio, text in batch: - padded_audio = np.zeros([audio.shape[0], max_length]) - padded_audio[:, :audio.shape[1]] = audio - if flatten: - padded_audio = padded_audio.flatten() - new_batch.append((padded_audio, text)) - return new_batch - - def __batch_shuffle__(self, manifest, batch_size): - """ - The instances have different lengths and they cannot be - combined into a single matrix multiplication. It usually - sorts the training examples by length and combines only - similarly-sized instances into minibatches, pads with - silence when necessary so that all instances in a batch - have the same length. This batch shuffle fuction is used - to make similarly-sized instances into minibatches and - make a batch-wise shuffle. - - 1. Sort the audio clips by duration. - 2. Generate a random number `k`, k in [0, batch_size). - 3. Randomly remove `k` instances in order to make different mini-batches, - then make minibatches and each minibatch size is batch_size. - 4. Shuffle the minibatches. - - :param manifest: manifest file. - :type manifest: list - :param batch_size: Batch size. This size is also used for generate - a random number for batch shuffle. - :type batch_size: int - :return: batch shuffled mainifest. - :rtype: list - """ - manifest.sort(key=lambda x: x["duration"]) - shift_len = self.__random__.randint(0, batch_size - 1) - batch_manifest = zip(*[iter(manifest[shift_len:])] * batch_size) - self.__random__.shuffle(batch_manifest) - batch_manifest = list(sum(batch_manifest, ())) - res_len = len(manifest) - shift_len - len(batch_manifest) - batch_manifest.extend(manifest[-res_len:]) - batch_manifest.extend(manifest[0:shift_len]) - return batch_manifest - - def instance_reader_creator(self, manifest): - """ - Instance reader creator for audio data. Creat a callable function to - produce instances of data. - - Instance: a tuple of a numpy ndarray of audio spectrogram and a list of - tokenized and indexed transcription text. - - :param manifest: Filepath of manifest for audio clip files. - :type manifest: basestring - :return: Data reader function. - :rtype: callable - """ - - def reader(): - # extract spectrogram feature - for instance in manifest: - spectrogram = self.__audio_featurize__( - instance["audio_filepath"]) - transcript = self.__text_featurize__(instance["text"]) - yield (spectrogram, transcript) - - return reader - - def batch_reader_creator(self, - manifest_path, - batch_size, - padding_to=-1, - flatten=False, - sortagrad=False, - batch_shuffle=False): - """ - Batch data reader creator for audio data. Creat a callable function to - produce batches of data. - - Audio features will be padded with zeros to make each instance in the - batch to share the same audio feature shape. - - :param manifest_path: Filepath of manifest for audio clip files. - :type manifest_path: basestring - :param batch_size: Instance number in a batch. - :type batch_size: int - :param padding_to: If set -1, the maximun column numbers in the batch - will be used as the target size for padding. - Otherwise, `padding_to` will be the target size. - Default is -1. - :type padding_to: int - :param flatten: If set True, audio data will be flatten to be a 1-dim - ndarray. Otherwise, 2-dim ndarray. Default is False. - :type flatten: bool - :param sortagrad: Sort the audio clips by duration in the first epoc - if set True. - :type sortagrad: bool - :param batch_shuffle: Shuffle the audio clips if set True. It is - not a thorough instance-wise shuffle, but a - specific batch-wise shuffle. For more details, - please see `__batch_shuffle__` function. - :type batch_shuffle: bool - :return: Batch reader function, producing batches of data when called. - :rtype: callable - """ - - def batch_reader(): - # read manifest - manifest = self.__read_manifest__( - manifest_path=manifest_path, - max_duration=self.__max_duration__, - min_duration=self.__min_duration__) - - # sort (by duration) or shuffle manifest - if self.__epoc__ == 0 and sortagrad: - manifest.sort(key=lambda x: x["duration"]) - elif batch_shuffle: - manifest = self.__batch_shuffle__(manifest, batch_size) - - instance_reader = self.instance_reader_creator(manifest) - batch = [] - for instance in instance_reader(): - batch.append(instance) - if len(batch) == batch_size: - yield self.__padding_batch__(batch, padding_to, flatten) - batch = [] - if len(batch) > 0: - yield self.__padding_batch__(batch, padding_to, flatten) - self.__epoc__ += 1 - - return batch_reader - - def vocabulary_size(self): - """ - Get vocabulary size. - - :return: Vocabulary size. - :rtype: int - """ - return len(self.__vocab_list__) - - def vocabulary_dict(self): - """ - Get vocabulary in dict. - - :return: Vocabulary in dict. - :rtype: dict - """ - return self.__vocab_dict__ - - def vocabulary_list(self): - """ - Get vocabulary in list. - - :return: Vocabulary in list - :rtype: list - """ - return self.__vocab_list__ - - def data_name_feeding(self): - """ - Get feeddings (data field name and corresponding field id). - - :return: Feeding dict. - :rtype: dict - """ - feeding = { - "audio_spectrogram": 0, - "transcript_text": 1, - } - return feeding diff --git a/deep_speech_2/compute_mean_std.py b/deep_speech_2/compute_mean_std.py new file mode 100644 index 0000000000000000000000000000000000000000..9c301c93f6d2ce3ae099caa96830912f76ce6c58 --- /dev/null +++ b/deep_speech_2/compute_mean_std.py @@ -0,0 +1,57 @@ +"""Compute mean and std for feature normalizer, and save to file.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import argparse +from data_utils.normalizer import FeatureNormalizer +from data_utils.augmentor.augmentation import AugmentationPipeline +from data_utils.featurizer.audio_featurizer import AudioFeaturizer + +parser = argparse.ArgumentParser( + description='Computing mean and stddev for feature normalizer.') +parser.add_argument( + "--manifest_path", + default='datasets/manifest.train', + type=str, + help="Manifest path for computing normalizer's mean and stddev." + "(default: %(default)s)") +parser.add_argument( + "--num_samples", + default=2000, + type=int, + help="Number of samples for computing mean and stddev. " + "(default: %(default)s)") +parser.add_argument( + "--augmentation_config", + default='{}', + type=str, + help="Augmentation configuration in json-format. " + "(default: %(default)s)") +parser.add_argument( + "--output_file", + default='mean_std.npz', + type=str, + help="Filepath to write mean and std to (.npz)." + "(default: %(default)s)") +args = parser.parse_args() + + +def main(): + augmentation_pipeline = AugmentationPipeline(args.augmentation_config) + audio_featurizer = AudioFeaturizer() + + def augment_and_featurize(audio_segment): + augmentation_pipeline.transform_audio(audio_segment) + return audio_featurizer.featurize(audio_segment) + + normalizer = FeatureNormalizer( + mean_std_filepath=None, + manifest_path=args.manifest_path, + featurize_func=augment_and_featurize, + num_samples=args.num_samples) + normalizer.write_to_file(args.output_file) + + +if __name__ == '__main__': + main() diff --git a/deep_speech_2/data_utils/__init__.py b/deep_speech_2/data_utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/deep_speech_2/data_utils/audio.py b/deep_speech_2/data_utils/audio.py new file mode 100644 index 0000000000000000000000000000000000000000..916c8ac1ae781bcb0ec6e1ed2ad1b3574dc6fe65 --- /dev/null +++ b/deep_speech_2/data_utils/audio.py @@ -0,0 +1,252 @@ +"""Contains the audio segment class.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np +import io +import soundfile + + +class AudioSegment(object): + """Monaural audio segment abstraction. + + :param samples: Audio samples [num_samples x num_channels]. + :type samples: ndarray.float32 + :param sample_rate: Audio sample rate. + :type sample_rate: int + :raises TypeError: If the sample data type is not float or int. + """ + + def __init__(self, samples, sample_rate): + """Create audio segment from samples. + + Samples are convert float32 internally, with int scaled to [-1, 1]. + """ + self._samples = self._convert_samples_to_float32(samples) + self._sample_rate = sample_rate + if self._samples.ndim >= 2: + self._samples = np.mean(self._samples, 1) + + def __eq__(self, other): + """Return whether two objects are equal.""" + if type(other) is not type(self): + return False + if self._sample_rate != other._sample_rate: + return False + if self._samples.shape != other._samples.shape: + return False + if np.any(self.samples != other._samples): + return False + return True + + def __ne__(self, other): + """Return whether two objects are unequal.""" + return not self.__eq__(other) + + def __str__(self): + """Return human-readable representation of segment.""" + return ("%s: num_samples=%d, sample_rate=%d, duration=%.2fsec, " + "rms=%.2fdB" % (type(self), self.num_samples, self.sample_rate, + self.duration, self.rms_db)) + + @classmethod + def from_file(cls, file): + """Create audio segment from audio file. + + :param filepath: Filepath or file object to audio file. + :type filepath: basestring|file + :return: Audio segment instance. + :rtype: AudioSegment + """ + samples, sample_rate = soundfile.read(file, dtype='float32') + return cls(samples, sample_rate) + + @classmethod + def from_bytes(cls, bytes): + """Create audio segment from a byte string containing audio samples. + + :param bytes: Byte string containing audio samples. + :type bytes: str + :return: Audio segment instance. + :rtype: AudioSegment + """ + samples, sample_rate = soundfile.read( + io.BytesIO(bytes), dtype='float32') + return cls(samples, sample_rate) + + def to_wav_file(self, filepath, dtype='float32'): + """Save audio segment to disk as wav file. + + :param filepath: WAV filepath or file object to save the + audio segment. + :type filepath: basestring|file + :param dtype: Subtype for audio file. Options: 'int16', 'int32', + 'float32', 'float64'. Default is 'float32'. + :type dtype: str + :raises TypeError: If dtype is not supported. + """ + samples = self._convert_samples_from_float32(self._samples, dtype) + subtype_map = { + 'int16': 'PCM_16', + 'int32': 'PCM_32', + 'float32': 'FLOAT', + 'float64': 'DOUBLE' + } + soundfile.write( + filepath, + samples, + self._sample_rate, + format='WAV', + subtype=subtype_map[dtype]) + + def to_bytes(self, dtype='float32'): + """Create a byte string containing the audio content. + + :param dtype: Data type for export samples. Options: 'int16', 'int32', + 'float32', 'float64'. Default is 'float32'. + :type dtype: str + :return: Byte string containing audio content. + :rtype: str + """ + samples = self._convert_samples_from_float32(self._samples, dtype) + return samples.tostring() + + def apply_gain(self, gain): + """Apply gain in decibels to samples. + + Note that this is an in-place transformation. + + :param gain: Gain in decibels to apply to samples. + :type gain: float + """ + self._samples *= 10.**(gain / 20.) + + def change_speed(self, speed_rate): + """Change the audio speed by linear interpolation. + + Note that this is an in-place transformation. + + :param speed_rate: Rate of speed change: + speed_rate > 1.0, speed up the audio; + speed_rate = 1.0, unchanged; + speed_rate < 1.0, slow down the audio; + speed_rate <= 0.0, not allowed, raise ValueError. + :type speed_rate: float + :raises ValueError: If speed_rate <= 0.0. + """ + if speed_rate <= 0: + raise ValueError("speed_rate should be greater than zero.") + old_length = self._samples.shape[0] + new_length = int(old_length / speed_rate) + old_indices = np.arange(old_length) + new_indices = np.linspace(start=0, stop=old_length, num=new_length) + self._samples = np.interp(new_indices, old_indices, self._samples) + + def normalize(self, target_sample_rate): + raise NotImplementedError() + + def resample(self, target_sample_rate): + raise NotImplementedError() + + def pad_silence(self, duration, sides='both'): + raise NotImplementedError() + + def subsegment(self, start_sec=None, end_sec=None): + raise NotImplementedError() + + def convolve(self, filter, allow_resample=False): + raise NotImplementedError() + + def convolve_and_normalize(self, filter, allow_resample=False): + raise NotImplementedError() + + @property + def samples(self): + """Return audio samples. + + :return: Audio samples. + :rtype: ndarray + """ + return self._samples.copy() + + @property + def sample_rate(self): + """Return audio sample rate. + + :return: Audio sample rate. + :rtype: int + """ + return self._sample_rate + + @property + def num_samples(self): + """Return number of samples. + + :return: Number of samples. + :rtype: int + """ + return self._samples.shape(0) + + @property + def duration(self): + """Return audio duration. + + :return: Audio duration in seconds. + :rtype: float + """ + return self._samples.shape[0] / float(self._sample_rate) + + @property + def rms_db(self): + """Return root mean square energy of the audio in decibels. + + :return: Root mean square energy in decibels. + :rtype: float + """ + # square root => multiply by 10 instead of 20 for dBs + mean_square = np.mean(self._samples**2) + return 10 * np.log10(mean_square) + + def _convert_samples_to_float32(self, samples): + """Convert sample type to float32. + + Audio sample type is usually integer or float-point. + Integers will be scaled to [-1, 1] in float32. + """ + float32_samples = samples.astype('float32') + if samples.dtype in np.sctypes['int']: + bits = np.iinfo(samples.dtype).bits + float32_samples *= (1. / 2**(bits - 1)) + elif samples.dtype in np.sctypes['float']: + pass + else: + raise TypeError("Unsupported sample type: %s." % samples.dtype) + return float32_samples + + def _convert_samples_from_float32(self, samples, dtype): + """Convert sample type from float32 to dtype. + + Audio sample type is usually integer or float-point. For integer + type, float32 will be rescaled from [-1, 1] to the maximum range + supported by the integer type. + + This is for writing a audio file. + """ + dtype = np.dtype(dtype) + output_samples = samples.copy() + if dtype in np.sctypes['int']: + bits = np.iinfo(dtype).bits + output_samples *= (2**(bits - 1) / 1.) + min_val = np.iinfo(dtype).min + max_val = np.iinfo(dtype).max + output_samples[output_samples > max_val] = max_val + output_samples[output_samples < min_val] = min_val + elif samples.dtype in np.sctypes['float']: + min_val = np.finfo(dtype).min + max_val = np.finfo(dtype).max + output_samples[output_samples > max_val] = max_val + output_samples[output_samples < min_val] = min_val + else: + raise TypeError("Unsupported sample type: %s." % samples.dtype) + return output_samples.astype(dtype) diff --git a/deep_speech_2/data_utils/augmentor/__init__.py b/deep_speech_2/data_utils/augmentor/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/deep_speech_2/data_utils/augmentor/augmentation.py b/deep_speech_2/data_utils/augmentor/augmentation.py new file mode 100644 index 0000000000000000000000000000000000000000..abe1a0ec89c5d6fc6f8ac1822df184cf5db4d7e1 --- /dev/null +++ b/deep_speech_2/data_utils/augmentor/augmentation.py @@ -0,0 +1,80 @@ +"""Contains the data augmentation pipeline.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import json +import random +from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor + + +class AugmentationPipeline(object): + """Build a pre-processing pipeline with various augmentation models.Such a + data augmentation pipeline is oftern leveraged to augment the training + samples to make the model invariant to certain types of perturbations in the + real world, improving model's generalization ability. + + The pipeline is built according the the augmentation configuration in json + string, e.g. + + .. code-block:: + + '[{"type": "volume", + "params": {"min_gain_dBFS": -15, + "max_gain_dBFS": 15}, + "prob": 0.5}, + {"type": "speed", + "params": {"min_speed_rate": 0.8, + "max_speed_rate": 1.2}, + "prob": 0.5} + ]' + + This augmentation configuration inserts two augmentation models + into the pipeline, with one is VolumePerturbAugmentor and the other + SpeedPerturbAugmentor. "prob" indicates the probability of the current + augmentor to take effect. + + :param augmentation_config: Augmentation configuration in json string. + :type augmentation_config: str + :param random_seed: Random seed. + :type random_seed: int + :raises ValueError: If the augmentation json config is in incorrect format". + """ + + def __init__(self, augmentation_config, random_seed=0): + self._rng = random.Random(random_seed) + self._augmentors, self._rates = self._parse_pipeline_from( + augmentation_config) + + def transform_audio(self, audio_segment): + """Run the pre-processing pipeline for data augmentation. + + Note that this is an in-place transformation. + + :param audio_segment: Audio segment to process. + :type audio_segment: AudioSegmenet|SpeechSegment + """ + for augmentor, rate in zip(self._augmentors, self._rates): + if self._rng.uniform(0., 1.) <= rate: + augmentor.transform_audio(audio_segment) + + def _parse_pipeline_from(self, config_json): + """Parse the config json to build a augmentation pipelien.""" + try: + configs = json.loads(config_json) + augmentors = [ + self._get_augmentor(config["type"], config["params"]) + for config in configs + ] + rates = [config["prob"] for config in configs] + except Exception as e: + raise ValueError("Failed to parse the augmentation config json: " + "%s" % str(e)) + return augmentors, rates + + def _get_augmentor(self, augmentor_type, params): + """Return an augmentation model by the type name, and pass in params.""" + if augmentor_type == "volume": + return VolumePerturbAugmentor(self._rng, **params) + else: + raise ValueError("Unknown augmentor type [%s]." % augmentor_type) diff --git a/deep_speech_2/data_utils/augmentor/base.py b/deep_speech_2/data_utils/augmentor/base.py new file mode 100644 index 0000000000000000000000000000000000000000..a323165aaeefb8135e7189a47a388a565afd8c8a --- /dev/null +++ b/deep_speech_2/data_utils/augmentor/base.py @@ -0,0 +1,33 @@ +"""Contains the abstract base class for augmentation models.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from abc import ABCMeta, abstractmethod + + +class AugmentorBase(object): + """Abstract base class for augmentation model (augmentor) class. + All augmentor classes should inherit from this class, and implement the + following abstract methods. + """ + + __metaclass__ = ABCMeta + + @abstractmethod + def __init__(self): + pass + + @abstractmethod + def transform_audio(self, audio_segment): + """Adds various effects to the input audio segment. Such effects + will augment the training data to make the model invariant to certain + types of perturbations in the real world, improving model's + generalization ability. + + Note that this is an in-place transformation. + + :param audio_segment: Audio segment to add effects to. + :type audio_segment: AudioSegmenet|SpeechSegment + """ + pass diff --git a/deep_speech_2/data_utils/augmentor/volume_perturb.py b/deep_speech_2/data_utils/augmentor/volume_perturb.py new file mode 100644 index 0000000000000000000000000000000000000000..a5a9f6cadac13e651dd6902d68d0efdaa9a61dc4 --- /dev/null +++ b/deep_speech_2/data_utils/augmentor/volume_perturb.py @@ -0,0 +1,40 @@ +"""Contains the volume perturb augmentation model.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from data_utils.augmentor.base import AugmentorBase + + +class VolumePerturbAugmentor(AugmentorBase): + """Augmentation model for adding random volume perturbation. + + This is used for multi-loudness training of PCEN. See + + https://arxiv.org/pdf/1607.05666v1.pdf + + for more details. + + :param rng: Random generator object. + :type rng: random.Random + :param min_gain_dBFS: Minimal gain in dBFS. + :type min_gain_dBFS: float + :param max_gain_dBFS: Maximal gain in dBFS. + :type max_gain_dBFS: float + """ + + def __init__(self, rng, min_gain_dBFS, max_gain_dBFS): + self._min_gain_dBFS = min_gain_dBFS + self._max_gain_dBFS = max_gain_dBFS + self._rng = rng + + def transform_audio(self, audio_segment): + """Change audio loadness. + + Note that this is an in-place transformation. + + :param audio_segment: Audio segment to add effects to. + :type audio_segment: AudioSegmenet|SpeechSegment + """ + gain = self._rng.uniform(min_gain_dBFS, max_gain_dBFS) + audio_segment.apply_gain(gain) diff --git a/deep_speech_2/data_utils/data.py b/deep_speech_2/data_utils/data.py new file mode 100644 index 0000000000000000000000000000000000000000..424343a48ffa579a8ab465794987f957de36abdb --- /dev/null +++ b/deep_speech_2/data_utils/data.py @@ -0,0 +1,273 @@ +"""Contains data generator for orgnaizing various audio data preprocessing +pipeline and offering data reader interface of PaddlePaddle requirements. +""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import random +import numpy as np +import paddle.v2 as paddle +from data_utils import utils +from data_utils.augmentor.augmentation import AugmentationPipeline +from data_utils.featurizer.speech_featurizer import SpeechFeaturizer +from data_utils.speech import SpeechSegment +from data_utils.normalizer import FeatureNormalizer + + +class DataGenerator(object): + """ + DataGenerator provides basic audio data preprocessing pipeline, and offers + data reader interfaces of PaddlePaddle requirements. + + :param vocab_filepath: Vocabulary filepath for indexing tokenized + transcripts. + :type vocab_filepath: basestring + :param mean_std_filepath: File containing the pre-computed mean and stddev. + :type mean_std_filepath: None|basestring + :param augmentation_config: Augmentation configuration in json string. + Details see AugmentationPipeline.__doc__. + :type augmentation_config: str + :param max_duration: Audio with duration (in seconds) greater than + this will be discarded. + :type max_duration: float + :param min_duration: Audio with duration (in seconds) smaller than + this will be discarded. + :type min_duration: float + :param stride_ms: Striding size (in milliseconds) for generating frames. + :type stride_ms: float + :param window_ms: Window size (in milliseconds) for generating frames. + :type window_ms: float + :param max_freq: Used when specgram_type is 'linear', only FFT bins + corresponding to frequencies between [0, max_freq] are + returned. + :types max_freq: None|float + :param specgram_type: Specgram feature type. Options: 'linear'. + :type specgram_type: str + :param random_seed: Random seed. + :type random_seed: int + """ + + def __init__(self, + vocab_filepath, + mean_std_filepath, + augmentation_config='{}', + max_duration=float('inf'), + min_duration=0.0, + stride_ms=10.0, + window_ms=20.0, + max_freq=None, + specgram_type='linear', + random_seed=0): + self._max_duration = max_duration + self._min_duration = min_duration + self._normalizer = FeatureNormalizer(mean_std_filepath) + self._augmentation_pipeline = AugmentationPipeline( + augmentation_config=augmentation_config, random_seed=random_seed) + self._speech_featurizer = SpeechFeaturizer( + vocab_filepath=vocab_filepath, + specgram_type=specgram_type, + stride_ms=stride_ms, + window_ms=window_ms, + max_freq=max_freq) + self._rng = random.Random(random_seed) + self._epoch = 0 + + def batch_reader_creator(self, + manifest_path, + batch_size, + min_batch_size=1, + padding_to=-1, + flatten=False, + sortagrad=False, + shuffle_method="batch_shuffle"): + """ + Batch data reader creator for audio data. Return a callable generator + function to produce batches of data. + + Audio features within one batch will be padded with zeros to have the + same shape, or a user-defined shape. + + :param manifest_path: Filepath of manifest for audio files. + :type manifest_path: basestring + :param batch_size: Number of instances in a batch. + :type batch_size: int + :param min_batch_size: Any batch with batch size smaller than this will + be discarded. (To be deprecated in the future.) + :type min_batch_size: int + :param padding_to: If set -1, the maximun shape in the batch + will be used as the target shape for padding. + Otherwise, `padding_to` will be the target shape. + :type padding_to: int + :param flatten: If set True, audio features will be flatten to 1darray. + :type flatten: bool + :param sortagrad: If set True, sort the instances by audio duration + in the first epoch for speed up training. + :type sortagrad: bool + :param shuffle_method: Shuffle method. Options: + '' or None: no shuffle. + 'instance_shuffle': instance-wise shuffle. + 'batch_shuffle': similarly-sized instances are + put into batches, and then + batch-wise shuffle the batches. + For more details, please see + ``_batch_shuffle.__doc__``. + 'batch_shuffle_clipped': 'batch_shuffle' with + head shift and tail + clipping. For more + details, please see + ``_batch_shuffle``. + If sortagrad is True, shuffle is disabled + for the first epoch. + :type shuffle_method: None|str + :return: Batch reader function, producing batches of data when called. + :rtype: callable + """ + + def batch_reader(): + # read manifest + manifest = utils.read_manifest( + manifest_path=manifest_path, + max_duration=self._max_duration, + min_duration=self._min_duration) + # sort (by duration) or batch-wise shuffle the manifest + if self._epoch == 0 and sortagrad: + manifest.sort(key=lambda x: x["duration"]) + else: + if shuffle_method == "batch_shuffle": + manifest = self._batch_shuffle( + manifest, batch_size, clipped=False) + elif shuffle_method == "batch_shuffle_clipped": + manifest = self._batch_shuffle( + manifest, batch_size, clipped=True) + elif shuffle_method == "instance_shuffle": + self._rng.shuffle(manifest) + elif not shuffle_method: + pass + else: + raise ValueError("Unknown shuffle method %s." % + shuffle_method) + # prepare batches + instance_reader = self._instance_reader_creator(manifest) + batch = [] + for instance in instance_reader(): + batch.append(instance) + if len(batch) == batch_size: + yield self._padding_batch(batch, padding_to, flatten) + batch = [] + if len(batch) >= min_batch_size: + yield self._padding_batch(batch, padding_to, flatten) + self._epoch += 1 + + return batch_reader + + @property + def feeding(self): + """Returns data reader's feeding dict. + + :return: Data feeding dict. + :rtype: dict + """ + return {"audio_spectrogram": 0, "transcript_text": 1} + + @property + def vocab_size(self): + """Return the vocabulary size. + + :return: Vocabulary size. + :rtype: int + """ + return self._speech_featurizer.vocab_size + + @property + def vocab_list(self): + """Return the vocabulary in list. + + :return: Vocabulary in list. + :rtype: list + """ + return self._speech_featurizer.vocab_list + + def _process_utterance(self, filename, transcript): + """Load, augment, featurize and normalize for speech data.""" + speech_segment = SpeechSegment.from_file(filename, transcript) + self._augmentation_pipeline.transform_audio(speech_segment) + specgram, text_ids = self._speech_featurizer.featurize(speech_segment) + specgram = self._normalizer.apply(specgram) + return specgram, text_ids + + def _instance_reader_creator(self, manifest): + """ + Instance reader creator. Create a callable function to produce + instances of data. + + Instance: a tuple of ndarray of audio spectrogram and a list of + token indices for transcript. + """ + + def reader(): + for instance in manifest: + yield self._process_utterance(instance["audio_filepath"], + instance["text"]) + + return reader + + def _padding_batch(self, batch, padding_to=-1, flatten=False): + """ + Padding audio features with zeros to make them have the same shape (or + a user-defined shape) within one bach. + + If ``padding_to`` is -1, the maximun shape in the batch will be used + as the target shape for padding. Otherwise, `padding_to` will be the + target shape (only refers to the second axis). + + If `flatten` is True, features will be flatten to 1darray. + """ + new_batch = [] + # get target shape + max_length = max([audio.shape[1] for audio, text in batch]) + if padding_to != -1: + if padding_to < max_length: + raise ValueError("If padding_to is not -1, it should be larger " + "than any instance's shape in the batch") + max_length = padding_to + # padding + for audio, text in batch: + padded_audio = np.zeros([audio.shape[0], max_length]) + padded_audio[:, :audio.shape[1]] = audio + if flatten: + padded_audio = padded_audio.flatten() + new_batch.append((padded_audio, text)) + return new_batch + + def _batch_shuffle(self, manifest, batch_size, clipped=False): + """Put similarly-sized instances into minibatches for better efficiency + and make a batch-wise shuffle. + + 1. Sort the audio clips by duration. + 2. Generate a random number `k`, k in [0, batch_size). + 3. Randomly shift `k` instances in order to create different batches + for different epochs. Create minibatches. + 4. Shuffle the minibatches. + + :param manifest: Manifest contents. List of dict. + :type manifest: list + :param batch_size: Batch size. This size is also used for generate + a random number for batch shuffle. + :type batch_size: int + :param clipped: Whether to clip the heading (small shift) and trailing + (incomplete batch) instances. + :type clipped: bool + :return: Batch shuffled mainifest. + :rtype: list + """ + manifest.sort(key=lambda x: x["duration"]) + shift_len = self._rng.randint(0, batch_size - 1) + batch_manifest = zip(*[iter(manifest[shift_len:])] * batch_size) + self._rng.shuffle(batch_manifest) + batch_manifest = list(sum(batch_manifest, ())) + if not clipped: + res_len = len(manifest) - shift_len - len(batch_manifest) + batch_manifest.extend(manifest[-res_len:]) + batch_manifest.extend(manifest[0:shift_len]) + return batch_manifest diff --git a/deep_speech_2/data_utils/featurizer/__init__.py b/deep_speech_2/data_utils/featurizer/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/deep_speech_2/data_utils/featurizer/audio_featurizer.py b/deep_speech_2/data_utils/featurizer/audio_featurizer.py new file mode 100644 index 0000000000000000000000000000000000000000..9f9d4e505d13b4fcaf1c1411821163caa4b73bc8 --- /dev/null +++ b/deep_speech_2/data_utils/featurizer/audio_featurizer.py @@ -0,0 +1,106 @@ +"""Contains the audio featurizer class.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np +from data_utils import utils +from data_utils.audio import AudioSegment + + +class AudioFeaturizer(object): + """Audio featurizer, for extracting features from audio contents of + AudioSegment or SpeechSegment. + + Currently, it only supports feature type of linear spectrogram. + + :param specgram_type: Specgram feature type. Options: 'linear'. + :type specgram_type: str + :param stride_ms: Striding size (in milliseconds) for generating frames. + :type stride_ms: float + :param window_ms: Window size (in milliseconds) for generating frames. + :type window_ms: float + :param max_freq: Used when specgram_type is 'linear', only FFT bins + corresponding to frequencies between [0, max_freq] are + returned. + :types max_freq: None|float + """ + + def __init__(self, + specgram_type='linear', + stride_ms=10.0, + window_ms=20.0, + max_freq=None): + self._specgram_type = specgram_type + self._stride_ms = stride_ms + self._window_ms = window_ms + self._max_freq = max_freq + + def featurize(self, audio_segment): + """Extract audio features from AudioSegment or SpeechSegment. + + :param audio_segment: Audio/speech segment to extract features from. + :type audio_segment: AudioSegment|SpeechSegment + :return: Spectrogram audio feature in 2darray. + :rtype: ndarray + """ + return self._compute_specgram(audio_segment.samples, + audio_segment.sample_rate) + + def _compute_specgram(self, samples, sample_rate): + """Extract various audio features.""" + if self._specgram_type == 'linear': + return self._compute_linear_specgram( + samples, sample_rate, self._stride_ms, self._window_ms, + self._max_freq) + else: + raise ValueError("Unknown specgram_type %s. " + "Supported values: linear." % self._specgram_type) + + def _compute_linear_specgram(self, + samples, + sample_rate, + stride_ms=10.0, + window_ms=20.0, + max_freq=None, + eps=1e-14): + """Compute the linear spectrogram from FFT energy.""" + if max_freq is None: + max_freq = sample_rate / 2 + if max_freq > sample_rate / 2: + raise ValueError("max_freq must be greater than half of " + "sample rate.") + if stride_ms > window_ms: + raise ValueError("Stride size must not be greater than " + "window size.") + stride_size = int(0.001 * sample_rate * stride_ms) + window_size = int(0.001 * sample_rate * window_ms) + specgram, freqs = self._specgram_real( + samples, + window_size=window_size, + stride_size=stride_size, + sample_rate=sample_rate) + ind = np.where(freqs <= max_freq)[0][-1] + 1 + return np.log(specgram[:ind, :] + eps) + + def _specgram_real(self, samples, window_size, stride_size, sample_rate): + """Compute the spectrogram for samples from a real signal.""" + # extract strided windows + truncate_size = (len(samples) - window_size) % stride_size + samples = samples[:len(samples) - truncate_size] + nshape = (window_size, (len(samples) - window_size) // stride_size + 1) + nstrides = (samples.strides[0], samples.strides[0] * stride_size) + windows = np.lib.stride_tricks.as_strided( + samples, shape=nshape, strides=nstrides) + assert np.all( + windows[:, 1] == samples[stride_size:(stride_size + window_size)]) + # window weighting, squared Fast Fourier Transform (fft), scaling + weighting = np.hanning(window_size)[:, None] + fft = np.fft.rfft(windows * weighting, axis=0) + fft = np.absolute(fft)**2 + scale = np.sum(weighting**2) * sample_rate + fft[1:-1, :] *= (2.0 / scale) + fft[(0, -1), :] /= scale + # prepare fft frequency list + freqs = float(sample_rate) / window_size * np.arange(fft.shape[0]) + return fft, freqs diff --git a/deep_speech_2/data_utils/featurizer/speech_featurizer.py b/deep_speech_2/data_utils/featurizer/speech_featurizer.py new file mode 100644 index 0000000000000000000000000000000000000000..7702045597fb8379bffee2c31029ace4b2453f92 --- /dev/null +++ b/deep_speech_2/data_utils/featurizer/speech_featurizer.py @@ -0,0 +1,77 @@ +"""Contains the speech featurizer class.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from data_utils.featurizer.audio_featurizer import AudioFeaturizer +from data_utils.featurizer.text_featurizer import TextFeaturizer + + +class SpeechFeaturizer(object): + """Speech featurizer, for extracting features from both audio and transcript + contents of SpeechSegment. + + Currently, for audio parts, it only supports feature type of linear + spectrogram; for transcript parts, it only supports char-level tokenizing + and conversion into a list of token indices. Note that the token indexing + order follows the given vocabulary file. + + :param vocab_filepath: Filepath to load vocabulary for token indices + conversion. + :type specgram_type: basestring + :param specgram_type: Specgram feature type. Options: 'linear'. + :type specgram_type: str + :param stride_ms: Striding size (in milliseconds) for generating frames. + :type stride_ms: float + :param window_ms: Window size (in milliseconds) for generating frames. + :type window_ms: float + :param max_freq: Used when specgram_type is 'linear', only FFT bins + corresponding to frequencies between [0, max_freq] are + returned. + :types max_freq: None|float + """ + + def __init__(self, + vocab_filepath, + specgram_type='linear', + stride_ms=10.0, + window_ms=20.0, + max_freq=None): + self._audio_featurizer = AudioFeaturizer(specgram_type, stride_ms, + window_ms, max_freq) + self._text_featurizer = TextFeaturizer(vocab_filepath) + + def featurize(self, speech_segment): + """Extract features for speech segment. + + 1. For audio parts, extract the audio features. + 2. For transcript parts, convert text string to a list of token indices + in char-level. + + :param audio_segment: Speech segment to extract features from. + :type audio_segment: SpeechSegment + :return: A tuple of 1) spectrogram audio feature in 2darray, 2) list of + char-level token indices. + :rtype: tuple + """ + audio_feature = self._audio_featurizer.featurize(speech_segment) + text_ids = self._text_featurizer.featurize(speech_segment.transcript) + return audio_feature, text_ids + + @property + def vocab_size(self): + """Return the vocabulary size. + + :return: Vocabulary size. + :rtype: int + """ + return self._text_featurizer.vocab_size + + @property + def vocab_list(self): + """Return the vocabulary in list. + + :return: Vocabulary in list. + :rtype: list + """ + return self._text_featurizer.vocab_list diff --git a/deep_speech_2/data_utils/featurizer/text_featurizer.py b/deep_speech_2/data_utils/featurizer/text_featurizer.py new file mode 100644 index 0000000000000000000000000000000000000000..4f9a49b594010f91a64797b9a4b7e9054d4749d5 --- /dev/null +++ b/deep_speech_2/data_utils/featurizer/text_featurizer.py @@ -0,0 +1,67 @@ +"""Contains the text featurizer class.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os + + +class TextFeaturizer(object): + """Text featurizer, for processing or extracting features from text. + + Currently, it only supports char-level tokenizing and conversion into + a list of token indices. Note that the token indexing order follows the + given vocabulary file. + + :param vocab_filepath: Filepath to load vocabulary for token indices + conversion. + :type specgram_type: basestring + """ + + def __init__(self, vocab_filepath): + self._vocab_dict, self._vocab_list = self._load_vocabulary_from_file( + vocab_filepath) + + def featurize(self, text): + """Convert text string to a list of token indices in char-level.Note + that the token indexing order follows the given vocabulary file. + + :param text: Text to process. + :type text: basestring + :return: List of char-level token indices. + :rtype: list + """ + tokens = self._char_tokenize(text) + return [self._vocab_dict[token] for token in tokens] + + @property + def vocab_size(self): + """Return the vocabulary size. + + :return: Vocabulary size. + :rtype: int + """ + return len(self._vocab_list) + + @property + def vocab_list(self): + """Return the vocabulary in list. + + :return: Vocabulary in list. + :rtype: list + """ + return self._vocab_list + + def _char_tokenize(self, text): + """Character tokenizer.""" + return list(text.strip()) + + def _load_vocabulary_from_file(self, vocab_filepath): + """Load vocabulary from file.""" + vocab_lines = [] + with open(vocab_filepath, 'r') as file: + vocab_lines.extend(file.readlines()) + vocab_list = [line[:-1] for line in vocab_lines] + vocab_dict = dict( + [(token, id) for (id, token) in enumerate(vocab_list)]) + return vocab_dict, vocab_list diff --git a/deep_speech_2/data_utils/normalizer.py b/deep_speech_2/data_utils/normalizer.py new file mode 100644 index 0000000000000000000000000000000000000000..c123d25d20600140b47da1e93655b15c0053dfea --- /dev/null +++ b/deep_speech_2/data_utils/normalizer.py @@ -0,0 +1,87 @@ +"""Contains feature normalizers.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np +import random +import data_utils.utils as utils +from data_utils.audio import AudioSegment + + +class FeatureNormalizer(object): + """Feature normalizer. Normalize features to be of zero mean and unit + stddev. + + if mean_std_filepath is provided (not None), the normalizer will directly + initilize from the file. Otherwise, both manifest_path and featurize_func + should be given for on-the-fly mean and stddev computing. + + :param mean_std_filepath: File containing the pre-computed mean and stddev. + :type mean_std_filepath: None|basestring + :param manifest_path: Manifest of instances for computing mean and stddev. + :type meanifest_path: None|basestring + :param featurize_func: Function to extract features. It should be callable + with ``featurize_func(audio_segment)``. + :type featurize_func: None|callable + :param num_samples: Number of random samples for computing mean and stddev. + :type num_samples: int + :param random_seed: Random seed for sampling instances. + :type random_seed: int + :raises ValueError: If both mean_std_filepath and manifest_path + (or both mean_std_filepath and featurize_func) are None. + """ + + def __init__(self, + mean_std_filepath, + manifest_path=None, + featurize_func=None, + num_samples=500, + random_seed=0): + if not mean_std_filepath: + if not (manifest_path and featurize_func): + raise ValueError("If mean_std_filepath is None, meanifest_path " + "and featurize_func should not be None.") + self._rng = random.Random(random_seed) + self._compute_mean_std(manifest_path, featurize_func, num_samples) + else: + self._read_mean_std_from_file(mean_std_filepath) + + def apply(self, features, eps=1e-14): + """Normalize features to be of zero mean and unit stddev. + + :param features: Input features to be normalized. + :type features: ndarray + :param eps: added to stddev to provide numerical stablibity. + :type eps: float + :return: Normalized features. + :rtype: ndarray + """ + return (features - self._mean) / (self._std + eps) + + def write_to_file(self, filepath): + """Write the mean and stddev to the file. + + :param filepath: File to write mean and stddev. + :type filepath: basestring + """ + np.savez(filepath, mean=self._mean, std=self._std) + + def _read_mean_std_from_file(self, filepath): + """Load mean and std from file.""" + npzfile = np.load(filepath) + self._mean = npzfile["mean"] + self._std = npzfile["std"] + + def _compute_mean_std(self, manifest_path, featurize_func, num_samples): + """Compute mean and std from randomly sampled instances.""" + manifest = utils.read_manifest(manifest_path) + sampled_manifest = self._rng.sample(manifest, num_samples) + features = [] + for instance in sampled_manifest: + features.append( + featurize_func( + AudioSegment.from_file(instance["audio_filepath"]))) + features = np.hstack(features) + self._mean = np.mean(features, axis=1).reshape([-1, 1]) + self._std = np.std(features, axis=1).reshape([-1, 1]) diff --git a/deep_speech_2/data_utils/speech.py b/deep_speech_2/data_utils/speech.py new file mode 100644 index 0000000000000000000000000000000000000000..48db595b41b82933f9b5c16cab7d2ee24f9a2ecc --- /dev/null +++ b/deep_speech_2/data_utils/speech.py @@ -0,0 +1,75 @@ +"""Contains the speech segment class.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from data_utils.audio import AudioSegment + + +class SpeechSegment(AudioSegment): + """Speech segment abstraction, a subclass of AudioSegment, + with an additional transcript. + + :param samples: Audio samples [num_samples x num_channels]. + :type samples: ndarray.float32 + :param sample_rate: Audio sample rate. + :type sample_rate: int + :param transcript: Transcript text for the speech. + :type transript: basestring + :raises TypeError: If the sample data type is not float or int. + """ + + def __init__(self, samples, sample_rate, transcript): + AudioSegment.__init__(self, samples, sample_rate) + self._transcript = transcript + + def __eq__(self, other): + """Return whether two objects are equal. + """ + if not AudioSegment.__eq__(self, other): + return False + if self._transcript != other._transcript: + return False + return True + + def __ne__(self, other): + """Return whether two objects are unequal.""" + return not self.__eq__(other) + + @classmethod + def from_file(cls, filepath, transcript): + """Create speech segment from audio file and corresponding transcript. + + :param filepath: Filepath or file object to audio file. + :type filepath: basestring|file + :param transcript: Transcript text for the speech. + :type transript: basestring + :return: Audio segment instance. + :rtype: AudioSegment + """ + audio = AudioSegment.from_file(filepath) + return cls(audio.samples, audio.sample_rate, transcript) + + @classmethod + def from_bytes(cls, bytes, transcript): + """Create speech segment from a byte string and corresponding + transcript. + + :param bytes: Byte string containing audio samples. + :type bytes: str + :param transcript: Transcript text for the speech. + :type transript: basestring + :return: Audio segment instance. + :rtype: AudioSegment + """ + audio = AudioSegment.from_bytes(bytes) + return cls(audio.samples, audio.sample_rate, transcript) + + @property + def transcript(self): + """Return the transcript text. + + :return: Transcript text for the speech. + :rtype: basestring + """ + return self._transcript diff --git a/deep_speech_2/data_utils/utils.py b/deep_speech_2/data_utils/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..3f1165718aa0e2a0bf0687b8a613a6447b964ee8 --- /dev/null +++ b/deep_speech_2/data_utils/utils.py @@ -0,0 +1,34 @@ +"""Contains data helper functions.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import json + + +def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0): + """Load and parse manifest file. + + Instances with durations outside [min_duration, max_duration] will be + filtered out. + + :param manifest_path: Manifest file to load and parse. + :type manifest_path: basestring + :param max_duration: Maximal duration in seconds for instance filter. + :type max_duration: float + :param min_duration: Minimal duration in seconds for instance filter. + :type min_duration: float + :return: Manifest parsing results. List of dict. + :rtype: list + :raises IOError: If failed to parse the manifest. + """ + manifest = [] + for json_line in open(manifest_path): + try: + json_data = json.loads(json_line) + except Exception as e: + raise IOError("Error reading manifest: %s" % str(e)) + if (json_data["duration"] <= max_duration and + json_data["duration"] >= min_duration): + manifest.append(json_data) + return manifest diff --git a/deep_speech_2/data/librispeech.py b/deep_speech_2/datasets/librispeech/librispeech.py similarity index 93% rename from deep_speech_2/data/librispeech.py rename to deep_speech_2/datasets/librispeech/librispeech.py index 653caa9267b62aa8415a26be2143de874bb15e88..87e52ae4aa286503d79f1326065831acfe6bf985 100644 --- a/deep_speech_2/data/librispeech.py +++ b/deep_speech_2/datasets/librispeech/librispeech.py @@ -1,13 +1,14 @@ -""" - Download, unpack and create manifest json files for the Librespeech dataset. +"""Prepare Librispeech ASR datasets. - A manifest is a json file summarizing filelist in a data set, with each line - containing the meta data (i.e. audio filepath, transcription text, audio - duration) of each audio file in the data set. +Download, unpack and create manifest files. +Manifest file is a json-format file with each line containing the +meta data (i.e. audio filepath, transcript and audio duration) +of each audio file in the data set. """ +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function -import paddle.v2 as paddle -from paddle.v2.dataset.common import md5file import distutils.util import os import wget @@ -15,6 +16,7 @@ import tarfile import argparse import soundfile import json +from paddle.v2.dataset.common import md5file DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') @@ -35,8 +37,7 @@ MD5_TRAIN_CLEAN_100 = "2a93770f6d5c6c964bc36631d331a522" MD5_TRAIN_CLEAN_360 = "c0e676e450a7ff2f54aeade5171606fa" MD5_TRAIN_OTHER_500 = "d1a0fd59409feb2c614ce4d30c387708" -parser = argparse.ArgumentParser( - description='Downloads and prepare LibriSpeech dataset.') +parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( "--target_dir", default=DATA_HOME + "/Libri", @@ -44,7 +45,7 @@ parser.add_argument( help="Directory to save the dataset. (default: %(default)s)") parser.add_argument( "--manifest_prefix", - default="manifest.libri", + default="manifest", type=str, help="Filepath prefix for output manifests. (default: %(default)s)") parser.add_argument( diff --git a/deep_speech_2/datasets/run_all.sh b/deep_speech_2/datasets/run_all.sh new file mode 100644 index 0000000000000000000000000000000000000000..ef2b721fbdc2a18fcbc208730189604e88d7ef2c --- /dev/null +++ b/deep_speech_2/datasets/run_all.sh @@ -0,0 +1,13 @@ +cd librispeech +python librispeech.py +if [ $? -ne 0 ]; then + echo "Prepare LibriSpeech failed. Terminated." + exit 1 +fi +cd - + +cat librispeech/manifest.train* | shuf > manifest.train +cat librispeech/manifest.dev-clean > manifest.dev +cat librispeech/manifest.test-clean > manifest.test + +echo "All done." diff --git a/deep_speech_2/data/eng_vocab.txt b/deep_speech_2/datasets/vocab/eng_vocab.txt similarity index 100% rename from deep_speech_2/data/eng_vocab.txt rename to deep_speech_2/datasets/vocab/eng_vocab.txt diff --git a/deep_speech_2/decoder.py b/deep_speech_2/decoder.py old mode 100755 new mode 100644 index 7c4b952636f3e94167bbd00880673a8dc5635803..77d950b8db072d539788fd1b2bc7ac0525ffa0f9 --- a/deep_speech_2/decoder.py +++ b/deep_speech_2/decoder.py @@ -1,14 +1,14 @@ -""" - CTC-like decoder utilitis. -""" +"""Contains various CTC decoder.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function -from itertools import groupby import numpy as np +from itertools import groupby def ctc_best_path_decode(probs_seq, vocabulary): - """ - Best path decoding, also called argmax decoding or greedy decoding. + """Best path decoding, also called argmax decoding or greedy decoding. Path consisting of the most probable tokens are further post-processed to remove consecutive repetitions and all blanks. @@ -37,8 +37,7 @@ def ctc_best_path_decode(probs_seq, vocabulary): def ctc_decode(probs_seq, vocabulary, method): - """ - CTC-like sequence decoding from a sequence of likelihood probablilites. + """CTC-like sequence decoding from a sequence of likelihood probablilites. :param probs_seq: 2-D list of probabilities over the vocabulary for each character. Each element is a list of float probabilities diff --git a/deep_speech_2/infer.py b/deep_speech_2/infer.py index 598c348b063c6b5fb98bd6f3b287f95d64ef121e..06449ab05c7960ec78acc9ce5bb664cf1058a845 100644 --- a/deep_speech_2/infer.py +++ b/deep_speech_2/infer.py @@ -1,17 +1,18 @@ -""" - Inference for a simplifed version of Baidu DeepSpeech2 model. -""" +"""Inferer for DeepSpeech2 model.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function -import paddle.v2 as paddle -import distutils.util import argparse import gzip -from audio_data_utils import DataGenerator +import distutils.util +import paddle.v2 as paddle +from data_utils.data import DataGenerator from model import deep_speech2 from decoder import ctc_decode +import utils -parser = argparse.ArgumentParser( - description='Simplified version of DeepSpeech2 inference.') +parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( "--num_samples", default=10, @@ -38,13 +39,13 @@ parser.add_argument( type=distutils.util.strtobool, help="Use gpu or not. (default: %(default)s)") parser.add_argument( - "--normalizer_manifest_path", - default='data/manifest.libri.train-clean-100', + "--mean_std_filepath", + default='mean_std.npz', type=str, help="Manifest path for normalizer. (default: %(default)s)") parser.add_argument( "--decode_manifest_path", - default='data/manifest.libri.test-clean', + default='datasets/manifest.test', type=str, help="Manifest path for decoding. (default: %(default)s)") parser.add_argument( @@ -54,41 +55,33 @@ parser.add_argument( help="Model filepath. (default: %(default)s)") parser.add_argument( "--vocab_filepath", - default='data/eng_vocab.txt', + default='datasets/vocab/eng_vocab.txt', type=str, help="Vocabulary filepath. (default: %(default)s)") args = parser.parse_args() def infer(): - """ - Max-ctc-decoding for DeepSpeech2. - """ + """Max-ctc-decoding for DeepSpeech2.""" # initialize data generator data_generator = DataGenerator( vocab_filepath=args.vocab_filepath, - normalizer_manifest_path=args.normalizer_manifest_path, - normalizer_num_samples=200, - max_duration=20.0, - min_duration=0.0, - stride_ms=10, - window_ms=20) + mean_std_filepath=args.mean_std_filepath, + augmentation_config='{}') # create network config - dict_size = data_generator.vocabulary_size() - vocab_list = data_generator.vocabulary_list() + # paddle.data_type.dense_array is used for variable batch input. + # The size 161 * 161 is only an placeholder value and the real shape + # of input batch data will be induced during training. audio_data = paddle.layer.data( - name="audio_spectrogram", - height=161, - width=2000, - type=paddle.data_type.dense_vector(322000)) + name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161)) text_data = paddle.layer.data( name="transcript_text", - type=paddle.data_type.integer_value_sequence(dict_size)) + type=paddle.data_type.integer_value_sequence(data_generator.vocab_size)) output_probs = deep_speech2( audio_data=audio_data, text_data=text_data, - dict_size=dict_size, + dict_size=data_generator.vocab_size, num_conv_layers=args.num_conv_layers, num_rnn_layers=args.num_rnn_layers, rnn_size=args.rnn_layer_size, @@ -99,36 +92,36 @@ def infer(): gzip.open(args.model_filepath)) # prepare infer data - feeding = data_generator.data_name_feeding() - test_batch_reader = data_generator.batch_reader_creator( + batch_reader = data_generator.batch_reader_creator( manifest_path=args.decode_manifest_path, batch_size=args.num_samples, - padding_to=2000, - flatten=True, - sort_by_duration=False, - shuffle=False) - infer_data = test_batch_reader().next() + sortagrad=False, + shuffle_method=None) + infer_data = batch_reader().next() # run inference infer_results = paddle.infer( output_layer=output_probs, parameters=parameters, input=infer_data) - num_steps = len(infer_results) / len(infer_data) + num_steps = len(infer_results) // len(infer_data) probs_split = [ infer_results[i * num_steps:(i + 1) * num_steps] - for i in xrange(0, len(infer_data)) + for i in xrange(len(infer_data)) ] # decode and print for i, probs in enumerate(probs_split): output_transcription = ctc_decode( - probs_seq=probs, vocabulary=vocab_list, method="best_path") + probs_seq=probs, + vocabulary=data_generator.vocab_list, + method="best_path") target_transcription = ''.join( - [vocab_list[index] for index in infer_data[i][1]]) + [data_generator.vocab_list[index] for index in infer_data[i][1]]) print("Target Transcription: %s \nOutput Transcription: %s \n" % (target_transcription, output_transcription)) def main(): + utils.print_arguments(args) paddle.init(use_gpu=args.use_gpu, trainer_count=1) infer() diff --git a/deep_speech_2/model.py b/deep_speech_2/model.py index 13ff829b9a6b947253a40a1d3ea524de141bd9d1..cb0b4ecbba1a3fb435a5f625a54d6e5bebe689e0 100644 --- a/deep_speech_2/model.py +++ b/deep_speech_2/model.py @@ -1,11 +1,10 @@ -""" - A simplifed version of Baidu DeepSpeech2 model. -""" +"""Contains DeepSpeech2 model.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function import paddle.v2 as paddle -#TODO: add bidirectional rnn. - def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride, padding, act): diff --git a/deep_speech_2/train.py b/deep_speech_2/train.py index 957c24267ce24c917ca8437683d03eefec6636d5..c60a039b69d91a89eb20e83ec1e090c8600d47a3 100644 --- a/deep_speech_2/train.py +++ b/deep_speech_2/train.py @@ -1,22 +1,20 @@ -""" - Trainer for a simplifed version of Baidu DeepSpeech2 model. -""" +"""Trainer for DeepSpeech2 model.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function -import paddle.v2 as paddle -import distutils.util +import sys +import os import argparse import gzip import time -import sys +import distutils.util +import paddle.v2 as paddle from model import deep_speech2 -from audio_data_utils import DataGenerator -import numpy as np -import os +from data_utils.data import DataGenerator +import utils -#TODO: add WER metric - -parser = argparse.ArgumentParser( - description='Simplified version of DeepSpeech2 trainer.') +parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( "--batch_size", default=32, type=int, help="Minibatch size.") parser.add_argument( @@ -51,32 +49,38 @@ parser.add_argument( help="Use gpu or not. (default: %(default)s)") parser.add_argument( "--use_sortagrad", - default=False, + default=True, type=distutils.util.strtobool, help="Use sortagrad or not. (default: %(default)s)") +parser.add_argument( + "--shuffle_method", + default='instance_shuffle', + type=str, + help="Shuffle method: 'instance_shuffle', 'batch_shuffle', " + "'batch_shuffle_batch'. (default: %(default)s)") parser.add_argument( "--trainer_count", default=4, type=int, help="Trainer number. (default: %(default)s)") parser.add_argument( - "--normalizer_manifest_path", - default='data/manifest.libri.train-clean-100', + "--mean_std_filepath", + default='mean_std.npz', type=str, help="Manifest path for normalizer. (default: %(default)s)") parser.add_argument( "--train_manifest_path", - default='data/manifest.libri.train-clean-100', + default='datasets/manifest.train', type=str, help="Manifest path for training. (default: %(default)s)") parser.add_argument( "--dev_manifest_path", - default='data/manifest.libri.dev-clean', + default='datasets/manifest.dev', type=str, help="Manifest path for validation. (default: %(default)s)") parser.add_argument( "--vocab_filepath", - default='data/eng_vocab.txt', + default='datasets/vocab/eng_vocab.txt', type=str, help="Vocabulary filepath. (default: %(default)s)") parser.add_argument( @@ -86,41 +90,42 @@ parser.add_argument( help="If set None, the training will start from scratch. " "Otherwise, the training will resume from " "the existing model of this path. (default: %(default)s)") +parser.add_argument( + "--augmentation_config", + default='{}', + type=str, + help="Augmentation configuration in json-format. " + "(default: %(default)s)") args = parser.parse_args() def train(): - """ - DeepSpeech2 training. - """ + """DeepSpeech2 training.""" # initialize data generator def data_generator(): return DataGenerator( vocab_filepath=args.vocab_filepath, - normalizer_manifest_path=args.normalizer_manifest_path, - normalizer_num_samples=200, - max_duration=20.0, - min_duration=0.0, - stride_ms=10, - window_ms=20) + mean_std_filepath=args.mean_std_filepath, + augmentation_config=args.augmentation_config) train_generator = data_generator() test_generator = data_generator() + # create network config - dict_size = train_generator.vocabulary_size() # paddle.data_type.dense_array is used for variable batch input. - # the size 161 * 161 is only an placeholder value and the real shape - # of input batch data will be set at each batch. + # The size 161 * 161 is only an placeholder value and the real shape + # of input batch data will be induced during training. audio_data = paddle.layer.data( name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161)) text_data = paddle.layer.data( name="transcript_text", - type=paddle.data_type.integer_value_sequence(dict_size)) + type=paddle.data_type.integer_value_sequence( + train_generator.vocab_size)) cost = deep_speech2( audio_data=audio_data, text_data=text_data, - dict_size=dict_size, + dict_size=train_generator.vocab_size, num_conv_layers=args.num_conv_layers, num_rnn_layers=args.num_rnn_layers, rnn_size=args.rnn_layer_size, @@ -143,13 +148,15 @@ def train(): train_batch_reader = train_generator.batch_reader_creator( manifest_path=args.train_manifest_path, batch_size=args.batch_size, - sortagrad=True if args.init_model_path is None else False, - batch_shuffle=True) + min_batch_size=args.trainer_count, + sortagrad=args.use_sortagrad if args.init_model_path is None else False, + shuffle_method=args.shuffle_method) test_batch_reader = test_generator.batch_reader_creator( manifest_path=args.dev_manifest_path, batch_size=args.batch_size, - batch_shuffle=False) - feeding = train_generator.data_name_feeding() + min_batch_size=1, # must be 1, but will have errors. + sortagrad=False, + shuffle_method=None) # create event handler def event_handler(event): @@ -157,9 +164,9 @@ def train(): if isinstance(event, paddle.event.EndIteration): cost_sum += event.cost cost_counter += 1 - if event.batch_id % 50 == 0: - print "\nPass: %d, Batch: %d, TrainCost: %f" % ( - event.pass_id, event.batch_id, cost_sum / cost_counter) + if (event.batch_id + 1) % 100 == 0: + print("\nPass: %d, Batch: %d, TrainCost: %f" % ( + event.pass_id, event.batch_id + 1, cost_sum / cost_counter)) cost_sum, cost_counter = 0.0, 0 with gzip.open("params.tar.gz", 'w') as f: parameters.to_tar(f) @@ -170,19 +177,21 @@ def train(): start_time = time.time() cost_sum, cost_counter = 0.0, 0 if isinstance(event, paddle.event.EndPass): - result = trainer.test(reader=test_batch_reader, feeding=feeding) - print "\n------- Time: %d sec, Pass: %d, ValidationCost: %s" % ( - time.time() - start_time, event.pass_id, result.cost) + result = trainer.test( + reader=test_batch_reader, feeding=test_generator.feeding) + print("\n------- Time: %d sec, Pass: %d, ValidationCost: %s" % + (time.time() - start_time, event.pass_id, result.cost)) # run train trainer.train( reader=train_batch_reader, event_handler=event_handler, num_passes=args.num_passes, - feeding=feeding) + feeding=train_generator.feeding) def main(): + utils.print_arguments(args) paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count) train() diff --git a/deep_speech_2/utils.py b/deep_speech_2/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..9ca363c8f59c2b1cd2885db4b04605c0025998bf --- /dev/null +++ b/deep_speech_2/utils.py @@ -0,0 +1,25 @@ +"""Contains common utility functions.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + + +def print_arguments(args): + """Print argparse's arguments. + + Usage: + + .. code-block:: python + + parser = argparse.ArgumentParser() + parser.add_argument("name", default="Jonh", type=str, help="User name.") + args = parser.parse_args() + print_arguments(args) + + :param args: Input argparse.Namespace for printing. + :type args: argparse.Namespace + """ + print("----- Configuration Arguments -----") + for arg, value in vars(args).iteritems(): + print("%s: %s" % (arg, value)) + print("------------------------------------") diff --git a/image_classification/README.md b/image_classification/README.md index a0990367ef8b03c70c29d285e22ef85907e1d0b7..94a0a1b70e9b31e7247eeb2765dc000cb6c19ea9 100644 --- a/image_classification/README.md +++ b/image_classification/README.md @@ -1 +1,223 @@ -TBD +图像分类 +======================= + +这里将介绍如何在PaddlePaddle下使用AlexNet、VGG、GoogLeNet和ResNet模型进行图像分类。图像分类问题的描述和这四种模型的介绍可以参考[PaddlePaddle book](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification)。 + +## 训练模型 + +### 初始化 + +在初始化阶段需要导入所用的包,并对PaddlePaddle进行初始化。 + +```python +import gzip +import paddle.v2.dataset.flowers as flowers +import paddle.v2 as paddle +import reader +import vgg +import resnet +import alexnet +import googlenet + + +# PaddlePaddle init +paddle.init(use_gpu=False, trainer_count=1) +``` + +### 定义参数和输入 + +设置算法参数(如数据维度、类别数目和batch size等参数),定义数据输入层`image`和类别标签`lbl`。 + +```python +DATA_DIM = 3 * 224 * 224 +CLASS_DIM = 102 +BATCH_SIZE = 128 + +image = paddle.layer.data( + name="image", type=paddle.data_type.dense_vector(DATA_DIM)) +lbl = paddle.layer.data( + name="label", type=paddle.data_type.integer_value(CLASS_DIM)) +``` + +### 获得所用模型 + +这里可以选择使用AlexNet、VGG、GoogLeNet和ResNet模型中的一个模型进行图像分类。通过调用相应的方法可以获得网络最后的Softmax层。 + +1. 使用AlexNet模型 + +指定输入层`image`和类别数目`CLASS_DIM`后,可以通过下面的代码得到AlexNet的Softmax层。 + +```python +out = alexnet.alexnet(image, class_dim=CLASS_DIM) +``` + +2. 使用VGG模型 + +根据层数的不同,VGG分为VGG13、VGG16和VGG19。使用VGG16模型的代码如下: + +```python +out = vgg.vgg16(image, class_dim=CLASS_DIM) +``` + +类似地,VGG13和VGG19可以分别通过`vgg.vgg13`和`vgg.vgg19`方法获得。 + +3. 使用GoogLeNet模型 + +GoogLeNet在训练阶段使用两个辅助的分类器强化梯度信息并进行额外的正则化。因此`googlenet.googlenet`共返回三个Softmax层,如下面的代码所示: + +```python +out, out1, out2 = googlenet.googlenet(image, class_dim=CLASS_DIM) +loss1 = paddle.layer.cross_entropy_cost( + input=out1, label=lbl, coeff=0.3) +paddle.evaluator.classification_error(input=out1, label=lbl) +loss2 = paddle.layer.cross_entropy_cost( + input=out2, label=lbl, coeff=0.3) +paddle.evaluator.classification_error(input=out2, label=lbl) +extra_layers = [loss1, loss2] +``` + +对于两个辅助的输出,这里分别对其计算损失函数并评价错误率,然后将损失作为后文SGD的extra_layers。 + +4. 使用ResNet模型 + +ResNet模型可以通过下面的代码获取: + +```python +out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM) +``` + +### 定义损失函数 + +```python +cost = paddle.layer.classification_cost(input=out, label=lbl) +``` + +### 创建参数和优化方法 + +```python +# Create parameters +parameters = paddle.parameters.create(cost) + +# Create optimizer +optimizer = paddle.optimizer.Momentum( + momentum=0.9, + regularization=paddle.optimizer.L2Regularization(rate=0.0005 * + BATCH_SIZE), + learning_rate=0.001 / BATCH_SIZE, + learning_rate_decay_a=0.1, + learning_rate_decay_b=128000 * 35, + learning_rate_schedule="discexp", ) +``` + +通过 `learning_rate_decay_a` (简写$a$) 、`learning_rate_decay_b` (简写$b$) 和 `learning_rate_schedule` 指定学习率调整策略,这里采用离散指数的方式调节学习率,计算公式如下, $n$ 代表已经处理过的累计总样本数,$lr_{0}$ 即为参数里设置的 `learning_rate`。 + +$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$ + + +### 定义数据读取 + +首先以[花卉数据](http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html)为例说明如何定义输入。下面的代码定义了花卉数据训练集和验证集的输入: + +```python +train_reader = paddle.batch( + paddle.reader.shuffle( + flowers.train(), + buf_size=1000), + batch_size=BATCH_SIZE) +test_reader = paddle.batch( + flowers.valid(), + batch_size=BATCH_SIZE) +``` + +若需要使用其他数据,则需要先建立图像列表文件。`reader.py`定义了这种文件的读取方式,它从图像列表文件中解析出图像路径和类别标签。 + +图像列表文件是一个文本文件,其中每一行由一个图像路径和类别标签构成,二者以跳格符(Tab)隔开。类别标签用整数表示,其最小值为0。下面给出一个图像列表文件的片段示例: + +``` +dataset_100/train_images/n03982430_23191.jpeg 1 +dataset_100/train_images/n04461696_23653.jpeg 7 +dataset_100/train_images/n02441942_3170.jpeg 8 +dataset_100/train_images/n03733281_31716.jpeg 2 +dataset_100/train_images/n03424325_240.jpeg 0 +dataset_100/train_images/n02643566_75.jpeg 8 +``` + +训练时需要分别指定训练集和验证集的图像列表文件。这里假设这两个文件分别为`train.list`和`val.list`,数据读取方式如下: + +```python +train_reader = paddle.batch( + paddle.reader.shuffle( + reader.train_reader('train.list'), + buf_size=1000), + batch_size=BATCH_SIZE) +test_reader = paddle.batch( + reader.test_reader('val.list'), + batch_size=BATCH_SIZE) +``` + +### 定义事件处理程序 +```python +# End batch and end pass event handler +def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 1 == 0: + print "\nPass %d, Batch %d, Cost %f, %s" % ( + event.pass_id, event.batch_id, event.cost, event.metrics) + if isinstance(event, paddle.event.EndPass): + with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f: + parameters.to_tar(f) + + result = trainer.test(reader=test_reader) + print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) +``` + +### 定义训练方法 + +对于AlexNet、VGG和ResNet,可以按下面的代码定义训练方法: + +```python +# Create trainer +trainer = paddle.trainer.SGD( + cost=cost, + parameters=parameters, + update_equation=optimizer) +``` + +GoogLeNet有两个额外的输出层,因此需要指定`extra_layers`,如下所示: + +```python +# Create trainer +trainer = paddle.trainer.SGD( + cost=cost, + parameters=parameters, + update_equation=optimizer, + extra_layers=extra_layers) +``` + +### 开始训练 + +```python +trainer.train( + reader=train_reader, num_passes=200, event_handler=event_handler) +``` + +## 应用模型 +模型训练好后,可以使用下面的代码预测给定图片的类别。 + +```python +# load parameters +with gzip.open('params_pass_10.tar.gz', 'r') as f: + parameters = paddle.parameters.Parameters.from_tar(f) + +file_list = [line.strip() for line in open(image_list_file)] +test_data = [(paddle.image.load_and_transform(image_file, 256, 224, False) + .flatten().astype('float32'), ) + for image_file in file_list] +probs = paddle.infer( + output_layer=out, parameters=parameters, input=test_data) +lab = np.argsort(-probs) +for file_name, result in zip(file_list, lab): + print "Label of %s is: %d" % (file_name, result[0]) +``` + +首先从文件中加载训练好的模型(代码里以第10轮迭代的结果为例),然后读取`image_list_file`中的图像。`image_list_file`是一个文本文件,每一行为一个图像路径。代码使用`paddle.infer`判断`image_list_file`中每个图像的类别,并进行输出。 diff --git a/image_classification/alexnet.py b/image_classification/alexnet.py new file mode 100644 index 0000000000000000000000000000000000000000..5262a97faf4f87fcfe0e861b87dd207a4449782c --- /dev/null +++ b/image_classification/alexnet.py @@ -0,0 +1,50 @@ +import paddle.v2 as paddle + +__all__ = ['alexnet'] + + +def alexnet(input, class_dim): + conv1 = paddle.layer.img_conv( + input=input, + filter_size=11, + num_channels=3, + num_filters=96, + stride=4, + padding=1) + cmrnorm1 = paddle.layer.img_cmrnorm( + input=conv1, size=5, scale=0.0001, power=0.75) + pool1 = paddle.layer.img_pool(input=cmrnorm1, pool_size=3, stride=2) + + conv2 = paddle.layer.img_conv( + input=pool1, + filter_size=5, + num_filters=256, + stride=1, + padding=2, + groups=1) + cmrnorm2 = paddle.layer.img_cmrnorm( + input=conv2, size=5, scale=0.0001, power=0.75) + pool2 = paddle.layer.img_pool(input=cmrnorm2, pool_size=3, stride=2) + + pool3 = paddle.networks.img_conv_group( + input=pool2, + pool_size=3, + pool_stride=2, + conv_num_filter=[384, 384, 256], + conv_filter_size=3, + pool_type=paddle.pooling.Max()) + + fc1 = paddle.layer.fc( + input=pool3, + size=4096, + act=paddle.activation.Relu(), + layer_attr=paddle.attr.Extra(drop_rate=0.5)) + fc2 = paddle.layer.fc( + input=fc1, + size=4096, + act=paddle.activation.Relu(), + layer_attr=paddle.attr.Extra(drop_rate=0.5)) + + out = paddle.layer.fc( + input=fc2, size=class_dim, act=paddle.activation.Softmax()) + return out diff --git a/image_classification/googlenet.py b/image_classification/googlenet.py new file mode 100644 index 0000000000000000000000000000000000000000..474f948f02262cd6c78136afcab4f30c59ae8199 --- /dev/null +++ b/image_classification/googlenet.py @@ -0,0 +1,180 @@ +import paddle.v2 as paddle + +__all__ = ['googlenet'] + + +def inception(name, input, channels, filter1, filter3R, filter3, filter5R, + filter5, proj): + cov1 = paddle.layer.img_conv( + name=name + '_1', + input=input, + filter_size=1, + num_channels=channels, + num_filters=filter1, + stride=1, + padding=0) + + cov3r = paddle.layer.img_conv( + name=name + '_3r', + input=input, + filter_size=1, + num_channels=channels, + num_filters=filter3R, + stride=1, + padding=0) + cov3 = paddle.layer.img_conv( + name=name + '_3', + input=cov3r, + filter_size=3, + num_filters=filter3, + stride=1, + padding=1) + + cov5r = paddle.layer.img_conv( + name=name + '_5r', + input=input, + filter_size=1, + num_channels=channels, + num_filters=filter5R, + stride=1, + padding=0) + cov5 = paddle.layer.img_conv( + name=name + '_5', + input=cov5r, + filter_size=5, + num_filters=filter5, + stride=1, + padding=2) + + pool1 = paddle.layer.img_pool( + name=name + '_max', + input=input, + pool_size=3, + num_channels=channels, + stride=1, + padding=1) + covprj = paddle.layer.img_conv( + name=name + '_proj', + input=pool1, + filter_size=1, + num_filters=proj, + stride=1, + padding=0) + + cat = paddle.layer.concat(name=name, input=[cov1, cov3, cov5, covprj]) + return cat + + +def googlenet(input, class_dim): + # stage 1 + conv1 = paddle.layer.img_conv( + name="conv1", + input=input, + filter_size=7, + num_channels=3, + num_filters=64, + stride=2, + padding=3) + pool1 = paddle.layer.img_pool( + name="pool1", input=conv1, pool_size=3, num_channels=64, stride=2) + + # stage 2 + conv2_1 = paddle.layer.img_conv( + name="conv2_1", + input=pool1, + filter_size=1, + num_filters=64, + stride=1, + padding=0) + conv2_2 = paddle.layer.img_conv( + name="conv2_2", + input=conv2_1, + filter_size=3, + num_filters=192, + stride=1, + padding=1) + pool2 = paddle.layer.img_pool( + name="pool2", input=conv2_2, pool_size=3, num_channels=192, stride=2) + + # stage 3 + ince3a = inception("ince3a", pool2, 192, 64, 96, 128, 16, 32, 32) + ince3b = inception("ince3b", ince3a, 256, 128, 128, 192, 32, 96, 64) + pool3 = paddle.layer.img_pool( + name="pool3", input=ince3b, num_channels=480, pool_size=3, stride=2) + + # stage 4 + ince4a = inception("ince4a", pool3, 480, 192, 96, 208, 16, 48, 64) + ince4b = inception("ince4b", ince4a, 512, 160, 112, 224, 24, 64, 64) + ince4c = inception("ince4c", ince4b, 512, 128, 128, 256, 24, 64, 64) + ince4d = inception("ince4d", ince4c, 512, 112, 144, 288, 32, 64, 64) + ince4e = inception("ince4e", ince4d, 528, 256, 160, 320, 32, 128, 128) + pool4 = paddle.layer.img_pool( + name="pool4", input=ince4e, num_channels=832, pool_size=3, stride=2) + + # stage 5 + ince5a = inception("ince5a", pool4, 832, 256, 160, 320, 32, 128, 128) + ince5b = inception("ince5b", ince5a, 832, 384, 192, 384, 48, 128, 128) + pool5 = paddle.layer.img_pool( + name="pool5", + input=ince5b, + num_channels=1024, + pool_size=7, + stride=7, + pool_type=paddle.pooling.Avg()) + dropout = paddle.layer.addto( + input=pool5, + layer_attr=paddle.attr.Extra(drop_rate=0.4), + act=paddle.activation.Linear()) + + out = paddle.layer.fc( + input=dropout, size=class_dim, act=paddle.activation.Softmax()) + + # fc for output 1 + pool_o1 = paddle.layer.img_pool( + name="pool_o1", + input=ince4a, + num_channels=512, + pool_size=5, + stride=3, + pool_type=paddle.pooling.Avg()) + conv_o1 = paddle.layer.img_conv( + name="conv_o1", + input=pool_o1, + filter_size=1, + num_filters=128, + stride=1, + padding=0) + fc_o1 = paddle.layer.fc( + name="fc_o1", + input=conv_o1, + size=1024, + layer_attr=paddle.attr.Extra(drop_rate=0.7), + act=paddle.activation.Relu()) + out1 = paddle.layer.fc( + input=fc_o1, size=class_dim, act=paddle.activation.Softmax()) + + # fc for output 2 + pool_o2 = paddle.layer.img_pool( + name="pool_o2", + input=ince4d, + num_channels=528, + pool_size=5, + stride=3, + pool_type=paddle.pooling.Avg()) + conv_o2 = paddle.layer.img_conv( + name="conv_o2", + input=pool_o2, + filter_size=1, + num_filters=128, + stride=1, + padding=0) + fc_o2 = paddle.layer.fc( + name="fc_o2", + input=conv_o2, + size=1024, + layer_attr=paddle.attr.Extra(drop_rate=0.7), + act=paddle.activation.Relu()) + out2 = paddle.layer.fc( + input=fc_o2, size=class_dim, act=paddle.activation.Softmax()) + + return out, out1, out2 diff --git a/image_classification/infer.py b/image_classification/infer.py new file mode 100644 index 0000000000000000000000000000000000000000..659c4f2a8e997c9060934cb81c783dd900854d5e --- /dev/null +++ b/image_classification/infer.py @@ -0,0 +1,68 @@ +import gzip +import paddle.v2 as paddle +import reader +import vgg +import resnet +import alexnet +import googlenet +import argparse +import os +from PIL import Image +import numpy as np + +WIDTH = 224 +HEIGHT = 224 +DATA_DIM = 3 * WIDTH * HEIGHT +CLASS_DIM = 102 + + +def main(): + # parse the argument + parser = argparse.ArgumentParser() + parser.add_argument( + 'data_list', + help='The path of data list file, which consists of one image path per line' + ) + parser.add_argument( + 'model', + help='The model for image classification', + choices=['alexnet', 'vgg13', 'vgg16', 'vgg19', 'resnet', 'googlenet']) + parser.add_argument( + 'params_path', help='The file which stores the parameters') + args = parser.parse_args() + + # PaddlePaddle init + paddle.init(use_gpu=True, trainer_count=1) + + image = paddle.layer.data( + name="image", type=paddle.data_type.dense_vector(DATA_DIM)) + + if args.model == 'alexnet': + out = alexnet.alexnet(image, class_dim=CLASS_DIM) + elif args.model == 'vgg13': + out = vgg.vgg13(image, class_dim=CLASS_DIM) + elif args.model == 'vgg16': + out = vgg.vgg16(image, class_dim=CLASS_DIM) + elif args.model == 'vgg19': + out = vgg.vgg19(image, class_dim=CLASS_DIM) + elif args.model == 'resnet': + out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM) + elif args.model == 'googlenet': + out, _, _ = googlenet.googlenet(image, class_dim=CLASS_DIM) + + # load parameters + with gzip.open(args.params_path, 'r') as f: + parameters = paddle.parameters.Parameters.from_tar(f) + + file_list = [line.strip() for line in open(args.data_list)] + test_data = [(paddle.image.load_and_transform(image_file, 256, 224, False) + .flatten().astype('float32'), ) for image_file in file_list] + probs = paddle.infer( + output_layer=out, parameters=parameters, input=test_data) + lab = np.argsort(-probs) + for file_name, result in zip(file_list, lab): + print "Label of %s is: %d" % (file_name, result[0]) + + +if __name__ == '__main__': + main() diff --git a/image_classification/reader.py b/image_classification/reader.py index b58807e3a3e5e1449e52f0eb5e8040a17df37b81..b6bad1a24c36d7ae182a4522dcd9d94b35f6ae3c 100644 --- a/image_classification/reader.py +++ b/image_classification/reader.py @@ -1,44 +1,51 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License - import random from paddle.v2.image import load_and_transform +import paddle.v2 as paddle +from multiprocessing import cpu_count + + +def train_mapper(sample): + ''' + map image path to type needed by model input layer for the training set + ''' + img, label = sample + img = paddle.image.load_image(img) + img = paddle.image.simple_transform(img, 256, 224, True) + return img.flatten().astype('float32'), label + + +def test_mapper(sample): + ''' + map image path to type needed by model input layer for the test set + ''' + img, label = sample + img = paddle.image.load_image(img) + img = paddle.image.simple_transform(img, 256, 224, True) + return img.flatten().astype('float32'), label -def train_reader(train_list): +def train_reader(train_list, buffered_size=1024): def reader(): with open(train_list, 'r') as f: lines = [line.strip() for line in f] - random.shuffle(lines) for line in lines: img_path, lab = line.strip().split('\t') - im = load_and_transform(img_path, 256, 224, True) - yield im.flatten().astype('float32'), int(lab) + yield img_path, int(lab) - return reader + return paddle.reader.xmap_readers(train_mapper, reader, + cpu_count(), buffered_size) -def test_reader(test_list): +def test_reader(test_list, buffered_size=1024): def reader(): with open(test_list, 'r') as f: lines = [line.strip() for line in f] for line in lines: img_path, lab = line.strip().split('\t') - im = load_and_transform(img_path, 256, 224, False) - yield im.flatten().astype('float32'), int(lab) + yield img_path, int(lab) - return reader + return paddle.reader.xmap_readers(test_mapper, reader, + cpu_count(), buffered_size) if __name__ == '__main__': diff --git a/image_classification/resnet.py b/image_classification/resnet.py new file mode 100644 index 0000000000000000000000000000000000000000..5a9f24322ce9b6a8ab6f3ab955fc47a7457400aa --- /dev/null +++ b/image_classification/resnet.py @@ -0,0 +1,95 @@ +import paddle.v2 as paddle + +__all__ = ['resnet_imagenet', 'resnet_cifar10'] + + +def conv_bn_layer(input, + ch_out, + filter_size, + stride, + padding, + active_type=paddle.activation.Relu(), + ch_in=None): + tmp = paddle.layer.img_conv( + input=input, + filter_size=filter_size, + num_channels=ch_in, + num_filters=ch_out, + stride=stride, + padding=padding, + act=paddle.activation.Linear(), + bias_attr=False) + return paddle.layer.batch_norm(input=tmp, act=active_type) + + +def shortcut(input, ch_in, ch_out, stride): + if ch_in != ch_out: + return conv_bn_layer(input, ch_out, 1, stride, 0, + paddle.activation.Linear()) + else: + return input + + +def basicblock(input, ch_in, ch_out, stride): + short = shortcut(input, ch_in, ch_out, stride) + conv1 = conv_bn_layer(input, ch_out, 3, stride, 1) + conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1, paddle.activation.Linear()) + return paddle.layer.addto( + input=[short, conv2], act=paddle.activation.Relu()) + + +def bottleneck(input, ch_in, ch_out, stride): + short = shortcut(input, ch_in, ch_out * 4, stride) + conv1 = conv_bn_layer(input, ch_out, 1, stride, 0) + conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1) + conv3 = conv_bn_layer(conv2, ch_out * 4, 1, 1, 0, + paddle.activation.Linear()) + return paddle.layer.addto( + input=[short, conv3], act=paddle.activation.Relu()) + + +def layer_warp(block_func, input, ch_in, ch_out, count, stride): + conv = block_func(input, ch_in, ch_out, stride) + for i in range(1, count): + conv = block_func(conv, ch_out, ch_out, 1) + return conv + + +def resnet_imagenet(input, class_dim, depth=50): + cfg = { + 18: ([2, 2, 2, 1], basicblock), + 34: ([3, 4, 6, 3], basicblock), + 50: ([3, 4, 6, 3], bottleneck), + 101: ([3, 4, 23, 3], bottleneck), + 152: ([3, 8, 36, 3], bottleneck) + } + stages, block_func = cfg[depth] + conv1 = conv_bn_layer( + input, ch_in=3, ch_out=64, filter_size=7, stride=2, padding=3) + pool1 = paddle.layer.img_pool(input=conv1, pool_size=3, stride=2) + res1 = layer_warp(block_func, pool1, 64, 64, stages[0], 1) + res2 = layer_warp(block_func, res1, 64, 128, stages[1], 2) + res3 = layer_warp(block_func, res2, 128, 256, stages[2], 2) + res4 = layer_warp(block_func, res3, 256, 512, stages[3], 2) + pool2 = paddle.layer.img_pool( + input=res4, pool_size=7, stride=1, pool_type=paddle.pooling.Avg()) + out = paddle.layer.fc( + input=pool2, size=class_dim, act=paddle.activation.Softmax()) + return out + + +def resnet_cifar10(input, class_dim, depth=32): + # depth should be one of 20, 32, 44, 56, 110, 1202 + assert (depth - 2) % 6 == 0 + n = (depth - 2) / 6 + nStages = {16, 64, 128} + conv1 = conv_bn_layer( + input, ch_in=3, ch_out=16, filter_size=3, stride=1, padding=1) + res1 = layer_warp(basicblock, conv1, 16, 16, n, 1) + res2 = layer_warp(basicblock, res1, 16, 32, n, 2) + res3 = layer_warp(basicblock, res2, 32, 64, n, 2) + pool = paddle.layer.img_pool( + input=res3, pool_size=8, stride=1, pool_type=paddle.pooling.Avg()) + out = paddle.layer.fc( + input=pool, size=class_dim, act=paddle.activation.Softmax()) + return out diff --git a/image_classification/train.py b/image_classification/train.py old mode 100644 new mode 100755 index d917bd8019bf31bf31133d7fb03247ad3b495af7..37f5deec28418c3aa523b0969f99d583bd34f161 --- a/image_classification/train.py +++ b/image_classification/train.py @@ -1,40 +1,58 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License - import gzip - +import paddle.v2.dataset.flowers as flowers import paddle.v2 as paddle import reader import vgg +import resnet +import alexnet +import googlenet +import argparse DATA_DIM = 3 * 224 * 224 -CLASS_DIM = 1000 +CLASS_DIM = 102 BATCH_SIZE = 128 def main(): + # parse the argument + parser = argparse.ArgumentParser() + parser.add_argument( + 'model', + help='The model for image classification', + choices=['alexnet', 'vgg13', 'vgg16', 'vgg19', 'resnet', 'googlenet']) + args = parser.parse_args() # PaddlePaddle init - paddle.init(use_gpu=True, trainer_count=4) + paddle.init(use_gpu=True, trainer_count=1) image = paddle.layer.data( name="image", type=paddle.data_type.dense_vector(DATA_DIM)) lbl = paddle.layer.data( name="label", type=paddle.data_type.integer_value(CLASS_DIM)) - net = vgg.vgg13(image) - out = paddle.layer.fc( - input=net, size=CLASS_DIM, act=paddle.activation.Softmax()) + + extra_layers = None + learning_rate = 0.01 + if args.model == 'alexnet': + out = alexnet.alexnet(image, class_dim=CLASS_DIM) + elif args.model == 'vgg13': + out = vgg.vgg13(image, class_dim=CLASS_DIM) + elif args.model == 'vgg16': + out = vgg.vgg16(image, class_dim=CLASS_DIM) + elif args.model == 'vgg19': + out = vgg.vgg19(image, class_dim=CLASS_DIM) + elif args.model == 'resnet': + out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM) + learning_rate = 0.1 + elif args.model == 'googlenet': + out, out1, out2 = googlenet.googlenet(image, class_dim=CLASS_DIM) + loss1 = paddle.layer.cross_entropy_cost( + input=out1, label=lbl, coeff=0.3) + paddle.evaluator.classification_error(input=out1, label=lbl) + loss2 = paddle.layer.cross_entropy_cost( + input=out2, label=lbl, coeff=0.3) + paddle.evaluator.classification_error(input=out2, label=lbl) + extra_layers = [loss1, loss2] + cost = paddle.layer.classification_cost(input=out, label=lbl) # Create parameters @@ -45,16 +63,23 @@ def main(): momentum=0.9, regularization=paddle.optimizer.L2Regularization(rate=0.0005 * BATCH_SIZE), - learning_rate=0.01 / BATCH_SIZE, + learning_rate=learning_rate / BATCH_SIZE, learning_rate_decay_a=0.1, learning_rate_decay_b=128000 * 35, learning_rate_schedule="discexp", ) train_reader = paddle.batch( - paddle.reader.shuffle(reader.test_reader("train.list"), buf_size=1000), + paddle.reader.shuffle( + flowers.train(), + # To use other data, replace the above line with: + # reader.train_reader('train.list'), + buf_size=1000), batch_size=BATCH_SIZE) test_reader = paddle.batch( - reader.train_reader("test.list"), batch_size=BATCH_SIZE) + flowers.valid(), + # To use other data, replace the above line with: + # reader.test_reader('val.list'), + batch_size=BATCH_SIZE) # End batch and end pass event handler def event_handler(event): @@ -71,11 +96,14 @@ def main(): # Create trainer trainer = paddle.trainer.SGD( - cost=cost, parameters=parameters, update_equation=optimizer) + cost=cost, + parameters=parameters, + update_equation=optimizer, + extra_layers=extra_layers) trainer.train( reader=train_reader, num_passes=200, event_handler=event_handler) if __name__ == '__main__': - main() + main() \ No newline at end of file diff --git a/image_classification/vgg.py b/image_classification/vgg.py index e21504ab54378b1a5dad35df87503a25ddaf99ea..c6ec79a8d1e4af370bdfa599092ea67b864f8c65 100644 --- a/image_classification/vgg.py +++ b/image_classification/vgg.py @@ -1,23 +1,9 @@ -# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import paddle.v2 as paddle __all__ = ['vgg13', 'vgg16', 'vgg19'] -def vgg(input, nums): +def vgg(input, nums, class_dim): def conv_block(input, num_filter, groups, num_channels=None): return paddle.networks.img_conv_group( input=input, @@ -48,19 +34,21 @@ def vgg(input, nums): size=fc_dim, act=paddle.activation.Relu(), layer_attr=paddle.attr.Extra(drop_rate=0.5)) - return fc2 + out = paddle.layer.fc( + input=fc2, size=class_dim, act=paddle.activation.Softmax()) + return out -def vgg13(input): +def vgg13(input, class_dim): nums = [2, 2, 2, 2, 2] - return vgg(input, nums) + return vgg(input, nums, class_dim) -def vgg16(input): +def vgg16(input, class_dim): nums = [2, 2, 3, 3, 3] - return vgg(input, nums) + return vgg(input, nums, class_dim) -def vgg19(input): +def vgg19(input, class_dim): nums = [2, 2, 4, 4, 4] - return vgg(input, nums) + return vgg(input, nums, class_dim) diff --git a/language_model/README.md b/language_model/README.md index a0990367ef8b03c70c29d285e22ef85907e1d0b7..75c3417ef2308acb15641f570d1c63f9f1366299 100644 --- a/language_model/README.md +++ b/language_model/README.md @@ -1 +1,200 @@ -TBD +# 语言模型 + +## 简介 +语言模型即 Language Model,简称LM。它是一个概率分布模型,简单来说,就是用来计算一个句子的概率的模型。利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词。语言模型是自然语言处理领域里一个重要的基础模型。 + +## 应用场景 +**语言模型被应用在很多领域**,如: + +* **自动写作**:语言模型可以根据上文生成下一个词,递归下去可以生成整个句子、段落、篇章。 +* **QA**:语言模型可以根据Question生成Answer。 +* **机器翻译**:当前主流的机器翻译模型大多基于Encoder-Decoder模式,其中Decoder就是一个语言模型,用来生成目标语言。 +* **拼写检查**:语言模型可以计算出词序列的概率,一般在拼写错误处序列的概率会骤减,可以用来识别拼写错误并提供改正候选集。 +* **词性标注、句法分析、语音识别......** + +## 关于本例 +Language Model 常见的实现方式有 N-Gram、RNN、seq2seq。本例中实现了基于N-Gram、RNN的语言模型。**本例的文件结构如下**(`images` 文件夹与使用无关可不关心): + + +```text +. +├── data # toy、demo数据,用户可据此格式化自己的数据 +│ ├── chinese.test.txt # test用的数据demo +| ├── chinese.train.txt # train用的数据demo +│ └── input.txt # infer用的输入数据demo +├── config.py # 配置文件,包括data、train、infer相关配置 +├── infer.py # 预测任务脚本,即生成文本 +├── network_conf.py # 本例中涉及的各种网络结构均定义在此文件中,希望进一步修改模型结构,请修改此文件 +├── reader.py # 读取数据接口 +├── README.md # 文档 +├── train.py # 训练任务脚本 +└── utils.py # 定义通用的函数,例如:构建字典、加载字典等 +``` + +**注:一般情况下基于N-Gram的语言模型不如基于RNN的语言模型效果好,所以实际使用时建议使用基于RNN的语言模型,本例中也将着重介绍基于RNN的模型,简略介绍基于N-Gram的模型。** + +## RNN 语言模型 +### 简介 + +RNN是一个序列模型,基本思路是:在时刻t,将前一时刻t-1的隐藏层输出和t时刻的词向量一起输入到隐藏层从而得到时刻t的特征表示,然后用这个特征表示得到t时刻的预测输出,如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识,具有“记忆”功能。理论上RNN能实现“长依赖”(即利用很久之前的知识),但在实际应用中发现效果并不理想,于是出现了很多RNN的变种,如常用的LSTM和GRU,它们对传统RNN的cell进行了改进,弥补了传统RNN的不足,本例中即使用了LSTM、GRU。下图是RNN(广义上包含了LSTM、GRU等)语言模型“循环”思想的示意图: + +


-
-图1. DNN文本分类模型
-
-
-图2. CNN文本分类模型
+
+图1. 本例中的 DNN 文本分类模型
+
+图2. 本例中的 CNN 文本分类模型
+