diff --git a/README.md b/README.md index 3010c0e536da732f1c4f042c82badaae21179f87..9d39903b53844a9143fa83fd3d39315bf50ea126 100644 --- a/README.md +++ b/README.md @@ -2,14 +2,19 @@ ## Installation -Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory. +### Prerequisites + + - **Python = 2.7** only supported; + - **cuDNN >= 6.0** is required to utilize NVIDIA GPU platform in the installation of PaddlePaddle, and the **CUDA toolkit** with proper version suitable for cuDNN. The cuDNN library below 6.0 is found to yield a fatal error in batch normalization when handling utterances with long duration in inference. + +### Setup ``` sh setup.sh export LD_LIBRARY_PATH=$PADDLE_INSTALL_DIR/Paddle/third_party/install/warpctc/lib:$LD_LIBRARY_PATH ``` -For some machines, we also need to install libsndfile1. Details to be added. +Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory. ## Usage @@ -35,13 +40,13 @@ python datasets/librispeech/librispeech.py --help ### Preparing for Training ``` -python compute_mean_std.py +python tools/compute_mean_std.py ``` It will compute mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. The default feature of audio data is power spectrum, and the mfcc feature is also supported. To train and infer based on mfcc feature, please generate this file by ``` -python compute_mean_std.py --specgram_type mfcc +python tools/compute_mean_std.py --specgram_type mfcc ``` and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluator.py or tune.py. @@ -49,7 +54,7 @@ and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluato More help for arguments: ``` -python compute_mean_std.py --help +python tools/compute_mean_std.py --help ``` ### Training @@ -138,3 +143,28 @@ python tune.py --help ``` Then reset parameters with the tuning result before inference or evaluating. + +### Playing with the ASR Demo + +A real-time ASR demo is built for users to try out the ASR model with their own voice. Please do the following installation on the machine you'd like to run the demo's client (no need for the machine running the demo's server). + +For example, on MAC OS X: + +``` +brew install portaudio +pip install pyaudio +pip install pynput +``` +After a model and language model is prepared, we can first start the demo's server: + +``` +CUDA_VISIBLE_DEVICES=0 python demo_server.py +``` +And then in another console, start the demo's client: + +``` +python demo_client.py +``` +On the client console, press and hold the "white-space" key on the keyboard to start talking, until you finish your speech and then release the "white-space" key. The decoding results (infered transcription) will be displayed. + +It could be possible to start the server and the client in two seperate machines, e.g. `demo_client.py` is usually started in a machine with a microphone hardware, while `demo_server.py` is usually started in a remote server with powerful GPUs. Please first make sure that these two machines have network access to each other, and then use `--host_ip` and `--host_port` to indicate the server machine's actual IP address (instead of the `localhost` as default) and TCP port, in both `demo_server.py` and `demo_client.py`. diff --git a/cloud/README.md b/cloud/README.md index 7c23e0dc01ba5acc972f6e66c9312a742045d651..392088cf96de645e9a667256d6ffe0b08a2f7a9d 100644 --- a/cloud/README.md +++ b/cloud/README.md @@ -1,28 +1,81 @@ -#DeepSpeech2 on paddle cloud +# Run DS2 on PaddleCloud -## Run DS2 by public data +>Note: Make sure current directory is `models/deep_speech_2/cloud/` -**Step1: ** Make sure current dir is `models/deep_speech_2/cloud/` +## Step1 Configure data set -**Step2:** Submit job by cmd: `sh pcloud_submit.sh` +You can configure your input data and output path in pcloud_submit.sh: +- `TRAIN_MANIFEST`: Absolute path of train data manifest file in local file system.This file has format as bellow: + +``` +{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text +": "mister quilter is the ..."} +{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text +": "nor is mister ..."} +``` + +- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST. + +- `VOCAB_FILE`: Absolute path of vocabulary file in local filesytem. +- `MEAN_STD_FILE`: Absolute path of vocabulary file in local filesytem. +- `CLOUD_DATA_DIR:` Absolute path in PaddleCloud filesystem. We will upload local train data to this directory. +- `CLOUD_MODEL_DIR`: Absolute path in PaddleCloud filesystem. PaddleCloud trainer will save model to this directory. + + +>Note: Upload will be skipped if target file has existed in ${CLOUD_DATA_DIR}. + +## Step2 Configure computation resource + +You can configure computation resource in pcloud_submit.sh: +``` +# Configure computation resource and submit job to PaddleCloud + paddlecloud submit \ + -image wanghaoshuang/pcloud_ds2:latest \ + -jobname ${JOB_NAME} \ + -cpu 4 \ + -gpu 4 \ + -memory 10Gi \ + -parallelism 1 \ + -pscpu 1 \ + -pservers 1 \ + -psmemory 10Gi \ + -passes 1 \ + -entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \ + ${DS2_PATH} +``` +For more information, please refer to[PaddleCloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务) + +## Step3 Configure algorithm options +You can configure algorithm options in pcloud_train.sh: +``` +python train.py \ +--use_gpu=1 \ +--trainer_count=4 \ +--batch_size=256 \ +--mean_std_filepath=$MEAN_STD_FILE \ +--train_manifest_path='./local.train.manifest' \ +--dev_manifest_path='./local.test.manifest' \ +--vocab_filepath=$VOCAB_PATH \ +--output_model_dir=${MODEL_PATH} +``` +You can get more information about algorithm options by follow command: +``` +cd .. +python train.py --help +``` + +## Step4 Submit job ``` $ sh pcloud_submit.sh -$ uploading: deepspeech.tar.gz... -$ uploading: pcloud_prepare_data.py... -$ uploading: pcloud_split_data.py... -$ uploading: pcloud_submit.sh... -$ uploading: pcloud_train.sh... -$ deepspeech20170727130129 submited. ``` -The we can get job name 'deepspeech20170727130129' at last line -**Step3:** Get logs from paddle cloud by cmd: `paddlecloud logs -n 10000 deepspeech20170727130129`. +## Step5 Get logs ``` $ paddlecloud logs -n 10000 deepspeech20170727130129 ``` -[More options and cmd about paddle cloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md) - -## Run DS2 by customize data -TODO +For more information, please refer to [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#下载并配置paddlecloud) or get help by follow command: +``` +paddlecloud --help +``` diff --git a/cloud/pcloud_submit.sh b/cloud/pcloud_submit.sh index 9ea5d93102254cd6bd99087db96ac1221cdc9410..179d144f45efe34aa56c32ea4ce9565acebef5cd 100644 --- a/cloud/pcloud_submit.sh +++ b/cloud/pcloud_submit.sh @@ -1,54 +1,28 @@ -# -TRAIN_MANIFEST="/home/work/wanghaoshuang/ds2/pcloud/models/deep_speech_2/datasets/manifest.dev" -TEST_MANIFEST="/home/work/wanghaoshuang/ds2/pcloud/models/deep_speech_2/datasets/manifest.dev" -VOCAB_PATH="/home/work/wanghaoshuang/ds2/pcloud/models/deep_speech_2/datasets/vocab/eng_vocab.txt" -MEAN_STD_PATH="/home/work/wanghaoshuang/ds2/pcloud/models/deep_speech_2/compute_mean_std.py" -CLOUD_DATA_DIR="/pfs/dlnel/home/wanghaoshuang@baidu.com/deepspeech2/data" -CLOUD_MODEL_DIR="/pfs/dlnel/home/wanghaoshuang@baidu.com/deepspeech2/model" +# Configure input data set in local filesystem +TRAIN_MANIFEST="/home/work/demo/ds2/pcloud/models/deep_speech_2/datasets/manifest.dev" +TEST_MANIFEST="/home/work/demo/ds2/pcloud/models/deep_speech_2/datasets/manifest.dev" +VOCAB_FILE="/home/work/demo/ds2/pcloud/models/deep_speech_2/datasets/vocab/eng_vocab.txt" +MEAN_STD_FILE="/home/work/demo/ds2/pcloud/models/deep_speech_2/mean_std.npz" + +# Configure output path in PaddleCloud filesystem +CLOUD_DATA_DIR="/pfs/dlnel/home/demo/deepspeech2/data" +CLOUD_MODEL_DIR="/pfs/dlnel/home/demo/deepspeech2/model" + +# Pack and upload local data to PaddleCloud filesystem +python upload_data.py \ +--train_manifest_path=${TRAIN_MANIFEST} \ +--test_manifest_path=${TEST_MANIFEST} \ +--vocab_file=${VOCAB_FILE} \ +--mean_std_file=${MEAN_STD_FILE} \ +--cloud_data_path=${CLOUD_DATA_DIR} +JOB_NAME=deepspeech`date +%Y%m%d%H%M%S` DS2_PATH=${PWD%/*} +cp -f pcloud_train.sh ${DS2_PATH} -rm -rf ./tmp -mkdir ./tmp - -paddlecloud ls ${CLOUD_DATA_DIR}/mean_std.npz -if [ $? -ne 0 ];then - cp -f ${MEAN_STD_PATH} ./tmp/mean_std.npz - paddlecloud file put ./tmp/mean_std.npz ${CLOUD_DATA_DIR}/ -fi - -paddlecloud ls ${CLOUD_DATA_DIR}/vocab.txt -if [ $? -ne 0 ];then - cp -f ${VOCAB_PATH} ./tmp/vocab.txt - paddlecloud file put ./tmp/vocab.txt ${CLOUD_DATA_DIR}/ -fi - -paddlecloud ls ${CLOUD_DATA_DIR}/cloud.train.manifest -if [ $? -ne 0 ];then -python prepare_data.py \ ---manifest_path=${TRAIN_MANIFEST} \ ---out_tar_path="./tmp/cloud.train.tar" \ ---out_manifest_path="tmp/cloud.train.manifest" -paddlecloud file put ./tmp/cloud.train.tar ${CLOUD_DATA_DIR}/ -paddlecloud file put ./tmp/cloud.train.manifest ${CLOUD_DATA_DIR}/ -fi - -paddlecloud ls ${CLOUD_DATA_DIR}/cloud.test.manifest -if [ $? -ne 0 ];then -python prepare_data.py \ ---manifest_path=${TEST_MANIFEST} \ ---out_tar_path="./tmp/cloud.test.tar" \ ---out_manifest_path="tmp/cloud.test.manifest" -paddlecloud file put ./tmp/cloud.test.tar ${CLOUD_DATA_DIR}/ -paddlecloud file put ./tmp/cloud.test.manifest ${CLOUD_DATA_DIR}/ -fi - -rm -rf ./tmp - -JOB_NAME=deepspeech`date +%Y%m%d%H%M%S` -cp pcloud_train.sh ${DS2_PATH} +# Configure computation resource and submit job to PaddleCloud paddlecloud submit \ --image bootstrapper:5000/wanghaoshuang/pcloud_ds2:latest-gpu-cudnn \ +-image bootstrapper:5000/wanghaoshuang/pcloud_ds2:latest \ -jobname ${JOB_NAME} \ -cpu 4 \ -gpu 4 \ @@ -58,5 +32,7 @@ paddlecloud submit \ -pservers 1 \ -psmemory 10Gi \ -passes 1 \ --entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEl_DIR}" \ +-entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \ ${DS2_PATH} + +rm ${DS2_PATH}/pcloud_train.sh diff --git a/cloud/pcloud_train.sh b/cloud/pcloud_train.sh index ebf73bbb77d82536583411d5af12d690f3278d55..64a0fac3bd83988cada7fad39cb8c3fa1bb07511 100644 --- a/cloud/pcloud_train.sh +++ b/cloud/pcloud_train.sh @@ -1,16 +1,10 @@ DATA_PATH=$1 MODEL_PATH=$2 -#setted by user TRAIN_MANI=${DATA_PATH}/cloud.train.manifest -#setted by user DEV_MANI=${DATA_PATH}/cloud.test.manifest -#setted by user TRAIN_TAR=${DATA_PATH}/cloud.train.tar -#setted by user DEV_TAR=${DATA_PATH}/cloud.test.tar -#setted by user -VOCAB_PATH=${DATA_PATH}/eng_vocab.txt -#setted by user +VOCAB_PATH=${DATA_PATH}/vocab.txt MEAN_STD_FILE=${DATA_PATH}/mean_std.npz # split train data for each pcloud node @@ -28,8 +22,10 @@ python ./cloud/split_data.py \ python train.py \ --use_gpu=1 \ --trainer_count=4 \ ---batch_size=256 \ +--batch_size=32 \ +--num_threads_data=4 \ --mean_std_filepath=$MEAN_STD_FILE \ --train_manifest_path='./local.train.manifest' \ --dev_manifest_path='./local.test.manifest' \ --vocab_filepath=$VOCAB_PATH \ +--output_model_dir=${MODEL_PATH} diff --git a/cloud/prepare_data.py b/cloud/prepare_data.py deleted file mode 100644 index dc1e2d27970db1deaef340b7a2f8c9463666970d..0000000000000000000000000000000000000000 --- a/cloud/prepare_data.py +++ /dev/null @@ -1,61 +0,0 @@ -""" -This tool is used for preparing data for DeepSpeech2 trainning on paddle cloud. - -Steps: -1. Read original manifest and get the local path of sound files. -2. Tar all local sound files into one tar file. -3. Modify original manifest to remove the local path information. - -Finally, we will get a tar file and a manifest with sound file name, duration -and text. -""" -import json -import os -import tarfile -import sys -import argparse -sys.path.append('../') -from data_utils.utils import read_manifest - -parser = argparse.ArgumentParser(description=__doc__) -parser.add_argument( - "--manifest_path", - default="../datasets/manifest.train", - type=str, - help="Manifest of target data. (default: %(default)s)") -parser.add_argument( - "--out_tar_path", - default="./tmp/cloud.train.tar", - type=str, - help="Output tar file path. (default: %(default)s)") -parser.add_argument( - "--out_manifest_path", - default="./tmp/cloud.train.manifest", - type=str, - help="Manifest of output data. (default: %(default)s)") -args = parser.parse_args() - - -def gen_pcloud_data(manifest_path, out_tar_path, out_manifest_path): - ''' - 1. According manifest, tar sound files into out_tar_path - 2. Generate a new manifest for output tar file - ''' - out_tar = tarfile.open(out_tar_path, 'w') - manifest = read_manifest(manifest_path) - results = [] - for json_data in manifest: - sound_file = json_data['audio_filepath'] - filename = os.path.basename(sound_file) - out_tar.add(sound_file, arcname=filename) - json_data['audio_filepath'] = filename - results.append("%s\n" % json.dumps(json_data)) - with open(out_manifest_path, 'w') as out_manifest: - out_manifest.writelines(results) - out_manifest.close() - out_tar.close() - - -if __name__ == '__main__': - gen_pcloud_data(args.manifest_path, args.out_tar_path, - args.out_manifest_path) diff --git a/cloud/upload_data.py b/cloud/upload_data.py new file mode 100644 index 0000000000000000000000000000000000000000..75dcf010eb6320d38a003ac3b7e4e35ad21daa46 --- /dev/null +++ b/cloud/upload_data.py @@ -0,0 +1,147 @@ +""" +This tool is used for preparing data for DeepSpeech2 trainning on paddle cloud. + +Steps: +1. Read original manifest and get the local path of sound files. +2. Tar all local sound files into one tar file. +3. Modify original manifest to remove the local path information. + +Finally, we will get a tar file and a manifest with sound file name, duration +and text. +""" +import json +import os +import tarfile +import sys +import argparse +import shutil +sys.path.append('../') +from data_utils.utils import read_manifest +from subprocess import call + +TRAIN_TAR = "cloud.train.tar" +TRAIN_MANIFEST = "cloud.train.manifest" +TEST_TAR = "cloud.test.tar" +TEST_MANIFEST = "cloud.test.manifest" +VOCAB_FILE = "vocab.txt" +MEAN_STD_FILE = "mean_std.npz" + +parser = argparse.ArgumentParser(description=__doc__) +parser.add_argument( + "--train_manifest_path", + default="../datasets/manifest.train", + type=str, + help="Manifest file of train data. (default: %(default)s)") +parser.add_argument( + "--test_manifest_path", + default="../datasets/manifest.test", + type=str, + help="Manifest file of test data. (default: %(default)s)") +parser.add_argument( + "--vocab_file", + default="../datasets/vocab/eng_vocab.txt", + type=str, + help="Vocab file to be uploaded to paddlecloud. (default: %(default)s)") +parser.add_argument( + "--mean_std_file", + default="../mean_std.npz", + type=str, + help="mean_std file to be uploaded to paddlecloud. (default: %(default)s)") +parser.add_argument( + "--cloud_data_path", + required=True, + default="", + type=str, + help="Destination path on paddlecloud. (default: %(default)s)") +args = parser.parse_args() + +parser.add_argument( + "--local_tmp_path", + default="./tmp/", + type=str, + help="Local directory for storing temporary data. (default: %(default)s)") +args = parser.parse_args() + + +def pack_data(manifest_path, out_tar_path, out_manifest_path): + ''' + 1. According manifest, tar sound files into out_tar_path + 2. Generate a new manifest for output tar file + ''' + out_tar = tarfile.open(out_tar_path, 'w') + manifest = read_manifest(manifest_path) + results = [] + for json_data in manifest: + sound_file = json_data['audio_filepath'] + filename = os.path.basename(sound_file) + out_tar.add(sound_file, arcname=filename) + json_data['audio_filepath'] = filename + results.append("%s\n" % json.dumps(json_data)) + with open(out_manifest_path, 'w') as out_manifest: + out_manifest.writelines(results) + out_manifest.close() + out_tar.close() + + +if __name__ == '__main__': + cloud_train_manifest = "%s/%s" % (args.cloud_data_path, TRAIN_MANIFEST) + cloud_train_tar = "%s/%s" % (args.cloud_data_path, TRAIN_TAR) + cloud_test_manifest = "%s/%s" % (args.cloud_data_path, TEST_MANIFEST) + cloud_test_tar = "%s/%s" % (args.cloud_data_path, TEST_TAR) + cloud_vocab_file = "%s/%s" % (args.cloud_data_path, VOCAB_FILE) + cloud_mean_file = "%s/%s" % (args.cloud_data_path, MEAN_STD_FILE) + + local_train_manifest = "%s/%s" % (args.local_tmp_path, TRAIN_MANIFEST) + local_train_tar = "%s/%s" % (args.local_tmp_path, TRAIN_TAR) + local_test_manifest = "%s/%s" % (args.local_tmp_path, TEST_MANIFEST) + local_test_tar = "%s/%s" % (args.local_tmp_path, TEST_TAR) + + if os.path.exists(args.local_tmp_path): + shutil.rmtree(args.local_tmp_path) + os.makedirs(args.local_tmp_path) + + ret = 1 + # train data + if args.train_manifest_path != "": + ret = call(['paddlecloud', 'ls', cloud_train_manifest]) + if ret != 0: + print "%s does't exist" % cloud_train_manifest + pack_data(args.train_manifest_path, local_train_tar, + local_train_manifest) + call([ + 'paddlecloud', 'cp', local_train_manifest, cloud_train_manifest + ]) + call(['paddlecloud', 'cp', local_train_tar, cloud_train_tar]) + + # test data + if args.test_manifest_path != "": + try: + ret = call(['paddlecloud', 'ls', cloud_test_manifest]) + except Exception: + ret = 1 + if ret != 0: + pack_data(args.test_manifest_path, local_test_tar, + local_test_manifest) + call( + ['paddlecloud', 'cp', local_test_manifest, cloud_test_manifest]) + call(['paddlecloud', 'cp', local_test_tar, cloud_test_tar]) + + # vocab file + if args.vocab_file != "": + try: + ret = call(['paddlecloud', 'ls', cloud_vocab_file]) + except Exception: + ret = 1 + if ret != 0: + call(['paddlecloud', 'cp', args.vocab_file, cloud_vocab_file]) + + # mean_std file + if args.mean_std_file != "": + try: + ret = call(['paddlecloud', 'ls', cloud_mean_file]) + except Exception: + ret = 1 + if ret != 0: + call(['paddlecloud', 'cp', args.mean_std_file, cloud_mean_file]) + + os.removedirs(args.local_tmp_path) diff --git a/conf/augmentation.config b/conf/augmentation.config new file mode 100644 index 0000000000000000000000000000000000000000..6c24da5497460d4bae9c9c4fecdbe96ab8da7532 --- /dev/null +++ b/conf/augmentation.config @@ -0,0 +1,8 @@ +[ + { + "type": "shift", + "params": {"min_shift_ms": -5, + "max_shift_ms": 5}, + "prob": 1.0 + } +] diff --git a/conf/augmentation.config.example b/conf/augmentation.config.example new file mode 100644 index 0000000000000000000000000000000000000000..21ed6ee10375a749f4c072389509db2020d9e9c9 --- /dev/null +++ b/conf/augmentation.config.example @@ -0,0 +1,39 @@ +[ + { + "type": "noise", + "params": {"min_snr_dB": 40, + "max_snr_dB": 50, + "noise_manifest_path": "datasets/manifest.noise"}, + "prob": 0.6 + }, + { + "type": "impulse", + "params": {"impulse_manifest_path": "datasets/manifest.impulse"}, + "prob": 0.5 + }, + { + "type": "speed", + "params": {"min_speed_rate": 0.95, + "max_speed_rate": 1.05}, + "prob": 0.5 + }, + { + "type": "shift", + "params": {"min_shift_ms": -5, + "max_shift_ms": 5}, + "prob": 1.0 + }, + { + "type": "volume", + "params": {"min_gain_dBFS": -10, + "max_gain_dBFS": 10}, + "prob": 0.0 + }, + { + "type": "bayesian_normal", + "params": {"target_db": -20, + "prior_db": -20, + "prior_samples": 100}, + "prob": 0.0 + } +] diff --git a/data_utils/__init__.pyc b/data_utils/__init__.pyc new file mode 100644 index 0000000000000000000000000000000000000000..62144b80b4a3a1d94314a0974b53531f0c4f0437 Binary files /dev/null and b/data_utils/__init__.pyc differ diff --git a/data_utils/audio.py b/data_utils/audio.py index 3891f5b923f6d73c6b87dcb90bede0183b0e081c..30e25221cd84aa6849061635749188e3bd13d67b 100644 --- a/data_utils/audio.py +++ b/data_utils/audio.py @@ -204,7 +204,7 @@ class AudioSegment(object): :raise ValueError: If the sample rates of the two segments are not equal, or if the lengths of segments don't match. """ - if type(self) != type(other): + if isinstance(other, type(self)): raise TypeError("Cannot add segments of different types: %s " "and %s." % (type(self), type(other))) if self._sample_rate != other._sample_rate: @@ -231,7 +231,7 @@ class AudioSegment(object): Note that this is an in-place transformation. :param gain: Gain in decibels to apply to samples. - :type gain: float + :type gain: float|1darray """ self._samples *= 10.**(gain / 20.) @@ -457,9 +457,9 @@ class AudioSegment(object): audio segments when resample is not allowed. """ if allow_resample and self.sample_rate != impulse_segment.sample_rate: - impulse_segment = impulse_segment.resample(self.sample_rate) + impulse_segment.resample(self.sample_rate) if self.sample_rate != impulse_segment.sample_rate: - raise ValueError("Impulse segment's sample rate (%d Hz) is not" + raise ValueError("Impulse segment's sample rate (%d Hz) is not " "equal to base signal sample rate (%d Hz)." % (impulse_segment.sample_rate, self.sample_rate)) samples = signal.fftconvolve(self.samples, impulse_segment.samples, diff --git a/data_utils/audio.pyc b/data_utils/audio.pyc new file mode 100644 index 0000000000000000000000000000000000000000..af544fd49d75f4bcf6460163967b65b3c69d14f8 Binary files /dev/null and b/data_utils/audio.pyc differ diff --git a/data_utils/augmentor/__init__.pyc b/data_utils/augmentor/__init__.pyc new file mode 100644 index 0000000000000000000000000000000000000000..d3f0f7f059174885c853717d377f85588a5d6bb3 Binary files /dev/null and b/data_utils/augmentor/__init__.pyc differ diff --git a/data_utils/augmentor/augmentation.py b/data_utils/augmentor/augmentation.py index 9dced47314a81f52dc0eafd6e592e240953f291d..5c30b627ef9a23ff41d1f64f270934f149a793a2 100644 --- a/data_utils/augmentor/augmentation.py +++ b/data_utils/augmentor/augmentation.py @@ -8,6 +8,8 @@ import random from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor from data_utils.augmentor.shift_perturb import ShiftPerturbAugmentor from data_utils.augmentor.speed_perturb import SpeedPerturbAugmentor +from data_utils.augmentor.noise_perturb import NoisePerturbAugmentor +from data_utils.augmentor.impulse_response import ImpulseResponseAugmentor from data_utils.augmentor.resample import ResampleAugmentor from data_utils.augmentor.online_bayesian_normalization import \ OnlineBayesianNormalizationAugmentor @@ -23,21 +25,46 @@ class AugmentationPipeline(object): string, e.g. .. code-block:: - - '[{"type": "volume", - "params": {"min_gain_dBFS": -15, - "max_gain_dBFS": 15}, - "prob": 0.5}, - {"type": "speed", - "params": {"min_speed_rate": 0.8, - "max_speed_rate": 1.2}, - "prob": 0.5} - ]' + [ { + "type": "noise", + "params": {"min_snr_dB": 10, + "max_snr_dB": 20, + "noise_manifest_path": "datasets/manifest.noise"}, + "prob": 0.0 + }, + { + "type": "speed", + "params": {"min_speed_rate": 0.9, + "max_speed_rate": 1.1}, + "prob": 1.0 + }, + { + "type": "shift", + "params": {"min_shift_ms": -5, + "max_shift_ms": 5}, + "prob": 1.0 + }, + { + "type": "volume", + "params": {"min_gain_dBFS": -10, + "max_gain_dBFS": 10}, + "prob": 0.0 + }, + { + "type": "bayesian_normal", + "params": {"target_db": -20, + "prior_db": -20, + "prior_samples": 100}, + "prob": 0.0 + } + ] + This augmentation configuration inserts two augmentation models into the pipeline, with one is VolumePerturbAugmentor and the other SpeedPerturbAugmentor. "prob" indicates the probability of the current - augmentor to take effect. + augmentor to take effect. If "prob" is zero, the augmentor does not take + effect. :param augmentation_config: Augmentation configuration in json string. :type augmentation_config: str @@ -60,7 +87,7 @@ class AugmentationPipeline(object): :type audio_segment: AudioSegmenet|SpeechSegment """ for augmentor, rate in zip(self._augmentors, self._rates): - if self._rng.uniform(0., 1.) <= rate: + if self._rng.uniform(0., 1.) < rate: augmentor.transform_audio(audio_segment) def _parse_pipeline_from(self, config_json): @@ -89,5 +116,9 @@ class AugmentationPipeline(object): return ResampleAugmentor(self._rng, **params) elif augmentor_type == "bayesian_normal": return OnlineBayesianNormalizationAugmentor(self._rng, **params) + elif augmentor_type == "noise": + return NoisePerturbAugmentor(self._rng, **params) + elif augmentor_type == "impulse": + return ImpulseResponseAugmentor(self._rng, **params) else: raise ValueError("Unknown augmentor type [%s]." % augmentor_type) diff --git a/data_utils/augmentor/augmentation.pyc b/data_utils/augmentor/augmentation.pyc new file mode 100644 index 0000000000000000000000000000000000000000..5f3f22c9c5c57da356010e67bf7fe3ed7088b6be Binary files /dev/null and b/data_utils/augmentor/augmentation.pyc differ diff --git a/data_utils/augmentor/base.pyc b/data_utils/augmentor/base.pyc new file mode 100644 index 0000000000000000000000000000000000000000..b8b568d325d2e979dde0d9d8004ac7ef491fe66e Binary files /dev/null and b/data_utils/augmentor/base.pyc differ diff --git a/data_utils/augmentor/impulse_response.py b/data_utils/augmentor/impulse_response.py new file mode 100644 index 0000000000000000000000000000000000000000..c3de0fdbb2a40150f8cffdef3487ceb4400e52ed --- /dev/null +++ b/data_utils/augmentor/impulse_response.py @@ -0,0 +1,35 @@ +"""Contains the impulse response augmentation model.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from data_utils.augmentor.base import AugmentorBase +from data_utils import utils +from data_utils.audio import AudioSegment + + +class ImpulseResponseAugmentor(AugmentorBase): + """Augmentation model for adding impulse response effect. + + :param rng: Random generator object. + :type rng: random.Random + :param impulse_manifest_path: Manifest path for impulse audio data. + :type impulse_manifest_path: basestring + """ + + def __init__(self, rng, impulse_manifest_path): + self._rng = rng + self._impulse_manifest = utils.read_manifest( + manifest_path=impulse_manifest_path) + + def transform_audio(self, audio_segment): + """Add impulse response effect. + + Note that this is an in-place transformation. + + :param audio_segment: Audio segment to add effects to. + :type audio_segment: AudioSegmenet|SpeechSegment + """ + impulse_json = self._rng.sample(self._impulse_manifest, 1)[0] + impulse_segment = AudioSegment.from_file(impulse_json['audio_filepath']) + audio_segment.convolve(impulse_segment, allow_resample=True) diff --git a/data_utils/augmentor/impulse_response.pyc b/data_utils/augmentor/impulse_response.pyc new file mode 100644 index 0000000000000000000000000000000000000000..03fc786a564343a3778f336cf7d50d809efc19cb Binary files /dev/null and b/data_utils/augmentor/impulse_response.pyc differ diff --git a/data_utils/augmentor/noise_perturb.py b/data_utils/augmentor/noise_perturb.py new file mode 100644 index 0000000000000000000000000000000000000000..281174af42c2f6d673ead94bd532941769c79c25 --- /dev/null +++ b/data_utils/augmentor/noise_perturb.py @@ -0,0 +1,50 @@ +"""Contains the noise perturb augmentation model.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from data_utils.augmentor.base import AugmentorBase +from data_utils import utils +from data_utils.audio import AudioSegment + + +class NoisePerturbAugmentor(AugmentorBase): + """Augmentation model for adding background noise. + + :param rng: Random generator object. + :type rng: random.Random + :param min_snr_dB: Minimal signal noise ratio, in decibels. + :type min_snr_dB: float + :param max_snr_dB: Maximal signal noise ratio, in decibels. + :type max_snr_dB: float + :param noise_manifest_path: Manifest path for noise audio data. + :type noise_manifest_path: basestring + """ + + def __init__(self, rng, min_snr_dB, max_snr_dB, noise_manifest_path): + self._min_snr_dB = min_snr_dB + self._max_snr_dB = max_snr_dB + self._rng = rng + self._noise_manifest = utils.read_manifest( + manifest_path=noise_manifest_path) + + def transform_audio(self, audio_segment): + """Add background noise audio. + + Note that this is an in-place transformation. + + :param audio_segment: Audio segment to add effects to. + :type audio_segment: AudioSegmenet|SpeechSegment + """ + noise_json = self._rng.sample(self._noise_manifest, 1)[0] + if noise_json['duration'] < audio_segment.duration: + raise RuntimeError("The duration of sampled noise audio is smaller " + "than the audio segment to add effects to.") + diff_duration = noise_json['duration'] - audio_segment.duration + start = self._rng.uniform(0, diff_duration) + end = start + audio_segment.duration + noise_segment = AudioSegment.slice_from_file( + noise_json['audio_filepath'], start=start, end=end) + snr_dB = self._rng.uniform(self._min_snr_dB, self._max_snr_dB) + audio_segment.add_noise( + noise_segment, snr_dB, allow_downsampling=True, rng=self._rng) diff --git a/data_utils/augmentor/noise_perturb.pyc b/data_utils/augmentor/noise_perturb.pyc new file mode 100644 index 0000000000000000000000000000000000000000..0b118e7d60c67338b541a1ec97c61b193d0f0710 Binary files /dev/null and b/data_utils/augmentor/noise_perturb.pyc differ diff --git a/data_utils/augmentor/online_bayesian_normalization.py b/data_utils/augmentor/online_bayesian_normalization.py old mode 100755 new mode 100644 diff --git a/data_utils/augmentor/online_bayesian_normalization.pyc b/data_utils/augmentor/online_bayesian_normalization.pyc new file mode 100644 index 0000000000000000000000000000000000000000..54ca1603b15c17d721ed610bbf77a65be55ea6ca Binary files /dev/null and b/data_utils/augmentor/online_bayesian_normalization.pyc differ diff --git a/data_utils/augmentor/resample.py b/data_utils/augmentor/resample.py old mode 100755 new mode 100644 diff --git a/data_utils/augmentor/resample.pyc b/data_utils/augmentor/resample.pyc new file mode 100644 index 0000000000000000000000000000000000000000..a59a9af523637ac4caf60a88cf34f35775a74d19 Binary files /dev/null and b/data_utils/augmentor/resample.pyc differ diff --git a/data_utils/augmentor/shift_perturb.pyc b/data_utils/augmentor/shift_perturb.pyc new file mode 100644 index 0000000000000000000000000000000000000000..e1b013089151fd486c5e916176e4a8d4ea90bc34 Binary files /dev/null and b/data_utils/augmentor/shift_perturb.pyc differ diff --git a/data_utils/augmentor/speed_perturb.pyc b/data_utils/augmentor/speed_perturb.pyc new file mode 100644 index 0000000000000000000000000000000000000000..64f5bf348ac70f2a0ff33b66c9d623d55fabf4f7 Binary files /dev/null and b/data_utils/augmentor/speed_perturb.pyc differ diff --git a/data_utils/augmentor/volume_perturb.pyc b/data_utils/augmentor/volume_perturb.pyc new file mode 100644 index 0000000000000000000000000000000000000000..73577ce64c32df68e2149001ed94065e2b3212e0 Binary files /dev/null and b/data_utils/augmentor/volume_perturb.pyc differ diff --git a/data_utils/data.py b/data_utils/data.py index 5a5fa51b2b91b3fa7bb1a5476d462534978fa973..1beba14319a69c306c53ee6f7c8054841711f106 100644 --- a/data_utils/data.py +++ b/data_utils/data.py @@ -72,7 +72,7 @@ class DataGenerator(object): max_freq=None, specgram_type='linear', use_dB_normalization=True, - num_threads=multiprocessing.cpu_count(), + num_threads=multiprocessing.cpu_count() // 2, random_seed=0): self._max_duration = max_duration self._min_duration = min_duration @@ -94,6 +94,23 @@ class DataGenerator(object): self.tar2info = {} self.tar2object = {} + def process_utterance(self, filename, transcript): + """Load, augment, featurize and normalize for speech data. + + :param filename: Audio filepath + :type filename: basestring | file + :param transcript: Transcription text. + :type transcript: basestring + :return: Tuple of audio feature tensor and list of token ids for + transcription. + :rtype: tuple of (2darray, list) + """ + speech_segment = SpeechSegment.from_file(filename, transcript) + self._augmentation_pipeline.transform_audio(speech_segment) + specgram, text_ids = self._speech_featurizer.featurize(speech_segment) + specgram = self._normalizer.apply(specgram) + return specgram, text_ids + def batch_reader_creator(self, manifest_path, batch_size, @@ -163,7 +180,7 @@ class DataGenerator(object): manifest, batch_size, clipped=True) elif shuffle_method == "instance_shuffle": self._rng.shuffle(manifest) - elif not shuffle_method: + elif shuffle_method == None: pass else: raise ValueError("Unknown shuffle method %s." % @@ -210,8 +227,8 @@ class DataGenerator(object): return self._speech_featurizer.vocab_list def _parse_tar(self, file): - """ - Parse a tar file to get a tarfile object and a map containing tarinfoes + """Parse a tar file to get a tarfile object + and a map containing tarinfoes """ result = {} f = tarfile.open(file) @@ -219,14 +236,14 @@ class DataGenerator(object): result[tarinfo.name] = tarinfo return f, result - def _read_soundbytes(self, filepath): - """ - Read bytes from file. - If filepath startwith tar, we will read bytes from tar file + def _get_file_object(self, file): + """Get file object by file path. + If file startwith tar, it will return a tar file object and cached tar file info for next reading request. + It will return file directly, if the type of file is not str. """ - if filepath.startswith('tar:'): - tarpath, filename = filepath.split(':', 1)[1].split('#', 1) + if file.startswith('tar:'): + tarpath, filename = file.split(':', 1)[1].split('#', 1) if 'tar2info' not in local_data.__dict__: local_data.tar2info = {} if 'tar2object' not in local_data.__dict__: @@ -236,18 +253,9 @@ class DataGenerator(object): local_data.tar2info[tarpath] = infoes local_data.tar2object[tarpath] = object return local_data.tar2object[tarpath].extractfile( - local_data.tar2info[tarpath][filename]).read() + local_data.tar2info[tarpath][filename]) else: - return open(filepath).read() - - def _process_utterance(self, filename, transcript): - """Load, augment, featurize and normalize for speech data.""" - speech_segment = SpeechSegment.from_bytes( - self._read_soundbytes(filename), transcript) - self._augmentation_pipeline.transform_audio(speech_segment) - specgram, text_ids = self._speech_featurizer.featurize(speech_segment) - specgram = self._normalizer.apply(specgram) - return specgram, text_ids + return open(file) def _instance_reader_creator(self, manifest): """ @@ -263,8 +271,9 @@ class DataGenerator(object): yield instance def mapper(instance): - return self._process_utterance(instance["audio_filepath"], - instance["text"]) + return self.process_utterance( + self._get_file_object(instance["audio_filepath"]), + instance["text"]) return paddle.reader.xmap_readers( mapper, reader, self._num_threads, 1024, order=True) diff --git a/data_utils/data.pyc b/data_utils/data.pyc new file mode 100644 index 0000000000000000000000000000000000000000..961d3a9a9c39f35b2f417e3caa67abf06a2dff74 Binary files /dev/null and b/data_utils/data.pyc differ diff --git a/data_utils/featurizer/__init__.pyc b/data_utils/featurizer/__init__.pyc new file mode 100644 index 0000000000000000000000000000000000000000..949e3e05e8ec9f8f7b1c40e5a444748efe29ad8c Binary files /dev/null and b/data_utils/featurizer/__init__.pyc differ diff --git a/data_utils/featurizer/audio_featurizer.py b/data_utils/featurizer/audio_featurizer.py index 271e535b6a9f1cded27caf4f63adcc51abf3e835..00f0e8a35bc8e67ab285b7d509a0992c02dc54ca 100644 --- a/data_utils/featurizer/audio_featurizer.py +++ b/data_utils/featurizer/audio_featurizer.py @@ -166,21 +166,18 @@ class AudioFeaturizer(object): "window size.") # compute 13 cepstral coefficients, and the first one is replaced # by log(frame energy) - mfcc_feat = mfcc( - signal=samples, - samplerate=sample_rate, - winlen=0.001 * window_ms, - winstep=0.001 * stride_ms, - highfreq=max_freq) + mfcc_feat = np.transpose( + mfcc( + signal=samples, + samplerate=sample_rate, + winlen=0.001 * window_ms, + winstep=0.001 * stride_ms, + highfreq=max_freq)) # Deltas d_mfcc_feat = delta(mfcc_feat, 2) # Deltas-Deltas dd_mfcc_feat = delta(d_mfcc_feat, 2) # concat above three features - concat_mfcc_feat = [ - np.concatenate((mfcc_feat[i], d_mfcc_feat[i], dd_mfcc_feat[i])) - for i in xrange(len(mfcc_feat)) - ] - # transpose to be consistent with the linear specgram situation - concat_mfcc_feat = np.transpose(concat_mfcc_feat) + concat_mfcc_feat = np.concatenate( + (mfcc_feat, d_mfcc_feat, dd_mfcc_feat)) return concat_mfcc_feat diff --git a/data_utils/featurizer/audio_featurizer.pyc b/data_utils/featurizer/audio_featurizer.pyc new file mode 100644 index 0000000000000000000000000000000000000000..6b855fde2fb668861625b6cd7fd70130c4b336d6 Binary files /dev/null and b/data_utils/featurizer/audio_featurizer.pyc differ diff --git a/data_utils/featurizer/speech_featurizer.pyc b/data_utils/featurizer/speech_featurizer.pyc new file mode 100644 index 0000000000000000000000000000000000000000..a8225d6810ad115b355e54f569c3d967fba99938 Binary files /dev/null and b/data_utils/featurizer/speech_featurizer.pyc differ diff --git a/data_utils/featurizer/text_featurizer.py b/data_utils/featurizer/text_featurizer.py index 4f9a49b594010f91a64797b9a4b7e9054d4749d5..89202163ca8d8b69f59b858db5451882d7e089b3 100644 --- a/data_utils/featurizer/text_featurizer.py +++ b/data_utils/featurizer/text_featurizer.py @@ -4,6 +4,7 @@ from __future__ import division from __future__ import print_function import os +import codecs class TextFeaturizer(object): @@ -59,7 +60,7 @@ class TextFeaturizer(object): def _load_vocabulary_from_file(self, vocab_filepath): """Load vocabulary from file.""" vocab_lines = [] - with open(vocab_filepath, 'r') as file: + with codecs.open(vocab_filepath, 'r', 'utf-8') as file: vocab_lines.extend(file.readlines()) vocab_list = [line[:-1] for line in vocab_lines] vocab_dict = dict( diff --git a/data_utils/featurizer/text_featurizer.pyc b/data_utils/featurizer/text_featurizer.pyc new file mode 100644 index 0000000000000000000000000000000000000000..15bbd050c2c8c616a951c2436484a2bb38c813ba Binary files /dev/null and b/data_utils/featurizer/text_featurizer.pyc differ diff --git a/data_utils/normalizer.pyc b/data_utils/normalizer.pyc new file mode 100644 index 0000000000000000000000000000000000000000..8bb687f9efb1f9ca2b05213d8cd44f89c4a9c38a Binary files /dev/null and b/data_utils/normalizer.pyc differ diff --git a/data_utils/speech.py b/data_utils/speech.py index 568e4443ba557149505dfb4de6f230b4962e332a..17d68f315d04b6cc1aae2346df78cf77982cd7bc 100644 --- a/data_utils/speech.py +++ b/data_utils/speech.py @@ -115,7 +115,7 @@ class SpeechSegment(AudioSegment): speech file. :rtype: SpeechSegment """ - audio = Audiosegment.slice_from_file(filepath, start, end) + audio = AudioSegment.slice_from_file(filepath, start, end) return cls(audio.samples, audio.sample_rate, transcript) @classmethod diff --git a/data_utils/speech.pyc b/data_utils/speech.pyc new file mode 100644 index 0000000000000000000000000000000000000000..cef13388b85fc5f50bc6e78a33da1fe332996d47 Binary files /dev/null and b/data_utils/speech.pyc differ diff --git a/data_utils/utils.py b/data_utils/utils.py index 3f1165718aa0e2a0bf0687b8a613a6447b964ee8..f970ff55adeee0e1a4613143db1145e617b3699c 100644 --- a/data_utils/utils.py +++ b/data_utils/utils.py @@ -4,15 +4,16 @@ from __future__ import division from __future__ import print_function import json +import codecs def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0): """Load and parse manifest file. - + Instances with durations outside [min_duration, max_duration] will be filtered out. - :param manifest_path: Manifest file to load and parse. + :param manifest_path: Manifest file to load and parse. :type manifest_path: basestring :param max_duration: Maximal duration in seconds for instance filter. :type max_duration: float @@ -23,7 +24,7 @@ def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0): :raises IOError: If failed to parse the manifest. """ manifest = [] - for json_line in open(manifest_path): + for json_line in codecs.open(manifest_path, 'r', 'utf-8'): try: json_data = json.loads(json_line) except Exception as e: diff --git a/data_utils/utils.pyc b/data_utils/utils.pyc new file mode 100644 index 0000000000000000000000000000000000000000..61f4393e90ce8c10aeb56351bbca4307654f7dcf Binary files /dev/null and b/data_utils/utils.pyc differ diff --git a/datasets/librispeech/librispeech.py b/datasets/librispeech/librispeech.py index 87e52ae4aa286503d79f1326065831acfe6bf985..d963a7d5372d64f3abb1dcbdd16dbdafc1888de0 100644 --- a/datasets/librispeech/librispeech.py +++ b/datasets/librispeech/librispeech.py @@ -11,11 +11,12 @@ from __future__ import print_function import distutils.util import os -import wget +import sys import tarfile import argparse import soundfile import json +import codecs from paddle.v2.dataset.common import md5file DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') @@ -66,7 +67,7 @@ def download(url, md5sum, target_dir): filepath = os.path.join(target_dir, url.split("/")[-1]) if not (os.path.exists(filepath) and md5file(filepath) == md5sum): print("Downloading %s ..." % url) - wget.download(url, target_dir) + os.system("wget -c " + url + " -P " + target_dir) print("\nMD5 Chesksum %s ..." % filepath) if not md5file(filepath) == md5sum: raise RuntimeError("MD5 checksum failed.") @@ -112,7 +113,7 @@ def create_manifest(data_dir, manifest_path): 'duration': duration, 'text': text })) - with open(manifest_path, 'w') as out_file: + with codecs.open(manifest_path, 'w', 'utf-8') as out_file: for line in json_lines: out_file.write(line + '\n') diff --git a/datasets/noise/chime3_background.py b/datasets/noise/chime3_background.py new file mode 100644 index 0000000000000000000000000000000000000000..f79ca7335bda7aec795bc43c32a51519f3363d85 --- /dev/null +++ b/datasets/noise/chime3_background.py @@ -0,0 +1,128 @@ +"""Prepare CHiME3 background data. + +Download, unpack and create manifest files. +Manifest file is a json-format file with each line containing the +meta data (i.e. audio filepath, transcript and audio duration) +of each audio file in the data set. +""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import distutils.util +import os +import wget +import zipfile +import argparse +import soundfile +import json +from paddle.v2.dataset.common import md5file + +DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') + +URL = "https://d4s.myairbridge.com/packagev2/AG0Y3DNBE5IWRRTV/?dlid=W19XG7T0NNHB027139H0EQ" +MD5 = "c3ff512618d7a67d4f85566ea1bc39ec" + +parser = argparse.ArgumentParser(description=__doc__) +parser.add_argument( + "--target_dir", + default=DATA_HOME + "/chime3_background", + type=str, + help="Directory to save the dataset. (default: %(default)s)") +parser.add_argument( + "--manifest_filepath", + default="manifest.chime3.background", + type=str, + help="Filepath for output manifests. (default: %(default)s)") +args = parser.parse_args() + + +def download(url, md5sum, target_dir, filename=None): + """Download file from url to target_dir, and check md5sum.""" + if filename == None: + filename = url.split("/")[-1] + if not os.path.exists(target_dir): os.makedirs(target_dir) + filepath = os.path.join(target_dir, filename) + if not (os.path.exists(filepath) and md5file(filepath) == md5sum): + print("Downloading %s ..." % url) + wget.download(url, target_dir) + print("\nMD5 Chesksum %s ..." % filepath) + if not md5file(filepath) == md5sum: + raise RuntimeError("MD5 checksum failed.") + else: + print("File exists, skip downloading. (%s)" % filepath) + return filepath + + +def unpack(filepath, target_dir): + """Unpack the file to the target_dir.""" + print("Unpacking %s ..." % filepath) + if filepath.endswith('.zip'): + zip = zipfile.ZipFile(filepath, 'r') + zip.extractall(target_dir) + zip.close() + elif filepath.endswith('.tar') or filepath.endswith('.tar.gz'): + tar = zipfile.open(filepath) + tar.extractall(target_dir) + tar.close() + else: + raise ValueError("File format is not supported for unpacking.") + + +def create_manifest(data_dir, manifest_path): + """Create a manifest json file summarizing the data set, with each line + containing the meta data (i.e. audio filepath, transcription text, audio + duration) of each audio file within the data set. + """ + print("Creating manifest %s ..." % manifest_path) + json_lines = [] + for subfolder, _, filelist in sorted(os.walk(data_dir)): + for filename in filelist: + if filename.endswith('.wav'): + filepath = os.path.join(data_dir, subfolder, filename) + audio_data, samplerate = soundfile.read(filepath) + duration = float(len(audio_data)) / samplerate + json_lines.append( + json.dumps({ + 'audio_filepath': filepath, + 'duration': duration, + 'text': '' + })) + with open(manifest_path, 'w') as out_file: + for line in json_lines: + out_file.write(line + '\n') + + +def prepare_chime3(url, md5sum, target_dir, manifest_path): + """Download, unpack and create summmary manifest file.""" + if not os.path.exists(os.path.join(target_dir, "CHiME3")): + # download + filepath = download(url, md5sum, target_dir, + "myairbridge-AG0Y3DNBE5IWRRTV.zip") + # unpack + unpack(filepath, target_dir) + unpack( + os.path.join(target_dir, 'CHiME3_background_bus.zip'), target_dir) + unpack( + os.path.join(target_dir, 'CHiME3_background_caf.zip'), target_dir) + unpack( + os.path.join(target_dir, 'CHiME3_background_ped.zip'), target_dir) + unpack( + os.path.join(target_dir, 'CHiME3_background_str.zip'), target_dir) + else: + print("Skip downloading and unpacking. Data already exists in %s." % + target_dir) + # create manifest json file + create_manifest(target_dir, manifest_path) + + +def main(): + prepare_chime3( + url=URL, + md5sum=MD5, + target_dir=args.target_dir, + manifest_path=args.manifest_filepath) + + +if __name__ == '__main__': + main() diff --git a/datasets/run_noise.sh b/datasets/run_noise.sh new file mode 100644 index 0000000000000000000000000000000000000000..7b27abde47a97b671609f0cd15e81565b3a00d02 --- /dev/null +++ b/datasets/run_noise.sh @@ -0,0 +1,10 @@ +cd noise +python chime3_background.py +if [ $? -ne 0 ]; then + echo "Prepare CHiME3 background noise failed. Terminated." + exit 1 +fi +cd - + +cat noise/manifest.* > manifest.noise +echo "All done." diff --git a/decoder.py b/decoder.py index a1fadc2c81ac5036f5082e1a60b018106ab90277..8f2e0508de79fea30ebc30230e948b15923bdf24 100644 --- a/decoder.py +++ b/decoder.py @@ -205,9 +205,9 @@ def ctc_beam_search_decoder_batch(probs_split, :type num_processes: int :param cutoff_prob: Cutoff probability in pruning, default 1.0, no pruning. + :type cutoff_prob: float :param num_processes: Number of parallel processes. :type num_processes: int - :type cutoff_prob: float :param ext_scoring_func: External scoring function for partially decoded sentence, e.g. word count or language model. diff --git a/demo_client.py b/demo_client.py new file mode 100644 index 0000000000000000000000000000000000000000..ddf4dd1bf3f5ea62661e181e0dd2fb3f3b1379c6 --- /dev/null +++ b/demo_client.py @@ -0,0 +1,94 @@ +"""Client-end for the ASR demo.""" +from pynput import keyboard +import struct +import socket +import sys +import argparse +import pyaudio + +parser = argparse.ArgumentParser(description=__doc__) +parser.add_argument( + "--host_ip", + default="localhost", + type=str, + help="Server IP address. (default: %(default)s)") +parser.add_argument( + "--host_port", + default=8086, + type=int, + help="Server Port. (default: %(default)s)") +args = parser.parse_args() + +is_recording = False +enable_trigger_record = True + + +def on_press(key): + """On-press keyboard callback function.""" + global is_recording, enable_trigger_record + if key == keyboard.Key.space: + if (not is_recording) and enable_trigger_record: + sys.stdout.write("Start Recording ... ") + sys.stdout.flush() + is_recording = True + + +def on_release(key): + """On-release keyboard callback function.""" + global is_recording, enable_trigger_record + if key == keyboard.Key.esc: + return False + elif key == keyboard.Key.space: + if is_recording == True: + is_recording = False + + +data_list = [] + + +def callback(in_data, frame_count, time_info, status): + """Audio recorder's stream callback function.""" + global data_list, is_recording, enable_trigger_record + if is_recording: + data_list.append(in_data) + enable_trigger_record = False + elif len(data_list) > 0: + # Connect to server and send data + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + sock.connect((args.host_ip, args.host_port)) + sent = ''.join(data_list) + sock.sendall(struct.pack('>i', len(sent)) + sent) + print('Speech[length=%d] Sent.' % len(sent)) + # Receive data from the server and shut down + received = sock.recv(1024) + print "Recognition Results: {}".format(received) + sock.close() + data_list = [] + enable_trigger_record = True + return (in_data, pyaudio.paContinue) + + +def main(): + # prepare audio recorder + p = pyaudio.PyAudio() + stream = p.open( + format=pyaudio.paInt32, + channels=1, + rate=16000, + input=True, + stream_callback=callback) + stream.start_stream() + + # prepare keyboard listener + with keyboard.Listener( + on_press=on_press, on_release=on_release) as listener: + listener.join() + + # close up + stream.stop_stream() + stream.close() + p.terminate() + + +if __name__ == "__main__": + main() diff --git a/demo_server.py b/demo_server.py new file mode 100644 index 0000000000000000000000000000000000000000..c7e7e94a450121ea3c5c12fbbf7df4dfa3a48262 --- /dev/null +++ b/demo_server.py @@ -0,0 +1,245 @@ +"""Server-end for the ASR demo.""" +import os +import time +import random +import argparse +import distutils.util +from time import gmtime, strftime +import SocketServer +import struct +import wave +import paddle.v2 as paddle +from utils import print_arguments +from data_utils.data import DataGenerator +from model import DeepSpeech2Model +from data_utils.utils import read_manifest + +parser = argparse.ArgumentParser(description=__doc__) +parser.add_argument( + "--host_ip", + default="localhost", + type=str, + help="Server IP address. (default: %(default)s)") +parser.add_argument( + "--host_port", + default=8086, + type=int, + help="Server Port. (default: %(default)s)") +parser.add_argument( + "--speech_save_dir", + default="demo_cache", + type=str, + help="Directory for saving demo speech. (default: %(default)s)") +parser.add_argument( + "--vocab_filepath", + default='datasets/vocab/eng_vocab.txt', + type=str, + help="Vocabulary filepath. (default: %(default)s)") +parser.add_argument( + "--mean_std_filepath", + default='mean_std.npz', + type=str, + help="Manifest path for normalizer. (default: %(default)s)") +parser.add_argument( + "--warmup_manifest_path", + default='datasets/manifest.test', + type=str, + help="Manifest path for warmup test. (default: %(default)s)") +parser.add_argument( + "--specgram_type", + default='linear', + type=str, + help="Feature type of audio data: 'linear' (power spectrum)" + " or 'mfcc'. (default: %(default)s)") +parser.add_argument( + "--num_conv_layers", + default=2, + type=int, + help="Convolution layer number. (default: %(default)s)") +parser.add_argument( + "--num_rnn_layers", + default=3, + type=int, + help="RNN layer number. (default: %(default)s)") +parser.add_argument( + "--rnn_layer_size", + default=512, + type=int, + help="RNN layer cell number. (default: %(default)s)") +parser.add_argument( + "--use_gpu", + default=True, + type=distutils.util.strtobool, + help="Use gpu or not. (default: %(default)s)") +parser.add_argument( + "--model_filepath", + default='checkpoints/params.latest.tar.gz', + type=str, + help="Model filepath. (default: %(default)s)") +parser.add_argument( + "--decode_method", + default='beam_search', + type=str, + help="Method for ctc decoding: best_path or beam_search. " + "(default: %(default)s)") +parser.add_argument( + "--beam_size", + default=100, + type=int, + help="Width for beam search decoding. (default: %(default)d)") +parser.add_argument( + "--language_model_path", + default="lm/data/common_crawl_00.prune01111.trie.klm", + type=str, + help="Path for language model. (default: %(default)s)") +parser.add_argument( + "--alpha", + default=0.36, + type=float, + help="Parameter associated with language model. (default: %(default)f)") +parser.add_argument( + "--beta", + default=0.25, + type=float, + help="Parameter associated with word count. (default: %(default)f)") +parser.add_argument( + "--cutoff_prob", + default=0.99, + type=float, + help="The cutoff probability of pruning" + "in beam search. (default: %(default)f)") +args = parser.parse_args() + + +class AsrTCPServer(SocketServer.TCPServer): + """The ASR TCP Server.""" + + def __init__(self, + server_address, + RequestHandlerClass, + speech_save_dir, + audio_process_handler, + bind_and_activate=True): + self.speech_save_dir = speech_save_dir + self.audio_process_handler = audio_process_handler + SocketServer.TCPServer.__init__( + self, server_address, RequestHandlerClass, bind_and_activate=True) + + +class AsrRequestHandler(SocketServer.BaseRequestHandler): + """The ASR request handler.""" + + def handle(self): + # receive data through TCP socket + chunk = self.request.recv(1024) + target_len = struct.unpack('>i', chunk[:4])[0] + data = chunk[4:] + while len(data) < target_len: + chunk = self.request.recv(1024) + data += chunk + # write to file + filename = self._write_to_file(data) + + print("Received utterance[length=%d] from %s, saved to %s." % + (len(data), self.client_address[0], filename)) + start_time = time.time() + transcript = self.server.audio_process_handler(filename) + finish_time = time.time() + print("Response Time: %f, Transcript: %s" % + (finish_time - start_time, transcript)) + self.request.sendall(transcript) + + def _write_to_file(self, data): + # prepare save dir and filename + if not os.path.exists(self.server.speech_save_dir): + os.mkdir(self.server.speech_save_dir) + timestamp = strftime("%Y%m%d%H%M%S", gmtime()) + out_filename = os.path.join( + self.server.speech_save_dir, + timestamp + "_" + self.client_address[0] + ".wav") + # write to wav file + file = wave.open(out_filename, 'wb') + file.setnchannels(1) + file.setsampwidth(4) + file.setframerate(16000) + file.writeframes(data) + file.close() + return out_filename + + +def warm_up_test(audio_process_handler, + manifest_path, + num_test_cases, + random_seed=0): + """Warming-up test.""" + manifest = read_manifest(manifest_path) + rng = random.Random(random_seed) + samples = rng.sample(manifest, num_test_cases) + for idx, sample in enumerate(samples): + print("Warm-up Test Case %d: %s", idx, sample['audio_filepath']) + start_time = time.time() + transcript = audio_process_handler(sample['audio_filepath']) + finish_time = time.time() + print("Response Time: %f, Transcript: %s" % + (finish_time - start_time, transcript)) + + +def start_server(): + """Start the ASR server""" + # prepare data generator + data_generator = DataGenerator( + vocab_filepath=args.vocab_filepath, + mean_std_filepath=args.mean_std_filepath, + augmentation_config='{}', + specgram_type=args.specgram_type, + num_threads=1) + # prepare ASR model + ds2_model = DeepSpeech2Model( + vocab_size=data_generator.vocab_size, + num_conv_layers=args.num_conv_layers, + num_rnn_layers=args.num_rnn_layers, + rnn_layer_size=args.rnn_layer_size, + pretrained_model_path=args.model_filepath) + + # prepare ASR inference handler + def file_to_transcript(filename): + feature = data_generator.process_utterance(filename, "") + result_transcript = ds2_model.infer_batch( + infer_data=[feature], + decode_method=args.decode_method, + beam_alpha=args.alpha, + beam_beta=args.beta, + beam_size=args.beam_size, + cutoff_prob=args.cutoff_prob, + vocab_list=data_generator.vocab_list, + language_model_path=args.language_model_path, + num_processes=1) + return result_transcript[0] + + # warming up with utterrances sampled from Librispeech + print('-----------------------------------------------------------') + print('Warming up ...') + warm_up_test( + audio_process_handler=file_to_transcript, + manifest_path=args.warmup_manifest_path, + num_test_cases=3) + print('-----------------------------------------------------------') + + # start the server + server = AsrTCPServer( + server_address=(args.host_ip, args.host_port), + RequestHandlerClass=AsrRequestHandler, + speech_save_dir=args.speech_save_dir, + audio_process_handler=file_to_transcript) + print("ASR Server Started.") + server.serve_forever() + + +def main(): + print_arguments(args) + paddle.init(use_gpu=args.use_gpu, trainer_count=1) + start_server() + + +if __name__ == "__main__": + main() diff --git a/error_rate.py b/error_rate.py index 0cf17921c0dd3db051648f93570baf900054bb52..ea829f4703a90babd53e5408cfd30a427430de0d 100644 --- a/error_rate.py +++ b/error_rate.py @@ -10,47 +10,54 @@ import numpy as np def _levenshtein_distance(ref, hyp): - """Levenshtein distance is a string metric for measuring the difference between - two sequences. Informally, the levenshtein disctance is defined as the minimum - number of single-character edits (substitutions, insertions or deletions) - required to change one word into the other. We can naturally extend the edits to - word level when calculate levenshtein disctance for two sentences. + """Levenshtein distance is a string metric for measuring the difference + between two sequences. Informally, the levenshtein disctance is defined as + the minimum number of single-character edits (substitutions, insertions or + deletions) required to change one word into the other. We can naturally + extend the edits to word level when calculate levenshtein disctance for + two sentences. """ - ref_len = len(ref) - hyp_len = len(hyp) + m = len(ref) + n = len(hyp) # special case if ref == hyp: return 0 - if ref_len == 0: - return hyp_len - if hyp_len == 0: - return ref_len + if m == 0: + return n + if n == 0: + return m - distance = np.zeros((ref_len + 1, hyp_len + 1), dtype=np.int32) + if m < n: + ref, hyp = hyp, ref + m, n = n, m + + # use O(min(m, n)) space + distance = np.zeros((2, n + 1), dtype=np.int32) # initialize distance matrix - for j in xrange(hyp_len + 1): + for j in xrange(n + 1): distance[0][j] = j - for i in xrange(ref_len + 1): - distance[i][0] = i # calculate levenshtein distance - for i in xrange(1, ref_len + 1): - for j in xrange(1, hyp_len + 1): + for i in xrange(1, m + 1): + prev_row_idx = (i - 1) % 2 + cur_row_idx = i % 2 + distance[cur_row_idx][0] = i + for j in xrange(1, n + 1): if ref[i - 1] == hyp[j - 1]: - distance[i][j] = distance[i - 1][j - 1] + distance[cur_row_idx][j] = distance[prev_row_idx][j - 1] else: - s_num = distance[i - 1][j - 1] + 1 - i_num = distance[i][j - 1] + 1 - d_num = distance[i - 1][j] + 1 - distance[i][j] = min(s_num, i_num, d_num) + s_num = distance[prev_row_idx][j - 1] + 1 + i_num = distance[cur_row_idx][j - 1] + 1 + d_num = distance[prev_row_idx][j] + 1 + distance[cur_row_idx][j] = min(s_num, i_num, d_num) - return distance[ref_len][hyp_len] + return distance[m % 2][n] def wer(reference, hypothesis, ignore_case=False, delimiter=' '): - """Calculate word error rate (WER). WER compares reference text and + """Calculate word error rate (WER). WER compares reference text and hypothesis text in word-level. WER is defined as: .. math:: @@ -65,8 +72,8 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '): Iw is the number of words inserted, Nw is the number of words in the reference - We can use levenshtein distance to calculate WER. Please draw an attention that - empty items will be removed when splitting sentences by delimiter. + We can use levenshtein distance to calculate WER. Please draw an attention + that empty items will be removed when splitting sentences by delimiter. :param reference: The reference sentence. :type reference: basestring @@ -95,7 +102,7 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '): return wer -def cer(reference, hypothesis, ignore_case=False): +def cer(reference, hypothesis, ignore_case=False, remove_space=False): """Calculate charactor error rate (CER). CER compares reference text and hypothesis text in char-level. CER is defined as: @@ -111,10 +118,10 @@ def cer(reference, hypothesis, ignore_case=False): Ic is the number of characters inserted Nc is the number of characters in the reference - We can use levenshtein distance to calculate CER. Chinese input should be - encoded to unicode. Please draw an attention that the leading and tailing - white space characters will be truncated and multiple consecutive white - space characters in a sentence will be replaced by one white space character. + We can use levenshtein distance to calculate CER. Chinese input should be + encoded to unicode. Please draw an attention that the leading and tailing + space characters will be truncated and multiple consecutive space + characters in a sentence will be replaced by one space character. :param reference: The reference sentence. :type reference: basestring @@ -122,6 +129,8 @@ def cer(reference, hypothesis, ignore_case=False): :type hypothesis: basestring :param ignore_case: Whether case-sensitive or not. :type ignore_case: bool + :param remove_space: Whether remove internal space characters + :type remove_space: bool :return: Character error rate. :rtype: float :raises ValueError: If the reference length is zero. @@ -130,8 +139,12 @@ def cer(reference, hypothesis, ignore_case=False): reference = reference.lower() hypothesis = hypothesis.lower() - reference = ' '.join(filter(None, reference.split(' '))) - hypothesis = ' '.join(filter(None, hypothesis.split(' '))) + join_char = ' ' + if remove_space == True: + join_char = '' + + reference = join_char.join(filter(None, reference.split(' '))) + hypothesis = join_char.join(filter(None, hypothesis.split(' '))) if len(reference) == 0: raise ValueError("Length of reference should be greater than 0.") diff --git a/evaluate.py b/evaluate.py index 19eabf4e5aff090ed2f529e3ea3cd7f10ae57cb7..82dcec3c24480d439f8a622964f0a1d90e948cd4 100644 --- a/evaluate.py +++ b/evaluate.py @@ -5,20 +5,24 @@ from __future__ import print_function import distutils.util import argparse -import gzip +import multiprocessing import paddle.v2 as paddle from data_utils.data import DataGenerator -from model import deep_speech2 -from decoder import * -from lm.lm_scorer import LmScorer -from error_rate import wer +from model import DeepSpeech2Model +from error_rate import wer, cer +import utils parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( "--batch_size", - default=100, + default=128, type=int, help="Minibatch size for evaluation. (default: %(default)s)") +parser.add_argument( + "--trainer_count", + default=8, + type=int, + help="Trainer number. (default: %(default)s)") parser.add_argument( "--num_conv_layers", default=2, @@ -41,12 +45,12 @@ parser.add_argument( help="Use gpu or not. (default: %(default)s)") parser.add_argument( "--num_threads_data", - default=multiprocessing.cpu_count(), + default=multiprocessing.cpu_count() // 2, type=int, help="Number of cpu threads for preprocessing data. (default: %(default)s)") parser.add_argument( "--num_processes_beam_search", - default=multiprocessing.cpu_count(), + default=multiprocessing.cpu_count() // 2, type=int, help="Number of cpu processes for beam search. (default: %(default)s)") parser.add_argument( @@ -58,8 +62,8 @@ parser.add_argument( "--decode_method", default='beam_search', type=str, - help="Method for ctc decoding, best_path or beam_search. (default: %(default)s)" -) + help="Method for ctc decoding, best_path or beam_search. " + "(default: %(default)s)") parser.add_argument( "--language_model_path", default="lm/data/common_crawl_00.prune01111.trie.klm", @@ -67,12 +71,12 @@ parser.add_argument( help="Path for language model. (default: %(default)s)") parser.add_argument( "--alpha", - default=0.26, + default=0.36, type=float, help="Parameter associated with language model. (default: %(default)f)") parser.add_argument( "--beta", - default=0.1, + default=0.25, type=float, help="Parameter associated with word count. (default: %(default)f)") parser.add_argument( @@ -107,42 +111,25 @@ parser.add_argument( default='datasets/vocab/eng_vocab.txt', type=str, help="Vocabulary filepath. (default: %(default)s)") +parser.add_argument( + "--error_rate_type", + default='wer', + choices=['wer', 'cer'], + type=str, + help="Error rate type for evaluation. 'wer' for word error rate and 'cer' " + "for character error rate. " + "(default: %(default)s)") args = parser.parse_args() def evaluate(): """Evaluate on whole test data for DeepSpeech2.""" - # initialize data generator data_generator = DataGenerator( vocab_filepath=args.vocab_filepath, mean_std_filepath=args.mean_std_filepath, augmentation_config='{}', specgram_type=args.specgram_type, num_threads=args.num_threads_data) - - # create network config - # paddle.data_type.dense_array is used for variable batch input. - # The size 161 * 161 is only an placeholder value and the real shape - # of input batch data will be induced during training. - audio_data = paddle.layer.data( - name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161)) - text_data = paddle.layer.data( - name="transcript_text", - type=paddle.data_type.integer_value_sequence(data_generator.vocab_size)) - output_probs = deep_speech2( - audio_data=audio_data, - text_data=text_data, - dict_size=data_generator.vocab_size, - num_conv_layers=args.num_conv_layers, - num_rnn_layers=args.num_rnn_layers, - rnn_size=args.rnn_layer_size, - is_inference=True) - - # load parameters - parameters = paddle.parameters.Parameters.from_tar( - gzip.open(args.model_filepath)) - - # prepare infer data batch_reader = data_generator.batch_reader_creator( manifest_path=args.decode_manifest_path, batch_size=args.batch_size, @@ -150,61 +137,42 @@ def evaluate(): sortagrad=False, shuffle_method=None) - # define inferer - inferer = paddle.inference.Inference( - output_layer=output_probs, parameters=parameters) - - # initialize external scorer for beam search decoding - if args.decode_method == 'beam_search': - ext_scorer = LmScorer(args.alpha, args.beta, args.language_model_path) + ds2_model = DeepSpeech2Model( + vocab_size=data_generator.vocab_size, + num_conv_layers=args.num_conv_layers, + num_rnn_layers=args.num_rnn_layers, + rnn_layer_size=args.rnn_layer_size, + pretrained_model_path=args.model_filepath) - wer_counter, wer_sum = 0, 0.0 + error_rate_func = cer if args.error_rate_type == 'cer' else wer + error_sum, num_ins = 0.0, 0 for infer_data in batch_reader(): - # run inference - infer_results = inferer.infer(input=infer_data) - num_steps = len(infer_results) // len(infer_data) - probs_split = [ - infer_results[i * num_steps:(i + 1) * num_steps] - for i in xrange(0, len(infer_data)) - ] - # target transcription - target_transcription = [ - ''.join([ - data_generator.vocab_list[index] for index in infer_data[i][1] - ]) for i, probs in enumerate(probs_split) + result_transcripts = ds2_model.infer_batch( + infer_data=infer_data, + decode_method=args.decode_method, + beam_alpha=args.alpha, + beam_beta=args.beta, + beam_size=args.beam_size, + cutoff_prob=args.cutoff_prob, + vocab_list=data_generator.vocab_list, + language_model_path=args.language_model_path, + num_processes=args.num_processes_beam_search) + target_transcripts = [ + ''.join([data_generator.vocab_list[token] for token in transcript]) + for _, transcript in infer_data ] - # decode and print - # best path decode - if args.decode_method == "best_path": - for i, probs in enumerate(probs_split): - output_transcription = ctc_best_path_decoder( - probs_seq=probs, vocabulary=data_generator.vocab_list) - wer_sum += wer(target_transcription[i], output_transcription) - wer_counter += 1 - # beam search decode - elif args.decode_method == "beam_search": - # beam search using multiple processes - beam_search_results = ctc_beam_search_decoder_batch( - probs_split=probs_split, - vocabulary=data_generator.vocab_list, - beam_size=args.beam_size, - blank_id=len(data_generator.vocab_list), - num_processes=args.num_processes_beam_search, - ext_scoring_func=ext_scorer, - cutoff_prob=args.cutoff_prob, ) - for i, beam_search_result in enumerate(beam_search_results): - wer_sum += wer(target_transcription[i], - beam_search_result[0][1]) - wer_counter += 1 - else: - raise ValueError("Decoding method [%s] is not supported." % - decode_method) - - print("Final WER = %f" % (wer_sum / wer_counter)) + for target, result in zip(target_transcripts, result_transcripts): + error_sum += error_rate_func(target, result) + num_ins += 1 + print("Error rate [%s] (%d/?) = %f" % + (args.error_rate_type, num_ins, error_sum / num_ins)) + print("Final error rate [%s] (%d/%d) = %f" % + (args.error_rate_type, num_ins, num_ins, error_sum / num_ins)) def main(): - paddle.init(use_gpu=args.use_gpu, trainer_count=1) + utils.print_arguments(args) + paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count) evaluate() diff --git a/infer.py b/infer.py index 817526302764b3d6044688da97ad0cc072c14144..43643cde70f3421a9faf92e6177c103e4099c97d 100644 --- a/infer.py +++ b/infer.py @@ -4,15 +4,12 @@ from __future__ import division from __future__ import print_function import argparse -import gzip import distutils.util import multiprocessing import paddle.v2 as paddle from data_utils.data import DataGenerator -from model import deep_speech2 -from decoder import * -from lm.lm_scorer import LmScorer -from error_rate import wer +from model import DeepSpeech2Model +from error_rate import wer, cer import utils parser = argparse.ArgumentParser(description=__doc__) @@ -43,12 +40,12 @@ parser.add_argument( help="Use gpu or not. (default: %(default)s)") parser.add_argument( "--num_threads_data", - default=multiprocessing.cpu_count(), + default=1, type=int, help="Number of cpu threads for preprocessing data. (default: %(default)s)") parser.add_argument( "--num_processes_beam_search", - default=multiprocessing.cpu_count(), + default=multiprocessing.cpu_count() // 2, type=int, help="Number of cpu processes for beam search. (default: %(default)s)") parser.add_argument( @@ -57,6 +54,11 @@ parser.add_argument( type=str, help="Feature type of audio data: 'linear' (power spectrum)" " or 'mfcc'. (default: %(default)s)") +parser.add_argument( + "--trainer_count", + default=8, + type=int, + help="Trainer number. (default: %(default)s)") parser.add_argument( "--mean_std_filepath", default='mean_std.npz', @@ -81,18 +83,13 @@ parser.add_argument( "--decode_method", default='beam_search', type=str, - help="Method for ctc decoding: best_path or beam_search. (default: %(default)s)" -) + help="Method for ctc decoding: best_path or beam_search. " + "(default: %(default)s)") parser.add_argument( "--beam_size", default=500, type=int, help="Width for beam search decoding. (default: %(default)d)") -parser.add_argument( - "--num_results_per_sample", - default=1, - type=int, - help="Number of output per sample in beam search. (default: %(default)d)") parser.add_argument( "--language_model_path", default="lm/data/common_crawl_00.prune01111.trie.klm", @@ -100,12 +97,12 @@ parser.add_argument( help="Path for language model. (default: %(default)s)") parser.add_argument( "--alpha", - default=0.26, + default=0.36, type=float, help="Parameter associated with language model. (default: %(default)f)") parser.add_argument( "--beta", - default=0.1, + default=0.25, type=float, help="Parameter associated with word count. (default: %(default)f)") parser.add_argument( @@ -114,42 +111,25 @@ parser.add_argument( type=float, help="The cutoff probability of pruning" "in beam search. (default: %(default)f)") +parser.add_argument( + "--error_rate_type", + default='wer', + choices=['wer', 'cer'], + type=str, + help="Error rate type for evaluation. 'wer' for word error rate and 'cer' " + "for character error rate. " + "(default: %(default)s)") args = parser.parse_args() def infer(): """Inference for DeepSpeech2.""" - # initialize data generator data_generator = DataGenerator( vocab_filepath=args.vocab_filepath, mean_std_filepath=args.mean_std_filepath, augmentation_config='{}', specgram_type=args.specgram_type, num_threads=args.num_threads_data) - - # create network config - # paddle.data_type.dense_array is used for variable batch input. - # The size 161 * 161 is only an placeholder value and the real shape - # of input batch data will be induced during training. - audio_data = paddle.layer.data( - name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161)) - text_data = paddle.layer.data( - name="transcript_text", - type=paddle.data_type.integer_value_sequence(data_generator.vocab_size)) - output_probs = deep_speech2( - audio_data=audio_data, - text_data=text_data, - dict_size=data_generator.vocab_size, - num_conv_layers=args.num_conv_layers, - num_rnn_layers=args.num_rnn_layers, - rnn_size=args.rnn_layer_size, - is_inference=True) - - # load parameters - parameters = paddle.parameters.Parameters.from_tar( - gzip.open(args.model_filepath)) - - # prepare infer data batch_reader = data_generator.batch_reader_creator( manifest_path=args.decode_manifest_path, batch_size=args.num_samples, @@ -158,66 +138,38 @@ def infer(): shuffle_method=None) infer_data = batch_reader().next() - # run inference - infer_results = paddle.infer( - output_layer=output_probs, parameters=parameters, input=infer_data) - num_steps = len(infer_results) // len(infer_data) - probs_split = [ - infer_results[i * num_steps:(i + 1) * num_steps] - for i in xrange(len(infer_data)) - ] + ds2_model = DeepSpeech2Model( + vocab_size=data_generator.vocab_size, + num_conv_layers=args.num_conv_layers, + num_rnn_layers=args.num_rnn_layers, + rnn_layer_size=args.rnn_layer_size, + pretrained_model_path=args.model_filepath) + result_transcripts = ds2_model.infer_batch( + infer_data=infer_data, + decode_method=args.decode_method, + beam_alpha=args.alpha, + beam_beta=args.beta, + beam_size=args.beam_size, + cutoff_prob=args.cutoff_prob, + vocab_list=data_generator.vocab_list, + language_model_path=args.language_model_path, + num_processes=args.num_processes_beam_search) - # targe transcription - target_transcription = [ - ''.join( - [data_generator.vocab_list[index] for index in infer_data[i][1]]) - for i, probs in enumerate(probs_split) + error_rate_func = cer if args.error_rate_type == 'cer' else wer + target_transcripts = [ + ''.join([data_generator.vocab_list[token] for token in transcript]) + for _, transcript in infer_data ] - - ## decode and print - # best path decode - wer_sum, wer_counter = 0, 0 - if args.decode_method == "best_path": - for i, probs in enumerate(probs_split): - best_path_transcription = ctc_best_path_decoder( - probs_seq=probs, vocabulary=data_generator.vocab_list) - print("\nTarget Transcription: %s\nOutput Transcription: %s" % - (target_transcription[i], best_path_transcription)) - wer_cur = wer(target_transcription[i], best_path_transcription) - wer_sum += wer_cur - wer_counter += 1 - print("cur wer = %f, average wer = %f" % - (wer_cur, wer_sum / wer_counter)) - # beam search decode - elif args.decode_method == "beam_search": - ext_scorer = LmScorer(args.alpha, args.beta, args.language_model_path) - beam_search_batch_results = ctc_beam_search_decoder_batch( - probs_split=probs_split, - vocabulary=data_generator.vocab_list, - beam_size=args.beam_size, - blank_id=len(data_generator.vocab_list), - num_processes=args.num_processes_beam_search, - cutoff_prob=args.cutoff_prob, - ext_scoring_func=ext_scorer, ) - for i, beam_search_result in enumerate(beam_search_batch_results): - print("\nTarget Transcription:\t%s" % target_transcription[i]) - for index in xrange(args.num_results_per_sample): - result = beam_search_result[index] - #output: index, log prob, beam result - print("Beam %d: %f \t%s" % (index, result[0], result[1])) - wer_cur = wer(target_transcription[i], beam_search_result[0][1]) - wer_sum += wer_cur - wer_counter += 1 - print("cur wer = %f , average wer = %f" % - (wer_cur, wer_sum / wer_counter)) - else: - raise ValueError("Decoding method [%s] is not supported." % - decode_method) + for target, result in zip(target_transcripts, result_transcripts): + print("\nTarget Transcription: %s\nOutput Transcription: %s" % + (target, result)) + print("Current error rate [%s] = %f" % + (args.error_rate_type, error_rate_func(target, result))) def main(): utils.print_arguments(args) - paddle.init(use_gpu=args.use_gpu, trainer_count=1) + paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count) infer() diff --git a/layer.py b/layer.py new file mode 100644 index 0000000000000000000000000000000000000000..3b492645d5a42f3f0c61d2646b7d6a19bb0c3e98 --- /dev/null +++ b/layer.py @@ -0,0 +1,177 @@ +"""Contains DeepSpeech2 layers.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import paddle.v2 as paddle + + +def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride, + padding, act): + """Convolution layer with batch normalization. + + :param input: Input layer. + :type input: LayerOutput + :param filter_size: The x dimension of a filter kernel. Or input a tuple for + two image dimension. + :type filter_size: int|tuple|list + :param num_channels_in: Number of input channels. + :type num_channels_in: int + :type num_channels_out: Number of output channels. + :type num_channels_in: out + :param padding: The x dimension of the padding. Or input a tuple for two + image dimension. + :type padding: int|tuple|list + :param act: Activation type. + :type act: BaseActivation + :return: Batch norm layer after convolution layer. + :rtype: LayerOutput + """ + conv_layer = paddle.layer.img_conv( + input=input, + filter_size=filter_size, + num_channels=num_channels_in, + num_filters=num_channels_out, + stride=stride, + padding=padding, + act=paddle.activation.Linear(), + bias_attr=False) + return paddle.layer.batch_norm(input=conv_layer, act=act) + + +def bidirectional_simple_rnn_bn_layer(name, input, size, act): + """Bidirectonal simple rnn layer with sequence-wise batch normalization. + The batch normalization is only performed on input-state weights. + + :param name: Name of the layer. + :type name: string + :param input: Input layer. + :type input: LayerOutput + :param size: Number of RNN cells. + :type size: int + :param act: Activation type. + :type act: BaseActivation + :return: Bidirectional simple rnn layer. + :rtype: LayerOutput + """ + # input-hidden weights shared across bi-direcitonal rnn. + input_proj = paddle.layer.fc( + input=input, size=size, act=paddle.activation.Linear(), bias_attr=False) + # batch norm is only performed on input-state projection + input_proj_bn = paddle.layer.batch_norm( + input=input_proj, act=paddle.activation.Linear()) + # forward and backward in time + forward_simple_rnn = paddle.layer.recurrent( + input=input_proj_bn, act=act, reverse=False) + backward_simple_rnn = paddle.layer.recurrent( + input=input_proj_bn, act=act, reverse=True) + return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn]) + + +def conv_group(input, num_stacks): + """Convolution group with stacked convolution layers. + + :param input: Input layer. + :type input: LayerOutput + :param num_stacks: Number of stacked convolution layers. + :type num_stacks: int + :return: Output layer of the convolution group. + :rtype: LayerOutput + """ + conv = conv_bn_layer( + input=input, + filter_size=(11, 41), + num_channels_in=1, + num_channels_out=32, + stride=(3, 2), + padding=(5, 20), + act=paddle.activation.BRelu()) + for i in xrange(num_stacks - 1): + conv = conv_bn_layer( + input=conv, + filter_size=(11, 21), + num_channels_in=32, + num_channels_out=32, + stride=(1, 2), + padding=(5, 10), + act=paddle.activation.BRelu()) + output_num_channels = 32 + output_height = 160 // pow(2, num_stacks) + 1 + return conv, output_num_channels, output_height + + +def rnn_group(input, size, num_stacks): + """RNN group with stacked bidirectional simple RNN layers. + + :param input: Input layer. + :type input: LayerOutput + :param size: Number of RNN cells in each layer. + :type size: int + :param num_stacks: Number of stacked rnn layers. + :type num_stacks: int + :return: Output layer of the RNN group. + :rtype: LayerOutput + """ + output = input + for i in xrange(num_stacks): + output = bidirectional_simple_rnn_bn_layer( + name=str(i), input=output, size=size, act=paddle.activation.BRelu()) + return output + + +def deep_speech2(audio_data, + text_data, + dict_size, + num_conv_layers=2, + num_rnn_layers=3, + rnn_size=256): + """ + The whole DeepSpeech2 model structure (a simplified version). + + :param audio_data: Audio spectrogram data layer. + :type audio_data: LayerOutput + :param text_data: Transcription text data layer. + :type text_data: LayerOutput + :param dict_size: Dictionary size for tokenized transcription. + :type dict_size: int + :param num_conv_layers: Number of stacking convolution layers. + :type num_conv_layers: int + :param num_rnn_layers: Number of stacking RNN layers. + :type num_rnn_layers: int + :param rnn_size: RNN layer size (number of RNN cells). + :type rnn_size: int + :return: A tuple of an output unnormalized log probability layer ( + before softmax) and a ctc cost layer. + :rtype: tuple of LayerOutput + """ + # convolution group + conv_group_output, conv_group_num_channels, conv_group_height = conv_group( + input=audio_data, num_stacks=num_conv_layers) + # convert data form convolution feature map to sequence of vectors + conv2seq = paddle.layer.block_expand( + input=conv_group_output, + num_channels=conv_group_num_channels, + stride_x=1, + stride_y=1, + block_x=1, + block_y=conv_group_height) + # rnn group + rnn_group_output = rnn_group( + input=conv2seq, size=rnn_size, num_stacks=num_rnn_layers) + fc = paddle.layer.fc( + input=rnn_group_output, + size=dict_size + 1, + act=paddle.activation.Linear(), + bias_attr=True) + # probability distribution with softmax + log_probs = paddle.layer.mixed( + input=paddle.layer.identity_projection(input=fc), + act=paddle.activation.Softmax()) + # ctc cost + ctc_loss = paddle.layer.warp_ctc( + input=fc, + label=text_data, + size=dict_size + 1, + blank=dict_size, + norm_by_times=True) + return log_probs, ctc_loss diff --git a/model.py b/model.py index cb0b4ecbba1a3fb435a5f625a54d6e5bebe689e0..e2f2903b6ecff653c1dd032308c1cdd7eb4a175d 100644 --- a/model.py +++ b/model.py @@ -3,141 +3,240 @@ from __future__ import absolute_import from __future__ import division from __future__ import print_function +import sys +import os +import time +import gzip +from decoder import * +from lm.lm_scorer import LmScorer import paddle.v2 as paddle +from layer import * -def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride, - padding, act): - """ - Convolution layer with batch normalization. - """ - conv_layer = paddle.layer.img_conv( - input=input, - filter_size=filter_size, - num_channels=num_channels_in, - num_filters=num_channels_out, - stride=stride, - padding=padding, - act=paddle.activation.Linear(), - bias_attr=False) - return paddle.layer.batch_norm(input=conv_layer, act=act) - - -def bidirectional_simple_rnn_bn_layer(name, input, size, act): - """ - Bidirectonal simple rnn layer with sequence-wise batch normalization. - The batch normalization is only performed on input-state weights. - """ - # input-hidden weights shared across bi-direcitonal rnn. - input_proj = paddle.layer.fc( - input=input, size=size, act=paddle.activation.Linear(), bias_attr=False) - # batch norm is only performed on input-state projection - input_proj_bn = paddle.layer.batch_norm( - input=input_proj, act=paddle.activation.Linear()) - # forward and backward in time - forward_simple_rnn = paddle.layer.recurrent( - input=input_proj_bn, act=act, reverse=False) - backward_simple_rnn = paddle.layer.recurrent( - input=input_proj_bn, act=act, reverse=True) - return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn]) - - -def conv_group(input, num_stacks): - """ - Convolution group with several stacking convolution layers. - """ - conv = conv_bn_layer( - input=input, - filter_size=(11, 41), - num_channels_in=1, - num_channels_out=32, - stride=(3, 2), - padding=(5, 20), - act=paddle.activation.BRelu()) - for i in xrange(num_stacks - 1): - conv = conv_bn_layer( - input=conv, - filter_size=(11, 21), - num_channels_in=32, - num_channels_out=32, - stride=(1, 2), - padding=(5, 10), - act=paddle.activation.BRelu()) - output_num_channels = 32 - output_height = 160 // pow(2, num_stacks) + 1 - return conv, output_num_channels, output_height - - -def rnn_group(input, size, num_stacks): - """ - RNN group with several stacking RNN layers. - """ - output = input - for i in xrange(num_stacks): - output = bidirectional_simple_rnn_bn_layer( - name=str(i), input=output, size=size, act=paddle.activation.BRelu()) - return output - - -def deep_speech2(audio_data, - text_data, - dict_size, - num_conv_layers=2, - num_rnn_layers=3, - rnn_size=256, - is_inference=False): - """ - The whole DeepSpeech2 model structure (a simplified version). - - :param audio_data: Audio spectrogram data layer. - :type audio_data: LayerOutput - :param text_data: Transcription text data layer. - :type text_data: LayerOutput - :param dict_size: Dictionary size for tokenized transcription. - :type dict_size: int +class DeepSpeech2Model(object): + """DeepSpeech2Model class. + + :param vocab_size: Decoding vocabulary size. + :type vocab_size: int :param num_conv_layers: Number of stacking convolution layers. :type num_conv_layers: int :param num_rnn_layers: Number of stacking RNN layers. :type num_rnn_layers: int - :param rnn_size: RNN layer size (number of RNN cells). - :type rnn_size: int - :param is_inference: False in the training mode, and True in the - inferene mode. - :type is_inference: bool - :return: If is_inference set False, return a ctc cost layer; - if is_inference set True, return a sequence layer of output - probability distribution. - :rtype: tuple of LayerOutput + :param rnn_layer_size: RNN layer size (number of RNN cells). + :type rnn_layer_size: int + :param pretrained_model_path: Pretrained model path. If None, will train + from stratch. + :type pretrained_model_path: basestring|None """ - # convolution group - conv_group_output, conv_group_num_channels, conv_group_height = conv_group( - input=audio_data, num_stacks=num_conv_layers) - # convert data form convolution feature map to sequence of vectors - conv2seq = paddle.layer.block_expand( - input=conv_group_output, - num_channels=conv_group_num_channels, - stride_x=1, - stride_y=1, - block_x=1, - block_y=conv_group_height) - # rnn group - rnn_group_output = rnn_group( - input=conv2seq, size=rnn_size, num_stacks=num_rnn_layers) - fc = paddle.layer.fc( - input=rnn_group_output, - size=dict_size + 1, - act=paddle.activation.Linear(), - bias_attr=True) - if is_inference: - # probability distribution with softmax - return paddle.layer.mixed( - input=paddle.layer.identity_projection(input=fc), - act=paddle.activation.Softmax()) - else: - # ctc cost - return paddle.layer.warp_ctc( - input=fc, - label=text_data, - size=dict_size + 1, - blank=dict_size, - norm_by_times=True) + + def __init__(self, vocab_size, num_conv_layers, num_rnn_layers, + rnn_layer_size, pretrained_model_path): + self._create_network(vocab_size, num_conv_layers, num_rnn_layers, + rnn_layer_size) + self._create_parameters(pretrained_model_path) + self._inferer = None + self._loss_inferer = None + self._ext_scorer = None + + def train(self, + train_batch_reader, + dev_batch_reader, + feeding_dict, + learning_rate, + gradient_clipping, + num_passes, + output_model_dir, + num_iterations_print=100): + """Train the model. + + :param train_batch_reader: Train data reader. + :type train_batch_reader: callable + :param dev_batch_reader: Validation data reader. + :type dev_batch_reader: callable + :param feeding_dict: Feeding is a map of field name and tuple index + of the data that reader returns. + :type feeding_dict: dict|list + :param learning_rate: Learning rate for ADAM optimizer. + :type learning_rate: float + :param gradient_clipping: Gradient clipping threshold. + :type gradient_clipping: float + :param num_passes: Number of training epochs. + :type num_passes: int + :param num_iterations_print: Number of training iterations for printing + a training loss. + :type rnn_iteratons_print: int + :param output_model_dir: Directory for saving the model (every pass). + :type output_model_dir: basestring + """ + # prepare model output directory + if not os.path.exists(output_model_dir): + os.mkdir(output_model_dir) + + # prepare optimizer and trainer + optimizer = paddle.optimizer.Adam( + learning_rate=learning_rate, + gradient_clipping_threshold=gradient_clipping) + trainer = paddle.trainer.SGD( + cost=self._loss, + parameters=self._parameters, + update_equation=optimizer) + + # create event handler + def event_handler(event): + global start_time, cost_sum, cost_counter + if isinstance(event, paddle.event.EndIteration): + cost_sum += event.cost + cost_counter += 1 + if (event.batch_id + 1) % num_iterations_print == 0: + output_model_path = os.path.join(output_model_dir, + "params.latest.tar.gz") + with gzip.open(output_model_path, 'w') as f: + self._parameters.to_tar(f) + print("\nPass: %d, Batch: %d, TrainCost: %f" % + (event.pass_id, event.batch_id + 1, + cost_sum / cost_counter)) + cost_sum, cost_counter = 0.0, 0 + else: + sys.stdout.write('.') + sys.stdout.flush() + if isinstance(event, paddle.event.BeginPass): + start_time = time.time() + cost_sum, cost_counter = 0.0, 0 + if isinstance(event, paddle.event.EndPass): + result = trainer.test( + reader=dev_batch_reader, feeding=feeding_dict) + output_model_path = os.path.join( + output_model_dir, "params.pass-%d.tar.gz" % event.pass_id) + with gzip.open(output_model_path, 'w') as f: + self._parameters.to_tar(f) + print("\n------- Time: %d sec, Pass: %d, ValidationCost: %s" % + (time.time() - start_time, event.pass_id, result.cost)) + + # run train + trainer.train( + reader=train_batch_reader, + event_handler=event_handler, + num_passes=num_passes, + feeding=feeding_dict) + + def infer_loss_batch(self, infer_data): + """Model inference. Infer the ctc loss for a batch of speech + utterances. + + :param infer_data: List of utterances to infer, with each utterance a + tuple of audio features and transcription text (empty + string). + :type infer_data: list + :return: List of ctc loss. + :rtype: List of float + """ + # define inferer + if self._loss_inferer == None: + self._loss_inferer = paddle.inference.Inference( + output_layer=self._loss, parameters=self._parameters) + # run inference + return self._loss_inferer.infer(input=infer_data) + + def infer_batch(self, infer_data, decode_method, beam_alpha, beam_beta, + beam_size, cutoff_prob, vocab_list, language_model_path, + num_processes): + """Model inference. Infer the transcription for a batch of speech + utterances. + + :param infer_data: List of utterances to infer, with each utterance + consisting of a tuple of audio features and + transcription text (empty string). + :type infer_data: list + :param decode_method: Decoding method name, 'best_path' or + 'beam search'. + :param decode_method: string + :param beam_alpha: Parameter associated with language model. + :type beam_alpha: float + :param beam_beta: Parameter associated with word count. + :type beam_beta: float + :param beam_size: Width for Beam search. + :type beam_size: int + :param cutoff_prob: Cutoff probability in pruning, + default 1.0, no pruning. + :type cutoff_prob: float + :param vocab_list: List of tokens in the vocabulary, for decoding. + :type vocab_list: list + :param language_model_path: Filepath for language model. + :type language_model_path: basestring|None + :param num_processes: Number of processes (CPU) for decoder. + :type num_processes: int + :return: List of transcription texts. + :rtype: List of basestring + """ + # define inferer + if self._inferer == None: + self._inferer = paddle.inference.Inference( + output_layer=self._log_probs, parameters=self._parameters) + # run inference + infer_results = self._inferer.infer(input=infer_data) + num_steps = len(infer_results) // len(infer_data) + probs_split = [ + infer_results[i * num_steps:(i + 1) * num_steps] + for i in xrange(0, len(infer_data)) + ] + # run decoder + results = [] + if decode_method == "best_path": + # best path decode + for i, probs in enumerate(probs_split): + output_transcription = ctc_best_path_decoder( + probs_seq=probs, vocabulary=vocab_list) + results.append(output_transcription) + elif decode_method == "beam_search": + # initialize external scorer + if self._ext_scorer == None: + self._ext_scorer = LmScorer(beam_alpha, beam_beta, + language_model_path) + self._loaded_lm_path = language_model_path + else: + self._ext_scorer.reset_params(beam_alpha, beam_beta) + assert self._loaded_lm_path == language_model_path + + # beam search decode + beam_search_results = ctc_beam_search_decoder_batch( + probs_split=probs_split, + vocabulary=vocab_list, + beam_size=beam_size, + blank_id=len(vocab_list), + num_processes=num_processes, + ext_scoring_func=self._ext_scorer, + cutoff_prob=cutoff_prob) + + results = [result[0][1] for result in beam_search_results] + else: + raise ValueError("Decoding method [%s] is not supported." % + decode_method) + return results + + def _create_parameters(self, model_path=None): + """Load or create model parameters.""" + if model_path is None: + self._parameters = paddle.parameters.create(self._loss) + else: + self._parameters = paddle.parameters.Parameters.from_tar( + gzip.open(model_path)) + + def _create_network(self, vocab_size, num_conv_layers, num_rnn_layers, + rnn_layer_size): + """Create data layers and model network.""" + # paddle.data_type.dense_array is used for variable batch input. + # The size 161 * 161 is only an placeholder value and the real shape + # of input batch data will be induced during training. + audio_data = paddle.layer.data( + name="audio_spectrogram", + type=paddle.data_type.dense_array(161 * 161)) + text_data = paddle.layer.data( + name="transcript_text", + type=paddle.data_type.integer_value_sequence(vocab_size)) + self._log_probs, self._loss = deep_speech2( + audio_data=audio_data, + text_data=text_data, + dict_size=vocab_size, + num_conv_layers=num_conv_layers, + num_rnn_layers=num_rnn_layers, + rnn_size=rnn_layer_size) diff --git a/pcloud_train.sh b/pcloud_train.sh deleted file mode 100644 index ebf73bbb77d82536583411d5af12d690f3278d55..0000000000000000000000000000000000000000 --- a/pcloud_train.sh +++ /dev/null @@ -1,35 +0,0 @@ -DATA_PATH=$1 -MODEL_PATH=$2 -#setted by user -TRAIN_MANI=${DATA_PATH}/cloud.train.manifest -#setted by user -DEV_MANI=${DATA_PATH}/cloud.test.manifest -#setted by user -TRAIN_TAR=${DATA_PATH}/cloud.train.tar -#setted by user -DEV_TAR=${DATA_PATH}/cloud.test.tar -#setted by user -VOCAB_PATH=${DATA_PATH}/eng_vocab.txt -#setted by user -MEAN_STD_FILE=${DATA_PATH}/mean_std.npz - -# split train data for each pcloud node -python ./cloud/split_data.py \ ---in_manifest_path=$TRAIN_MANI \ ---data_tar_path=$TRAIN_TAR \ ---out_manifest_path='./local.train.manifest' - -# split dev data for each pcloud node -python ./cloud/split_data.py \ ---in_manifest_path=$DEV_MANI \ ---data_tar_path=$DEV_TAR \ ---out_manifest_path='./local.test.manifest' - -python train.py \ ---use_gpu=1 \ ---trainer_count=4 \ ---batch_size=256 \ ---mean_std_filepath=$MEAN_STD_FILE \ ---train_manifest_path='./local.train.manifest' \ ---dev_manifest_path='./local.test.manifest' \ ---vocab_filepath=$VOCAB_PATH \ diff --git a/requirements.txt b/requirements.txt old mode 100755 new mode 100644 index 721fa2811081e530a9cec3b2e403ad2372b59269..131f75ff47e003f3b44f4a62f1431cf13d4f44a4 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,5 @@ -wget==3.2 scipy==0.13.1 resampy==0.1.5 -https://github.com/kpu/kenlm/archive/master.zip +SoundFile==0.9.0.post1 python_speech_features +https://github.com/luotao1/kenlm/archive/master.zip diff --git a/setup.sh b/setup.sh index 8cba91ecdb68b42125181331471f9ee323062a24..7f4272550c4efb9cebd5483c4911caed02cd9673 100644 --- a/setup.sh +++ b/setup.sh @@ -9,25 +9,21 @@ if [ $? != 0 ]; then exit 1 fi -# install package Soundfile -curl -O "http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz" +# install package libsndfile +python -c "import soundfile" if [ $? != 0 ]; then - echo "Download libsndfile-1.0.28.tar.gz failed !!!" - exit 1 + echo "Install package libsndfile into default system path." + curl -O "http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz" + if [ $? != 0 ]; then + echo "Download libsndfile-1.0.28.tar.gz failed !!!" + exit 1 + fi + tar -zxvf libsndfile-1.0.28.tar.gz + cd libsndfile-1.0.28 + ./configure && make && make install + cd .. + rm -rf libsndfile-1.0.28 + rm libsndfile-1.0.28.tar.gz fi -tar -zxvf libsndfile-1.0.28.tar.gz -cd libsndfile-1.0.28 -./configure && make && make install -cd - -rm -rf libsndfile-1.0.28 -rm libsndfile-1.0.28.tar.gz -pip install SoundFile==0.9.0.post1 -if [ $? != 0 ]; then - echo "Install SoundFile failed !!!" - exit 1 -fi - -# prepare ./checkpoints -mkdir checkpoints echo "Install all dependencies successfully." diff --git a/tests/test_error_rate.py b/tests/test_error_rate.py index be7313f3570c2633392e35f3bf38a0d02840a196..99e137a9a190cba8f2d99001cbf3c22ce8d53b56 100644 --- a/tests/test_error_rate.py +++ b/tests/test_error_rate.py @@ -11,16 +11,54 @@ import error_rate class TestParse(unittest.TestCase): def test_wer_1(self): ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night' - hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last night' + hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last '\ + 'night' word_error_rate = error_rate.wer(ref, hyp) self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6) def test_wer_2(self): + ref = 'as any in england i would say said gamewell proudly that is '\ + 'in his day' + hyp = 'as any in england i would say said came well proudly that is '\ + 'in his day' + word_error_rate = error_rate.wer(ref, hyp) + self.assertTrue(abs(word_error_rate - 0.1333333) < 1e-6) + + def test_wer_3(self): + ref = 'the lieutenant governor lilburn w boggs afterward governor '\ + 'was a pronounced mormon hater and throughout the period of '\ + 'the troubles he manifested sympathy with the persecutors' + hyp = 'the lieutenant governor little bit how bags afterward '\ + 'governor was a pronounced warman hater and throughout the '\ + 'period of th troubles he manifests sympathy with the '\ + 'persecutors' + word_error_rate = error_rate.wer(ref, hyp) + self.assertTrue(abs(word_error_rate - 0.2692307692) < 1e-6) + + def test_wer_4(self): + ref = 'the wood flamed up splendidly under the large brewing copper '\ + 'and it sighed so deeply' + hyp = 'the wood flame do splendidly under the large brewing copper '\ + 'and its side so deeply' + word_error_rate = error_rate.wer(ref, hyp) + self.assertTrue(abs(word_error_rate - 0.2666666667) < 1e-6) + + def test_wer_5(self): + ref = 'all the morning they trudged up the mountain path and at noon '\ + 'unc and ojo sat on a fallen tree trunk and ate the last of '\ + 'the bread which the old munchkin had placed in his pocket' + hyp = 'all the morning they trudged up the mountain path and at noon '\ + 'unc in ojo sat on a fallen tree trunk and ate the last of '\ + 'the bread which the old munchkin had placed in his pocket' + word_error_rate = error_rate.wer(ref, hyp) + self.assertTrue(abs(word_error_rate - 0.027027027) < 1e-6) + + def test_wer_6(self): ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night' word_error_rate = error_rate.wer(ref, ref) self.assertEqual(word_error_rate, 0.0) - def test_wer_3(self): + def test_wer_7(self): ref = ' ' hyp = 'Hypothesis sentence' with self.assertRaises(ValueError): @@ -33,22 +71,40 @@ class TestParse(unittest.TestCase): self.assertTrue(abs(char_error_rate - 0.25) < 1e-6) def test_cer_2(self): + ref = 'werewolf' + hyp = 'weae wolf' + char_error_rate = error_rate.cer(ref, hyp, remove_space=True) + self.assertTrue(abs(char_error_rate - 0.125) < 1e-6) + + def test_cer_3(self): + ref = 'were wolf' + hyp = 'were wolf' + char_error_rate = error_rate.cer(ref, hyp) + self.assertTrue(abs(char_error_rate - 0.0) < 1e-6) + + def test_cer_4(self): ref = 'werewolf' char_error_rate = error_rate.cer(ref, ref) self.assertEqual(char_error_rate, 0.0) - def test_cer_3(self): + def test_cer_5(self): ref = u'我是中国人' hyp = u'我是 美洲人' char_error_rate = error_rate.cer(ref, hyp) self.assertTrue(abs(char_error_rate - 0.6) < 1e-6) - def test_cer_4(self): + def test_cer_6(self): + ref = u'我 是 中 国 人' + hyp = u'我 是 美 洲 人' + char_error_rate = error_rate.cer(ref, hyp, remove_space=True) + self.assertTrue(abs(char_error_rate - 0.4) < 1e-6) + + def test_cer_7(self): ref = u'我是中国人' char_error_rate = error_rate.cer(ref, ref) self.assertFalse(char_error_rate, 0.0) - def test_cer_5(self): + def test_cer_8(self): ref = '' hyp = 'Hypothesis' with self.assertRaises(ValueError): diff --git a/tests/test_setup.py b/tests/test_setup.py new file mode 100644 index 0000000000000000000000000000000000000000..18b9c1a0ce5333f559383b18704edf7270457fcf --- /dev/null +++ b/tests/test_setup.py @@ -0,0 +1,23 @@ +"""Test Setup.""" +import unittest +import numpy as np +import os + + +class TestSetup(unittest.TestCase): + def test_soundfile(self): + import soundfile as sf + # floating point data is typically limited to the interval [-1.0, 1.0], + # but smaller/larger values are supported as well + data = np.array([[1.75, -1.75], [1.0, -1.0], [0.5, -0.5], + [0.25, -0.25]]) + file = 'test.wav' + sf.write(file, data, 44100, format='WAV', subtype='FLOAT') + read, fs = sf.read(file) + self.assertTrue(np.all(read == data)) + self.assertEqual(fs, 44100) + os.remove(file) + + +if __name__ == '__main__': + unittest.main() diff --git a/tools/_init_paths.py b/tools/_init_paths.py new file mode 100644 index 0000000000000000000000000000000000000000..ddabb535be682d95c3c8b73003ea30eed06ca0b0 --- /dev/null +++ b/tools/_init_paths.py @@ -0,0 +1,19 @@ +"""Set up paths for DS2""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os.path +import sys + + +def add_path(path): + if path not in sys.path: + sys.path.insert(0, path) + + +this_dir = os.path.dirname(__file__) + +# Add project path to PYTHONPATH +proj_path = os.path.join(this_dir, '..') +add_path(proj_path) diff --git a/tools/build_vocab.py b/tools/build_vocab.py new file mode 100644 index 0000000000000000000000000000000000000000..618f2498537ba9d085a0ec3a60852f591bb0ff3e --- /dev/null +++ b/tools/build_vocab.py @@ -0,0 +1,59 @@ +"""Build vocabulary from manifest files. + +Each item in vocabulary file is a character. +""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import argparse +import codecs +import json +from collections import Counter +import os.path +import _init_paths +from data_utils import utils + +parser = argparse.ArgumentParser(description=__doc__) +parser.add_argument( + "--manifest_paths", + type=str, + help="Manifest paths for building vocabulary." + "You can provide multiple manifest files.", + nargs='+', + required=True) +parser.add_argument( + "--count_threshold", + default=0, + type=int, + help="Characters whose counts are below the threshold will be truncated. " + "(default: %(default)i)") +parser.add_argument( + "--vocab_path", + default='datasets/vocab/zh_vocab.txt', + type=str, + help="File path to write the vocabulary. (default: %(default)s)") +args = parser.parse_args() + + +def count_manifest(counter, manifest_path): + manifest_jsons = utils.read_manifest(manifest_path) + for line_json in manifest_jsons: + for char in line_json['text']: + counter.update(char) + + +def main(): + counter = Counter() + for manifest_path in args.manifest_paths: + count_manifest(counter, manifest_path) + + count_sorted = sorted(counter.items(), key=lambda x: x[1], reverse=True) + with codecs.open(args.vocab_path, 'w', 'utf-8') as fout: + for char, count in count_sorted: + if count < args.count_threshold: break + fout.write(char + '\n') + + +if __name__ == '__main__': + main() diff --git a/compute_mean_std.py b/tools/compute_mean_std.py similarity index 99% rename from compute_mean_std.py rename to tools/compute_mean_std.py index 0cc84e73022ecb1333b805457cace39adcc68ce4..da49eb4c056700e6c4da5e740c2bbcee84fa3154 100644 --- a/compute_mean_std.py +++ b/tools/compute_mean_std.py @@ -4,6 +4,7 @@ from __future__ import division from __future__ import print_function import argparse +import _init_paths from data_utils.normalizer import FeatureNormalizer from data_utils.augmentor.augmentation import AugmentationPipeline from data_utils.featurizer.audio_featurizer import AudioFeaturizer diff --git a/train.py b/train.py index 6481074c6e58f98f57f81c6e42480fa00a261bbe..0d4e2508dddf5cc6834b4f61f0c2cc8deee405af 100644 --- a/train.py +++ b/train.py @@ -3,15 +3,11 @@ from __future__ import absolute_import from __future__ import division from __future__ import print_function -import sys -import os import argparse -import gzip -import time import distutils.util import multiprocessing import paddle.v2 as paddle -from model import deep_speech2 +from model import DeepSpeech2Model from data_utils.data import DataGenerator import utils @@ -23,6 +19,12 @@ parser.add_argument( default=200, type=int, help="Training pass number. (default: %(default)s)") +parser.add_argument( + "--num_iterations_print", + default=100, + type=int, + help="Number of iterations for every train cost printing. " + "(default: %(default)s)") parser.add_argument( "--num_conv_layers", default=2, @@ -84,7 +86,7 @@ parser.add_argument( help="Trainer number. (default: %(default)s)") parser.add_argument( "--num_threads_data", - default=multiprocessing.cpu_count(), + default=multiprocessing.cpu_count() // 2, type=int, help="Number of cpu threads for preprocessing data. (default: %(default)s)") parser.add_argument( @@ -114,11 +116,14 @@ parser.add_argument( help="If set None, the training will start from scratch. " "Otherwise, the training will resume from " "the existing model of this path. (default: %(default)s)") +parser.add_argument( + "--output_model_dir", + default="./checkpoints", + type=str, + help="Directory for saving models. (default: %(default)s)") parser.add_argument( "--augmentation_config", - default='[{"type": "shift", ' - '"params": {"min_shift_ms": -5, "max_shift_ms": 5},' - '"prob": 1.0}]', + default=open('conf/augmentation.config', 'r').read(), type=str, help="Augmentation configuration in json-format. " "(default: %(default)s)") @@ -127,100 +132,48 @@ args = parser.parse_args() def train(): """DeepSpeech2 training.""" - - # initialize data generator - def data_generator(): - return DataGenerator( - vocab_filepath=args.vocab_filepath, - mean_std_filepath=args.mean_std_filepath, - augmentation_config=args.augmentation_config, - max_duration=args.max_duration, - min_duration=args.min_duration, - specgram_type=args.specgram_type, - num_threads=args.num_threads_data) - - train_generator = data_generator() - test_generator = data_generator() - - # create network config - # paddle.data_type.dense_array is used for variable batch input. - # The size 161 * 161 is only an placeholder value and the real shape - # of input batch data will be induced during training. - audio_data = paddle.layer.data( - name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161)) - text_data = paddle.layer.data( - name="transcript_text", - type=paddle.data_type.integer_value_sequence( - train_generator.vocab_size)) - cost = deep_speech2( - audio_data=audio_data, - text_data=text_data, - dict_size=train_generator.vocab_size, - num_conv_layers=args.num_conv_layers, - num_rnn_layers=args.num_rnn_layers, - rnn_size=args.rnn_layer_size, - is_inference=False) - - # create/load parameters and optimizer - if args.init_model_path is None: - parameters = paddle.parameters.create(cost) - else: - if not os.path.isfile(args.init_model_path): - raise IOError("Invalid model!") - parameters = paddle.parameters.Parameters.from_tar( - gzip.open(args.init_model_path)) - optimizer = paddle.optimizer.Adam( - learning_rate=args.adam_learning_rate, gradient_clipping_threshold=400) - trainer = paddle.trainer.SGD( - cost=cost, parameters=parameters, update_equation=optimizer) - - # prepare data reader + train_generator = DataGenerator( + vocab_filepath=args.vocab_filepath, + mean_std_filepath=args.mean_std_filepath, + augmentation_config=args.augmentation_config, + max_duration=args.max_duration, + min_duration=args.min_duration, + specgram_type=args.specgram_type, + num_threads=args.num_threads_data) + dev_generator = DataGenerator( + vocab_filepath=args.vocab_filepath, + mean_std_filepath=args.mean_std_filepath, + augmentation_config="{}", + specgram_type=args.specgram_type, + num_threads=args.num_threads_data) train_batch_reader = train_generator.batch_reader_creator( manifest_path=args.train_manifest_path, batch_size=args.batch_size, min_batch_size=args.trainer_count, sortagrad=args.use_sortagrad if args.init_model_path is None else False, shuffle_method=args.shuffle_method) - test_batch_reader = test_generator.batch_reader_creator( + dev_batch_reader = dev_generator.batch_reader_creator( manifest_path=args.dev_manifest_path, batch_size=args.batch_size, min_batch_size=1, # must be 1, but will have errors. sortagrad=False, shuffle_method=None) - # create event handler - def event_handler(event): - global start_time, cost_sum, cost_counter - if isinstance(event, paddle.event.EndIteration): - cost_sum += event.cost - cost_counter += 1 - if (event.batch_id + 1) % 100 == 0: - print("\nPass: %d, Batch: %d, TrainCost: %f" % ( - event.pass_id, event.batch_id + 1, cost_sum / cost_counter)) - cost_sum, cost_counter = 0.0, 0 - with gzip.open("checkpoints/params.latest.tar.gz", 'w') as f: - parameters.to_tar(f) - else: - sys.stdout.write('.') - sys.stdout.flush() - if isinstance(event, paddle.event.BeginPass): - start_time = time.time() - cost_sum, cost_counter = 0.0, 0 - if isinstance(event, paddle.event.EndPass): - result = trainer.test( - reader=test_batch_reader, feeding=test_generator.feeding) - print("\n------- Time: %d sec, Pass: %d, ValidationCost: %s" % - (time.time() - start_time, event.pass_id, result.cost)) - with gzip.open("checkpoints/params.pass-%d.tar.gz" % event.pass_id, - 'w') as f: - parameters.to_tar(f) - - # run train - trainer.train( - reader=train_batch_reader, - event_handler=event_handler, + ds2_model = DeepSpeech2Model( + vocab_size=train_generator.vocab_size, + num_conv_layers=args.num_conv_layers, + num_rnn_layers=args.num_rnn_layers, + rnn_layer_size=args.rnn_layer_size, + pretrained_model_path=args.init_model_path) + ds2_model.train( + train_batch_reader=train_batch_reader, + dev_batch_reader=dev_batch_reader, + feeding_dict=train_generator.feeding, + learning_rate=args.adam_learning_rate, + gradient_clipping=400, num_passes=args.num_passes, - feeding=train_generator.feeding) + num_iterations_print=args.num_iterations_print, + output_model_dir=args.output_model_dir) def main(): diff --git a/tune.py b/tune.py index 2fcca48628aa0aba7fd2e09a1d9ba90582492f89..328d67a1197634e5f02ad0689056196a8904fc06 100644 --- a/tune.py +++ b/tune.py @@ -3,14 +3,13 @@ from __future__ import absolute_import from __future__ import division from __future__ import print_function +import numpy as np import distutils.util import argparse -import gzip +import multiprocessing import paddle.v2 as paddle from data_utils.data import DataGenerator -from model import deep_speech2 -from decoder import * -from lm.lm_scorer import LmScorer +from model import DeepSpeech2Model from error_rate import wer import utils @@ -40,14 +39,19 @@ parser.add_argument( default=True, type=distutils.util.strtobool, help="Use gpu or not. (default: %(default)s)") +parser.add_argument( + "--trainer_count", + default=8, + type=int, + help="Trainer number. (default: %(default)s)") parser.add_argument( "--num_threads_data", - default=multiprocessing.cpu_count(), + default=1, type=int, help="Number of cpu threads for preprocessing data. (default: %(default)s)") parser.add_argument( "--num_processes_beam_search", - default=multiprocessing.cpu_count(), + default=multiprocessing.cpu_count() // 2, type=int, help="Number of cpu processes for beam search. (default: %(default)s)") parser.add_argument( @@ -62,10 +66,10 @@ parser.add_argument( type=str, help="Manifest path for normalizer. (default: %(default)s)") parser.add_argument( - "--decode_manifest_path", - default='datasets/manifest.test', + "--tune_manifest_path", + default='datasets/manifest.dev', type=str, - help="Manifest path for decoding. (default: %(default)s)") + help="Manifest path for tuning. (default: %(default)s)") parser.add_argument( "--model_filepath", default='checkpoints/params.latest.tar.gz', @@ -127,96 +131,64 @@ args = parser.parse_args() def tune(): """Tune parameters alpha and beta on one minibatch.""" - if not args.num_alphas >= 0: raise ValueError("num_alphas must be non-negative!") - if not args.num_betas >= 0: raise ValueError("num_betas must be non-negative!") - # initialize data generator data_generator = DataGenerator( vocab_filepath=args.vocab_filepath, mean_std_filepath=args.mean_std_filepath, augmentation_config='{}', specgram_type=args.specgram_type, num_threads=args.num_threads_data) - - # create network config - # paddle.data_type.dense_array is used for variable batch input. - # The size 161 * 161 is only an placeholder value and the real shape - # of input batch data will be induced during training. - audio_data = paddle.layer.data( - name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161)) - text_data = paddle.layer.data( - name="transcript_text", - type=paddle.data_type.integer_value_sequence(data_generator.vocab_size)) - output_probs = deep_speech2( - audio_data=audio_data, - text_data=text_data, - dict_size=data_generator.vocab_size, - num_conv_layers=args.num_conv_layers, - num_rnn_layers=args.num_rnn_layers, - rnn_size=args.rnn_layer_size, - is_inference=True) - - # load parameters - parameters = paddle.parameters.Parameters.from_tar( - gzip.open(args.model_filepath)) - - # prepare infer data batch_reader = data_generator.batch_reader_creator( - manifest_path=args.decode_manifest_path, + manifest_path=args.tune_manifest_path, batch_size=args.num_samples, sortagrad=False, shuffle_method=None) - # get one batch data for tuning - infer_data = batch_reader().next() - - # run inference - infer_results = paddle.infer( - output_layer=output_probs, parameters=parameters, input=infer_data) - num_steps = len(infer_results) // len(infer_data) - probs_split = [ - infer_results[i * num_steps:(i + 1) * num_steps] - for i in xrange(0, len(infer_data)) + tune_data = batch_reader().next() + target_transcripts = [ + ''.join([data_generator.vocab_list[token] for token in transcript]) + for _, transcript in tune_data ] + ds2_model = DeepSpeech2Model( + vocab_size=data_generator.vocab_size, + num_conv_layers=args.num_conv_layers, + num_rnn_layers=args.num_rnn_layers, + rnn_layer_size=args.rnn_layer_size, + pretrained_model_path=args.model_filepath) + # create grid for search cand_alphas = np.linspace(args.alpha_from, args.alpha_to, args.num_alphas) cand_betas = np.linspace(args.beta_from, args.beta_to, args.num_betas) params_grid = [(alpha, beta) for alpha in cand_alphas for beta in cand_betas] - ext_scorer = LmScorer(args.alpha_from, args.beta_from, - args.language_model_path) ## tune parameters in loop for alpha, beta in params_grid: - wer_sum, wer_counter = 0, 0 - # reset scorer - ext_scorer.reset_params(alpha, beta) - # beam search using multiple processes - beam_search_results = ctc_beam_search_decoder_batch( - probs_split=probs_split, - vocabulary=data_generator.vocab_list, + result_transcripts = ds2_model.infer_batch( + infer_data=tune_data, + decode_method='beam_search', + beam_alpha=alpha, + beam_beta=beta, beam_size=args.beam_size, cutoff_prob=args.cutoff_prob, - blank_id=len(data_generator.vocab_list), - num_processes=args.num_processes_beam_search, - ext_scoring_func=ext_scorer, ) - for i, beam_search_result in enumerate(beam_search_results): - target_transcription = ''.join([ - data_generator.vocab_list[index] for index in infer_data[i][1] - ]) - wer_sum += wer(target_transcription, beam_search_result[0][1]) - wer_counter += 1 - + vocab_list=data_generator.vocab_list, + language_model_path=args.language_model_path, + num_processes=args.num_processes_beam_search) + wer_sum, num_ins = 0.0, 0 + for target, result in zip(target_transcripts, result_transcripts): + wer_sum += wer(target, result) + num_ins += 1 print("alpha = %f\tbeta = %f\tWER = %f" % - (alpha, beta, wer_sum / wer_counter)) + (alpha, beta, wer_sum / num_ins)) def main(): - paddle.init(use_gpu=args.use_gpu, trainer_count=1) + utils.print_arguments(args) + paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count) tune()