提交 c00db21e 编写于 作者: W wanghaoshuang

Implement uploading data to PaddleCloud

1. Refine data_utils/data.py, reuse process_utterance function.
2. Modified README.
3. Implement uploading data in cloud/upload_data.py
4. Merge branch 'develop' of https://github.com/PaddlePaddle/models into ds2_pcloud
...@@ -2,14 +2,19 @@ ...@@ -2,14 +2,19 @@
## Installation ## Installation
Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory. ### Prerequisites
- **Python = 2.7** only supported;
- **cuDNN >= 6.0** is required to utilize NVIDIA GPU platform in the installation of PaddlePaddle, and the **CUDA toolkit** with proper version suitable for cuDNN. The cuDNN library below 6.0 is found to yield a fatal error in batch normalization when handling utterances with long duration in inference.
### Setup
``` ```
sh setup.sh sh setup.sh
export LD_LIBRARY_PATH=$PADDLE_INSTALL_DIR/Paddle/third_party/install/warpctc/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=$PADDLE_INSTALL_DIR/Paddle/third_party/install/warpctc/lib:$LD_LIBRARY_PATH
``` ```
For some machines, we also need to install libsndfile1. Details to be added. Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory.
## Usage ## Usage
...@@ -35,13 +40,13 @@ python datasets/librispeech/librispeech.py --help ...@@ -35,13 +40,13 @@ python datasets/librispeech/librispeech.py --help
### Preparing for Training ### Preparing for Training
``` ```
python compute_mean_std.py python tools/compute_mean_std.py
``` ```
It will compute mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. The default feature of audio data is power spectrum, and the mfcc feature is also supported. To train and infer based on mfcc feature, please generate this file by It will compute mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. The default feature of audio data is power spectrum, and the mfcc feature is also supported. To train and infer based on mfcc feature, please generate this file by
``` ```
python compute_mean_std.py --specgram_type mfcc python tools/compute_mean_std.py --specgram_type mfcc
``` ```
and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluator.py or tune.py. and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluator.py or tune.py.
...@@ -49,7 +54,7 @@ and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluato ...@@ -49,7 +54,7 @@ and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluato
More help for arguments: More help for arguments:
``` ```
python compute_mean_std.py --help python tools/compute_mean_std.py --help
``` ```
### Training ### Training
...@@ -138,3 +143,28 @@ python tune.py --help ...@@ -138,3 +143,28 @@ python tune.py --help
``` ```
Then reset parameters with the tuning result before inference or evaluating. Then reset parameters with the tuning result before inference or evaluating.
### Playing with the ASR Demo
A real-time ASR demo is built for users to try out the ASR model with their own voice. Please do the following installation on the machine you'd like to run the demo's client (no need for the machine running the demo's server).
For example, on MAC OS X:
```
brew install portaudio
pip install pyaudio
pip install pynput
```
After a model and language model is prepared, we can first start the demo's server:
```
CUDA_VISIBLE_DEVICES=0 python demo_server.py
```
And then in another console, start the demo's client:
```
python demo_client.py
```
On the client console, press and hold the "white-space" key on the keyboard to start talking, until you finish your speech and then release the "white-space" key. The decoding results (infered transcription) will be displayed.
It could be possible to start the server and the client in two seperate machines, e.g. `demo_client.py` is usually started in a machine with a microphone hardware, while `demo_server.py` is usually started in a remote server with powerful GPUs. Please first make sure that these two machines have network access to each other, and then use `--host_ip` and `--host_port` to indicate the server machine's actual IP address (instead of the `localhost` as default) and TCP port, in both `demo_server.py` and `demo_client.py`.
#DeepSpeech2 on paddle cloud # Run DS2 on PaddleCloud
## Run DS2 by public data >Note: Make sure current directory is `models/deep_speech_2/cloud/`
**Step1: ** Make sure current dir is `models/deep_speech_2/cloud/` ## Step1 Configure data set
**Step2:** Submit job by cmd: `sh pcloud_submit.sh` You can configure your input data and output path in pcloud_submit.sh:
- `TRAIN_MANIFEST`: Absolute path of train data manifest file in local file system.This file has format as bellow:
```
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text
": "mister quilter is the ..."}
{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text
": "nor is mister ..."}
```
- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem.This file has format like TRAIN_MANIFEST.
- `VOCAB_FILE`: Absolute path of vocabulary file in local filesytem.
- `MEAN_STD_FILE`: Absolute path of vocabulary file in local filesytem.
- `CLOUD_DATA_DIR:` Absolute path in PaddleCloud filesystem. We will upload local train data to this directory.
- `CLOUD_MODEL_DIR`: Absolute path in PaddleCloud filesystem. PaddleCloud trainer will save model to this directory.
>Note: Upload will be skipped if target file has existed in ${CLOUD_DATA_DIR}.
## Step2 Configure computation resource
You can configure computation resource in pcloud_submit.sh:
```
# Configure computation resource and submit job to PaddleCloud
paddlecloud submit \
-image wanghaoshuang/pcloud_ds2:latest \
-jobname ${JOB_NAME} \
-cpu 4 \
-gpu 4 \
-memory 10Gi \
-parallelism 1 \
-pscpu 1 \
-pservers 1 \
-psmemory 10Gi \
-passes 1 \
-entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \
${DS2_PATH}
```
For more information, please refer to[PaddleCloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务)
## Step3 Configure algorithm options
You can configure algorithm options in pcloud_train.sh:
```
python train.py \
--use_gpu=1 \
--trainer_count=4 \
--batch_size=256 \
--mean_std_filepath=$MEAN_STD_FILE \
--train_manifest_path='./local.train.manifest' \
--dev_manifest_path='./local.test.manifest' \
--vocab_filepath=$VOCAB_PATH \
--output_model_dir=${MODEL_PATH}
```
You can get more information about algorithm options by follow command:
```
cd ..
python train.py --help
```
## Step4 Submit job
``` ```
$ sh pcloud_submit.sh $ sh pcloud_submit.sh
$ uploading: deepspeech.tar.gz...
$ uploading: pcloud_prepare_data.py...
$ uploading: pcloud_split_data.py...
$ uploading: pcloud_submit.sh...
$ uploading: pcloud_train.sh...
$ deepspeech20170727130129 submited.
``` ```
The we can get job name 'deepspeech20170727130129' at last line
**Step3:** Get logs from paddle cloud by cmd: `paddlecloud logs -n 10000 deepspeech20170727130129`.
## Step5 Get logs
``` ```
$ paddlecloud logs -n 10000 deepspeech20170727130129 $ paddlecloud logs -n 10000 deepspeech20170727130129
``` ```
[More options and cmd about paddle cloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md) For more information, please refer to [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#下载并配置paddlecloud) or get help by follow command:
```
## Run DS2 by customize data paddlecloud --help
TODO ```
# # Configure input data set in local filesystem
TRAIN_MANIFEST="/home/work/wanghaoshuang/ds2/pcloud/models/deep_speech_2/datasets/manifest.dev" TRAIN_MANIFEST="/home/work/demo/ds2/pcloud/models/deep_speech_2/datasets/manifest.dev"
TEST_MANIFEST="/home/work/wanghaoshuang/ds2/pcloud/models/deep_speech_2/datasets/manifest.dev" TEST_MANIFEST="/home/work/demo/ds2/pcloud/models/deep_speech_2/datasets/manifest.dev"
VOCAB_PATH="/home/work/wanghaoshuang/ds2/pcloud/models/deep_speech_2/datasets/vocab/eng_vocab.txt" VOCAB_FILE="/home/work/demo/ds2/pcloud/models/deep_speech_2/datasets/vocab/eng_vocab.txt"
MEAN_STD_PATH="/home/work/wanghaoshuang/ds2/pcloud/models/deep_speech_2/compute_mean_std.py" MEAN_STD_FILE="/home/work/demo/ds2/pcloud/models/deep_speech_2/mean_std.npz"
CLOUD_DATA_DIR="/pfs/dlnel/home/wanghaoshuang@baidu.com/deepspeech2/data"
CLOUD_MODEL_DIR="/pfs/dlnel/home/wanghaoshuang@baidu.com/deepspeech2/model" # Configure output path in PaddleCloud filesystem
CLOUD_DATA_DIR="/pfs/dlnel/home/demo/deepspeech2/data"
CLOUD_MODEL_DIR="/pfs/dlnel/home/demo/deepspeech2/model"
# Pack and upload local data to PaddleCloud filesystem
python upload_data.py \
--train_manifest_path=${TRAIN_MANIFEST} \
--test_manifest_path=${TEST_MANIFEST} \
--vocab_file=${VOCAB_FILE} \
--mean_std_file=${MEAN_STD_FILE} \
--cloud_data_path=${CLOUD_DATA_DIR}
JOB_NAME=deepspeech`date +%Y%m%d%H%M%S`
DS2_PATH=${PWD%/*} DS2_PATH=${PWD%/*}
cp -f pcloud_train.sh ${DS2_PATH}
rm -rf ./tmp # Configure computation resource and submit job to PaddleCloud
mkdir ./tmp
paddlecloud ls ${CLOUD_DATA_DIR}/mean_std.npz
if [ $? -ne 0 ];then
cp -f ${MEAN_STD_PATH} ./tmp/mean_std.npz
paddlecloud file put ./tmp/mean_std.npz ${CLOUD_DATA_DIR}/
fi
paddlecloud ls ${CLOUD_DATA_DIR}/vocab.txt
if [ $? -ne 0 ];then
cp -f ${VOCAB_PATH} ./tmp/vocab.txt
paddlecloud file put ./tmp/vocab.txt ${CLOUD_DATA_DIR}/
fi
paddlecloud ls ${CLOUD_DATA_DIR}/cloud.train.manifest
if [ $? -ne 0 ];then
python prepare_data.py \
--manifest_path=${TRAIN_MANIFEST} \
--out_tar_path="./tmp/cloud.train.tar" \
--out_manifest_path="tmp/cloud.train.manifest"
paddlecloud file put ./tmp/cloud.train.tar ${CLOUD_DATA_DIR}/
paddlecloud file put ./tmp/cloud.train.manifest ${CLOUD_DATA_DIR}/
fi
paddlecloud ls ${CLOUD_DATA_DIR}/cloud.test.manifest
if [ $? -ne 0 ];then
python prepare_data.py \
--manifest_path=${TEST_MANIFEST} \
--out_tar_path="./tmp/cloud.test.tar" \
--out_manifest_path="tmp/cloud.test.manifest"
paddlecloud file put ./tmp/cloud.test.tar ${CLOUD_DATA_DIR}/
paddlecloud file put ./tmp/cloud.test.manifest ${CLOUD_DATA_DIR}/
fi
rm -rf ./tmp
JOB_NAME=deepspeech`date +%Y%m%d%H%M%S`
cp pcloud_train.sh ${DS2_PATH}
paddlecloud submit \ paddlecloud submit \
-image bootstrapper:5000/wanghaoshuang/pcloud_ds2:latest-gpu-cudnn \ -image bootstrapper:5000/wanghaoshuang/pcloud_ds2:latest \
-jobname ${JOB_NAME} \ -jobname ${JOB_NAME} \
-cpu 4 \ -cpu 4 \
-gpu 4 \ -gpu 4 \
...@@ -58,5 +32,7 @@ paddlecloud submit \ ...@@ -58,5 +32,7 @@ paddlecloud submit \
-pservers 1 \ -pservers 1 \
-psmemory 10Gi \ -psmemory 10Gi \
-passes 1 \ -passes 1 \
-entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEl_DIR}" \ -entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \
${DS2_PATH} ${DS2_PATH}
rm ${DS2_PATH}/pcloud_train.sh
DATA_PATH=$1 DATA_PATH=$1
MODEL_PATH=$2 MODEL_PATH=$2
#setted by user
TRAIN_MANI=${DATA_PATH}/cloud.train.manifest TRAIN_MANI=${DATA_PATH}/cloud.train.manifest
#setted by user
DEV_MANI=${DATA_PATH}/cloud.test.manifest DEV_MANI=${DATA_PATH}/cloud.test.manifest
#setted by user
TRAIN_TAR=${DATA_PATH}/cloud.train.tar TRAIN_TAR=${DATA_PATH}/cloud.train.tar
#setted by user
DEV_TAR=${DATA_PATH}/cloud.test.tar DEV_TAR=${DATA_PATH}/cloud.test.tar
#setted by user VOCAB_PATH=${DATA_PATH}/vocab.txt
VOCAB_PATH=${DATA_PATH}/eng_vocab.txt
#setted by user
MEAN_STD_FILE=${DATA_PATH}/mean_std.npz MEAN_STD_FILE=${DATA_PATH}/mean_std.npz
# split train data for each pcloud node # split train data for each pcloud node
...@@ -28,8 +22,10 @@ python ./cloud/split_data.py \ ...@@ -28,8 +22,10 @@ python ./cloud/split_data.py \
python train.py \ python train.py \
--use_gpu=1 \ --use_gpu=1 \
--trainer_count=4 \ --trainer_count=4 \
--batch_size=256 \ --batch_size=32 \
--num_threads_data=4 \
--mean_std_filepath=$MEAN_STD_FILE \ --mean_std_filepath=$MEAN_STD_FILE \
--train_manifest_path='./local.train.manifest' \ --train_manifest_path='./local.train.manifest' \
--dev_manifest_path='./local.test.manifest' \ --dev_manifest_path='./local.test.manifest' \
--vocab_filepath=$VOCAB_PATH \ --vocab_filepath=$VOCAB_PATH \
--output_model_dir=${MODEL_PATH}
"""
This tool is used for preparing data for DeepSpeech2 trainning on paddle cloud.
Steps:
1. Read original manifest and get the local path of sound files.
2. Tar all local sound files into one tar file.
3. Modify original manifest to remove the local path information.
Finally, we will get a tar file and a manifest with sound file name, duration
and text.
"""
import json
import os
import tarfile
import sys
import argparse
sys.path.append('../')
from data_utils.utils import read_manifest
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--manifest_path",
default="../datasets/manifest.train",
type=str,
help="Manifest of target data. (default: %(default)s)")
parser.add_argument(
"--out_tar_path",
default="./tmp/cloud.train.tar",
type=str,
help="Output tar file path. (default: %(default)s)")
parser.add_argument(
"--out_manifest_path",
default="./tmp/cloud.train.manifest",
type=str,
help="Manifest of output data. (default: %(default)s)")
args = parser.parse_args()
def gen_pcloud_data(manifest_path, out_tar_path, out_manifest_path):
'''
1. According manifest, tar sound files into out_tar_path
2. Generate a new manifest for output tar file
'''
out_tar = tarfile.open(out_tar_path, 'w')
manifest = read_manifest(manifest_path)
results = []
for json_data in manifest:
sound_file = json_data['audio_filepath']
filename = os.path.basename(sound_file)
out_tar.add(sound_file, arcname=filename)
json_data['audio_filepath'] = filename
results.append("%s\n" % json.dumps(json_data))
with open(out_manifest_path, 'w') as out_manifest:
out_manifest.writelines(results)
out_manifest.close()
out_tar.close()
if __name__ == '__main__':
gen_pcloud_data(args.manifest_path, args.out_tar_path,
args.out_manifest_path)
"""
This tool is used for preparing data for DeepSpeech2 trainning on paddle cloud.
Steps:
1. Read original manifest and get the local path of sound files.
2. Tar all local sound files into one tar file.
3. Modify original manifest to remove the local path information.
Finally, we will get a tar file and a manifest with sound file name, duration
and text.
"""
import json
import os
import tarfile
import sys
import argparse
import shutil
sys.path.append('../')
from data_utils.utils import read_manifest
from subprocess import call
TRAIN_TAR = "cloud.train.tar"
TRAIN_MANIFEST = "cloud.train.manifest"
TEST_TAR = "cloud.test.tar"
TEST_MANIFEST = "cloud.test.manifest"
VOCAB_FILE = "vocab.txt"
MEAN_STD_FILE = "mean_std.npz"
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--train_manifest_path",
default="../datasets/manifest.train",
type=str,
help="Manifest file of train data. (default: %(default)s)")
parser.add_argument(
"--test_manifest_path",
default="../datasets/manifest.test",
type=str,
help="Manifest file of test data. (default: %(default)s)")
parser.add_argument(
"--vocab_file",
default="../datasets/vocab/eng_vocab.txt",
type=str,
help="Vocab file to be uploaded to paddlecloud. (default: %(default)s)")
parser.add_argument(
"--mean_std_file",
default="../mean_std.npz",
type=str,
help="mean_std file to be uploaded to paddlecloud. (default: %(default)s)")
parser.add_argument(
"--cloud_data_path",
required=True,
default="",
type=str,
help="Destination path on paddlecloud. (default: %(default)s)")
args = parser.parse_args()
parser.add_argument(
"--local_tmp_path",
default="./tmp/",
type=str,
help="Local directory for storing temporary data. (default: %(default)s)")
args = parser.parse_args()
def pack_data(manifest_path, out_tar_path, out_manifest_path):
'''
1. According manifest, tar sound files into out_tar_path
2. Generate a new manifest for output tar file
'''
out_tar = tarfile.open(out_tar_path, 'w')
manifest = read_manifest(manifest_path)
results = []
for json_data in manifest:
sound_file = json_data['audio_filepath']
filename = os.path.basename(sound_file)
out_tar.add(sound_file, arcname=filename)
json_data['audio_filepath'] = filename
results.append("%s\n" % json.dumps(json_data))
with open(out_manifest_path, 'w') as out_manifest:
out_manifest.writelines(results)
out_manifest.close()
out_tar.close()
if __name__ == '__main__':
cloud_train_manifest = "%s/%s" % (args.cloud_data_path, TRAIN_MANIFEST)
cloud_train_tar = "%s/%s" % (args.cloud_data_path, TRAIN_TAR)
cloud_test_manifest = "%s/%s" % (args.cloud_data_path, TEST_MANIFEST)
cloud_test_tar = "%s/%s" % (args.cloud_data_path, TEST_TAR)
cloud_vocab_file = "%s/%s" % (args.cloud_data_path, VOCAB_FILE)
cloud_mean_file = "%s/%s" % (args.cloud_data_path, MEAN_STD_FILE)
local_train_manifest = "%s/%s" % (args.local_tmp_path, TRAIN_MANIFEST)
local_train_tar = "%s/%s" % (args.local_tmp_path, TRAIN_TAR)
local_test_manifest = "%s/%s" % (args.local_tmp_path, TEST_MANIFEST)
local_test_tar = "%s/%s" % (args.local_tmp_path, TEST_TAR)
if os.path.exists(args.local_tmp_path):
shutil.rmtree(args.local_tmp_path)
os.makedirs(args.local_tmp_path)
ret = 1
# train data
if args.train_manifest_path != "":
ret = call(['paddlecloud', 'ls', cloud_train_manifest])
if ret != 0:
print "%s does't exist" % cloud_train_manifest
pack_data(args.train_manifest_path, local_train_tar,
local_train_manifest)
call([
'paddlecloud', 'cp', local_train_manifest, cloud_train_manifest
])
call(['paddlecloud', 'cp', local_train_tar, cloud_train_tar])
# test data
if args.test_manifest_path != "":
try:
ret = call(['paddlecloud', 'ls', cloud_test_manifest])
except Exception:
ret = 1
if ret != 0:
pack_data(args.test_manifest_path, local_test_tar,
local_test_manifest)
call(
['paddlecloud', 'cp', local_test_manifest, cloud_test_manifest])
call(['paddlecloud', 'cp', local_test_tar, cloud_test_tar])
# vocab file
if args.vocab_file != "":
try:
ret = call(['paddlecloud', 'ls', cloud_vocab_file])
except Exception:
ret = 1
if ret != 0:
call(['paddlecloud', 'cp', args.vocab_file, cloud_vocab_file])
# mean_std file
if args.mean_std_file != "":
try:
ret = call(['paddlecloud', 'ls', cloud_mean_file])
except Exception:
ret = 1
if ret != 0:
call(['paddlecloud', 'cp', args.mean_std_file, cloud_mean_file])
os.removedirs(args.local_tmp_path)
[
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 1.0
}
]
[
{
"type": "noise",
"params": {"min_snr_dB": 40,
"max_snr_dB": 50,
"noise_manifest_path": "datasets/manifest.noise"},
"prob": 0.6
},
{
"type": "impulse",
"params": {"impulse_manifest_path": "datasets/manifest.impulse"},
"prob": 0.5
},
{
"type": "speed",
"params": {"min_speed_rate": 0.95,
"max_speed_rate": 1.05},
"prob": 0.5
},
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 1.0
},
{
"type": "volume",
"params": {"min_gain_dBFS": -10,
"max_gain_dBFS": 10},
"prob": 0.0
},
{
"type": "bayesian_normal",
"params": {"target_db": -20,
"prior_db": -20,
"prior_samples": 100},
"prob": 0.0
}
]
...@@ -204,7 +204,7 @@ class AudioSegment(object): ...@@ -204,7 +204,7 @@ class AudioSegment(object):
:raise ValueError: If the sample rates of the two segments are not :raise ValueError: If the sample rates of the two segments are not
equal, or if the lengths of segments don't match. equal, or if the lengths of segments don't match.
""" """
if type(self) != type(other): if isinstance(other, type(self)):
raise TypeError("Cannot add segments of different types: %s " raise TypeError("Cannot add segments of different types: %s "
"and %s." % (type(self), type(other))) "and %s." % (type(self), type(other)))
if self._sample_rate != other._sample_rate: if self._sample_rate != other._sample_rate:
...@@ -231,7 +231,7 @@ class AudioSegment(object): ...@@ -231,7 +231,7 @@ class AudioSegment(object):
Note that this is an in-place transformation. Note that this is an in-place transformation.
:param gain: Gain in decibels to apply to samples. :param gain: Gain in decibels to apply to samples.
:type gain: float :type gain: float|1darray
""" """
self._samples *= 10.**(gain / 20.) self._samples *= 10.**(gain / 20.)
...@@ -457,9 +457,9 @@ class AudioSegment(object): ...@@ -457,9 +457,9 @@ class AudioSegment(object):
audio segments when resample is not allowed. audio segments when resample is not allowed.
""" """
if allow_resample and self.sample_rate != impulse_segment.sample_rate: if allow_resample and self.sample_rate != impulse_segment.sample_rate:
impulse_segment = impulse_segment.resample(self.sample_rate) impulse_segment.resample(self.sample_rate)
if self.sample_rate != impulse_segment.sample_rate: if self.sample_rate != impulse_segment.sample_rate:
raise ValueError("Impulse segment's sample rate (%d Hz) is not" raise ValueError("Impulse segment's sample rate (%d Hz) is not "
"equal to base signal sample rate (%d Hz)." % "equal to base signal sample rate (%d Hz)." %
(impulse_segment.sample_rate, self.sample_rate)) (impulse_segment.sample_rate, self.sample_rate))
samples = signal.fftconvolve(self.samples, impulse_segment.samples, samples = signal.fftconvolve(self.samples, impulse_segment.samples,
......
文件已添加
...@@ -8,6 +8,8 @@ import random ...@@ -8,6 +8,8 @@ import random
from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor
from data_utils.augmentor.shift_perturb import ShiftPerturbAugmentor from data_utils.augmentor.shift_perturb import ShiftPerturbAugmentor
from data_utils.augmentor.speed_perturb import SpeedPerturbAugmentor from data_utils.augmentor.speed_perturb import SpeedPerturbAugmentor
from data_utils.augmentor.noise_perturb import NoisePerturbAugmentor
from data_utils.augmentor.impulse_response import ImpulseResponseAugmentor
from data_utils.augmentor.resample import ResampleAugmentor from data_utils.augmentor.resample import ResampleAugmentor
from data_utils.augmentor.online_bayesian_normalization import \ from data_utils.augmentor.online_bayesian_normalization import \
OnlineBayesianNormalizationAugmentor OnlineBayesianNormalizationAugmentor
...@@ -24,20 +26,45 @@ class AugmentationPipeline(object): ...@@ -24,20 +26,45 @@ class AugmentationPipeline(object):
.. code-block:: .. code-block::
'[{"type": "volume", [ {
"params": {"min_gain_dBFS": -15, "type": "noise",
"max_gain_dBFS": 15}, "params": {"min_snr_dB": 10,
"prob": 0.5}, "max_snr_dB": 20,
{"type": "speed", "noise_manifest_path": "datasets/manifest.noise"},
"params": {"min_speed_rate": 0.8, "prob": 0.0
"max_speed_rate": 1.2}, },
"prob": 0.5} {
]' "type": "speed",
"params": {"min_speed_rate": 0.9,
"max_speed_rate": 1.1},
"prob": 1.0
},
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 1.0
},
{
"type": "volume",
"params": {"min_gain_dBFS": -10,
"max_gain_dBFS": 10},
"prob": 0.0
},
{
"type": "bayesian_normal",
"params": {"target_db": -20,
"prior_db": -20,
"prior_samples": 100},
"prob": 0.0
}
]
This augmentation configuration inserts two augmentation models This augmentation configuration inserts two augmentation models
into the pipeline, with one is VolumePerturbAugmentor and the other into the pipeline, with one is VolumePerturbAugmentor and the other
SpeedPerturbAugmentor. "prob" indicates the probability of the current SpeedPerturbAugmentor. "prob" indicates the probability of the current
augmentor to take effect. augmentor to take effect. If "prob" is zero, the augmentor does not take
effect.
:param augmentation_config: Augmentation configuration in json string. :param augmentation_config: Augmentation configuration in json string.
:type augmentation_config: str :type augmentation_config: str
...@@ -60,7 +87,7 @@ class AugmentationPipeline(object): ...@@ -60,7 +87,7 @@ class AugmentationPipeline(object):
:type audio_segment: AudioSegmenet|SpeechSegment :type audio_segment: AudioSegmenet|SpeechSegment
""" """
for augmentor, rate in zip(self._augmentors, self._rates): for augmentor, rate in zip(self._augmentors, self._rates):
if self._rng.uniform(0., 1.) <= rate: if self._rng.uniform(0., 1.) < rate:
augmentor.transform_audio(audio_segment) augmentor.transform_audio(audio_segment)
def _parse_pipeline_from(self, config_json): def _parse_pipeline_from(self, config_json):
...@@ -89,5 +116,9 @@ class AugmentationPipeline(object): ...@@ -89,5 +116,9 @@ class AugmentationPipeline(object):
return ResampleAugmentor(self._rng, **params) return ResampleAugmentor(self._rng, **params)
elif augmentor_type == "bayesian_normal": elif augmentor_type == "bayesian_normal":
return OnlineBayesianNormalizationAugmentor(self._rng, **params) return OnlineBayesianNormalizationAugmentor(self._rng, **params)
elif augmentor_type == "noise":
return NoisePerturbAugmentor(self._rng, **params)
elif augmentor_type == "impulse":
return ImpulseResponseAugmentor(self._rng, **params)
else: else:
raise ValueError("Unknown augmentor type [%s]." % augmentor_type) raise ValueError("Unknown augmentor type [%s]." % augmentor_type)
"""Contains the impulse response augmentation model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from data_utils.augmentor.base import AugmentorBase
from data_utils import utils
from data_utils.audio import AudioSegment
class ImpulseResponseAugmentor(AugmentorBase):
"""Augmentation model for adding impulse response effect.
:param rng: Random generator object.
:type rng: random.Random
:param impulse_manifest_path: Manifest path for impulse audio data.
:type impulse_manifest_path: basestring
"""
def __init__(self, rng, impulse_manifest_path):
self._rng = rng
self._impulse_manifest = utils.read_manifest(
manifest_path=impulse_manifest_path)
def transform_audio(self, audio_segment):
"""Add impulse response effect.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
impulse_json = self._rng.sample(self._impulse_manifest, 1)[0]
impulse_segment = AudioSegment.from_file(impulse_json['audio_filepath'])
audio_segment.convolve(impulse_segment, allow_resample=True)
"""Contains the noise perturb augmentation model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from data_utils.augmentor.base import AugmentorBase
from data_utils import utils
from data_utils.audio import AudioSegment
class NoisePerturbAugmentor(AugmentorBase):
"""Augmentation model for adding background noise.
:param rng: Random generator object.
:type rng: random.Random
:param min_snr_dB: Minimal signal noise ratio, in decibels.
:type min_snr_dB: float
:param max_snr_dB: Maximal signal noise ratio, in decibels.
:type max_snr_dB: float
:param noise_manifest_path: Manifest path for noise audio data.
:type noise_manifest_path: basestring
"""
def __init__(self, rng, min_snr_dB, max_snr_dB, noise_manifest_path):
self._min_snr_dB = min_snr_dB
self._max_snr_dB = max_snr_dB
self._rng = rng
self._noise_manifest = utils.read_manifest(
manifest_path=noise_manifest_path)
def transform_audio(self, audio_segment):
"""Add background noise audio.
Note that this is an in-place transformation.
:param audio_segment: Audio segment to add effects to.
:type audio_segment: AudioSegmenet|SpeechSegment
"""
noise_json = self._rng.sample(self._noise_manifest, 1)[0]
if noise_json['duration'] < audio_segment.duration:
raise RuntimeError("The duration of sampled noise audio is smaller "
"than the audio segment to add effects to.")
diff_duration = noise_json['duration'] - audio_segment.duration
start = self._rng.uniform(0, diff_duration)
end = start + audio_segment.duration
noise_segment = AudioSegment.slice_from_file(
noise_json['audio_filepath'], start=start, end=end)
snr_dB = self._rng.uniform(self._min_snr_dB, self._max_snr_dB)
audio_segment.add_noise(
noise_segment, snr_dB, allow_downsampling=True, rng=self._rng)
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
...@@ -72,7 +72,7 @@ class DataGenerator(object): ...@@ -72,7 +72,7 @@ class DataGenerator(object):
max_freq=None, max_freq=None,
specgram_type='linear', specgram_type='linear',
use_dB_normalization=True, use_dB_normalization=True,
num_threads=multiprocessing.cpu_count(), num_threads=multiprocessing.cpu_count() // 2,
random_seed=0): random_seed=0):
self._max_duration = max_duration self._max_duration = max_duration
self._min_duration = min_duration self._min_duration = min_duration
...@@ -94,6 +94,23 @@ class DataGenerator(object): ...@@ -94,6 +94,23 @@ class DataGenerator(object):
self.tar2info = {} self.tar2info = {}
self.tar2object = {} self.tar2object = {}
def process_utterance(self, filename, transcript):
"""Load, augment, featurize and normalize for speech data.
:param filename: Audio filepath
:type filename: basestring | file
:param transcript: Transcription text.
:type transcript: basestring
:return: Tuple of audio feature tensor and list of token ids for
transcription.
:rtype: tuple of (2darray, list)
"""
speech_segment = SpeechSegment.from_file(filename, transcript)
self._augmentation_pipeline.transform_audio(speech_segment)
specgram, text_ids = self._speech_featurizer.featurize(speech_segment)
specgram = self._normalizer.apply(specgram)
return specgram, text_ids
def batch_reader_creator(self, def batch_reader_creator(self,
manifest_path, manifest_path,
batch_size, batch_size,
...@@ -163,7 +180,7 @@ class DataGenerator(object): ...@@ -163,7 +180,7 @@ class DataGenerator(object):
manifest, batch_size, clipped=True) manifest, batch_size, clipped=True)
elif shuffle_method == "instance_shuffle": elif shuffle_method == "instance_shuffle":
self._rng.shuffle(manifest) self._rng.shuffle(manifest)
elif not shuffle_method: elif shuffle_method == None:
pass pass
else: else:
raise ValueError("Unknown shuffle method %s." % raise ValueError("Unknown shuffle method %s." %
...@@ -210,8 +227,8 @@ class DataGenerator(object): ...@@ -210,8 +227,8 @@ class DataGenerator(object):
return self._speech_featurizer.vocab_list return self._speech_featurizer.vocab_list
def _parse_tar(self, file): def _parse_tar(self, file):
""" """Parse a tar file to get a tarfile object
Parse a tar file to get a tarfile object and a map containing tarinfoes and a map containing tarinfoes
""" """
result = {} result = {}
f = tarfile.open(file) f = tarfile.open(file)
...@@ -219,14 +236,14 @@ class DataGenerator(object): ...@@ -219,14 +236,14 @@ class DataGenerator(object):
result[tarinfo.name] = tarinfo result[tarinfo.name] = tarinfo
return f, result return f, result
def _read_soundbytes(self, filepath): def _get_file_object(self, file):
""" """Get file object by file path.
Read bytes from file. If file startwith tar, it will return a tar file object
If filepath startwith tar, we will read bytes from tar file
and cached tar file info for next reading request. and cached tar file info for next reading request.
It will return file directly, if the type of file is not str.
""" """
if filepath.startswith('tar:'): if file.startswith('tar:'):
tarpath, filename = filepath.split(':', 1)[1].split('#', 1) tarpath, filename = file.split(':', 1)[1].split('#', 1)
if 'tar2info' not in local_data.__dict__: if 'tar2info' not in local_data.__dict__:
local_data.tar2info = {} local_data.tar2info = {}
if 'tar2object' not in local_data.__dict__: if 'tar2object' not in local_data.__dict__:
...@@ -236,18 +253,9 @@ class DataGenerator(object): ...@@ -236,18 +253,9 @@ class DataGenerator(object):
local_data.tar2info[tarpath] = infoes local_data.tar2info[tarpath] = infoes
local_data.tar2object[tarpath] = object local_data.tar2object[tarpath] = object
return local_data.tar2object[tarpath].extractfile( return local_data.tar2object[tarpath].extractfile(
local_data.tar2info[tarpath][filename]).read() local_data.tar2info[tarpath][filename])
else: else:
return open(filepath).read() return open(file)
def _process_utterance(self, filename, transcript):
"""Load, augment, featurize and normalize for speech data."""
speech_segment = SpeechSegment.from_bytes(
self._read_soundbytes(filename), transcript)
self._augmentation_pipeline.transform_audio(speech_segment)
specgram, text_ids = self._speech_featurizer.featurize(speech_segment)
specgram = self._normalizer.apply(specgram)
return specgram, text_ids
def _instance_reader_creator(self, manifest): def _instance_reader_creator(self, manifest):
""" """
...@@ -263,7 +271,8 @@ class DataGenerator(object): ...@@ -263,7 +271,8 @@ class DataGenerator(object):
yield instance yield instance
def mapper(instance): def mapper(instance):
return self._process_utterance(instance["audio_filepath"], return self.process_utterance(
self._get_file_object(instance["audio_filepath"]),
instance["text"]) instance["text"])
return paddle.reader.xmap_readers( return paddle.reader.xmap_readers(
......
文件已添加
...@@ -166,21 +166,18 @@ class AudioFeaturizer(object): ...@@ -166,21 +166,18 @@ class AudioFeaturizer(object):
"window size.") "window size.")
# compute 13 cepstral coefficients, and the first one is replaced # compute 13 cepstral coefficients, and the first one is replaced
# by log(frame energy) # by log(frame energy)
mfcc_feat = mfcc( mfcc_feat = np.transpose(
mfcc(
signal=samples, signal=samples,
samplerate=sample_rate, samplerate=sample_rate,
winlen=0.001 * window_ms, winlen=0.001 * window_ms,
winstep=0.001 * stride_ms, winstep=0.001 * stride_ms,
highfreq=max_freq) highfreq=max_freq))
# Deltas # Deltas
d_mfcc_feat = delta(mfcc_feat, 2) d_mfcc_feat = delta(mfcc_feat, 2)
# Deltas-Deltas # Deltas-Deltas
dd_mfcc_feat = delta(d_mfcc_feat, 2) dd_mfcc_feat = delta(d_mfcc_feat, 2)
# concat above three features # concat above three features
concat_mfcc_feat = [ concat_mfcc_feat = np.concatenate(
np.concatenate((mfcc_feat[i], d_mfcc_feat[i], dd_mfcc_feat[i])) (mfcc_feat, d_mfcc_feat, dd_mfcc_feat))
for i in xrange(len(mfcc_feat))
]
# transpose to be consistent with the linear specgram situation
concat_mfcc_feat = np.transpose(concat_mfcc_feat)
return concat_mfcc_feat return concat_mfcc_feat
...@@ -4,6 +4,7 @@ from __future__ import division ...@@ -4,6 +4,7 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import os import os
import codecs
class TextFeaturizer(object): class TextFeaturizer(object):
...@@ -59,7 +60,7 @@ class TextFeaturizer(object): ...@@ -59,7 +60,7 @@ class TextFeaturizer(object):
def _load_vocabulary_from_file(self, vocab_filepath): def _load_vocabulary_from_file(self, vocab_filepath):
"""Load vocabulary from file.""" """Load vocabulary from file."""
vocab_lines = [] vocab_lines = []
with open(vocab_filepath, 'r') as file: with codecs.open(vocab_filepath, 'r', 'utf-8') as file:
vocab_lines.extend(file.readlines()) vocab_lines.extend(file.readlines())
vocab_list = [line[:-1] for line in vocab_lines] vocab_list = [line[:-1] for line in vocab_lines]
vocab_dict = dict( vocab_dict = dict(
......
...@@ -115,7 +115,7 @@ class SpeechSegment(AudioSegment): ...@@ -115,7 +115,7 @@ class SpeechSegment(AudioSegment):
speech file. speech file.
:rtype: SpeechSegment :rtype: SpeechSegment
""" """
audio = Audiosegment.slice_from_file(filepath, start, end) audio = AudioSegment.slice_from_file(filepath, start, end)
return cls(audio.samples, audio.sample_rate, transcript) return cls(audio.samples, audio.sample_rate, transcript)
@classmethod @classmethod
......
...@@ -4,6 +4,7 @@ from __future__ import division ...@@ -4,6 +4,7 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import json import json
import codecs
def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0): def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
...@@ -23,7 +24,7 @@ def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0): ...@@ -23,7 +24,7 @@ def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
:raises IOError: If failed to parse the manifest. :raises IOError: If failed to parse the manifest.
""" """
manifest = [] manifest = []
for json_line in open(manifest_path): for json_line in codecs.open(manifest_path, 'r', 'utf-8'):
try: try:
json_data = json.loads(json_line) json_data = json.loads(json_line)
except Exception as e: except Exception as e:
......
文件已添加
...@@ -11,11 +11,12 @@ from __future__ import print_function ...@@ -11,11 +11,12 @@ from __future__ import print_function
import distutils.util import distutils.util
import os import os
import wget import sys
import tarfile import tarfile
import argparse import argparse
import soundfile import soundfile
import json import json
import codecs
from paddle.v2.dataset.common import md5file from paddle.v2.dataset.common import md5file
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
...@@ -66,7 +67,7 @@ def download(url, md5sum, target_dir): ...@@ -66,7 +67,7 @@ def download(url, md5sum, target_dir):
filepath = os.path.join(target_dir, url.split("/")[-1]) filepath = os.path.join(target_dir, url.split("/")[-1])
if not (os.path.exists(filepath) and md5file(filepath) == md5sum): if not (os.path.exists(filepath) and md5file(filepath) == md5sum):
print("Downloading %s ..." % url) print("Downloading %s ..." % url)
wget.download(url, target_dir) os.system("wget -c " + url + " -P " + target_dir)
print("\nMD5 Chesksum %s ..." % filepath) print("\nMD5 Chesksum %s ..." % filepath)
if not md5file(filepath) == md5sum: if not md5file(filepath) == md5sum:
raise RuntimeError("MD5 checksum failed.") raise RuntimeError("MD5 checksum failed.")
...@@ -112,7 +113,7 @@ def create_manifest(data_dir, manifest_path): ...@@ -112,7 +113,7 @@ def create_manifest(data_dir, manifest_path):
'duration': duration, 'duration': duration,
'text': text 'text': text
})) }))
with open(manifest_path, 'w') as out_file: with codecs.open(manifest_path, 'w', 'utf-8') as out_file:
for line in json_lines: for line in json_lines:
out_file.write(line + '\n') out_file.write(line + '\n')
......
"""Prepare CHiME3 background data.
Download, unpack and create manifest files.
Manifest file is a json-format file with each line containing the
meta data (i.e. audio filepath, transcript and audio duration)
of each audio file in the data set.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import distutils.util
import os
import wget
import zipfile
import argparse
import soundfile
import json
from paddle.v2.dataset.common import md5file
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
URL = "https://d4s.myairbridge.com/packagev2/AG0Y3DNBE5IWRRTV/?dlid=W19XG7T0NNHB027139H0EQ"
MD5 = "c3ff512618d7a67d4f85566ea1bc39ec"
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--target_dir",
default=DATA_HOME + "/chime3_background",
type=str,
help="Directory to save the dataset. (default: %(default)s)")
parser.add_argument(
"--manifest_filepath",
default="manifest.chime3.background",
type=str,
help="Filepath for output manifests. (default: %(default)s)")
args = parser.parse_args()
def download(url, md5sum, target_dir, filename=None):
"""Download file from url to target_dir, and check md5sum."""
if filename == None:
filename = url.split("/")[-1]
if not os.path.exists(target_dir): os.makedirs(target_dir)
filepath = os.path.join(target_dir, filename)
if not (os.path.exists(filepath) and md5file(filepath) == md5sum):
print("Downloading %s ..." % url)
wget.download(url, target_dir)
print("\nMD5 Chesksum %s ..." % filepath)
if not md5file(filepath) == md5sum:
raise RuntimeError("MD5 checksum failed.")
else:
print("File exists, skip downloading. (%s)" % filepath)
return filepath
def unpack(filepath, target_dir):
"""Unpack the file to the target_dir."""
print("Unpacking %s ..." % filepath)
if filepath.endswith('.zip'):
zip = zipfile.ZipFile(filepath, 'r')
zip.extractall(target_dir)
zip.close()
elif filepath.endswith('.tar') or filepath.endswith('.tar.gz'):
tar = zipfile.open(filepath)
tar.extractall(target_dir)
tar.close()
else:
raise ValueError("File format is not supported for unpacking.")
def create_manifest(data_dir, manifest_path):
"""Create a manifest json file summarizing the data set, with each line
containing the meta data (i.e. audio filepath, transcription text, audio
duration) of each audio file within the data set.
"""
print("Creating manifest %s ..." % manifest_path)
json_lines = []
for subfolder, _, filelist in sorted(os.walk(data_dir)):
for filename in filelist:
if filename.endswith('.wav'):
filepath = os.path.join(data_dir, subfolder, filename)
audio_data, samplerate = soundfile.read(filepath)
duration = float(len(audio_data)) / samplerate
json_lines.append(
json.dumps({
'audio_filepath': filepath,
'duration': duration,
'text': ''
}))
with open(manifest_path, 'w') as out_file:
for line in json_lines:
out_file.write(line + '\n')
def prepare_chime3(url, md5sum, target_dir, manifest_path):
"""Download, unpack and create summmary manifest file."""
if not os.path.exists(os.path.join(target_dir, "CHiME3")):
# download
filepath = download(url, md5sum, target_dir,
"myairbridge-AG0Y3DNBE5IWRRTV.zip")
# unpack
unpack(filepath, target_dir)
unpack(
os.path.join(target_dir, 'CHiME3_background_bus.zip'), target_dir)
unpack(
os.path.join(target_dir, 'CHiME3_background_caf.zip'), target_dir)
unpack(
os.path.join(target_dir, 'CHiME3_background_ped.zip'), target_dir)
unpack(
os.path.join(target_dir, 'CHiME3_background_str.zip'), target_dir)
else:
print("Skip downloading and unpacking. Data already exists in %s." %
target_dir)
# create manifest json file
create_manifest(target_dir, manifest_path)
def main():
prepare_chime3(
url=URL,
md5sum=MD5,
target_dir=args.target_dir,
manifest_path=args.manifest_filepath)
if __name__ == '__main__':
main()
cd noise
python chime3_background.py
if [ $? -ne 0 ]; then
echo "Prepare CHiME3 background noise failed. Terminated."
exit 1
fi
cd -
cat noise/manifest.* > manifest.noise
echo "All done."
...@@ -205,9 +205,9 @@ def ctc_beam_search_decoder_batch(probs_split, ...@@ -205,9 +205,9 @@ def ctc_beam_search_decoder_batch(probs_split,
:type num_processes: int :type num_processes: int
:param cutoff_prob: Cutoff probability in pruning, :param cutoff_prob: Cutoff probability in pruning,
default 1.0, no pruning. default 1.0, no pruning.
:type cutoff_prob: float
:param num_processes: Number of parallel processes. :param num_processes: Number of parallel processes.
:type num_processes: int :type num_processes: int
:type cutoff_prob: float
:param ext_scoring_func: External scoring function for :param ext_scoring_func: External scoring function for
partially decoded sentence, e.g. word count partially decoded sentence, e.g. word count
or language model. or language model.
......
"""Client-end for the ASR demo."""
from pynput import keyboard
import struct
import socket
import sys
import argparse
import pyaudio
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--host_ip",
default="localhost",
type=str,
help="Server IP address. (default: %(default)s)")
parser.add_argument(
"--host_port",
default=8086,
type=int,
help="Server Port. (default: %(default)s)")
args = parser.parse_args()
is_recording = False
enable_trigger_record = True
def on_press(key):
"""On-press keyboard callback function."""
global is_recording, enable_trigger_record
if key == keyboard.Key.space:
if (not is_recording) and enable_trigger_record:
sys.stdout.write("Start Recording ... ")
sys.stdout.flush()
is_recording = True
def on_release(key):
"""On-release keyboard callback function."""
global is_recording, enable_trigger_record
if key == keyboard.Key.esc:
return False
elif key == keyboard.Key.space:
if is_recording == True:
is_recording = False
data_list = []
def callback(in_data, frame_count, time_info, status):
"""Audio recorder's stream callback function."""
global data_list, is_recording, enable_trigger_record
if is_recording:
data_list.append(in_data)
enable_trigger_record = False
elif len(data_list) > 0:
# Connect to server and send data
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((args.host_ip, args.host_port))
sent = ''.join(data_list)
sock.sendall(struct.pack('>i', len(sent)) + sent)
print('Speech[length=%d] Sent.' % len(sent))
# Receive data from the server and shut down
received = sock.recv(1024)
print "Recognition Results: {}".format(received)
sock.close()
data_list = []
enable_trigger_record = True
return (in_data, pyaudio.paContinue)
def main():
# prepare audio recorder
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt32,
channels=1,
rate=16000,
input=True,
stream_callback=callback)
stream.start_stream()
# prepare keyboard listener
with keyboard.Listener(
on_press=on_press, on_release=on_release) as listener:
listener.join()
# close up
stream.stop_stream()
stream.close()
p.terminate()
if __name__ == "__main__":
main()
"""Server-end for the ASR demo."""
import os
import time
import random
import argparse
import distutils.util
from time import gmtime, strftime
import SocketServer
import struct
import wave
import paddle.v2 as paddle
from utils import print_arguments
from data_utils.data import DataGenerator
from model import DeepSpeech2Model
from data_utils.utils import read_manifest
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--host_ip",
default="localhost",
type=str,
help="Server IP address. (default: %(default)s)")
parser.add_argument(
"--host_port",
default=8086,
type=int,
help="Server Port. (default: %(default)s)")
parser.add_argument(
"--speech_save_dir",
default="demo_cache",
type=str,
help="Directory for saving demo speech. (default: %(default)s)")
parser.add_argument(
"--vocab_filepath",
default='datasets/vocab/eng_vocab.txt',
type=str,
help="Vocabulary filepath. (default: %(default)s)")
parser.add_argument(
"--mean_std_filepath",
default='mean_std.npz',
type=str,
help="Manifest path for normalizer. (default: %(default)s)")
parser.add_argument(
"--warmup_manifest_path",
default='datasets/manifest.test',
type=str,
help="Manifest path for warmup test. (default: %(default)s)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--num_conv_layers",
default=2,
type=int,
help="Convolution layer number. (default: %(default)s)")
parser.add_argument(
"--num_rnn_layers",
default=3,
type=int,
help="RNN layer number. (default: %(default)s)")
parser.add_argument(
"--rnn_layer_size",
default=512,
type=int,
help="RNN layer cell number. (default: %(default)s)")
parser.add_argument(
"--use_gpu",
default=True,
type=distutils.util.strtobool,
help="Use gpu or not. (default: %(default)s)")
parser.add_argument(
"--model_filepath",
default='checkpoints/params.latest.tar.gz',
type=str,
help="Model filepath. (default: %(default)s)")
parser.add_argument(
"--decode_method",
default='beam_search',
type=str,
help="Method for ctc decoding: best_path or beam_search. "
"(default: %(default)s)")
parser.add_argument(
"--beam_size",
default=100,
type=int,
help="Width for beam search decoding. (default: %(default)d)")
parser.add_argument(
"--language_model_path",
default="lm/data/common_crawl_00.prune01111.trie.klm",
type=str,
help="Path for language model. (default: %(default)s)")
parser.add_argument(
"--alpha",
default=0.36,
type=float,
help="Parameter associated with language model. (default: %(default)f)")
parser.add_argument(
"--beta",
default=0.25,
type=float,
help="Parameter associated with word count. (default: %(default)f)")
parser.add_argument(
"--cutoff_prob",
default=0.99,
type=float,
help="The cutoff probability of pruning"
"in beam search. (default: %(default)f)")
args = parser.parse_args()
class AsrTCPServer(SocketServer.TCPServer):
"""The ASR TCP Server."""
def __init__(self,
server_address,
RequestHandlerClass,
speech_save_dir,
audio_process_handler,
bind_and_activate=True):
self.speech_save_dir = speech_save_dir
self.audio_process_handler = audio_process_handler
SocketServer.TCPServer.__init__(
self, server_address, RequestHandlerClass, bind_and_activate=True)
class AsrRequestHandler(SocketServer.BaseRequestHandler):
"""The ASR request handler."""
def handle(self):
# receive data through TCP socket
chunk = self.request.recv(1024)
target_len = struct.unpack('>i', chunk[:4])[0]
data = chunk[4:]
while len(data) < target_len:
chunk = self.request.recv(1024)
data += chunk
# write to file
filename = self._write_to_file(data)
print("Received utterance[length=%d] from %s, saved to %s." %
(len(data), self.client_address[0], filename))
start_time = time.time()
transcript = self.server.audio_process_handler(filename)
finish_time = time.time()
print("Response Time: %f, Transcript: %s" %
(finish_time - start_time, transcript))
self.request.sendall(transcript)
def _write_to_file(self, data):
# prepare save dir and filename
if not os.path.exists(self.server.speech_save_dir):
os.mkdir(self.server.speech_save_dir)
timestamp = strftime("%Y%m%d%H%M%S", gmtime())
out_filename = os.path.join(
self.server.speech_save_dir,
timestamp + "_" + self.client_address[0] + ".wav")
# write to wav file
file = wave.open(out_filename, 'wb')
file.setnchannels(1)
file.setsampwidth(4)
file.setframerate(16000)
file.writeframes(data)
file.close()
return out_filename
def warm_up_test(audio_process_handler,
manifest_path,
num_test_cases,
random_seed=0):
"""Warming-up test."""
manifest = read_manifest(manifest_path)
rng = random.Random(random_seed)
samples = rng.sample(manifest, num_test_cases)
for idx, sample in enumerate(samples):
print("Warm-up Test Case %d: %s", idx, sample['audio_filepath'])
start_time = time.time()
transcript = audio_process_handler(sample['audio_filepath'])
finish_time = time.time()
print("Response Time: %f, Transcript: %s" %
(finish_time - start_time, transcript))
def start_server():
"""Start the ASR server"""
# prepare data generator
data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=1)
# prepare ASR model
ds2_model = DeepSpeech2Model(
vocab_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
pretrained_model_path=args.model_filepath)
# prepare ASR inference handler
def file_to_transcript(filename):
feature = data_generator.process_utterance(filename, "")
result_transcript = ds2_model.infer_batch(
infer_data=[feature],
decode_method=args.decode_method,
beam_alpha=args.alpha,
beam_beta=args.beta,
beam_size=args.beam_size,
cutoff_prob=args.cutoff_prob,
vocab_list=data_generator.vocab_list,
language_model_path=args.language_model_path,
num_processes=1)
return result_transcript[0]
# warming up with utterrances sampled from Librispeech
print('-----------------------------------------------------------')
print('Warming up ...')
warm_up_test(
audio_process_handler=file_to_transcript,
manifest_path=args.warmup_manifest_path,
num_test_cases=3)
print('-----------------------------------------------------------')
# start the server
server = AsrTCPServer(
server_address=(args.host_ip, args.host_port),
RequestHandlerClass=AsrRequestHandler,
speech_save_dir=args.speech_save_dir,
audio_process_handler=file_to_transcript)
print("ASR Server Started.")
server.serve_forever()
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=1)
start_server()
if __name__ == "__main__":
main()
...@@ -10,43 +10,50 @@ import numpy as np ...@@ -10,43 +10,50 @@ import numpy as np
def _levenshtein_distance(ref, hyp): def _levenshtein_distance(ref, hyp):
"""Levenshtein distance is a string metric for measuring the difference between """Levenshtein distance is a string metric for measuring the difference
two sequences. Informally, the levenshtein disctance is defined as the minimum between two sequences. Informally, the levenshtein disctance is defined as
number of single-character edits (substitutions, insertions or deletions) the minimum number of single-character edits (substitutions, insertions or
required to change one word into the other. We can naturally extend the edits to deletions) required to change one word into the other. We can naturally
word level when calculate levenshtein disctance for two sentences. extend the edits to word level when calculate levenshtein disctance for
two sentences.
""" """
ref_len = len(ref) m = len(ref)
hyp_len = len(hyp) n = len(hyp)
# special case # special case
if ref == hyp: if ref == hyp:
return 0 return 0
if ref_len == 0: if m == 0:
return hyp_len return n
if hyp_len == 0: if n == 0:
return ref_len return m
distance = np.zeros((ref_len + 1, hyp_len + 1), dtype=np.int32) if m < n:
ref, hyp = hyp, ref
m, n = n, m
# use O(min(m, n)) space
distance = np.zeros((2, n + 1), dtype=np.int32)
# initialize distance matrix # initialize distance matrix
for j in xrange(hyp_len + 1): for j in xrange(n + 1):
distance[0][j] = j distance[0][j] = j
for i in xrange(ref_len + 1):
distance[i][0] = i
# calculate levenshtein distance # calculate levenshtein distance
for i in xrange(1, ref_len + 1): for i in xrange(1, m + 1):
for j in xrange(1, hyp_len + 1): prev_row_idx = (i - 1) % 2
cur_row_idx = i % 2
distance[cur_row_idx][0] = i
for j in xrange(1, n + 1):
if ref[i - 1] == hyp[j - 1]: if ref[i - 1] == hyp[j - 1]:
distance[i][j] = distance[i - 1][j - 1] distance[cur_row_idx][j] = distance[prev_row_idx][j - 1]
else: else:
s_num = distance[i - 1][j - 1] + 1 s_num = distance[prev_row_idx][j - 1] + 1
i_num = distance[i][j - 1] + 1 i_num = distance[cur_row_idx][j - 1] + 1
d_num = distance[i - 1][j] + 1 d_num = distance[prev_row_idx][j] + 1
distance[i][j] = min(s_num, i_num, d_num) distance[cur_row_idx][j] = min(s_num, i_num, d_num)
return distance[ref_len][hyp_len] return distance[m % 2][n]
def wer(reference, hypothesis, ignore_case=False, delimiter=' '): def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
...@@ -65,8 +72,8 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '): ...@@ -65,8 +72,8 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
Iw is the number of words inserted, Iw is the number of words inserted,
Nw is the number of words in the reference Nw is the number of words in the reference
We can use levenshtein distance to calculate WER. Please draw an attention that We can use levenshtein distance to calculate WER. Please draw an attention
empty items will be removed when splitting sentences by delimiter. that empty items will be removed when splitting sentences by delimiter.
:param reference: The reference sentence. :param reference: The reference sentence.
:type reference: basestring :type reference: basestring
...@@ -95,7 +102,7 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '): ...@@ -95,7 +102,7 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
return wer return wer
def cer(reference, hypothesis, ignore_case=False): def cer(reference, hypothesis, ignore_case=False, remove_space=False):
"""Calculate charactor error rate (CER). CER compares reference text and """Calculate charactor error rate (CER). CER compares reference text and
hypothesis text in char-level. CER is defined as: hypothesis text in char-level. CER is defined as:
...@@ -113,8 +120,8 @@ def cer(reference, hypothesis, ignore_case=False): ...@@ -113,8 +120,8 @@ def cer(reference, hypothesis, ignore_case=False):
We can use levenshtein distance to calculate CER. Chinese input should be We can use levenshtein distance to calculate CER. Chinese input should be
encoded to unicode. Please draw an attention that the leading and tailing encoded to unicode. Please draw an attention that the leading and tailing
white space characters will be truncated and multiple consecutive white space characters will be truncated and multiple consecutive space
space characters in a sentence will be replaced by one white space character. characters in a sentence will be replaced by one space character.
:param reference: The reference sentence. :param reference: The reference sentence.
:type reference: basestring :type reference: basestring
...@@ -122,6 +129,8 @@ def cer(reference, hypothesis, ignore_case=False): ...@@ -122,6 +129,8 @@ def cer(reference, hypothesis, ignore_case=False):
:type hypothesis: basestring :type hypothesis: basestring
:param ignore_case: Whether case-sensitive or not. :param ignore_case: Whether case-sensitive or not.
:type ignore_case: bool :type ignore_case: bool
:param remove_space: Whether remove internal space characters
:type remove_space: bool
:return: Character error rate. :return: Character error rate.
:rtype: float :rtype: float
:raises ValueError: If the reference length is zero. :raises ValueError: If the reference length is zero.
...@@ -130,8 +139,12 @@ def cer(reference, hypothesis, ignore_case=False): ...@@ -130,8 +139,12 @@ def cer(reference, hypothesis, ignore_case=False):
reference = reference.lower() reference = reference.lower()
hypothesis = hypothesis.lower() hypothesis = hypothesis.lower()
reference = ' '.join(filter(None, reference.split(' '))) join_char = ' '
hypothesis = ' '.join(filter(None, hypothesis.split(' '))) if remove_space == True:
join_char = ''
reference = join_char.join(filter(None, reference.split(' ')))
hypothesis = join_char.join(filter(None, hypothesis.split(' ')))
if len(reference) == 0: if len(reference) == 0:
raise ValueError("Length of reference should be greater than 0.") raise ValueError("Length of reference should be greater than 0.")
......
...@@ -5,20 +5,24 @@ from __future__ import print_function ...@@ -5,20 +5,24 @@ from __future__ import print_function
import distutils.util import distutils.util
import argparse import argparse
import gzip import multiprocessing
import paddle.v2 as paddle import paddle.v2 as paddle
from data_utils.data import DataGenerator from data_utils.data import DataGenerator
from model import deep_speech2 from model import DeepSpeech2Model
from decoder import * from error_rate import wer, cer
from lm.lm_scorer import LmScorer import utils
from error_rate import wer
parser = argparse.ArgumentParser(description=__doc__) parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument( parser.add_argument(
"--batch_size", "--batch_size",
default=100, default=128,
type=int, type=int,
help="Minibatch size for evaluation. (default: %(default)s)") help="Minibatch size for evaluation. (default: %(default)s)")
parser.add_argument(
"--trainer_count",
default=8,
type=int,
help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_conv_layers", "--num_conv_layers",
default=2, default=2,
...@@ -41,12 +45,12 @@ parser.add_argument( ...@@ -41,12 +45,12 @@ parser.add_argument(
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_threads_data", "--num_threads_data",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu threads for preprocessing data. (default: %(default)s)") help="Number of cpu threads for preprocessing data. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_processes_beam_search", "--num_processes_beam_search",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu processes for beam search. (default: %(default)s)") help="Number of cpu processes for beam search. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -58,8 +62,8 @@ parser.add_argument( ...@@ -58,8 +62,8 @@ parser.add_argument(
"--decode_method", "--decode_method",
default='beam_search', default='beam_search',
type=str, type=str,
help="Method for ctc decoding, best_path or beam_search. (default: %(default)s)" help="Method for ctc decoding, best_path or beam_search. "
) "(default: %(default)s)")
parser.add_argument( parser.add_argument(
"--language_model_path", "--language_model_path",
default="lm/data/common_crawl_00.prune01111.trie.klm", default="lm/data/common_crawl_00.prune01111.trie.klm",
...@@ -67,12 +71,12 @@ parser.add_argument( ...@@ -67,12 +71,12 @@ parser.add_argument(
help="Path for language model. (default: %(default)s)") help="Path for language model. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--alpha", "--alpha",
default=0.26, default=0.36,
type=float, type=float,
help="Parameter associated with language model. (default: %(default)f)") help="Parameter associated with language model. (default: %(default)f)")
parser.add_argument( parser.add_argument(
"--beta", "--beta",
default=0.1, default=0.25,
type=float, type=float,
help="Parameter associated with word count. (default: %(default)f)") help="Parameter associated with word count. (default: %(default)f)")
parser.add_argument( parser.add_argument(
...@@ -107,42 +111,25 @@ parser.add_argument( ...@@ -107,42 +111,25 @@ parser.add_argument(
default='datasets/vocab/eng_vocab.txt', default='datasets/vocab/eng_vocab.txt',
type=str, type=str,
help="Vocabulary filepath. (default: %(default)s)") help="Vocabulary filepath. (default: %(default)s)")
parser.add_argument(
"--error_rate_type",
default='wer',
choices=['wer', 'cer'],
type=str,
help="Error rate type for evaluation. 'wer' for word error rate and 'cer' "
"for character error rate. "
"(default: %(default)s)")
args = parser.parse_args() args = parser.parse_args()
def evaluate(): def evaluate():
"""Evaluate on whole test data for DeepSpeech2.""" """Evaluate on whole test data for DeepSpeech2."""
# initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath, mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}', augmentation_config='{}',
specgram_type=args.specgram_type, specgram_type=args.specgram_type,
num_threads=args.num_threads_data) num_threads=args.num_threads_data)
# create network config
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size,
is_inference=True)
# load parameters
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(args.model_filepath))
# prepare infer data
batch_reader = data_generator.batch_reader_creator( batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.decode_manifest_path,
batch_size=args.batch_size, batch_size=args.batch_size,
...@@ -150,61 +137,42 @@ def evaluate(): ...@@ -150,61 +137,42 @@ def evaluate():
sortagrad=False, sortagrad=False,
shuffle_method=None) shuffle_method=None)
# define inferer ds2_model = DeepSpeech2Model(
inferer = paddle.inference.Inference( vocab_size=data_generator.vocab_size,
output_layer=output_probs, parameters=parameters) num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
# initialize external scorer for beam search decoding rnn_layer_size=args.rnn_layer_size,
if args.decode_method == 'beam_search': pretrained_model_path=args.model_filepath)
ext_scorer = LmScorer(args.alpha, args.beta, args.language_model_path)
wer_counter, wer_sum = 0, 0.0 error_rate_func = cer if args.error_rate_type == 'cer' else wer
error_sum, num_ins = 0.0, 0
for infer_data in batch_reader(): for infer_data in batch_reader():
# run inference result_transcripts = ds2_model.infer_batch(
infer_results = inferer.infer(input=infer_data) infer_data=infer_data,
num_steps = len(infer_results) // len(infer_data) decode_method=args.decode_method,
probs_split = [ beam_alpha=args.alpha,
infer_results[i * num_steps:(i + 1) * num_steps] beam_beta=args.beta,
for i in xrange(0, len(infer_data))
]
# target transcription
target_transcription = [
''.join([
data_generator.vocab_list[index] for index in infer_data[i][1]
]) for i, probs in enumerate(probs_split)
]
# decode and print
# best path decode
if args.decode_method == "best_path":
for i, probs in enumerate(probs_split):
output_transcription = ctc_best_path_decoder(
probs_seq=probs, vocabulary=data_generator.vocab_list)
wer_sum += wer(target_transcription[i], output_transcription)
wer_counter += 1
# beam search decode
elif args.decode_method == "beam_search":
# beam search using multiple processes
beam_search_results = ctc_beam_search_decoder_batch(
probs_split=probs_split,
vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
blank_id=len(data_generator.vocab_list), cutoff_prob=args.cutoff_prob,
num_processes=args.num_processes_beam_search, vocab_list=data_generator.vocab_list,
ext_scoring_func=ext_scorer, language_model_path=args.language_model_path,
cutoff_prob=args.cutoff_prob, ) num_processes=args.num_processes_beam_search)
for i, beam_search_result in enumerate(beam_search_results): target_transcripts = [
wer_sum += wer(target_transcription[i], ''.join([data_generator.vocab_list[token] for token in transcript])
beam_search_result[0][1]) for _, transcript in infer_data
wer_counter += 1 ]
else: for target, result in zip(target_transcripts, result_transcripts):
raise ValueError("Decoding method [%s] is not supported." % error_sum += error_rate_func(target, result)
decode_method) num_ins += 1
print("Error rate [%s] (%d/?) = %f" %
print("Final WER = %f" % (wer_sum / wer_counter)) (args.error_rate_type, num_ins, error_sum / num_ins))
print("Final error rate [%s] (%d/%d) = %f" %
(args.error_rate_type, num_ins, num_ins, error_sum / num_ins))
def main(): def main():
paddle.init(use_gpu=args.use_gpu, trainer_count=1) utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
evaluate() evaluate()
......
...@@ -4,15 +4,12 @@ from __future__ import division ...@@ -4,15 +4,12 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import argparse import argparse
import gzip
import distutils.util import distutils.util
import multiprocessing import multiprocessing
import paddle.v2 as paddle import paddle.v2 as paddle
from data_utils.data import DataGenerator from data_utils.data import DataGenerator
from model import deep_speech2 from model import DeepSpeech2Model
from decoder import * from error_rate import wer, cer
from lm.lm_scorer import LmScorer
from error_rate import wer
import utils import utils
parser = argparse.ArgumentParser(description=__doc__) parser = argparse.ArgumentParser(description=__doc__)
...@@ -43,12 +40,12 @@ parser.add_argument( ...@@ -43,12 +40,12 @@ parser.add_argument(
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_threads_data", "--num_threads_data",
default=multiprocessing.cpu_count(), default=1,
type=int, type=int,
help="Number of cpu threads for preprocessing data. (default: %(default)s)") help="Number of cpu threads for preprocessing data. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_processes_beam_search", "--num_processes_beam_search",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu processes for beam search. (default: %(default)s)") help="Number of cpu processes for beam search. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -57,6 +54,11 @@ parser.add_argument( ...@@ -57,6 +54,11 @@ parser.add_argument(
type=str, type=str,
help="Feature type of audio data: 'linear' (power spectrum)" help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)") " or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--trainer_count",
default=8,
type=int,
help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--mean_std_filepath", "--mean_std_filepath",
default='mean_std.npz', default='mean_std.npz',
...@@ -81,18 +83,13 @@ parser.add_argument( ...@@ -81,18 +83,13 @@ parser.add_argument(
"--decode_method", "--decode_method",
default='beam_search', default='beam_search',
type=str, type=str,
help="Method for ctc decoding: best_path or beam_search. (default: %(default)s)" help="Method for ctc decoding: best_path or beam_search. "
) "(default: %(default)s)")
parser.add_argument( parser.add_argument(
"--beam_size", "--beam_size",
default=500, default=500,
type=int, type=int,
help="Width for beam search decoding. (default: %(default)d)") help="Width for beam search decoding. (default: %(default)d)")
parser.add_argument(
"--num_results_per_sample",
default=1,
type=int,
help="Number of output per sample in beam search. (default: %(default)d)")
parser.add_argument( parser.add_argument(
"--language_model_path", "--language_model_path",
default="lm/data/common_crawl_00.prune01111.trie.klm", default="lm/data/common_crawl_00.prune01111.trie.klm",
...@@ -100,12 +97,12 @@ parser.add_argument( ...@@ -100,12 +97,12 @@ parser.add_argument(
help="Path for language model. (default: %(default)s)") help="Path for language model. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--alpha", "--alpha",
default=0.26, default=0.36,
type=float, type=float,
help="Parameter associated with language model. (default: %(default)f)") help="Parameter associated with language model. (default: %(default)f)")
parser.add_argument( parser.add_argument(
"--beta", "--beta",
default=0.1, default=0.25,
type=float, type=float,
help="Parameter associated with word count. (default: %(default)f)") help="Parameter associated with word count. (default: %(default)f)")
parser.add_argument( parser.add_argument(
...@@ -114,42 +111,25 @@ parser.add_argument( ...@@ -114,42 +111,25 @@ parser.add_argument(
type=float, type=float,
help="The cutoff probability of pruning" help="The cutoff probability of pruning"
"in beam search. (default: %(default)f)") "in beam search. (default: %(default)f)")
parser.add_argument(
"--error_rate_type",
default='wer',
choices=['wer', 'cer'],
type=str,
help="Error rate type for evaluation. 'wer' for word error rate and 'cer' "
"for character error rate. "
"(default: %(default)s)")
args = parser.parse_args() args = parser.parse_args()
def infer(): def infer():
"""Inference for DeepSpeech2.""" """Inference for DeepSpeech2."""
# initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath, mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}', augmentation_config='{}',
specgram_type=args.specgram_type, specgram_type=args.specgram_type,
num_threads=args.num_threads_data) num_threads=args.num_threads_data)
# create network config
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size,
is_inference=True)
# load parameters
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(args.model_filepath))
# prepare infer data
batch_reader = data_generator.batch_reader_creator( batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.decode_manifest_path,
batch_size=args.num_samples, batch_size=args.num_samples,
...@@ -158,66 +138,38 @@ def infer(): ...@@ -158,66 +138,38 @@ def infer():
shuffle_method=None) shuffle_method=None)
infer_data = batch_reader().next() infer_data = batch_reader().next()
# run inference ds2_model = DeepSpeech2Model(
infer_results = paddle.infer( vocab_size=data_generator.vocab_size,
output_layer=output_probs, parameters=parameters, input=infer_data) num_conv_layers=args.num_conv_layers,
num_steps = len(infer_results) // len(infer_data) num_rnn_layers=args.num_rnn_layers,
probs_split = [ rnn_layer_size=args.rnn_layer_size,
infer_results[i * num_steps:(i + 1) * num_steps] pretrained_model_path=args.model_filepath)
for i in xrange(len(infer_data)) result_transcripts = ds2_model.infer_batch(
] infer_data=infer_data,
decode_method=args.decode_method,
beam_alpha=args.alpha,
beam_beta=args.beta,
beam_size=args.beam_size,
cutoff_prob=args.cutoff_prob,
vocab_list=data_generator.vocab_list,
language_model_path=args.language_model_path,
num_processes=args.num_processes_beam_search)
# targe transcription error_rate_func = cer if args.error_rate_type == 'cer' else wer
target_transcription = [ target_transcripts = [
''.join( ''.join([data_generator.vocab_list[token] for token in transcript])
[data_generator.vocab_list[index] for index in infer_data[i][1]]) for _, transcript in infer_data
for i, probs in enumerate(probs_split)
] ]
for target, result in zip(target_transcripts, result_transcripts):
## decode and print
# best path decode
wer_sum, wer_counter = 0, 0
if args.decode_method == "best_path":
for i, probs in enumerate(probs_split):
best_path_transcription = ctc_best_path_decoder(
probs_seq=probs, vocabulary=data_generator.vocab_list)
print("\nTarget Transcription: %s\nOutput Transcription: %s" % print("\nTarget Transcription: %s\nOutput Transcription: %s" %
(target_transcription[i], best_path_transcription)) (target, result))
wer_cur = wer(target_transcription[i], best_path_transcription) print("Current error rate [%s] = %f" %
wer_sum += wer_cur (args.error_rate_type, error_rate_func(target, result)))
wer_counter += 1
print("cur wer = %f, average wer = %f" %
(wer_cur, wer_sum / wer_counter))
# beam search decode
elif args.decode_method == "beam_search":
ext_scorer = LmScorer(args.alpha, args.beta, args.language_model_path)
beam_search_batch_results = ctc_beam_search_decoder_batch(
probs_split=probs_split,
vocabulary=data_generator.vocab_list,
beam_size=args.beam_size,
blank_id=len(data_generator.vocab_list),
num_processes=args.num_processes_beam_search,
cutoff_prob=args.cutoff_prob,
ext_scoring_func=ext_scorer, )
for i, beam_search_result in enumerate(beam_search_batch_results):
print("\nTarget Transcription:\t%s" % target_transcription[i])
for index in xrange(args.num_results_per_sample):
result = beam_search_result[index]
#output: index, log prob, beam result
print("Beam %d: %f \t%s" % (index, result[0], result[1]))
wer_cur = wer(target_transcription[i], beam_search_result[0][1])
wer_sum += wer_cur
wer_counter += 1
print("cur wer = %f , average wer = %f" %
(wer_cur, wer_sum / wer_counter))
else:
raise ValueError("Decoding method [%s] is not supported." %
decode_method)
def main(): def main():
utils.print_arguments(args) utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=1) paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
infer() infer()
......
"""Contains DeepSpeech2 layers."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle
def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride,
padding, act):
"""Convolution layer with batch normalization.
:param input: Input layer.
:type input: LayerOutput
:param filter_size: The x dimension of a filter kernel. Or input a tuple for
two image dimension.
:type filter_size: int|tuple|list
:param num_channels_in: Number of input channels.
:type num_channels_in: int
:type num_channels_out: Number of output channels.
:type num_channels_in: out
:param padding: The x dimension of the padding. Or input a tuple for two
image dimension.
:type padding: int|tuple|list
:param act: Activation type.
:type act: BaseActivation
:return: Batch norm layer after convolution layer.
:rtype: LayerOutput
"""
conv_layer = paddle.layer.img_conv(
input=input,
filter_size=filter_size,
num_channels=num_channels_in,
num_filters=num_channels_out,
stride=stride,
padding=padding,
act=paddle.activation.Linear(),
bias_attr=False)
return paddle.layer.batch_norm(input=conv_layer, act=act)
def bidirectional_simple_rnn_bn_layer(name, input, size, act):
"""Bidirectonal simple rnn layer with sequence-wise batch normalization.
The batch normalization is only performed on input-state weights.
:param name: Name of the layer.
:type name: string
:param input: Input layer.
:type input: LayerOutput
:param size: Number of RNN cells.
:type size: int
:param act: Activation type.
:type act: BaseActivation
:return: Bidirectional simple rnn layer.
:rtype: LayerOutput
"""
# input-hidden weights shared across bi-direcitonal rnn.
input_proj = paddle.layer.fc(
input=input, size=size, act=paddle.activation.Linear(), bias_attr=False)
# batch norm is only performed on input-state projection
input_proj_bn = paddle.layer.batch_norm(
input=input_proj, act=paddle.activation.Linear())
# forward and backward in time
forward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=False)
backward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=True)
return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn])
def conv_group(input, num_stacks):
"""Convolution group with stacked convolution layers.
:param input: Input layer.
:type input: LayerOutput
:param num_stacks: Number of stacked convolution layers.
:type num_stacks: int
:return: Output layer of the convolution group.
:rtype: LayerOutput
"""
conv = conv_bn_layer(
input=input,
filter_size=(11, 41),
num_channels_in=1,
num_channels_out=32,
stride=(3, 2),
padding=(5, 20),
act=paddle.activation.BRelu())
for i in xrange(num_stacks - 1):
conv = conv_bn_layer(
input=conv,
filter_size=(11, 21),
num_channels_in=32,
num_channels_out=32,
stride=(1, 2),
padding=(5, 10),
act=paddle.activation.BRelu())
output_num_channels = 32
output_height = 160 // pow(2, num_stacks) + 1
return conv, output_num_channels, output_height
def rnn_group(input, size, num_stacks):
"""RNN group with stacked bidirectional simple RNN layers.
:param input: Input layer.
:type input: LayerOutput
:param size: Number of RNN cells in each layer.
:type size: int
:param num_stacks: Number of stacked rnn layers.
:type num_stacks: int
:return: Output layer of the RNN group.
:rtype: LayerOutput
"""
output = input
for i in xrange(num_stacks):
output = bidirectional_simple_rnn_bn_layer(
name=str(i), input=output, size=size, act=paddle.activation.BRelu())
return output
def deep_speech2(audio_data,
text_data,
dict_size,
num_conv_layers=2,
num_rnn_layers=3,
rnn_size=256):
"""
The whole DeepSpeech2 model structure (a simplified version).
:param audio_data: Audio spectrogram data layer.
:type audio_data: LayerOutput
:param text_data: Transcription text data layer.
:type text_data: LayerOutput
:param dict_size: Dictionary size for tokenized transcription.
:type dict_size: int
:param num_conv_layers: Number of stacking convolution layers.
:type num_conv_layers: int
:param num_rnn_layers: Number of stacking RNN layers.
:type num_rnn_layers: int
:param rnn_size: RNN layer size (number of RNN cells).
:type rnn_size: int
:return: A tuple of an output unnormalized log probability layer (
before softmax) and a ctc cost layer.
:rtype: tuple of LayerOutput
"""
# convolution group
conv_group_output, conv_group_num_channels, conv_group_height = conv_group(
input=audio_data, num_stacks=num_conv_layers)
# convert data form convolution feature map to sequence of vectors
conv2seq = paddle.layer.block_expand(
input=conv_group_output,
num_channels=conv_group_num_channels,
stride_x=1,
stride_y=1,
block_x=1,
block_y=conv_group_height)
# rnn group
rnn_group_output = rnn_group(
input=conv2seq, size=rnn_size, num_stacks=num_rnn_layers)
fc = paddle.layer.fc(
input=rnn_group_output,
size=dict_size + 1,
act=paddle.activation.Linear(),
bias_attr=True)
# probability distribution with softmax
log_probs = paddle.layer.mixed(
input=paddle.layer.identity_projection(input=fc),
act=paddle.activation.Softmax())
# ctc cost
ctc_loss = paddle.layer.warp_ctc(
input=fc,
label=text_data,
size=dict_size + 1,
blank=dict_size,
norm_by_times=True)
return log_probs, ctc_loss
...@@ -3,141 +3,240 @@ from __future__ import absolute_import ...@@ -3,141 +3,240 @@ from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
import sys
import os
import time
import gzip
from decoder import *
from lm.lm_scorer import LmScorer
import paddle.v2 as paddle import paddle.v2 as paddle
from layer import *
def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride, class DeepSpeech2Model(object):
padding, act): """DeepSpeech2Model class.
"""
Convolution layer with batch normalization. :param vocab_size: Decoding vocabulary size.
""" :type vocab_size: int
conv_layer = paddle.layer.img_conv(
input=input,
filter_size=filter_size,
num_channels=num_channels_in,
num_filters=num_channels_out,
stride=stride,
padding=padding,
act=paddle.activation.Linear(),
bias_attr=False)
return paddle.layer.batch_norm(input=conv_layer, act=act)
def bidirectional_simple_rnn_bn_layer(name, input, size, act):
"""
Bidirectonal simple rnn layer with sequence-wise batch normalization.
The batch normalization is only performed on input-state weights.
"""
# input-hidden weights shared across bi-direcitonal rnn.
input_proj = paddle.layer.fc(
input=input, size=size, act=paddle.activation.Linear(), bias_attr=False)
# batch norm is only performed on input-state projection
input_proj_bn = paddle.layer.batch_norm(
input=input_proj, act=paddle.activation.Linear())
# forward and backward in time
forward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=False)
backward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=True)
return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn])
def conv_group(input, num_stacks):
"""
Convolution group with several stacking convolution layers.
"""
conv = conv_bn_layer(
input=input,
filter_size=(11, 41),
num_channels_in=1,
num_channels_out=32,
stride=(3, 2),
padding=(5, 20),
act=paddle.activation.BRelu())
for i in xrange(num_stacks - 1):
conv = conv_bn_layer(
input=conv,
filter_size=(11, 21),
num_channels_in=32,
num_channels_out=32,
stride=(1, 2),
padding=(5, 10),
act=paddle.activation.BRelu())
output_num_channels = 32
output_height = 160 // pow(2, num_stacks) + 1
return conv, output_num_channels, output_height
def rnn_group(input, size, num_stacks):
"""
RNN group with several stacking RNN layers.
"""
output = input
for i in xrange(num_stacks):
output = bidirectional_simple_rnn_bn_layer(
name=str(i), input=output, size=size, act=paddle.activation.BRelu())
return output
def deep_speech2(audio_data,
text_data,
dict_size,
num_conv_layers=2,
num_rnn_layers=3,
rnn_size=256,
is_inference=False):
"""
The whole DeepSpeech2 model structure (a simplified version).
:param audio_data: Audio spectrogram data layer.
:type audio_data: LayerOutput
:param text_data: Transcription text data layer.
:type text_data: LayerOutput
:param dict_size: Dictionary size for tokenized transcription.
:type dict_size: int
:param num_conv_layers: Number of stacking convolution layers. :param num_conv_layers: Number of stacking convolution layers.
:type num_conv_layers: int :type num_conv_layers: int
:param num_rnn_layers: Number of stacking RNN layers. :param num_rnn_layers: Number of stacking RNN layers.
:type num_rnn_layers: int :type num_rnn_layers: int
:param rnn_size: RNN layer size (number of RNN cells). :param rnn_layer_size: RNN layer size (number of RNN cells).
:type rnn_size: int :type rnn_layer_size: int
:param is_inference: False in the training mode, and True in the :param pretrained_model_path: Pretrained model path. If None, will train
inferene mode. from stratch.
:type is_inference: bool :type pretrained_model_path: basestring|None
:return: If is_inference set False, return a ctc cost layer; """
if is_inference set True, return a sequence layer of output
probability distribution. def __init__(self, vocab_size, num_conv_layers, num_rnn_layers,
:rtype: tuple of LayerOutput rnn_layer_size, pretrained_model_path):
self._create_network(vocab_size, num_conv_layers, num_rnn_layers,
rnn_layer_size)
self._create_parameters(pretrained_model_path)
self._inferer = None
self._loss_inferer = None
self._ext_scorer = None
def train(self,
train_batch_reader,
dev_batch_reader,
feeding_dict,
learning_rate,
gradient_clipping,
num_passes,
output_model_dir,
num_iterations_print=100):
"""Train the model.
:param train_batch_reader: Train data reader.
:type train_batch_reader: callable
:param dev_batch_reader: Validation data reader.
:type dev_batch_reader: callable
:param feeding_dict: Feeding is a map of field name and tuple index
of the data that reader returns.
:type feeding_dict: dict|list
:param learning_rate: Learning rate for ADAM optimizer.
:type learning_rate: float
:param gradient_clipping: Gradient clipping threshold.
:type gradient_clipping: float
:param num_passes: Number of training epochs.
:type num_passes: int
:param num_iterations_print: Number of training iterations for printing
a training loss.
:type rnn_iteratons_print: int
:param output_model_dir: Directory for saving the model (every pass).
:type output_model_dir: basestring
"""
# prepare model output directory
if not os.path.exists(output_model_dir):
os.mkdir(output_model_dir)
# prepare optimizer and trainer
optimizer = paddle.optimizer.Adam(
learning_rate=learning_rate,
gradient_clipping_threshold=gradient_clipping)
trainer = paddle.trainer.SGD(
cost=self._loss,
parameters=self._parameters,
update_equation=optimizer)
# create event handler
def event_handler(event):
global start_time, cost_sum, cost_counter
if isinstance(event, paddle.event.EndIteration):
cost_sum += event.cost
cost_counter += 1
if (event.batch_id + 1) % num_iterations_print == 0:
output_model_path = os.path.join(output_model_dir,
"params.latest.tar.gz")
with gzip.open(output_model_path, 'w') as f:
self._parameters.to_tar(f)
print("\nPass: %d, Batch: %d, TrainCost: %f" %
(event.pass_id, event.batch_id + 1,
cost_sum / cost_counter))
cost_sum, cost_counter = 0.0, 0
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.BeginPass):
start_time = time.time()
cost_sum, cost_counter = 0.0, 0
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=dev_batch_reader, feeding=feeding_dict)
output_model_path = os.path.join(
output_model_dir, "params.pass-%d.tar.gz" % event.pass_id)
with gzip.open(output_model_path, 'w') as f:
self._parameters.to_tar(f)
print("\n------- Time: %d sec, Pass: %d, ValidationCost: %s" %
(time.time() - start_time, event.pass_id, result.cost))
# run train
trainer.train(
reader=train_batch_reader,
event_handler=event_handler,
num_passes=num_passes,
feeding=feeding_dict)
def infer_loss_batch(self, infer_data):
"""Model inference. Infer the ctc loss for a batch of speech
utterances.
:param infer_data: List of utterances to infer, with each utterance a
tuple of audio features and transcription text (empty
string).
:type infer_data: list
:return: List of ctc loss.
:rtype: List of float
""" """
# convolution group # define inferer
conv_group_output, conv_group_num_channels, conv_group_height = conv_group( if self._loss_inferer == None:
input=audio_data, num_stacks=num_conv_layers) self._loss_inferer = paddle.inference.Inference(
# convert data form convolution feature map to sequence of vectors output_layer=self._loss, parameters=self._parameters)
conv2seq = paddle.layer.block_expand( # run inference
input=conv_group_output, return self._loss_inferer.infer(input=infer_data)
num_channels=conv_group_num_channels,
stride_x=1, def infer_batch(self, infer_data, decode_method, beam_alpha, beam_beta,
stride_y=1, beam_size, cutoff_prob, vocab_list, language_model_path,
block_x=1, num_processes):
block_y=conv_group_height) """Model inference. Infer the transcription for a batch of speech
# rnn group utterances.
rnn_group_output = rnn_group(
input=conv2seq, size=rnn_size, num_stacks=num_rnn_layers) :param infer_data: List of utterances to infer, with each utterance
fc = paddle.layer.fc( consisting of a tuple of audio features and
input=rnn_group_output, transcription text (empty string).
size=dict_size + 1, :type infer_data: list
act=paddle.activation.Linear(), :param decode_method: Decoding method name, 'best_path' or
bias_attr=True) 'beam search'.
if is_inference: :param decode_method: string
# probability distribution with softmax :param beam_alpha: Parameter associated with language model.
return paddle.layer.mixed( :type beam_alpha: float
input=paddle.layer.identity_projection(input=fc), :param beam_beta: Parameter associated with word count.
act=paddle.activation.Softmax()) :type beam_beta: float
:param beam_size: Width for Beam search.
:type beam_size: int
:param cutoff_prob: Cutoff probability in pruning,
default 1.0, no pruning.
:type cutoff_prob: float
:param vocab_list: List of tokens in the vocabulary, for decoding.
:type vocab_list: list
:param language_model_path: Filepath for language model.
:type language_model_path: basestring|None
:param num_processes: Number of processes (CPU) for decoder.
:type num_processes: int
:return: List of transcription texts.
:rtype: List of basestring
"""
# define inferer
if self._inferer == None:
self._inferer = paddle.inference.Inference(
output_layer=self._log_probs, parameters=self._parameters)
# run inference
infer_results = self._inferer.infer(input=infer_data)
num_steps = len(infer_results) // len(infer_data)
probs_split = [
infer_results[i * num_steps:(i + 1) * num_steps]
for i in xrange(0, len(infer_data))
]
# run decoder
results = []
if decode_method == "best_path":
# best path decode
for i, probs in enumerate(probs_split):
output_transcription = ctc_best_path_decoder(
probs_seq=probs, vocabulary=vocab_list)
results.append(output_transcription)
elif decode_method == "beam_search":
# initialize external scorer
if self._ext_scorer == None:
self._ext_scorer = LmScorer(beam_alpha, beam_beta,
language_model_path)
self._loaded_lm_path = language_model_path
else:
self._ext_scorer.reset_params(beam_alpha, beam_beta)
assert self._loaded_lm_path == language_model_path
# beam search decode
beam_search_results = ctc_beam_search_decoder_batch(
probs_split=probs_split,
vocabulary=vocab_list,
beam_size=beam_size,
blank_id=len(vocab_list),
num_processes=num_processes,
ext_scoring_func=self._ext_scorer,
cutoff_prob=cutoff_prob)
results = [result[0][1] for result in beam_search_results]
else: else:
# ctc cost raise ValueError("Decoding method [%s] is not supported." %
return paddle.layer.warp_ctc( decode_method)
input=fc, return results
label=text_data,
size=dict_size + 1, def _create_parameters(self, model_path=None):
blank=dict_size, """Load or create model parameters."""
norm_by_times=True) if model_path is None:
self._parameters = paddle.parameters.create(self._loss)
else:
self._parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_path))
def _create_network(self, vocab_size, num_conv_layers, num_rnn_layers,
rnn_layer_size):
"""Create data layers and model network."""
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram",
type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(vocab_size))
self._log_probs, self._loss = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=vocab_size,
num_conv_layers=num_conv_layers,
num_rnn_layers=num_rnn_layers,
rnn_size=rnn_layer_size)
DATA_PATH=$1
MODEL_PATH=$2
#setted by user
TRAIN_MANI=${DATA_PATH}/cloud.train.manifest
#setted by user
DEV_MANI=${DATA_PATH}/cloud.test.manifest
#setted by user
TRAIN_TAR=${DATA_PATH}/cloud.train.tar
#setted by user
DEV_TAR=${DATA_PATH}/cloud.test.tar
#setted by user
VOCAB_PATH=${DATA_PATH}/eng_vocab.txt
#setted by user
MEAN_STD_FILE=${DATA_PATH}/mean_std.npz
# split train data for each pcloud node
python ./cloud/split_data.py \
--in_manifest_path=$TRAIN_MANI \
--data_tar_path=$TRAIN_TAR \
--out_manifest_path='./local.train.manifest'
# split dev data for each pcloud node
python ./cloud/split_data.py \
--in_manifest_path=$DEV_MANI \
--data_tar_path=$DEV_TAR \
--out_manifest_path='./local.test.manifest'
python train.py \
--use_gpu=1 \
--trainer_count=4 \
--batch_size=256 \
--mean_std_filepath=$MEAN_STD_FILE \
--train_manifest_path='./local.train.manifest' \
--dev_manifest_path='./local.test.manifest' \
--vocab_filepath=$VOCAB_PATH \
...@@ -9,25 +9,21 @@ if [ $? != 0 ]; then ...@@ -9,25 +9,21 @@ if [ $? != 0 ]; then
exit 1 exit 1
fi fi
# install package Soundfile # install package libsndfile
curl -O "http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz" python -c "import soundfile"
if [ $? != 0 ]; then if [ $? != 0 ]; then
echo "Install package libsndfile into default system path."
curl -O "http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz"
if [ $? != 0 ]; then
echo "Download libsndfile-1.0.28.tar.gz failed !!!" echo "Download libsndfile-1.0.28.tar.gz failed !!!"
exit 1 exit 1
fi
tar -zxvf libsndfile-1.0.28.tar.gz
cd libsndfile-1.0.28
./configure && make && make install
cd ..
rm -rf libsndfile-1.0.28
rm libsndfile-1.0.28.tar.gz
fi fi
tar -zxvf libsndfile-1.0.28.tar.gz
cd libsndfile-1.0.28
./configure && make && make install
cd -
rm -rf libsndfile-1.0.28
rm libsndfile-1.0.28.tar.gz
pip install SoundFile==0.9.0.post1
if [ $? != 0 ]; then
echo "Install SoundFile failed !!!"
exit 1
fi
# prepare ./checkpoints
mkdir checkpoints
echo "Install all dependencies successfully." echo "Install all dependencies successfully."
...@@ -11,16 +11,54 @@ import error_rate ...@@ -11,16 +11,54 @@ import error_rate
class TestParse(unittest.TestCase): class TestParse(unittest.TestCase):
def test_wer_1(self): def test_wer_1(self):
ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night' ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last night' hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last '\
'night'
word_error_rate = error_rate.wer(ref, hyp) word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6) self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6)
def test_wer_2(self): def test_wer_2(self):
ref = 'as any in england i would say said gamewell proudly that is '\
'in his day'
hyp = 'as any in england i would say said came well proudly that is '\
'in his day'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.1333333) < 1e-6)
def test_wer_3(self):
ref = 'the lieutenant governor lilburn w boggs afterward governor '\
'was a pronounced mormon hater and throughout the period of '\
'the troubles he manifested sympathy with the persecutors'
hyp = 'the lieutenant governor little bit how bags afterward '\
'governor was a pronounced warman hater and throughout the '\
'period of th troubles he manifests sympathy with the '\
'persecutors'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.2692307692) < 1e-6)
def test_wer_4(self):
ref = 'the wood flamed up splendidly under the large brewing copper '\
'and it sighed so deeply'
hyp = 'the wood flame do splendidly under the large brewing copper '\
'and its side so deeply'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.2666666667) < 1e-6)
def test_wer_5(self):
ref = 'all the morning they trudged up the mountain path and at noon '\
'unc and ojo sat on a fallen tree trunk and ate the last of '\
'the bread which the old munchkin had placed in his pocket'
hyp = 'all the morning they trudged up the mountain path and at noon '\
'unc in ojo sat on a fallen tree trunk and ate the last of '\
'the bread which the old munchkin had placed in his pocket'
word_error_rate = error_rate.wer(ref, hyp)
self.assertTrue(abs(word_error_rate - 0.027027027) < 1e-6)
def test_wer_6(self):
ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night' ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
word_error_rate = error_rate.wer(ref, ref) word_error_rate = error_rate.wer(ref, ref)
self.assertEqual(word_error_rate, 0.0) self.assertEqual(word_error_rate, 0.0)
def test_wer_3(self): def test_wer_7(self):
ref = ' ' ref = ' '
hyp = 'Hypothesis sentence' hyp = 'Hypothesis sentence'
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
...@@ -33,22 +71,40 @@ class TestParse(unittest.TestCase): ...@@ -33,22 +71,40 @@ class TestParse(unittest.TestCase):
self.assertTrue(abs(char_error_rate - 0.25) < 1e-6) self.assertTrue(abs(char_error_rate - 0.25) < 1e-6)
def test_cer_2(self): def test_cer_2(self):
ref = 'werewolf'
hyp = 'weae wolf'
char_error_rate = error_rate.cer(ref, hyp, remove_space=True)
self.assertTrue(abs(char_error_rate - 0.125) < 1e-6)
def test_cer_3(self):
ref = 'were wolf'
hyp = 'were wolf'
char_error_rate = error_rate.cer(ref, hyp)
self.assertTrue(abs(char_error_rate - 0.0) < 1e-6)
def test_cer_4(self):
ref = 'werewolf' ref = 'werewolf'
char_error_rate = error_rate.cer(ref, ref) char_error_rate = error_rate.cer(ref, ref)
self.assertEqual(char_error_rate, 0.0) self.assertEqual(char_error_rate, 0.0)
def test_cer_3(self): def test_cer_5(self):
ref = u'我是中国人' ref = u'我是中国人'
hyp = u'我是 美洲人' hyp = u'我是 美洲人'
char_error_rate = error_rate.cer(ref, hyp) char_error_rate = error_rate.cer(ref, hyp)
self.assertTrue(abs(char_error_rate - 0.6) < 1e-6) self.assertTrue(abs(char_error_rate - 0.6) < 1e-6)
def test_cer_4(self): def test_cer_6(self):
ref = u'我 是 中 国 人'
hyp = u'我 是 美 洲 人'
char_error_rate = error_rate.cer(ref, hyp, remove_space=True)
self.assertTrue(abs(char_error_rate - 0.4) < 1e-6)
def test_cer_7(self):
ref = u'我是中国人' ref = u'我是中国人'
char_error_rate = error_rate.cer(ref, ref) char_error_rate = error_rate.cer(ref, ref)
self.assertFalse(char_error_rate, 0.0) self.assertFalse(char_error_rate, 0.0)
def test_cer_5(self): def test_cer_8(self):
ref = '' ref = ''
hyp = 'Hypothesis' hyp = 'Hypothesis'
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
......
"""Test Setup."""
import unittest
import numpy as np
import os
class TestSetup(unittest.TestCase):
def test_soundfile(self):
import soundfile as sf
# floating point data is typically limited to the interval [-1.0, 1.0],
# but smaller/larger values are supported as well
data = np.array([[1.75, -1.75], [1.0, -1.0], [0.5, -0.5],
[0.25, -0.25]])
file = 'test.wav'
sf.write(file, data, 44100, format='WAV', subtype='FLOAT')
read, fs = sf.read(file)
self.assertTrue(np.all(read == data))
self.assertEqual(fs, 44100)
os.remove(file)
if __name__ == '__main__':
unittest.main()
"""Set up paths for DS2"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os.path
import sys
def add_path(path):
if path not in sys.path:
sys.path.insert(0, path)
this_dir = os.path.dirname(__file__)
# Add project path to PYTHONPATH
proj_path = os.path.join(this_dir, '..')
add_path(proj_path)
"""Build vocabulary from manifest files.
Each item in vocabulary file is a character.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import codecs
import json
from collections import Counter
import os.path
import _init_paths
from data_utils import utils
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--manifest_paths",
type=str,
help="Manifest paths for building vocabulary."
"You can provide multiple manifest files.",
nargs='+',
required=True)
parser.add_argument(
"--count_threshold",
default=0,
type=int,
help="Characters whose counts are below the threshold will be truncated. "
"(default: %(default)i)")
parser.add_argument(
"--vocab_path",
default='datasets/vocab/zh_vocab.txt',
type=str,
help="File path to write the vocabulary. (default: %(default)s)")
args = parser.parse_args()
def count_manifest(counter, manifest_path):
manifest_jsons = utils.read_manifest(manifest_path)
for line_json in manifest_jsons:
for char in line_json['text']:
counter.update(char)
def main():
counter = Counter()
for manifest_path in args.manifest_paths:
count_manifest(counter, manifest_path)
count_sorted = sorted(counter.items(), key=lambda x: x[1], reverse=True)
with codecs.open(args.vocab_path, 'w', 'utf-8') as fout:
for char, count in count_sorted:
if count < args.count_threshold: break
fout.write(char + '\n')
if __name__ == '__main__':
main()
...@@ -4,6 +4,7 @@ from __future__ import division ...@@ -4,6 +4,7 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
import argparse import argparse
import _init_paths
from data_utils.normalizer import FeatureNormalizer from data_utils.normalizer import FeatureNormalizer
from data_utils.augmentor.augmentation import AugmentationPipeline from data_utils.augmentor.augmentation import AugmentationPipeline
from data_utils.featurizer.audio_featurizer import AudioFeaturizer from data_utils.featurizer.audio_featurizer import AudioFeaturizer
......
...@@ -3,15 +3,11 @@ from __future__ import absolute_import ...@@ -3,15 +3,11 @@ from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
import sys
import os
import argparse import argparse
import gzip
import time
import distutils.util import distutils.util
import multiprocessing import multiprocessing
import paddle.v2 as paddle import paddle.v2 as paddle
from model import deep_speech2 from model import DeepSpeech2Model
from data_utils.data import DataGenerator from data_utils.data import DataGenerator
import utils import utils
...@@ -23,6 +19,12 @@ parser.add_argument( ...@@ -23,6 +19,12 @@ parser.add_argument(
default=200, default=200,
type=int, type=int,
help="Training pass number. (default: %(default)s)") help="Training pass number. (default: %(default)s)")
parser.add_argument(
"--num_iterations_print",
default=100,
type=int,
help="Number of iterations for every train cost printing. "
"(default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_conv_layers", "--num_conv_layers",
default=2, default=2,
...@@ -84,7 +86,7 @@ parser.add_argument( ...@@ -84,7 +86,7 @@ parser.add_argument(
help="Trainer number. (default: %(default)s)") help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_threads_data", "--num_threads_data",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu threads for preprocessing data. (default: %(default)s)") help="Number of cpu threads for preprocessing data. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -114,11 +116,14 @@ parser.add_argument( ...@@ -114,11 +116,14 @@ parser.add_argument(
help="If set None, the training will start from scratch. " help="If set None, the training will start from scratch. "
"Otherwise, the training will resume from " "Otherwise, the training will resume from "
"the existing model of this path. (default: %(default)s)") "the existing model of this path. (default: %(default)s)")
parser.add_argument(
"--output_model_dir",
default="./checkpoints",
type=str,
help="Directory for saving models. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--augmentation_config", "--augmentation_config",
default='[{"type": "shift", ' default=open('conf/augmentation.config', 'r').read(),
'"params": {"min_shift_ms": -5, "max_shift_ms": 5},'
'"prob": 1.0}]',
type=str, type=str,
help="Augmentation configuration in json-format. " help="Augmentation configuration in json-format. "
"(default: %(default)s)") "(default: %(default)s)")
...@@ -127,10 +132,7 @@ args = parser.parse_args() ...@@ -127,10 +132,7 @@ args = parser.parse_args()
def train(): def train():
"""DeepSpeech2 training.""" """DeepSpeech2 training."""
train_generator = DataGenerator(
# initialize data generator
def data_generator():
return DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath, mean_std_filepath=args.mean_std_filepath,
augmentation_config=args.augmentation_config, augmentation_config=args.augmentation_config,
...@@ -138,89 +140,40 @@ def train(): ...@@ -138,89 +140,40 @@ def train():
min_duration=args.min_duration, min_duration=args.min_duration,
specgram_type=args.specgram_type, specgram_type=args.specgram_type,
num_threads=args.num_threads_data) num_threads=args.num_threads_data)
dev_generator = DataGenerator(
train_generator = data_generator() vocab_filepath=args.vocab_filepath,
test_generator = data_generator() mean_std_filepath=args.mean_std_filepath,
augmentation_config="{}",
# create network config specgram_type=args.specgram_type,
# paddle.data_type.dense_array is used for variable batch input. num_threads=args.num_threads_data)
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(
train_generator.vocab_size))
cost = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=train_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size,
is_inference=False)
# create/load parameters and optimizer
if args.init_model_path is None:
parameters = paddle.parameters.create(cost)
else:
if not os.path.isfile(args.init_model_path):
raise IOError("Invalid model!")
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(args.init_model_path))
optimizer = paddle.optimizer.Adam(
learning_rate=args.adam_learning_rate, gradient_clipping_threshold=400)
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=optimizer)
# prepare data reader
train_batch_reader = train_generator.batch_reader_creator( train_batch_reader = train_generator.batch_reader_creator(
manifest_path=args.train_manifest_path, manifest_path=args.train_manifest_path,
batch_size=args.batch_size, batch_size=args.batch_size,
min_batch_size=args.trainer_count, min_batch_size=args.trainer_count,
sortagrad=args.use_sortagrad if args.init_model_path is None else False, sortagrad=args.use_sortagrad if args.init_model_path is None else False,
shuffle_method=args.shuffle_method) shuffle_method=args.shuffle_method)
test_batch_reader = test_generator.batch_reader_creator( dev_batch_reader = dev_generator.batch_reader_creator(
manifest_path=args.dev_manifest_path, manifest_path=args.dev_manifest_path,
batch_size=args.batch_size, batch_size=args.batch_size,
min_batch_size=1, # must be 1, but will have errors. min_batch_size=1, # must be 1, but will have errors.
sortagrad=False, sortagrad=False,
shuffle_method=None) shuffle_method=None)
# create event handler ds2_model = DeepSpeech2Model(
def event_handler(event): vocab_size=train_generator.vocab_size,
global start_time, cost_sum, cost_counter num_conv_layers=args.num_conv_layers,
if isinstance(event, paddle.event.EndIteration): num_rnn_layers=args.num_rnn_layers,
cost_sum += event.cost rnn_layer_size=args.rnn_layer_size,
cost_counter += 1 pretrained_model_path=args.init_model_path)
if (event.batch_id + 1) % 100 == 0: ds2_model.train(
print("\nPass: %d, Batch: %d, TrainCost: %f" % ( train_batch_reader=train_batch_reader,
event.pass_id, event.batch_id + 1, cost_sum / cost_counter)) dev_batch_reader=dev_batch_reader,
cost_sum, cost_counter = 0.0, 0 feeding_dict=train_generator.feeding,
with gzip.open("checkpoints/params.latest.tar.gz", 'w') as f: learning_rate=args.adam_learning_rate,
parameters.to_tar(f) gradient_clipping=400,
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.BeginPass):
start_time = time.time()
cost_sum, cost_counter = 0.0, 0
if isinstance(event, paddle.event.EndPass):
result = trainer.test(
reader=test_batch_reader, feeding=test_generator.feeding)
print("\n------- Time: %d sec, Pass: %d, ValidationCost: %s" %
(time.time() - start_time, event.pass_id, result.cost))
with gzip.open("checkpoints/params.pass-%d.tar.gz" % event.pass_id,
'w') as f:
parameters.to_tar(f)
# run train
trainer.train(
reader=train_batch_reader,
event_handler=event_handler,
num_passes=args.num_passes, num_passes=args.num_passes,
feeding=train_generator.feeding) num_iterations_print=args.num_iterations_print,
output_model_dir=args.output_model_dir)
def main(): def main():
......
...@@ -3,14 +3,13 @@ from __future__ import absolute_import ...@@ -3,14 +3,13 @@ from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
import numpy as np
import distutils.util import distutils.util
import argparse import argparse
import gzip import multiprocessing
import paddle.v2 as paddle import paddle.v2 as paddle
from data_utils.data import DataGenerator from data_utils.data import DataGenerator
from model import deep_speech2 from model import DeepSpeech2Model
from decoder import *
from lm.lm_scorer import LmScorer
from error_rate import wer from error_rate import wer
import utils import utils
...@@ -40,14 +39,19 @@ parser.add_argument( ...@@ -40,14 +39,19 @@ parser.add_argument(
default=True, default=True,
type=distutils.util.strtobool, type=distutils.util.strtobool,
help="Use gpu or not. (default: %(default)s)") help="Use gpu or not. (default: %(default)s)")
parser.add_argument(
"--trainer_count",
default=8,
type=int,
help="Trainer number. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_threads_data", "--num_threads_data",
default=multiprocessing.cpu_count(), default=1,
type=int, type=int,
help="Number of cpu threads for preprocessing data. (default: %(default)s)") help="Number of cpu threads for preprocessing data. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--num_processes_beam_search", "--num_processes_beam_search",
default=multiprocessing.cpu_count(), default=multiprocessing.cpu_count() // 2,
type=int, type=int,
help="Number of cpu processes for beam search. (default: %(default)s)") help="Number of cpu processes for beam search. (default: %(default)s)")
parser.add_argument( parser.add_argument(
...@@ -62,10 +66,10 @@ parser.add_argument( ...@@ -62,10 +66,10 @@ parser.add_argument(
type=str, type=str,
help="Manifest path for normalizer. (default: %(default)s)") help="Manifest path for normalizer. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--decode_manifest_path", "--tune_manifest_path",
default='datasets/manifest.test', default='datasets/manifest.dev',
type=str, type=str,
help="Manifest path for decoding. (default: %(default)s)") help="Manifest path for tuning. (default: %(default)s)")
parser.add_argument( parser.add_argument(
"--model_filepath", "--model_filepath",
default='checkpoints/params.latest.tar.gz', default='checkpoints/params.latest.tar.gz',
...@@ -127,96 +131,64 @@ args = parser.parse_args() ...@@ -127,96 +131,64 @@ args = parser.parse_args()
def tune(): def tune():
"""Tune parameters alpha and beta on one minibatch.""" """Tune parameters alpha and beta on one minibatch."""
if not args.num_alphas >= 0: if not args.num_alphas >= 0:
raise ValueError("num_alphas must be non-negative!") raise ValueError("num_alphas must be non-negative!")
if not args.num_betas >= 0: if not args.num_betas >= 0:
raise ValueError("num_betas must be non-negative!") raise ValueError("num_betas must be non-negative!")
# initialize data generator
data_generator = DataGenerator( data_generator = DataGenerator(
vocab_filepath=args.vocab_filepath, vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath, mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}', augmentation_config='{}',
specgram_type=args.specgram_type, specgram_type=args.specgram_type,
num_threads=args.num_threads_data) num_threads=args.num_threads_data)
# create network config
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
output_probs = deep_speech2(
audio_data=audio_data,
text_data=text_data,
dict_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_size=args.rnn_layer_size,
is_inference=True)
# load parameters
parameters = paddle.parameters.Parameters.from_tar(
gzip.open(args.model_filepath))
# prepare infer data
batch_reader = data_generator.batch_reader_creator( batch_reader = data_generator.batch_reader_creator(
manifest_path=args.decode_manifest_path, manifest_path=args.tune_manifest_path,
batch_size=args.num_samples, batch_size=args.num_samples,
sortagrad=False, sortagrad=False,
shuffle_method=None) shuffle_method=None)
# get one batch data for tuning tune_data = batch_reader().next()
infer_data = batch_reader().next() target_transcripts = [
''.join([data_generator.vocab_list[token] for token in transcript])
# run inference for _, transcript in tune_data
infer_results = paddle.infer(
output_layer=output_probs, parameters=parameters, input=infer_data)
num_steps = len(infer_results) // len(infer_data)
probs_split = [
infer_results[i * num_steps:(i + 1) * num_steps]
for i in xrange(0, len(infer_data))
] ]
ds2_model = DeepSpeech2Model(
vocab_size=data_generator.vocab_size,
num_conv_layers=args.num_conv_layers,
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
pretrained_model_path=args.model_filepath)
# create grid for search # create grid for search
cand_alphas = np.linspace(args.alpha_from, args.alpha_to, args.num_alphas) cand_alphas = np.linspace(args.alpha_from, args.alpha_to, args.num_alphas)
cand_betas = np.linspace(args.beta_from, args.beta_to, args.num_betas) cand_betas = np.linspace(args.beta_from, args.beta_to, args.num_betas)
params_grid = [(alpha, beta) for alpha in cand_alphas params_grid = [(alpha, beta) for alpha in cand_alphas
for beta in cand_betas] for beta in cand_betas]
ext_scorer = LmScorer(args.alpha_from, args.beta_from,
args.language_model_path)
## tune parameters in loop ## tune parameters in loop
for alpha, beta in params_grid: for alpha, beta in params_grid:
wer_sum, wer_counter = 0, 0 result_transcripts = ds2_model.infer_batch(
# reset scorer infer_data=tune_data,
ext_scorer.reset_params(alpha, beta) decode_method='beam_search',
# beam search using multiple processes beam_alpha=alpha,
beam_search_results = ctc_beam_search_decoder_batch( beam_beta=beta,
probs_split=probs_split,
vocabulary=data_generator.vocab_list,
beam_size=args.beam_size, beam_size=args.beam_size,
cutoff_prob=args.cutoff_prob, cutoff_prob=args.cutoff_prob,
blank_id=len(data_generator.vocab_list), vocab_list=data_generator.vocab_list,
num_processes=args.num_processes_beam_search, language_model_path=args.language_model_path,
ext_scoring_func=ext_scorer, ) num_processes=args.num_processes_beam_search)
for i, beam_search_result in enumerate(beam_search_results): wer_sum, num_ins = 0.0, 0
target_transcription = ''.join([ for target, result in zip(target_transcripts, result_transcripts):
data_generator.vocab_list[index] for index in infer_data[i][1] wer_sum += wer(target, result)
]) num_ins += 1
wer_sum += wer(target_transcription, beam_search_result[0][1])
wer_counter += 1
print("alpha = %f\tbeta = %f\tWER = %f" % print("alpha = %f\tbeta = %f\tWER = %f" %
(alpha, beta, wer_sum / wer_counter)) (alpha, beta, wer_sum / num_ins))
def main(): def main():
paddle.init(use_gpu=args.use_gpu, trainer_count=1) utils.print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
tune() tune()
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册