提交 d74f4ff3 编写于 作者: L lfchener

update deepspeech to fluid api

上级 d2bdd254
# DeepSpeech2 on PaddlePaddle
*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, distributed [PaddleCloud](https://github.com/PaddlePaddle/cloud) training, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released.
*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released.
## Table of Contents
- [Installation](#installation)
......@@ -10,7 +10,6 @@
- [Data Augmentation Pipeline](#data-augmentation-pipeline)
- [Inference and Evaluation](#inference-and-evaluation)
- [Running in Docker Container](#running-in-docker-container)
- [Distributed Cloud Training](#distributed-cloud-training)
- [Hyper-parameters Tuning](#hyper-parameters-tuning)
- [Training for Mandarin Language](#training-for-mandarin-language)
- [Trying Live Demo with Your Own Voice](#trying-live-demo-with-your-own-voice)
......@@ -22,13 +21,45 @@
## Installation
For this project was developed in PaddlePaddle V2 API, which is not maintained officially any more, we only support [running it in Docker container](#running-in-docker-container), instead of building environment from source code. And we are going to release the update to the latest Paddle Fluid API very soon, please keep an eye on this project.
To avoid the trouble of environment setup, [running in Docker container](#running-in-docker-container) is highly recommended. Otherwise follow the guidelines below to install the dependencies manually.
### Prerequisites
- Python 2.7 only supported
- PaddlePaddle the latest version (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/install/index_en.html))
### Setup
- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost` and `swig`, e.g. installing them via `apt-get`:
```bash
sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig
```
or, installing them via `yum`:
```bash
sudo yum install pkgconfig libogg-devel libvorbis-devel boost-devel
wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.1.tar.xz
xz -d flac-1.3.1.tar.xz
tar -xvf flac-1.3.1.tar
cd flac-1.3.1
./configure
make
make install
```
- Run the setup script for the remaining dependencies
```bash
git clone https://github.com/PaddlePaddle/DeepSpeech.git
cd DeepSpeech
sh setup.sh
```
## Getting Started
Several shell scripts provided in `./examples` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES` and `--trainer_count`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit.
Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit.
Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org/12/) for instance.
......@@ -45,7 +76,7 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org
sh run_data.sh
```
`run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `~/.cache/paddle/dataset/speech/libri` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
`run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `./dataset/librispeech` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
- Train your own ASR model
```bash
......@@ -139,20 +170,20 @@ python tools/build_vocab.py --help
- Start training from scratch with 8 GPUs:
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --trainer_count 8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py
```
- Start training from scratch with 16 CPUs:
- Start training from scratch with CPUs:
```
python train.py --use_gpu False --trainer_count 16
python train.py --use_gpu False
```
- Resume training from a checkpoint:
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python train.py \
--init_model_path CHECKPOINT_PATH_TO_RESUME_FROM
--init_from_pretrain_model CHECKPOINT_PATH_TO_RESUME_FROM
```
For more help on arguments:
......@@ -162,6 +193,7 @@ python train.py --help
```
or refer to `example/librispeech/run_train.sh`.
## Data Augmentation Pipeline
Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
......@@ -206,8 +238,8 @@ A language model is required to improve the decoder's performance. We have prepa
```bash
cd models/lm
sh download_lm_en.sh
sh download_lm_ch.sh
bash download_lm_en.sh
bash download_lm_ch.sh
```
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials. Here we provide some tips to show how we preparing our English and Mandarin language models. You can take it as a reference when you train your own.
......@@ -216,7 +248,7 @@ If you wish to train your own better language model, please refer to [KenLM](htt
The English corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our English language model. There are some preprocessing steps before training:
* Characters not in \[A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
* Characters not in \['A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
* Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
* Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.
......@@ -239,13 +271,13 @@ An inference module caller `infer.py` is provided to infer, decode and visualize
- Inference with GPU:
```bash
CUDA_VISIBLE_DEVICES=0 python infer.py --trainer_count 1
CUDA_VISIBLE_DEVICES=0 python infer.py
```
- Inference with CPUs:
```bash
python infer.py --use_gpu False --trainer_count 12
python infer.py --use_gpu False
```
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `--decoding_method`.
......@@ -264,13 +296,13 @@ To evaluate a model's performance quantitatively, please run:
- Evaluation with GPUs:
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py --trainer_count 8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py
```
- Evaluation with CPUs:
```bash
python test.py --use_gpu False --trainer_count 12
python test.py --use_gpu False
```
The error rate (default: word error rate; can be set with `--error_rate_type`) will be printed.
......@@ -293,7 +325,6 @@ The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertio
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python tools/tune.py \
--trainer_count 8 \
--alpha_from 1.0 \
--alpha_to 3.2 \
--num_alphas 45 \
......@@ -332,7 +363,7 @@ Take several steps to launch the Docker image:
- Download the Docker image
```bash
nvidia-docker pull paddlepaddle/deep_speech:latest-gpu
nvidia-docker pull hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu
```
- Clone this repository
......@@ -344,72 +375,10 @@ git clone https://github.com/PaddlePaddle/DeepSpeech.git
- Run the Docker image
```bash
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech paddlepaddle/deep_speech:latest-gpu /bin/bash
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash
```
Now go back and start from the [Getting Started](#getting-started) section, you can execute training, inference and hyper-parameters tuning similarly in the Docker container.
## Distributed Cloud Training
We also provide a cloud training module for users to do the distributed cluster training on [PaddleCloud](https://github.com/PaddlePaddle/cloud), to achieve a much faster training speed with multiple machines. To start with this, please first install PaddleCloud client and register a PaddleCloud account, as described in [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud).
Please take the following steps to submit a training job:
- Go to directory:
```bash
cd cloud
```
- Upload data:
Data must be uploaded to PaddleCloud filesystem to be accessed within a cloud job. `pcloud_upload_data.sh` helps do the data packing and uploading:
```bash
sh pcloud_upload_data.sh
```
Given input manifests, `pcloud_upload_data.sh` will:
- Extract the audio files listed in the input manifests.
- Pack them into a specified number of tar files.
- Upload these tar files to PaddleCloud filesystem.
- Create cloud manifests by replacing local filesystem paths with PaddleCloud filesystem paths. New manifests will be used to inform the cloud jobs of audio files' location and their meta information.
It should be done only once for the very first time to do the cloud training. Later, the data is kept persisitent on the cloud filesystem and reusable for further job submissions.
For argument details please refer to [Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
- Configure training arguments:
Configure the cloud job parameters in `pcloud_submit.sh` (e.g. `NUM_NODES`, `NUM_GPUS`, `CLOUD_TRAIN_DIR`, `JOB_NAME` etc.) and then configure other hyper-parameters for training in `pcloud_train.sh` (just as what you do for local training).
For argument details please refer to [Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
- Submit the job:
By running:
```bash
sh pcloud_submit.sh
```
a training job has been submitted to PaddleCloud, with the job name printed to the console.
- Get training logs
Run this to list all the jobs you have submitted, as well as their running status:
```bash
paddlecloud get jobs
```
Run this, the corresponding job's logs will be printed.
```bash
paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
```
For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).
For more information about the DeepSpeech2 training on PaddleCloud, please refer to
[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
## Training for Mandarin Language
......@@ -417,14 +386,13 @@ The key steps of training for Mandarin language are same to that of English lang
## Trying Live Demo with Your Own Voice
Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice.
Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_english_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice.
To start the demo's server, please run this in one console:
```bash
CUDA_VISIBLE_DEVICES=0 \
python deploy/demo_server.py \
--trainer_count 1 \
--host_ip localhost \
--host_port 8086
```
......@@ -436,7 +404,7 @@ For example, on MAC OS X:
```bash
brew install portaudio
pip install pyaudio
pip install pynput
pip install keyboard
```
Then to start the client, please run this in another console:
......@@ -452,7 +420,7 @@ Now, in the client console, press the `whitespace` key, hold, and start speaking
Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine.
Please also refer to `examples/mandarin/run_demo_server.sh`, which will first download a pre-trained Mandarin model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak Mandarin to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
For more help on arguments:
......@@ -467,10 +435,10 @@ python deploy/demo_client.py --help
Language | Model Name | Training Data | Hours of Speech
:-----------: | :------------: | :----------: | -------:
English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz) | Baidu Internal English Dataset | 8628 h
Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz) | Baidu Internal English Dataset | 8628 h
Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model_fluid.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
#### Language Model Released
......@@ -504,17 +472,16 @@ Baidu Internal Testset | 12.64
#### Acceleration with Multi-GPUs
We compare the training time with 1, 2, 4, 8, 16 Tesla K40m GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
<img src="docs/images/multi_gpu_speedup.png" width=450><br/>
| # of GPU | Acceleration Rate |
| -------- | --------------: |
| 1 | 1.00 X |
| 2 | 1.97 X |
| 4 | 3.74 X |
| 8 | 6.21 X |
|16 | 10.70 X |
| 2 | 1.98 X |
| 4 | 3.73 X |
| 8 | 6.95 X |
`tools/profile.sh` provides such a profiling tool.
......
此差异已折叠。
# Train DeepSpeech2 on PaddleCloud
>Note:
>Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/`
## Step 1: Upload Data
Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths.
Please modify the following arguments in `pcloud_upload_data.sh`:
- `IN_MANIFESTS`: Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter.
- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are updated with cloud filesystem paths.
- `CLOUD_DATA_DIR`: Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data.
By running:
```
sh pcloud_upload_data.sh
```
all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`.
You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions.
## Step 2: Configure Training
Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments:
- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`.
- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation.
- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
- `BATCH_SIZE`: Training batch size for a single node.
- `NUM_GPU`: Number of GPUs allocated for a single node.
- `NUM_NODE`: Number of nodes (machines) allocated for this job.
- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes.
Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training.
By running:
```
sh pcloud_submit.sh
```
you submit a training job to PaddleCloud. And you will see the job name when the submission is done.
## Step 3 Get Job Logs
Run this to list all the jobs you have submitted, as well as their running status:
```
paddlecloud get jobs
```
Run this, the corresponding job's logs will be printed.
```
paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
```
## More Help
For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).
"""Set up paths for DS2"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os.path
import sys
def add_path(path):
if path not in sys.path:
sys.path.insert(0, path)
this_dir = os.path.dirname(__file__)
proj_path = os.path.join(this_dir, '..')
add_path(proj_path)
#! /usr/bin/env bash
TRAIN_MANIFEST="cloud/cloud_manifests/cloud.manifest.train"
DEV_MANIFEST="cloud/cloud_manifests/cloud.manifest.dev"
CLOUD_MODEL_DIR="./checkpoints"
BATCH_SIZE=512
NUM_GPU=8
NUM_NODE=1
IS_LOCAL="True"
JOB_NAME=deepspeech-`date +%Y%m%d%H%M%S`
DS2_PATH=${PWD%/*}
cp -f pcloud_train.sh ${DS2_PATH}
paddlecloud submit \
-image bootstrapper:5000/paddlepaddle/pcloud_ds2:latest \
-jobname ${JOB_NAME} \
-cpu ${NUM_GPU} \
-gpu ${NUM_GPU} \
-memory 64Gi \
-parallelism ${NUM_NODE} \
-pscpu 1 \
-pservers 1 \
-psmemory 64Gi \
-passes 1 \
-entry "sh pcloud_train.sh ${TRAIN_MANIFEST} ${DEV_MANIFEST} ${CLOUD_MODEL_DIR} ${NUM_GPU} ${BATCH_SIZE} ${IS_LOCAL}" \
${DS2_PATH}
rm ${DS2_PATH}/pcloud_train.sh
#! /usr/bin/env bash
TRAIN_MANIFEST=$1
DEV_MANIFEST=$2
MODEL_PATH=$3
NUM_GPU=$4
BATCH_SIZE=$5
IS_LOCAL=$6
python ./cloud/split_data.py \
--in_manifest_path=${TRAIN_MANIFEST} \
--out_manifest_path='/local.manifest.train'
python ./cloud/split_data.py \
--in_manifest_path=${DEV_MANIFEST} \
--out_manifest_path='/local.manifest.dev'
mkdir ./logs
python -u train.py \
--batch_size=${BATCH_SIZE} \
--trainer_count=${NUM_GPU} \
--num_passes=200 \
--num_proc_data=${NUM_GPU} \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
--use_sortagrad=True \
--use_gru=False \
--use_gpu=True \
--is_local=${IS_LOCAL} \
--share_rnn_weights=True \
--train_manifest='/local.manifest.train' \
--dev_manifest='/local.manifest.dev' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='data/librispeech/vocab.txt' \
--output_model_dir='./checkpoints' \
--output_model_dir=${MODEL_PATH} \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped' \
2>&1 | tee ./logs/train.log
#! /usr/bin/env bash
mkdir cloud_manifests
IN_MANIFESTS="../data/librispeech/manifest.train ../data/librispeech/manifest.dev-clean ../data/librispeech/manifest.test-clean"
OUT_MANIFESTS="cloud_manifests/cloud.manifest.train cloud_manifests/cloud.manifest.dev cloud_manifests/cloud.manifest.test"
CLOUD_DATA_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/data/librispeech"
NUM_SHARDS=50
python upload_data.py \
--in_manifest_paths ${IN_MANIFESTS} \
--out_manifest_paths ${OUT_MANIFESTS} \
--cloud_data_dir ${CLOUD_DATA_DIR} \
--num_shards ${NUM_SHARDS}
if [ $? -ne 0 ]
then
echo "Upload Data Failed!"
exit 1
fi
echo "All Done."
"""This tool is used for splitting data into each node of
paddlecloud. This script should be called in paddlecloud.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import json
import argparse
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--in_manifest_path",
type=str,
required=True,
help="Input manifest path for all nodes.")
parser.add_argument(
"--out_manifest_path",
type=str,
required=True,
help="Output manifest file path for current node.")
args = parser.parse_args()
def split_data(in_manifest_path, out_manifest_path):
with open("/trainer_id", "r") as f:
trainer_id = int(f.readline()[:-1])
with open("/trainer_count", "r") as f:
trainer_count = int(f.readline()[:-1])
out_manifest = []
for index, json_line in enumerate(open(in_manifest_path, 'r')):
if (index % trainer_count) == trainer_id:
out_manifest.append("%s\n" % json_line.strip())
with open(out_manifest_path, 'w') as f:
f.writelines(out_manifest)
if __name__ == '__main__':
split_data(args.in_manifest_path, args.out_manifest_path)
"""This script is for uploading data for DeepSpeech2 training on paddlecloud.
Steps:
1. Read original manifests and extract local sound files.
2. Tar all local sound files into multiple tar files and upload them.
3. Modify original manifests with updated paths in cloud filesystem.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import os
import tarfile
import sys
import argparse
import shutil
from subprocess import call
import _init_paths
from data_utils.utils import read_manifest
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--in_manifest_paths",
default=[
"../datasets/manifest.train", "../datasets/manifest.dev",
"../datasets/manifest.test"
],
type=str,
nargs='+',
help="Local filepaths of input manifests to load, pack and upload."
"(default: %(default)s)")
parser.add_argument(
"--out_manifest_paths",
default=[
"./cloud.manifest.train", "./cloud.manifest.dev",
"./cloud.manifest.test"
],
type=str,
nargs='+',
help="Local filepaths of modified manifests to write to. "
"(default: %(default)s)")
parser.add_argument(
"--cloud_data_dir",
required=True,
type=str,
help="Destination directory on paddlecloud to upload data to.")
parser.add_argument(
"--num_shards",
default=10,
type=int,
help="Number of parts to split data to. (default: %(default)s)")
parser.add_argument(
"--local_tmp_dir",
default="./tmp/",
type=str,
help="Local directory for storing temporary data. (default: %(default)s)")
args = parser.parse_args()
def upload_data(in_manifest_path_list, out_manifest_path_list, local_tmp_dir,
upload_tar_dir, num_shards):
"""Extract and pack sound files listed in the manifest files into multple
tar files and upload them to padldecloud. Besides, generate new manifest
files with updated paths in paddlecloud.
"""
# compute total audio number
total_line = 0
for manifest_path in in_manifest_path_list:
with open(manifest_path, 'r') as f:
total_line += len(f.readlines())
line_per_tar = (total_line // num_shards) + 1
# pack and upload shard by shard
line_count, tar_file = 0, None
for manifest_path, out_manifest_path in zip(in_manifest_path_list,
out_manifest_path_list):
manifest = read_manifest(manifest_path)
out_manifest = []
for json_data in manifest:
sound_filepath = json_data['audio_filepath']
sound_filename = os.path.basename(sound_filepath)
if line_count % line_per_tar == 0:
if tar_file != None:
tar_file.close()
pcloud_cp(tar_path, upload_tar_dir)
os.remove(tar_path)
tar_name = 'part-%s-of-%s.tar' % (
str(line_count // line_per_tar).zfill(5),
str(num_shards).zfill(5))
tar_path = os.path.join(local_tmp_dir, tar_name)
tar_file = tarfile.open(tar_path, 'w')
tar_file.add(sound_filepath, arcname=sound_filename)
line_count += 1
json_data['audio_filepath'] = "tar:%s#%s" % (
os.path.join(upload_tar_dir, tar_name), sound_filename)
out_manifest.append("%s\n" % json.dumps(json_data))
with open(out_manifest_path, 'w') as f:
f.writelines(out_manifest)
pcloud_cp(out_manifest_path, upload_tar_dir)
tar_file.close()
pcloud_cp(tar_path, upload_tar_dir)
os.remove(tar_path)
def pcloud_mkdir(dir):
"""Make directory in PaddleCloud filesystem.
"""
if call(['paddlecloud', 'mkdir', dir]) != 0:
raise IOError("PaddleCloud mkdir failed: %s." % dir)
def pcloud_cp(src, dst):
"""Copy src from local filesytem to dst in PaddleCloud filesystem,
or downlowd src from PaddleCloud filesystem to dst in local filesystem.
"""
if call(['paddlecloud', 'cp', src, dst]) != 0:
raise IOError("PaddleCloud cp failed: from [%s] to [%s]." % (src, dst))
if __name__ == '__main__':
if not os.path.exists(args.local_tmp_dir):
os.makedirs(args.local_tmp_dir)
pcloud_mkdir(args.cloud_data_dir)
upload_data(args.in_manifest_paths, args.out_manifest_paths,
args.local_tmp_dir, args.cloud_data_dir, args.num_shards)
shutil.rmtree(args.local_tmp_dir)
......@@ -16,6 +16,7 @@ import argparse
import soundfile
import json
import codecs
import io
from data_utils.utility import download, unpack
URL_ROOT = "http://www.openslr.org/resources/12"
......@@ -68,12 +69,11 @@ def create_manifest(data_dir, manifest_path):
filename for filename in filelist if filename.endswith('trans.txt')
]
if len(text_filelist) > 0:
text_filepath = os.path.join(data_dir, subfolder, text_filelist[0])
for line in open(text_filepath):
text_filepath = os.path.join(subfolder, text_filelist[0])
for line in io.open(text_filepath, encoding="utf8"):
segments = line.strip().split()
text = ' '.join(segments[1:]).lower()
audio_filepath = os.path.join(data_dir, subfolder,
segments[0] + '.flac')
audio_filepath = os.path.join(subfolder, segments[0] + '.flac')
audio_data, samplerate = soundfile.read(audio_filepath)
duration = float(len(audio_data)) / samplerate
json_lines.append(
......
......@@ -16,6 +16,7 @@ import zipfile
import argparse
import soundfile
import json
import io
from paddle.v2.dataset.common import md5file
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
......@@ -88,7 +89,7 @@ def create_manifest(data_dir, manifest_path):
'duration': duration,
'text': ''
}))
with open(manifest_path, 'w') as out_file:
with io.open(manifest_path, mode='w', encoding='utf8') as out_file:
for line in json_lines:
out_file.write(line + '\n')
......
......@@ -3,7 +3,7 @@
# download data, generate manifests
PYTHONPATH=../../:$PYTHONPATH python voxforge.py \
--manifest_prefix='./manifest' \
--target_dir='~/.cache/paddle/dataset/speech/VoxForge' \
--target_dir='./dataset/VoxForge' \
--is_merge_dialect=True \
--dialects 'american' 'british' 'australian' 'european' 'irish' 'canadian' 'indian'
......
......@@ -18,7 +18,7 @@ import shutil
import subprocess
from data_utils.utility import download_multi, unpack, getfile_insensitive
DATA_HOME = '~/.cache/paddle/dataset/speech'
DATA_HOME = './dataset'
DATA_URL = 'http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/' \
'Audio/Main/16kHz_16bit'
......
......@@ -12,6 +12,7 @@ import resampy
from scipy import signal
import random
import copy
import io
class AudioSegment(object):
......@@ -154,7 +155,7 @@ class AudioSegment(object):
fileno = int(matches.group(2))
# read headers
f = open(filename, 'rb')
f = io.open(filename, mode='rb', encoding='utf8')
version = f.read(4)
num_utterances = struct.unpack("i", f.read(4))[0]
bytes_per_header = struct.unpack("i", f.read(4))[0]
......
......@@ -9,10 +9,9 @@ import random
import tarfile
import multiprocessing
import numpy as np
import paddle.v2 as paddle
import paddle.fluid as fluid
from threading import local
from data_utils.utility import read_manifest
from data_utils.utility import xmap_readers_mp
from data_utils.augmentor.augmentation import AugmentationPipeline
from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
from data_utils.speech import SpeechSegment
......@@ -51,14 +50,17 @@ class DataGenerator(object):
:param use_dB_normalization: Whether to normalize the audio to -20 dB
before extracting the features.
:type use_dB_normalization: bool
:param num_threads: Number of CPU threads for processing data.
:type num_threads: int
:param random_seed: Random seed.
:type random_seed: int
:param keep_transcription_text: If set to True, transcription text will
be passed forward directly without
converting to index sequence.
:type keep_transcription_text: bool
:param place: The place to run the program.
:type place: CPU or GPU
:param is_training: If set to True, generate text data for training,
otherwise, generate text data for infer.
:type is_training: bool
"""
def __init__(self,
......@@ -72,9 +74,10 @@ class DataGenerator(object):
max_freq=None,
specgram_type='linear',
use_dB_normalization=True,
num_threads=multiprocessing.cpu_count() // 2,
random_seed=0,
keep_transcription_text=False):
keep_transcription_text=False,
place=fluid.CPUPlace(),
is_training=True):
self._max_duration = max_duration
self._min_duration = min_duration
self._normalizer = FeatureNormalizer(mean_std_filepath)
......@@ -87,14 +90,15 @@ class DataGenerator(object):
window_ms=window_ms,
max_freq=max_freq,
use_dB_normalization=use_dB_normalization)
self._num_threads = num_threads
self._rng = random.Random(random_seed)
self._keep_transcription_text = keep_transcription_text
self._epoch = 0
self._is_training = is_training
# for caching tar files info
self._local_data = local()
self._local_data.tar2info = {}
self._local_data.tar2object = {}
self._place = place
def process_utterance(self, audio_file, transcript):
"""Load, augment, featurize and normalize for speech data.
......@@ -121,7 +125,6 @@ class DataGenerator(object):
def batch_reader_creator(self,
manifest_path,
batch_size,
min_batch_size=1,
padding_to=-1,
flatten=False,
sortagrad=False,
......@@ -137,9 +140,6 @@ class DataGenerator(object):
:type manifest_path: basestring
:param batch_size: Number of instances in a batch.
:type batch_size: int
:param min_batch_size: Any batch with batch size smaller than this will
be discarded. (To be deprecated in the future.)
:type min_batch_size: int
:param padding_to: If set -1, the maximun shape in the batch
will be used as the target shape for padding.
Otherwise, `padding_to` will be the target shape.
......@@ -178,6 +178,7 @@ class DataGenerator(object):
# sort (by duration) or batch-wise shuffle the manifest
if self._epoch == 0 and sortagrad:
manifest.sort(key=lambda x: x["duration"])
else:
if shuffle_method == "batch_shuffle":
manifest = self._batch_shuffle(
......@@ -193,18 +194,16 @@ class DataGenerator(object):
raise ValueError("Unknown shuffle method %s." %
shuffle_method)
# prepare batches
instance_reader, cleanup = self._instance_reader_creator(manifest)
batch = []
try:
for instance in instance_reader():
batch.append(instance)
if len(batch) == batch_size:
yield self._padding_batch(batch, padding_to, flatten)
batch = []
if len(batch) >= min_batch_size:
instance_reader = self._instance_reader_creator(manifest)
for instance in instance_reader():
batch.append(instance)
if len(batch) == batch_size:
yield self._padding_batch(batch, padding_to, flatten)
finally:
cleanup()
batch = []
if len(batch) >= 1:
yield self._padding_batch(batch, padding_to, flatten)
self._epoch += 1
return batch_reader
......@@ -276,13 +275,11 @@ class DataGenerator(object):
def reader():
for instance in manifest:
yield instance
inst = self.process_utterance(instance["audio_filepath"],
instance["text"]),
yield inst[0]
reader, cleanup_callback = xmap_readers_mp(
lambda instance: self.process_utterance(instance["audio_filepath"], instance["text"]),
reader, self._num_threads, 4096)
return reader, cleanup_callback
return reader
def _padding_batch(self, batch, padding_to=-1, flatten=False):
"""
......@@ -304,14 +301,43 @@ class DataGenerator(object):
"than any instance's shape in the batch")
max_length = padding_to
# padding
padded_audios = []
texts, text_lens = [], []
audio_lens = []
masks = []
for audio, text in batch:
padded_audio = np.zeros([audio.shape[0], max_length])
padded_audio[:, :audio.shape[1]] = audio
if flatten:
padded_audio = padded_audio.flatten()
padded_instance = [padded_audio, text, audio.shape[1]]
new_batch.append(padded_instance)
return new_batch
padded_audios.append(padded_audio)
if self._is_training:
texts += text
else:
texts.append(text)
text_lens.append(len(text))
audio_lens.append(audio.shape[1])
mask_shape0 = (audio.shape[0] - 1) // 2 + 1
mask_shape1 = (audio.shape[1] - 1) // 3 + 1
mask_max_len = (max_length - 1) // 3 + 1
mask_ones = np.ones((mask_shape0, mask_shape1))
mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1))
mask = np.repeat(
np.reshape(
np.concatenate((mask_ones, mask_zeros), axis=1),
(1, mask_shape0, mask_max_len)),
32,
axis=0)
masks.append(mask)
padded_audios = np.array(padded_audios).astype('float32')
if self._is_training:
texts = fluid.create_lod_tensor(
np.array(texts).astype('int32'),
recursive_seq_lens=[text_lens],
place=self._place)
audio_lens = np.array(audio_lens).astype('int64').reshape([-1, 1])
masks = np.array(masks).astype('float32')
return padded_audios, texts, audio_lens, masks
def _batch_shuffle(self, manifest, batch_size, clipped=False):
"""Put similarly-sized instances into minibatches for better efficiency
......
......@@ -11,7 +11,7 @@ import time
from Queue import Queue
from threading import Thread
from multiprocessing import Process, Manager, Value
from paddle.v2.dataset.common import md5file
from paddle.dataset.common import md5file
def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
......@@ -88,127 +88,3 @@ def unpack(filepath, target_dir, rm_tar=False):
class XmapEndSignal():
pass
def xmap_readers_mp(mapper, reader, process_num, buffer_size, order=False):
"""A multiprocessing pipeline wrapper for the data reader.
:param mapper: Function to map sample.
:type mapper: callable
:param reader: Given data reader.
:type reader: callable
:param process_num: Number of processes in the pipeline
:type process_num: int
:param buffer_size: Maximal buffer size.
:type buffer_size: int
:return: The wrappered reader and cleanup callback
:rtype: tuple
"""
end_flag = XmapEndSignal()
read_workers = []
handle_workers = []
flush_workers = []
read_exit_flag = Value('i', 0)
handle_exit_flag = Value('i', 0)
flush_exit_flag = Value('i', 0)
# define a worker to read samples from reader to in_queue with order flag
def order_read_worker(reader, in_queue):
for order_id, sample in enumerate(reader()):
if read_exit_flag.value == 1: break
in_queue.put((order_id, sample))
in_queue.put(end_flag)
# the reading worker should not exit until all handling work exited
while handle_exit_flag.value == 0 or read_exit_flag.value == 0:
time.sleep(0.001)
# define a worker to handle samples from in_queue by mapper and put results
# to out_queue with order
def order_handle_worker(in_queue, out_queue, mapper, out_order):
ins = in_queue.get()
while not isinstance(ins, XmapEndSignal):
if handle_exit_flag.value == 1: break
order_id, sample = ins
result = mapper(sample)
while order_id != out_order[0]:
time.sleep(0.001)
out_queue.put(result)
out_order[0] += 1
ins = in_queue.get()
in_queue.put(end_flag)
out_queue.put(end_flag)
# wait for exit of flushing worker
while flush_exit_flag.value == 0 or handle_exit_flag.value == 0:
time.sleep(0.001)
read_exit_flag.value = 1
handle_exit_flag.value = 1
# define a thread worker to flush samples from Manager.Queue to Queue
# for acceleration
def flush_worker(in_queue, out_queue):
finish = 0
while finish < process_num and flush_exit_flag.value == 0:
sample = in_queue.get()
if isinstance(sample, XmapEndSignal):
finish += 1
else:
out_queue.put(sample)
out_queue.put(end_flag)
handle_exit_flag.value = 1
flush_exit_flag.value = 1
def cleanup():
# first exit flushing workers
flush_exit_flag.value = 1
for w in flush_workers:
w.join()
# next exit handling workers
handle_exit_flag.value = 1
for w in handle_workers:
w.join()
# last exit reading workers
read_exit_flag.value = 1
for w in read_workers:
w.join()
def xreader():
# prepare shared memory
manager = Manager()
in_queue = manager.Queue(buffer_size)
out_queue = manager.Queue(buffer_size)
out_order = manager.list([0])
# start a read worker in a process
target = order_read_worker
p = Process(target=target, args=(reader, in_queue))
p.daemon = True
p.start()
read_workers.append(p)
# start handle_workers with multiple processes
target = order_handle_worker
args = (in_queue, out_queue, mapper, out_order)
workers = [
Process(target=target, args=args) for _ in xrange(process_num)
]
for w in workers:
w.daemon = True
w.start()
handle_workers.append(w)
# start a thread to read data from slow Manager.Queue
flush_queue = Queue(buffer_size)
t = Thread(target=flush_worker, args=(out_queue, flush_queue))
t.daemon = True
t.start()
flush_workers.append(t)
# get results
sample = flush_queue.get()
while not isinstance(sample, XmapEndSignal):
yield sample
sample = flush_queue.get()
return xreader, cleanup
......@@ -102,7 +102,7 @@ def ctc_beam_search_decoder(probs_seq,
probs_b_prev, probs_nb_prev = {'\t': 1.0}, {'\t': 0.0}
## extend prefix in loop
for time_step in xrange(len(probs_seq)):
for time_step in range(len(probs_seq)):
# prefix_set_next: the set containing candidate prefixes
# probs_b_cur: prefixes' probability ending with blank in current step
# probs_nb_cur: prefixes' probability ending with non-blank in current step
......@@ -114,7 +114,7 @@ def ctc_beam_search_decoder(probs_seq,
if cutoff_prob < 1.0 or cutoff_top_n < cutoff_len:
prob_idx = sorted(prob_idx, key=lambda asd: asd[1], reverse=True)
cutoff_len, cum_prob = 0, 0.0
for i in xrange(len(prob_idx)):
for i in range(len(prob_idx)):
cum_prob += prob_idx[i][1]
cutoff_len += 1
if cum_prob >= cutoff_prob:
......@@ -127,7 +127,7 @@ def ctc_beam_search_decoder(probs_seq,
probs_b_cur[l], probs_nb_cur[l] = 0.0, 0.0
# extend prefix by travering prob_idx
for index in xrange(cutoff_len):
for index in range(cutoff_len):
c, prob_c = prob_idx[index][0], prob_idx[index][1]
if c == blank_id:
......
"""Client-end for the ASR demo."""
from pynput import keyboard
import keyboard
import struct
import socket
import sys
......@@ -23,22 +23,17 @@ is_recording = False
enable_trigger_record = True
def on_press(key):
"""On-press keyboard callback function."""
def on_press_release(x):
"""Keyboard callback function."""
global is_recording, enable_trigger_record
if key == keyboard.Key.space:
press = keyboard.KeyboardEvent('down', 28, 'space')
release = keyboard.KeyboardEvent('up', 28, 'space')
if x.event_type == 'down' and x.name == press.name:
if (not is_recording) and enable_trigger_record:
sys.stdout.write("Start Recording ... ")
sys.stdout.flush()
is_recording = True
def on_release(key):
"""On-release keyboard callback function."""
global is_recording, enable_trigger_record
if key == keyboard.Key.esc:
return False
elif key == keyboard.Key.space:
if x.event_type == 'up' and x.name == release.name:
if is_recording == True:
is_recording = False
......@@ -80,9 +75,10 @@ def main():
stream.start_stream()
# prepare keyboard listener
with keyboard.Listener(
on_press=on_press, on_release=on_release) as listener:
listener.join()
while (1):
keyboard.hook(on_press_release)
if keyboard.record('esc'):
break
# close up
stream.stop_stream()
......
......@@ -8,7 +8,8 @@ from time import gmtime, strftime
import SocketServer
import struct
import wave
import paddle.v2 as paddle
import paddle.fluid as fluid
import numpy as np
import _init_paths
from data_utils.data import DataGenerator
from model_utils.model import DeepSpeech2Model
......@@ -141,13 +142,19 @@ def warm_up_test(audio_process_handler,
def start_server():
"""Start the ASR server"""
# prepare data generator
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
data_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=1,
keep_transcription_text=True)
keep_transcription_text=True,
place = place,
is_training = False)
# prepare ASR model
ds2_model = DeepSpeech2Model(
vocab_size=data_generator.vocab_size,
......@@ -155,7 +162,8 @@ def start_server():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.model_path,
init_from_pretrain_model=args.model_path,
place=place,
share_rnn_weights=args.share_rnn_weights)
vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
......@@ -166,8 +174,24 @@ def start_server():
# prepare ASR inference handler
def file_to_transcript(filename):
feature = data_generator.process_utterance(filename, "")
audio_len = feature[0].shape[1]
mask_shape0 = (feature[0].shape[0] - 1) // 2 + 1
mask_shape1 = (feature[0].shape[1] - 1) // 3 + 1
mask_max_len = (audio_len - 1) // 3 + 1
mask_ones = np.ones((mask_shape0, mask_shape1))
mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1))
mask = np.repeat(
np.reshape(
np.concatenate((mask_ones, mask_zeros), axis=1),
(1, mask_shape0, mask_max_len)),
32,
axis=0)
feature = (np.array([feature[0]]).astype('float32'),
None,
np.array([audio_len]).astype('int64').reshape([-1,1]),
np.array([mask]).astype('float32'))
probs_split = ds2_model.infer_batch_probs(
infer_data=[feature],
infer_data=feature,
feeding_dict=data_generator.feeding)
if args.decoding_method == "ctc_greedy":
......@@ -207,7 +231,6 @@ def start_server():
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=1)
start_server()
......
docs/images/multi_gpu_speedup.png

153.1 KB | W: | H:

docs/images/multi_gpu_speedup.png

206.5 KB | W: | H:

docs/images/multi_gpu_speedup.png
docs/images/multi_gpu_speedup.png
docs/images/multi_gpu_speedup.png
docs/images/multi_gpu_speedup.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -5,7 +5,7 @@ cd ../.. > /dev/null
# download data, generate manifests
PYTHONPATH=.:$PYTHONPATH python data/aishell/aishell.py \
--manifest_prefix='data/aishell/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/Aishell'
--target_dir='./dataset/aishell'
if [ $? -ne 0 ]; then
echo "Prepare Aishell failed. Terminated."
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_ch.sh
bash download_lm_ch.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -15,7 +15,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=300 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
......@@ -31,7 +30,7 @@ python -u infer.py \
--infer_manifest='data/aishell/manifest.test' \
--mean_std_path='data/aishell/mean_std.npz' \
--vocab_path='data/aishell/vocab.txt' \
--model_path='checkpoints/aishell/params.latest.tar.gz' \
--model_path='checkpoints/aishell/srep_final' \
--lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='cer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_ch.sh
bash download_lm_ch.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/aishell > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=300 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
......@@ -35,12 +34,12 @@ python -u infer.py \
--cutoff_prob=0.99 \
--cutoff_top_n=40 \
--use_gru=True \
--use_gpu=True \
--use_gpu=False \
--share_rnn_weights=False \
--infer_manifest='data/aishell/manifest.test' \
--mean_std_path='models/aishell/mean_std.npz' \
--vocab_path='models/aishell/vocab.txt' \
--model_path='models/aishell/params.tar.gz' \
--model_path='models/aishell' \
--lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='cer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_ch.sh
bash download_lm_ch.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -15,10 +15,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=300 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=1024 \
......@@ -32,7 +30,7 @@ python -u test.py \
--test_manifest='data/aishell/manifest.test' \
--mean_std_path='data/aishell/mean_std.npz' \
--vocab_path='data/aishell/vocab.txt' \
--model_path='checkpoints/aishell/params.latest.tar.gz' \
--model_path='checkpoints/aishell/step_final' \
--lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='cer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_ch.sh
bash download_lm_ch.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/aishell > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -24,10 +24,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=300 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=1024 \
......@@ -41,7 +39,7 @@ python -u test.py \
--test_manifest='data/aishell/manifest.test' \
--mean_std_path='models/aishell/mean_std.npz' \
--vocab_path='models/aishell/vocab.txt' \
--model_path='models/aishell/params.tar.gz' \
--model_path='models/aishell' \
--lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='cer' \
......
......@@ -3,17 +3,18 @@
cd ../.. > /dev/null
# train model
# if you wish to resume from an exists model, uncomment --init_model_path
# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
export FLAGS_sync_nccl_allreduce=0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u train.py \
--batch_size=64 \
--trainer_count=8 \
--num_passes=50 \
--num_proc_data=16 \
--num_epoch=50 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=1024 \
--num_iter_print=100 \
--save_epoch=1 \
--num_samples=120000 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
......@@ -30,7 +31,7 @@ python -u train.py \
--output_model_dir='./checkpoints/aishell' \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
--shuffle_method='batch_shuffle_clipped' \
if [ $? -ne 0 ]; then
echo "Failed in training!"
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/baidu_en8k > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=5 \
--num_conv_layers=2 \
......@@ -35,12 +34,12 @@ python -u infer.py \
--cutoff_prob=1.0 \
--cutoff_top_n=40 \
--use_gru=True \
--use_gpu=True \
--use_gpu=False \
--share_rnn_weights=False \
--infer_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='models/baidu_en8k/mean_std.npz' \
--vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
--model_path='models/baidu_en8k' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/baidu_en8k > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -u test.py \
--batch_size=128 \
--trainer_count=4 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
......@@ -36,12 +35,12 @@ python -u test.py \
--cutoff_prob=1.0 \
--cutoff_top_n=40 \
--use_gru=True \
--use_gpu=True \
--use_gpu=False \
--share_rnn_weights=False \
--test_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='models/baidu_en8k/mean_std.npz' \
--vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
--model_path='models/baidu_en8k' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -5,7 +5,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -14,7 +14,7 @@ cd - > /dev/null
# download well-trained model
cd models/baidu_en8k > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -40,7 +40,7 @@ python -u deploy/demo_server.py \
--warmup_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='models/baidu_en8k/mean_std.npz' \
--vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
--model_path='models/baidu_en8k' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--specgram_type='linear'
......
......@@ -5,7 +5,7 @@ cd ../.. > /dev/null
# download data, generate manifests
PYTHONPATH=.:$PYTHONPATH python data/librispeech/librispeech.py \
--manifest_prefix='data/librispeech/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/libri' \
--target_dir='./dataset/librispeech' \
--full_download='True'
if [ $? -ne 0 ]; then
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -15,7 +15,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
......@@ -31,7 +30,7 @@ python -u infer.py \
--infer_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='data/librispeech/vocab.txt' \
--model_path='checkpoints/libri/params.latest.tar.gz' \
--model_path='checkpoints/libri/step_final' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/librispeech > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
......@@ -40,7 +39,7 @@ python -u infer.py \
--infer_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -15,10 +15,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
......@@ -32,7 +30,7 @@ python -u test.py \
--test_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='data/librispeech/vocab.txt' \
--model_path='checkpoints/libri/params.latest.tar.gz' \
--model_path='checkpoints/libri/step_final' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/librispeech > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -24,10 +24,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
......@@ -41,7 +39,7 @@ python -u test.py \
--test_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -3,17 +3,19 @@
cd ../.. > /dev/null
# train model
# if you wish to resume from an exists model, uncomment --init_model_path
# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
export FLAGS_sync_nccl_allreduce=0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u train.py \
--batch_size=160 \
--trainer_count=8 \
--num_passes=50 \
--num_proc_data=16 \
--batch_size=20 \
--num_epoch=50 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--save_epoch=1 \
--num_samples=280000 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
......@@ -30,7 +32,7 @@ python -u train.py \
--output_model_dir='./checkpoints/libri' \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
--shuffle_method='batch_shuffle_clipped' \
if [ $? -ne 0 ]; then
echo "Failed in training!"
......
......@@ -7,7 +7,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -u tools/tune.py \
--num_batches=-1 \
--batch_size=128 \
--trainer_count=4 \
--beam_size=500 \
--num_proc_bsearch=12 \
--num_conv_layers=2 \
......@@ -27,7 +26,7 @@ python -u tools/tune.py \
--tune_manifest='data/librispeech/manifest.dev-clean' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--error_rate_type='wer' \
--specgram_type='linear'
......
......@@ -7,11 +7,10 @@ if [ ! -e data/tiny ]; then
mkdir data/tiny
fi
# download data, generate manifests
PYTHONPATH=.:$PYTHONPATH python data/librispeech/librispeech.py \
--manifest_prefix='data/tiny/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/libri' \
--target_dir='./dataset/librispeech' \
--full_download='False'
if [ $? -ne 0 ]; then
......@@ -21,12 +20,11 @@ fi
head -n 64 data/tiny/manifest.dev-clean > data/tiny/manifest.tiny
# build vocabulary
python tools/build_vocab.py \
--count_threshold=0 \
--vocab_path='data/tiny/vocab.txt' \
--manifest_paths='data/tiny/manifest.dev-clean'
--manifest_paths='data/tiny/manifest.tiny'
if [ $? -ne 0 ]; then
echo "Build vocabulary failed. Terminated."
......@@ -47,5 +45,5 @@ if [ $? -ne 0 ]; then
fi
echo "Tiny data preparation done."
echo "LibriSpeech Data preparation done."
exit 0
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -15,7 +15,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
......@@ -28,10 +27,10 @@ python -u infer.py \
--use_gru=False \
--use_gpu=True \
--share_rnn_weights=True \
--infer_manifest='data/tiny/manifest.tiny' \
--infer_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='data/tiny/mean_std.npz' \
--vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/tiny/params.pass-19.tar.gz' \
--model_path='./checkpoints/tiny/step_final' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/librispeech > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
......@@ -40,7 +39,7 @@ python -u infer.py \
--infer_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -14,11 +14,9 @@ cd - > /dev/null
# evaluate model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=16 \
--trainer_count=8 \
--batch_size=128 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
......@@ -29,10 +27,10 @@ python -u test.py \
--use_gru=False \
--use_gpu=True \
--share_rnn_weights=True \
--test_manifest='data/tiny/manifest.tiny' \
--test_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='data/tiny/mean_std.npz' \
--vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/tiny/params.pass-19.tar.gz' \
--model_path='checkpoints/tiny/step_final' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/librispeech > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
......@@ -24,10 +24,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
......@@ -41,7 +39,7 @@ python -u test.py \
--test_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \
......
......@@ -3,17 +3,18 @@
cd ../.. > /dev/null
# train model
# if you wish to resume from an exists model, uncomment --init_model_path
# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
export FLAGS_sync_nccl_allreduce=0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -u train.py \
--batch_size=16 \
--trainer_count=4 \
--num_passes=20 \
--num_proc_data=1 \
--batch_size=4 \
--num_epoch=20 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--num_iter_print=1 \
--save_epoch=1 \
--num_samples=64 \
--learning_rate=1e-5 \
--max_duration=27.0 \
--min_duration=0.0 \
......@@ -30,10 +31,10 @@ python -u train.py \
--output_model_dir='./checkpoints/tiny' \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
--shuffle_method='batch_shuffle_clipped' \
if [ $? -ne 0 ]; then
echo "Fail in training!"
echo "Failed in training!"
exit 1
fi
......
......@@ -3,11 +3,10 @@
cd ../.. > /dev/null
# grid-search for hyper-parameters in language model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -u tools/tune.py \
--num_batches=1 \
--batch_size=24 \
--trainer_count=8 \
--num_batches=-1 \
--batch_size=128 \
--beam_size=500 \
--num_proc_bsearch=12 \
--num_conv_layers=2 \
......@@ -24,10 +23,10 @@ python -u tools/tune.py \
--use_gru=False \
--use_gpu=True \
--share_rnn_weights=True \
--tune_manifest='data/tiny/manifest.tiny' \
--tune_manifest='data/tiny/manifest.dev-clean' \
--mean_std_path='data/tiny/mean_std.npz' \
--vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/params.pass-9.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--error_rate_type='wer' \
--specgram_type='linear'
......
......@@ -3,9 +3,13 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import argparse
import functools
import paddle.v2 as paddle
import paddle.fluid as fluid
from data_utils.data import DataGenerator
from model_utils.model import DeepSpeech2Model
from utils.error_rate import wer, cer
......@@ -15,7 +19,6 @@ parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('num_samples', int, 10, "# of samples to infer.")
add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).")
add_arg('beam_size', int, 500, "Beam search width.")
add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.")
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
......@@ -63,20 +66,25 @@ args = parser.parse_args()
def infer():
"""Inference for DeepSpeech2."""
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
data_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=1,
keep_transcription_text=True)
keep_transcription_text=True,
place = place,
is_training = False)
batch_reader = data_generator.batch_reader_creator(
manifest_path=args.infer_manifest,
batch_size=args.num_samples,
min_batch_size=1,
sortagrad=False,
shuffle_method=None)
infer_data = batch_reader().next()
infer_data = next(batch_reader())
ds2_model = DeepSpeech2Model(
vocab_size=data_generator.vocab_size,
......@@ -84,16 +92,19 @@ def infer():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.model_path,
share_rnn_weights=args.share_rnn_weights)
share_rnn_weights=args.share_rnn_weights,
place=place,
init_from_pretrain_model=args.model_path)
# decoders only accept string encoded in utf-8
vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
if args.decoding_method == "ctc_greedy":
ds2_model.logger.info("start inference ...")
probs_split = ds2_model.infer_batch_probs(infer_data=infer_data,
probs_split = ds2_model.infer_batch_probs(
infer_data=infer_data,
feeding_dict=data_generator.feeding)
result_transcripts = ds2_model.decode_batch_greedy(
probs_split=probs_split,
vocab_list=vocab_list)
......@@ -101,9 +112,11 @@ def infer():
ds2_model.init_ext_scorer(args.alpha, args.beta, args.lang_model_path,
vocab_list)
ds2_model.logger.info("start inference ...")
probs_split = ds2_model.infer_batch_probs(infer_data=infer_data,
probs_split= ds2_model.infer_batch_probs(
infer_data=infer_data,
feeding_dict=data_generator.feeding)
result_transcripts = ds2_model.decode_batch_beam_search(
result_transcripts= ds2_model.decode_batch_beam_search(
probs_split=probs_split,
beam_alpha=args.alpha,
beam_beta=args.beta,
......@@ -114,7 +127,7 @@ def infer():
num_processes=args.num_proc_bsearch)
error_rate_func = cer if args.error_rate_type == 'cer' else wer
target_transcripts = [data[1] for data in infer_data]
target_transcripts = infer_data[1]
for target, result in zip(target_transcripts, result_transcripts):
print("\nTarget Transcription: %s\nOutput Transcription: %s" %
(target, result))
......@@ -125,9 +138,6 @@ def infer():
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu,
rnn_use_batch=True,
trainer_count=args.trainer_count)
infer()
......
此差异已折叠。
此差异已折叠。
......@@ -2,9 +2,9 @@
. ../../utils/utility.sh
URL='https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz'
MD5=0ee83aa15fba421e5de8fc66c8feb350
TARGET=./aishell_model.tar.gz
URL='https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz'
MD5=2bf0cc8b6d5da2a2a787b5cc36a496b5
TARGET=./aishell_model_fluid.tar.gz
echo "Download Aishell model ..."
......
......@@ -2,9 +2,9 @@
. ../../utils/utility.sh
URL='https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz'
MD5=5fe7639e720d51b3c3bdf7a1470c6272
TARGET=./baidu_en8k_model.tar.gz
URL='https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz'
MD5=7e58fbf64aa4ecf639b049792ddcf788
TARGET=./baidu_en8k_model_fluid.tar.gz
echo "Download BaiduEn8k model ..."
......
......@@ -2,9 +2,9 @@
. ../../utils/utility.sh
URL='https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz'
MD5=1f72d0c5591f453362f0caa09dd57618
TARGET=./librispeech_model.tar.gz
URL='https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz'
MD5=fafb11fe57c3ecd107147056453f5348
TARGET=./librispeech_model_fluid.tar.gz
echo "Download LibriSpeech model ..."
......
......@@ -5,7 +5,7 @@ from __future__ import print_function
import argparse
import functools
import paddle.v2 as paddle
import paddle.fluid as fluid
from data_utils.data import DataGenerator
from model_utils.model import DeepSpeech2Model
from utils.error_rate import char_errors, word_errors
......@@ -15,10 +15,8 @@ parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('batch_size', int, 128, "Minibatch size.")
add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).")
add_arg('beam_size', int, 500, "Beam search width.")
add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.")
add_arg('num_proc_data', int, 8, "# of CPUs for data preprocessing.")
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
add_arg('num_rnn_layers', int, 3, "# of recurrent layers.")
add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.")
......@@ -64,17 +62,22 @@ args = parser.parse_args()
def evaluate():
"""Evaluate on whole test data for DeepSpeech2."""
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
data_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_proc_data,
keep_transcription_text=True)
keep_transcription_text=True,
place = place,
is_training = False)
batch_reader = data_generator.batch_reader_creator(
manifest_path=args.test_manifest,
batch_size=args.batch_size,
min_batch_size=1,
sortagrad=False,
shuffle_method=None)
......@@ -84,8 +87,9 @@ def evaluate():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.model_path,
share_rnn_weights=args.share_rnn_weights)
share_rnn_weights=args.share_rnn_weights,
place=place,
init_from_pretrain_model=args.model_path)
# decoders only accept string encoded in utf-8
vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
......@@ -115,7 +119,7 @@ def evaluate():
cutoff_top_n=args.cutoff_top_n,
vocab_list=vocab_list,
num_processes=args.num_proc_bsearch)
target_transcripts = [data[1] for data in infer_data]
target_transcripts = infer_data[1]
for target, result in zip(target_transcripts, result_transcripts):
errors, len_ref = errors_func(target, result)
......@@ -131,9 +135,6 @@ def evaluate():
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu,
rnn_use_batch=True,
trainer_count=args.trainer_count)
evaluate()
......
......@@ -9,14 +9,13 @@ function join_by { local IFS="$1"; shift; echo "$*"; }
for NUM_GPUS in 16 8 4 2 1
do
DEVICES=$(join_by , $(seq 0 $(($NUM_GPUS-1))))
BATCH_SIZE=$(($BATCH_SIZE_PER_GPU * $NUM_GPUS))
BATCH_SIZE=$(($BATCH_SIZE_PER_GPU))
CUDA_VISIBLE_DEVICES=$DEVICES \
python train.py \
--batch_size=$BATCH_SIZE \
--num_passes=1 \
--num_epoch=1 \
--test_off=True \
--trainer_count=$NUM_GPUS \
--min_duration=$MIN_DURATION \
--max_duration=$MAX_DURATION > tmp.log 2>&1
......@@ -24,7 +23,7 @@ do
exit 1
fi
cat tmp.log | grep "Time" | awk '{print "GPU Num: " "'"$NUM_GPUS"'" " Time: "$3}'
cat tmp.log | grep "Time" | awk '{print "GPU Num: " "'"$NUM_GPUS"'" " Time: "$2}'
rm tmp.log
done
......@@ -10,7 +10,7 @@ import argparse
import functools
import gzip
import logging
import paddle.v2 as paddle
import paddle.fluid as fluid
import _init_paths
from data_utils.data import DataGenerator
from model_utils.model import DeepSpeech2Model
......@@ -26,7 +26,6 @@ add_arg('batch_size', int, 256, "# of samples per batch.")
add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).")
add_arg('beam_size', int, 500, "Beam search width.")
add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.")
add_arg('num_proc_data', int, 8, "# of CPUs for data preprocessing.")
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
add_arg('num_rnn_layers', int, 3, "# of recurrent layers.")
add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.")
......@@ -77,13 +76,19 @@ def tune():
if not args.num_betas >= 0:
raise ValueError("num_betas must be non-negative!")
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
data_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_proc_data,
keep_transcription_text=True)
keep_transcription_text=True,
place = place,
is_training = False)
batch_reader = data_generator.batch_reader_creator(
manifest_path=args.tune_manifest,
......@@ -97,7 +102,8 @@ def tune():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.model_path,
place=place,
init_from_pretrain_model=args.model_path,
share_rnn_weights=args.share_rnn_weights)
# decoders only accept string encoded in utf-8
......@@ -109,8 +115,8 @@ def tune():
params_grid = [(alpha, beta) for alpha in cand_alphas
for beta in cand_betas]
err_sum = [0.0 for i in xrange(len(params_grid))]
err_ave = [0.0 for i in xrange(len(params_grid))]
err_sum = [0.0 for i in range(len(params_grid))]
err_ave = [0.0 for i in range(len(params_grid))]
num_ins, len_refs, cur_batch = 0, 0, 0
# initialize external scorer
ds2_model.init_ext_scorer(args.alpha_from, args.beta_from,
......@@ -123,7 +129,7 @@ def tune():
probs_split = ds2_model.infer_batch_probs(
infer_data=infer_data,
feeding_dict=data_generator.feeding)
target_transcripts = [ data[1] for data in infer_data ]
target_transcripts = infer_data[1]
num_ins += len(target_transcripts)
# grid search
......@@ -137,7 +143,6 @@ def tune():
cutoff_top_n=args.cutoff_top_n,
vocab_list=vocab_list,
num_processes=args.num_proc_bsearch)
for target, result in zip(target_transcripts, result_transcripts):
errors, len_ref = errors_func(target, result)
err_sum[index] += errors
......@@ -163,7 +168,7 @@ def tune():
# output WER/CER at every (alpha, beta)
print("\nFinal %s:\n" % args.error_rate_type)
for index in xrange(len(params_grid)):
for index in range(len(params_grid)):
print("(alpha, beta) = (%s, %s), [%s] = %f"
% ("%.3f" % params_grid[index][0], "%.3f" % params_grid[index][1],
args.error_rate_type, err_ave[index]))
......@@ -179,9 +184,6 @@ def tune():
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu,
rnn_use_batch=True,
trainer_count=args.trainer_count)
tune()
......
......@@ -5,23 +5,25 @@ from __future__ import print_function
import argparse
import functools
import paddle.v2 as paddle
import io
from model_utils.model import DeepSpeech2Model
from data_utils.data import DataGenerator
from utils.utility import add_arguments, print_arguments
import paddle.fluid as fluid
parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('batch_size', int, 256, "Minibatch size.")
add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).")
add_arg('num_passes', int, 200, "# of training epochs.")
add_arg('num_proc_data', int, 16, "# of CPUs for data preprocessing.")
add_arg('num_epoch', int, 200, "# of training epochs.")
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
add_arg('num_rnn_layers', int, 3, "# of recurrent layers.")
add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.")
add_arg('num_iter_print', int, 100, "Every # iterations for printing "
add_arg('num_iter_print', int, 100, "Every # batch for printing "
"train cost.")
add_arg('save_epoch', int, 10, "# Every # batch for save checkpoint and modle params ")
add_arg('num_samples', int, 10000, "The num of train samples.")
add_arg('learning_rate', float, 5e-4, "Learning rate.")
add_arg('max_duration', float, 27.0, "Longest audio duration allowed.")
add_arg('min_duration', float, 0.0, "Shortest audio duration allowed.")
......@@ -31,7 +33,12 @@ add_arg('use_gpu', bool, True, "Use GPU or not.")
add_arg('use_gru', bool, False, "Use GRUs instead of simple RNNs.")
add_arg('is_local', bool, True, "Use pserver or not.")
add_arg('share_rnn_weights',bool, True, "Share input-hidden weights across "
"bi-directional RNNs. Not for GRU.")
"bi-directional RNNs. Not for GRU.")
add_arg('init_from_pretrain_model',str,
None,
"If None, the training starts from scratch, "
"otherwise, it resumes from the pre-trained model.")
add_arg('train_manifest', str,
'data/librispeech/manifest.train',
"Filepath of train manifest.")
......@@ -44,10 +51,6 @@ add_arg('mean_std_path', str,
add_arg('vocab_path', str,
'data/librispeech/vocab.txt',
"Filepath of vocabulary.")
add_arg('init_model_path', str,
None,
"If None, the training starts from scratch, "
"otherwise, it resumes from the pre-trained model.")
add_arg('output_model_dir', str,
"./checkpoints/libri",
"Directory for saving checkpoints.")
......@@ -68,30 +71,33 @@ args = parser.parse_args()
def train():
"""DeepSpeech2 training."""
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
train_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config=open(args.augment_conf_path, 'r').read(),
augmentation_config=io.open(args.augment_conf_path, mode='r', encoding='utf8').read(),
max_duration=args.max_duration,
min_duration=args.min_duration,
specgram_type=args.specgram_type,
num_threads=args.num_proc_data)
place=place)
dev_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config="{}",
specgram_type=args.specgram_type,
num_threads=args.num_proc_data)
place = place)
train_batch_reader = train_generator.batch_reader_creator(
manifest_path=args.train_manifest,
batch_size=args.batch_size,
min_batch_size=args.trainer_count,
sortagrad=args.use_sortagrad if args.init_model_path is None else False,
sortagrad=args.use_sortagrad if args.init_from_pretrain_model is None else False,
shuffle_method=args.shuffle_method)
dev_batch_reader = dev_generator.batch_reader_creator(
manifest_path=args.dev_manifest,
batch_size=args.batch_size,
min_batch_size=1, # must be 1, but will have errors.
sortagrad=False,
shuffle_method=None)
......@@ -101,27 +107,27 @@ def train():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.init_model_path,
share_rnn_weights=args.share_rnn_weights)
share_rnn_weights=args.share_rnn_weights,
place=place,
init_from_pretrain_model=args.init_from_pretrain_model,
output_model_dir=args.output_model_dir)
ds2_model.train(
train_batch_reader=train_batch_reader,
dev_batch_reader=dev_batch_reader,
feeding_dict=train_generator.feeding,
learning_rate=args.learning_rate,
gradient_clipping=400,
num_passes=args.num_passes,
batch_size=args.batch_size,
num_samples=args.num_samples,
num_epoch=args.num_epoch,
save_epoch=args.save_epoch,
num_iterations_print=args.num_iter_print,
output_model_dir=args.output_model_dir,
is_local=args.is_local,
test_off=args.test_off)
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu,
rnn_use_batch=True,
trainer_count=args.trainer_count,
log_clipping=True)
train()
......
......@@ -36,15 +36,15 @@ def _levenshtein_distance(ref, hyp):
distance = np.zeros((2, n + 1), dtype=np.int32)
# initialize distance matrix
for j in xrange(n + 1):
for j in range(n + 1):
distance[0][j] = j
# calculate levenshtein distance
for i in xrange(1, m + 1):
for i in range(1, m + 1):
prev_row_idx = (i - 1) % 2
cur_row_idx = i % 2
distance[cur_row_idx][0] = i
for j in xrange(1, n + 1):
for j in range(1, n + 1):
if ref[i - 1] == hyp[j - 1]:
distance[cur_row_idx][j] = distance[prev_row_idx][j - 1]
else:
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册