update deepspeech to fluid api

d74f4ff3 · lfchener · d2bdd254 · d74f4ff3 · d74f4ff3 · d2bdd254
55 changed file
--- a/README.md
+++ b/README.md
 # DeepSpeech2 on PaddlePaddle
-*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, distributed [PaddleCloud](https://github.com/PaddlePaddle/cloud) training, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released.
+*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released.
 ## Table of Contents
 - [Installation](#installation)
@@ -10,7 +10,6 @@
 - [Data Augmentation Pipeline](#data-augmentation-pipeline)
 - [Inference and Evaluation](#inference-and-evaluation)
 - [Running in Docker Container](#running-in-docker-container)
- [Distributed Cloud Training](#distributed-cloud-training)
 - [Hyper-parameters Tuning](#hyper-parameters-tuning)
 - [Training for Mandarin Language](#training-for-mandarin-language)
 - [Trying Live Demo with Your Own Voice](#trying-live-demo-with-your-own-voice)
@@ -22,13 +21,45 @@
 ## Installation
-For this project was developed in PaddlePaddle V2 API, which is not maintained officially any more, we only support [running it in Docker container](#running-in-docker-container), instead of building environment from source code. And we are going to release the update to the latest Paddle Fluid API very soon, please keep an eye on this project.
+To avoid the trouble of environment setup, [running in Docker container](#running-in-docker-container) is highly recommended. Otherwise follow the guidelines below to install the dependencies manually.
+### Prerequisites
+- Python 2.7 only supported
+- PaddlePaddle the latest version (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/install/index_en.html))
+### Setup
+- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost` and `swig`, e.g. installing them via `apt-get`:
+```bash
+sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig
+```
+or, installing them via `yum`:
+```bash
+sudo yum install pkgconfig libogg-devel libvorbis-devel boost-devel
+wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.1.tar.xz
+xz -d flac-1.3.1.tar.xz
+tar -xvf flac-1.3.1.tar
+cd flac-1.3.1
+./configure
+make
+make install
+```
+- Run the setup script for the remaining dependencies
+```bash
+git clone https://github.com/PaddlePaddle/DeepSpeech.git
+cd DeepSpeech
+sh setup.sh
+```
 ## Getting Started
 Several shell scripts provided in `./examples` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
-Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES` and `--trainer_count`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit.
+Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit.
 Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org/12/) for instance.
@@ -45,7 +76,7 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org
    sh run_data.sh
    ```
-    `run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `~/.cache/paddle/dataset/speech/libri` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
+    `run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `./dataset/librispeech` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
 - Train your own ASR model
    ```bash
@@ -139,20 +170,20 @@ python tools/build_vocab.py --help
 - Start training from scratch with 8 GPUs:
    ```
-    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --trainer_count 8
+    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py
    ```
- Start training from scratch with 16 CPUs:
+- Start training from scratch with CPUs:
    ```
-    python train.py --use_gpu False --trainer_count 16
+    python train.py --use_gpu False
    ```
 - Resume training from a checkpoint:
    ```
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    python train.py \
-    --init_model_path CHECKPOINT_PATH_TO_RESUME_FROM
+    --init_from_pretrain_model CHECKPOINT_PATH_TO_RESUME_FROM
    ```
 For more help on arguments:
@@ -162,6 +193,7 @@ python train.py --help
 ```
 or refer to `example/librispeech/run_train.sh`.
 ## Data Augmentation Pipeline
 Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
@@ -206,8 +238,8 @@ A language model is required to improve the decoder's performance. We have prepa
 ```bash
 cd models/lm
-sh download_lm_en.sh
+bash download_lm_en.sh
-sh download_lm_ch.sh
+bash download_lm_ch.sh
 ```
 If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials. Here we provide some tips to show how we preparing our English and Mandarin language models. You can take it as a reference when you train your own.
@@ -216,7 +248,7 @@ If you wish to train your own better language model, please refer to [KenLM](htt
 The English corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our English language model. There are some preprocessing steps before training:
-  * Characters not in \[A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
+  * Characters not in \['A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
  * Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
  * Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.
@@ -239,13 +271,13 @@ An inference module caller `infer.py` is provided to infer, decode and visualize
 - Inference with GPU:
    ```bash
-    CUDA_VISIBLE_DEVICES=0 python infer.py --trainer_count 1
+    CUDA_VISIBLE_DEVICES=0 python infer.py
    ```
 - Inference with CPUs:
    ```bash
-    python infer.py --use_gpu False --trainer_count 12
+    python infer.py --use_gpu False
    ```
 We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `--decoding_method`.
@@ -264,13 +296,13 @@ To evaluate a model's performance quantitatively, please run:
 - Evaluation with GPUs:
    ```bash
-    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py --trainer_count 8
+    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py
    ```
 - Evaluation with CPUs:
    ```bash
-    python test.py --use_gpu False --trainer_count 12
+    python test.py --use_gpu False
    ```
 The error rate (default: word error rate; can be set with `--error_rate_type`) will be printed.
@@ -293,7 +325,6 @@ The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertio
    ```bash
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    python tools/tune.py \
-    --trainer_count 8 \
    --alpha_from 1.0 \
    --alpha_to 3.2 \
    --num_alphas 45 \
@@ -332,7 +363,7 @@ Take several steps to launch the Docker image:
 - Download the Docker image
 ```bash
-nvidia-docker pull paddlepaddle/deep_speech:latest-gpu
+nvidia-docker pull hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu
 ```
 - Clone this repository
@@ -344,72 +375,10 @@ git clone https://github.com/PaddlePaddle/DeepSpeech.git
 - Run the Docker image
 ```bash
-sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech paddlepaddle/deep_speech:latest-gpu /bin/bash
+sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash
 ```
 Now go back and start from the [Getting Started](#getting-started) section, you can execute training, inference and hyper-parameters tuning similarly in the Docker container.
-## Distributed Cloud Training
-We also provide a cloud training module for users to do the distributed cluster training on [PaddleCloud](https://github.com/PaddlePaddle/cloud), to achieve a much faster training speed with multiple machines. To start with this, please first install PaddleCloud client and register a PaddleCloud account, as described in [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud).
-Please take the following steps to submit a training job:
- Go to directory:
-    ```bash
-    cd cloud
-    ```
- Upload data:
-    Data must be uploaded to PaddleCloud filesystem to be accessed within a cloud job. `pcloud_upload_data.sh` helps do the data packing and uploading:
-    ```bash
-    sh pcloud_upload_data.sh
-    ```
-    Given input manifests, `pcloud_upload_data.sh` will:
-    - Extract the audio files listed in the input manifests.
-    - Pack them into a specified number of tar files.
-    - Upload these tar files to PaddleCloud filesystem.
-    - Create cloud manifests by replacing local filesystem paths with PaddleCloud filesystem paths. New manifests will be used to inform the cloud jobs of audio files' location and their meta information.
-    It should be done only once for the very first time to do the cloud training. Later, the data is kept persisitent on the cloud filesystem and reusable for further job submissions.
-    For argument details please refer to [Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
- - Configure training arguments:
-    Configure the cloud job parameters in `pcloud_submit.sh` (e.g. `NUM_NODES`, `NUM_GPUS`, `CLOUD_TRAIN_DIR`, `JOB_NAME` etc.) and then configure other hyper-parameters for training in `pcloud_train.sh` (just as what you do for local training).
-    For argument details please refer to [Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
- - Submit the job:
-    By running:
-    ```bash
-    sh pcloud_submit.sh
-    ```
-    a training job has been submitted to PaddleCloud, with the job name printed to the console.
-  - Get training logs
-    Run this to list all the jobs you have submitted, as well as their running status:
-    ```bash
-    paddlecloud get jobs
-    ```
-    Run this, the corresponding job's logs will be printed.
-    ```bash
-    paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
-    ```
-For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).
-For more information about the DeepSpeech2 training on PaddleCloud, please refer to
-[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
 ## Training for Mandarin Language
@@ -417,14 +386,13 @@ The key steps of training for Mandarin language are same to that of English lang
 ## Trying Live Demo with Your Own Voice
-Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice.
+Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_english_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice.
 To start the demo's server, please run this in one console:
 ```bash
 CUDA_VISIBLE_DEVICES=0 \
 python deploy/demo_server.py \
--trainer_count 1 \
 --host_ip localhost \
 --host_port 8086
 ```
@@ -436,7 +404,7 @@ For example, on MAC OS X:
 ```bash
 brew install portaudio
 pip install pyaudio
-pip install pynput
+pip install keyboard
 ```
 Then to start the client, please run this in another console:
@@ -452,7 +420,7 @@ Now, in the client console, press the `whitespace` key, hold, and start speaking
 Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine.
-Please also refer to `examples/mandarin/run_demo_server.sh`, which will first download a pre-trained Mandarin model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak Mandarin to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
+Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
 For more help on arguments:
@@ -467,10 +435,10 @@ python deploy/demo_client.py --help
 Language  | Model Name | Training Data | Hours of Speech
 :-----------: | :------------: | :----------: |  -------:
-English  | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
+English  | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
-English  | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz) | Baidu Internal English Dataset | 8628 h
+English  | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz) | Baidu Internal English Dataset | 8628 h
-Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
+Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
-Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
+Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model_fluid.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
 #### Language Model Released
@@ -504,17 +472,16 @@ Baidu Internal Testset  |   12.64
 #### Acceleration with Multi-GPUs
-We compare the training time with 1, 2, 4, 8, 16 Tesla K40m GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds).  And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
+We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds).  And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
 <img src="docs/images/multi_gpu_speedup.png" width=450><br/>
 | # of GPU  | Acceleration Rate |
 | --------  | --------------:   |
 | 1         | 1.00 X |
-| 2         | 1.97 X |
+| 2         | 1.98 X |
-| 4         | 3.74 X |
+| 4         | 3.73 X |
-| 8         | 6.21 X |
+| 8         | 6.95 X |
-|16         | 10.70 X |
 `tools/profile.sh` provides such a profiling tool.

--- a/README_cn.md
+++ b/README_cn.md
--- a/cloud/README.md
+++ b/cloud/README.md
-# Train DeepSpeech2 on PaddleCloud
->Note:
->Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/`
-## Step 1:  Upload Data
-Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths.
-Please modify the following arguments in `pcloud_upload_data.sh`:
- `IN_MANIFESTS`： Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter.
- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are updated with cloud filesystem paths.
- `CLOUD_DATA_DIR`:  Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data.
-By running:
-```
-sh pcloud_upload_data.sh
-```
-all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`.
-You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions.
-## Step 2:  Configure Training
-Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments:
- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`.
- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation.
- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
- `BATCH_SIZE`: Training batch size for a single node.
- `NUM_GPU`: Number of GPUs allocated for a single node.
- `NUM_NODE`: Number of nodes (machines) allocated for this job.
- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes.
-Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training.
-By running:
-```
-sh pcloud_submit.sh
-```
-you submit a training job to PaddleCloud. And you will see the job name when the submission is done.
-## Step 3  Get Job Logs
-Run this to list all the jobs you have submitted, as well as their running status:
-```
-paddlecloud get jobs
-```
-Run this, the corresponding job's logs will be printed.
-```
-paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
-```
-## More Help
-For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).
--- a/cloud/_init_paths.py
+++ b/cloud/_init_paths.py
-"""Set up paths for DS2"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-import os.path
-import sys
-def add_path(path):
-    if path not in sys.path:
-        sys.path.insert(0, path)
-this_dir = os.path.dirname(__file__)
-proj_path = os.path.join(this_dir, '..')
-add_path(proj_path)
--- a/cloud/pcloud_submit.sh
+++ b/cloud/pcloud_submit.sh
-#! /usr/bin/env bash
-TRAIN_MANIFEST="cloud/cloud_manifests/cloud.manifest.train"
-DEV_MANIFEST="cloud/cloud_manifests/cloud.manifest.dev"
-CLOUD_MODEL_DIR="./checkpoints"
-BATCH_SIZE=512
-NUM_GPU=8
-NUM_NODE=1
-IS_LOCAL="True"
-JOB_NAME=deepspeech-`date +%Y%m%d%H%M%S`
-DS2_PATH=${PWD%/*}
-cp -f  pcloud_train.sh ${DS2_PATH}
-paddlecloud submit \
-image bootstrapper:5000/paddlepaddle/pcloud_ds2:latest \
-jobname ${JOB_NAME} \
-cpu ${NUM_GPU} \
-gpu ${NUM_GPU} \
-memory 64Gi \
-parallelism ${NUM_NODE} \
-pscpu 1 \
-pservers 1 \
-psmemory 64Gi \
-passes 1 \
-entry "sh pcloud_train.sh ${TRAIN_MANIFEST} ${DEV_MANIFEST} ${CLOUD_MODEL_DIR} ${NUM_GPU} ${BATCH_SIZE} ${IS_LOCAL}" \
-${DS2_PATH}
-rm ${DS2_PATH}/pcloud_train.sh
--- a/cloud/pcloud_train.sh
+++ b/cloud/pcloud_train.sh
-#! /usr/bin/env bash
-TRAIN_MANIFEST=$1
-DEV_MANIFEST=$2
-MODEL_PATH=$3
-NUM_GPU=$4
-BATCH_SIZE=$5
-IS_LOCAL=$6
-python ./cloud/split_data.py \
--in_manifest_path=${TRAIN_MANIFEST} \
--out_manifest_path='/local.manifest.train'
-python ./cloud/split_data.py \
--in_manifest_path=${DEV_MANIFEST} \
--out_manifest_path='/local.manifest.dev'
-mkdir ./logs
-python -u train.py \
--batch_size=${BATCH_SIZE} \
--trainer_count=${NUM_GPU} \
--num_passes=200 \
--num_proc_data=${NUM_GPU} \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
--use_sortagrad=True \
--use_gru=False \
--use_gpu=True \
--is_local=${IS_LOCAL} \
--share_rnn_weights=True \
--train_manifest='/local.manifest.train' \
--dev_manifest='/local.manifest.dev' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='data/librispeech/vocab.txt' \
--output_model_dir='./checkpoints' \
--output_model_dir=${MODEL_PATH} \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped' \
-2>&1 | tee ./logs/train.log
--- a/cloud/pcloud_upload_data.sh
+++ b/cloud/pcloud_upload_data.sh
-#! /usr/bin/env bash
-mkdir cloud_manifests
-IN_MANIFESTS="../data/librispeech/manifest.train ../data/librispeech/manifest.dev-clean ../data/librispeech/manifest.test-clean"
-OUT_MANIFESTS="cloud_manifests/cloud.manifest.train cloud_manifests/cloud.manifest.dev cloud_manifests/cloud.manifest.test"
-CLOUD_DATA_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/data/librispeech"
-NUM_SHARDS=50
-python upload_data.py \
--in_manifest_paths ${IN_MANIFESTS} \
--out_manifest_paths ${OUT_MANIFESTS} \
--cloud_data_dir ${CLOUD_DATA_DIR} \
--num_shards ${NUM_SHARDS}
-if [ $? -ne 0 ]
-then
-    echo "Upload Data Failed!"
-    exit 1
-fi
-echo "All Done."
--- a/cloud/split_data.py
+++ b/cloud/split_data.py
-"""This tool is used for splitting data into each node of
-paddlecloud. This script should be called in paddlecloud.
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-import os
-import json
-import argparse
-parser = argparse.ArgumentParser(description=__doc__)
-parser.add_argument(
-    "--in_manifest_path",
-    type=str,
-    required=True,
-    help="Input manifest path for all nodes.")
-parser.add_argument(
-    "--out_manifest_path",
-    type=str,
-    required=True,
-    help="Output manifest file path for current node.")
-args = parser.parse_args()
-def split_data(in_manifest_path, out_manifest_path):
-    with open("/trainer_id", "r") as f:
-        trainer_id = int(f.readline()[:-1])
-    with open("/trainer_count", "r") as f:
-        trainer_count = int(f.readline()[:-1])
-    out_manifest = []
-    for index, json_line in enumerate(open(in_manifest_path, 'r')):
-        if (index % trainer_count) == trainer_id:
-            out_manifest.append("%s\n" % json_line.strip())
-    with open(out_manifest_path, 'w') as f:
-        f.writelines(out_manifest)
-if __name__ == '__main__':
-    split_data(args.in_manifest_path, args.out_manifest_path)
--- a/cloud/upload_data.py
+++ b/cloud/upload_data.py
-"""This script is for uploading data for DeepSpeech2 training on paddlecloud.
-Steps:
-1. Read original manifests and extract local sound files.
-2. Tar all local sound files into multiple tar files and upload them.
-3. Modify original manifests with updated paths in cloud filesystem.
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-import json
-import os
-import tarfile
-import sys
-import argparse
-import shutil
-from subprocess import call
-import _init_paths
-from data_utils.utils import read_manifest
-parser = argparse.ArgumentParser(description=__doc__)
-parser.add_argument(
-    "--in_manifest_paths",
-    default=[
-        "../datasets/manifest.train", "../datasets/manifest.dev",
-        "../datasets/manifest.test"
-    ],
-    type=str,
-    nargs='+',
-    help="Local filepaths of input manifests to load, pack and upload."
-    "(default: %(default)s)")
-parser.add_argument(
-    "--out_manifest_paths",
-    default=[
-        "./cloud.manifest.train", "./cloud.manifest.dev",
-        "./cloud.manifest.test"
-    ],
-    type=str,
-    nargs='+',
-    help="Local filepaths of modified manifests to write to. "
-    "(default: %(default)s)")
-parser.add_argument(
-    "--cloud_data_dir",
-    required=True,
-    type=str,
-    help="Destination directory on paddlecloud to upload data to.")
-parser.add_argument(
-    "--num_shards",
-    default=10,
-    type=int,
-    help="Number of parts to split data to. (default: %(default)s)")
-parser.add_argument(
-    "--local_tmp_dir",
-    default="./tmp/",
-    type=str,
-    help="Local directory for storing temporary data. (default: %(default)s)")
-args = parser.parse_args()
-def upload_data(in_manifest_path_list, out_manifest_path_list, local_tmp_dir,
-                upload_tar_dir, num_shards):
-    """Extract and pack sound files listed in the manifest files into multple
-    tar files and upload them to padldecloud. Besides, generate new manifest
-    files with updated paths in paddlecloud.
-    """
-    # compute total audio number
-    total_line = 0
-    for manifest_path in in_manifest_path_list:
-        with open(manifest_path, 'r') as f:
-            total_line += len(f.readlines())
-    line_per_tar = (total_line // num_shards) + 1
-    # pack and upload shard by shard
-    line_count, tar_file = 0, None
-    for manifest_path, out_manifest_path in zip(in_manifest_path_list,
-                                                out_manifest_path_list):
-        manifest = read_manifest(manifest_path)
-        out_manifest = []
-        for json_data in manifest:
-            sound_filepath = json_data['audio_filepath']
-            sound_filename = os.path.basename(sound_filepath)
-            if line_count % line_per_tar == 0:
-                if tar_file != None:
-                    tar_file.close()
-                    pcloud_cp(tar_path, upload_tar_dir)
-                    os.remove(tar_path)
-                tar_name = 'part-%s-of-%s.tar' % (
-                    str(line_count // line_per_tar).zfill(5),
-                    str(num_shards).zfill(5))
-                tar_path = os.path.join(local_tmp_dir, tar_name)
-                tar_file = tarfile.open(tar_path, 'w')
-            tar_file.add(sound_filepath, arcname=sound_filename)
-            line_count += 1
-            json_data['audio_filepath'] = "tar:%s#%s" % (
-                os.path.join(upload_tar_dir, tar_name), sound_filename)
-            out_manifest.append("%s\n" % json.dumps(json_data))
-        with open(out_manifest_path, 'w') as f:
-            f.writelines(out_manifest)
-        pcloud_cp(out_manifest_path, upload_tar_dir)
-    tar_file.close()
-    pcloud_cp(tar_path, upload_tar_dir)
-    os.remove(tar_path)
-def pcloud_mkdir(dir):
-    """Make directory in PaddleCloud filesystem.
-    """
-    if call(['paddlecloud', 'mkdir', dir]) != 0:
-        raise IOError("PaddleCloud mkdir failed: %s." % dir)
-def pcloud_cp(src, dst):
-    """Copy src from local filesytem to dst in PaddleCloud filesystem,
-    or downlowd src from PaddleCloud filesystem to dst in local filesystem.
-    """
-    if call(['paddlecloud', 'cp', src, dst]) != 0:
-        raise IOError("PaddleCloud cp failed: from [%s] to [%s]." % (src, dst))
-if __name__ == '__main__':
-    if not os.path.exists(args.local_tmp_dir):
-        os.makedirs(args.local_tmp_dir)
-    pcloud_mkdir(args.cloud_data_dir)
-    upload_data(args.in_manifest_paths, args.out_manifest_paths,
-                args.local_tmp_dir, args.cloud_data_dir, args.num_shards)
-    shutil.rmtree(args.local_tmp_dir)
--- a/data/librispeech/librispeech.py
+++ b/data/librispeech/librispeech.py
@@ -16,6 +16,7 @@ import argparse
 import soundfile
 import json
 import codecs
+import io
 from data_utils.utility import download, unpack
 URL_ROOT = "http://www.openslr.org/resources/12"
@@ -68,12 +69,11 @@ def create_manifest(data_dir, manifest_path):
            filename for filename in filelist if filename.endswith('trans.txt')
        ]
        if len(text_filelist) > 0:
-            text_filepath = os.path.join(data_dir, subfolder, text_filelist[0])
+            text_filepath = os.path.join(subfolder, text_filelist[0])
-            for line in open(text_filepath):
+            for line in io.open(text_filepath, encoding="utf8"):
                segments = line.strip().split()
                text = ' '.join(segments[1:]).lower()
-                audio_filepath = os.path.join(data_dir, subfolder,
+                audio_filepath = os.path.join(subfolder, segments[0] + '.flac')
-                                              segments[0] + '.flac')
                audio_data, samplerate = soundfile.read(audio_filepath)
                duration = float(len(audio_data)) / samplerate
                json_lines.append(

--- a/data/noise/chime3_background.py
+++ b/data/noise/chime3_background.py
@@ -16,6 +16,7 @@ import zipfile
 import argparse
 import soundfile
 import json
+import io
 from paddle.v2.dataset.common import md5file
 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
@@ -88,7 +89,7 @@ def create_manifest(data_dir, manifest_path):
                        'duration': duration,
                        'text': ''
                    }))
-    with open(manifest_path, 'w') as out_file:
+    with io.open(manifest_path, mode='w', encoding='utf8') as out_file:
        for line in json_lines:
            out_file.write(line + '\n')

--- a/data/voxforge/run_data.sh
+++ b/data/voxforge/run_data.sh
@@ -3,7 +3,7 @@
 # download data, generate manifests
 PYTHONPATH=../../:$PYTHONPATH python voxforge.py \
 --manifest_prefix='./manifest' \
--target_dir='~/.cache/paddle/dataset/speech/VoxForge' \
+--target_dir='./dataset/VoxForge' \
 --is_merge_dialect=True \
 --dialects 'american' 'british' 'australian' 'european' 'irish' 'canadian' 'indian'

--- a/data/voxforge/voxforge.py
+++ b/data/voxforge/voxforge.py
@@ -18,7 +18,7 @@ import shutil
 import subprocess
 from data_utils.utility import download_multi, unpack, getfile_insensitive
-DATA_HOME = '~/.cache/paddle/dataset/speech'
+DATA_HOME = './dataset'
 DATA_URL = 'http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/' \
           'Audio/Main/16kHz_16bit'

--- a/data_utils/audio.py
+++ b/data_utils/audio.py
@@ -12,6 +12,7 @@ import resampy
 from scipy import signal
 import random
 import copy
+import io
 class AudioSegment(object):
@@ -154,7 +155,7 @@ class AudioSegment(object):
        fileno = int(matches.group(2))
        # read headers
-        f = open(filename, 'rb')
+        f = io.open(filename, mode='rb', encoding='utf8')
        version = f.read(4)
        num_utterances = struct.unpack("i", f.read(4))[0]
        bytes_per_header = struct.unpack("i", f.read(4))[0]

--- a/data_utils/data.py
+++ b/data_utils/data.py
@@ -9,10 +9,9 @@ import random
 import tarfile
 import multiprocessing
 import numpy as np
-import paddle.v2 as paddle
+import paddle.fluid as fluid
 from threading import local
 from data_utils.utility import read_manifest
-from data_utils.utility import xmap_readers_mp
 from data_utils.augmentor.augmentation import AugmentationPipeline
 from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
 from data_utils.speech import SpeechSegment
@@ -51,14 +50,17 @@ class DataGenerator(object):
    :param use_dB_normalization: Whether to normalize the audio to -20 dB
                                before extracting the features.
    :type use_dB_normalization: bool
-    :param num_threads: Number of CPU threads for processing data.
-    :type num_threads: int
    :param random_seed: Random seed.
    :type random_seed: int
    :param keep_transcription_text: If set to True, transcription text will
                                    be passed forward directly without
                                    converting to index sequence.
    :type keep_transcription_text: bool
+    :param place: The place to run the program.
+    :type place: CPU or GPU
+    :param is_training: If set to True, generate text data for training, 
+                        otherwise,  generate text data for infer.
+    :type is_training: bool 
    """
    def __init__(self,
@@ -72,9 +74,10 @@ class DataGenerator(object):
                 max_freq=None,
                 specgram_type='linear',
                 use_dB_normalization=True,
-                 num_threads=multiprocessing.cpu_count() // 2,
                 random_seed=0,
-                 keep_transcription_text=False):
+                 keep_transcription_text=False,
+                 place=fluid.CPUPlace(),
+                 is_training=True):
        self._max_duration = max_duration
        self._min_duration = min_duration
        self._normalizer = FeatureNormalizer(mean_std_filepath)
@@ -87,14 +90,15 @@ class DataGenerator(object):
            window_ms=window_ms,
            max_freq=max_freq,
            use_dB_normalization=use_dB_normalization)
-        self._num_threads = num_threads
        self._rng = random.Random(random_seed)
        self._keep_transcription_text = keep_transcription_text
        self._epoch = 0
+        self._is_training = is_training
        # for caching tar files info
        self._local_data = local()
        self._local_data.tar2info = {}
        self._local_data.tar2object = {}
+        self._place = place
    def process_utterance(self, audio_file, transcript):
        """Load, augment, featurize and normalize for speech data.
@@ -121,7 +125,6 @@ class DataGenerator(object):
    def batch_reader_creator(self,
                             manifest_path,
                             batch_size,
-                             min_batch_size=1,
                             padding_to=-1,
                             flatten=False,
                             sortagrad=False,
@@ -137,9 +140,6 @@ class DataGenerator(object):
        :type manifest_path: basestring
        :param batch_size: Number of instances in a batch.
        :type batch_size: int
-        :param min_batch_size: Any batch with batch size smaller than this will
-                               be discarded. (To be deprecated in the future.)
-        :type min_batch_size: int
        :param padding_to:  If set -1, the maximun shape in the batch
                            will be used as the target shape for padding.
                            Otherwise, `padding_to` will be the target shape.
@@ -178,6 +178,7 @@ class DataGenerator(object):
            # sort (by duration) or batch-wise shuffle the manifest
            if self._epoch == 0 and sortagrad:
                manifest.sort(key=lambda x: x["duration"])
            else:
                if shuffle_method == "batch_shuffle":
                    manifest = self._batch_shuffle(
@@ -193,18 +194,16 @@ class DataGenerator(object):
                    raise ValueError("Unknown shuffle method %s." %
                                     shuffle_method)
            # prepare batches
-            instance_reader, cleanup = self._instance_reader_creator(manifest)
            batch = []
-            try:
+            instance_reader = self._instance_reader_creator(manifest)
-                for instance in instance_reader():
-                    batch.append(instance)
+            for instance in instance_reader():
-                    if len(batch) == batch_size:
+                batch.append(instance)
-                        yield self._padding_batch(batch, padding_to, flatten)
+                if len(batch) == batch_size:
-                        batch = []
-                if len(batch) >= min_batch_size:
                    yield self._padding_batch(batch, padding_to, flatten)
-            finally:
+                    batch = []
-                cleanup()
+            if len(batch) >= 1:
+                yield self._padding_batch(batch, padding_to, flatten)
            self._epoch += 1
        return batch_reader
@@ -276,13 +275,11 @@ class DataGenerator(object):
        def reader():
            for instance in manifest:
-                yield instance
+                inst = self.process_utterance(instance["audio_filepath"],
+                                              instance["text"]),
+                yield inst[0]
-        reader, cleanup_callback = xmap_readers_mp(
+        return reader
-            lambda instance: self.process_utterance(instance["audio_filepath"], instance["text"]),
-            reader, self._num_threads, 4096)
-        return reader, cleanup_callback
    def _padding_batch(self, batch, padding_to=-1, flatten=False):
        """
@@ -304,14 +301,43 @@ class DataGenerator(object):
                                 "than any instance's shape in the batch")
            max_length = padding_to
        # padding
+        padded_audios = []
+        texts, text_lens = [], []
+        audio_lens = []
+        masks = []
        for audio, text in batch:
            padded_audio = np.zeros([audio.shape[0], max_length])
            padded_audio[:, :audio.shape[1]] = audio
            if flatten:
                padded_audio = padded_audio.flatten()
-            padded_instance = [padded_audio, text, audio.shape[1]]
+            padded_audios.append(padded_audio)
-            new_batch.append(padded_instance)
+            if self._is_training:
-        return new_batch
+                texts += text
+            else:
+                texts.append(text)
+            text_lens.append(len(text))
+            audio_lens.append(audio.shape[1])
+            mask_shape0 = (audio.shape[0] - 1) // 2 + 1
+            mask_shape1 = (audio.shape[1] - 1) // 3 + 1
+            mask_max_len = (max_length - 1) // 3 + 1
+            mask_ones = np.ones((mask_shape0, mask_shape1))
+            mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1))
+            mask = np.repeat(
+                np.reshape(
+                    np.concatenate((mask_ones, mask_zeros), axis=1),
+                    (1, mask_shape0, mask_max_len)),
+                32,
+                axis=0)
+            masks.append(mask)
+        padded_audios = np.array(padded_audios).astype('float32')
+        if self._is_training:
+            texts = fluid.create_lod_tensor(
+                np.array(texts).astype('int32'),
+                recursive_seq_lens=[text_lens],
+                place=self._place)
+        audio_lens = np.array(audio_lens).astype('int64').reshape([-1, 1])
+        masks = np.array(masks).astype('float32')
+        return padded_audios, texts, audio_lens, masks
    def _batch_shuffle(self, manifest, batch_size, clipped=False):
        """Put similarly-sized instances into minibatches for better efficiency

--- a/data_utils/utility.py
+++ b/data_utils/utility.py
@@ -11,7 +11,7 @@ import time
 from Queue import Queue
 from threading import Thread
 from multiprocessing import Process, Manager, Value
-from paddle.v2.dataset.common import md5file
+from paddle.dataset.common import md5file
 def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
@@ -88,127 +88,3 @@ def unpack(filepath, target_dir, rm_tar=False):
 class XmapEndSignal():
    pass
-def xmap_readers_mp(mapper, reader, process_num, buffer_size, order=False):
-    """A multiprocessing pipeline wrapper for the data reader.
-    :param mapper:  Function to map sample.
-    :type mapper: callable
-    :param reader: Given data reader.
-    :type reader: callable
-    :param process_num: Number of processes in the pipeline
-    :type process_num: int
-    :param buffer_size: Maximal buffer size.
-    :type buffer_size: int
-    :return: The wrappered reader and cleanup callback
-    :rtype: tuple
-    """
-    end_flag = XmapEndSignal()
-    read_workers = []
-    handle_workers = []
-    flush_workers = []
-    read_exit_flag = Value('i', 0)
-    handle_exit_flag = Value('i', 0)
-    flush_exit_flag = Value('i', 0)
-    # define a worker to read samples from reader to in_queue with order flag
-    def order_read_worker(reader, in_queue):
-        for order_id, sample in enumerate(reader()):
-            if read_exit_flag.value == 1: break
-            in_queue.put((order_id, sample))
-        in_queue.put(end_flag)
-        # the reading worker should not exit until all handling work exited
-        while handle_exit_flag.value == 0 or read_exit_flag.value == 0:
-            time.sleep(0.001)
-    # define a worker to handle samples from in_queue by mapper and put results
-    # to out_queue with order
-    def order_handle_worker(in_queue, out_queue, mapper, out_order):
-        ins = in_queue.get()
-        while not isinstance(ins, XmapEndSignal):
-            if handle_exit_flag.value == 1: break
-            order_id, sample = ins
-            result = mapper(sample)
-            while order_id != out_order[0]:
-                time.sleep(0.001)
-            out_queue.put(result)
-            out_order[0] += 1
-            ins = in_queue.get()
-        in_queue.put(end_flag)
-        out_queue.put(end_flag)
-        # wait for exit of flushing worker
-        while flush_exit_flag.value == 0 or handle_exit_flag.value == 0:
-            time.sleep(0.001)
-        read_exit_flag.value = 1
-        handle_exit_flag.value = 1
-    # define a thread worker to flush samples from Manager.Queue to Queue
-    # for acceleration
-    def flush_worker(in_queue, out_queue):
-        finish = 0
-        while finish < process_num and flush_exit_flag.value == 0:
-            sample = in_queue.get()
-            if isinstance(sample, XmapEndSignal):
-                finish += 1
-            else:
-                out_queue.put(sample)
-        out_queue.put(end_flag)
-        handle_exit_flag.value = 1
-        flush_exit_flag.value = 1
-    def cleanup():
-        # first exit flushing workers
-        flush_exit_flag.value = 1
-        for w in flush_workers:
-            w.join()
-        # next exit handling workers
-        handle_exit_flag.value = 1
-        for w in handle_workers:
-            w.join()
-        # last exit reading workers
-        read_exit_flag.value = 1
-        for w in read_workers:
-            w.join()
-    def xreader():
-        # prepare shared memory
-        manager = Manager()
-        in_queue = manager.Queue(buffer_size)
-        out_queue = manager.Queue(buffer_size)
-        out_order = manager.list([0])
-        # start a read worker in a process
-        target = order_read_worker
-        p = Process(target=target, args=(reader, in_queue))
-        p.daemon = True
-        p.start()
-        read_workers.append(p)
-        # start handle_workers with multiple processes
-        target = order_handle_worker
-        args = (in_queue, out_queue, mapper, out_order)
-        workers = [
-            Process(target=target, args=args) for _ in xrange(process_num)
-        ]
-        for w in workers:
-            w.daemon = True
-            w.start()
-            handle_workers.append(w)
-        # start a thread to read data from slow Manager.Queue
-        flush_queue = Queue(buffer_size)
-        t = Thread(target=flush_worker, args=(out_queue, flush_queue))
-        t.daemon = True
-        t.start()
-        flush_workers.append(t)
-        # get results
-        sample = flush_queue.get()
-        while not isinstance(sample, XmapEndSignal):
-            yield sample
-            sample = flush_queue.get()
-    return xreader, cleanup
--- a/decoders/decoders_deprecated.py
+++ b/decoders/decoders_deprecated.py
@@ -102,7 +102,7 @@ def ctc_beam_search_decoder(probs_seq,
    probs_b_prev, probs_nb_prev = {'\t': 1.0}, {'\t': 0.0}
    ## extend prefix in loop
-    for time_step in xrange(len(probs_seq)):
+    for time_step in range(len(probs_seq)):
        # prefix_set_next: the set containing candidate prefixes
        # probs_b_cur: prefixes' probability ending with blank in current step
        # probs_nb_cur: prefixes' probability ending with non-blank in current step
@@ -114,7 +114,7 @@ def ctc_beam_search_decoder(probs_seq,
        if cutoff_prob < 1.0 or cutoff_top_n < cutoff_len:
            prob_idx = sorted(prob_idx, key=lambda asd: asd[1], reverse=True)
            cutoff_len, cum_prob = 0, 0.0
-            for i in xrange(len(prob_idx)):
+            for i in range(len(prob_idx)):
                cum_prob += prob_idx[i][1]
                cutoff_len += 1
                if cum_prob >= cutoff_prob:
@@ -127,7 +127,7 @@ def ctc_beam_search_decoder(probs_seq,
                probs_b_cur[l], probs_nb_cur[l] = 0.0, 0.0
            # extend prefix by travering prob_idx
-            for index in xrange(cutoff_len):
+            for index in range(cutoff_len):
                c, prob_c = prob_idx[index][0], prob_idx[index][1]
                if c == blank_id:

--- a/deploy/demo_client.py
+++ b/deploy/demo_client.py
 """Client-end for the ASR demo."""
-from pynput import keyboard
+import keyboard
 import struct
 import socket
 import sys
@@ -23,22 +23,17 @@ is_recording = False
 enable_trigger_record = True
-def on_press(key):
+def on_press_release(x):
-    """On-press keyboard callback function."""
+    """Keyboard callback function."""
    global is_recording, enable_trigger_record
-    if key == keyboard.Key.space:
+    press = keyboard.KeyboardEvent('down', 28, 'space')
+    release = keyboard.KeyboardEvent('up', 28, 'space')
+    if x.event_type == 'down' and x.name == press.name:
        if (not is_recording) and enable_trigger_record:
            sys.stdout.write("Start Recording ... ")
            sys.stdout.flush()
            is_recording = True
+    if x.event_type == 'up' and x.name == release.name:
-def on_release(key):
-    """On-release keyboard callback function."""
-    global is_recording, enable_trigger_record
-    if key == keyboard.Key.esc:
-        return False
-    elif key == keyboard.Key.space:
        if is_recording == True:
            is_recording = False
@@ -80,9 +75,10 @@ def main():
    stream.start_stream()
    # prepare keyboard listener
-    with keyboard.Listener(
+    while (1):
-            on_press=on_press, on_release=on_release) as listener:
+        keyboard.hook(on_press_release)
-        listener.join()
+        if keyboard.record('esc'):
+            break
    # close up
    stream.stop_stream()

--- a/deploy/demo_server.py
+++ b/deploy/demo_server.py
@@ -8,7 +8,8 @@ from time import gmtime, strftime
 import SocketServer
 import struct
 import wave
-import paddle.v2 as paddle
+import paddle.fluid as fluid
+import numpy as np
 import _init_paths
 from data_utils.data import DataGenerator
 from model_utils.model import DeepSpeech2Model
@@ -141,13 +142,19 @@ def warm_up_test(audio_process_handler,
 def start_server():
    """Start the ASR server"""
    # prepare data generator
+    if args.use_gpu:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
    data_generator = DataGenerator(
        vocab_filepath=args.vocab_path,
        mean_std_filepath=args.mean_std_path,
        augmentation_config='{}',
        specgram_type=args.specgram_type,
-        num_threads=1,
+        keep_transcription_text=True,
-        keep_transcription_text=True)
+        place = place,
+        is_training = False)
    # prepare ASR model
    ds2_model = DeepSpeech2Model(
        vocab_size=data_generator.vocab_size,
@@ -155,7 +162,8 @@ def start_server():
        num_rnn_layers=args.num_rnn_layers,
        rnn_layer_size=args.rnn_layer_size,
        use_gru=args.use_gru,
-        pretrained_model_path=args.model_path,
+        init_from_pretrain_model=args.model_path,
+        place=place,
        share_rnn_weights=args.share_rnn_weights)
    vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
@@ -166,8 +174,24 @@ def start_server():
    # prepare ASR inference handler
    def file_to_transcript(filename):
        feature = data_generator.process_utterance(filename, "")
+        audio_len = feature[0].shape[1]
+        mask_shape0 = (feature[0].shape[0] - 1) // 2 + 1
+        mask_shape1 = (feature[0].shape[1] - 1) // 3 + 1
+        mask_max_len = (audio_len - 1) // 3 + 1
+        mask_ones = np.ones((mask_shape0, mask_shape1))
+        mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1))
+        mask = np.repeat(
+            np.reshape(
+                np.concatenate((mask_ones, mask_zeros), axis=1),
+                (1, mask_shape0, mask_max_len)),
+            32,
+            axis=0)
+        feature = (np.array([feature[0]]).astype('float32'),
+                   None,
+                   np.array([audio_len]).astype('int64').reshape([-1,1]),
+                   np.array([mask]).astype('float32'))
        probs_split = ds2_model.infer_batch_probs(
-            infer_data=[feature],
+            infer_data=feature,
            feeding_dict=data_generator.feeding)
        if args.decoding_method == "ctc_greedy":
@@ -207,7 +231,6 @@ def start_server():
 def main():
    print_arguments(args)
-    paddle.init(use_gpu=args.use_gpu, trainer_count=1)
    start_server()

--- a/docs/images/multi_gpu_speedup.png
+++ b/docs/images/multi_gpu_speedup.png
--- a/examples/aishell/run_data.sh
+++ b/examples/aishell/run_data.sh
@@ -5,7 +5,7 @@ cd ../.. > /dev/null
 # download data, generate manifests
 PYTHONPATH=.:$PYTHONPATH python data/aishell/aishell.py \
 --manifest_prefix='data/aishell/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/Aishell'
+--target_dir='./dataset/aishell'
 if [ $? -ne 0 ]; then
    echo "Prepare Aishell failed. Terminated."

--- a/examples/aishell/run_infer.sh
+++ b/examples/aishell/run_infer.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_ch.sh
+bash download_lm_ch.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -15,7 +15,6 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0 \
 python -u infer.py \
 --num_samples=10 \
--trainer_count=1 \
 --beam_size=300 \
 --num_proc_bsearch=8 \
 --num_conv_layers=2 \
@@ -31,7 +30,7 @@ python -u infer.py \
 --infer_manifest='data/aishell/manifest.test' \
 --mean_std_path='data/aishell/mean_std.npz' \
 --vocab_path='data/aishell/vocab.txt' \
--model_path='checkpoints/aishell/params.latest.tar.gz' \
+--model_path='checkpoints/aishell/srep_final' \
 --lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='cer' \

--- a/examples/aishell/run_infer_golden.sh
+++ b/examples/aishell/run_infer_golden.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_ch.sh
+bash download_lm_ch.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -13,7 +13,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/aishell > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -24,7 +24,6 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0 \
 python -u infer.py \
 --num_samples=10 \
--trainer_count=1 \
 --beam_size=300 \
 --num_proc_bsearch=8 \
 --num_conv_layers=2 \
@@ -35,12 +34,12 @@ python -u infer.py \
 --cutoff_prob=0.99 \
 --cutoff_top_n=40 \
 --use_gru=True \
--use_gpu=True \
+--use_gpu=False \
 --share_rnn_weights=False \
 --infer_manifest='data/aishell/manifest.test' \
 --mean_std_path='models/aishell/mean_std.npz' \
 --vocab_path='models/aishell/vocab.txt' \
--model_path='models/aishell/params.tar.gz' \
+--model_path='models/aishell' \
 --lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='cer' \

--- a/examples/aishell/run_test.sh
+++ b/examples/aishell/run_test.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_ch.sh
+bash download_lm_ch.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -15,10 +15,8 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python -u test.py \
 --batch_size=128 \
--trainer_count=8 \
 --beam_size=300 \
 --num_proc_bsearch=8 \
--num_proc_data=8 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=1024 \
@@ -32,7 +30,7 @@ python -u test.py \
 --test_manifest='data/aishell/manifest.test' \
 --mean_std_path='data/aishell/mean_std.npz' \
 --vocab_path='data/aishell/vocab.txt' \
--model_path='checkpoints/aishell/params.latest.tar.gz' \
+--model_path='checkpoints/aishell/step_final' \
 --lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='cer' \

--- a/examples/aishell/run_test_golden.sh
+++ b/examples/aishell/run_test_golden.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_ch.sh
+bash download_lm_ch.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -13,7 +13,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/aishell > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -24,10 +24,8 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python -u test.py \
 --batch_size=128 \
--trainer_count=8 \
 --beam_size=300 \
 --num_proc_bsearch=8 \
--num_proc_data=8 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=1024 \
@@ -41,7 +39,7 @@ python -u test.py \
 --test_manifest='data/aishell/manifest.test' \
 --mean_std_path='models/aishell/mean_std.npz' \
 --vocab_path='models/aishell/vocab.txt' \
--model_path='models/aishell/params.tar.gz' \
+--model_path='models/aishell' \
 --lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='cer' \

--- a/examples/aishell/run_train.sh
+++ b/examples/aishell/run_train.sh
@@ -3,17 +3,18 @@
 cd ../.. > /dev/null
 # train model
-# if you wish to resume from an exists model, uncomment --init_model_path
+# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
+export FLAGS_sync_nccl_allreduce=0
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python -u train.py \
 --batch_size=64 \
--trainer_count=8 \
+--num_epoch=50 \
--num_passes=50 \
--num_proc_data=16 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=1024 \
 --num_iter_print=100 \
+--save_epoch=1 \
+--num_samples=120000 \
 --learning_rate=5e-4 \
 --max_duration=27.0 \
 --min_duration=0.0 \
@@ -30,7 +31,7 @@ python -u train.py \
 --output_model_dir='./checkpoints/aishell' \
 --augment_conf_path='conf/augmentation.config' \
 --specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
+--shuffle_method='batch_shuffle_clipped' \
 if [ $? -ne 0 ]; then
    echo "Failed in training!"

--- a/examples/baidu_en8k/run_infer_golden.sh
+++ b/examples/baidu_en8k/run_infer_golden.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -13,7 +13,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/baidu_en8k > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -24,7 +24,6 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0 \
 python -u infer.py \
 --num_samples=10 \
--trainer_count=1 \
 --beam_size=500 \
 --num_proc_bsearch=5 \
 --num_conv_layers=2 \
@@ -35,12 +34,12 @@ python -u infer.py \
 --cutoff_prob=1.0 \
 --cutoff_top_n=40 \
 --use_gru=True \
--use_gpu=True \
+--use_gpu=False \
 --share_rnn_weights=False \
 --infer_manifest='data/librispeech/manifest.test-clean' \
 --mean_std_path='models/baidu_en8k/mean_std.npz' \
 --vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
+--model_path='models/baidu_en8k' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/baidu_en8k/run_test_golden.sh
+++ b/examples/baidu_en8k/run_test_golden.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -13,7 +13,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/baidu_en8k > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -24,7 +24,6 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python -u test.py \
 --batch_size=128 \
--trainer_count=4 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
 --num_proc_data=8 \
@@ -36,12 +35,12 @@ python -u test.py \
 --cutoff_prob=1.0 \
 --cutoff_top_n=40 \
 --use_gru=True \
--use_gpu=True \
+--use_gpu=False \
 --share_rnn_weights=False \
 --test_manifest='data/librispeech/manifest.test-clean' \
 --mean_std_path='models/baidu_en8k/mean_std.npz' \
 --vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
+--model_path='models/baidu_en8k' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/deploy_demo/run_english_demo_server.sh
+++ b/examples/deploy_demo/run_english_demo_server.sh
@@ -5,7 +5,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -14,7 +14,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/baidu_en8k > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -40,7 +40,7 @@ python -u deploy/demo_server.py \
 --warmup_manifest='data/tiny/manifest.test-clean' \
 --mean_std_path='models/baidu_en8k/mean_std.npz' \
 --vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
+--model_path='models/baidu_en8k' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --specgram_type='linear'

--- a/examples/librispeech/run_data.sh
+++ b/examples/librispeech/run_data.sh
@@ -5,7 +5,7 @@ cd ../.. > /dev/null
 # download data, generate manifests
 PYTHONPATH=.:$PYTHONPATH python data/librispeech/librispeech.py \
 --manifest_prefix='data/librispeech/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/libri' \
+--target_dir='./dataset/librispeech' \
 --full_download='True'
 if [ $? -ne 0 ]; then

--- a/examples/librispeech/run_infer.sh
+++ b/examples/librispeech/run_infer.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -15,7 +15,6 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0 \
 python -u infer.py \
 --num_samples=10 \
--trainer_count=1 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
 --num_conv_layers=2 \
@@ -31,7 +30,7 @@ python -u infer.py \
 --infer_manifest='data/librispeech/manifest.test-clean' \
 --mean_std_path='data/librispeech/mean_std.npz' \
 --vocab_path='data/librispeech/vocab.txt' \
--model_path='checkpoints/libri/params.latest.tar.gz' \
+--model_path='checkpoints/libri/step_final' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/librispeech/run_infer_golden.sh
+++ b/examples/librispeech/run_infer_golden.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -13,7 +13,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/librispeech > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -24,7 +24,6 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0 \
 python -u infer.py \
 --num_samples=10 \
--trainer_count=1 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
 --num_conv_layers=2 \
@@ -40,7 +39,7 @@ python -u infer.py \
 --infer_manifest='data/librispeech/manifest.test-clean' \
 --mean_std_path='models/librispeech/mean_std.npz' \
 --vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
+--model_path='models/librispeech' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/librispeech/run_test.sh
+++ b/examples/librispeech/run_test.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -15,10 +15,8 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python -u test.py \
 --batch_size=128 \
--trainer_count=8 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
--num_proc_data=8 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=2048 \
@@ -32,7 +30,7 @@ python -u test.py \
 --test_manifest='data/librispeech/manifest.test-clean' \
 --mean_std_path='data/librispeech/mean_std.npz' \
 --vocab_path='data/librispeech/vocab.txt' \
--model_path='checkpoints/libri/params.latest.tar.gz' \
+--model_path='checkpoints/libri/step_final' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/librispeech/run_test_golden.sh
+++ b/examples/librispeech/run_test_golden.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -13,7 +13,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/librispeech > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -24,10 +24,8 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python -u test.py \
 --batch_size=128 \
--trainer_count=8 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
--num_proc_data=8 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=2048 \
@@ -41,7 +39,7 @@ python -u test.py \
 --test_manifest='data/librispeech/manifest.test-clean' \
 --mean_std_path='models/librispeech/mean_std.npz' \
 --vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
+--model_path='models/librispeech' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/librispeech/run_train.sh
+++ b/examples/librispeech/run_train.sh
@@ -3,17 +3,19 @@
 cd ../.. > /dev/null
 # train model
-# if you wish to resume from an exists model, uncomment --init_model_path
+# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
+export FLAGS_sync_nccl_allreduce=0
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python -u train.py \
--batch_size=160 \
+--batch_size=20 \
--trainer_count=8 \
+--num_epoch=50 \
--num_passes=50 \
--num_proc_data=16 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=2048 \
 --num_iter_print=100 \
+--save_epoch=1 \
+--num_samples=280000 \
 --learning_rate=5e-4 \
 --max_duration=27.0 \
 --min_duration=0.0 \
@@ -30,7 +32,7 @@ python -u train.py \
 --output_model_dir='./checkpoints/libri' \
 --augment_conf_path='conf/augmentation.config' \
 --specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
+--shuffle_method='batch_shuffle_clipped' \
 if [ $? -ne 0 ]; then
    echo "Failed in training!"

--- a/examples/librispeech/run_tune.sh
+++ b/examples/librispeech/run_tune.sh
@@ -7,7 +7,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python -u tools/tune.py \
 --num_batches=-1 \
 --batch_size=128 \
--trainer_count=4 \
 --beam_size=500 \
 --num_proc_bsearch=12 \
 --num_conv_layers=2 \
@@ -27,7 +26,7 @@ python -u tools/tune.py \
 --tune_manifest='data/librispeech/manifest.dev-clean' \
 --mean_std_path='data/librispeech/mean_std.npz' \
 --vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
+--model_path='models/librispeech' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --error_rate_type='wer' \
 --specgram_type='linear'

--- a/examples/tiny/run_data.sh
+++ b/examples/tiny/run_data.sh
@@ -7,11 +7,10 @@ if [ ! -e data/tiny ]; then
    mkdir data/tiny
 fi
 # download data, generate manifests
 PYTHONPATH=.:$PYTHONPATH python data/librispeech/librispeech.py \
 --manifest_prefix='data/tiny/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/libri' \
+--target_dir='./dataset/librispeech' \
 --full_download='False'
 if [ $? -ne 0 ]; then
@@ -21,12 +20,11 @@ fi
 head -n 64 data/tiny/manifest.dev-clean  > data/tiny/manifest.tiny
 # build vocabulary
 python tools/build_vocab.py \
 --count_threshold=0 \
 --vocab_path='data/tiny/vocab.txt' \
--manifest_paths='data/tiny/manifest.dev-clean'
+--manifest_paths='data/tiny/manifest.tiny'
 if [ $? -ne 0 ]; then
    echo "Build vocabulary failed. Terminated."
@@ -47,5 +45,5 @@ if [ $? -ne 0 ]; then
 fi
-echo "Tiny data preparation done."
+echo "LibriSpeech Data preparation done."
 exit 0
--- a/examples/tiny/run_infer.sh
+++ b/examples/tiny/run_infer.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -15,7 +15,6 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0 \
 python -u infer.py \
 --num_samples=10 \
--trainer_count=1 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
 --num_conv_layers=2 \
@@ -28,10 +27,10 @@ python -u infer.py \
 --use_gru=False \
 --use_gpu=True \
 --share_rnn_weights=True \
--infer_manifest='data/tiny/manifest.tiny' \
+--infer_manifest='data/tiny/manifest.test-clean' \
 --mean_std_path='data/tiny/mean_std.npz' \
 --vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/tiny/params.pass-19.tar.gz' \
+--model_path='./checkpoints/tiny/step_final' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/tiny/run_infer_golden.sh
+++ b/examples/tiny/run_infer_golden.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -13,7 +13,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/librispeech > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -24,7 +24,6 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0 \
 python -u infer.py \
 --num_samples=10 \
--trainer_count=1 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
 --num_conv_layers=2 \
@@ -40,7 +39,7 @@ python -u infer.py \
 --infer_manifest='data/tiny/manifest.test-clean' \
 --mean_std_path='models/librispeech/mean_std.npz' \
 --vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
+--model_path='models/librispeech' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/tiny/run_test.sh
+++ b/examples/tiny/run_test.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -14,11 +14,9 @@ cd - > /dev/null
 # evaluate model
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python -u test.py \
--batch_size=16 \
+--batch_size=128 \
--trainer_count=8 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
--num_proc_data=8 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=2048 \
@@ -29,10 +27,10 @@ python -u test.py \
 --use_gru=False \
 --use_gpu=True \
 --share_rnn_weights=True \
--test_manifest='data/tiny/manifest.tiny' \
+--test_manifest='data/tiny/manifest.test-clean' \
 --mean_std_path='data/tiny/mean_std.npz' \
 --vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/tiny/params.pass-19.tar.gz' \
+--model_path='checkpoints/tiny/step_final' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/tiny/run_test_golden.sh
+++ b/examples/tiny/run_test_golden.sh
@@ -4,7 +4,7 @@ cd ../.. > /dev/null
 # download language model
 cd models/lm > /dev/null
-sh download_lm_en.sh
+bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -13,7 +13,7 @@ cd - > /dev/null
 # download well-trained model
 cd models/librispeech > /dev/null
-sh download_model.sh
+bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
 fi
@@ -24,10 +24,8 @@ cd - > /dev/null
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python -u test.py \
 --batch_size=128 \
--trainer_count=8 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
--num_proc_data=8 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=2048 \
@@ -41,7 +39,7 @@ python -u test.py \
 --test_manifest='data/tiny/manifest.test-clean' \
 --mean_std_path='models/librispeech/mean_std.npz' \
 --vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
+--model_path='models/librispeech' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --decoding_method='ctc_beam_search' \
 --error_rate_type='wer' \

--- a/examples/tiny/run_train.sh
+++ b/examples/tiny/run_train.sh
@@ -3,17 +3,18 @@
 cd ../.. > /dev/null
 # train model
-# if you wish to resume from an exists model, uncomment --init_model_path
+# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
+export FLAGS_sync_nccl_allreduce=0
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python -u train.py \
--batch_size=16 \
+--batch_size=4 \
--trainer_count=4 \
+--num_epoch=20 \
--num_passes=20 \
--num_proc_data=1 \
 --num_conv_layers=2 \
 --num_rnn_layers=3 \
 --rnn_layer_size=2048 \
--num_iter_print=100 \
+--num_iter_print=1 \
+--save_epoch=1 \
+--num_samples=64 \
 --learning_rate=1e-5 \
 --max_duration=27.0 \
 --min_duration=0.0 \
@@ -30,10 +31,10 @@ python -u train.py \
 --output_model_dir='./checkpoints/tiny' \
 --augment_conf_path='conf/augmentation.config' \
 --specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
+--shuffle_method='batch_shuffle_clipped' \
 if [ $? -ne 0 ]; then
-    echo "Fail in training!"
+    echo "Failed in training!"
    exit 1
 fi

--- a/examples/tiny/run_tune.sh
+++ b/examples/tiny/run_tune.sh
@@ -3,11 +3,10 @@
 cd ../.. > /dev/null
 # grid-search for hyper-parameters in language model
-CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python -u tools/tune.py \
--num_batches=1 \
+--num_batches=-1 \
--batch_size=24 \
+--batch_size=128 \
--trainer_count=8 \
 --beam_size=500 \
 --num_proc_bsearch=12 \
 --num_conv_layers=2 \
@@ -24,10 +23,10 @@ python -u tools/tune.py \
 --use_gru=False \
 --use_gpu=True \
 --share_rnn_weights=True \
--tune_manifest='data/tiny/manifest.tiny' \
+--tune_manifest='data/tiny/manifest.dev-clean' \
 --mean_std_path='data/tiny/mean_std.npz' \
 --vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/params.pass-9.tar.gz' \
+--model_path='models/librispeech' \
 --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
 --error_rate_type='wer' \
 --specgram_type='linear'

--- a/infer.py
+++ b/infer.py
@@ -3,9 +3,13 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+import sys
+reload(sys)
+sys.setdefaultencoding('utf-8')
 import argparse
 import functools
-import paddle.v2 as paddle
+import paddle.fluid as fluid
 from data_utils.data import DataGenerator
 from model_utils.model import DeepSpeech2Model
 from utils.error_rate import wer, cer
@@ -15,7 +19,6 @@ parser = argparse.ArgumentParser(description=__doc__)
 add_arg = functools.partial(add_arguments, argparser=parser)
 # yapf: disable
 add_arg('num_samples',      int,    10,     "# of samples to infer.")
-add_arg('trainer_count',    int,    8,      "# of Trainers (CPUs or GPUs).")
 add_arg('beam_size',        int,    500,    "Beam search width.")
 add_arg('num_proc_bsearch', int,    8,      "# of CPUs for beam search.")
 add_arg('num_conv_layers',  int,    2,      "# of convolution layers.")
@@ -63,20 +66,25 @@ args = parser.parse_args()
 def infer():
    """Inference for DeepSpeech2."""
+    if args.use_gpu:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
    data_generator = DataGenerator(
        vocab_filepath=args.vocab_path,
        mean_std_filepath=args.mean_std_path,
        augmentation_config='{}',
        specgram_type=args.specgram_type,
-        num_threads=1,
+        keep_transcription_text=True,
-        keep_transcription_text=True)
+        place = place,
+        is_training = False)
    batch_reader = data_generator.batch_reader_creator(
        manifest_path=args.infer_manifest,
        batch_size=args.num_samples,
-        min_batch_size=1,
        sortagrad=False,
        shuffle_method=None)
-    infer_data = batch_reader().next()
+    infer_data = next(batch_reader())
    ds2_model = DeepSpeech2Model(
        vocab_size=data_generator.vocab_size,
@@ -84,16 +92,19 @@ def infer():
        num_rnn_layers=args.num_rnn_layers,
        rnn_layer_size=args.rnn_layer_size,
        use_gru=args.use_gru,
-        pretrained_model_path=args.model_path,
+        share_rnn_weights=args.share_rnn_weights,
-        share_rnn_weights=args.share_rnn_weights)
+        place=place,
+        init_from_pretrain_model=args.model_path)
    # decoders only accept string encoded in utf-8
    vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
    if args.decoding_method == "ctc_greedy":
        ds2_model.logger.info("start inference ...")
-        probs_split = ds2_model.infer_batch_probs(infer_data=infer_data,
+        probs_split = ds2_model.infer_batch_probs(
+            infer_data=infer_data,
            feeding_dict=data_generator.feeding)
        result_transcripts = ds2_model.decode_batch_greedy(
            probs_split=probs_split,
            vocab_list=vocab_list)
@@ -101,9 +112,11 @@ def infer():
        ds2_model.init_ext_scorer(args.alpha, args.beta, args.lang_model_path,
                                  vocab_list)
        ds2_model.logger.info("start inference ...")
-        probs_split = ds2_model.infer_batch_probs(infer_data=infer_data,
+        probs_split= ds2_model.infer_batch_probs(
+            infer_data=infer_data,
            feeding_dict=data_generator.feeding)
-        result_transcripts = ds2_model.decode_batch_beam_search(
+        result_transcripts= ds2_model.decode_batch_beam_search(
            probs_split=probs_split,
            beam_alpha=args.alpha,
            beam_beta=args.beta,
@@ -114,7 +127,7 @@ def infer():
            num_processes=args.num_proc_bsearch)
    error_rate_func = cer if args.error_rate_type == 'cer' else wer
-    target_transcripts = [data[1] for data in infer_data]
+    target_transcripts = infer_data[1]
    for target, result in zip(target_transcripts, result_transcripts):
        print("\nTarget Transcription: %s\nOutput Transcription: %s" %
              (target, result))
@@ -125,9 +138,6 @@ def infer():
 def main():
    print_arguments(args)
-    paddle.init(use_gpu=args.use_gpu,
-                rnn_use_batch=True,
-                trainer_count=args.trainer_count)
    infer()

--- a/model_utils/model.py
+++ b/model_utils/model.py
--- a/model_utils/network.py
+++ b/model_utils/network.py
--- a/models/aishell/download_model.sh
+++ b/models/aishell/download_model.sh
@@ -2,9 +2,9 @@
 . ../../utils/utility.sh
-URL='https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz'
+URL='https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz'
-MD5=0ee83aa15fba421e5de8fc66c8feb350
+MD5=2bf0cc8b6d5da2a2a787b5cc36a496b5
-TARGET=./aishell_model.tar.gz
+TARGET=./aishell_model_fluid.tar.gz
 echo "Download Aishell model ..."

--- a/models/baidu_en8k/download_model.sh
+++ b/models/baidu_en8k/download_model.sh
@@ -2,9 +2,9 @@
 . ../../utils/utility.sh
-URL='https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz'
+URL='https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz'
-MD5=5fe7639e720d51b3c3bdf7a1470c6272
+MD5=7e58fbf64aa4ecf639b049792ddcf788
-TARGET=./baidu_en8k_model.tar.gz
+TARGET=./baidu_en8k_model_fluid.tar.gz
 echo "Download BaiduEn8k model ..."

--- a/models/librispeech/download_model.sh
+++ b/models/librispeech/download_model.sh
@@ -2,9 +2,9 @@
 . ../../utils/utility.sh
-URL='https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz'
+URL='https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz'
-MD5=1f72d0c5591f453362f0caa09dd57618
+MD5=fafb11fe57c3ecd107147056453f5348
-TARGET=./librispeech_model.tar.gz
+TARGET=./librispeech_model_fluid.tar.gz
 echo "Download LibriSpeech model ..."

--- a/requirements.txt
+++ b/requirements.txt
-scipy==0.13.1
+scipy==1.2.1
 resampy==0.1.5
 SoundFile==0.9.0.post1
 python_speech_features
--- a/test.py
+++ b/test.py
@@ -5,7 +5,7 @@ from __future__ import print_function
 import argparse
 import functools
-import paddle.v2 as paddle
+import paddle.fluid as fluid
 from data_utils.data import DataGenerator
 from model_utils.model import DeepSpeech2Model
 from utils.error_rate import char_errors, word_errors
@@ -15,10 +15,8 @@ parser = argparse.ArgumentParser(description=__doc__)
 add_arg = functools.partial(add_arguments, argparser=parser)
 # yapf: disable
 add_arg('batch_size',       int,    128,    "Minibatch size.")
-add_arg('trainer_count',    int,    8,      "# of Trainers (CPUs or GPUs).")
 add_arg('beam_size',        int,    500,    "Beam search width.")
 add_arg('num_proc_bsearch', int,    8,      "# of CPUs for beam search.")
-add_arg('num_proc_data',    int,    8,      "# of CPUs for data preprocessing.")
 add_arg('num_conv_layers',  int,    2,      "# of convolution layers.")
 add_arg('num_rnn_layers',   int,    3,      "# of recurrent layers.")
 add_arg('rnn_layer_size',   int,    2048,   "# of recurrent cells per layer.")
@@ -64,17 +62,22 @@ args = parser.parse_args()
 def evaluate():
    """Evaluate on whole test data for DeepSpeech2."""
+    if args.use_gpu:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
    data_generator = DataGenerator(
        vocab_filepath=args.vocab_path,
        mean_std_filepath=args.mean_std_path,
        augmentation_config='{}',
        specgram_type=args.specgram_type,
-        num_threads=args.num_proc_data,
+        keep_transcription_text=True,
-        keep_transcription_text=True)
+        place = place,
+        is_training = False)
    batch_reader = data_generator.batch_reader_creator(
        manifest_path=args.test_manifest,
        batch_size=args.batch_size,
-        min_batch_size=1,
        sortagrad=False,
        shuffle_method=None)
@@ -84,8 +87,9 @@ def evaluate():
        num_rnn_layers=args.num_rnn_layers,
        rnn_layer_size=args.rnn_layer_size,
        use_gru=args.use_gru,
-        pretrained_model_path=args.model_path,
+        share_rnn_weights=args.share_rnn_weights,
-        share_rnn_weights=args.share_rnn_weights)
+        place=place,
+        init_from_pretrain_model=args.model_path)
    # decoders only accept string encoded in utf-8
    vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
@@ -115,7 +119,7 @@ def evaluate():
                cutoff_top_n=args.cutoff_top_n,
                vocab_list=vocab_list,
                num_processes=args.num_proc_bsearch)
-        target_transcripts = [data[1] for data in infer_data]
+        target_transcripts = infer_data[1]
        for target, result in zip(target_transcripts, result_transcripts):
            errors, len_ref = errors_func(target, result)
@@ -131,9 +135,6 @@ def evaluate():
 def main():
    print_arguments(args)
-    paddle.init(use_gpu=args.use_gpu,
-                rnn_use_batch=True,
-                trainer_count=args.trainer_count)
    evaluate()

--- a/tools/profile.sh
+++ b/tools/profile.sh
@@ -9,14 +9,13 @@ function join_by { local IFS="$1"; shift; echo "$*"; }
 for NUM_GPUS in 16 8 4 2 1
 do
  DEVICES=$(join_by , $(seq 0 $(($NUM_GPUS-1))))
-  BATCH_SIZE=$(($BATCH_SIZE_PER_GPU * $NUM_GPUS))
+  BATCH_SIZE=$(($BATCH_SIZE_PER_GPU))
  CUDA_VISIBLE_DEVICES=$DEVICES \
  python train.py \
  --batch_size=$BATCH_SIZE \
-  --num_passes=1 \
+  --num_epoch=1 \
  --test_off=True \
-  --trainer_count=$NUM_GPUS \
  --min_duration=$MIN_DURATION \
  --max_duration=$MAX_DURATION > tmp.log 2>&1
@@ -24,7 +23,7 @@ do
      exit 1
  fi
-  cat tmp.log  | grep "Time" | awk '{print "GPU Num: " "'"$NUM_GPUS"'" "	Time: "$3}'
+  cat tmp.log  | grep "Time" | awk '{print "GPU Num: " "'"$NUM_GPUS"'" "	Time: "$2}'
  rm tmp.log
 done
--- a/tools/tune.py
+++ b/tools/tune.py
@@ -10,7 +10,7 @@ import argparse
 import functools
 import gzip
 import logging
-import paddle.v2 as paddle
+import paddle.fluid as fluid
 import _init_paths
 from data_utils.data import DataGenerator
 from model_utils.model import DeepSpeech2Model
@@ -26,7 +26,6 @@ add_arg('batch_size',       int,    256,    "# of samples per batch.")
 add_arg('trainer_count',    int,    8,      "# of Trainers (CPUs or GPUs).")
 add_arg('beam_size',        int,    500,    "Beam search width.")
 add_arg('num_proc_bsearch', int,    8,     "# of CPUs for beam search.")
-add_arg('num_proc_data',    int,    8,      "# of CPUs for data preprocessing.")
 add_arg('num_conv_layers',  int,    2,      "# of convolution layers.")
 add_arg('num_rnn_layers',   int,    3,      "# of recurrent layers.")
 add_arg('rnn_layer_size',   int,    2048,   "# of recurrent cells per layer.")
@@ -77,13 +76,19 @@ def tune():
    if not args.num_betas >= 0:
        raise ValueError("num_betas must be non-negative!")
+    if args.use_gpu:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
    data_generator = DataGenerator(
        vocab_filepath=args.vocab_path,
        mean_std_filepath=args.mean_std_path,
        augmentation_config='{}',
        specgram_type=args.specgram_type,
-        num_threads=args.num_proc_data,
+        keep_transcription_text=True,
-        keep_transcription_text=True)
+        place = place,
+        is_training = False)
    batch_reader = data_generator.batch_reader_creator(
        manifest_path=args.tune_manifest,
@@ -97,7 +102,8 @@ def tune():
        num_rnn_layers=args.num_rnn_layers,
        rnn_layer_size=args.rnn_layer_size,
        use_gru=args.use_gru,
-        pretrained_model_path=args.model_path,
+        place=place,
+        init_from_pretrain_model=args.model_path,
        share_rnn_weights=args.share_rnn_weights)
    # decoders only accept string encoded in utf-8
@@ -109,8 +115,8 @@ def tune():
    params_grid = [(alpha, beta) for alpha in cand_alphas
                   for beta in cand_betas]
-    err_sum = [0.0 for i in xrange(len(params_grid))]
+    err_sum = [0.0 for i in range(len(params_grid))]
-    err_ave = [0.0 for i in xrange(len(params_grid))]
+    err_ave = [0.0 for i in range(len(params_grid))]
    num_ins, len_refs, cur_batch = 0, 0, 0
    # initialize external scorer
    ds2_model.init_ext_scorer(args.alpha_from, args.beta_from,
@@ -123,7 +129,7 @@ def tune():
        probs_split = ds2_model.infer_batch_probs(
            infer_data=infer_data,
            feeding_dict=data_generator.feeding)
-        target_transcripts = [ data[1] for data in infer_data ]
+        target_transcripts = infer_data[1]
        num_ins += len(target_transcripts)
        # grid search
@@ -137,7 +143,6 @@ def tune():
                cutoff_top_n=args.cutoff_top_n,
                vocab_list=vocab_list,
                num_processes=args.num_proc_bsearch)
            for target, result in zip(target_transcripts, result_transcripts):
                errors, len_ref = errors_func(target, result)
                err_sum[index] += errors
@@ -163,7 +168,7 @@ def tune():
    # output WER/CER at every (alpha, beta)
    print("\nFinal %s:\n" % args.error_rate_type)
-    for index in xrange(len(params_grid)):
+    for index in range(len(params_grid)):
        print("(alpha, beta) = (%s, %s), [%s] = %f"
             % ("%.3f" % params_grid[index][0], "%.3f" % params_grid[index][1],
             args.error_rate_type, err_ave[index]))
@@ -179,9 +184,6 @@ def tune():
 def main():
    print_arguments(args)
-    paddle.init(use_gpu=args.use_gpu,
-                rnn_use_batch=True,
-                trainer_count=args.trainer_count)
    tune()

--- a/train.py
+++ b/train.py
@@ -5,23 +5,25 @@ from __future__ import print_function
 import argparse
 import functools
-import paddle.v2 as paddle
+import io
 from model_utils.model import DeepSpeech2Model
 from data_utils.data import DataGenerator
 from utils.utility import add_arguments, print_arguments
+import paddle.fluid as fluid
 parser = argparse.ArgumentParser(description=__doc__)
 add_arg = functools.partial(add_arguments, argparser=parser)
 # yapf: disable
 add_arg('batch_size',       int,    256,    "Minibatch size.")
-add_arg('trainer_count',    int,    8,      "# of Trainers (CPUs or GPUs).")
+add_arg('num_epoch',       int,    200,    "# of training epochs.")
-add_arg('num_passes',       int,    200,    "# of training epochs.")
-add_arg('num_proc_data',    int,    16,     "# of CPUs for data preprocessing.")
 add_arg('num_conv_layers',  int,    2,      "# of convolution layers.")
 add_arg('num_rnn_layers',   int,    3,      "# of recurrent layers.")
 add_arg('rnn_layer_size',   int,    2048,   "# of recurrent cells per layer.")
-add_arg('num_iter_print',   int,    100,    "Every # iterations for printing "
+add_arg('num_iter_print',   int,    100,    "Every # batch for printing "
                                            "train cost.")
+add_arg('save_epoch',   int,    10,   "# Every # batch for save checkpoint and modle params ")
+add_arg('num_samples',    int,    10000,    "The num of train samples.")
 add_arg('learning_rate',    float,  5e-4,   "Learning rate.")
 add_arg('max_duration',     float,  27.0,   "Longest audio duration allowed.")
 add_arg('min_duration',     float,  0.0,    "Shortest audio duration allowed.")
@@ -31,7 +33,12 @@ add_arg('use_gpu',          bool,   True,   "Use GPU or not.")
 add_arg('use_gru',          bool,   False,  "Use GRUs instead of simple RNNs.")
 add_arg('is_local',         bool,   True,   "Use pserver or not.")
 add_arg('share_rnn_weights',bool,   True,   "Share input-hidden weights across "
-                                            "bi-directional RNNs. Not for GRU.")
+                                           "bi-directional RNNs. Not for GRU.")
+add_arg('init_from_pretrain_model',str,
+         None,
+         "If None, the training starts from scratch, "
+         "otherwise, it resumes from the pre-trained model.")
 add_arg('train_manifest',   str,
        'data/librispeech/manifest.train',
        "Filepath of train manifest.")
@@ -44,10 +51,6 @@ add_arg('mean_std_path',    str,
 add_arg('vocab_path',       str,
        'data/librispeech/vocab.txt',
        "Filepath of vocabulary.")
-add_arg('init_model_path',  str,
-        None,
-        "If None, the training starts from scratch, "
-        "otherwise, it resumes from the pre-trained model.")
 add_arg('output_model_dir', str,
        "./checkpoints/libri",
        "Directory for saving checkpoints.")
@@ -68,30 +71,33 @@ args = parser.parse_args()
 def train():
    """DeepSpeech2 training."""
+    if args.use_gpu:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
    train_generator = DataGenerator(
        vocab_filepath=args.vocab_path,
        mean_std_filepath=args.mean_std_path,
-        augmentation_config=open(args.augment_conf_path, 'r').read(),
+        augmentation_config=io.open(args.augment_conf_path, mode='r', encoding='utf8').read(),
        max_duration=args.max_duration,
        min_duration=args.min_duration,
        specgram_type=args.specgram_type,
-        num_threads=args.num_proc_data)
+        place=place)
    dev_generator = DataGenerator(
        vocab_filepath=args.vocab_path,
        mean_std_filepath=args.mean_std_path,
        augmentation_config="{}",
        specgram_type=args.specgram_type,
-        num_threads=args.num_proc_data)
+        place = place)
    train_batch_reader = train_generator.batch_reader_creator(
        manifest_path=args.train_manifest,
        batch_size=args.batch_size,
-        min_batch_size=args.trainer_count,
+        sortagrad=args.use_sortagrad if args.init_from_pretrain_model is None else False,
-        sortagrad=args.use_sortagrad if args.init_model_path is None else False,
        shuffle_method=args.shuffle_method)
    dev_batch_reader = dev_generator.batch_reader_creator(
        manifest_path=args.dev_manifest,
        batch_size=args.batch_size,
-        min_batch_size=1,  # must be 1, but will have errors.
        sortagrad=False,
        shuffle_method=None)
@@ -101,27 +107,27 @@ def train():
        num_rnn_layers=args.num_rnn_layers,
        rnn_layer_size=args.rnn_layer_size,
        use_gru=args.use_gru,
-        pretrained_model_path=args.init_model_path,
+        share_rnn_weights=args.share_rnn_weights,
-        share_rnn_weights=args.share_rnn_weights)
+        place=place,
+        init_from_pretrain_model=args.init_from_pretrain_model,
+        output_model_dir=args.output_model_dir)
    ds2_model.train(
        train_batch_reader=train_batch_reader,
        dev_batch_reader=dev_batch_reader,
        feeding_dict=train_generator.feeding,
        learning_rate=args.learning_rate,
        gradient_clipping=400,
-        num_passes=args.num_passes,
+        batch_size=args.batch_size,
+        num_samples=args.num_samples,
+        num_epoch=args.num_epoch,
+        save_epoch=args.save_epoch,
        num_iterations_print=args.num_iter_print,
-        output_model_dir=args.output_model_dir,
-        is_local=args.is_local,
        test_off=args.test_off)
 def main():
    print_arguments(args)
-    paddle.init(use_gpu=args.use_gpu,
-                rnn_use_batch=True,
-                trainer_count=args.trainer_count,
-                log_clipping=True)
    train()

--- a/utils/error_rate.py
+++ b/utils/error_rate.py
@@ -36,15 +36,15 @@ def _levenshtein_distance(ref, hyp):
    distance = np.zeros((2, n + 1), dtype=np.int32)
    # initialize distance matrix
-    for j in xrange(n + 1):
+    for j in range(n + 1):
        distance[0][j] = j
    # calculate levenshtein distance
-    for i in xrange(1, m + 1):
+    for i in range(1, m + 1):
        prev_row_idx = (i - 1) % 2
        cur_row_idx = i % 2
        distance[cur_row_idx][0] = i
-        for j in xrange(1, n + 1):
+        for j in range(1, n + 1):
            if ref[i - 1] == hyp[j - 1]:
                distance[cur_row_idx][j] = distance[prev_row_idx][j - 1]
            else: