README.md 4.3 KB
Newer Older
1 2
# Deep Speech 2 on PaddlePaddle

3
## Installation
4

5 6 7 8 9
### Prerequisites

 - **Python = 2.7** only supported;
 - **cuDNN >= 6.0** is required to utilize NVIDIA GPU platform in the installation of PaddlePaddle, and the **CUDA toolkit** with proper version suitable for cuDNN. The cuDNN library below 6.0 is found to yield a fatal error in batch normalization when handling utterances with long duration in inference.

10
### Setup
11 12

```
Y
yangyaming 已提交
13
sh setup.sh
14 15
export LD_LIBRARY_PATH=$PADDLE_INSTALL_DIR/Paddle/third_party/install/warpctc/lib:$LD_LIBRARY_PATH
```
16 17 18

Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory.

19 20 21
## Usage

### Preparing Data
22

23
```
X
Xinghai Sun 已提交
24 25
cd datasets
sh run_all.sh
26
cd ..
27
```
28

X
Xinghai Sun 已提交
29
`sh run_all.sh` prepares all ASR datasets (currently, only LibriSpeech available). After running, we have several summarization manifest files in json-format.
30

X
Xinghai Sun 已提交
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
A manifest file summarizes a speech data set, with each line containing the meta data (i.e. audio filepath, transcript text, audio duration) of each audio file within the data set, in json format. Manifest file serves as an interface informing our system of  where and what to read the speech samples.


More help for arguments:

```
python datasets/librispeech/librispeech.py --help
```

### Preparing for Training

```
python compute_mean_std.py
```

Y
Yibing Liu 已提交
46
It will compute mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. The default feature of audio data is power spectrum, and the mfcc feature is also supported. To train and infer based on mfcc feature, please generate this file by
Y
Yibing Liu 已提交
47 48 49 50

```
python compute_mean_std.py --specgram_type mfcc
```
51

Y
Yibing Liu 已提交
52
and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluator.py or tune.py.
53

54 55 56
More help for arguments:

```
X
Xinghai Sun 已提交
57
python compute_mean_std.py --help
58 59
```

X
Xinghai Sun 已提交
60
### Training
61 62 63 64

For GPU Training:

```
65
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py
66 67 68 69 70
```

For CPU Training:

```
71
python train.py --use_gpu False
72 73 74 75 76 77 78 79
```

More help for arguments:

```
python train.py --help
```

Y
Yibing Liu 已提交
80 81 82 83 84 85 86 87 88 89 90 91 92 93
### Preparing language model

The following steps, inference, parameters tuning and evaluating, will require a language model during decoding.
A compressed language model is provided and can be accessed by

```
cd ./lm
sh run.sh
cd ..
```

### Inference

For GPU inference
94 95

```
X
Xinghai Sun 已提交
96
CUDA_VISIBLE_DEVICES=0 python infer.py
97 98
```

Y
Yibing Liu 已提交
99 100 101 102 103 104
For CPU inference

```
python infer.py --use_gpu=False
```

105 106 107 108 109
More help for arguments:

```
python infer.py --help
```
Y
Yibing Liu 已提交
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124

### Evaluating

```
CUDA_VISIBLE_DEVICES=0 python evaluate.py
```

More help for arguments:

```
python evaluate.py --help
```

### Parameters tuning

Y
Yibing Liu 已提交
125 126 127
Usually, the parameters $\alpha$ and $\beta$ for the CTC [prefix beam search](https://arxiv.org/abs/1408.2873) decoder need to be tuned after retraining the acoustic model.

For GPU tuning
Y
Yibing Liu 已提交
128 129 130 131 132

```
CUDA_VISIBLE_DEVICES=0 python tune.py
```

Y
Yibing Liu 已提交
133 134 135 136 137 138
For CPU tuning

```
python tune.py --use_gpu=False
```

Y
Yibing Liu 已提交
139 140 141 142 143
More help for arguments:

```
python tune.py --help
```
Y
Yibing Liu 已提交
144 145

Then reset parameters with the tuning result before inference or evaluating.
146 147 148

### Playing with the ASR Demo

149 150 151 152 153 154 155 156 157 158
A real-time ASR demo is built for users to try out the ASR model with their own voice. Please do the following installation on the machine you'd like to run the demo's client (no need for the machine running the demo's server).

For example, on MAC OS X:

```
brew install portaudio
pip install pyaudio
pip install pynput
```
After a model and language model is prepared, we can first start the demo's server:
159 160 161 162

```
CUDA_VISIBLE_DEVICES=0 python demo_server.py
```
163
And then in another console, start the demo's client:
164 165 166 167

```
python demo_client.py
```
168
On the client console, press and hold the "white-space" key on the keyboard to start talking, until you finish your speech and then release the "white-space" key. The decoding results (infered transcription) will be displayed.
169

170
It could be possible to start the server and the client in two seperate machines, e.g. `demo_client.py` is usually started in a machine with a microphone hardware, while `demo_server.py` is usually started in a remote server with powerful GPUs. Please first make sure that these two machines have network access to each other, and then use `--host_ip` and `--host_port` to indicate the server machine's actual IP address (instead of the `localhost` as default) and TCP port, in both `demo_server.py` and `demo_client.py`.