README.md 13.8 KB
Newer Older
M
Mingxue-Xu 已提交
1
<p align="center">
M
Mingxue-Xu 已提交
2
  <img src="./docs/images/PaddleSpeech_logo.png" />
M
Mingxue-Xu 已提交
3 4 5 6 7 8
</p>
<div align="center">  

  <h3>
  <a href="#quick-start"> Quick Start </a>
  | <a href="#tutorials"> Tutorials </a>
Z
Zeyu Chen 已提交
9
  | <a href="#model-list"> Models List </a>
M
Mingxue-Xu 已提交
10 11 12
</div>

------------------------------------------------------------------------------------
Z
Zeyu Chen 已提交
13

Z
Zeyu Chen 已提交
14 15 16
![License](https://img.shields.io/badge/license-Apache%202-red.svg)
![python version](https://img.shields.io/badge/python-3.7+-orange.svg)
![support os](https://img.shields.io/badge/os-linux-yellow.svg)
L
lfchener 已提交
17

M
Mingxue-Xu 已提交
18 19 20 21 22 23 24 25
<!---
from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readmes-readable.md
1.What is this repo or project? (You can reuse the repo description you used earlier because this section doesn’t have to be long.)
2.How does it work?
3.Who will use this repo or project?
4.What is the goal of this project?
-->

M
Mingxue-Xu 已提交
26
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech, with the state-of-art and influential models.
M
Mingxue-Xu 已提交
27

Z
Zeyu Chen 已提交
28
##### Speech-to-Text
M
Mingxue-Xu 已提交
29 30 31 32 33 34

<div align = "center">
<table style="width:100%">
  <thead>
    <tr>
      <th> Input Audio  </th>
M
Mingxue-Xu 已提交
35
      <th width="550"> Recognition Result  </th>
M
Mingxue-Xu 已提交
36 37 38 39 40
    </tr>
  </thead>
  <tbody>
   <tr>
      <td align = "center">
M
Mingxue-Xu 已提交
41
      <a href="https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav" rel="nofollow">
M
Mingxue-Xu 已提交
42
            <img align="center" src="./docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
M
Mingxue-Xu 已提交
43
      </td>
M
Mingxue-Xu 已提交
44
      <td >I knocked at the door on the ancient side of the building.</td>
M
Mingxue-Xu 已提交
45 46 47
    </tr>
    <tr>
      <td align = "center">
M
Mingxue-Xu 已提交
48
      <a href="https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav" rel="nofollow">
M
Mingxue-Xu 已提交
49
            <img align="center" src="./docs/images/audio_icon.png" width="200" style="max-width: 100%;"></a><br>
M
Mingxue-Xu 已提交
50
      </td>
M
Mingxue-Xu 已提交
51
      <td>我认为跑步最重要的就是给我带来了身体健康。</td>
M
Mingxue-Xu 已提交
52 53 54 55 56
    </tr>
  </tbody>
</table>

</div>
57

Z
Zeyu Chen 已提交
58
##### Text-to-Speech
M
Mingxue-Xu 已提交
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
<div align = "center">
<table style="width:100%">
  <thead>
    <tr>
      <th><img width="200" height="1"> Input Text <img width="200" height="1"> </th>
      <th>Synthetic Audio</th>
    </tr>
  </thead>
  <tbody>
   <tr>
      <td >Life was like a box of chocolates, you never know what you're gonna get.</td>
      <td align = "center">
      <a href="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/001.wav" rel="nofollow">
            <img align="center" src="./docs/images/audio_icon.png" width="200" style="max-width: 100%;"></a><br>
      </td>
    </tr>
    <tr>
      <td >早上好,今天是2020/10/29,最低温度是-3°C。</td>
      <td align = "center">
      <a href="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/parakeet_espnet_fs2_pwg_demo/tn_g2p/parakeet/001.wav" rel="nofollow">
            <img align="center" src="./docs/images/audio_icon.png" width="200" style="max-width: 100%;"></a><br>
      </td>
    </tr>
  </tbody>
</table>

</div>

Z
Zeyu Chen 已提交
87
For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html).
小湉湉's avatar
小湉湉 已提交
88

M
Mingxue-Xu 已提交
89 90 91 92
Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing modules, and deployment process. To be more specific, this toolkit features at:
- **Fast and Light-weight**: we provide high-speed and ultra-lightweight models that are convenient for industrial deployment.
- **Rule-based Chinese frontend**: our frontend contains Text Normalization and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context.
- **Varieties of Functions that Vitalize both Industrial and Academia**:
Z
Zeyu Chen 已提交
93 94
  - *Implementation of critical audio tasks*: this toolkit contains audio functions like Speech Translation, Automatic Speech Recognition, Text-to-Speech Synthesis, Voice Cloning, etc.
  - *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model list](#model-list) for more details.
M
Mingxue-Xu 已提交
95 96
  - *Cascaded models application*: as an extension of the application of traditional audio tasks, we combine the workflows of aforementioned tasks with other fields like Natural language processing (NLP), like Punctuation Restoration.

Z
Zeyu Chen 已提交
97
## Installation
M
Mingxue-Xu 已提交
98 99 100 101

The base environment in this page is  
- Ubuntu 16.04
- python>=3.7
Z
Zeyu Chen 已提交
102
- paddlepaddle>=2.2.0
M
Mingxue-Xu 已提交
103

M
Mingxue-Xu 已提交
104
If you want to set up PaddleSpeech in other environment, please see the [installation](./docs/source/install.md) documents for all the alternatives.
M
Mingxue-Xu 已提交
105

Z
Zeyu Chen 已提交
106
## Quick Start
M
Mingxue-Xu 已提交
107 108 109

Developers can have a try of our model with only a few lines of code.

Z
Zeyu Chen 已提交
110
A tiny DeepSpeech2 **Speech-to-Text** model training on toy set of LibriSpeech:
M
Mingxue-Xu 已提交
111 112 113 114 115

```shell
cd examples/tiny/s0/
# source the environment
source path.sh
116 117 118 119 120 121 122 123 124
source ../../../utils/parse_options.sh
# prepare data
bash ./local/data.sh
# train model, all `ckpt` under `exp` dir, if you use paddlepaddle-gpu, you can set CUDA_VISIBLE_DEVICES before the train script
./local/train.sh conf/deepspeech2.yaml deepspeech2 offline
# avg n best model to get the test model, in this case, n = 1
avg.sh best exp/deepspeech2/checkpoints 1
# evaluate the test model
./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 offline
M
Mingxue-Xu 已提交
125 126
```

127
For **Text-To-Speech**, try pretrained FastSpeech2 + Parallel WaveGAN on CSMSC:
M
Mingxue-Xu 已提交
128 129 130
```shell
cd examples/csmsc/tts3
# download the pretrained models and unaip them
小湉湉's avatar
小湉湉 已提交
131
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip
M
Mingxue-Xu 已提交
132
unzip pwg_baker_ckpt_0.4.zip
小湉湉's avatar
小湉湉 已提交
133
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip
M
Mingxue-Xu 已提交
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
unzip fastspeech2_nosil_baker_ckpt_0.4.zip
# source the environment
source path.sh
# run end-to-end synthesize
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \
  --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
  --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
  --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
  --text=${BIN_DIR}/../sentences.txt \
  --output-dir=exp/default/test_e2e \
  --inference-dir=exp/default/inference \
  --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
```

Z
Zeyu Chen 已提交
153
If you want to try more functions like training and tuning, please see [Speech-to-Text Quick Start](./docs/source/asr/quick_start.md) and [Text-To-Speech Quick Start](./docs/source/tts/quick_start.md).
Z
Zeyu Chen 已提交
154

Z
Zeyu Chen 已提交
155
## Model List
156

小湉湉's avatar
小湉湉 已提交
157
PaddleSpeech supports a series of most popular models, summarized in [released models](./docs/source/released_model.md) with available pretrained models.
Z
Zeyu Chen 已提交
158

Z
Zeyu Chen 已提交
159
Speech-to-Text module contains *Acoustic Model* and *Language Model*, with the following details:
L
lfchener 已提交
160

M
Mingxue-Xu 已提交
161 162 163
<!---
The current hyperlinks redirect to [Previous Parakeet](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples).
-->
H
Hui Zhang 已提交
164

M
Mingxue-Xu 已提交
165 166 167
<table style="width:100%">
  <thead>
    <tr>
Z
Zeyu Chen 已提交
168
      <th>Speech-to-Text Module Type</th>
M
Mingxue-Xu 已提交
169 170 171 172 173 174 175 176 177 178 179
      <th>Dataset</th>
      <th>Model Type</th>
      <th>Link</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="3">Acoustic Model</td>
      <td rowspan="2" >Aishell</td>
      <td >DeepSpeech2 RNN + Conv based Models</td>
      <td>
H
Hui Zhang 已提交
180
      <a href = "./examples/aishell/s0">deepspeech2-aishell</a>
M
Mingxue-Xu 已提交
181 182 183 184 185
      </td>
    </tr>
    <tr>
      <td>Transformer based Attention Models </td>
      <td>
M
Mingxue-Xu 已提交
186
      <a href = "./examples/aishell/s1">u2.transformer.conformer-aishell</a>
M
Mingxue-Xu 已提交
187 188 189 190 191 192
      </td>
    </tr>
      <tr>
      <td> Librispeech</td>
      <td>Transformer based Attention Models </td>
      <td>
H
Hui Zhang 已提交
193
      <a href = "./examples/librispeech/s0">deepspeech2-librispeech</a> / <a href = "./examples/librispeech/s1">transformer.conformer.u2-librispeech</a>  / <a href = "./examples/librispeech/s2">transformer.conformer.u2-kaldi-librispeech</a>
M
Mingxue-Xu 已提交
194 195 196
      </td>
      </td>
    </tr>
M
Mingxue-Xu 已提交
197 198 199 200 201 202 203 204
  <tr>
  <td>Alignment</td>
  <td>THCHS30</td>
  <td>MFA</td>
  <td>
  <a href = ".examples/thchs30/a0">mfa-thchs30</a>
  </td>
  </tr>
M
Mingxue-Xu 已提交
205 206
   <tr>
      <td rowspan="2">Language Model</td>
M
Mingxue-Xu 已提交
207
      <td colspan = "2">Ngram Language Model</td>
M
Mingxue-Xu 已提交
208
      <td>
M
Mingxue-Xu 已提交
209
      <a href = "./examples/other/ngram_lm">kenlm</a>
M
Mingxue-Xu 已提交
210 211 212
      </td>
    </tr>
    <tr>
M
Mingxue-Xu 已提交
213 214
      <td>TIMIT</td>
      <td>Unified Streaming & Non-streaming Two-pass</td>
M
Mingxue-Xu 已提交
215
      <td>
H
Hui Zhang 已提交
216
    <a href = "./examples/timit/s1"> u2-timit</a>
M
Mingxue-Xu 已提交
217 218 219 220
      </td>
    </tr>
  </tbody>
</table>
H
Hui Zhang 已提交
221

Z
Zeyu Chen 已提交
222
PaddleSpeech Text-to-Speech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
H
Hui Zhang 已提交
223

M
Mingxue-Xu 已提交
224 225 226
<table>
  <thead>
    <tr>
Z
Zeyu Chen 已提交
227
      <th> Text-to-Speech Module Type <img width="110" height="1"> </th>
M
Mingxue-Xu 已提交
228 229 230 231 232 233 234 235 236 237
      <th>  Model Type  </th>
      <th> <img width="50" height="1"> Dataset  <img width="50" height="1"> </th>
      <th> <img width="101" height="1"> Link <img width="105" height="1"> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
    <td> Text Frontend</td>
    <td colspan="2"> &emsp; </td>
    <td>
238
    <a href = "./examples/other/tn">tn</a> / <a href = "./examples/other/g2p">g2p</a>
M
Mingxue-Xu 已提交
239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
    </td>
    </tr>
    <tr>
      <td rowspan="4">Acoustic Model</td>
      <td >Tacotron2</td>
      <td rowspan="2" >LJSpeech</td>
      <td>
      <a href = "./examples/ljspeech/tts0">tacotron2-ljspeech</a>
      </td>
    </tr>
    <tr>
      <td>TransformerTTS</td>
      <td>
      <a href = "./examples/ljspeech/tts1">transformer-ljspeech</a>
      </td>
    </tr>
    <tr>
      <td>SpeedySpeech</td>
      <td>CSMSC</td>
      <td >
      <a href = "./examples/csmsc/tts2">speedyspeech-csmsc</a>
      </td>
    </tr>
    <tr>
      <td>FastSpeech2</td>
      <td>AISHELL-3 / VCTK / LJSpeech / CSMSC</td>
      <td>
      <a href = "./examples/aishell3/tts3">fastspeech2-aishell3</a> / <a href = "./examples/vctk/tts3">fastspeech2-vctk</a> / <a href = "./examples/ljspeech/tts3">fastspeech2-ljspeech</a> / <a href = "./examples/csmsc/tts3">fastspeech2-csmsc</a>
      </td>
    </tr>
   <tr>
小湉湉's avatar
小湉湉 已提交
270
      <td rowspan="3">Vocoder</td>
M
Mingxue-Xu 已提交
271 272 273 274 275 276 277 278 279 280
      <td >WaveFlow</td>
      <td >LJSpeech</td>
      <td>
      <a href = "./examples/ljspeech/voc0">waveflow-ljspeech</a>
      </td>
    </tr>
    <tr>
      <td >Parallel WaveGAN</td>
      <td >LJSpeech / VCTK / CSMSC</td>
      <td>
281
      <a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a>
M
Mingxue-Xu 已提交
282 283
      </td>
    </tr>
小湉湉's avatar
小湉湉 已提交
284 285 286 287 288 289 290
    <tr>
      <td >Multi Band MelGAN</td>
      <td >CSMSC</td>
      <td>
      <a href = "./examples/csmsc/voc3">Multi Band MelGAN-csmsc</a> 
      </td>
    </tr>                                                                                                                                           
M
Mingxue-Xu 已提交
291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308
    <tr>
    <td rowspan="2">Voice Cloning</td>
    <td>GE2E</td>
    <td >AISHELL-3, etc.</td>
    <td>
    <a href = "./examples/other/ge2e">ge2e</a>
    </td>
    </tr>
    <tr>
    <td>GE2E + Tactron2</td>
    <td>AISHELL-3</td>
    <td>
    <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
    </td>
    </td>
    </tr>
  </tbody>
</table>
H
Hui Zhang 已提交
309

Z
Zeyu Chen 已提交
310
## Tutorials
Z
Zeyu Chen 已提交
311

M
Mingxue-Xu 已提交
312
Normally, [Speech SoTA](https://paperswithcode.com/area/speech) gives you an overview of the hot academic topics in speech. To focus on the tasks in PaddleSpeech, you will find the following guidelines are helpful to grasp the core ideas.
Z
Zeyu Chen 已提交
313

314
- [Overview](./docs/source/introduction.md)
M
Mingxue-Xu 已提交
315 316
- Quick Start
  - [Dependencies](./docs/source/dependencies.md) and [Installation](./docs/source/install.md)
Z
Zeyu Chen 已提交
317 318 319
  - [Quick Start of Speech-to-Text](./docs/source/asr/quick_start.md)
  - [Quick Start of Text-to-Speech](./docs/source/tts/quick_start.md)
- Speech-to-Text
M
Mingxue-Xu 已提交
320 321 322 323 324
  - [Models Introduction](./docs/source/asr/models_introduction.md)
  - [Data Preparation](./docs/source/asr/data_preparation.md)
  - [Data Augmentation Pipeline](./docs/source/asr/augmentation.md)
  - [Features](./docs/source/asr/feature_list.md)
  - [Ngram LM](./docs/source/asr/ngram_lm.md)
Z
Zeyu Chen 已提交
325
- Text-to-Speech
M
Mingxue-Xu 已提交
326 327 328 329 330
  - [Introduction](./docs/source/tts/models_introduction.md)
  - [Advanced Usage](./docs/source/tts/advanced_usage.md)
  - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
  - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) and [PaddleSpeech VS. Espnet](https://paddlespeech.readthedocs.io/en/latest/tts/demo_2.html)
- [Released Models](./docs/source/released_model.md)
Z
Zeyu Chen 已提交
331

Z
Zeyu Chen 已提交
332
The TTS module is originally called [Parakeet](https://github.com/PaddlePaddle/Parakeet), and now merged with DeepSpeech. If you are interested in academic research about this function, please see [TTS research overview](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview). Also, [this document](https://paddleparakeet.readthedocs.io/en/latest/released_models.html) is a good guideline for the pipeline components.
Z
Zeyu Chen 已提交
333

Z
Zeyu Chen 已提交
334
## FAQ and Contributing
Z
Zeyu Chen 已提交
335

Z
Zeyu Chen 已提交
336
You are warmly welcome to submit questions in [discussions](https://github.com/PaddlePaddle/PaddleSpeech/discussions) and bug reports in [issues](https://github.com/PaddlePaddle/PaddleSpeech/issues)! Also, we highly appreciate if you would like to contribute to this project!
337

Z
Zeyu Chen 已提交
338
## Citation
339

M
Mingxue-Xu 已提交
340 341 342 343 344
To cite PaddleSpeech for research, please use the following format.
```tex
@misc{ppspeech2021,
title={PaddleSpeech, a toolkit for audio processing based on PaddlePaddle.},
author={PaddlePaddle Authors},
Z
Zeyu Chen 已提交
345
howpublished = {\url{https://github.com/PaddlePaddle/PaddleSpeech}},
M
Mingxue-Xu 已提交
346 347 348
year={2021}
}
```
Z
Zeyu Chen 已提交
349 350 351 352 353

## License and Acknowledge

PaddleSpeech is provided under the [Apache-2.0 License](./LICENSE).

小湉湉's avatar
小湉湉 已提交
354
PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information.