README.md 12.2 KB
Newer Older
1
English | [简体中文](README_ch.md)
2

3 4 5 6 7 8 9 10 11
# PaddleSpeech



<p align="center">
  <img src="./docs/images/PaddleSpeech_log.png" />
</p>
<div align="center">  

12
  <h3>
M
Mingxue-Xu 已提交
13 14
  <a href="#quick-start"> Quick Start </a>
  | <a href="#tutorials"> Tutorials </a>
15
  | <a href="#model-list"> Models List </a>
16
</div>
17

18
------------------------------------------------------------------------------------
Z
Zeyu Chen 已提交
19 20 21
![License](https://img.shields.io/badge/license-Apache%202-red.svg)
![python version](https://img.shields.io/badge/python-3.7+-orange.svg)
![support os](https://img.shields.io/badge/os-linux-yellow.svg)
L
lfchener 已提交
22

23
<!---
24 25
why they should use your module,
how they can install it,
26 27 28
how they can use it
-->

M
Mingxue-Xu 已提交
29
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech, with state-of-art and influential models.
30

M
Mingxue-Xu 已提交
31 32
Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing modules, and deployment process. To be more specific, this toolkit features at:
- **Fast and Light-weight**: we provide high-speed and ultra-lightweight models that are convenient for industrial deployment.
33
- **Rule-based Chinese frontend**: our frontend contains Text Normalization (TN) and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context.
34
- **Varieties of Functions that Vitalize both Industrial and Academia**:
M
Mingxue-Xu 已提交
35 36 37
  - *Implementation of critical audio tasks*: this toolkit contains audio functions like Speech Translation (ST), Automatic Speech Recognition (ASR), Text-To-Speech Synthesis (TTS), Voice Cloning(VC), Punctuation Restoration, etc.
  - *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model lists](#models-list) for more details.
  - *Cross-domain application*: as an extension of the application of traditional audio tasks, we combine the aforementioned tasks with other fields like NLP.
38 39

Let's install PaddleSpeech with only a few lines of code!
40 41 42

>Note: The official name is still deepspeech. 2021/10/26

M
Mingxue-Xu 已提交
43 44
If you are using Ubuntu, PaddleSpeech can be set up with pip installation (with root privilege).
```shell
45 46 47 48 49 50 51 52
git clone https://github.com/PaddlePaddle/DeepSpeech.git
cd DeepSpeech
pip install -e .
```

## Table of Contents

The contents of this README is as follow:
M
Mingxue-Xu 已提交
53
- [Alternative Installation](#alternative-installation)
54 55 56 57 58 59 60 61 62 63 64 65 66
- [Quick Start](#quick-start)
- [Models List](#models-list)
- [Tutorials](#tutorials)
- [FAQ and Contributing](#faq-and-contributing)
- [License](#license)
- [Acknowledgement](#acknowledgement)

## Alternative Installation

The base environment in this page is  
- Ubuntu 16.04
- python>=3.7
- paddlepaddle==2.1.2
67

68
If you want to set up PaddleSpeech in other environment, please see the [ASR installation](docs/source/asr/install.md) and [TTS installation](docs/source/tts/install.md) documents for all the alternatives.
69

70
## Quick Start
M
Mingxue-Xu 已提交
71
> Note: the current links to `English ASR` and `English TTS` are not valid.
Z
Zeyu Chen 已提交
72

73
Just a quick test of our functions: [English ASR](link/hubdetail?name=deepspeech2_aishell&en_category=AutomaticSpeechRecognition) and [English TTS](link/hubdetail?name=fastspeech2_baker&en_category=TextToSpeech) by typing message or upload your own audio file.
74

M
Mingxue-Xu 已提交
75
Developers can have a try of our model with only a few lines of code.
Z
Zeyu Chen 已提交
76

77
A tiny **ASR** DeepSpeech2 model training on toy set of LibriSpeech:
Z
Zeyu Chen 已提交
78

79
```bash
80 81 82 83 84 85 86 87
cd examples/tiny/s0/
# source the environment
source path.sh
# prepare librispeech dataset
bash local/data.sh
# evaluate your ckptfile model file
bash local/test.sh conf/deepspeech2.yaml ckptfile offline
```
L
lfchener 已提交
88

89
For **TTS**, try pretrained FastSpeech2 + Parallel WaveGAN on CSMSC:
H
Hui Zhang 已提交
90

91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
```bash
cd examples/csmsc/tts3
# download the pretrained models and unaip them
wget https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip
unzip pwg_baker_ckpt_0.4.zip
wget https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip
unzip fastspeech2_nosil_baker_ckpt_0.4.zip
# source the environment
source path.sh
# run end-to-end synthesize
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \
  --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
  --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
  --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
  --text=${BIN_DIR}/../sentences.txt \
  --output-dir=exp/default/test_e2e \
  --inference-dir=exp/default/inference \
  --device="gpu" \
  --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
115

116
```
117

118
If you want to try more functions like training and tuning, please see [ASR getting started](docs/source/asr/getting_started.md) and [TTS Basic Use](/docs/source/tts/basic_usage.md).
H
Hui Zhang 已提交
119

120 121
## Models List

M
Mingxue-Xu 已提交
122
PaddleSpeech supports a series of most popular models, summarized in [released models](./docs/source/released_model.md) with available pretrained models.
123

M
Mingxue-Xu 已提交
124
ASR module contains *Acoustic Model* and *Language Model*, with the following details:
125 126

<!---
127
The current hyperlinks redirect to [Previous Parakeet](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples).
128 129
-->

M
Mingxue-Xu 已提交
130 131 132
> Note: The `Link` should be code path rather than download links.


133 134 135 136 137 138 139 140 141 142 143 144 145
<table>
  <thead>
    <tr>
      <th>ASR Module Type</th>
      <th>Dataset</th>
      <th>Model Type</th>
      <th>Link</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="6">Acoustic Model</td>
      <td rowspan="4" >Aishell</td>
M
Mingxue-Xu 已提交
146
      <td >2 Conv + 5 LSTM layers with only forward direction</td>
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz">Ds2 Online Aishell Model</a>
      </td>
    </tr>
    <tr>
      <td>2 Conv + 3 bidirectional GRU layers</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds2.offline.cer6p65.release.tar.gz">Ds2 Offline Aishell Model</a>
      </td>
    </tr>
    <tr>
      <td>Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz">Conformer Offline Aishell Model</a>
      </td>
    </tr>
    <tr>
      <td >Encoder:Conformer, Decoder:Transformer, Decoding method: Attention</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz">Conformer Librispeech Model</a>
      </td>
    </tr>
      <tr>
      <td rowspan="2"> Librispeech</td>
      <td>Encoder:Conformer, Decoder:Transformer, Decoding method: Attention</td>
      <td> <a href = "https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz">Conformer Librispeech Model</a> </td>
    </tr>
    <tr>
      <td>Encoder:Transformer, Decoder:Transformer, Decoding method: Attention</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/transformer.release.tar.gz">Transformer Librispeech Model</a>
      </td>
    </tr>
   <tr>
      <td rowspan="3">Language Model</td>
      <td >CommonCrawl(en.00)</td>
      <td >English Language Model</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm">English Language Model</a>
      </td>
    </tr>
    <tr>
      <td rowspan="2">Baidu Internal Corpus</td>
      <td>Mandarin Language Model Small</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm">Mandarin Language Model Small</a>
      </td>
    </tr>
    <tr>
      <td >Mandarin Language Model Large</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm">Mandarin Language Model Large</a>
      </td>
    </tr>
  </tbody>
</table>


PaddleSpeech TTS mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:

<table>
  <thead>
    <tr>
      <th>TTS Module Type</th>
      <th>Model Type</th>
      <th>Dataset</th>
      <th>Link</th>
    </tr>
  </thead>
  <tbody>
    <tr>
    <td> Text Frontend</td>
    <td colspan="2"> &emsp; </td>
220
    <td>
M
Mingxue-Xu 已提交
221
    <a href = "./examples/other/text_frontend">chinese-fronted</a>
222 223 224 225 226 227 228
    </td>
    </tr>
    <tr>
      <td rowspan="7">Acoustic Model</td>
      <td >Tacotron2</td>
      <td rowspan="2" >LJSpeech</td>
      <td>
M
Mingxue-Xu 已提交
229
      <a href = "./examples/ljspeech/tts0">tacotron2-vctk</a>
230 231 232 233 234
      </td>
    </tr>
    <tr>
      <td>TransformerTTS</td>
      <td>
M
Mingxue-Xu 已提交
235
      <a href = "./examples/ljspeech/tts1">transformer-ljspeech</a>
236 237 238 239 240 241
      </td>
    </tr>
    <tr>
      <td>SpeedySpeech</td>
      <td>CSMSC</td>
      <td >
M
Mingxue-Xu 已提交
242
      <a href = "./examples/csmsc/tts2">speedyspeech-csmsc</a>
243 244 245 246 247 248
      </td>
    </tr>
    <tr>
      <td rowspan="4">FastSpeech2</td>
      <td>AISHELL-3</td>
      <td>
M
Mingxue-Xu 已提交
249
      <a href = "./examples/aishell3/tts3">fastspeech2-aishell3</a>
250 251 252 253
      </td>
    </tr>
    <tr>
      <td>VCTK</td>
M
Mingxue-Xu 已提交
254
      <td> <a href = "./examples/vctk/tts3">fastspeech2-vctk</a> </td>
255 256 257
    </tr>
    <tr>
      <td>LJSpeech</td>
M
Mingxue-Xu 已提交
258
      <td> <a href = "./examples/ljspeech/tts3">fastspeech2-ljspeech</a> </td>
259 260 261 262
    </tr>
    <tr>
      <td>CSMSC</td>
      <td>
M
Mingxue-Xu 已提交
263
      <a href = "./examples/csmsc/tts3">fastspeech2-csmsc</a>
264 265 266 267 268 269 270
      </td>
    </tr>
   <tr>
      <td rowspan="4">Vocoder</td>
      <td >WaveFlow</td>
      <td >LJSpeech</td>
      <td>
M
Mingxue-Xu 已提交
271
      <a href = "./examples/ljspeech/voc0">waveflow-ljspeech</a>
272 273 274 275 276 277
      </td>
    </tr>
    <tr>
      <td rowspan="3">Parallel WaveGAN</td>
      <td >LJSpeech</td>
      <td>
M
Mingxue-Xu 已提交
278
      <a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a>
279 280 281 282 283
      </td>
    </tr>
    <tr>
      <td >VCTK</td>
      <td>
M
Mingxue-Xu 已提交
284
      <a href = "./examples/vctk/voc1">PWGAN-vctk</a>
285 286 287 288 289
      </td>
    </tr>
    <tr>
      <td >CSMSC</td>
      <td>
M
Mingxue-Xu 已提交
290
      <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a>
291 292 293 294 295 296 297
      </td>
    </tr>
    <tr>
    <td rowspan="2">Voice Cloning</td>
    <td>GE2E</td>
    <td >AISHELL-3, etc.</td>
    <td>
M
Mingxue-Xu 已提交
298
    <a href = "./examples/other/ge2e">ge2e</a>
299 300 301 302 303 304
    </td>
    </tr>
    <tr>
    <td>GE2E + Tactron2</td>
    <td>AISHELL-3</td>
    <td>
M
Mingxue-Xu 已提交
305
    <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
306 307 308 309 310 311 312
    </td>
    </td>
    </tr>
  </tbody>
</table>


313
## Tutorials
314 315 316

Normally, [Speech SoTA](https://paperswithcode.com/area/speech) gives you an overview of the hot academic topics in speech. If you want to focus on the two tasks in PaddleSpeech, you will find the following guidelines are helpful to grasp the core ideas.

317
The original ASR module is based on [Baidu's DeepSpeech](https://arxiv.org/abs/1412.5567) which is an independent product named [DeepSpeech](https://deepspeech.readthedocs.io). However, the toolkit aligns almost all the SoTA modules in the pipeline. Specifically, these modules are
Z
Zeyu Chen 已提交
318

H
Hui Zhang 已提交
319 320 321 322 323
* [Data Prepration](docs/source/asr/data_preparation.md)  
* [Data Augmentation](docs/source/asr/augmentation.md)  
* [Ngram LM](docs/source/asr/ngram_lm.md)  
* [Benchmark](docs/source/asr/benchmark.md)  
* [Relased Model](docs/source/asr/released_model.md)  
H
Hui Zhang 已提交
324

325
The TTS module is originally called [Parakeet](https://github.com/PaddlePaddle/Parakeet), and now merged with DeepSpeech. If you are interested in academic research about this function, please see [TTS research overview](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/docs/source/tts#overview). Also, [this document](https://paddleparakeet.readthedocs.io/en/latest/released_models.html) is a good guideline for the pipeline components.
H
Hui Zhang 已提交
326

X
Xinghai Sun 已提交
327

328
## FAQ and Contributing
Z
Zeyu Chen 已提交
329

330
You are warmly welcome to submit questions in [discussions](https://github.com/PaddlePaddle/DeepSpeech/discussions) and bug reports in [issues](https://github.com/PaddlePaddle/DeepSpeech/issues)! Also, we highly appreciate if you would like to contribute to this project!
Z
Zeyu Chen 已提交
331 332 333

## License

334
PaddleSpeech is provided under the [Apache-2.0 License](./LICENSE).
335 336 337

## Acknowledgement

H
Hui Zhang 已提交
338
PaddleSpeech depends on a lot of open source repos. See [references](docs/source/reference.md) for more information.