README.md 14.0 KB
Newer Older
M
Mingxue-Xu 已提交
1
English | [简体中文](README_ch.md)
2

M
Mingxue-Xu 已提交
3 4 5 6 7 8 9 10 11 12 13 14 15 16


<p align="center">
  <img src="./docs/images/PaddleSpeech_logo.png" />
</p>
<div align="center">  

  <h3>
  <a href="#quick-start"> Quick Start </a>
  | <a href="#tutorials"> Tutorials </a>
  | <a href="#model-list"> Models List </a>
</div>

------------------------------------------------------------------------------------
Z
Zeyu Chen 已提交
17 18 19
![License](https://img.shields.io/badge/license-Apache%202-red.svg)
![python version](https://img.shields.io/badge/python-3.7+-orange.svg)
![support os](https://img.shields.io/badge/os-linux-yellow.svg)
L
lfchener 已提交
20

M
Mingxue-Xu 已提交
21 22 23 24 25 26 27 28
<!---
from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readmes-readable.md
1.What is this repo or project? (You can reuse the repo description you used earlier because this section doesn’t have to be long.)
2.How does it work?
3.Who will use this repo or project?
4.What is the goal of this project?
-->

M
Mingxue-Xu 已提交
29
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech, with the state-of-art and influential models.
M
Mingxue-Xu 已提交
30 31 32 33 34 35 36 37 38 39 40 41 42 43

##### Speech-To-Text

<div align = "center">
<table style="width:100%">
  <thead>
    <tr>
      <th> Input Audio  </th>
      <th> Recognition Result  </th>
    </tr>
  </thead>
  <tbody>
   <tr>
      <td align = "center">
M
Mingxue-Xu 已提交
44
      <a href="https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav" rel="nofollow">
M
Mingxue-Xu 已提交
45
            <img align="center" src="./docs/images/audio_icon.png" width="320 style="max-width: 100%;"></a><br>
M
Mingxue-Xu 已提交
46
      </td>
M
Mingxue-Xu 已提交
47
      <td >I knocked at the door on the ancient side of the building.</td>
M
Mingxue-Xu 已提交
48 49 50
    </tr>
    <tr>
      <td align = "center">
M
Mingxue-Xu 已提交
51
      <a href="https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav" rel="nofollow">
M
Mingxue-Xu 已提交
52
            <img align="center" src="./docs/images/audio_icon.png" width="320" style="max-width: 100%;"></a><br>
M
Mingxue-Xu 已提交
53
      </td>
M
Mingxue-Xu 已提交
54
      <td>我认为跑步最重要的就是给我带来了身体健康。</td>
M
Mingxue-Xu 已提交
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
    </tr>
  </tbody>
</table>

</div>
  
##### Text-To-Speech
<div align = "center">
<table style="width:100%">
  <thead>
    <tr>
      <th><img width="200" height="1"> Input Text <img width="200" height="1"> </th>
      <th>Synthetic Audio</th>
    </tr>
  </thead>
  <tbody>
   <tr>
      <td >Life was like a box of chocolates, you never know what you're gonna get.</td>
      <td align = "center">
      <a href="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/001.wav" rel="nofollow">
            <img align="center" src="./docs/images/audio_icon.png" width="200" style="max-width: 100%;"></a><br>
      </td>
    </tr>
    <tr>
      <td >早上好,今天是2020/10/29,最低温度是-3°C。</td>
      <td align = "center">
      <a href="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/parakeet_espnet_fs2_pwg_demo/tn_g2p/parakeet/001.wav" rel="nofollow">
            <img align="center" src="./docs/images/audio_icon.png" width="200" style="max-width: 100%;"></a><br>
      </td>
    </tr>
  </tbody>
</table>

</div>

Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing modules, and deployment process. To be more specific, this toolkit features at:
- **Fast and Light-weight**: we provide high-speed and ultra-lightweight models that are convenient for industrial deployment.
- **Rule-based Chinese frontend**: our frontend contains Text Normalization and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context.
- **Varieties of Functions that Vitalize both Industrial and Academia**:
  - *Implementation of critical audio tasks*: this toolkit contains audio functions like Speech Translation, Automatic Speech Recognition, Text-To-Speech Synthesis, Voice Cloning, etc.
  - *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model lists](#models-list) for more details.
  - *Cascaded models application*: as an extension of the application of traditional audio tasks, we combine the workflows of aforementioned tasks with other fields like Natural language processing (NLP), like Punctuation Restoration.

# Community

You are warmly welcome to submit questions in [discussions](https://github.com/PaddlePaddle/DeepSpeech/discussions) and bug reports in [issues](https://github.com/PaddlePaddle/DeepSpeech/issues)! Also, we highly appreciate if you would like to contribute to this project!

If you are from China, we strongly recommend you join our PaddleSpeech WeChat group. Scan the following WeChat QR code and get in touch with the other developers in this community!

<div align="center">
<img src="./docs/images/wechat-code-speech.png"  width = "200">

</div>

# Alternative Installation

The base environment in this page is  
- Ubuntu 16.04
- python>=3.7
- paddlepaddle==2.1.2

M
Mingxue-Xu 已提交
116
If you want to set up PaddleSpeech in other environment, please see the [installation](./docs/source/install.md) documents for all the alternatives.
M
Mingxue-Xu 已提交
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164

# Quick Start

Developers can have a try of our model with only a few lines of code.

A tiny DeepSpeech2 *Speech-To-Text* model training on toy set of LibriSpeech:

```shell
cd examples/tiny/s0/
# source the environment
source path.sh
# prepare librispeech dataset
bash local/data.sh
# evaluate your ckptfile model file
bash local/test.sh conf/deepspeech2.yaml ckptfile offline
```

For *Text-To-Speech*, try FastSpeech2 on LJSpeech:
- Download LJSpeech-1.1 from the [ljspeech official website](https://keithito.com/LJ-Speech-Dataset/), our prepared durations for fastspeech2 [ljspeech_alignment](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz).
- The pretrained models are seperated into two parts: [fastspeech2_nosil_ljspeech_ckpt](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) and [pwg_ljspeech_ckpt](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip). Please download then unzip to `./model/fastspeech2` and `./model/pwg` respectively.
- Assume your path to the dataset is `~/datasets/LJSpeech-1.1` and `./ljspeech_alignment` accordingly, preprocess your data and then use our pretrained model to synthesize:
```shell
cd examples/csmsc/tts3
# download the pretrained models and unaip them
wget https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip
unzip pwg_baker_ckpt_0.4.zip
wget https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip
unzip fastspeech2_nosil_baker_ckpt_0.4.zip
# source the environment
source path.sh
# run end-to-end synthesize
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \
  --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
  --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
  --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
  --text=${BIN_DIR}/../sentences.txt \
  --output-dir=exp/default/test_e2e \
  --inference-dir=exp/default/inference \
  --device="gpu" \
  --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
```

If you want to try more functions like training and tuning, please see [Speech-To-Text getting started](./docs/source/asr/getting_started.md) and [Text-To-Speech Basic Use](./docs/source/tts/basic_usage.md).
Z
Zeyu Chen 已提交
165

M
Mingxue-Xu 已提交
166
# Models List
167

M
Mingxue-Xu 已提交
168
PaddleSpeech supports a series of most popular models, summarized in [released models](./docs/source/released_models.md) with available pretrained models.
Z
Zeyu Chen 已提交
169

M
Mingxue-Xu 已提交
170
Speech-To-Text module contains *Acoustic Model* and *Language Model*, with the following details:
L
lfchener 已提交
171

M
Mingxue-Xu 已提交
172 173 174
<!---
The current hyperlinks redirect to [Previous Parakeet](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples).
-->
H
Hui Zhang 已提交
175

M
Mingxue-Xu 已提交
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
<table style="width:100%">
  <thead>
    <tr>
      <th>Speech-To-Text Module Type</th>
      <th>Dataset</th>
      <th>Model Type</th>
      <th>Link</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="3">Acoustic Model</td>
      <td rowspan="2" >Aishell</td>
      <td >DeepSpeech2 RNN + Conv based Models</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz">Online</a> / <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds2.offline.cer6p65.release.tar.gz">Offline </a>
      </td>
    </tr>
    <tr>
      <td>Transformer based Attention Models </td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz">Non-CTC Loss</a> / <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz">CTC Loss </a>
      </td>
    </tr>
      <tr>
      <td> Librispeech</td>
      <td>Transformer based Attention Models </td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz">Conformer</a> / <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz"> Transformer </a> Decoder
      </td>
      </td>
    </tr>
   <tr>
      <td rowspan="2">Language Model</td>
      <td >CommonCrawl(en.00)</td>
      <td >English Language Model</td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm">English Language Model</a>
      </td>
    </tr>
    <tr>
      <td>Baidu Internal Corpus</td>
      <td>Mandarin Language Model Small</td>
      <td>
      Mandarin Language Model <a href = "https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm"> Small</a> / <a href = "https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm">Large</a>
      </td>
    </tr>
  </tbody>
</table>
H
Hui Zhang 已提交
225

Z
Zeyu Chen 已提交
226

M
Mingxue-Xu 已提交
227
PaddleSpeech Text-To-Speech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
H
Hui Zhang 已提交
228

M
Mingxue-Xu 已提交
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
<table>
  <thead>
    <tr>
      <th> Text-To-Speech Module Type <img width="110" height="1"> </th>
      <th>  Model Type  </th>
      <th> <img width="50" height="1"> Dataset  <img width="50" height="1"> </th>
      <th> <img width="101" height="1"> Link <img width="105" height="1"> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
    <td> Text Frontend</td>
    <td colspan="2"> &emsp; </td>
    <td>
    <a href = "./examples/other/text_frontend">chinese-fronted</a>
    </td>
    </tr>
    <tr>
      <td rowspan="4">Acoustic Model</td>
      <td >Tacotron2</td>
      <td rowspan="2" >LJSpeech</td>
      <td>
      <a href = "./examples/ljspeech/tts0">tacotron2-ljspeech</a>
      </td>
    </tr>
    <tr>
      <td>TransformerTTS</td>
      <td>
      <a href = "./examples/ljspeech/tts1">transformer-ljspeech</a>
      </td>
    </tr>
    <tr>
      <td>SpeedySpeech</td>
      <td>CSMSC</td>
      <td >
      <a href = "./examples/csmsc/tts2">speedyspeech-csmsc</a>
      </td>
    </tr>
    <tr>
      <td>FastSpeech2</td>
      <td>AISHELL-3 / VCTK / LJSpeech / CSMSC</td>
      <td>
      <a href = "./examples/aishell3/tts3">fastspeech2-aishell3</a> / <a href = "./examples/vctk/tts3">fastspeech2-vctk</a> / <a href = "./examples/ljspeech/tts3">fastspeech2-ljspeech</a> / <a href = "./examples/csmsc/tts3">fastspeech2-csmsc</a>
      </td>
    </tr>
   <tr>
      <td rowspan="2">Vocoder</td>
      <td >WaveFlow</td>
      <td >LJSpeech</td>
      <td>
      <a href = "./examples/ljspeech/voc0">waveflow-ljspeech</a>
      </td>
    </tr>
    <tr>
      <td >Parallel WaveGAN</td>
      <td >LJSpeech / VCTK / CSMSC</td>
      <td>
      <a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> 
      </td>
    </tr>
    <tr>
    <td rowspan="2">Voice Cloning</td>
    <td>GE2E</td>
    <td >AISHELL-3, etc.</td>
    <td>
    <a href = "./examples/other/ge2e">ge2e</a>
    </td>
    </tr>
    <tr>
    <td>GE2E + Tactron2</td>
    <td>AISHELL-3</td>
    <td>
    <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
    </td>
    </td>
    </tr>
  </tbody>
</table>
H
Hui Zhang 已提交
307

X
Xinghai Sun 已提交
308

M
Mingxue-Xu 已提交
309
# Tutorials
Z
Zeyu Chen 已提交
310

M
Mingxue-Xu 已提交
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328
Normally, [Speech SoTA](https://paperswithcode.com/area/speech) gives you an overview of the hot academic topics in speech. To focus on the tasks in PaddleSpeech, you will find the following guidelines are helpful to grasp the core ideas.
- [Overview](./docs/source/introduction.md) 
- Quick Start
  - [Dependencies](./docs/source/dependencies.md) and [Installation](./docs/source/install.md)
  - [Quick Start of Speech-To-Text](./docs/source/asr/quick_start.md)
  - [Quick Start of Text-To-Speech](./docs/source/tts/quick_start.md)
- Speech-To-Text
  - [Models Introduction](./docs/source/asr/models_introduction.md)
  - [Data Preparation](./docs/source/asr/data_preparation.md)
  - [Data Augmentation Pipeline](./docs/source/asr/augmentation.md)
  - [Features](./docs/source/asr/feature_list.md)
  - [Ngram LM](./docs/source/asr/ngram_lm.md)
- Text-To-Speech
  - [Introduction](./docs/source/tts/models_introduction.md)
  - [Advanced Usage](./docs/source/tts/advanced_usage.md)
  - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
  - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) and [PaddleSpeech VS. Espnet](https://paddlespeech.readthedocs.io/en/latest/tts/demo_2.html)
- [Released Models](./docs/source/released_model.md)
Z
Zeyu Chen 已提交
329

M
Mingxue-Xu 已提交
330
# License and Acknowledgement
Z
Zeyu Chen 已提交
331

M
Mingxue-Xu 已提交
332
PaddleSpeech is provided under the [Apache-2.0 License](./LICENSE).
333

M
Mingxue-Xu 已提交
334
PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/asr/reference.md) for more information.
335

M
Mingxue-Xu 已提交
336 337 338 339 340 341 342 343 344 345
# Citation
To cite PaddleSpeech for research, please use the following format.
```tex
@misc{ppspeech2021,
title={PaddleSpeech, a toolkit for audio processing based on PaddlePaddle.},
author={PaddlePaddle Authors},
howpublished = {\url{https://github.com/PaddlePaddle/DeepSpeech}},
year={2021}
}
```