**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
...
@@ -142,26 +147,6 @@ For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech sample
...
@@ -142,26 +147,6 @@ For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech sample
</div>
</div>
### ⭐ Examples
-**[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): Use PaddleSpeech TTS to generate virtual human voice.**
4 Days Live Courses: Depth interpretation of PaddleSpeech!
**Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130**
### Features
### Features
...
@@ -174,11 +159,22 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
...
@@ -174,11 +159,22 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🔬 *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model list](#model-list) for more details.
- 🔬 *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model list](#model-list) for more details.
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update
### 🔥 Hot Activities
<!---
<!---
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
--->
- 2021.12.21~12.24
4 Days Live Courses: Depth interpretation of PaddleSpeech!
**Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130**
### Recent Update
- 👏🏻 2022.04.28: PaddleSpeech Streaming Server is available for Automatic Speech Recognition and Text-to-Speech.
- 👏🏻 2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech.
- 👏🏻 2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech.
- 👏🏻 2022.03.28: PaddleSpeech CLI is available for Speaker Verification.
- 👏🏻 2022.03.28: PaddleSpeech CLI is available for Speaker Verification.
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
...
@@ -196,6 +192,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
...
@@ -196,6 +192,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*.
We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*.
Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).
Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).
For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
<aname="quickstartstreamingserver"></a>
## Quick Start Streaming Server
Developers can have a try of [streaming asr](./demos/streaming_asr_server/README.md) and [streaming tts](./demos/streaming_tts_server/README.md) server.
The Text-to-Speech module is originally called [Parakeet](https://github.com/PaddlePaddle/Parakeet), and now merged with this repository. If you are interested in academic research about this task, please see [TTS research overview](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview). Also, [this document](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) is a good guideline for the pipeline components.
The Text-to-Speech module is originally called [Parakeet](https://github.com/PaddlePaddle/Parakeet), and now merged with this repository. If you are interested in academic research about this task, please see [TTS research overview](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview). Also, [this document](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) is a good guideline for the pipeline components.
## ⭐ Examples
-**[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): Use PaddleSpeech TTS to generate virtual human voice.**
To cite PaddleSpeech for research, please use the following format.
To cite PaddleSpeech for research, please use the following format.
...
@@ -655,7 +701,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
...
@@ -655,7 +701,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
## Acknowledgement
## Acknowledgement
- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
- Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
- Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
- Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.
- Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
--->
- 👏🏻 2022.04.28: PaddleSpeech Streaming Server 上线! 覆盖了语音识别和语音合成。
- 👏🏻 2022.03.28: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、以及语音合成。
- 👏🏻 2022.03.28: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、以及语音合成。
- 👏🏻 2022.03.28: PaddleSpeech CLI 上线声纹验证。
- 👏🏻 2022.03.28: PaddleSpeech CLI 上线声纹验证。
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav-v
# Chinese ASR + Punctuation Restoration
# Chinese ASR + Punctuation Restoration
paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
paddlespeech asr --input ./zh.wav -v | paddlespeech text --task punc -v
```
```
(It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
(If you don't want to see the log information, you can remove "-v". Besides, it doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
-`ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
-`ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
-`yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
-`yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
-`device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
-`device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
-`verbose`: Show the log information.
Output:
Output:
```bash
```bash
...
@@ -84,8 +85,12 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
...
@@ -84,8 +85,12 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
This demo is an implementation of starting the streaming speech service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python.
This demo is an implementation of starting the streaming speech service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python.
Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol.
@@ -15,19 +15,28 @@ You can choose one way from meduim and hard to install paddlespeech.
...
@@ -15,19 +15,28 @@ You can choose one way from meduim and hard to install paddlespeech.
### 2. Prepare config File
### 2. Prepare config File
The configuration file can be found in `conf/tts_online_application.yaml` 。
The configuration file can be found in `conf/tts_online_application.yaml`.
Among them, `protocol` indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported.
-`protocol` indicates the network protocol used by the streaming TTS service. Currently, both **http and websocket** are supported.
`engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
-`engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
- This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
Currently, the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
- the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
- Streaming TTS engine AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
- In streaming am inference, one chunk of data is inferred at a time to achieve a streaming effect. Among them, `am_block` indicates the number of valid frames in the chunk, and `am_pad` indicates the number of frames added before and after am_block in a chunk. The existence of am_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- fastspeech2 does not support streaming am inference, so am_pad and am_block have no effect on it.
- fastspeech2_cnndecoder supports streaming inference. When am_pad=12, streaming inference synthesized audio is consistent with non-streaming synthesized audio.
- In streaming voc inference, one chunk of data is inferred at a time to achieve a streaming effect. Where `voc_block` indicates the number of valid frames in the chunk, and `voc_pad` indicates the number of frames added before and after the voc_block in a chunk. The existence of voc_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- Both hifigan and mb_melgan support streaming voc inference.
- When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
- When the voc model is hifigan, when voc_pad=20, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
@@ -113,7 +120,6 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
...
@@ -113,7 +120,6 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
- `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0
- `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0
- `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
- `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
- `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
- `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
Output:
Output:
```bash
```bash
...
@@ -156,8 +162,144 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
...
@@ -156,8 +162,144 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
- `protocol`: Service protocol, choices: [http, websocket], default: http.
- `input`: (required): Input text to generate.
- `spk_id`: Speaker id for multi-speaker text to speech. Default: 0
- `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0
- `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0
- `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0
- `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
- `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc:'mb_melgan_csmsc'
voc:'mb_melgan_csmsc'
voc_config:
voc_config:
voc_ckpt:
voc_ckpt:
...
@@ -39,8 +41,13 @@ tts_online:
...
@@ -39,8 +41,13 @@ tts_online:
# others
# others
lang:'zh'
lang:'zh'
device:'cpu'# set 'gpu:id' or 'cpu'
device:'cpu'# set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block:42
am_block:42
am_pad:12
am_pad:12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc:'hifigan_csmsc_onnx'
voc:'hifigan_csmsc_onnx'
voc_ckpt:
voc_ckpt:
voc_sample_rate:24000
voc_sample_rate:24000
...
@@ -80,9 +89,15 @@ tts_online-onnx:
...
@@ -80,9 +89,15 @@ tts_online-onnx:
# others
# others
lang:'zh'
lang:'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block:42
am_block:42
am_pad:12
am_pad:12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block:14
voc_block:14
voc_pad:14
voc_pad:14
# voc_upsample should be same as n_shift on voc config.
The script performs diarization using x-vectors(TDNN,ECAPA-TDNN) on the AMI mix-headset data. We demonstrate the use of different clustering methods: AHC, spectral.
The script performs diarization using x-vectors(TDNN,ECAPA-TDNN) on the AMI mix-headset data. We demonstrate the use of different clustering methods: AHC, spectral.
## How to Run
## How to Run
### prepare annotations and audios
Download AMI corpus, You need around 10GB of free space to get whole data
The signals are too large to package in this way, so you need to use the chooser to indicate which ones you wish to download
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc:'mb_melgan_csmsc'
voc:'mb_melgan_csmsc'
voc_config:
voc_config:
voc_ckpt:
voc_ckpt:
...
@@ -39,8 +41,13 @@ tts_online:
...
@@ -39,8 +41,13 @@ tts_online:
# others
# others
lang:'zh'
lang:'zh'
device:'cpu'# set 'gpu:id' or 'cpu'
device:'cpu'# set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block:42
am_block:42
am_pad:12
am_pad:12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc:'hifigan_csmsc_onnx'
voc:'hifigan_csmsc_onnx'
voc_ckpt:
voc_ckpt:
voc_sample_rate:24000
voc_sample_rate:24000
...
@@ -80,9 +89,15 @@ tts_online-onnx:
...
@@ -80,9 +89,15 @@ tts_online-onnx:
# others
# others
lang:'zh'
lang:'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block:42
am_block:42
am_pad:12
am_pad:12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block:14
voc_block:14
voc_pad:14
voc_pad:14
# voc_upsample should be same as n_shift on voc config.
* More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html).
* More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html).
* If you want only work under cpu, please download corresponded [image](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html), and using `docker` instead `nvidia-docker`.