提交 43d1dff1 编写于 作者: Q qingen

Merge branch 'database-search' of github.com:qingen/PaddleSpeech into database-search

...@@ -50,13 +50,13 @@ repos: ...@@ -50,13 +50,13 @@ repos:
entry: bash .pre-commit-hooks/clang-format.hook -i entry: bash .pre-commit-hooks/clang-format.hook -i
language: system language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$ files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$
exclude: (?=speechx/speechx/kaldi).*(\.cpp|\.cc|\.h|\.py)$ exclude: (?=speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
- id: copyright_checker - id: copyright_checker
name: copyright_checker name: copyright_checker
entry: python .pre-commit-hooks/copyright-check.hook entry: python .pre-commit-hooks/copyright-check.hook
language: system language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$ files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
exclude: (?=third_party|pypinyin|speechx/speechx/kaldi).*(\.cpp|\.cc|\.h|\.py)$ exclude: (?=third_party|pypinyin|speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
- repo: https://github.com/asottile/reorder_python_imports - repo: https://github.com/asottile/reorder_python_imports
rev: v2.4.0 rev: v2.4.0
hooks: hooks:
......
# Changelog # Changelog
Date: 2022-3-08, Author: yt605155624.
Add features to: T2S:
- Add aishell3 hifigan egs.
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1545
Date: 2022-3-08, Author: yt605155624.
Add features to: T2S:
- Add vctk hifigan egs.
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1544
Date: 2022-1-29, Author: yt605155624. Date: 2022-1-29, Author: yt605155624.
Add features to: T2S: Add features to: T2S:
......
...@@ -178,7 +178,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision ...@@ -178,7 +178,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
<!--- <!---
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live). 2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
---> --->
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/akhaliq/paddlespeech) Demos on Hugging Face Spaces are available! - 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
- 👏🏻 2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech. - 👏🏻 2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech.
### Community ### Community
...@@ -207,6 +207,7 @@ paddlespeech cls --input input.wav ...@@ -207,6 +207,7 @@ paddlespeech cls --input input.wav
```shell ```shell
paddlespeech asr --lang zh --input input_16k.wav paddlespeech asr --lang zh --input input_16k.wav
``` ```
- web demo for Automatic Speech Recognition is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [ASR Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR)
**Speech Translation** (English to Chinese) **Speech Translation** (English to Chinese)
(not support for Mac and Windows now) (not support for Mac and Windows now)
...@@ -218,7 +219,7 @@ paddlespeech st --input input_16k.wav ...@@ -218,7 +219,7 @@ paddlespeech st --input input_16k.wav
```shell ```shell
paddlespeech tts --input "你好,欢迎使用飞桨深度学习框架!" --output output.wav paddlespeech tts --input "你好,欢迎使用飞桨深度学习框架!" --output output.wav
``` ```
- web demo for Text to Speech is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech) - web demo for Text to Speech is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS)
**Text Postprocessing** **Text Postprocessing**
- Punctuation Restoration - Punctuation Restoration
...@@ -397,9 +398,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r ...@@ -397,9 +398,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tr> </tr>
<tr> <tr>
<td >HiFiGAN</td> <td >HiFiGAN</td>
<td >CSMSC</td> <td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td> <td>
<a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> <a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
</td> </td>
</tr> </tr>
<tr> <tr>
...@@ -573,7 +574,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P ...@@ -573,7 +574,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help. - Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
- Many thanks to [AK391](https://github.com/AK391) for TTS web demo on Huggingface Spaces using Gradio.
- Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files. - Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
- Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function. - Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.
- Many thanks to [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) for contributing Punctuation Restoration model. - Many thanks to [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) for contributing Punctuation Restoration model.
......
...@@ -392,9 +392,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 ...@@ -392,9 +392,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tr> </tr>
<tr> <tr>
<td >HiFiGAN</td> <td >HiFiGAN</td>
<td >CSMSC</td> <td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td> <td>
<a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> <a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
</td> </td>
</tr> </tr>
<tr> <tr>
......
...@@ -84,5 +84,8 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by ...@@ -84,5 +84,8 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
| Model | Language | Sample Rate | Model | Language | Sample Rate
| :--- | :---: | :---: | | :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16000 | conformer_wenetspeech| zh| 16k
| transformer_librispeech| en| 16000 | transformer_librispeech| en| 16k
| deepspeech2offline_aishell| zh| 16k
| deepspeech2online_aishell | zh | 16k
|deepspeech2offline_librispeech|en| 16k
...@@ -81,5 +81,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -81,5 +81,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
| 模型 | 语言 | 采样率 | 模型 | 语言 | 采样率
| :--- | :---: | :---: | | :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16000 | conformer_wenetspeech | zh | 16k
| transformer_librispeech| en| 16000 | transformer_librispeech | en | 16k
| deepspeech2offline_aishell| zh| 16k
| deepspeech2online_aishell | zh | 16k
| deepspeech2offline_librispeech | en | 16k
...@@ -110,21 +110,22 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -110,21 +110,22 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API - Python API
```python ```python
from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor
import json
asrclient_executor = ASRClientExecutor() asrclient_executor = ASRClientExecutor()
asrclient_executor( res = asrclient_executor(
input="./zh.wav", input="./zh.wav",
server_ip="127.0.0.1", server_ip="127.0.0.1",
port=8090, port=8090,
sample_rate=16000, sample_rate=16000,
lang="zh_cn", lang="zh_cn",
audio_format="wav") audio_format="wav")
print(res.json())
``` ```
Output: Output:
```bash ```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
time cost 0.604353 s.
``` ```
### 5. TTS Client Usage ### 5. TTS Client Usage
...@@ -146,7 +147,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -146,7 +147,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0 - `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0
- `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0 - `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0
- `sample_rate`: Sampling rate, choice: [0, 8000, 16000], the default is the same as the model. Default: 0 - `sample_rate`: Sampling rate, choice: [0, 8000, 16000], the default is the same as the model. Default: 0
- `output`: Output wave filepath. Default: `output.wav`. - `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
Output: Output:
```bash ```bash
...@@ -160,9 +161,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -160,9 +161,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API - Python API
```python ```python
from paddlespeech.server.bin.paddlespeech_client import TTSClientExecutor from paddlespeech.server.bin.paddlespeech_client import TTSClientExecutor
import json
ttsclient_executor = TTSClientExecutor() ttsclient_executor = TTSClientExecutor()
ttsclient_executor( res = ttsclient_executor(
input="您好,欢迎使用百度飞桨语音合成服务。", input="您好,欢迎使用百度飞桨语音合成服务。",
server_ip="127.0.0.1", server_ip="127.0.0.1",
port=8090, port=8090,
...@@ -171,6 +173,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -171,6 +173,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
volume=1.0, volume=1.0,
sample_rate=0, sample_rate=0,
output="./output.wav") output="./output.wav")
response_dict = res.json()
print(response_dict["message"])
print("Save synthesized audio successfully on %s." % (response_dict['result']['save_path']))
print("Audio duration: %f s." %(response_dict['result']['duration']))
``` ```
Output: Output:
...@@ -178,7 +185,52 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -178,7 +185,52 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
{'description': 'success.'} {'description': 'success.'}
Save synthesized audio successfully on ./output.wav. Save synthesized audio successfully on ./output.wav.
Audio duration: 3.612500 s. Audio duration: 3.612500 s.
Response time: 0.388317 s.
```
### 6. CLS Client Usage
**Note:** The response time will be slightly longer when using the client for the first time
- Command Line (Recommended)
```
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
```
Usage:
```bash
paddlespeech_client cls --help
```
Arguments:
- `server_ip`: server ip. Default: 127.0.0.1
- `port`: server port. Default: 8090
- `input`(required): Audio file to be classified.
- `topk`: topk scores of classification result.
Output:
```bash
[2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
[2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s.
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import CLSClientExecutor
import json
clsclient_executor = CLSClientExecutor()
res = clsclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
topk=1)
print(res.json())
```
Output:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
``` ```
...@@ -189,3 +241,6 @@ Get all models supported by the ASR service via `paddlespeech_server stats --tas ...@@ -189,3 +241,6 @@ Get all models supported by the ASR service via `paddlespeech_server stats --tas
### TTS model ### TTS model
Get all models supported by the TTS service via `paddlespeech_server stats --task tts`, where static models can be used for paddle inference inference. Get all models supported by the TTS service via `paddlespeech_server stats --task tts`, where static models can be used for paddle inference inference.
### CLS model
Get all models supported by the CLS service via `paddlespeech_server stats --task cls`, where static models can be used for paddle inference inference.
...@@ -80,7 +80,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -80,7 +80,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
### 4. ASR客户端使用方法 ### 4. ASR 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长 **注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用) - 命令行 (推荐使用)
``` ```
...@@ -111,25 +111,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -111,25 +111,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API - Python API
```python ```python
from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor
import json
asrclient_executor = ASRClientExecutor() asrclient_executor = ASRClientExecutor()
asrclient_executor( res = asrclient_executor(
input="./zh.wav", input="./zh.wav",
server_ip="127.0.0.1", server_ip="127.0.0.1",
port=8090, port=8090,
sample_rate=16000, sample_rate=16000,
lang="zh_cn", lang="zh_cn",
audio_format="wav") audio_format="wav")
print(res.json())
``` ```
输出: 输出:
```bash ```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
time cost 0.604353 s.
``` ```
### 5. TTS客户端使用方法 ### 5. TTS 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长 **注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用) - 命令行 (推荐使用)
...@@ -150,7 +151,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -150,7 +151,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `speed`: 音频速度,该值应设置在 0 到 3 之间。 默认值:1.0 - `speed`: 音频速度,该值应设置在 0 到 3 之间。 默认值:1.0
- `volume`: 音频音量,该值应设置在 0 到 3 之间。 默认值: 1.0 - `volume`: 音频音量,该值应设置在 0 到 3 之间。 默认值: 1.0
- `sample_rate`: 采样率,可选 [0, 8000, 16000],默认与模型相同。 默认值:0 - `sample_rate`: 采样率,可选 [0, 8000, 16000],默认与模型相同。 默认值:0
- `output`: 输出音频的路径, 默认值:output.wav - `output`: 输出音频的路径, 默认值:None,表示不保存音频到本地
输出: 输出:
```bash ```bash
...@@ -163,9 +164,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -163,9 +164,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API - Python API
```python ```python
from paddlespeech.server.bin.paddlespeech_client import TTSClientExecutor from paddlespeech.server.bin.paddlespeech_client import TTSClientExecutor
import json
ttsclient_executor = TTSClientExecutor() ttsclient_executor = TTSClientExecutor()
ttsclient_executor( res = ttsclient_executor(
input="您好,欢迎使用百度飞桨语音合成服务。", input="您好,欢迎使用百度飞桨语音合成服务。",
server_ip="127.0.0.1", server_ip="127.0.0.1",
port=8090, port=8090,
...@@ -174,6 +176,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -174,6 +176,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
volume=1.0, volume=1.0,
sample_rate=0, sample_rate=0,
output="./output.wav") output="./output.wav")
response_dict = res.json()
print(response_dict["message"])
print("Save synthesized audio successfully on %s." % (response_dict['result']['save_path']))
print("Audio duration: %f s." %(response_dict['result']['duration']))
``` ```
输出: 输出:
...@@ -181,13 +188,63 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -181,13 +188,63 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
{'description': 'success.'} {'description': 'success.'}
Save synthesized audio successfully on ./output.wav. Save synthesized audio successfully on ./output.wav.
Audio duration: 3.612500 s. Audio duration: 3.612500 s.
Response time: 0.388317 s.
``` ```
### 5. CLS 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用)
```
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
```
使用帮助:
```bash
paddlespeech_client cls --help
```
参数:
- `server_ip`: 服务端ip地址,默认: 127.0.0.1。
- `port`: 服务端口,默认: 8090。
- `input`(必须输入): 用于分类的音频文件。
- `topk`: 分类结果的topk。
输出:
```bash
[2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
[2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s.
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import CLSClientExecutor
import json
clsclient_executor = CLSClientExecutor()
res = clsclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
topk=1)
print(res.json())
```
输出:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
```
## 服务支持的模型 ## 服务支持的模型
### ASR支持的模型 ### ASR支持的模型
通过 `paddlespeech_server stats --task asr` 获取ASR服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 通过 `paddlespeech_server stats --task asr` 获取ASR服务支持的所有模型,其中静态模型可用于 paddle inference 推理。
### TTS支持的模型 ### TTS支持的模型
通过 `paddlespeech_server stats --task tts` 获取TTS服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 通过 `paddlespeech_server stats --task tts` 获取TTS服务支持的所有模型,其中静态模型可用于 paddle inference 推理。
### CLS支持的模型
通过 `paddlespeech_server stats --task cls` 获取CLS服务支持的所有模型,其中静态模型可用于 paddle inference 推理。
#!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav --topk 1
...@@ -9,12 +9,14 @@ port: 8090 ...@@ -9,12 +9,14 @@ port: 8090
# The task format in the engin_list is: <speech task>_<engine type> # The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference'] # task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference']
engine_list: ['asr_python', 'tts_python'] engine_list: ['asr_python', 'tts_python', 'cls_python']
################################################################################# #################################################################################
# ENGINE CONFIG # # ENGINE CONFIG #
################################################################################# #################################################################################
################################### ASR #########################################
################### speech task: asr; engine_type: python ####################### ################### speech task: asr; engine_type: python #######################
asr_python: asr_python:
model: 'conformer_wenetspeech' model: 'conformer_wenetspeech'
...@@ -46,6 +48,7 @@ asr_inference: ...@@ -46,6 +48,7 @@ asr_inference:
summary: True # False -> do not show predictor config summary: True # False -> do not show predictor config
################################### TTS #########################################
################### speech task: tts; engine_type: python ####################### ################### speech task: tts; engine_type: python #######################
tts_python: tts_python:
# am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc', # am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc',
...@@ -105,3 +108,30 @@ tts_inference: ...@@ -105,3 +108,30 @@ tts_inference:
# others # others
lang: 'zh' lang: 'zh'
################################### CLS #########################################
################### speech task: cls; engine_type: python #######################
cls_python:
# model choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model: 'panns_cnn14'
cfg_path: # [optional] Config of cls task.
ckpt_path: # [optional] Checkpoint file of model.
label_file: # [optional] Label file of cls task.
device: # set 'gpu:id' or 'cpu'
################### speech task: cls; engine_type: inference #######################
cls_inference:
# model_type choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model_type: 'panns_cnn14'
cfg_path:
model_path: # the pdmodel file of am static model [optional]
params_path: # the pdiparams file of am static model [optional]
label_file: # [optional] Label file of cls task.
predictor_conf:
device: # set 'gpu:id' or 'cpu'
switch_ir_optim: True
glog_info: False # True -> print glog
summary: True # False -> do not show predictor config
...@@ -35,3 +35,7 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks ...@@ -35,3 +35,7 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks
* [librosa](https://github.com/librosa/librosa/blob/main/LICENSE.md) * [librosa](https://github.com/librosa/librosa/blob/main/LICENSE.md)
- ISC License - ISC License
- Audio feature - Audio feature
* [ThreadPool](https://github.com/progschj/ThreadPool/blob/master/COPYING)
- zlib License
- ThreadPool
...@@ -49,17 +49,19 @@ Model Type | Dataset| Example Link | Pretrained Models| Static Models|Size (stat ...@@ -49,17 +49,19 @@ Model Type | Dataset| Example Link | Pretrained Models| Static Models|Size (stat
WaveFlow| LJSpeech |[waveflow-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0)|[waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)||| WaveFlow| LJSpeech |[waveflow-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0)|[waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)|||
Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)|[pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)|5.1MB| Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)|[pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)|5.1MB|
Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)||| Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)|||
Parallel WaveGAN|AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)||| Parallel WaveGAN| AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)|||
Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)||| Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)|||
|Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) <br>[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) |8.2MB| |Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) <br>[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) |8.2MB|
Style MelGAN | CSMSC |[Style MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc4)|[style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)| | | Style MelGAN | CSMSC |[Style MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc4)|[style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)| | |
HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)|50MB| HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)|50MB|
HiFiGAN | AISHELL-3 |[HiFiGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|||
HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|||
WaveRNN | CSMSC |[WaveRNN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc6)|[wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)|[wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)|18MB| WaveRNN | CSMSC |[WaveRNN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc6)|[wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)|[wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)|18MB|
### Voice Cloning ### Voice Cloning
Model Type | Dataset| Example Link | Pretrained Models Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----: :-------------:| :------------:| :-----: | :-----: |
GE2E| AISHELL-3, etc. |[ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e)|[ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip) GE2E| AISHELL-3, etc. |[ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e)|[ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip)
GE2E + Tactron2| AISHELL-3 |[ge2e-tactron2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0)|[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip) GE2E + Tactron2| AISHELL-3 |[ge2e-tactron2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0)|[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc1)|[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip) GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc1)|[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
...@@ -67,9 +69,9 @@ GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/ ...@@ -67,9 +69,9 @@ GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/
## Audio Classification Models ## Audio Classification Models
Model Type | Dataset| Example Link | Pretrained Models Model Type | Dataset| Example Link | Pretrained Models | Static Models
:-------------:| :------------:| :-----: | :-----: :-------------:| :------------:| :-----: | :-----: | :-----:
PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams) PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams) | [panns_cnn6_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn6_static.tar.gz)(18M), [panns_cnn10_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn10_static.tar.gz)(19M), [panns_cnn14_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn14_static.tar.gz)(289M)
PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn6.tar.gz), [esc50_cnn10.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn10.tar.gz), [esc50_cnn14.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn14.tar.gz) PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn6.tar.gz), [esc50_cnn10.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn10.tar.gz), [esc50_cnn14.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn14.tar.gz)
## Punctuation Restoration Models ## Punctuation Restoration Models
......
...@@ -4,18 +4,44 @@ config_path=$1 ...@@ -4,18 +4,44 @@ config_path=$1
train_output_path=$2 train_output_path=$2
ckpt_name=$3 ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ stage=0
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ stop_stage=0
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \ # pwgan
--am_config=${config_path} \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ FLAGS_allocator_strategy=naive_best_fit \
--am_stat=dump/train/speech_stats.npy \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
--voc=pwgan_aishell3 \ python3 ${BIN_DIR}/../synthesize.py \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ --am=fastspeech2_aishell3 \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --am_config=${config_path} \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--test_metadata=dump/test/norm/metadata.jsonl \ --am_stat=dump/train/speech_stats.npy \
--output_dir=${train_output_path}/test \ --voc=pwgan_aishell3 \
--phones_dict=dump/phone_id_map.txt \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--speaker_dict=dump/speaker_id_map.txt --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pd \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi
...@@ -4,21 +4,50 @@ config_path=$1 ...@@ -4,21 +4,50 @@ config_path=$1
train_output_path=$2 train_output_path=$2
ckpt_name=$3 ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ stage=0
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ stop_stage=0
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \ # pwgan
--am_config=${config_path} \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ FLAGS_allocator_strategy=naive_best_fit \
--am_stat=dump/train/speech_stats.npy \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
--voc=pwgan_aishell3 \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ --am=fastspeech2_aishell3 \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --am_config=${config_path} \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--lang=zh \ --am_stat=dump/train/speech_stats.npy \
--text=${BIN_DIR}/../sentences.txt \ --voc=pwgan_aishell3 \
--output_dir=${train_output_path}/test_e2e \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--phones_dict=dump/phone_id_map.txt \ --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--speaker_dict=dump/speaker_id_map.txt \ --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--spk_id=0 \ --lang=zh \
--inference_dir=${train_output_path}/inference --text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "in hifigan syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
--speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi
#!/bin/bash #!/bin/bash
stage=3 stage=0
stop_stage=100 stop_stage=100
config_path=$1 config_path=$1
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
set -e set -e
source path.sh source path.sh
gpus=0 gpus=0,1
stage=0 stage=0
stop_stage=100 stop_stage=100
......
# HiFiGAN with AISHELL-3
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
## Dataset
### Download and Extract
Download AISHELL-3.
```bash
wget https://www.openslr.org/resources/93/data_aishell3.tgz
```
Extract AISHELL-3.
```bash
mkdir data_aishell3
tar zxvf data_aishell3.tgz -C data_aishell3
```
### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`.
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip).
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 2500000|24.060|0.1068|7.499
HiFiGAN checkpoint contains files listed below.
```text
hifigan_aishell3_ckpt_0.2.0
├── default.yaml # default config used to train hifigan
├── feats_stats.npy # statistics used to normalize spectrogram when training hifigan
└── snapshot_iter_2500000.pdz # generator parameters of hifigan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
# This is the configuration file for AISHELL-3 dataset.
# This configuration is based on HiFiGAN V1, which is
# an official configuration. But I found that the optimizer
# setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales
# is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 2048
hop_size: 300
win_length: 1200
window: "hann"
num_mels: 80
fmin: 0
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./aishell3_alignment_tone \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/data_aishell3/ \
--dataset=aishell3 \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
...@@ -17,11 +17,8 @@ Data preparation. ...@@ -17,11 +17,8 @@ Data preparation.
Download: http://groups.inf.ed.ac.uk/ami/download/ Download: http://groups.inf.ed.ac.uk/ami/download/
Prepares metadata files (JSON) from manual annotations "segments/" using RTTM format (Oracle VAD). Prepares metadata files (JSON) from manual annotations "segments/" using RTTM format (Oracle VAD).
Authors
* qingenz123@126.com (Qingen ZHAO) 2022
""" """
import argparse import argparse
import glob import glob
import json import json
......
...@@ -15,10 +15,6 @@ ...@@ -15,10 +15,6 @@
AMI corpus contained 100 hours of meeting recording. AMI corpus contained 100 hours of meeting recording.
This script returns the standard train, dev and eval split for AMI corpus. This script returns the standard train, dev and eval split for AMI corpus.
For more information on dataset please refer to http://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml For more information on dataset please refer to http://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml
Authors
* qingenz123@126.com (Qingen ZHAO) 2022
""" """
ALLOWED_OPTIONS = ["scenario_only", "full_corpus", "full_corpus_asr"] ALLOWED_OPTIONS = ["scenario_only", "full_corpus", "full_corpus_asr"]
......
...@@ -13,10 +13,6 @@ ...@@ -13,10 +13,6 @@
# limitations under the License. # limitations under the License.
""" """
Data reading and writing. Data reading and writing.
Authors
* qingenz123@126.com (Qingen ZHAO) 2022
""" """
import os import os
import pickle import pickle
......
...@@ -7,7 +7,7 @@ ckpt_name=$3 ...@@ -7,7 +7,7 @@ ckpt_name=$3
stage=0 stage=0
stop_stage=0 stop_stage=0
# TODO: tacotron2 动转静的结果没有态图的响亮, 可能还是 decode 的时候某个函数动静不对齐 # TODO: tacotron2 动转静的结果没有态图的响亮, 可能还是 decode 的时候某个函数动静不对齐
# pwgan # pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
......
...@@ -14,7 +14,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then ...@@ -14,7 +14,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--am=speedyspeech_csmsc \ --am=speedyspeech_csmsc \
--am_config=${config_path} \ --am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \ --am_stat=dump/train/feats_stats.npy \
--voc=pwgan_csmsc \ --voc=pwgan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
...@@ -34,7 +34,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then ...@@ -34,7 +34,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--am=speedyspeech_csmsc \ --am=speedyspeech_csmsc \
--am_config=${config_path} \ --am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \ --am_stat=dump/train/feats_stats.npy \
--voc=mb_melgan_csmsc \ --voc=mb_melgan_csmsc \
--voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \ --voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\ --voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
...@@ -53,7 +53,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then ...@@ -53,7 +53,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
--am=speedyspeech_csmsc \ --am=speedyspeech_csmsc \
--am_config=${config_path} \ --am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \ --am_stat=dump/train/feats_stats.npy \
--voc=style_melgan_csmsc \ --voc=style_melgan_csmsc \
--voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \ --voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \ --voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
...@@ -73,7 +73,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then ...@@ -73,7 +73,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--am=speedyspeech_csmsc \ --am=speedyspeech_csmsc \
--am_config=${config_path} \ --am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \ --am_stat=dump/train/feats_stats.npy \
--voc=hifigan_csmsc \ --voc=hifigan_csmsc \
--voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \ --voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \ --voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
...@@ -93,7 +93,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then ...@@ -93,7 +93,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
--am=speedyspeech_csmsc \ --am=speedyspeech_csmsc \
--am_config=${config_path} \ --am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \ --am_stat=dump/train/feats_stats.npy \
--voc=wavernn_csmsc \ --voc=wavernn_csmsc \
--voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \ --voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
--voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \ --voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
......
# HiFiGAN with the LJSpeech-1.1
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).
## Dataset
### Download and Extract
Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` parallel wavegan config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
# This is the configuration file for LJSpeech dataset.
# This configuration is based on HiFiGAN V1, which is an official configuration.
# But I found that the optimizer setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 22050 # Sampling rate.
n_fft: 1024 # FFT size (samples).
n_shift: 256 # Hop size (samples). 11.6ms
win_length: null # Window length (samples).
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [8, 8, 2, 2] # Upsampling scales.
upsample_kernel_sizes: [16, 16, 4, 4] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 22050
fft_size: 1024
hop_size: 256
win_length: null
window: "hann"
num_mels: 80
fmin: 0
fmax: 11025
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8192 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./ljspeech_alignment \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/LJSpeech-1.1/ \
--dataset=ljspeech \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
...@@ -4,18 +4,43 @@ config_path=$1 ...@@ -4,18 +4,43 @@ config_path=$1
train_output_path=$2 train_output_path=$2
ckpt_name=$3 ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ stage=0
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ stop_stage=0
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_vctk \ # pwgan
--am_config=${config_path} \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ FLAGS_allocator_strategy=naive_best_fit \
--am_stat=dump/train/speech_stats.npy \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
--voc=pwgan_vctk \ python3 ${BIN_DIR}/../synthesize.py \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \ --am=fastspeech2_vctk \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \ --am_config=${config_path} \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--test_metadata=dump/test/norm/metadata.jsonl \ --am_stat=dump/train/speech_stats.npy \
--output_dir=${train_output_path}/test \ --voc=pwgan_vctk \
--phones_dict=dump/phone_id_map.txt \ --voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--speaker_dict=dump/speaker_id_map.txt --voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_vctk \
--voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi
...@@ -4,21 +4,49 @@ config_path=$1 ...@@ -4,21 +4,49 @@ config_path=$1
train_output_path=$2 train_output_path=$2
ckpt_name=$3 ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ stage=0
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ stop_stage=0
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_vctk \ # pwgan
--am_config=${config_path} \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ FLAGS_allocator_strategy=naive_best_fit \
--am_stat=dump/train/speech_stats.npy \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
--voc=pwgan_vctk \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \ --am=fastspeech2_vctk \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \ --am_config=${config_path} \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--lang=en \ --am_stat=dump/train/speech_stats.npy \
--text=${BIN_DIR}/../sentences_en.txt \ --voc=pwgan_vctk \
--output_dir=${train_output_path}/test_e2e \ --voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--phones_dict=dump/phone_id_map.txt \ --voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--speaker_dict=dump/speaker_id_map.txt \ --voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--spk_id=0 \ --lang=en \
--inference_dir=${train_output_path}/inference --text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_vctk \
--voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi
# HiFiGAN with VCTK
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443).
## Dataset
### Download and Extract
Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
1. `p315`, because of no text for it.
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them.
## Get Started
Assume the path to the dataset is `~/datasets/VCTK-Corpus-0.92`.
Assume the path to the MFA result of VCTK is `./vctk_alignment`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip).
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 2500000|58.092|0.1234|24.384
HiFiGAN checkpoint contains files listed below.
```text
hifigan_vctk_ckpt_0.2.0
├── default.yaml # default config used to train hifigan
├── feats_stats.npy # statistics used to normalize spectrogram when training hifigan
└── snapshot_iter_2500000.pdz # generator parameters of hifigan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
# This is the configuration file for VCTK dataset.
# This configuration is based on HiFiGAN V1, which is
# an official configuration. But I found that the optimizer
# setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales
# is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 2048
hop_size: 300
win_length: 1200
window: "hann"
num_mels: 80
fmin: 0
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./vctk_alignment \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/VCTK-Corpus-0.92/ \
--dataset=vctk \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
# Changelog # Changelog
Date: 2022-3-15, Author: Xiaojie Chen.
- kaldi and librosa mfcc, fbank, spectrogram.
- unit test and benchmark.
Date: 2022-2-25, Author: Hui Zhang. Date: 2022-2-25, Author: Hui Zhang.
- Refactor architecture. - Refactor architecture.
- dtw distance and mcd style dtw - dtw distance and mcd style dtw.
...@@ -11,6 +11,7 @@ ...@@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import os
import warnings import warnings
from typing import Optional from typing import Optional
from typing import Tuple from typing import Tuple
...@@ -19,7 +20,6 @@ from typing import Union ...@@ -19,7 +20,6 @@ from typing import Union
import numpy as np import numpy as np
import resampy import resampy
import soundfile as sf import soundfile as sf
from numpy import ndarray as array
from scipy.io import wavfile from scipy.io import wavfile
from ..utils import ParameterError from ..utils import ParameterError
...@@ -38,13 +38,21 @@ RESAMPLE_MODES = ['kaiser_best', 'kaiser_fast'] ...@@ -38,13 +38,21 @@ RESAMPLE_MODES = ['kaiser_best', 'kaiser_fast']
EPS = 1e-8 EPS = 1e-8
def resample(y: array, src_sr: int, target_sr: int, def resample(y: np.ndarray,
mode: str='kaiser_fast') -> array: src_sr: int,
""" Audio resampling target_sr: int,
This function is the same as using resampy.resample(). mode: str='kaiser_fast') -> np.ndarray:
Notes: """Audio resampling.
The default mode is kaiser_fast. For better audio quality, use mode = 'kaiser_fast'
""" Args:
y (np.ndarray): Input waveform array in 1D or 2D.
src_sr (int): Source sample rate.
target_sr (int): Target sample rate.
mode (str, optional): The resampling filter to use. Defaults to 'kaiser_fast'.
Returns:
np.ndarray: `y` resampled to `target_sr`
"""
if mode == 'kaiser_best': if mode == 'kaiser_best':
warnings.warn( warnings.warn(
...@@ -53,7 +61,7 @@ def resample(y: array, src_sr: int, target_sr: int, ...@@ -53,7 +61,7 @@ def resample(y: array, src_sr: int, target_sr: int,
if not isinstance(y, np.ndarray): if not isinstance(y, np.ndarray):
raise ParameterError( raise ParameterError(
'Only support numpy array, but received y in {type(y)}') 'Only support numpy np.ndarray, but received y in {type(y)}')
if mode not in RESAMPLE_MODES: if mode not in RESAMPLE_MODES:
raise ParameterError(f'resample mode must in {RESAMPLE_MODES}') raise ParameterError(f'resample mode must in {RESAMPLE_MODES}')
...@@ -61,9 +69,17 @@ def resample(y: array, src_sr: int, target_sr: int, ...@@ -61,9 +69,17 @@ def resample(y: array, src_sr: int, target_sr: int,
return resampy.resample(y, src_sr, target_sr, filter=mode) return resampy.resample(y, src_sr, target_sr, filter=mode)
def to_mono(y: array, merge_type: str='average') -> array: def to_mono(y: np.ndarray, merge_type: str='average') -> np.ndarray:
""" convert sterior audio to mono """Convert sterior audio to mono.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
merge_type (str, optional): Merge type to generate mono waveform. Defaults to 'average'.
Returns:
np.ndarray: `y` with mono channel.
""" """
if merge_type not in MERGE_TYPES: if merge_type not in MERGE_TYPES:
raise ParameterError( raise ParameterError(
f'Unsupported merge type {merge_type}, available types are {MERGE_TYPES}' f'Unsupported merge type {merge_type}, available types are {MERGE_TYPES}'
...@@ -101,18 +117,34 @@ def to_mono(y: array, merge_type: str='average') -> array: ...@@ -101,18 +117,34 @@ def to_mono(y: array, merge_type: str='average') -> array:
return y_out return y_out
def _safe_cast(y: array, dtype: Union[type, str]) -> array: def _safe_cast(y: np.ndarray, dtype: Union[type, str]) -> np.ndarray:
""" data type casting in a safe way, i.e., prevent overflow or underflow """Data type casting in a safe way, i.e., prevent overflow or underflow.
This function is used internally.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
dtype (Union[type, str]): Data type of waveform.
Returns:
np.ndarray: `y` after safe casting.
""" """
return np.clip(y, np.iinfo(dtype).min, np.iinfo(dtype).max).astype(dtype) if 'float' in str(y.dtype):
return np.clip(y, np.finfo(dtype).min,
np.finfo(dtype).max).astype(dtype)
else:
return np.clip(y, np.iinfo(dtype).min,
np.iinfo(dtype).max).astype(dtype)
def depth_convert(y: array, dtype: Union[type, str], def depth_convert(y: np.ndarray, dtype: Union[type, str]) -> np.ndarray:
dithering: bool=True) -> array: """Convert audio array to target dtype safely. This function convert audio waveform to a target dtype, with addition steps of
"""Convert audio array to target dtype safely
This function convert audio waveform to a target dtype, with addition steps of
preventing overflow/underflow and preserving audio range. preventing overflow/underflow and preserving audio range.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
dtype (Union[type, str]): Data type of waveform.
Returns:
np.ndarray: `y` after safe casting.
""" """
SUPPORT_DTYPE = ['int16', 'int8', 'float32', 'float64'] SUPPORT_DTYPE = ['int16', 'int8', 'float32', 'float64']
...@@ -157,14 +189,20 @@ def depth_convert(y: array, dtype: Union[type, str], ...@@ -157,14 +189,20 @@ def depth_convert(y: array, dtype: Union[type, str],
return y return y
def sound_file_load(file: str, def sound_file_load(file: os.PathLike,
offset: Optional[float]=None, offset: Optional[float]=None,
dtype: str='int16', dtype: str='int16',
duration: Optional[int]=None) -> Tuple[array, int]: duration: Optional[int]=None) -> Tuple[np.ndarray, int]:
"""Load audio using soundfile library """Load audio using soundfile library. This function load audio file using libsndfile.
This function load audio file using libsndfile.
Reference: Args:
http://www.mega-nerd.com/libsndfile/#Features file (os.PathLike): File of waveform.
offset (Optional[float], optional): Offset to the start of waveform. Defaults to None.
dtype (str, optional): Data type of waveform. Defaults to 'int16'.
duration (Optional[int], optional): Duration of waveform to read. Defaults to None.
Returns:
Tuple[np.ndarray, int]: Waveform in ndarray and its samplerate.
""" """
with sf.SoundFile(file) as sf_desc: with sf.SoundFile(file) as sf_desc:
sr_native = sf_desc.samplerate sr_native = sf_desc.samplerate
...@@ -179,9 +217,17 @@ def sound_file_load(file: str, ...@@ -179,9 +217,17 @@ def sound_file_load(file: str,
return y, sf_desc.samplerate return y, sf_desc.samplerate
def normalize(y: array, norm_type: str='linear', def normalize(y: np.ndarray, norm_type: str='linear',
mul_factor: float=1.0) -> array: mul_factor: float=1.0) -> np.ndarray:
""" normalize an input audio with additional multiplier. """Normalize an input audio with additional multiplier.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
norm_type (str, optional): Type of normalization. Defaults to 'linear'.
mul_factor (float, optional): Scaling factor. Defaults to 1.0.
Returns:
np.ndarray: `y` after normalization.
""" """
if norm_type == 'linear': if norm_type == 'linear':
...@@ -199,12 +245,13 @@ def normalize(y: array, norm_type: str='linear', ...@@ -199,12 +245,13 @@ def normalize(y: array, norm_type: str='linear',
return y return y
def save(y: array, sr: int, file: str) -> None: def save(y: np.ndarray, sr: int, file: os.PathLike) -> None:
"""Save audio file to disk. """Save audio file to disk. This function saves audio to disk using scipy.io.wavfile, with additional step to convert input waveform to int16.
This function saves audio to disk using scipy.io.wavfile, with additional step
to convert input waveform to int16 unless it already is int16 Args:
Notes: y (np.ndarray): Input waveform array in 1D or 2D.
It only support raw wav format. sr (int): Sample rate.
file (os.PathLike): Path of auido file to save.
""" """
if not file.endswith('.wav'): if not file.endswith('.wav'):
raise ParameterError( raise ParameterError(
...@@ -226,7 +273,7 @@ def save(y: array, sr: int, file: str) -> None: ...@@ -226,7 +273,7 @@ def save(y: array, sr: int, file: str) -> None:
def load( def load(
file: str, file: os.PathLike,
sr: Optional[int]=None, sr: Optional[int]=None,
mono: bool=True, mono: bool=True,
merge_type: str='average', # ch0,ch1,random,average merge_type: str='average', # ch0,ch1,random,average
...@@ -236,11 +283,24 @@ def load( ...@@ -236,11 +283,24 @@ def load(
offset: float=0.0, offset: float=0.0,
duration: Optional[int]=None, duration: Optional[int]=None,
dtype: str='float32', dtype: str='float32',
resample_mode: str='kaiser_fast') -> Tuple[array, int]: resample_mode: str='kaiser_fast') -> Tuple[np.ndarray, int]:
"""Load audio file from disk. """Load audio file from disk. This function loads audio from disk using using audio beackend.
This function loads audio from disk using using audio beackend.
Parameters: Args:
Notes: file (os.PathLike): Path of auido file to load.
sr (Optional[int], optional): Sample rate of loaded waveform. Defaults to None.
mono (bool, optional): Return waveform with mono channel. Defaults to True.
merge_type (str, optional): Merge type of multi-channels waveform. Defaults to 'average'.
normal (bool, optional): Waveform normalization. Defaults to True.
norm_type (str, optional): Type of normalization. Defaults to 'linear'.
norm_mul_factor (float, optional): Scaling factor. Defaults to 1.0.
offset (float, optional): Offset to the start of waveform. Defaults to 0.0.
duration (Optional[int], optional): Duration of waveform to read. Defaults to None.
dtype (str, optional): Data type of waveform. Defaults to 'float32'.
resample_mode (str, optional): The resampling filter to use. Defaults to 'kaiser_fast'.
Returns:
Tuple[np.ndarray, int]: Waveform in ndarray and its samplerate.
""" """
y, r = sound_file_load(file, offset=offset, dtype=dtype, duration=duration) y, r = sound_file_load(file, offset=offset, dtype=dtype, duration=duration)
......
...@@ -220,7 +220,7 @@ def spectrogram(waveform: Tensor, ...@@ -220,7 +220,7 @@ def spectrogram(waveform: Tensor,
"""Compute and return a spectrogram from a waveform. The output is identical to Kaldi's. """Compute and return a spectrogram from a waveform. The output is identical to Kaldi's.
Args: Args:
waveform (Tensor): A waveform tensor with shape [C, T]. waveform (Tensor): A waveform tensor with shape `(C, T)`.
blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42. blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42.
channel (int, optional): Select the channel of waveform. Defaults to -1. channel (int, optional): Select the channel of waveform. Defaults to -1.
dither (float, optional): Dithering constant . Defaults to 0.0. dither (float, optional): Dithering constant . Defaults to 0.0.
...@@ -239,7 +239,7 @@ def spectrogram(waveform: Tensor, ...@@ -239,7 +239,7 @@ def spectrogram(waveform: Tensor,
window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY. window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY.
Returns: Returns:
Tensor: A spectrogram tensor with shape (m, padded_window_size // 2 + 1) where m is the number of frames Tensor: A spectrogram tensor with shape `(m, padded_window_size // 2 + 1)` where m is the number of frames
depends on frame_length and frame_shift. depends on frame_length and frame_shift.
""" """
dtype = waveform.dtype dtype = waveform.dtype
...@@ -422,7 +422,7 @@ def fbank(waveform: Tensor, ...@@ -422,7 +422,7 @@ def fbank(waveform: Tensor,
"""Compute and return filter banks from a waveform. The output is identical to Kaldi's. """Compute and return filter banks from a waveform. The output is identical to Kaldi's.
Args: Args:
waveform (Tensor): A waveform tensor with shape [C, T]. waveform (Tensor): A waveform tensor with shape `(C, T)`.
blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42. blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42.
channel (int, optional): Select the channel of waveform. Defaults to -1. channel (int, optional): Select the channel of waveform. Defaults to -1.
dither (float, optional): Dithering constant . Defaults to 0.0. dither (float, optional): Dithering constant . Defaults to 0.0.
...@@ -451,7 +451,7 @@ def fbank(waveform: Tensor, ...@@ -451,7 +451,7 @@ def fbank(waveform: Tensor,
window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY. window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY.
Returns: Returns:
Tensor: A filter banks tensor with shape (m, n_mels). Tensor: A filter banks tensor with shape `(m, n_mels)`.
""" """
dtype = waveform.dtype dtype = waveform.dtype
...@@ -542,7 +542,7 @@ def mfcc(waveform: Tensor, ...@@ -542,7 +542,7 @@ def mfcc(waveform: Tensor,
identical to Kaldi's. identical to Kaldi's.
Args: Args:
waveform (Tensor): A waveform tensor with shape [C, T]. waveform (Tensor): A waveform tensor with shape `(C, T)`.
blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42. blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42.
cepstral_lifter (float, optional): Scaling of output mfccs. Defaults to 22.0. cepstral_lifter (float, optional): Scaling of output mfccs. Defaults to 22.0.
channel (int, optional): Select the channel of waveform. Defaults to -1. channel (int, optional): Select the channel of waveform. Defaults to -1.
...@@ -571,7 +571,7 @@ def mfcc(waveform: Tensor, ...@@ -571,7 +571,7 @@ def mfcc(waveform: Tensor,
window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY. window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY.
Returns: Returns:
Tensor: A mel frequency cepstral coefficients tensor with shape (m, n_mfcc). Tensor: A mel frequency cepstral coefficients tensor with shape `(m, n_mfcc)`.
""" """
assert n_mfcc <= n_mels, 'n_mfcc cannot be larger than n_mels: %d vs %d' % ( assert n_mfcc <= n_mels, 'n_mfcc cannot be larger than n_mels: %d vs %d' % (
n_mfcc, n_mels) n_mfcc, n_mels)
......
...@@ -17,6 +17,7 @@ from typing import Optional ...@@ -17,6 +17,7 @@ from typing import Optional
from typing import Union from typing import Union
import paddle import paddle
from paddle import Tensor
__all__ = [ __all__ = [
'hz_to_mel', 'hz_to_mel',
...@@ -29,19 +30,20 @@ __all__ = [ ...@@ -29,19 +30,20 @@ __all__ = [
] ]
def hz_to_mel(freq: Union[paddle.Tensor, float], def hz_to_mel(freq: Union[Tensor, float],
htk: bool=False) -> Union[paddle.Tensor, float]: htk: bool=False) -> Union[Tensor, float]:
"""Convert Hz to Mels. """Convert Hz to Mels.
Parameters:
freq: the input tensor of arbitrary shape, or a single floating point number. Args:
htk: use HTK formula to do the conversion. freq (Union[Tensor, float]): The input tensor with arbitrary shape.
The default value is False. htk (bool, optional): Use htk scaling. Defaults to False.
Returns: Returns:
The frequencies represented in Mel-scale. Union[Tensor, float]: Frequency in mels.
""" """
if htk: if htk:
if isinstance(freq, paddle.Tensor): if isinstance(freq, Tensor):
return 2595.0 * paddle.log10(1.0 + freq / 700.0) return 2595.0 * paddle.log10(1.0 + freq / 700.0)
else: else:
return 2595.0 * math.log10(1.0 + freq / 700.0) return 2595.0 * math.log10(1.0 + freq / 700.0)
...@@ -58,7 +60,7 @@ def hz_to_mel(freq: Union[paddle.Tensor, float], ...@@ -58,7 +60,7 @@ def hz_to_mel(freq: Union[paddle.Tensor, float],
min_log_mel = (min_log_hz - f_min) / f_sp # same (Mels) min_log_mel = (min_log_hz - f_min) / f_sp # same (Mels)
logstep = math.log(6.4) / 27.0 # step size for log region logstep = math.log(6.4) / 27.0 # step size for log region
if isinstance(freq, paddle.Tensor): if isinstance(freq, Tensor):
target = min_log_mel + paddle.log( target = min_log_mel + paddle.log(
freq / min_log_hz + 1e-10) / logstep # prevent nan with 1e-10 freq / min_log_hz + 1e-10) / logstep # prevent nan with 1e-10
mask = (freq > min_log_hz).astype(freq.dtype) mask = (freq > min_log_hz).astype(freq.dtype)
...@@ -71,14 +73,16 @@ def hz_to_mel(freq: Union[paddle.Tensor, float], ...@@ -71,14 +73,16 @@ def hz_to_mel(freq: Union[paddle.Tensor, float],
return mels return mels
def mel_to_hz(mel: Union[float, paddle.Tensor], def mel_to_hz(mel: Union[float, Tensor],
htk: bool=False) -> Union[float, paddle.Tensor]: htk: bool=False) -> Union[float, Tensor]:
"""Convert mel bin numbers to frequencies. """Convert mel bin numbers to frequencies.
Parameters:
mel: the mel frequency represented as a tensor of arbitrary shape, or a floating point number. Args:
htk: use HTK formula to do the conversion. mel (Union[float, Tensor]): The mel frequency represented as a tensor with arbitrary shape.
htk (bool, optional): Use htk scaling. Defaults to False.
Returns: Returns:
The frequencies represented in hz. Union[float, Tensor]: Frequencies in Hz.
""" """
if htk: if htk:
return 700.0 * (10.0**(mel / 2595.0) - 1.0) return 700.0 * (10.0**(mel / 2595.0) - 1.0)
...@@ -90,7 +94,7 @@ def mel_to_hz(mel: Union[float, paddle.Tensor], ...@@ -90,7 +94,7 @@ def mel_to_hz(mel: Union[float, paddle.Tensor],
min_log_hz = 1000.0 # beginning of log region (Hz) min_log_hz = 1000.0 # beginning of log region (Hz)
min_log_mel = (min_log_hz - f_min) / f_sp # same (Mels) min_log_mel = (min_log_hz - f_min) / f_sp # same (Mels)
logstep = math.log(6.4) / 27.0 # step size for log region logstep = math.log(6.4) / 27.0 # step size for log region
if isinstance(mel, paddle.Tensor): if isinstance(mel, Tensor):
target = min_log_hz * paddle.exp(logstep * (mel - min_log_mel)) target = min_log_hz * paddle.exp(logstep * (mel - min_log_mel))
mask = (mel > min_log_mel).astype(mel.dtype) mask = (mel > min_log_mel).astype(mel.dtype)
freqs = target * mask + freqs * ( freqs = target * mask + freqs * (
...@@ -106,16 +110,18 @@ def mel_frequencies(n_mels: int=64, ...@@ -106,16 +110,18 @@ def mel_frequencies(n_mels: int=64,
f_min: float=0.0, f_min: float=0.0,
f_max: float=11025.0, f_max: float=11025.0,
htk: bool=False, htk: bool=False,
dtype: str=paddle.float32): dtype: str='float32') -> Tensor:
"""Compute mel frequencies. """Compute mel frequencies.
Parameters:
n_mels(int): number of Mel bins. Args:
f_min(float): the lower cut-off frequency, below which the filter response is zero. n_mels (int, optional): Number of mel bins. Defaults to 64.
f_max(float): the upper cut-off frequency, above which the filter response is zero. f_min (float, optional): Minimum frequency in Hz. Defaults to 0.0.
htk(bool): whether to use htk formula. fmax (float, optional): Maximum frequency in Hz. Defaults to 11025.0.
dtype(str): the datatype of the return frequencies. htk (bool, optional): Use htk scaling. Defaults to False.
dtype (str, optional): The data type of the return frequencies. Defaults to 'float32'.
Returns: Returns:
The frequencies represented in Mel-scale Tensor: Tensor of n_mels frequencies in Hz with shape `(n_mels,)`.
""" """
# 'Center freqs' of mel bands - uniformly spaced between limits # 'Center freqs' of mel bands - uniformly spaced between limits
min_mel = hz_to_mel(f_min, htk=htk) min_mel = hz_to_mel(f_min, htk=htk)
...@@ -125,14 +131,16 @@ def mel_frequencies(n_mels: int=64, ...@@ -125,14 +131,16 @@ def mel_frequencies(n_mels: int=64,
return freqs return freqs
def fft_frequencies(sr: int, n_fft: int, dtype: str=paddle.float32): def fft_frequencies(sr: int, n_fft: int, dtype: str='float32') -> Tensor:
"""Compute fourier frequencies. """Compute fourier frequencies.
Parameters:
sr(int): the audio sample rate. Args:
n_fft(float): the number of fft bins. sr (int): Sample rate.
dtype(str): the datatype of the return frequencies. n_fft (int): Number of fft bins.
dtype (str, optional): The data type of the return frequencies. Defaults to 'float32'.
Returns: Returns:
The frequencies represented in hz. Tensor: FFT frequencies in Hz with shape `(n_fft//2 + 1,)`.
""" """
return paddle.linspace(0, float(sr) / 2, int(1 + n_fft // 2), dtype=dtype) return paddle.linspace(0, float(sr) / 2, int(1 + n_fft // 2), dtype=dtype)
...@@ -144,23 +152,21 @@ def compute_fbank_matrix(sr: int, ...@@ -144,23 +152,21 @@ def compute_fbank_matrix(sr: int,
f_max: Optional[float]=None, f_max: Optional[float]=None,
htk: bool=False, htk: bool=False,
norm: Union[str, float]='slaney', norm: Union[str, float]='slaney',
dtype: str=paddle.float32): dtype: str='float32') -> Tensor:
"""Compute fbank matrix. """Compute fbank matrix.
Parameters:
sr(int): the audio sample rate. Args:
n_fft(int): the number of fft bins. sr (int): Sample rate.
n_mels(int): the number of Mel bins. n_fft (int): Number of fft bins.
f_min(float): the lower cut-off frequency, below which the filter response is zero. n_mels (int, optional): Number of mel bins. Defaults to 64.
f_max(float): the upper cut-off frequency, above which the filter response is zero. f_min (float, optional): Minimum frequency in Hz. Defaults to 0.0.
htk: whether to use htk formula. f_max (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
return_complex(bool): whether to return complex matrix. If True, the matrix will htk (bool, optional): Use htk scaling. Defaults to False.
be complex type. Otherwise, the real and image part will be stored in the last norm (Union[str, float], optional): Type of normalization. Defaults to 'slaney'.
axis of returned tensor. dtype (str, optional): The data type of the return matrix. Defaults to 'float32'.
dtype(str): the datatype of the returned fbank matrix.
Returns: Returns:
The fbank matrix of shape (n_mels, int(1+n_fft//2)). Tensor: Mel transform matrix with shape `(n_mels, n_fft//2 + 1)`.
Shape:
output: (n_mels, int(1+n_fft//2))
""" """
if f_max is None: if f_max is None:
...@@ -199,27 +205,20 @@ def compute_fbank_matrix(sr: int, ...@@ -199,27 +205,20 @@ def compute_fbank_matrix(sr: int,
return weights return weights
def power_to_db(magnitude: paddle.Tensor, def power_to_db(spect: Tensor,
ref_value: float=1.0, ref_value: float=1.0,
amin: float=1e-10, amin: float=1e-10,
top_db: Optional[float]=None) -> paddle.Tensor: top_db: Optional[float]=None) -> Tensor:
"""Convert a power spectrogram (amplitude squared) to decibel (dB) units. """Convert a power spectrogram (amplitude squared) to decibel (dB) units. The function computes the scaling `10 * log10(x / ref)` in a numerically stable way.
The function computes the scaling ``10 * log10(x / ref)`` in a numerically
stable way. Args:
Parameters: spect (Tensor): STFT power spectrogram.
magnitude(Tensor): the input magnitude tensor of any shape. ref_value (float, optional): The reference value. If smaller than 1.0, the db level of the signal will be pulled up accordingly. Otherwise, the db level is pushed down. Defaults to 1.0.
ref_value(float): the reference value. If smaller than 1.0, the db level amin (float, optional): Minimum threshold. Defaults to 1e-10.
of the signal will be pulled up accordingly. Otherwise, the db level top_db (Optional[float], optional): Threshold the output at `top_db` below the peak. Defaults to None.
is pushed down.
amin(float): the minimum value of input magnitude, below which the input
magnitude is clipped(to amin).
top_db(float): the maximum db value of resulting spectrum, above which the
spectrum is clipped(to top_db).
Returns: Returns:
The spectrogram in log-scale. Tensor: Power spectrogram in db scale.
shape:
input: any shape
output: same as input
""" """
if amin <= 0: if amin <= 0:
raise Exception("amin must be strictly positive") raise Exception("amin must be strictly positive")
...@@ -227,8 +226,8 @@ def power_to_db(magnitude: paddle.Tensor, ...@@ -227,8 +226,8 @@ def power_to_db(magnitude: paddle.Tensor,
if ref_value <= 0: if ref_value <= 0:
raise Exception("ref_value must be strictly positive") raise Exception("ref_value must be strictly positive")
ones = paddle.ones_like(magnitude) ones = paddle.ones_like(spect)
log_spec = 10.0 * paddle.log10(paddle.maximum(ones * amin, magnitude)) log_spec = 10.0 * paddle.log10(paddle.maximum(ones * amin, spect))
log_spec -= 10.0 * math.log10(max(ref_value, amin)) log_spec -= 10.0 * math.log10(max(ref_value, amin))
if top_db is not None: if top_db is not None:
...@@ -242,15 +241,17 @@ def power_to_db(magnitude: paddle.Tensor, ...@@ -242,15 +241,17 @@ def power_to_db(magnitude: paddle.Tensor,
def create_dct(n_mfcc: int, def create_dct(n_mfcc: int,
n_mels: int, n_mels: int,
norm: Optional[str]='ortho', norm: Optional[str]='ortho',
dtype: Optional[str]=paddle.float32) -> paddle.Tensor: dtype: str='float32') -> Tensor:
"""Create a discrete cosine transform(DCT) matrix. """Create a discrete cosine transform(DCT) matrix.
Parameters: Args:
n_mfcc (int): Number of mel frequency cepstral coefficients. n_mfcc (int): Number of mel frequency cepstral coefficients.
n_mels (int): Number of mel filterbanks. n_mels (int): Number of mel filterbanks.
norm (str, optional): Normalizaiton type. Defaults to 'ortho'. norm (Optional[str], optional): Normalizaiton type. Defaults to 'ortho'.
dtype (str, optional): The data type of the return matrix. Defaults to 'float32'.
Returns: Returns:
Tensor: The DCT matrix with shape (n_mels, n_mfcc). Tensor: The DCT matrix with shape `(n_mels, n_mfcc)`.
""" """
n = paddle.arange(n_mels, dtype=dtype) n = paddle.arange(n_mels, dtype=dtype)
k = paddle.arange(n_mfcc, dtype=dtype).unsqueeze(1) k = paddle.arange(n_mfcc, dtype=dtype).unsqueeze(1)
......
...@@ -20,24 +20,11 @@ from paddle import Tensor ...@@ -20,24 +20,11 @@ from paddle import Tensor
__all__ = [ __all__ = [
'get_window', 'get_window',
# windows
'taylor',
'hamming',
'hann',
'tukey',
'kaiser',
'gaussian',
'exponential',
'triang',
'bohman',
'blackman',
'cosine',
] ]
def _cat(a: List[Tensor], data_type: str) -> Tensor: def _cat(x: List[Tensor], data_type: str) -> Tensor:
l = [paddle.to_tensor(_a, data_type) for _a in a] l = [paddle.to_tensor(_, data_type) for _ in x]
return paddle.concat(l) return paddle.concat(l)
...@@ -48,7 +35,7 @@ def _acosh(x: Union[Tensor, float]) -> Tensor: ...@@ -48,7 +35,7 @@ def _acosh(x: Union[Tensor, float]) -> Tensor:
def _extend(M: int, sym: bool) -> bool: def _extend(M: int, sym: bool) -> bool:
"""Extend window by 1 sample if needed for DFT-even symmetry""" """Extend window by 1 sample if needed for DFT-even symmetry. """
if not sym: if not sym:
return M + 1, True return M + 1, True
else: else:
...@@ -56,7 +43,7 @@ def _extend(M: int, sym: bool) -> bool: ...@@ -56,7 +43,7 @@ def _extend(M: int, sym: bool) -> bool:
def _len_guards(M: int) -> bool: def _len_guards(M: int) -> bool:
"""Handle small or incorrect window lengths""" """Handle small or incorrect window lengths. """
if int(M) != M or M < 0: if int(M) != M or M < 0:
raise ValueError('Window length M must be a non-negative integer') raise ValueError('Window length M must be a non-negative integer')
...@@ -64,15 +51,15 @@ def _len_guards(M: int) -> bool: ...@@ -64,15 +51,15 @@ def _len_guards(M: int) -> bool:
def _truncate(w: Tensor, needed: bool) -> Tensor: def _truncate(w: Tensor, needed: bool) -> Tensor:
"""Truncate window by 1 sample if needed for DFT-even symmetry""" """Truncate window by 1 sample if needed for DFT-even symmetry. """
if needed: if needed:
return w[:-1] return w[:-1]
else: else:
return w return w
def general_gaussian(M: int, p, sig, sym: bool=True, def _general_gaussian(M: int, p, sig, sym: bool=True,
dtype: str='float64') -> Tensor: dtype: str='float64') -> Tensor:
"""Compute a window with a generalized Gaussian shape. """Compute a window with a generalized Gaussian shape.
This function is consistent with scipy.signal.windows.general_gaussian(). This function is consistent with scipy.signal.windows.general_gaussian().
""" """
...@@ -86,8 +73,8 @@ def general_gaussian(M: int, p, sig, sym: bool=True, ...@@ -86,8 +73,8 @@ def general_gaussian(M: int, p, sig, sym: bool=True,
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
def general_cosine(M: int, a: float, sym: bool=True, def _general_cosine(M: int, a: float, sym: bool=True,
dtype: str='float64') -> Tensor: dtype: str='float64') -> Tensor:
"""Compute a generic weighted sum of cosine terms window. """Compute a generic weighted sum of cosine terms window.
This function is consistent with scipy.signal.windows.general_cosine(). This function is consistent with scipy.signal.windows.general_cosine().
""" """
...@@ -101,31 +88,23 @@ def general_cosine(M: int, a: float, sym: bool=True, ...@@ -101,31 +88,23 @@ def general_cosine(M: int, a: float, sym: bool=True,
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
def general_hamming(M: int, alpha: float, sym: bool=True, def _general_hamming(M: int, alpha: float, sym: bool=True,
dtype: str='float64') -> Tensor: dtype: str='float64') -> Tensor:
"""Compute a generalized Hamming window. """Compute a generalized Hamming window.
This function is consistent with scipy.signal.windows.general_hamming() This function is consistent with scipy.signal.windows.general_hamming()
""" """
return general_cosine(M, [alpha, 1. - alpha], sym, dtype=dtype) return _general_cosine(M, [alpha, 1. - alpha], sym, dtype=dtype)
def taylor(M: int, def _taylor(M: int,
nbar=4, nbar=4,
sll=30, sll=30,
norm=True, norm=True,
sym: bool=True, sym: bool=True,
dtype: str='float64') -> Tensor: dtype: str='float64') -> Tensor:
"""Compute a Taylor window. """Compute a Taylor window.
The Taylor window taper function approximates the Dolph-Chebyshev window's The Taylor window taper function approximates the Dolph-Chebyshev window's
constant sidelobe level for a parameterized number of near-in sidelobes. constant sidelobe level for a parameterized number of near-in sidelobes.
Parameters:
M(int): window size
nbar, sil, norm: the window-specific parameter.
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
if _len_guards(M): if _len_guards(M):
return paddle.ones((M, ), dtype=dtype) return paddle.ones((M, ), dtype=dtype)
...@@ -171,46 +150,25 @@ def taylor(M: int, ...@@ -171,46 +150,25 @@ def taylor(M: int,
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
def hamming(M: int, sym: bool=True, dtype: str='float64') -> Tensor: def _hamming(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Hamming window. """Compute a Hamming window.
The Hamming window is a taper formed by using a raised cosine with The Hamming window is a taper formed by using a raised cosine with
non-zero endpoints, optimized to minimize the nearest side lobe. non-zero endpoints, optimized to minimize the nearest side lobe.
Parameters:
M(int): window size
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
return general_hamming(M, 0.54, sym, dtype=dtype) return _general_hamming(M, 0.54, sym, dtype=dtype)
def hann(M: int, sym: bool=True, dtype: str='float64') -> Tensor: def _hann(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Hann window. """Compute a Hann window.
The Hann window is a taper formed by using a raised cosine or sine-squared The Hann window is a taper formed by using a raised cosine or sine-squared
with ends that touch zero. with ends that touch zero.
Parameters:
M(int): window size
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
return general_hamming(M, 0.5, sym, dtype=dtype) return _general_hamming(M, 0.5, sym, dtype=dtype)
def tukey(M: int, alpha=0.5, sym: bool=True, dtype: str='float64') -> Tensor: def _tukey(M: int, alpha=0.5, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Tukey window. """Compute a Tukey window.
The Tukey window is also known as a tapered cosine window. The Tukey window is also known as a tapered cosine window.
Parameters:
M(int): window size
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
if _len_guards(M): if _len_guards(M):
return paddle.ones((M, ), dtype=dtype) return paddle.ones((M, ), dtype=dtype)
...@@ -237,32 +195,18 @@ def tukey(M: int, alpha=0.5, sym: bool=True, dtype: str='float64') -> Tensor: ...@@ -237,32 +195,18 @@ def tukey(M: int, alpha=0.5, sym: bool=True, dtype: str='float64') -> Tensor:
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
def kaiser(M: int, beta: float, sym: bool=True, dtype: str='float64') -> Tensor: def _kaiser(M: int, beta: float, sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute a Kaiser window. """Compute a Kaiser window.
The Kaiser window is a taper formed by using a Bessel function. The Kaiser window is a taper formed by using a Bessel function.
Parameters:
M(int): window size.
beta(float): the window-specific parameter.
sym(bool):whether to return symmetric window.
The default value is True
Returns:
Tensor: the window tensor
""" """
raise NotImplementedError() raise NotImplementedError()
def gaussian(M: int, std: float, sym: bool=True, def _gaussian(M: int, std: float, sym: bool=True,
dtype: str='float64') -> Tensor: dtype: str='float64') -> Tensor:
"""Compute a Gaussian window. """Compute a Gaussian window.
The Gaussian widows has a Gaussian shape defined by the standard deviation(std). The Gaussian widows has a Gaussian shape defined by the standard deviation(std).
Parameters:
M(int): window size.
std(float): the window-specific parameter.
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
if _len_guards(M): if _len_guards(M):
return paddle.ones((M, ), dtype=dtype) return paddle.ones((M, ), dtype=dtype)
...@@ -275,21 +219,12 @@ def gaussian(M: int, std: float, sym: bool=True, ...@@ -275,21 +219,12 @@ def gaussian(M: int, std: float, sym: bool=True,
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
def exponential(M: int, def _exponential(M: int,
center=None, center=None,
tau=1., tau=1.,
sym: bool=True, sym: bool=True,
dtype: str='float64') -> Tensor: dtype: str='float64') -> Tensor:
"""Compute an exponential (or Poisson) window. """Compute an exponential (or Poisson) window. """
Parameters:
M(int): window size.
tau(float): the window-specific parameter.
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
if sym and center is not None: if sym and center is not None:
raise ValueError("If sym==True, center must be None.") raise ValueError("If sym==True, center must be None.")
if _len_guards(M): if _len_guards(M):
...@@ -305,15 +240,8 @@ def exponential(M: int, ...@@ -305,15 +240,8 @@ def exponential(M: int,
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
def triang(M: int, sym: bool=True, dtype: str='float64') -> Tensor: def _triang(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a triangular window. """Compute a triangular window.
Parameters:
M(int): window size.
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
if _len_guards(M): if _len_guards(M):
return paddle.ones((M, ), dtype=dtype) return paddle.ones((M, ), dtype=dtype)
...@@ -330,16 +258,9 @@ def triang(M: int, sym: bool=True, dtype: str='float64') -> Tensor: ...@@ -330,16 +258,9 @@ def triang(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
def bohman(M: int, sym: bool=True, dtype: str='float64') -> Tensor: def _bohman(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Bohman window. """Compute a Bohman window.
The Bohman window is the autocorrelation of a cosine window. The Bohman window is the autocorrelation of a cosine window.
Parameters:
M(int): window size.
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
if _len_guards(M): if _len_guards(M):
return paddle.ones((M, ), dtype=dtype) return paddle.ones((M, ), dtype=dtype)
...@@ -353,32 +274,18 @@ def bohman(M: int, sym: bool=True, dtype: str='float64') -> Tensor: ...@@ -353,32 +274,18 @@ def bohman(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
def blackman(M: int, sym: bool=True, dtype: str='float64') -> Tensor: def _blackman(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Blackman window. """Compute a Blackman window.
The Blackman window is a taper formed by using the first three terms of The Blackman window is a taper formed by using the first three terms of
a summation of cosines. It was designed to have close to the minimal a summation of cosines. It was designed to have close to the minimal
leakage possible. It is close to optimal, only slightly worse than a leakage possible. It is close to optimal, only slightly worse than a
Kaiser window. Kaiser window.
Parameters:
M(int): window size.
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
return general_cosine(M, [0.42, 0.50, 0.08], sym, dtype=dtype) return _general_cosine(M, [0.42, 0.50, 0.08], sym, dtype=dtype)
def cosine(M: int, sym: bool=True, dtype: str='float64') -> Tensor: def _cosine(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a window with a simple cosine shape. """Compute a window with a simple cosine shape.
Parameters:
M(int): window size.
sym(bool):whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
""" """
if _len_guards(M): if _len_guards(M):
return paddle.ones((M, ), dtype=dtype) return paddle.ones((M, ), dtype=dtype)
...@@ -388,19 +295,20 @@ def cosine(M: int, sym: bool=True, dtype: str='float64') -> Tensor: ...@@ -388,19 +295,20 @@ def cosine(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
return _truncate(w, needs_trunc) return _truncate(w, needs_trunc)
## factory function
def get_window(window: Union[str, Tuple[str, float]], def get_window(window: Union[str, Tuple[str, float]],
win_length: int, win_length: int,
fftbins: bool=True, fftbins: bool=True,
dtype: str='float64') -> Tensor: dtype: str='float64') -> Tensor:
"""Return a window of a given length and type. """Return a window of a given length and type.
Parameters:
window(str|(str,float)): the type of window to create. Args:
win_length(int): the number of samples in the window. window (Union[str, Tuple[str, float]]): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'.
fftbins(bool): If True, create a "periodic" window. Otherwise, win_length (int): Number of samples.
create a "symmetric" window, for use in filter design. fftbins (bool, optional): If True, create a "periodic" window. Otherwise, create a "symmetric" window, for use in filter design. Defaults to True.
dtype (str, optional): The data type of the return window. Defaults to 'float64'.
Returns: Returns:
The window represented as a tensor. Tensor: The window represented as a tensor.
""" """
sym = not fftbins sym = not fftbins
...@@ -420,7 +328,7 @@ def get_window(window: Union[str, Tuple[str, float]], ...@@ -420,7 +328,7 @@ def get_window(window: Union[str, Tuple[str, float]],
str(type(window))) str(type(window)))
try: try:
winfunc = eval(winstr) winfunc = eval('_' + winstr)
except KeyError as e: except KeyError as e:
raise ValueError("Unknown window type.") from e raise ValueError("Unknown window type.") from e
......
...@@ -20,9 +20,7 @@ __all__ = [ ...@@ -20,9 +20,7 @@ __all__ = [
def dtw_distance(xs: np.ndarray, ys: np.ndarray) -> float: def dtw_distance(xs: np.ndarray, ys: np.ndarray) -> float:
"""dtw distance """Dynamic Time Warping.
Dynamic Time Warping.
This function keeps a compact matrix, not the full warping paths matrix. This function keeps a compact matrix, not the full warping paths matrix.
Uses dynamic programming to compute: Uses dynamic programming to compute:
......
...@@ -11,19 +11,46 @@ ...@@ -11,19 +11,46 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import glob
import os
import setuptools import setuptools
from setuptools.command.install import install
from setuptools.command.test import test
# set the version here # set the version here
VERSION = '0.2.0' VERSION = '0.2.0'
# Inspired by the example at https://pytest.org/latest/goodpractises.html
class TestCommand(test):
def finalize_options(self):
test.finalize_options(self)
self.test_args = []
self.test_suite = True
def run(self):
self.run_benchmark()
super(TestCommand, self).run()
def run_tests(self):
# Run nose ensuring that argv simulates running nosetests directly
import nose
nose.run_exit(argv=['nosetests', '-w', 'tests'])
def run_benchmark(self):
for benchmark_item in glob.glob('tests/benchmark/*py'):
os.system(f'pytest {benchmark_item}')
class InstallCommand(install):
def run(self):
install.run(self)
def write_version_py(filename='paddleaudio/__init__.py'): def write_version_py(filename='paddleaudio/__init__.py'):
import paddleaudio
if hasattr(paddleaudio,
"__version__") and paddleaudio.__version__ == VERSION:
return
with open(filename, "a") as f: with open(filename, "a") as f:
f.write(f"\n__version__ = '{VERSION}'\n") f.write(f"__version__ = '{VERSION}'")
def remove_version_py(filename='paddleaudio/__init__.py'): def remove_version_py(filename='paddleaudio/__init__.py'):
...@@ -35,6 +62,7 @@ def remove_version_py(filename='paddleaudio/__init__.py'): ...@@ -35,6 +62,7 @@ def remove_version_py(filename='paddleaudio/__init__.py'):
f.write(line) f.write(line)
remove_version_py()
write_version_py() write_version_py()
setuptools.setup( setuptools.setup(
...@@ -61,6 +89,16 @@ setuptools.setup( ...@@ -61,6 +89,16 @@ setuptools.setup(
'colorlog', 'colorlog',
'dtaidistance >= 2.3.6', 'dtaidistance >= 2.3.6',
'mcd >= 0.4', 'mcd >= 0.4',
], ) ],
extras_require={
'test': [
'nose', 'librosa==0.8.1', 'soundfile==0.10.3.post1',
'torchaudio==0.10.2', 'pytest-benchmark'
],
},
cmdclass={
'install': InstallCommand,
'test': TestCommand,
}, )
remove_version_py() remove_version_py()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import unittest
import urllib.request
mono_channel_wav = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
multi_channels_wav = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav'
class BackendTest(unittest.TestCase):
def setUp(self):
self.initWavInput()
def initWavInput(self):
self.files = []
for url in [mono_channel_wav, multi_channels_wav]:
if not os.path.isfile(os.path.basename(url)):
urllib.request.urlretrieve(url, os.path.basename(url))
self.files.append(os.path.basename(url))
def initParmas(self):
raise NotImplementedError
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import filecmp
import os
import unittest
import numpy as np
import soundfile as sf
import paddleaudio
from ..base import BackendTest
class TestIO(BackendTest):
def test_load_mono_channel(self):
sf_data, sf_sr = sf.read(self.files[0])
pa_data, pa_sr = paddleaudio.load(
self.files[0], normal=False, dtype='float64')
self.assertEqual(sf_data.dtype, pa_data.dtype)
self.assertEqual(sf_sr, pa_sr)
np.testing.assert_array_almost_equal(sf_data, pa_data)
def test_load_multi_channels(self):
sf_data, sf_sr = sf.read(self.files[1])
sf_data = sf_data.T # Channel dim first
pa_data, pa_sr = paddleaudio.load(
self.files[1], mono=False, normal=False, dtype='float64')
self.assertEqual(sf_data.dtype, pa_data.dtype)
self.assertEqual(sf_sr, pa_sr)
np.testing.assert_array_almost_equal(sf_data, pa_data)
def test_save_mono_channel(self):
waveform, sr = np.random.randint(
low=-32768, high=32768, size=(48000), dtype=np.int16), 16000
sf_tmp_file = 'sf_tmp.wav'
pa_tmp_file = 'pa_tmp.wav'
sf.write(sf_tmp_file, waveform, sr)
paddleaudio.save(waveform, sr, pa_tmp_file)
self.assertTrue(filecmp.cmp(sf_tmp_file, pa_tmp_file))
for file in [sf_tmp_file, pa_tmp_file]:
os.remove(file)
def test_save_multi_channels(self):
waveform, sr = np.random.randint(
low=-32768, high=32768, size=(2, 48000), dtype=np.int16), 16000
sf_tmp_file = 'sf_tmp.wav'
pa_tmp_file = 'pa_tmp.wav'
sf.write(sf_tmp_file, waveform.T, sr)
paddleaudio.save(waveform.T, sr, pa_tmp_file)
self.assertTrue(filecmp.cmp(sf_tmp_file, pa_tmp_file))
for file in [sf_tmp_file, pa_tmp_file]:
os.remove(file)
if __name__ == '__main__':
unittest.main()
# 1. Prepare
First, install `pytest-benchmark` via pip.
```sh
pip install pytest-benchmark
```
# 2. Run
Run the specific script for profiling.
```sh
pytest melspectrogram.py
```
Result:
```sh
========================================================================== test session starts ==========================================================================
platform linux -- Python 3.7.7, pytest-7.0.1, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /ssd3/chenxiaojie06/PaddleSpeech/DeepSpeech/paddleaudio
plugins: typeguard-2.12.1, benchmark-3.4.1, anyio-3.5.0
collected 4 items
melspectrogram.py .... [100%]
-------------------------------------------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_melspect_gpu_torchaudio 202.0765 (1.0) 360.6230 (1.0) 218.1168 (1.0) 16.3022 (1.0) 214.2871 (1.0) 21.8451 (1.0) 40;3 4,584.7001 (1.0) 286 1
test_melspect_gpu 657.8509 (3.26) 908.0470 (2.52) 724.2545 (3.32) 106.5771 (6.54) 669.9096 (3.13) 113.4719 (5.19) 1;0 1,380.7300 (0.30) 5 1
test_melspect_cpu_torchaudio 1,247.6053 (6.17) 2,892.5799 (8.02) 1,443.2853 (6.62) 345.3732 (21.19) 1,262.7263 (5.89) 221.6385 (10.15) 56;53 692.8637 (0.15) 399 1
test_melspect_cpu 20,326.2549 (100.59) 20,607.8682 (57.15) 20,473.4125 (93.86) 63.8654 (3.92) 20,467.0429 (95.51) 68.4294 (3.13) 8;1 48.8438 (0.01) 29 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
========================================================================== 4 passed in 21.12s ===========================================================================
```
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import urllib.request
import librosa
import numpy as np
import paddle
import torch
import torchaudio
import paddleaudio
wav_url = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
if not os.path.isfile(os.path.basename(wav_url)):
urllib.request.urlretrieve(wav_url, os.path.basename(wav_url))
waveform, sr = paddleaudio.load(os.path.abspath(os.path.basename(wav_url)))
waveform_tensor = paddle.to_tensor(waveform).unsqueeze(0)
waveform_tensor_torch = torch.from_numpy(waveform).unsqueeze(0)
# Feature conf
mel_conf = {
'sr': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
}
mel_conf_torchaudio = {
'sample_rate': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
'norm': 'slaney',
'mel_scale': 'slaney',
}
def enable_cpu_device():
paddle.set_device('cpu')
def enable_gpu_device():
paddle.set_device('gpu')
log_mel_extractor = paddleaudio.features.LogMelSpectrogram(
**mel_conf, f_min=0.0, top_db=80.0, dtype=waveform_tensor.dtype)
def log_melspectrogram():
return log_mel_extractor(waveform_tensor).squeeze(0)
def test_log_melspect_cpu(benchmark):
enable_cpu_device()
feature_paddleaudio = benchmark(log_melspectrogram)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=80.0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_log_melspect_gpu(benchmark):
enable_gpu_device()
feature_paddleaudio = benchmark(log_melspectrogram)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=80.0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=2)
mel_extractor_torchaudio = torchaudio.transforms.MelSpectrogram(
**mel_conf_torchaudio, f_min=0.0)
amplitude_to_DB = torchaudio.transforms.AmplitudeToDB('power', top_db=80.0)
def melspectrogram_torchaudio():
return mel_extractor_torchaudio(waveform_tensor_torch).squeeze(0)
def log_melspectrogram_torchaudio():
mel_specgram = mel_extractor_torchaudio(waveform_tensor_torch)
return amplitude_to_DB(mel_specgram).squeeze(0)
def test_log_melspect_cpu_torchaudio(benchmark):
global waveform_tensor_torch, mel_extractor_torchaudio, amplitude_to_DB
mel_extractor_torchaudio = mel_extractor_torchaudio.to('cpu')
waveform_tensor_torch = waveform_tensor_torch.to('cpu')
amplitude_to_DB = amplitude_to_DB.to('cpu')
feature_paddleaudio = benchmark(log_melspectrogram_torchaudio)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=80.0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_log_melspect_gpu_torchaudio(benchmark):
global waveform_tensor_torch, mel_extractor_torchaudio, amplitude_to_DB
mel_extractor_torchaudio = mel_extractor_torchaudio.to('cuda')
waveform_tensor_torch = waveform_tensor_torch.to('cuda')
amplitude_to_DB = amplitude_to_DB.to('cuda')
feature_torchaudio = benchmark(log_melspectrogram_torchaudio)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=80.0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_torchaudio.cpu(), decimal=2)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import urllib.request
import librosa
import numpy as np
import paddle
import torch
import torchaudio
import paddleaudio
wav_url = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
if not os.path.isfile(os.path.basename(wav_url)):
urllib.request.urlretrieve(wav_url, os.path.basename(wav_url))
waveform, sr = paddleaudio.load(os.path.abspath(os.path.basename(wav_url)))
waveform_tensor = paddle.to_tensor(waveform).unsqueeze(0)
waveform_tensor_torch = torch.from_numpy(waveform).unsqueeze(0)
# Feature conf
mel_conf = {
'sr': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
}
mel_conf_torchaudio = {
'sample_rate': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
'norm': 'slaney',
'mel_scale': 'slaney',
}
def enable_cpu_device():
paddle.set_device('cpu')
def enable_gpu_device():
paddle.set_device('gpu')
mel_extractor = paddleaudio.features.MelSpectrogram(
**mel_conf, f_min=0.0, dtype=waveform_tensor.dtype)
def melspectrogram():
return mel_extractor(waveform_tensor).squeeze(0)
def test_melspect_cpu(benchmark):
enable_cpu_device()
feature_paddleaudio = benchmark(melspectrogram)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_melspect_gpu(benchmark):
enable_gpu_device()
feature_paddleaudio = benchmark(melspectrogram)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
mel_extractor_torchaudio = torchaudio.transforms.MelSpectrogram(
**mel_conf_torchaudio, f_min=0.0)
def melspectrogram_torchaudio():
return mel_extractor_torchaudio(waveform_tensor_torch).squeeze(0)
def test_melspect_cpu_torchaudio(benchmark):
global waveform_tensor_torch, mel_extractor_torchaudio
mel_extractor_torchaudio = mel_extractor_torchaudio.to('cpu')
waveform_tensor_torch = waveform_tensor_torch.to('cpu')
feature_paddleaudio = benchmark(melspectrogram_torchaudio)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_melspect_gpu_torchaudio(benchmark):
global waveform_tensor_torch, mel_extractor_torchaudio
mel_extractor_torchaudio = mel_extractor_torchaudio.to('cuda')
waveform_tensor_torch = waveform_tensor_torch.to('cuda')
feature_torchaudio = benchmark(melspectrogram_torchaudio)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_torchaudio.cpu(), decimal=3)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import urllib.request
import librosa
import numpy as np
import paddle
import torch
import torchaudio
import paddleaudio
wav_url = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
if not os.path.isfile(os.path.basename(wav_url)):
urllib.request.urlretrieve(wav_url, os.path.basename(wav_url))
waveform, sr = paddleaudio.load(os.path.abspath(os.path.basename(wav_url)))
waveform_tensor = paddle.to_tensor(waveform).unsqueeze(0)
waveform_tensor_torch = torch.from_numpy(waveform).unsqueeze(0)
# Feature conf
mel_conf = {
'sr': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
}
mfcc_conf = {
'n_mfcc': 20,
'top_db': 80.0,
}
mfcc_conf.update(mel_conf)
mel_conf_torchaudio = {
'sample_rate': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
'norm': 'slaney',
'mel_scale': 'slaney',
}
mfcc_conf_torchaudio = {
'sample_rate': sr,
'n_mfcc': 20,
}
def enable_cpu_device():
paddle.set_device('cpu')
def enable_gpu_device():
paddle.set_device('gpu')
mfcc_extractor = paddleaudio.features.MFCC(
**mfcc_conf, f_min=0.0, dtype=waveform_tensor.dtype)
def mfcc():
return mfcc_extractor(waveform_tensor).squeeze(0)
def test_mfcc_cpu(benchmark):
enable_cpu_device()
feature_paddleaudio = benchmark(mfcc)
feature_librosa = librosa.feature.mfcc(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_mfcc_gpu(benchmark):
enable_gpu_device()
feature_paddleaudio = benchmark(mfcc)
feature_librosa = librosa.feature.mfcc(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
del mel_conf_torchaudio['sample_rate']
mfcc_extractor_torchaudio = torchaudio.transforms.MFCC(
**mfcc_conf_torchaudio, melkwargs=mel_conf_torchaudio)
def mfcc_torchaudio():
return mfcc_extractor_torchaudio(waveform_tensor_torch).squeeze(0)
def test_mfcc_cpu_torchaudio(benchmark):
global waveform_tensor_torch, mfcc_extractor_torchaudio
mel_extractor_torchaudio = mfcc_extractor_torchaudio.to('cpu')
waveform_tensor_torch = waveform_tensor_torch.to('cpu')
feature_paddleaudio = benchmark(mfcc_torchaudio)
feature_librosa = librosa.feature.mfcc(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_mfcc_gpu_torchaudio(benchmark):
global waveform_tensor_torch, mfcc_extractor_torchaudio
mel_extractor_torchaudio = mfcc_extractor_torchaudio.to('cuda')
waveform_tensor_torch = waveform_tensor_torch.to('cuda')
feature_torchaudio = benchmark(mfcc_torchaudio)
feature_librosa = librosa.feature.mfcc(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_torchaudio.cpu(), decimal=3)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import unittest
import urllib.request
import numpy as np
import paddle
from paddleaudio import load
wav_url = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
class FeatTest(unittest.TestCase):
def setUp(self):
self.initParmas()
self.initWavInput()
self.setUpDevice()
def setUpDevice(self, device='cpu'):
paddle.set_device(device)
def initWavInput(self, url=wav_url):
if not os.path.isfile(os.path.basename(url)):
urllib.request.urlretrieve(url, os.path.basename(url))
self.waveform, self.sr = load(os.path.abspath(os.path.basename(url)))
self.waveform = self.waveform.astype(
np.float32
) # paddlespeech.s2t.transform.spectrogram only supports float32
dim = len(self.waveform.shape)
assert dim in [1, 2]
if dim == 1:
self.waveform = np.expand_dims(self.waveform, 0)
def initParmas(self):
raise NotImplementedError
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
from .base import FeatTest
from paddleaudio.functional.window import get_window
from paddlespeech.s2t.transform.spectrogram import IStft
from paddlespeech.s2t.transform.spectrogram import Stft
class TestIstft(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
self.window_str = 'hann'
def test_istft(self):
ps_stft = Stft(self.n_fft, self.hop_length)
ps_res = ps_stft(
self.waveform.T).squeeze(1).T # (n_fft//2 + 1, n_frmaes)
x = paddle.to_tensor(ps_res)
ps_istft = IStft(self.hop_length)
ps_res = ps_istft(ps_res.T)
window = get_window(
self.window_str, self.n_fft, dtype=self.waveform.dtype)
pd_res = paddle.signal.istft(
x, self.n_fft, self.hop_length, window=window)
np.testing.assert_array_almost_equal(ps_res, pd_res, decimal=5)
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
import torch
import torchaudio
import paddleaudio
from .base import FeatTest
class TestKaldi(FeatTest):
def initParmas(self):
self.window_size = 1024
self.dtype = 'float32'
def test_window(self):
t_hann_window = torch.hann_window(
self.window_size, periodic=False, dtype=eval(f'torch.{self.dtype}'))
t_hamm_window = torch.hamming_window(
self.window_size,
periodic=False,
alpha=0.54,
beta=0.46,
dtype=eval(f'torch.{self.dtype}'))
t_povey_window = torch.hann_window(
self.window_size, periodic=False,
dtype=eval(f'torch.{self.dtype}')).pow(0.85)
p_hann_window = paddleaudio.functional.window.get_window(
'hann',
self.window_size,
fftbins=False,
dtype=eval(f'paddle.{self.dtype}'))
p_hamm_window = paddleaudio.functional.window.get_window(
'hamming',
self.window_size,
fftbins=False,
dtype=eval(f'paddle.{self.dtype}'))
p_povey_window = paddleaudio.functional.window.get_window(
'hann',
self.window_size,
fftbins=False,
dtype=eval(f'paddle.{self.dtype}')).pow(0.85)
np.testing.assert_array_almost_equal(t_hann_window, p_hann_window)
np.testing.assert_array_almost_equal(t_hamm_window, p_hamm_window)
np.testing.assert_array_almost_equal(t_povey_window, p_povey_window)
def test_fbank(self):
ta_features = torchaudio.compliance.kaldi.fbank(
torch.from_numpy(self.waveform.astype(self.dtype)))
pa_features = paddleaudio.compliance.kaldi.fbank(
paddle.to_tensor(self.waveform.astype(self.dtype)))
np.testing.assert_array_almost_equal(
ta_features, pa_features, decimal=4)
def test_mfcc(self):
ta_features = torchaudio.compliance.kaldi.mfcc(
torch.from_numpy(self.waveform.astype(self.dtype)))
pa_features = paddleaudio.compliance.kaldi.mfcc(
paddle.to_tensor(self.waveform.astype(self.dtype)))
np.testing.assert_array_almost_equal(
ta_features, pa_features, decimal=4)
if __name__ == '__main__':
unittest.main()
此差异已折叠。
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
import paddleaudio
from .base import FeatTest
from paddlespeech.s2t.transform.spectrogram import LogMelSpectrogram
class TestLogMelSpectrogram(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
self.n_mels = 40
def test_log_melspect(self):
ps_melspect = LogMelSpectrogram(self.sr, self.n_mels, self.n_fft,
self.hop_length)
ps_res = ps_melspect(self.waveform.T).squeeze(1).T
x = paddle.to_tensor(self.waveform)
# paddlespeech.s2t的特征存在幅度谱和功率谱滥用的情况
ps_melspect = paddleaudio.features.LogMelSpectrogram(
self.sr,
self.n_fft,
self.hop_length,
power=1.0,
n_mels=self.n_mels,
f_min=0.0)
pa_res = (ps_melspect(x) / 10.0).squeeze(0).numpy()
np.testing.assert_array_almost_equal(ps_res, pa_res, decimal=5)
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
import paddleaudio
from .base import FeatTest
from paddlespeech.s2t.transform.spectrogram import Spectrogram
class TestSpectrogram(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
def test_spectrogram(self):
ps_spect = Spectrogram(self.n_fft, self.hop_length)
ps_res = ps_spect(self.waveform.T).squeeze(1).T # Magnitude
x = paddle.to_tensor(self.waveform)
pa_spect = paddleaudio.features.Spectrogram(
self.n_fft, self.hop_length, power=1.0)
pa_res = pa_spect(x).squeeze(0).numpy()
np.testing.assert_array_almost_equal(ps_res, pa_res, decimal=5)
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
from .base import FeatTest
from paddleaudio.functional.window import get_window
from paddlespeech.s2t.transform.spectrogram import Stft
class TestStft(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
self.window_str = 'hann'
def test_stft(self):
ps_stft = Stft(self.n_fft, self.hop_length)
ps_res = ps_stft(
self.waveform.T).squeeze(1).T # (n_fft//2 + 1, n_frmaes)
x = paddle.to_tensor(self.waveform)
window = get_window(self.window_str, self.n_fft, dtype=x.dtype)
pd_res = paddle.signal.stft(
x, self.n_fft, self.hop_length, window=window).squeeze(0).numpy()
np.testing.assert_array_almost_equal(ps_res, pd_res, decimal=5)
if __name__ == '__main__':
unittest.main()
...@@ -193,7 +193,8 @@ class CLSExecutor(BaseExecutor): ...@@ -193,7 +193,8 @@ class CLSExecutor(BaseExecutor):
sr=feat_conf['sample_rate'], sr=feat_conf['sample_rate'],
mono=True, mono=True,
dtype='float32') dtype='float32')
logger.info("Preprocessing audio_file:" + audio_file) if isinstance(audio_file, (str, os.PathLike)):
logger.info("Preprocessing audio_file:" + audio_file)
# Feature extraction # Feature extraction
feature_extractor = LogMelSpectrogram( feature_extractor = LogMelSpectrogram(
......
...@@ -178,7 +178,8 @@ class BaseExecutor(ABC): ...@@ -178,7 +178,8 @@ class BaseExecutor(ABC):
Returns: Returns:
bool: return `True` for job input, `False` otherwise. bool: return `True` for job input, `False` otherwise.
""" """
return input_ and os.path.isfile(input_) and input_.endswith('.job') return input_ and os.path.isfile(input_) and (input_.endswith('.job') or
input_.endswith('.txt'))
def _get_job_contents( def _get_job_contents(
self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]: self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:
......
...@@ -237,6 +237,30 @@ pretrained_models = { ...@@ -237,6 +237,30 @@ pretrained_models = {
'speech_stats': 'speech_stats':
'feats_stats.npy', 'feats_stats.npy',
}, },
"hifigan_aishell3-zh": {
'url':
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip',
'md5':
'3bb49bc75032ed12f79c00c8cc79a09a',
'config':
'default.yaml',
'ckpt':
'snapshot_iter_2500000.pdz',
'speech_stats':
'feats_stats.npy',
},
"hifigan_vctk-en": {
'url':
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip',
'md5':
'7da8f88359bca2457e705d924cf27bd4',
'config':
'default.yaml',
'ckpt':
'snapshot_iter_2500000.pdz',
'speech_stats':
'feats_stats.npy',
},
# wavernn # wavernn
"wavernn_csmsc-zh": { "wavernn_csmsc-zh": {
...@@ -365,6 +389,8 @@ class TTSExecutor(BaseExecutor): ...@@ -365,6 +389,8 @@ class TTSExecutor(BaseExecutor):
'mb_melgan_csmsc', 'mb_melgan_csmsc',
'style_melgan_csmsc', 'style_melgan_csmsc',
'hifigan_csmsc', 'hifigan_csmsc',
'hifigan_aishell3',
'hifigan_vctk',
'wavernn_csmsc', 'wavernn_csmsc',
], ],
help='Choose vocoder type of tts task.') help='Choose vocoder type of tts task.')
......
...@@ -192,7 +192,7 @@ class ConfigCache: ...@@ -192,7 +192,7 @@ class ConfigCache:
try: try:
cfg = yaml.load(file, Loader=yaml.FullLoader) cfg = yaml.load(file, Loader=yaml.FullLoader)
self._data.update(cfg) self._data.update(cfg)
except: except Exception as e:
self.flush() self.flush()
@property @property
......
...@@ -18,6 +18,7 @@ from .base_commands import ClientHelpCommand ...@@ -18,6 +18,7 @@ from .base_commands import ClientHelpCommand
from .base_commands import ServerBaseCommand from .base_commands import ServerBaseCommand
from .base_commands import ServerHelpCommand from .base_commands import ServerHelpCommand
from .bin.paddlespeech_client import ASRClientExecutor from .bin.paddlespeech_client import ASRClientExecutor
from .bin.paddlespeech_client import CLSClientExecutor
from .bin.paddlespeech_client import TTSClientExecutor from .bin.paddlespeech_client import TTSClientExecutor
from .bin.paddlespeech_server import ServerExecutor from .bin.paddlespeech_server import ServerExecutor
......
...@@ -9,12 +9,14 @@ port: 8090 ...@@ -9,12 +9,14 @@ port: 8090
# The task format in the engin_list is: <speech task>_<engine type> # The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference'] # task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference']
engine_list: ['asr_python', 'tts_python'] engine_list: ['asr_python', 'tts_python', 'cls_python']
################################################################################# #################################################################################
# ENGINE CONFIG # # ENGINE CONFIG #
################################################################################# #################################################################################
################################### ASR #########################################
################### speech task: asr; engine_type: python ####################### ################### speech task: asr; engine_type: python #######################
asr_python: asr_python:
model: 'conformer_wenetspeech' model: 'conformer_wenetspeech'
...@@ -46,6 +48,7 @@ asr_inference: ...@@ -46,6 +48,7 @@ asr_inference:
summary: True # False -> do not show predictor config summary: True # False -> do not show predictor config
################################### TTS #########################################
################### speech task: tts; engine_type: python ####################### ################### speech task: tts; engine_type: python #######################
tts_python: tts_python:
# am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc', # am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc',
...@@ -105,3 +108,30 @@ tts_inference: ...@@ -105,3 +108,30 @@ tts_inference:
# others # others
lang: 'zh' lang: 'zh'
################################### CLS #########################################
################### speech task: cls; engine_type: python #######################
cls_python:
# model choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model: 'panns_cnn14'
cfg_path: # [optional] Config of cls task.
ckpt_path: # [optional] Checkpoint file of model.
label_file: # [optional] Label file of cls task.
device: # set 'gpu:id' or 'cpu'
################### speech task: cls; engine_type: inference #######################
cls_inference:
# model_type choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model_type: 'panns_cnn14'
cfg_path:
model_path: # the pdmodel file of am static model [optional]
params_path: # the pdiparams file of am static model [optional]
label_file: # [optional] Label file of cls task.
predictor_conf:
device: # set 'gpu:id' or 'cpu'
switch_ir_optim: True
glog_info: False # True -> print glog
summary: True # False -> do not show predictor config
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册