提交 f6d7016d 编写于 作者: C chenfeiyu

Merge branch 'master' of upstream

...@@ -13,6 +13,4 @@ ...@@ -13,6 +13,4 @@
limitations under the License. limitations under the License.
Part of code was copied or adpated from https://github.com/r9y9/deepvoice3_pytorch/
Copyright (c) 2017: Ryuichi Yamamoto, whose applies.
# Parakeet # Parakeet
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on Paddle Fluid dynamic graph, with the support of many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other academic institutions. Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle Fluid dynamic graph and includes many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other research groups.
<div align="center"> <div align="center">
<img src="images/logo.png" width=450 /> <br> <img src="images/logo.png" width=450 /> <br>
</div> </div>
In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research.
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and serveral orders of magnitude faster than WaveNet.
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
### Setup ### Setup
Make sure the library `libsndfile1` installed, e.g., on Ubuntu Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
```bash ```bash
sudo apt-get install libsndfile1 sudo apt-get install libsndfile1
...@@ -16,7 +21,7 @@ sudo apt-get install libsndfile1 ...@@ -16,7 +21,7 @@ sudo apt-get install libsndfile1
### Install PaddlePaddle ### Install PaddlePaddle
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires paddlepaddle's version is above 1.7. See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires paddlepaddle 1.7 or above.
### Install Parakeet ### Install Parakeet
...@@ -49,3 +54,7 @@ nltk.download("cmudict") ...@@ -49,3 +54,7 @@ nltk.download("cmudict")
- [Train a TransformerTTS model with ljspeech dataset](./examples/transformer_tts) - [Train a TransformerTTS model with ljspeech dataset](./examples/transformer_tts)
- [Train a FastSpeech model with ljspeech dataset](./examples/fastspeech) - [Train a FastSpeech model with ljspeech dataset](./examples/fastspeech)
- [Train a WaveFlow model with ljspeech dataset](./examples/waveflow) - [Train a WaveFlow model with ljspeech dataset](./examples/waveflow)
## Copyright and License
Parakeet is provided under the [Apache-2.0 license](LICENSE).
# Clarinet # Clarinet
Paddle implementation of clarinet in dynamic graph, a convolutional network based vocoder. The implementation is based on the paper [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281). PaddlePaddle dynamic graph implementation of ClariNet, a convolutional network based vocoder. The implementation is based on the paper [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
## Dataset ## Dataset
...@@ -30,7 +30,7 @@ Train the model using train.py, follow the usage displayed by `python train.py - ...@@ -30,7 +30,7 @@ Train the model using train.py, follow the usage displayed by `python train.py -
usage: train.py [-h] [--config CONFIG] [--device DEVICE] [--output OUTPUT] usage: train.py [-h] [--config CONFIG] [--device DEVICE] [--output OUTPUT]
[--data DATA] [--resume RESUME] [--wavenet WAVENET] [--data DATA] [--resume RESUME] [--wavenet WAVENET]
train a clarinet model with LJspeech and a trained wavenet model. train a ClariNet model with LJspeech and a trained WaveNet model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
...@@ -54,12 +54,12 @@ optional arguments: ...@@ -54,12 +54,12 @@ optional arguments:
``` ```
5. `--device` is the device (gpu id) to use for training. `-1` means CPU. 5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
6. `--wavenet` is the path of the wavenet checkpoint to load. if you do not specify `--resume`, then this must be provided. 6. `--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided.
Before you start training a clarinet model, you should have trained a wavenet model with single gaussian as output distribution. Make sure the config for teacher matches that for the trained model. Before you start training a ClariNet model, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained model.
example script: Example script:
```bash ```bash
python train.py --config=./configs/clarinet_ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0 --conditioner=wavenet_checkpoint/conditioner --conditioner=wavenet_checkpoint/teacher python train.py --config=./configs/clarinet_ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0 --conditioner=wavenet_checkpoint/conditioner --conditioner=wavenet_checkpoint/teacher
...@@ -77,7 +77,7 @@ tensorboard --logdir=. ...@@ -77,7 +77,7 @@ tensorboard --logdir=.
usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA] usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA]
checkpoint output checkpoint output
train a clarinet model with LJspeech and a trained wavenet model. train a ClariNet model with LJspeech and a trained WaveNet model.
positional arguments: positional arguments:
checkpoint checkpoint to load from. checkpoint checkpoint to load from.
...@@ -96,7 +96,7 @@ optional arguments: ...@@ -96,7 +96,7 @@ optional arguments:
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`). 4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
5. `--device` is the device (gpu id) to use for training. `-1` means CPU. 5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script: Example script:
```bash ```bash
python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated
......
# Deepvoice 3 # Deep Voice 3
Paddle implementation of deepvoice 3 in dynamic graph, a convolutional network based text-to-speech synthesis model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654). PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
We implement Deepvoice 3 in paddle fluid with dynamic graph, which is convenient for flexible network architectures. We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures.
## Dataset ## Dataset
...@@ -15,9 +15,9 @@ tar xjvf LJSpeech-1.1.tar.bz2 ...@@ -15,9 +15,9 @@ tar xjvf LJSpeech-1.1.tar.bz2
## Model Architecture ## Model Architecture
![DeepVoice3 model architecture](./images/model_architecture.png) ![Deep Voice 3 model architecture](./images/model_architecture.png)
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder, together with the decoder forms the seq2seq part of the model, and the converter forms the postnet part. The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part.
## Project Structure ## Project Structure
...@@ -37,7 +37,7 @@ Train the model using train.py, follow the usage displayed by `python train.py - ...@@ -37,7 +37,7 @@ Train the model using train.py, follow the usage displayed by `python train.py -
```text ```text
usage: train.py [-h] [-c CONFIG] [-s DATA] [-r RESUME] [-o OUTPUT] [-g DEVICE] usage: train.py [-h] [-c CONFIG] [-s DATA] [-r RESUME] [-o OUTPUT] [-g DEVICE]
Train a deepvoice 3 model with LJSpeech dataset. Train a Deep Voice 3 model with LJSpeech dataset.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
...@@ -55,7 +55,7 @@ optional arguments: ...@@ -55,7 +55,7 @@ optional arguments:
1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config. 1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). 2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig. 3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below. 4. `--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
```text ```text
├── checkpoints # checkpoint ├── checkpoints # checkpoint
...@@ -69,7 +69,7 @@ optional arguments: ...@@ -69,7 +69,7 @@ optional arguments:
5. `--device` is the device (gpu id) to use for training. `-1` means CPU. 5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script: Example script:
```bash ```bash
python train.py --config=configs/ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0 python train.py --config=configs/ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
...@@ -86,7 +86,7 @@ tensorboard --logdir=. ...@@ -86,7 +86,7 @@ tensorboard --logdir=.
```text ```text
usage: synthesis.py [-h] [-c CONFIG] [-g DEVICE] checkpoint text output_path usage: synthesis.py [-h] [-c CONFIG] [-g DEVICE] checkpoint text output_path
Synthsize waveform with a checkpoint. Synthsize waveform from a checkpoint.
positional arguments: positional arguments:
checkpoint checkpoint to load. checkpoint checkpoint to load.
...@@ -107,7 +107,7 @@ optional arguments: ...@@ -107,7 +107,7 @@ optional arguments:
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence. 4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
5. `--device` is the device (gpu id) to use for training. `-1` means CPU. 5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script: Example script:
```bash ```bash
python synthesis.py --config=configs/ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated python synthesis.py --config=configs/ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated
......
...@@ -220,9 +220,10 @@ class Flow(dg.Layer): ...@@ -220,9 +220,10 @@ class Flow(dg.Layer):
# Pad width dim (time): dialated non-causal convolution # Pad width dim (time): dialated non-causal convolution
pad_top, pad_bottom = (self.kernel_h - 1) * dilation_h, 0 pad_top, pad_bottom = (self.kernel_h - 1) * dilation_h, 0
pad_left = pad_right = int((self.kernel_w - 1) * dilation_w / 2) pad_left = pad_right = int((self.kernel_w - 1) * dilation_w / 2)
audio_pad = fluid.layers.pad2d( self.in_layers[i].layer._padding = [
audio, paddings=[pad_top, pad_bottom, pad_left, pad_right]) pad_top, pad_bottom, pad_left, pad_right
hidden = self.in_layers[i](audio_pad) ]
hidden = self.in_layers[i](audio)
cond_hidden = self.cond_layers[i](mel) cond_hidden = self.cond_layers[i](mel)
in_acts = hidden + cond_hidden in_acts = hidden + cond_hidden
out_acts = fluid.layers.tanh(in_acts[:, :self.n_channels, :]) * \ out_acts = fluid.layers.tanh(in_acts[:, :self.n_channels, :]) * \
...@@ -267,8 +268,9 @@ class Flow(dg.Layer): ...@@ -267,8 +268,9 @@ class Flow(dg.Layer):
pad_top, pad_bottom = 0, 0 pad_top, pad_bottom = 0, 0
pad_left = int((self.kernel_w - 1) * dilation_w / 2) pad_left = int((self.kernel_w - 1) * dilation_w / 2)
pad_right = int((self.kernel_w - 1) * dilation_w / 2) pad_right = int((self.kernel_w - 1) * dilation_w / 2)
state = fluid.layers.pad2d( self.in_layers[i].layer._padding = [
state, paddings=[pad_top, pad_bottom, pad_left, pad_right]) pad_top, pad_bottom, pad_left, pad_right
]
hidden = self.in_layers[i](state) hidden = self.in_layers[i](state)
cond_hidden = self.cond_layers[i](mel) cond_hidden = self.cond_layers[i](mel)
in_acts = hidden + cond_hidden in_acts = hidden + cond_hidden
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册