@@ -6,8 +6,8 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
<imgsrc="images/logo.png"width=450/><br>
</div>
In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research.
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and serveral orders of magnitude faster than WaveNet.
In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research.
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow] (https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
1.`--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
2.`--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3.`--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4.`--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
-`--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
-`--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
-`--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
-`--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
```text
├── checkpoints # checkpoint
...
...
@@ -53,8 +53,8 @@ optional arguments:
└── log # tensorboard log
```
5.`--device` is the device (gpu id) to use for training. `-1` means CPU.
6.`--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided.
-`--device` is the device (gpu id) to use for training. `-1` means CPU.
-`--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided.
Before you start training a ClariNet model, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained model.
...
...
@@ -90,11 +90,11 @@ optional arguments:
--data DATA path of LJspeech dataset.
```
1.`--config` is the configuration file to use. You should use the same configuration with which you train you model.
2.`--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
3.`checkpoint` is the checkpoint to load.
4.`output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
5.`--device` is the device (gpu id) to use for training. `-1` means CPU.
-`--config` is the configuration file to use. You should use the same configuration with which you train you model.
-`--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
-`checkpoint` is the checkpoint to load.
-`output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
-`--device` is the device (gpu id) to use for training. `-1` means CPU.
1.`--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
2.`--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3.`--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4.`--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
-`--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
-`--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
-`--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
-`--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
```text
├── checkpoints # checkpoint
...
...
@@ -67,7 +67,7 @@ optional arguments:
└── waveform # waveform (.wav files)
```
5.`--device` is the device (gpu id) to use for training. `-1` means CPU.
-`--device` is the device (gpu id) to use for training. `-1` means CPU.
Example script:
...
...
@@ -101,11 +101,11 @@ optional arguments:
device to use
```
1.`--config` is the configuration file to use. You should use the same configuration with which you train you model.
2.`checkpoint` is the checkpoint to load.
3.`text`is the text file to synthesize.
4.`output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
5.`--device` is the device (gpu id) to use for training. `-1` means CPU.
-`--config` is the configuration file to use. You should use the same configuration with which you train you model.
-`checkpoint` is the checkpoint to load.
-`text`is the text file to synthesize.
-`output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
-`--device` is the device (gpu id) to use for training. `-1` means CPU.
Paddle fluid implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
## Dataset
...
...
@@ -14,7 +14,7 @@ tar xjvf LJSpeech-1.1.tar.bz2
![FastSpeech model architecture](./images/model_architecture.png)
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
regulator to expand the source phoneme sequence to match the length of the target
mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.
The model consists of encoder, decoder and length regulator three parts.
...
...
@@ -28,7 +28,7 @@ The model consists of encoder, decoder and length regulator three parts.
## Train Transformer
FastSpeech model can train with ``train.py``.
FastSpeech model can be trained with ``train.py``.
```bash
python train.py \
--use_gpu=1 \
...
...
@@ -38,11 +38,11 @@ python train.py \
--transformer_step=160000 \
--config_path='config/fastspeech.yaml'\
```
or you can run the script file directly.
Or you can run the script file directly.
```bash
sh train.sh
```
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow:
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows:
Paddle fluid implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
## Dataset
...
...
@@ -12,7 +12,7 @@ tar xjvf LJSpeech-1.1.tar.bz2
## Model Architecture
![TransformerTTS model architecture](./images/model_architecture.jpg)
The model adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implemented CBHG model of tacotron as a vocoder part and converted the spectrogram into raw wave using griffin-lim algorithm.
The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm.
## Project Structure
```text
...
...
@@ -25,7 +25,7 @@ The model adapt the multi-head attention mechanism to replace the RNN structures
## Train Transformer
TransformerTTS model can train with ``train_transformer.py``.
TransformerTTS model can be trained with ``train_transformer.py``.
```bash
python train_trasformer.py \
--use_gpu=1 \
...
...
@@ -33,11 +33,11 @@ python train_trasformer.py \
--data_path=${DATAPATH}\
--config_path='config/train_transformer.yaml'\
```
or you can run the script file directly.
Or you can run the script file directly.
```bash
sh train_transformer.sh
```
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow:
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows:
Paddle fluid implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219).
PaddlePaddle dynamic graph implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219).
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow] (https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
## Project Structure
```text
...
...
@@ -72,7 +76,7 @@ Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use t
### Monitor with Tensorboard
By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs by tensorboard.
By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs using TensorBoard.
```bash
tensorboard --logdir=${log_dir}--port=8888
...
...
@@ -112,7 +116,7 @@ python -u benchmark.py \
### Low-precision inference
This model supports the float16 low-precsion inference. By appending the argument
This model supports the float16 low-precision inference. By appending the argument
Paddle implementation of wavenet in dynamic graph, a convolutional network based vocoder. Wavenet is proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499), but in thie experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
PaddlePaddle dynamic graph implementation of WaveNet, a convolutional network based vocoder. WaveNet is originally proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499). However, in this experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
## Dataset
...
...
@@ -24,13 +24,13 @@ tar xjvf LJSpeech-1.1.tar.bz2
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
Train the model using train.py. For help on usage, try `python train.py --help`.
1.`--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
2.`--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3.`--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4.`--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
-`--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
-`--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
-`--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before training.
-`--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
```text
├── checkpoints # checkpoint
└── log # tensorboard log
```
5.`--device` is the device (gpu id) to use for training. `-1` means CPU.
-`--device` is the device (gpu id) to use for training. `-1` means CPU.
Synthesize valid data from LJspeech with a wavenet model.
Synthesize valid data from LJspeech with a WaveNet model.
positional arguments:
checkpoint checkpoint to load.
...
...
@@ -84,13 +84,13 @@ optional arguments:
--device DEVICE device to use.
```
1.`--config` is the configuration file to use. You should use the same configuration with which you train you model.
2.`--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
3.`checkpoint` is the checkpoint to load.
4.`output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
5.`--device` is the device (gpu id) to use for training. `-1` means CPU.
-`--config` is the configuration file to use. You should use the same configuration with which you train you model.
-`--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
-`checkpoint` is the checkpoint to load.
-`output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
-`--device` is the device (gpu id) to use for training. `-1` means CPU.