README.md 6.1 KB
Newer Older
Z
zhaokexin01 已提交
1
# WaveFlow
2

Z
zhaokexin01 已提交
3 4 5 6
PaddlePaddle dynamic graph implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219).

- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow] (https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
L
liuyibing01 已提交
7
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
Z
zhaokexin01 已提交
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

## Project Structure
```text
├── configs                 # yaml configuration files of preset model hyperparameters
├── benchmark.py            # benchmark code to test the speed of batched speech synthesis
├── data.py                 # dataset and dataloader settings for LJSpeech
├── synthesis.py            # script for speech synthesis
├── train.py                # script for model training
├── utils.py                # helper functions for e.g., model checkpointing
├── waveflow.py             # WaveFlow model high level APIs
└── waveflow_modules.py     # WaveFlow model implementation
```

## Usage

23
There are many hyperparameters to be tuned depending on the specification of model and dataset you are working on.
Z
zhaokexin01 已提交
24 25
We provide `wavenet_ljspeech.yaml` as a hyperparameter set that works well on the LJSpeech dataset.

26
Note that `train.py`, `synthesis.py`, and `benchmark.py` all accept a `--config` parameter. To ensure consistency, you should use the same config yaml file for both training, synthesizing and benchmarking. You can also overwrite these preset hyperparameters with command line by updating parameters after `--config`.
Z
zhaokexin01 已提交
27 28 29
For example `--config=${yaml} --batch_size=8` can overwrite the corresponding hyperparameters in the `${yaml}` config file. For more details about these hyperparameters, check `utils.add_config_options_to_parser`.

Note that you also need to specify some additional parameters for `train.py`, `synthesis.py`, and `benchmark.py`, and the details can be found in `train.add_options_to_parser`, `synthesis.add_options_to_parser`, and `benchmark.add_options_to_parser`, respectively.
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

### Dataset

Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).

```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```

In this example, assume that the path of unzipped LJSpeech dataset is `./data/LJSpeech-1.1`.

### Train on single GPU

```bash
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
export CUDA_VISIBLE_DEVICES=0
Z
zhaokexin01 已提交
47 48
python -u train.py \
    --config=./configs/waveflow_ljspeech.yaml \
49 50 51 52 53
    --root=./data/LJSpeech-1.1 \
    --name=${ModelName} --batch_size=4 \
    --parallel=false --use_gpu=true
```

Z
zhaokexin01 已提交
54 55 56
#### Save and Load checkpoints

Our model will save model parameters as checkpoints in `./runs/waveflow/${ModelName}/checkpoint/` every 10000 iterations by default.
57
The saved checkpoint will have the format of `step-${iteration_number}.pdparams` for model parameters and `step-${iteration_number}.pdopt` for optimizer parameters.
Z
zhaokexin01 已提交
58 59

There are three ways to load a checkpoint and resume training (take an example that you want to load a 500000-iteration checkpoint):
60
1. Use `--checkpoint=./runs/waveflow/${ModelName}/checkpoint/step-500000` to provide a specific path to load. Note that you only need to provide the base name of the parameter file, which is `step-500000`, no extension name `.pdparams` or `.pdopt` is needed.
Z
zhaokexin01 已提交
61 62 63
2. Use `--iteration=500000`.
3. If you don't specify either `--checkpoint` or `--iteration`, the model will automatically load the latest checkpoint in `./runs/waveflow/${ModelName}/checkpoint`.

64 65 66 67 68 69
### Train on multiple GPUs

```bash
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -u -m paddle.distributed.launch train.py \
Z
zhaokexin01 已提交
70
    --config=./configs/waveflow_ljspeech.yaml \
71
    --root=./data/LJSpeech-1.1 \
Z
zhaokexin01 已提交
72
    --name=${ModelName} --parallel=true --use_gpu=true
73 74 75
```

Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use to be visible. Then the `paddle.distributed.launch` module will use these visible GPUs to do data parallel training in multiprocessing mode.
Z
zhaokexin01 已提交
76 77 78

### Monitor with Tensorboard

Z
zhaokexin01 已提交
79
By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs using TensorBoard.
Z
zhaokexin01 已提交
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101

```bash
tensorboard --logdir=${log_dir} --port=8888
```

### Synthesize from a checkpoint

Check the [Save and load checkpoint](#save-and-load-checkpoints) section on how to load a specific checkpoint.
The following example will automatically load the latest checkpoint:

```bash
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
export CUDA_VISIBLE_DEVICES=0
python -u synthesis.py \
    --config=./configs/waveflow_ljspeech.yaml \
    --root=./data/LJSpeech-1.1 \
    --name=${ModelName} --use_gpu=true \
    --output=./syn_audios \
    --sample=${SAMPLE} \
    --sigma=1.0
```

L
liuyibing01 已提交
102
In this example, `--output` specifies where to save the synthesized audios and `--sample` (<16) specifies which sample in the valid dataset (a split from the whole LJSpeech dataset, by default contains the first 16 audio samples) to synthesize based on the mel-spectrograms computed from the ground truth sample audio, e.g., `--sample=0` means to synthesize the first audio in the valid dataset.
Z
zhaokexin01 已提交
103 104 105 106 107 108 109 110 111 112 113 114

### Benchmarking

Use the following example to benchmark the speed of batched speech synthesis, which reports how many times faster than real-time:

```bash
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
export CUDA_VISIBLE_DEVICES=0
python -u benchmark.py \
    --config=./configs/waveflow_ljspeech.yaml \
    --root=./data/LJSpeech-1.1 \
    --name=${ModelName} --use_gpu=true
115
```
116 117 118

### Low-precision inference

Z
zhaokexin01 已提交
119
This model supports the float16 low-precision inference. By appending the argument
120 121 122 123 124 125

```bash
    --use_fp16=true
```

to the command of synthesis and benchmarking, one can experience the fast speed of low-precision inference.