README.md 4.2 KB
Newer Older
L
lifuchen 已提交
1
# Deepvoice 3
C
chenfeiyu 已提交
2 3 4 5 6

Paddle implementation of deepvoice 3 in dynamic graph, a convolutional network based text-to-speech synthesis model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).

We implement Deepvoice 3 in paddle fluid with dynamic graph, which is convenient for flexible network architectures.

C
chenfeiyu 已提交
7
## Dataset
C
chenfeiyu 已提交
8 9 10 11 12 13 14 15 16 17

We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).

```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```

## Model Architecture

C
chenfeiyu 已提交
18
![DeepVoice3 model architecture](./images/model_architecture.png)
C
chenfeiyu 已提交
19 20 21 22 23

The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder, together with the decoder forms the seq2seq part of the model, and the converter forms the postnet part.

## Project Structure

C
chenfeiyu 已提交
24
```text
L
lifuchen 已提交
25
├── data.py          data_processing
C
chenfeiyu 已提交
26 27 28 29 30
├── ljspeech.yaml    (example) configuration file
├── sentences.txt    sample sentences
├── synthesis.py     script to synthesize waveform from text
├── train.py         script to train a model
└── utils.py         utility functions
C
chenfeiyu 已提交
31
```
C
chenfeiyu 已提交
32

C
chenfeiyu 已提交
33
## Train
C
chenfeiyu 已提交
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

Train the model using train.py, follow the usage displayed by `python train.py --help`.

```text
usage: train.py [-h] [-c CONFIG] [-s DATA] [-r RESUME] [-o OUTPUT] [-g DEVICE]

Train a deepvoice 3 model with LJSpeech dataset.

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        experimrnt config
  -s DATA, --data DATA  The path of the LJSpeech dataset.
  -r RESUME, --resume RESUME
                        checkpoint to load
  -o OUTPUT, --output OUTPUT
                        The directory to save result.
  -g DEVICE, --device DEVICE
                        device to use
L
lifuchen 已提交
53
```
C
chenfeiyu 已提交
54 55 56 57 58 59 60 61 62 63

1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.

```text
├── checkpoints      # checkpoint
├── log              # tensorboard log
└── states           # train and evaluation results
L
lifuchen 已提交
64
    ├── alignments   # attention
C
chenfeiyu 已提交
65 66 67 68 69 70 71
    ├── lin_spec     # linear spectrogram
    ├── mel_spec     # mel spectrogram
    └── waveform     # waveform (.wav files)
```

5. `--device` is the device (gpu id) to use for training. `-1` means CPU.

72 73 74 75 76 77 78 79 80 81 82 83 84
example script:

```bash
python train.py --config=./ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
```

You can monitor training log via tensorboard, using the script below.

```bash
cd experiment/log
tensorboard --logdir=.
```

C
chenfeiyu 已提交
85
## Synthesis
C
chenfeiyu 已提交
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
```text
usage: synthesis.py [-h] [-c CONFIG] [-g DEVICE] checkpoint text output_path

Synthsize waveform with a checkpoint.

positional arguments:
  checkpoint            checkpoint to load.
  text                  text file to synthesize
  output_path           path to save results

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        experiment config.
  -g DEVICE, --device DEVICE
                        device to use
```

1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
2. `checkpoint` is the checkpoint to load.
3. `text`is the text file to synthesize.
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.

110 111 112 113 114
example script:

```bash
python synthesis.py --config=./ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated
```