提交 d08779d6 编写于 作者: L lifuchen

Modified data.py to generate masks as models inputs

...@@ -25,3 +25,11 @@ ...@@ -25,3 +25,11 @@
files: \.md$ files: \.md$
- id: remove-tabs - id: remove-tabs
files: \.md$ files: \.md$
- repo: local
hooks:
- id: copyright_checker
name: copyright_checker
entry: python ./tools/copyright.hook
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
exclude: (?!.*third_party)^.*$ | (?!.*book)^.*$
...@@ -13,6 +13,4 @@ ...@@ -13,6 +13,4 @@
limitations under the License. limitations under the License.
Part of code was copied or adpated from https://github.com/r9y9/deepvoice3_pytorch/
Copyright (c) 2017: Ryuichi Yamamoto, whose applies.
# Parakeet # Parakeet
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on Paddle Fluid dynamic graph, with the support of many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other academic institutions. Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle Fluid dynamic graph and includes many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other research groups.
<div align="center"> <div align="center">
<img src="images/logo.png" width=450 /> <br> <img src="images/logo.png" width=450 /> <br>
</div> </div>
## Installation In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research.
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and serveral orders of magnitude faster than WaveNet.
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
### Install Paddlepaddle ### Setup
Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
```bash
sudo apt-get install libsndfile1
```
### Install PaddlePaddle
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires paddlepaddle's version is above 1.7. See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires paddlepaddle 1.7 or above.
### Install Parakeet ### Install Parakeet
...@@ -20,12 +31,6 @@ cd Parakeet ...@@ -20,12 +31,6 @@ cd Parakeet
pip install -e . pip install -e .
``` ```
### Setup
Make sure libsndfile1 installed:
```bash
sudo apt-get install libsndfile1
```
### Install CMUdict for nltk ### Install CMUdict for nltk
CMUdict from nltk is used to transform text into phonemes. CMUdict from nltk is used to transform text into phonemes.
...@@ -36,14 +41,24 @@ nltk.download("cmudict") ...@@ -36,14 +41,24 @@ nltk.download("cmudict")
``` ```
## Supported models ## Related Research
- [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654) - [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654)
- [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895) - [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263). - [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)
- [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
- [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499)
- [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](https://arxiv.org/abs/1807.07281)
## Examples ## Examples
- [Train a deepvoice 3 model with ljspeech dataset](./parakeet/examples/deepvoice3) - [Train a DeepVoice3 model with ljspeech dataset](./examples/deepvoice3)
- [Train a transformer_tts model with ljspeech dataset](./parakeet/examples/transformer_tts) - [Train a TransformerTTS model with ljspeech dataset](./examples/transformer_tts)
- [Train a fastspeech model with ljspeech dataset](./parakeet/examples/fastspeech) - [Train a FastSpeech model with ljspeech dataset](./examples/fastspeech)
- [Train a WaveFlow model with ljspeech dataset](./examples/waveflow)
- [Train a WaveNet model with ljspeech dataset](./examples/wavenet)
- [Train a Clarinet model with ljspeech dataset](./examples/clarinet)
## Copyright and License
Parakeet is provided under the [Apache-2.0 license](LICENSE).
# Clarinet
PaddlePaddle dynamic graph implementation of ClariNet, a convolutional network based vocoder. The implementation is based on the paper [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Project Structure
```text
├── data.py data_processing
├── configs/ (example) configuration file
├── synthesis.py script to synthesize waveform from mel_spectrogram
├── train.py script to train a model
└── utils.py utility functions
```
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
usage: train.py [-h] [--config CONFIG] [--device DEVICE] [--output OUTPUT]
[--data DATA] [--resume RESUME] [--wavenet WAVENET]
train a ClariNet model with LJspeech and a trained WaveNet model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG path of the config file.
--device DEVICE device to use.
--output OUTPUT path to save student.
--data DATA path of LJspeech dataset.
--resume RESUME checkpoint to load from.
--wavenet WAVENET wavenet checkpoint to use.
```
1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
```text
├── checkpoints # checkpoint
├── states # audio files generated at validation
└── log # tensorboard log
```
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
6. `--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided.
Before you start training a ClariNet model, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained model.
Example script:
```bash
python train.py --config=./configs/clarinet_ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0 --conditioner=wavenet_checkpoint/conditioner --conditioner=wavenet_checkpoint/teacher
```
You can monitor training log via tensorboard, using the script below.
```bash
cd experiment/log
tensorboard --logdir=.
```
## Synthesis
```text
usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA]
checkpoint output
train a ClariNet model with LJspeech and a trained WaveNet model.
positional arguments:
checkpoint checkpoint to load from.
output path to save student.
optional arguments:
-h, --help show this help message and exit
--config CONFIG path of the config file.
--device DEVICE device to use.
--data DATA path of LJspeech dataset.
```
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
3. `checkpoint` is the checkpoint to load.
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
Example script:
```bash
python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated
```
data:
batch_size: 8
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
conditioner:
upsampling_factors: [16, 16]
teacher:
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 3
log_scale_min: -9
student:
n_loops: [10, 10, 10, 10, 10, 10]
n_layers: [1, 1, 1, 1, 1, 1]
filter_size: 3
residual_channels: 64
log_scale_min: -7
stft:
n_fft: 2048
win_length: 1024
hop_length: 256
loss:
lmd: 4
train:
learning_rate: 0.0005
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 1000
eval_interval: 1000
max_iterations: 2000000
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import argparse
import ruamel.yaml
import random
from tqdm import tqdm
import pickle
import numpy as np
from tensorboardX import SummaryWriter
import paddle.fluid.dygraph as dg
from paddle import fluid
from parakeet.models.wavenet import WaveNet, UpsampleNet
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo
from parakeet.utils.layer_tools import summary, freeze
from utils import valid_model, eval_model, save_checkpoint, load_checkpoint, load_model
sys.path.append("../wavenet")
from data import LJSpeechMetaData, Transform, DataCollector
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="synthesize audio files from mel spectrogram in the validation set."
)
parser.add_argument("--config", type=str, help="path of the config file.")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument("--data", type=str, help="path of LJspeech dataset.")
parser.add_argument(
"checkpoint", type=str, help="checkpoint to load from.")
parser.add_argument(
"output", type=str, default="experiment", help="path to save student.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
teacher_config = config["teacher"]
n_loop = teacher_config["n_loop"]
n_layer = teacher_config["n_layer"]
filter_size = teacher_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
with dg.guard(place):
# conditioner(upsampling net)
conditioner_config = config["conditioner"]
upsampling_factors = conditioner_config["upsampling_factors"]
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
freeze(upsample_net)
residual_channels = teacher_config["residual_channels"]
loss_type = teacher_config["loss_type"]
output_dim = teacher_config["output_dim"]
log_scale_min = teacher_config["log_scale_min"]
assert loss_type == "mog" and output_dim == 3, \
"the teacher wavenet should be a wavenet with single gaussian output"
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim,
n_mels, filter_size, loss_type, log_scale_min)
# load & freeze upsample_net & teacher
freeze(teacher)
student_config = config["student"]
n_loops = student_config["n_loops"]
n_layers = student_config["n_layers"]
student_residual_channels = student_config["residual_channels"]
student_filter_size = student_config["filter_size"]
student_log_scale_min = student_config["log_scale_min"]
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
n_mels, student_filter_size)
stft_config = config["stft"]
stft = STFT(
n_fft=stft_config["n_fft"],
hop_length=stft_config["hop_length"],
win_length=stft_config["win_length"])
lmd = config["loss"]["lmd"]
model = Clarinet(upsample_net, teacher, student, stft,
student_log_scale_min, lmd)
summary(model)
load_model(model, args.checkpoint)
# loader
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
if not os.path.exists(args.output):
os.makedirs(args.output)
eval_model(model, valid_loader, args.output, sample_rate)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import argparse
import ruamel.yaml
import random
from tqdm import tqdm
import pickle
import numpy as np
from tensorboardX import SummaryWriter
import paddle.fluid.dygraph as dg
from paddle import fluid
from parakeet.models.wavenet import WaveNet, UpsampleNet
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo
from parakeet.utils.layer_tools import summary, freeze
from utils import make_output_tree, valid_model, save_checkpoint, load_checkpoint, load_wavenet
sys.path.append("../wavenet")
from data import LJSpeechMetaData, Transform, DataCollector
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="train a clarinet model with LJspeech and a trained wavenet model."
)
parser.add_argument("--config", type=str, help="path of the config file.")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument(
"--output",
type=str,
default="experiment",
help="path to save student.")
parser.add_argument("--data", type=str, help="path of LJspeech dataset.")
parser.add_argument("--resume", type=str, help="checkpoint to load from.")
parser.add_argument(
"--wavenet", type=str, help="wavenet checkpoint to use.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
teacher_config = config["teacher"]
n_loop = teacher_config["n_loop"]
n_layer = teacher_config["n_layer"]
filter_size = teacher_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
make_output_tree(args.output)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
with dg.guard(place):
# conditioner(upsampling net)
conditioner_config = config["conditioner"]
upsampling_factors = conditioner_config["upsampling_factors"]
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
freeze(upsample_net)
residual_channels = teacher_config["residual_channels"]
loss_type = teacher_config["loss_type"]
output_dim = teacher_config["output_dim"]
log_scale_min = teacher_config["log_scale_min"]
assert loss_type == "mog" and output_dim == 3, \
"the teacher wavenet should be a wavenet with single gaussian output"
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim,
n_mels, filter_size, loss_type, log_scale_min)
freeze(teacher)
student_config = config["student"]
n_loops = student_config["n_loops"]
n_layers = student_config["n_layers"]
student_residual_channels = student_config["residual_channels"]
student_filter_size = student_config["filter_size"]
student_log_scale_min = student_config["log_scale_min"]
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
n_mels, student_filter_size)
stft_config = config["stft"]
stft = STFT(
n_fft=stft_config["n_fft"],
hop_length=stft_config["hop_length"],
win_length=stft_config["win_length"])
lmd = config["loss"]["lmd"]
model = Clarinet(upsample_net, teacher, student, stft,
student_log_scale_min, lmd)
summary(model)
# optim
train_config = config["train"]
learning_rate = train_config["learning_rate"]
anneal_rate = train_config["anneal_rate"]
anneal_interval = train_config["anneal_interval"]
lr_scheduler = dg.ExponentialDecay(
learning_rate, anneal_interval, anneal_rate, staircase=True)
optim = fluid.optimizer.Adam(
lr_scheduler, parameter_list=model.parameters())
gradiant_max_norm = train_config["gradient_max_norm"]
clipper = fluid.dygraph_grad_clip.GradClipByGlobalNorm(
gradiant_max_norm)
assert args.wavenet or args.resume, "you should load from a trained wavenet or resume training; training without a trained wavenet is not recommended."
if args.wavenet:
load_wavenet(model, args.wavenet)
if args.resume:
load_checkpoint(model, optim, args.resume)
# loader
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
# train
max_iterations = train_config["max_iterations"]
checkpoint_interval = train_config["checkpoint_interval"]
eval_interval = train_config["eval_interval"]
checkpoint_dir = os.path.join(args.output, "checkpoints")
state_dir = os.path.join(args.output, "states")
log_dir = os.path.join(args.output, "log")
writer = SummaryWriter(log_dir)
# training loop
global_step = 1
global_epoch = 1
while global_step < max_iterations:
epoch_loss = 0.
for j, batch in tqdm(enumerate(train_loader), desc="[train]"):
audios, mels, audio_starts = batch
model.train()
loss_dict = model(
audios, mels, audio_starts, clip_kl=global_step > 500)
writer.add_scalar("learning_rate",
optim._learning_rate.step().numpy()[0],
global_step)
for k, v in loss_dict.items():
writer.add_scalar("loss/{}".format(k),
v.numpy()[0], global_step)
l = loss_dict["loss"]
step_loss = l.numpy()[0]
print("[train] loss: {:<8.6f}".format(step_loss))
epoch_loss += step_loss
l.backward()
optim.minimize(l, grad_clip=clipper)
optim.clear_gradients()
if global_step % eval_interval == 0:
# evaluate on valid dataset
valid_model(model, valid_loader, state_dir, global_step,
sample_rate)
if global_step % checkpoint_interval == 0:
save_checkpoint(model, optim, checkpoint_dir, global_step)
global_step += 1
# epoch loss
average_loss = epoch_loss / j
writer.add_scalar("average_loss", average_loss, global_epoch)
global_epoch += 1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import soundfile as sf
from tensorboardX import SummaryWriter
from collections import OrderedDict
from paddle import fluid
import paddle.fluid.dygraph as dg
def make_output_tree(output_dir):
checkpoint_dir = os.path.join(output_dir, "checkpoints")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
state_dir = os.path.join(output_dir, "states")
if not os.path.exists(state_dir):
os.makedirs(state_dir)
def valid_model(model, valid_loader, output_dir, global_step, sample_rate):
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
path = os.path.join(output_dir,
"step_{}_sentence_{}.wav".format(global_step, i))
audio_clips, mel_specs, audio_starts = batch
wav_var = model.synthesis(mel_specs)
wav_np = wav_var.numpy()[0]
sf.write(path, wav_np, samplerate=sample_rate)
print("generated {}".format(path))
def eval_model(model, valid_loader, output_dir, sample_rate):
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
path = os.path.join(output_dir, "sentence_{}.wav".format(i))
audio_clips, mel_specs, audio_starts = batch
wav_var = model.synthesis(mel_specs)
wav_np = wav_var.numpy()[0]
sf.write(path, wav_np, samplerate=sample_rate)
print("generated {}".format(path))
def save_checkpoint(model, optim, checkpoint_dir, global_step):
path = os.path.join(checkpoint_dir, "step_{}".format(global_step))
dg.save_dygraph(model.state_dict(), path)
print("saving model to {}".format(path + ".pdparams"))
if optim:
dg.save_dygraph(optim.state_dict(), path)
print("saving optimizer to {}".format(path + ".pdopt"))
def load_model(model, path):
model_dict, _ = dg.load_dygraph(path)
model.state_dict(model_dict)
print("loaded model from {}.pdparams".format(path))
def load_checkpoint(model, optim, path):
model_dict, optim_dict = dg.load_dygraph(path)
model.state_dict(model_dict)
print("loaded model from {}.pdparams".format(path))
if optim_dict:
optim.set_dict(optim_dict)
print("loaded optimizer from {}.pdparams".format(path))
def load_wavenet(model, path):
wavenet_dict, _ = dg.load_dygraph(path)
encoder_dict = OrderedDict()
teacher_dict = OrderedDict()
for k, v in wavenet_dict.items():
if k.startswith("encoder."):
encoder_dict[k.split('.', 1)[1]] = v
else:
# k starts with "decoder."
teacher_dict[k.split('.', 1)[1]] = v
model.encoder.set_dict(encoder_dict)
model.teacher.set_dict(teacher_dict)
print("loaded the encoder part and teacher part from wavenet model.")
# Deepvoice 3 # Deep Voice 3
Paddle implementation of deepvoice 3 in dynamic graph, a convolutional network based text-to-speech synthesis model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654). PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
We implement Deepvoice 3 in paddle fluid with dynamic graph, which is convenient for flexible network architectures. We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures.
## Dataset ## Dataset
...@@ -15,15 +15,15 @@ tar xjvf LJSpeech-1.1.tar.bz2 ...@@ -15,15 +15,15 @@ tar xjvf LJSpeech-1.1.tar.bz2
## Model Architecture ## Model Architecture
![DeepVoice3 model architecture](./images/model_architecture.png) ![Deep Voice 3 model architecture](./images/model_architecture.png)
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder, together with the decoder forms the seq2seq part of the model, and the converter forms the postnet part. The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part.
## Project Structure ## Project Structure
```text ```text
├── data.py data_processing ├── data.py data_processing
├── ljspeech.yaml (example) configuration file ├── configs/ (example) configuration files
├── sentences.txt sample sentences ├── sentences.txt sample sentences
├── synthesis.py script to synthesize waveform from text ├── synthesis.py script to synthesize waveform from text
├── train.py script to train a model ├── train.py script to train a model
...@@ -37,7 +37,7 @@ Train the model using train.py, follow the usage displayed by `python train.py - ...@@ -37,7 +37,7 @@ Train the model using train.py, follow the usage displayed by `python train.py -
```text ```text
usage: train.py [-h] [-c CONFIG] [-s DATA] [-r RESUME] [-o OUTPUT] [-g DEVICE] usage: train.py [-h] [-c CONFIG] [-s DATA] [-r RESUME] [-o OUTPUT] [-g DEVICE]
Train a deepvoice 3 model with LJSpeech dataset. Train a Deep Voice 3 model with LJSpeech dataset.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
...@@ -50,18 +50,18 @@ optional arguments: ...@@ -50,18 +50,18 @@ optional arguments:
The directory to save result. The directory to save result.
-g DEVICE, --device DEVICE -g DEVICE, --device DEVICE
device to use device to use
``` ```
1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config. 1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). 2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig. 3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below. 4. `--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
```text ```text
├── checkpoints # checkpoint ├── checkpoints # checkpoint
├── log # tensorboard log ├── log # tensorboard log
└── states # train and evaluation results └── states # train and evaluation results
├── alignments # attention ├── alignments # attention
├── lin_spec # linear spectrogram ├── lin_spec # linear spectrogram
├── mel_spec # mel spectrogram ├── mel_spec # mel spectrogram
└── waveform # waveform (.wav files) └── waveform # waveform (.wav files)
...@@ -69,10 +69,10 @@ optional arguments: ...@@ -69,10 +69,10 @@ optional arguments:
5. `--device` is the device (gpu id) to use for training. `-1` means CPU. 5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script: Example script:
```bash ```bash
python train.py --config=./ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0 python train.py --config=configs/ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
``` ```
You can monitor training log via tensorboard, using the script below. You can monitor training log via tensorboard, using the script below.
...@@ -86,7 +86,7 @@ tensorboard --logdir=. ...@@ -86,7 +86,7 @@ tensorboard --logdir=.
```text ```text
usage: synthesis.py [-h] [-c CONFIG] [-g DEVICE] checkpoint text output_path usage: synthesis.py [-h] [-c CONFIG] [-g DEVICE] checkpoint text output_path
Synthsize waveform with a checkpoint. Synthsize waveform from a checkpoint.
positional arguments: positional arguments:
checkpoint checkpoint to load. checkpoint checkpoint to load.
...@@ -107,9 +107,8 @@ optional arguments: ...@@ -107,9 +107,8 @@ optional arguments:
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence. 4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
5. `--device` is the device (gpu id) to use for training. `-1` means CPU. 5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script: Example script:
```bash ```bash
python synthesis.py --config=./ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated python synthesis.py --config=configs/ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated
``` ```
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import csv import csv
from pathlib import Path from pathlib import Path
...@@ -79,10 +93,11 @@ class Transform(object): ...@@ -79,10 +93,11 @@ class Transform(object):
y = signal.lfilter([1., -self.preemphasis], [1.], wav) y = signal.lfilter([1., -self.preemphasis], [1.], wav)
# STFT # STFT
D = librosa.stft(y=y, D = librosa.stft(
n_fft=self.n_fft, y=y,
win_length=self.win_length, n_fft=self.n_fft,
hop_length=self.hop_length) win_length=self.win_length,
hop_length=self.hop_length)
S = np.abs(D) S = np.abs(D)
# to db and normalize to 0-1 # to db and normalize to 0-1
...@@ -96,11 +111,8 @@ class Transform(object): ...@@ -96,11 +111,8 @@ class Transform(object):
# mel scale and to db and normalize to 0-1, # mel scale and to db and normalize to 0-1,
# CAUTION: pass linear scale S, not dbscaled S # CAUTION: pass linear scale S, not dbscaled S
S_mel = librosa.feature.melspectrogram(S=S, S_mel = librosa.feature.melspectrogram(
n_mels=self.n_mels, S=S, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax, power=1.)
fmin=self.fmin,
fmax=self.fmax,
power=1.)
S_mel = 20 * np.log10(np.maximum(amplitude_min, S_mel = 20 * np.log10(np.maximum(amplitude_min,
S_mel)) - self.ref_level_db S_mel)) - self.ref_level_db
S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db) S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
...@@ -148,20 +160,18 @@ class DataCollector(object): ...@@ -148,20 +160,18 @@ class DataCollector(object):
(mix_grapheme_phonemes, text_length, speaker_id, S_norm, (mix_grapheme_phonemes, text_length, speaker_id, S_norm,
S_mel_norm, num_frames) = example S_mel_norm, num_frames) = example
text_sequences.append( text_sequences.append(
np.pad(mix_grapheme_phonemes, np.pad(mix_grapheme_phonemes, (0, max_text_length - text_length
(0, max_text_length - text_length))) )))
lin_specs.append( lin_specs.append(
np.pad(S_norm, np.pad(S_norm, ((0, 0), (self._pad_begin, max_frames -
((0, 0), (self._pad_begin, self._pad_begin - num_frames))))
max_frames - self._pad_begin - num_frames))))
mel_specs.append( mel_specs.append(
np.pad(S_mel_norm, np.pad(S_mel_norm, ((0, 0), (self._pad_begin, max_frames -
((0, 0), (self._pad_begin, self._pad_begin - num_frames))))
max_frames - self._pad_begin - num_frames))))
done_flags.append( done_flags.append(
np.pad(np.zeros((int(np.ceil(num_frames // self._factor)), )), np.pad(np.zeros((int(np.ceil(num_frames // self._factor)), )),
(0, max_decoder_length - (0, max_decoder_length - int(
int(np.ceil(num_frames // self._factor))), np.ceil(num_frames // self._factor))),
constant_values=1)) constant_values=1))
text_sequences = np.array(text_sequences).astype(np.int64) text_sequences = np.array(text_sequences).astype(np.int64)
lin_specs = np.transpose(np.array(lin_specs), lin_specs = np.transpose(np.array(lin_specs),
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import argparse import argparse
import ruamel.yaml import ruamel.yaml
...@@ -22,11 +36,8 @@ if __name__ == "__main__": ...@@ -22,11 +36,8 @@ if __name__ == "__main__":
parser.add_argument("checkpoint", type=str, help="checkpoint to load.") parser.add_argument("checkpoint", type=str, help="checkpoint to load.")
parser.add_argument("text", type=str, help="text file to synthesize") parser.add_argument("text", type=str, help="text file to synthesize")
parser.add_argument("output_path", type=str, help="path to save results") parser.add_argument("output_path", type=str, help="path to save results")
parser.add_argument("-g", parser.add_argument(
"--device", "-g", "--device", type=int, default=-1, help="device to use")
type=int,
default=-1,
help="device to use")
args = parser.parse_args() args = parser.parse_args()
with open(args.config, 'rt') as f: with open(args.config, 'rt') as f:
...@@ -76,15 +87,14 @@ if __name__ == "__main__": ...@@ -76,15 +87,14 @@ if __name__ == "__main__":
window_ahead = model_config["window_ahead"] window_ahead = model_config["window_ahead"]
key_projection = model_config["key_projection"] key_projection = model_config["key_projection"]
value_projection = model_config["value_projection"] value_projection = model_config["value_projection"]
dv3 = make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim, dv3 = make_model(
padding_idx, embedding_std, max_positions, n_vocab, n_speakers, speaker_dim, speaker_embed_std, embed_dim, padding_idx,
freeze_embedding, filter_size, encoder_channels, embedding_std, max_positions, n_vocab, freeze_embedding,
n_mels, decoder_channels, r, filter_size, encoder_channels, n_mels, decoder_channels, r,
trainable_positional_encodings, use_memory_mask, trainable_positional_encodings, use_memory_mask,
query_position_rate, key_position_rate, query_position_rate, key_position_rate, window_backward,
window_backward, window_ahead, key_projection, window_ahead, key_projection, value_projection, downsample_factor,
value_projection, downsample_factor, linear_dim, linear_dim, use_decoder_states, converter_channels, dropout)
use_decoder_states, converter_channels, dropout)
summary(dv3) summary(dv3)
state, _ = dg.load_dygraph(args.checkpoint) state, _ = dg.load_dygraph(args.checkpoint)
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import argparse import argparse
import ruamel.yaml import ruamel.yaml
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import numpy as np import numpy as np
from matplotlib import cm from matplotlib import cm
...@@ -28,8 +42,9 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim, ...@@ -28,8 +42,9 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
converter_channels, dropout): converter_channels, dropout):
"""just a simple function to create a deepvoice 3 model""" """just a simple function to create a deepvoice 3 model"""
if n_speakers > 1: if n_speakers > 1:
spe = dg.Embedding((n_speakers, speaker_dim), spe = dg.Embedding(
param_attr=I.Normal(scale=speaker_embed_std)) (n_speakers, speaker_dim),
param_attr=I.Normal(scale=speaker_embed_std))
else: else:
spe = None spe = None
...@@ -45,17 +60,17 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim, ...@@ -45,17 +60,17 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
ConvSpec(h, k, 9), ConvSpec(h, k, 9),
ConvSpec(h, k, 27), ConvSpec(h, k, 27),
ConvSpec(h, k, 1), ConvSpec(h, k, 1),
ConvSpec(h, k, 3), ConvSpec(h, k, 3), )
) enc = Encoder(
enc = Encoder(n_vocab, n_vocab,
embed_dim, embed_dim,
n_speakers, n_speakers,
speaker_dim, speaker_dim,
padding_idx=None, padding_idx=None,
embedding_weight_std=embedding_std, embedding_weight_std=embedding_std,
convolutions=encoder_convolutions, convolutions=encoder_convolutions,
max_positions=max_positions, max_positions=max_positions,
dropout=dropout) dropout=dropout)
if freeze_embedding: if freeze_embedding:
freeze(enc.embed) freeze(enc.embed)
...@@ -66,28 +81,28 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim, ...@@ -66,28 +81,28 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
ConvSpec(h, k, 3), ConvSpec(h, k, 3),
ConvSpec(h, k, 9), ConvSpec(h, k, 9),
ConvSpec(h, k, 27), ConvSpec(h, k, 27),
ConvSpec(h, k, 1), ConvSpec(h, k, 1), )
)
attention = [True, False, False, False, True] attention = [True, False, False, False, True]
force_monotonic_attention = [True, False, False, False, True] force_monotonic_attention = [True, False, False, False, True]
dec = Decoder(n_speakers, dec = Decoder(
speaker_dim, n_speakers,
embed_dim, speaker_dim,
mel_dim, embed_dim,
r=r, mel_dim,
max_positions=max_positions, r=r,
padding_idx=padding_idx, max_positions=max_positions,
preattention=prenet_convolutions, padding_idx=padding_idx,
convolutions=attentive_convolutions, preattention=prenet_convolutions,
attention=attention, convolutions=attentive_convolutions,
dropout=dropout, attention=attention,
use_memory_mask=use_memory_mask, dropout=dropout,
force_monotonic_attention=force_monotonic_attention, use_memory_mask=use_memory_mask,
query_position_rate=query_position_rate, force_monotonic_attention=force_monotonic_attention,
key_position_rate=key_position_rate, query_position_rate=query_position_rate,
window_range=WindowRange(window_behind, window_ahead), key_position_rate=key_position_rate,
key_projection=key_projection, window_range=WindowRange(window_behind, window_ahead),
value_projection=value_projection) key_projection=key_projection,
value_projection=value_projection)
if not trainable_positional_encodings: if not trainable_positional_encodings:
freeze(dec.embed_keys_positions) freeze(dec.embed_keys_positions)
freeze(dec.embed_query_positions) freeze(dec.embed_query_positions)
...@@ -97,15 +112,15 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim, ...@@ -97,15 +112,15 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
ConvSpec(h, k, 1), ConvSpec(h, k, 1),
ConvSpec(h, k, 3), ConvSpec(h, k, 3),
ConvSpec(2 * h, k, 1), ConvSpec(2 * h, k, 1),
ConvSpec(2 * h, k, 3), ConvSpec(2 * h, k, 3), )
) cvt = Converter(
cvt = Converter(n_speakers, n_speakers,
speaker_dim, speaker_dim,
dec.state_dim if use_decoder_states else mel_dim, dec.state_dim if use_decoder_states else mel_dim,
linear_dim, linear_dim,
time_upsampling=downsample_factor, time_upsampling=downsample_factor,
convolutions=postnet_convolutions, convolutions=postnet_convolutions,
dropout=dropout) dropout=dropout)
dv3 = DeepVoice3(enc, dec, cvt, spe, use_decoder_states) dv3 = DeepVoice3(enc, dec, cvt, spe, use_decoder_states)
return dv3 return dv3
...@@ -115,8 +130,10 @@ def eval_model(model, text, replace_pronounciation_prob, min_level_db, ...@@ -115,8 +130,10 @@ def eval_model(model, text, replace_pronounciation_prob, min_level_db,
ref_level_db, power, n_iter, win_length, hop_length, ref_level_db, power, n_iter, win_length, hop_length,
preemphasis): preemphasis):
"""generate waveform from text using a deepvoice 3 model""" """generate waveform from text using a deepvoice 3 model"""
text = np.array(en.text_to_sequence(text, p=replace_pronounciation_prob), text = np.array(
dtype=np.int64) en.text_to_sequence(
text, p=replace_pronounciation_prob),
dtype=np.int64)
length = len(text) length = len(text)
print("text sequence's length: {}".format(length)) print("text sequence's length: {}".format(length))
text_positions = np.arange(1, 1 + length) text_positions = np.arange(1, 1 + length)
...@@ -145,10 +162,11 @@ def spec_to_waveform(spec, min_level_db, ref_level_db, power, n_iter, ...@@ -145,10 +162,11 @@ def spec_to_waveform(spec, min_level_db, ref_level_db, power, n_iter,
""" """
denoramlized = np.clip(spec, 0, 1) * (-min_level_db) + min_level_db denoramlized = np.clip(spec, 0, 1) * (-min_level_db) + min_level_db
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10)) lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
wav = librosa.griffinlim(lin_scaled**power, wav = librosa.griffinlim(
n_iter=n_iter, lin_scaled**power,
hop_length=hop_length, n_iter=n_iter,
win_length=win_length) hop_length=hop_length,
win_length=win_length)
if preemphasis > 0: if preemphasis > 0:
wav = signal.lfilter([1.], [1., -preemphasis], wav) wav = signal.lfilter([1.], [1., -preemphasis], wav)
return wav return wav
...@@ -225,28 +243,30 @@ def save_state(save_dir, ...@@ -225,28 +243,30 @@ def save_state(save_dir,
plt.colorbar() plt.colorbar()
plt.title("mel_input") plt.title("mel_input")
plt.savefig( plt.savefig(
os.path.join(path, os.path.join(path, "target_mel_spec_step{:09d}.png".format(
"target_mel_spec_step{:09d}.png".format(global_step))) global_step)))
plt.close() plt.close()
writer.add_image("target/mel_spec", writer.add_image(
cm.viridis(mel_input), "target/mel_spec",
global_step, cm.viridis(mel_input),
dataformats="HWC") global_step,
dataformats="HWC")
plt.figure(figsize=(10, 3)) plt.figure(figsize=(10, 3))
display.specshow(mel_output) display.specshow(mel_output)
plt.colorbar() plt.colorbar()
plt.title("mel_output") plt.title("mel_output")
plt.savefig( plt.savefig(
os.path.join( os.path.join(path, "predicted_mel_spec_step{:09d}.png".format(
path, "predicted_mel_spec_step{:09d}.png".format(global_step))) global_step)))
plt.close() plt.close()
writer.add_image("predicted/mel_spec", writer.add_image(
cm.viridis(mel_output), "predicted/mel_spec",
global_step, cm.viridis(mel_output),
dataformats="HWC") global_step,
dataformats="HWC")
if lin_input is not None and lin_output is not None: if lin_input is not None and lin_output is not None:
lin_input = lin_input[0].numpy().T lin_input = lin_input[0].numpy().T
...@@ -258,28 +278,30 @@ def save_state(save_dir, ...@@ -258,28 +278,30 @@ def save_state(save_dir,
plt.colorbar() plt.colorbar()
plt.title("mel_input") plt.title("mel_input")
plt.savefig( plt.savefig(
os.path.join(path, os.path.join(path, "target_lin_spec_step{:09d}.png".format(
"target_lin_spec_step{:09d}.png".format(global_step))) global_step)))
plt.close() plt.close()
writer.add_image("target/lin_spec", writer.add_image(
cm.viridis(lin_input), "target/lin_spec",
global_step, cm.viridis(lin_input),
dataformats="HWC") global_step,
dataformats="HWC")
plt.figure(figsize=(10, 3)) plt.figure(figsize=(10, 3))
display.specshow(lin_output) display.specshow(lin_output)
plt.colorbar() plt.colorbar()
plt.title("mel_input") plt.title("mel_input")
plt.savefig( plt.savefig(
os.path.join( os.path.join(path, "predicted_lin_spec_step{:09d}.png".format(
path, "predicted_lin_spec_step{:09d}.png".format(global_step))) global_step)))
plt.close() plt.close()
writer.add_image("predicted/lin_spec", writer.add_image(
cm.viridis(lin_output), "predicted/lin_spec",
global_step, cm.viridis(lin_output),
dataformats="HWC") global_step,
dataformats="HWC")
if alignments is not None and len(alignments.shape) == 4: if alignments is not None and len(alignments.shape) == 4:
path = os.path.join(save_dir, "alignments") path = os.path.join(save_dir, "alignments")
...@@ -290,10 +312,11 @@ def save_state(save_dir, ...@@ -290,10 +312,11 @@ def save_state(save_dir,
"train_attn_layer_{}_step_{}.png".format(idx, global_step)) "train_attn_layer_{}_step_{}.png".format(idx, global_step))
plot_alignment(attn_layer, save_path) plot_alignment(attn_layer, save_path)
writer.add_image("train_attn/layer_{}".format(idx), writer.add_image(
cm.viridis(attn_layer), "train_attn/layer_{}".format(idx),
global_step, cm.viridis(attn_layer),
dataformats="HWC") global_step,
dataformats="HWC")
if lin_output is not None: if lin_output is not None:
wav = spec_to_waveform(lin_output, min_level_db, ref_level_db, power, wav = spec_to_waveform(lin_output, min_level_db, ref_level_db, power,
...@@ -302,7 +325,5 @@ def save_state(save_dir, ...@@ -302,7 +325,5 @@ def save_state(save_dir,
save_path = os.path.join( save_path = os.path.join(
path, "train_sample_step_{:09d}.wav".format(global_step)) path, "train_sample_step_{:09d}.wav".format(global_step))
sf.write(save_path, wav, sample_rate) sf.write(save_path, wav, sample_rate)
writer.add_audio("train_sample", writer.add_audio(
wav, "train_sample", wav, global_step, sample_rate=sample_rate)
global_step,
sample_rate=sample_rate)
...@@ -57,7 +57,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr ...@@ -57,7 +57,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--fastspeech_step`` If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--fastspeech_step``
For more help on arguments: For more help on arguments:
``python train.py --help``. ``python train.py --help``.
## Synthesis ## Synthesis
...@@ -75,5 +75,5 @@ or you can run the script file directly. ...@@ -75,5 +75,5 @@ or you can run the script file directly.
sh synthesis.sh sh synthesis.sh
``` ```
For more help on arguments: For more help on arguments:
``python synthesis.py --help``. ``python synthesis.py --help``.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse import argparse
def add_config_options_to_parser(parser): def add_config_options_to_parser(parser):
parser.add_argument('--config_path', type=str, default='config/fastspeech.yaml', parser.add_argument(
'--config_path',
type=str,
default='config/fastspeech.yaml',
help="the yaml config file path.") help="the yaml config file path.")
parser.add_argument('--batch_size', type=int, default=32, parser.add_argument(
help="batch size for training.") '--batch_size', type=int, default=32, help="batch size for training.")
parser.add_argument('--epochs', type=int, default=10000, parser.add_argument(
'--epochs',
type=int,
default=10000,
help="the number of epoch for training.") help="the number of epoch for training.")
parser.add_argument('--lr', type=float, default=0.001, parser.add_argument(
'--lr',
type=float,
default=0.001,
help="the learning rate for training.") help="the learning rate for training.")
parser.add_argument('--save_step', type=int, default=500, parser.add_argument(
'--save_step',
type=int,
default=500,
help="checkpointing interval during training.") help="checkpointing interval during training.")
parser.add_argument('--fastspeech_step', type=int, default=70000, parser.add_argument(
'--fastspeech_step',
type=int,
default=70000,
help="Global step to restore checkpoint of fastspeech.") help="Global step to restore checkpoint of fastspeech.")
parser.add_argument('--use_gpu', type=int, default=1, parser.add_argument(
'--use_gpu',
type=int,
default=1,
help="use gpu or not during training.") help="use gpu or not during training.")
parser.add_argument('--use_data_parallel', type=int, default=0, parser.add_argument(
'--use_data_parallel',
type=int,
default=0,
help="use data parallel or not during training.") help="use data parallel or not during training.")
parser.add_argument('--alpha', type=float, default=1.0, parser.add_argument(
'--alpha',
type=float,
default=1.0,
help="The hyperparameter to determine the length of the expanded sequence \ help="The hyperparameter to determine the length of the expanded sequence \
mel, thereby controlling the voice speed.") mel, thereby controlling the voice speed.")
parser.add_argument('--data_path', type=str, default='./dataset/LJSpeech-1.1', parser.add_argument(
'--data_path',
type=str,
default='./dataset/LJSpeech-1.1',
help="the path of dataset.") help="the path of dataset.")
parser.add_argument('--checkpoint_path', type=str, default=None, parser.add_argument(
'--checkpoint_path',
type=str,
default=None,
help="the path to load checkpoint or pretrain model.") help="the path to load checkpoint or pretrain model.")
parser.add_argument('--save_path', type=str, default='./checkpoint', parser.add_argument(
'--save_path',
type=str,
default='./checkpoint',
help="the path to save checkpoint.") help="the path to save checkpoint.")
parser.add_argument('--log_dir', type=str, default='./log', parser.add_argument(
'--log_dir',
type=str,
default='./log',
help="the directory to save tensorboard log.") help="the directory to save tensorboard log.")
parser.add_argument('--sample_path', type=str, default='./sample', parser.add_argument(
'--sample_path',
type=str,
default='./sample',
help="the directory to save audio sample in synthesis.") help="the directory to save audio sample in synthesis.")
parser.add_argument('--transtts_path', type=str, default='./log', parser.add_argument(
'--transtts_path',
type=str,
default='./log',
help="the directory to load pretrain transformerTTS model.") help="the directory to load pretrain transformerTTS model.")
parser.add_argument('--transformer_step', type=int, default=160000, parser.add_argument(
'--transformer_step',
type=int,
default=160000,
help="the step to load transformerTTS model.") help="the step to load transformerTTS model.")
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
from tensorboardX import SummaryWriter from tensorboardX import SummaryWriter
from collections import OrderedDict from collections import OrderedDict
...@@ -13,6 +26,7 @@ from parakeet import audio ...@@ -13,6 +26,7 @@ from parakeet import audio
from parakeet.models.fastspeech.fastspeech import FastSpeech from parakeet.models.fastspeech.fastspeech import FastSpeech
from parakeet.models.transformer_tts.utils import * from parakeet.models.transformer_tts.utils import *
def load_checkpoint(step, model_path): def load_checkpoint(step, model_path):
model_dict, _ = fluid.dygraph.load_dygraph(os.path.join(model_path, step)) model_dict, _ = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
new_state_dict = OrderedDict() new_state_dict = OrderedDict()
...@@ -23,13 +37,14 @@ def load_checkpoint(step, model_path): ...@@ -23,13 +37,14 @@ def load_checkpoint(step, model_path):
new_state_dict[param] = model_dict[param] new_state_dict[param] = model_dict[param]
return new_state_dict return new_state_dict
def synthesis(text_input, args): def synthesis(text_input, args):
place = (fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()) place = (fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace())
# tensorboard # tensorboard
if not os.path.exists(args.log_dir): if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir) os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'synthesis') path = os.path.join(args.log_dir, 'synthesis')
with open(args.config_path) as f: with open(args.config_path) as f:
cfg = yaml.load(f, Loader=yaml.Loader) cfg = yaml.load(f, Loader=yaml.Loader)
...@@ -38,35 +53,42 @@ def synthesis(text_input, args): ...@@ -38,35 +53,42 @@ def synthesis(text_input, args):
with dg.guard(place): with dg.guard(place):
model = FastSpeech(cfg) model = FastSpeech(cfg)
model.set_dict(load_checkpoint(str(args.fastspeech_step), os.path.join(args.checkpoint_path, "fastspeech"))) model.set_dict(
load_checkpoint(
str(args.fastspeech_step),
os.path.join(args.checkpoint_path, "fastspeech")))
model.eval() model.eval()
text = np.asarray(text_to_sequence(text_input)) text = np.asarray(text_to_sequence(text_input))
text = np.expand_dims(text, axis=0) text = np.expand_dims(text, axis=0)
pos_text = np.arange(1, text.shape[1]+1) pos_text = np.arange(1, text.shape[1] + 1)
pos_text = np.expand_dims(pos_text, axis=0) pos_text = np.expand_dims(pos_text, axis=0)
enc_non_pad_mask = get_non_pad_mask(pos_text).astype(np.float32) enc_non_pad_mask = get_non_pad_mask(pos_text).astype(np.float32)
enc_slf_attn_mask = get_attn_key_pad_mask(pos_text, text).astype(np.float32) enc_slf_attn_mask = get_attn_key_pad_mask(pos_text,
text).astype(np.float32)
text = dg.to_variable(text) text = dg.to_variable(text)
pos_text = dg.to_variable(pos_text) pos_text = dg.to_variable(pos_text)
enc_non_pad_mask = dg.to_variable(enc_non_pad_mask) enc_non_pad_mask = dg.to_variable(enc_non_pad_mask)
enc_slf_attn_mask = dg.to_variable(enc_slf_attn_mask) enc_slf_attn_mask = dg.to_variable(enc_slf_attn_mask)
mel_output, mel_output_postnet = model(text, pos_text, alpha=args.alpha, mel_output, mel_output_postnet = model(
enc_non_pad_mask=enc_non_pad_mask, text,
enc_slf_attn_mask=enc_slf_attn_mask, pos_text,
dec_non_pad_mask=None, alpha=args.alpha,
dec_slf_attn_mask=None) enc_non_pad_mask=enc_non_pad_mask,
enc_slf_attn_mask=enc_slf_attn_mask,
dec_non_pad_mask=None,
dec_slf_attn_mask=None)
_ljspeech_processor = audio.AudioProcessor( _ljspeech_processor = audio.AudioProcessor(
sample_rate=cfg['audio']['sr'], sample_rate=cfg['audio']['sr'],
num_mels=cfg['audio']['num_mels'], num_mels=cfg['audio']['num_mels'],
min_level_db=cfg['audio']['min_level_db'], min_level_db=cfg['audio']['min_level_db'],
ref_level_db=cfg['audio']['ref_level_db'], ref_level_db=cfg['audio']['ref_level_db'],
n_fft=cfg['audio']['n_fft'], n_fft=cfg['audio']['n_fft'],
win_length= cfg['audio']['win_length'], win_length=cfg['audio']['win_length'],
hop_length= cfg['audio']['hop_length'], hop_length=cfg['audio']['hop_length'],
power=cfg['audio']['power'], power=cfg['audio']['power'],
preemphasis=cfg['audio']['preemphasis'], preemphasis=cfg['audio']['preemphasis'],
signal_norm=True, signal_norm=True,
...@@ -79,14 +101,17 @@ def synthesis(text_input, args): ...@@ -79,14 +101,17 @@ def synthesis(text_input, args):
do_trim_silence=False, do_trim_silence=False,
sound_norm=False) sound_norm=False)
mel_output_postnet = fluid.layers.transpose(fluid.layers.squeeze(mel_output_postnet,[0]), [1,0]) mel_output_postnet = fluid.layers.transpose(
wav = _ljspeech_processor.inv_melspectrogram(mel_output_postnet.numpy()) fluid.layers.squeeze(mel_output_postnet, [0]), [1, 0])
wav = _ljspeech_processor.inv_melspectrogram(mel_output_postnet.numpy(
))
writer.add_audio(text_input, wav, 0, cfg['audio']['sr']) writer.add_audio(text_input, wav, 0, cfg['audio']['sr'])
print("Synthesis completed !!!") print("Synthesis completed !!!")
writer.close() writer.close()
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train Fastspeech model") parser = argparse.ArgumentParser(description="Train Fastspeech model")
add_config_options_to_parser(parser) add_config_options_to_parser(parser)
args = parser.parse_args() args = parser.parse_args()
synthesis("Transformer model is so fast!", args) synthesis("Transformer model is so fast!", args)
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np import numpy as np
import argparse import argparse
import os import os
...@@ -21,8 +34,10 @@ import sys ...@@ -21,8 +34,10 @@ import sys
sys.path.append("../transformer_tts") sys.path.append("../transformer_tts")
from data import LJSpeechLoader from data import LJSpeechLoader
def load_checkpoint(step, model_path): def load_checkpoint(step, model_path):
model_dict, opti_dict = fluid.dygraph.load_dygraph(os.path.join(model_path, step)) model_dict, opti_dict = fluid.dygraph.load_dygraph(
os.path.join(model_path, step))
new_state_dict = OrderedDict() new_state_dict = OrderedDict()
for param in model_dict: for param in model_dict:
if param.startswith('_layers.'): if param.startswith('_layers.'):
...@@ -31,6 +46,7 @@ def load_checkpoint(step, model_path): ...@@ -31,6 +46,7 @@ def load_checkpoint(step, model_path):
new_state_dict[param] = model_dict[param] new_state_dict[param] = model_dict[param]
return new_state_dict, opti_dict return new_state_dict, opti_dict
def main(args): def main(args):
local_rank = dg.parallel.Env().local_rank if args.use_data_parallel else 0 local_rank = dg.parallel.Env().local_rank if args.use_data_parallel else 0
nranks = dg.parallel.Env().nranks if args.use_data_parallel else 1 nranks = dg.parallel.Env().nranks if args.use_data_parallel else 1
...@@ -44,26 +60,33 @@ def main(args): ...@@ -44,26 +60,33 @@ def main(args):
if args.use_gpu else fluid.CPUPlace()) if args.use_gpu else fluid.CPUPlace())
if not os.path.exists(args.log_dir): if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir) os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'fastspeech') path = os.path.join(args.log_dir, 'fastspeech')
writer = SummaryWriter(path) if local_rank == 0 else None writer = SummaryWriter(path) if local_rank == 0 else None
with dg.guard(place): with dg.guard(place):
with fluid.unique_name.guard(): with fluid.unique_name.guard():
transformerTTS = TransformerTTS(cfg) transformerTTS = TransformerTTS(cfg)
model_dict, _ = load_checkpoint(str(args.transformer_step), os.path.join(args.transtts_path, "transformer")) model_dict, _ = load_checkpoint(
str(args.transformer_step),
os.path.join(args.transtts_path, "transformer"))
transformerTTS.set_dict(model_dict) transformerTTS.set_dict(model_dict)
transformerTTS.eval() transformerTTS.eval()
model = FastSpeech(cfg) model = FastSpeech(cfg)
model.train() model.train()
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']), optimizer = fluid.optimizer.AdamOptimizer(
parameter_list=model.parameters()) learning_rate=dg.NoamDecay(1 / (
reader = LJSpeechLoader(cfg, args, nranks, local_rank, shuffle=True).reader() cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
parameter_list=model.parameters())
reader = LJSpeechLoader(
cfg, args, nranks, local_rank, shuffle=True).reader()
if args.checkpoint_path is not None: if args.checkpoint_path is not None:
model_dict, opti_dict = load_checkpoint(str(args.fastspeech_step), os.path.join(args.checkpoint_path, "fastspeech")) model_dict, opti_dict = load_checkpoint(
str(args.fastspeech_step),
os.path.join(args.checkpoint_path, "fastspeech"))
model.set_dict(model_dict) model.set_dict(model_dict)
optimizer.set_dict(opti_dict) optimizer.set_dict(opti_dict)
global_step = args.fastspeech_step global_step = args.fastspeech_step
...@@ -77,45 +100,66 @@ def main(args): ...@@ -77,45 +100,66 @@ def main(args):
pbar = tqdm(reader) pbar = tqdm(reader)
for i, data in enumerate(pbar): for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d'%epoch) pbar.set_description('Processing at epoch %d' % epoch)
(character, mel, mel_input, pos_text, pos_mel, text_length, mel_lens, (character, mel, mel_input, pos_text, pos_mel, text_length,
enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask) = data mel_lens, enc_slf_mask, enc_query_mask, dec_slf_mask,
enc_dec_mask, dec_query_slf_mask, dec_query_mask) = data
_, _, attn_probs, _, _, _ = transformerTTS(character, mel_input, pos_text, pos_mel,
dec_slf_mask=dec_slf_mask, _, _, attn_probs, _, _, _ = transformerTTS(
enc_slf_mask=enc_slf_mask, enc_query_mask=enc_query_mask, character,
enc_dec_mask=enc_dec_mask, dec_query_slf_mask=dec_query_slf_mask, mel_input,
dec_query_mask=dec_query_mask) pos_text,
alignment, max_attn = get_alignment(attn_probs, mel_lens, cfg['transformer_head']) pos_mel,
dec_slf_mask=dec_slf_mask,
enc_slf_mask=enc_slf_mask,
enc_query_mask=enc_query_mask,
enc_dec_mask=enc_dec_mask,
dec_query_slf_mask=dec_query_slf_mask,
dec_query_mask=dec_query_mask)
alignment, max_attn = get_alignment(attn_probs, mel_lens,
cfg['transformer_head'])
alignment = dg.to_variable(alignment).astype(np.float32) alignment = dg.to_variable(alignment).astype(np.float32)
if local_rank==0 and global_step % 5 == 1: if local_rank == 0 and global_step % 5 == 1:
x = np.uint8(cm.viridis(max_attn[8,:mel_lens.numpy()[8]]) * 255) x = np.uint8(
writer.add_image('Attention_%d_0'%global_step, x, 0, dataformats="HWC") cm.viridis(max_attn[8, :mel_lens.numpy()[8]]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
0,
dataformats="HWC")
global_step += 1 global_step += 1
#Forward #Forward
result= model(character, result = model(
pos_text, character,
mel_pos=pos_mel, pos_text,
length_target=alignment, mel_pos=pos_mel,
enc_non_pad_mask=enc_query_mask, length_target=alignment,
enc_slf_attn_mask=enc_slf_mask, enc_non_pad_mask=enc_query_mask,
dec_non_pad_mask=dec_query_slf_mask, enc_slf_attn_mask=enc_slf_mask,
dec_slf_attn_mask=dec_slf_mask) dec_non_pad_mask=dec_query_slf_mask,
dec_slf_attn_mask=dec_slf_mask)
mel_output, mel_output_postnet, duration_predictor_output, _, _ = result mel_output, mel_output_postnet, duration_predictor_output, _, _ = result
mel_loss = layers.mse_loss(mel_output, mel) mel_loss = layers.mse_loss(mel_output, mel)
mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel) mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel)
duration_loss = layers.mean(layers.abs(layers.elementwise_sub(duration_predictor_output, alignment))) duration_loss = layers.mean(
layers.abs(
layers.elementwise_sub(duration_predictor_output,
alignment)))
total_loss = mel_loss + mel_postnet_loss + duration_loss total_loss = mel_loss + mel_postnet_loss + duration_loss
if local_rank==0: if local_rank == 0:
writer.add_scalar('mel_loss', mel_loss.numpy(), global_step) writer.add_scalar('mel_loss',
writer.add_scalar('post_mel_loss', mel_postnet_loss.numpy(), global_step) mel_loss.numpy(), global_step)
writer.add_scalar('duration_loss', duration_loss.numpy(), global_step) writer.add_scalar('post_mel_loss',
writer.add_scalar('learning_rate', optimizer._learning_rate.step().numpy(), global_step) mel_postnet_loss.numpy(), global_step)
writer.add_scalar('duration_loss',
duration_loss.numpy(), global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if args.use_data_parallel: if args.use_data_parallel:
total_loss = model.scale_loss(total_loss) total_loss = model.scale_loss(total_loss)
...@@ -123,21 +167,25 @@ def main(args): ...@@ -123,21 +167,25 @@ def main(args):
model.apply_collective_grads() model.apply_collective_grads()
else: else:
total_loss.backward() total_loss.backward()
optimizer.minimize(total_loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh'])) optimizer.minimize(
total_loss,
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
'grad_clip_thresh']))
model.clear_gradients() model.clear_gradients()
# save checkpoint # save checkpoint
if local_rank==0 and global_step % args.save_step == 0: if local_rank == 0 and global_step % args.save_step == 0:
if not os.path.exists(args.save_path): if not os.path.exists(args.save_path):
os.mkdir(args.save_path) os.mkdir(args.save_path)
save_path = os.path.join(args.save_path,'fastspeech/%d' % global_step) save_path = os.path.join(args.save_path,
'fastspeech/%d' % global_step)
dg.save_dygraph(model.state_dict(), save_path) dg.save_dygraph(model.state_dict(), save_path)
dg.save_dygraph(optimizer.state_dict(), save_path) dg.save_dygraph(optimizer.state_dict(), save_path)
if local_rank==0: if local_rank == 0:
writer.close() writer.close()
if __name__ =='__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train Fastspeech model") parser = argparse.ArgumentParser(description="Train Fastspeech model")
add_config_options_to_parser(parser) add_config_options_to_parser(parser)
args = parser.parse_args() args = parser.parse_args()
......
...@@ -50,7 +50,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr ...@@ -50,7 +50,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--transformer_step`` If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--transformer_step``
For more help on arguments: For more help on arguments:
``python train_transformer.py --help``. ``python train_transformer.py --help``.
## Train Vocoder ## Train Vocoder
...@@ -78,7 +78,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr ...@@ -78,7 +78,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
``` ```
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--vocoder_step`` If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--vocoder_step``
For more help on arguments: For more help on arguments:
``python train_vocoder.py --help``. ``python train_vocoder.py --help``.
## Synthesis ## Synthesis
...@@ -101,5 +101,5 @@ sh synthesis.sh ...@@ -101,5 +101,5 @@ sh synthesis.sh
And the audio file will be saved in ``--sample_path``. And the audio file will be saved in ``--sample_path``.
For more help on arguments: For more help on arguments:
``python synthesis.py --help``. ``python synthesis.py --help``.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path from pathlib import Path
import numpy as np import numpy as np
import pandas as pd import pandas as pd
...@@ -13,8 +26,15 @@ from parakeet.data.batch import TextIDBatcher, SpecBatcher ...@@ -13,8 +26,15 @@ from parakeet.data.batch import TextIDBatcher, SpecBatcher
from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset
from parakeet.models.transformer_tts.utils import * from parakeet.models.transformer_tts.utils import *
class LJSpeechLoader: class LJSpeechLoader:
def __init__(self, config, args, nranks, rank, is_vocoder=False, shuffle=True): def __init__(self,
config,
args,
nranks,
rank,
is_vocoder=False,
shuffle=True):
place = fluid.CUDAPlace(rank) if args.use_gpu else fluid.CPUPlace() place = fluid.CUDAPlace(rank) if args.use_gpu else fluid.CPUPlace()
LJSPEECH_ROOT = Path(args.data_path) LJSPEECH_ROOT = Path(args.data_path)
...@@ -23,15 +43,28 @@ class LJSpeechLoader: ...@@ -23,15 +43,28 @@ class LJSpeechLoader:
dataset = TransformDataset(metadata, transformer) dataset = TransformDataset(metadata, transformer)
dataset = CacheDataset(dataset) dataset = CacheDataset(dataset)
sampler = DistributedSampler(len(metadata), nranks, rank, shuffle=shuffle) sampler = DistributedSampler(
len(metadata), nranks, rank, shuffle=shuffle)
assert args.batch_size % nranks == 0 assert args.batch_size % nranks == 0
each_bs = args.batch_size // nranks each_bs = args.batch_size // nranks
if is_vocoder: if is_vocoder:
dataloader = DataCargo(dataset, sampler=sampler, batch_size=each_bs, shuffle=shuffle, batch_fn=batch_examples_vocoder, drop_last=True) dataloader = DataCargo(
dataset,
sampler=sampler,
batch_size=each_bs,
shuffle=shuffle,
batch_fn=batch_examples_vocoder,
drop_last=True)
else: else:
dataloader = DataCargo(dataset, sampler=sampler, batch_size=each_bs, shuffle=shuffle, batch_fn=batch_examples, drop_last=True) dataloader = DataCargo(
dataset,
sampler=sampler,
batch_size=each_bs,
shuffle=shuffle,
batch_fn=batch_examples,
drop_last=True)
self.reader = fluid.io.DataLoader.from_generator( self.reader = fluid.io.DataLoader.from_generator(
capacity=32, capacity=32,
iterable=True, iterable=True,
...@@ -66,13 +99,13 @@ class LJSpeech(object): ...@@ -66,13 +99,13 @@ class LJSpeech(object):
super(LJSpeech, self).__init__() super(LJSpeech, self).__init__()
self.config = config self.config = config
self._ljspeech_processor = audio.AudioProcessor( self._ljspeech_processor = audio.AudioProcessor(
sample_rate=config['audio']['sr'], sample_rate=config['audio']['sr'],
num_mels=config['audio']['num_mels'], num_mels=config['audio']['num_mels'],
min_level_db=config['audio']['min_level_db'], min_level_db=config['audio']['min_level_db'],
ref_level_db=config['audio']['ref_level_db'], ref_level_db=config['audio']['ref_level_db'],
n_fft=config['audio']['n_fft'], n_fft=config['audio']['n_fft'],
win_length= config['audio']['win_length'], win_length=config['audio']['win_length'],
hop_length= config['audio']['hop_length'], hop_length=config['audio']['hop_length'],
power=config['audio']['power'], power=config['audio']['power'],
preemphasis=config['audio']['preemphasis'], preemphasis=config['audio']['preemphasis'],
signal_norm=True, signal_norm=True,
...@@ -84,7 +117,7 @@ class LJSpeech(object): ...@@ -84,7 +117,7 @@ class LJSpeech(object):
griffin_lim_iters=60, griffin_lim_iters=60,
do_trim_silence=False, do_trim_silence=False,
sound_norm=False) sound_norm=False)
def __call__(self, metadatum): def __call__(self, metadatum):
"""All the code for generating an Example from a metadatum. If you want a """All the code for generating an Example from a metadatum. If you want a
different preprocessing pipeline, you can override this method. different preprocessing pipeline, you can override this method.
...@@ -93,13 +126,15 @@ class LJSpeech(object): ...@@ -93,13 +126,15 @@ class LJSpeech(object):
method. method.
""" """
fname, raw_text, normalized_text = metadatum fname, raw_text, normalized_text = metadatum
# load -> trim -> preemphasis -> stft -> magnitude -> mel_scale -> logscale -> normalize # load -> trim -> preemphasis -> stft -> magnitude -> mel_scale -> logscale -> normalize
wav = self._ljspeech_processor.load_wav(str(fname)) wav = self._ljspeech_processor.load_wav(str(fname))
mag = self._ljspeech_processor.spectrogram(wav).astype(np.float32) mag = self._ljspeech_processor.spectrogram(wav).astype(np.float32)
mel = self._ljspeech_processor.melspectrogram(wav).astype(np.float32) mel = self._ljspeech_processor.melspectrogram(wav).astype(np.float32)
phonemes = np.array(g2p.en.text_to_sequence(normalized_text), dtype=np.int64) phonemes = np.array(
return (mag, mel, phonemes) # maybe we need to implement it as a map in the future g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
return (mag, mel, phonemes
) # maybe we need to implement it as a map in the future
def batch_examples(batch): def batch_examples(batch):
...@@ -112,52 +147,81 @@ def batch_examples(batch): ...@@ -112,52 +147,81 @@ def batch_examples(batch):
pos_mels = [] pos_mels = []
for data in batch: for data in batch:
_, mel, text = data _, mel, text = data
mel_inputs.append(np.concatenate([np.zeros([mel.shape[0], 1], np.float32), mel[:,:-1]], axis=-1)) mel_inputs.append(
np.concatenate(
[np.zeros([mel.shape[0], 1], np.float32), mel[:, :-1]],
axis=-1))
mel_lens.append(mel.shape[1]) mel_lens.append(mel.shape[1])
text_lens.append(len(text)) text_lens.append(len(text))
pos_texts.append(np.arange(1, len(text) + 1)) pos_texts.append(np.arange(1, len(text) + 1))
pos_mels.append(np.arange(1, mel.shape[1] + 1)) pos_mels.append(np.arange(1, mel.shape[1] + 1))
mels.append(mel) mels.append(mel)
texts.append(text) texts.append(text)
# Sort by text_len in descending order # Sort by text_len in descending order
texts = [i for i,_ in sorted(zip(texts, text_lens), key=lambda x: x[1], reverse=True)] texts = [
mels = [i for i,_ in sorted(zip(mels, text_lens), key=lambda x: x[1], reverse=True)] i
mel_inputs = [i for i,_ in sorted(zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True)] for i, _ in sorted(
mel_lens = [i for i,_ in sorted(zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)] zip(texts, text_lens), key=lambda x: x[1], reverse=True)
pos_texts = [i for i,_ in sorted(zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)] ]
pos_mels = [i for i,_ in sorted(zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)] mels = [
i
for i, _ in sorted(
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
]
mel_inputs = [
i
for i, _ in sorted(
zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True)
]
mel_lens = [
i
for i, _ in sorted(
zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
]
pos_texts = [
i
for i, _ in sorted(
zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)
]
pos_mels = [
i
for i, _ in sorted(
zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)
]
text_lens = sorted(text_lens, reverse=True) text_lens = sorted(text_lens, reverse=True)
# Pad sequence with largest len of the batch # Pad sequence with largest len of the batch
texts = TextIDBatcher(pad_id=0)(texts) #(B, T) texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T) pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T) pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0,2,1)) #(B,T,num_mels) mels = np.transpose(
mel_inputs = np.transpose(SpecBatcher(pad_value=0.)(mel_inputs), axes=(0,2,1))#(B,T,num_mels) SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels)
mel_inputs = np.transpose(
SpecBatcher(pad_value=0.)(mel_inputs), axes=(0, 2, 1)) #(B,T,num_mels)
enc_slf_mask = get_attn_key_pad_mask(pos_texts, texts).astype(np.float32) enc_slf_mask = get_attn_key_pad_mask(pos_texts, texts).astype(np.float32)
enc_query_mask = get_non_pad_mask(pos_texts).astype(np.float32) enc_query_mask = get_non_pad_mask(pos_texts).astype(np.float32)
dec_slf_mask = get_dec_attn_key_pad_mask(pos_mels,mel_inputs).astype(np.float32) dec_slf_mask = get_dec_attn_key_pad_mask(pos_mels,
enc_dec_mask = get_attn_key_pad_mask(enc_query_mask[:,:,0], mel_inputs).astype(np.float32) mel_inputs).astype(np.float32)
enc_dec_mask = get_attn_key_pad_mask(enc_query_mask[:, :, 0],
mel_inputs).astype(np.float32)
dec_query_slf_mask = get_non_pad_mask(pos_mels).astype(np.float32) dec_query_slf_mask = get_non_pad_mask(pos_mels).astype(np.float32)
dec_query_mask = get_non_pad_mask(pos_mels).astype(np.float32) dec_query_mask = get_non_pad_mask(pos_mels).astype(np.float32)
return (texts, mels, mel_inputs, pos_texts, pos_mels, np.array(text_lens), np.array(mel_lens), return (texts, mels, mel_inputs, pos_texts, pos_mels, np.array(text_lens),
enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask) np.array(mel_lens), enc_slf_mask, enc_query_mask, dec_slf_mask,
enc_dec_mask, dec_query_slf_mask, dec_query_mask)
def batch_examples_vocoder(batch): def batch_examples_vocoder(batch):
mels=[] mels = []
mags=[] mags = []
for data in batch: for data in batch:
mag, mel, _ = data mag, mel, _ = data
mels.append(mel) mels.append(mel)
mags.append(mag) mags.append(mag)
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0,2,1)) mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1))
mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0,2,1)) mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0, 2, 1))
return (mels, mags) return (mels, mags)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse import argparse
def add_config_options_to_parser(parser): def add_config_options_to_parser(parser):
parser.add_argument('--config_path', type=str, default='config/train_transformer.yaml', parser.add_argument(
'--config_path',
type=str,
default='config/train_transformer.yaml',
help="the yaml config file path.") help="the yaml config file path.")
parser.add_argument('--batch_size', type=int, default=32, parser.add_argument(
help="batch size for training.") '--batch_size', type=int, default=32, help="batch size for training.")
parser.add_argument('--epochs', type=int, default=10000, parser.add_argument(
'--epochs',
type=int,
default=10000,
help="the number of epoch for training.") help="the number of epoch for training.")
parser.add_argument('--lr', type=float, default=0.001, parser.add_argument(
'--lr',
type=float,
default=0.001,
help="the learning rate for training.") help="the learning rate for training.")
parser.add_argument('--save_step', type=int, default=500, parser.add_argument(
'--save_step',
type=int,
default=500,
help="checkpointing interval during training.") help="checkpointing interval during training.")
parser.add_argument('--image_step', type=int, default=2000, parser.add_argument(
'--image_step',
type=int,
default=2000,
help="attention image interval during training.") help="attention image interval during training.")
parser.add_argument('--max_len', type=int, default=400, parser.add_argument(
'--max_len',
type=int,
default=400,
help="The max length of audio when synthsis.") help="The max length of audio when synthsis.")
parser.add_argument('--transformer_step', type=int, default=160000, parser.add_argument(
'--transformer_step',
type=int,
default=160000,
help="Global step to restore checkpoint of transformer.") help="Global step to restore checkpoint of transformer.")
parser.add_argument('--vocoder_step', type=int, default=90000, parser.add_argument(
'--vocoder_step',
type=int,
default=90000,
help="Global step to restore checkpoint of postnet.") help="Global step to restore checkpoint of postnet.")
parser.add_argument('--use_gpu', type=int, default=1, parser.add_argument(
'--use_gpu',
type=int,
default=1,
help="use gpu or not during training.") help="use gpu or not during training.")
parser.add_argument('--use_data_parallel', type=int, default=0, parser.add_argument(
'--use_data_parallel',
type=int,
default=0,
help="use data parallel or not during training.") help="use data parallel or not during training.")
parser.add_argument('--stop_token', type=int, default=0, parser.add_argument(
'--stop_token',
type=int,
default=0,
help="use stop token loss in network or not.") help="use stop token loss in network or not.")
parser.add_argument('--data_path', type=str, default='./dataset/LJSpeech-1.1', parser.add_argument(
'--data_path',
type=str,
default='./dataset/LJSpeech-1.1',
help="the path of dataset.") help="the path of dataset.")
parser.add_argument('--checkpoint_path', type=str, default=None, parser.add_argument(
'--checkpoint_path',
type=str,
default=None,
help="the path to load checkpoint or pretrain model.") help="the path to load checkpoint or pretrain model.")
parser.add_argument('--save_path', type=str, default='./checkpoint', parser.add_argument(
'--save_path',
type=str,
default='./checkpoint',
help="the path to save checkpoint.") help="the path to save checkpoint.")
parser.add_argument('--log_dir', type=str, default='./log', parser.add_argument(
'--log_dir',
type=str,
default='./log',
help="the directory to save tensorboard log.") help="the directory to save tensorboard log.")
parser.add_argument('--sample_path', type=str, default='./sample', parser.add_argument(
'--sample_path',
type=str,
default='./sample',
help="the directory to save audio sample in synthesis.") help="the directory to save audio sample in synthesis.")
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
from scipy.io.wavfile import write from scipy.io.wavfile import write
from parakeet.g2p.en import text_to_sequence from parakeet.g2p.en import text_to_sequence
...@@ -18,6 +31,7 @@ from parakeet import audio ...@@ -18,6 +31,7 @@ from parakeet import audio
from parakeet.models.transformer_tts.vocoder import Vocoder from parakeet.models.transformer_tts.vocoder import Vocoder
from parakeet.models.transformer_tts.transformer_tts import TransformerTTS from parakeet.models.transformer_tts.transformer_tts import TransformerTTS
def load_checkpoint(step, model_path): def load_checkpoint(step, model_path):
model_dict, _ = fluid.dygraph.load_dygraph(os.path.join(model_path, step)) model_dict, _ = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
new_state_dict = OrderedDict() new_state_dict = OrderedDict()
...@@ -28,6 +42,7 @@ def load_checkpoint(step, model_path): ...@@ -28,6 +42,7 @@ def load_checkpoint(step, model_path):
new_state_dict[param] = model_dict[param] new_state_dict[param] = model_dict[param]
return new_state_dict return new_state_dict
def synthesis(text_input, args): def synthesis(text_input, args):
place = (fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()) place = (fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace())
...@@ -36,48 +51,57 @@ def synthesis(text_input, args): ...@@ -36,48 +51,57 @@ def synthesis(text_input, args):
# tensorboard # tensorboard
if not os.path.exists(args.log_dir): if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir) os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'synthesis') path = os.path.join(args.log_dir, 'synthesis')
writer = SummaryWriter(path) writer = SummaryWriter(path)
with dg.guard(place): with dg.guard(place):
with fluid.unique_name.guard(): with fluid.unique_name.guard():
model = TransformerTTS(cfg) model = TransformerTTS(cfg)
model.set_dict(load_checkpoint(str(args.transformer_step), os.path.join(args.checkpoint_path, "transformer"))) model.set_dict(
load_checkpoint(
str(args.transformer_step),
os.path.join(args.checkpoint_path, "transformer")))
model.eval() model.eval()
with fluid.unique_name.guard(): with fluid.unique_name.guard():
model_vocoder = Vocoder(cfg, args.batch_size) model_vocoder = Vocoder(cfg, args.batch_size)
model_vocoder.set_dict(load_checkpoint(str(args.vocoder_step), os.path.join(args.checkpoint_path, "vocoder"))) model_vocoder.set_dict(
load_checkpoint(
str(args.vocoder_step),
os.path.join(args.checkpoint_path, "vocoder")))
model_vocoder.eval() model_vocoder.eval()
# init input # init input
text = np.asarray(text_to_sequence(text_input)) text = np.asarray(text_to_sequence(text_input))
text = fluid.layers.unsqueeze(dg.to_variable(text),[0]) text = fluid.layers.unsqueeze(dg.to_variable(text), [0])
mel_input = dg.to_variable(np.zeros([1,1,80])).astype(np.float32) mel_input = dg.to_variable(np.zeros([1, 1, 80])).astype(np.float32)
pos_text = np.arange(1, text.shape[1]+1) pos_text = np.arange(1, text.shape[1] + 1)
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text),[0]) pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text), [0])
pbar = tqdm(range(args.max_len)) pbar = tqdm(range(args.max_len))
for i in pbar: for i in pbar:
dec_slf_mask = get_triu_tensor(mel_input.numpy(), mel_input.numpy()).astype(np.float32) dec_slf_mask = get_triu_tensor(
dec_slf_mask = fluid.layers.cast(dg.to_variable(dec_slf_mask == 0), np.float32) mel_input.numpy(), mel_input.numpy()).astype(np.float32)
pos_mel = np.arange(1, mel_input.shape[1]+1) dec_slf_mask = fluid.layers.cast(
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel),[0]) dg.to_variable(dec_slf_mask == 0), np.float32)
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(text, mel_input, pos_text, pos_mel, dec_slf_mask) pos_mel = np.arange(1, mel_input.shape[1] + 1)
mel_input = fluid.layers.concat([mel_input, postnet_pred[:,-1:,:]], axis=1) pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel), [0])
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
text, mel_input, pos_text, pos_mel, dec_slf_mask)
mel_input = fluid.layers.concat(
[mel_input, postnet_pred[:, -1:, :]], axis=1)
mag_pred = model_vocoder(postnet_pred) mag_pred = model_vocoder(postnet_pred)
_ljspeech_processor = audio.AudioProcessor( _ljspeech_processor = audio.AudioProcessor(
sample_rate=cfg['audio']['sr'], sample_rate=cfg['audio']['sr'],
num_mels=cfg['audio']['num_mels'], num_mels=cfg['audio']['num_mels'],
min_level_db=cfg['audio']['min_level_db'], min_level_db=cfg['audio']['min_level_db'],
ref_level_db=cfg['audio']['ref_level_db'], ref_level_db=cfg['audio']['ref_level_db'],
n_fft=cfg['audio']['n_fft'], n_fft=cfg['audio']['n_fft'],
win_length= cfg['audio']['win_length'], win_length=cfg['audio']['win_length'],
hop_length= cfg['audio']['hop_length'], hop_length=cfg['audio']['hop_length'],
power=cfg['audio']['power'], power=cfg['audio']['power'],
preemphasis=cfg['audio']['preemphasis'], preemphasis=cfg['audio']['preemphasis'],
signal_norm=True, signal_norm=True,
...@@ -90,30 +114,49 @@ def synthesis(text_input, args): ...@@ -90,30 +114,49 @@ def synthesis(text_input, args):
do_trim_silence=False, do_trim_silence=False,
sound_norm=False) sound_norm=False)
wav = _ljspeech_processor.inv_spectrogram(fluid.layers.transpose(fluid.layers.squeeze(mag_pred,[0]), [1,0]).numpy()) wav = _ljspeech_processor.inv_spectrogram(
fluid.layers.transpose(
fluid.layers.squeeze(mag_pred, [0]), [1, 0]).numpy())
global_step = 0 global_step = 0
for i, prob in enumerate(attn_probs): for i, prob in enumerate(attn_probs):
for j in range(4): for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255) x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image('Attention_%d_0'%global_step, x, i*4+j, dataformats="HWC") writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_enc): for i, prob in enumerate(attn_enc):
for j in range(4): for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255) x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image('Attention_enc_%d_0'%global_step, x, i*4+j, dataformats="HWC") writer.add_image(
'Attention_enc_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_dec): for i, prob in enumerate(attn_dec):
for j in range(4): for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255) x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image('Attention_dec_%d_0'%global_step, x, i*4+j, dataformats="HWC") writer.add_image(
'Attention_dec_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
writer.add_audio(text_input, wav, 0, cfg['audio']['sr']) writer.add_audio(text_input, wav, 0, cfg['audio']['sr'])
if not os.path.exists(args.sample_path): if not os.path.exists(args.sample_path):
os.mkdir(args.sample_path) os.mkdir(args.sample_path)
write(os.path.join(args.sample_path,'test.wav'), cfg['audio']['sr'], wav) write(
os.path.join(args.sample_path, 'test.wav'), cfg['audio']['sr'],
wav)
writer.close() writer.close()
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Synthesis model") parser = argparse.ArgumentParser(description="Synthesis model")
add_config_options_to_parser(parser) add_config_options_to_parser(parser)
args = parser.parse_args() args = parser.parse_args()
synthesis("They emphasized the necessity that the information now being furnished be handled with judgment and care.", args) synthesis(
"They emphasized the necessity that the information now being furnished be handled with judgment and care.",
args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
from tqdm import tqdm from tqdm import tqdm
from tensorboardX import SummaryWriter from tensorboardX import SummaryWriter
...@@ -16,8 +29,10 @@ from parakeet.models.transformer_tts.utils import cross_entropy ...@@ -16,8 +29,10 @@ from parakeet.models.transformer_tts.utils import cross_entropy
from data import LJSpeechLoader from data import LJSpeechLoader
from parakeet.models.transformer_tts.transformer_tts import TransformerTTS from parakeet.models.transformer_tts.transformer_tts import TransformerTTS
def load_checkpoint(step, model_path): def load_checkpoint(step, model_path):
model_dict, opti_dict = fluid.dygraph.load_dygraph(os.path.join(model_path, step)) model_dict, opti_dict = fluid.dygraph.load_dygraph(
os.path.join(model_path, step))
new_state_dict = OrderedDict() new_state_dict = OrderedDict()
for param in model_dict: for param in model_dict:
if param.startswith('_layers.'): if param.startswith('_layers.'):
...@@ -40,22 +55,27 @@ def main(args): ...@@ -40,22 +55,27 @@ def main(args):
if args.use_gpu else fluid.CPUPlace()) if args.use_gpu else fluid.CPUPlace())
if not os.path.exists(args.log_dir): if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir) os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'transformer') path = os.path.join(args.log_dir, 'transformer')
writer = SummaryWriter(path) if local_rank == 0 else None writer = SummaryWriter(path) if local_rank == 0 else None
with dg.guard(place): with dg.guard(place):
model = TransformerTTS(cfg) model = TransformerTTS(cfg)
model.train() model.train()
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']), optimizer = fluid.optimizer.AdamOptimizer(
parameter_list=model.parameters()) learning_rate=dg.NoamDecay(1 / (
cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
reader = LJSpeechLoader(cfg, args, nranks, local_rank, shuffle=True).reader() parameter_list=model.parameters())
reader = LJSpeechLoader(
cfg, args, nranks, local_rank, shuffle=True).reader()
if args.checkpoint_path is not None: if args.checkpoint_path is not None:
model_dict, opti_dict = load_checkpoint(str(args.transformer_step), os.path.join(args.checkpoint_path, "transformer")) model_dict, opti_dict = load_checkpoint(
str(args.transformer_step),
os.path.join(args.checkpoint_path, "transformer"))
model.set_dict(model_dict) model.set_dict(model_dict)
optimizer.set_dict(opti_dict) optimizer.set_dict(opti_dict)
global_step = args.transformer_step global_step = args.transformer_step
...@@ -64,93 +84,122 @@ def main(args): ...@@ -64,93 +84,122 @@ def main(args):
if args.use_data_parallel: if args.use_data_parallel:
strategy = dg.parallel.prepare_context() strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy) model = fluid.dygraph.parallel.DataParallel(model, strategy)
for epoch in range(args.epochs): for epoch in range(args.epochs):
pbar = tqdm(reader) pbar = tqdm(reader)
for i, data in enumerate(pbar): for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d'%epoch) pbar.set_description('Processing at epoch %d' % epoch)
character, mel, mel_input, pos_text, pos_mel, text_length, _, enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask= data character, mel, mel_input, pos_text, pos_mel, text_length, _, enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask = data
global_step += 1 global_step += 1
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(character, mel_input, pos_text, pos_mel, dec_slf_mask=dec_slf_mask, mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
enc_slf_mask=enc_slf_mask, enc_query_mask=enc_query_mask, character,
enc_dec_mask=enc_dec_mask, dec_query_slf_mask=dec_query_slf_mask, mel_input,
dec_query_mask=dec_query_mask) pos_text,
pos_mel,
dec_slf_mask=dec_slf_mask,
mel_loss = layers.mean(layers.abs(layers.elementwise_sub(mel_pred, mel))) enc_slf_mask=enc_slf_mask,
post_mel_loss = layers.mean(layers.abs(layers.elementwise_sub(postnet_pred, mel))) enc_query_mask=enc_query_mask,
enc_dec_mask=enc_dec_mask,
dec_query_slf_mask=dec_query_slf_mask,
dec_query_mask=dec_query_mask)
mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(mel_pred, mel)))
post_mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(postnet_pred, mel)))
loss = mel_loss + post_mel_loss loss = mel_loss + post_mel_loss
# Note: When used stop token loss the learning did not work. # Note: When used stop token loss the learning did not work.
if args.stop_token: if args.stop_token:
label = (pos_mel == 0).astype(np.float32) label = (pos_mel == 0).astype(np.float32)
stop_loss = cross_entropy(stop_preds, label) stop_loss = cross_entropy(stop_preds, label)
loss = loss + stop_loss loss = loss + stop_loss
if local_rank==0: if local_rank == 0:
writer.add_scalars('training_loss', { writer.add_scalars('training_loss', {
'mel_loss':mel_loss.numpy(), 'mel_loss': mel_loss.numpy(),
'post_mel_loss':post_mel_loss.numpy() 'post_mel_loss': post_mel_loss.numpy()
}, global_step) }, global_step)
if args.stop_token: if args.stop_token:
writer.add_scalar('stop_loss', stop_loss.numpy(), global_step) writer.add_scalar('stop_loss',
stop_loss.numpy(), global_step)
if args.use_data_parallel: if args.use_data_parallel:
writer.add_scalars('alphas', { writer.add_scalars('alphas', {
'encoder_alpha':model._layers.encoder.alpha.numpy(), 'encoder_alpha':
'decoder_alpha':model._layers.decoder.alpha.numpy(), model._layers.encoder.alpha.numpy(),
'decoder_alpha':
model._layers.decoder.alpha.numpy(),
}, global_step) }, global_step)
else: else:
writer.add_scalars('alphas', { writer.add_scalars('alphas', {
'encoder_alpha':model.encoder.alpha.numpy(), 'encoder_alpha': model.encoder.alpha.numpy(),
'decoder_alpha':model.decoder.alpha.numpy(), 'decoder_alpha': model.decoder.alpha.numpy(),
}, global_step) }, global_step)
writer.add_scalar('learning_rate', optimizer._learning_rate.step().numpy(), global_step) writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if global_step % args.image_step == 1: if global_step % args.image_step == 1:
for i, prob in enumerate(attn_probs): for i, prob in enumerate(attn_probs):
for j in range(4): for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255) x = np.uint8(
writer.add_image('Attention_%d_0'%global_step, x, i*4+j, dataformats="HWC") cm.viridis(prob.numpy()[j * 16]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_enc): for i, prob in enumerate(attn_enc):
for j in range(4): for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255) x = np.uint8(
writer.add_image('Attention_enc_%d_0'%global_step, x, i*4+j, dataformats="HWC") cm.viridis(prob.numpy()[j * 16]) * 255)
writer.add_image(
'Attention_enc_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_dec): for i, prob in enumerate(attn_dec):
for j in range(4): for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255) x = np.uint8(
writer.add_image('Attention_dec_%d_0'%global_step, x, i*4+j, dataformats="HWC") cm.viridis(prob.numpy()[j * 16]) * 255)
writer.add_image(
'Attention_dec_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
if args.use_data_parallel: if args.use_data_parallel:
loss = model.scale_loss(loss) loss = model.scale_loss(loss)
loss.backward() loss.backward()
model.apply_collective_grads() model.apply_collective_grads()
else: else:
loss.backward() loss.backward()
optimizer.minimize(loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh'])) optimizer.minimize(
loss,
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
'grad_clip_thresh']))
model.clear_gradients() model.clear_gradients()
# save checkpoint # save checkpoint
if local_rank==0 and global_step % args.save_step == 0: if local_rank == 0 and global_step % args.save_step == 0:
if not os.path.exists(args.save_path): if not os.path.exists(args.save_path):
os.mkdir(args.save_path) os.mkdir(args.save_path)
save_path = os.path.join(args.save_path,'transformer/%d' % global_step) save_path = os.path.join(args.save_path,
'transformer/%d' % global_step)
dg.save_dygraph(model.state_dict(), save_path) dg.save_dygraph(model.state_dict(), save_path)
dg.save_dygraph(optimizer.state_dict(), save_path) dg.save_dygraph(optimizer.state_dict(), save_path)
if local_rank==0: if local_rank == 0:
writer.close() writer.close()
if __name__ =='__main__':
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train TransformerTTS model") parser = argparse.ArgumentParser(description="Train TransformerTTS model")
add_config_options_to_parser(parser) add_config_options_to_parser(parser)
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from tensorboardX import SummaryWriter from tensorboardX import SummaryWriter
import os import os
from tqdm import tqdm from tqdm import tqdm
...@@ -13,6 +26,7 @@ import paddle.fluid.layers as layers ...@@ -13,6 +26,7 @@ import paddle.fluid.layers as layers
from data import LJSpeechLoader from data import LJSpeechLoader
from parakeet.models.transformer_tts.vocoder import Vocoder from parakeet.models.transformer_tts.vocoder import Vocoder
def load_checkpoint(step, model_path): def load_checkpoint(step, model_path):
model_dict, opti_dict = dg.load_dygraph(os.path.join(model_path, step)) model_dict, opti_dict = dg.load_dygraph(os.path.join(model_path, step))
new_state_dict = OrderedDict() new_state_dict = OrderedDict()
...@@ -23,8 +37,9 @@ def load_checkpoint(step, model_path): ...@@ -23,8 +37,9 @@ def load_checkpoint(step, model_path):
new_state_dict[param] = model_dict[param] new_state_dict[param] = model_dict[param]
return new_state_dict, opti_dict return new_state_dict, opti_dict
def main(args): def main(args):
local_rank = dg.parallel.Env().local_rank if args.use_data_parallel else 0 local_rank = dg.parallel.Env().local_rank if args.use_data_parallel else 0
nranks = dg.parallel.Env().nranks if args.use_data_parallel else 1 nranks = dg.parallel.Env().nranks if args.use_data_parallel else 1
...@@ -35,23 +50,26 @@ def main(args): ...@@ -35,23 +50,26 @@ def main(args):
place = (fluid.CUDAPlace(dg.parallel.Env().dev_id) place = (fluid.CUDAPlace(dg.parallel.Env().dev_id)
if args.use_data_parallel else fluid.CUDAPlace(0) if args.use_data_parallel else fluid.CUDAPlace(0)
if args.use_gpu else fluid.CPUPlace()) if args.use_gpu else fluid.CPUPlace())
if not os.path.exists(args.log_dir): if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir) os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'vocoder') path = os.path.join(args.log_dir, 'vocoder')
writer = SummaryWriter(path) if local_rank == 0 else None writer = SummaryWriter(path) if local_rank == 0 else None
with dg.guard(place): with dg.guard(place):
model = Vocoder(cfg, args.batch_size) model = Vocoder(cfg, args.batch_size)
model.train() model.train()
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']), optimizer = fluid.optimizer.AdamOptimizer(
parameter_list=model.parameters()) learning_rate=dg.NoamDecay(1 / (
cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
parameter_list=model.parameters())
if args.checkpoint_path is not None: if args.checkpoint_path is not None:
model_dict, opti_dict = load_checkpoint(str(args.vocoder_step), os.path.join(args.checkpoint_path, "vocoder")) model_dict, opti_dict = load_checkpoint(
str(args.vocoder_step),
os.path.join(args.checkpoint_path, "vocoder"))
model.set_dict(model_dict) model.set_dict(model_dict)
optimizer.set_dict(opti_dict) optimizer.set_dict(opti_dict)
global_step = args.vocoder_step global_step = args.vocoder_step
...@@ -61,48 +79,55 @@ def main(args): ...@@ -61,48 +79,55 @@ def main(args):
strategy = dg.parallel.prepare_context() strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy) model = fluid.dygraph.parallel.DataParallel(model, strategy)
reader = LJSpeechLoader(cfg, args, nranks, local_rank, is_vocoder=True).reader() reader = LJSpeechLoader(
cfg, args, nranks, local_rank, is_vocoder=True).reader()
for epoch in range(args.epochs): for epoch in range(args.epochs):
pbar = tqdm(reader) pbar = tqdm(reader)
for i, data in enumerate(pbar): for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d'%epoch) pbar.set_description('Processing at epoch %d' % epoch)
mel, mag = data mel, mag = data
mag = dg.to_variable(mag.numpy()) mag = dg.to_variable(mag.numpy())
mel = dg.to_variable(mel.numpy()) mel = dg.to_variable(mel.numpy())
global_step += 1 global_step += 1
mag_pred = model(mel) mag_pred = model(mel)
loss = layers.mean(layers.abs(layers.elementwise_sub(mag_pred, mag))) loss = layers.mean(
layers.abs(layers.elementwise_sub(mag_pred, mag)))
if args.use_data_parallel: if args.use_data_parallel:
loss = model.scale_loss(loss) loss = model.scale_loss(loss)
loss.backward() loss.backward()
model.apply_collective_grads() model.apply_collective_grads()
else: else:
loss.backward() loss.backward()
optimizer.minimize(loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh'])) optimizer.minimize(
loss,
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
'grad_clip_thresh']))
model.clear_gradients() model.clear_gradients()
if local_rank==0: if local_rank == 0:
writer.add_scalars('training_loss',{ writer.add_scalars('training_loss', {
'loss':loss.numpy(), 'loss': loss.numpy(),
}, global_step) }, global_step)
if global_step % args.save_step == 0: if global_step % args.save_step == 0:
if not os.path.exists(args.save_path): if not os.path.exists(args.save_path):
os.mkdir(args.save_path) os.mkdir(args.save_path)
save_path = os.path.join(args.save_path,'vocoder/%d' % global_step) save_path = os.path.join(args.save_path,
'vocoder/%d' % global_step)
dg.save_dygraph(model.state_dict(), save_path) dg.save_dygraph(model.state_dict(), save_path)
dg.save_dygraph(optimizer.state_dict(), save_path) dg.save_dygraph(optimizer.state_dict(), save_path)
if local_rank==0: if local_rank == 0:
writer.close() writer.close()
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train vocoder model") parser = argparse.ArgumentParser(description="Train vocoder model")
add_config_options_to_parser(parser) add_config_options_to_parser(parser)
args = parser.parse_args() args = parser.parse_args()
# Print the whole config setting. # Print the whole config setting.
pprint(args) pprint(args)
main(args) main(args)
\ No newline at end of file
...@@ -109,3 +109,13 @@ python -u benchmark.py \ ...@@ -109,3 +109,13 @@ python -u benchmark.py \
--root=./data/LJSpeech-1.1 \ --root=./data/LJSpeech-1.1 \
--name=${ModelName} --use_gpu=true --name=${ModelName} --use_gpu=true
``` ```
### Low-precision inference
This model supports the float16 low-precsion inference. By appending the argument
```bash
--use_fp16=true
```
to the command of synthesis and benchmarking, one can experience the fast speed of low-precision inference.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import random import random
from pprint import pprint from pprint import pprint
...@@ -24,9 +38,14 @@ def add_options_to_parser(parser): ...@@ -24,9 +38,14 @@ def add_options_to_parser(parser):
parser.add_argument( parser.add_argument(
'--use_gpu', '--use_gpu',
type=bool, type=utils.str2bool,
default=True, default=True,
help="option to use gpu training") help="option to use gpu training")
parser.add_argument(
'--use_fp16',
type=utils.str2bool,
default=True,
help="option to use fp16 for inference")
parser.add_argument( parser.add_argument(
'--iteration', '--iteration',
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import random import random
from pprint import pprint from pprint import pprint
...@@ -24,9 +38,14 @@ def add_options_to_parser(parser): ...@@ -24,9 +38,14 @@ def add_options_to_parser(parser):
parser.add_argument( parser.add_argument(
'--use_gpu', '--use_gpu',
type=bool, type=utils.str2bool,
default=True, default=True,
help="option to use gpu training") help="option to use gpu training")
parser.add_argument(
'--use_fp16',
type=utils.str2bool,
default=True,
help="option to use fp16 for inference")
parser.add_argument( parser.add_argument(
'--iteration', '--iteration',
...@@ -74,7 +93,6 @@ def synthesize(config): ...@@ -74,7 +93,6 @@ def synthesize(config):
# Build model. # Build model.
model = WaveFlow(config, checkpoint_dir) model = WaveFlow(config, checkpoint_dir)
model.build(training=False) model.build(training=False)
# Obtain the current iteration. # Obtain the current iteration.
if config.checkpoint is None: if config.checkpoint is None:
if config.iteration is None: if config.iteration is None:
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import random import random
import subprocess import subprocess
...@@ -127,4 +141,6 @@ if __name__ == "__main__": ...@@ -127,4 +141,6 @@ if __name__ == "__main__":
# the preceding update will be overwritten by the following one. # the preceding update will be overwritten by the following one.
config = parser.parse_args() config = parser.parse_args()
config = utils.add_yaml_config(config) config = utils.add_yaml_config(config)
# Force to use fp32 in model training
vars(config)["use_fp16"] = False
train(config) train(config)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools import itertools
import os import os
import time import time
...@@ -126,7 +140,8 @@ def load_parameters(checkpoint_dir, ...@@ -126,7 +140,8 @@ def load_parameters(checkpoint_dir,
model, model,
optimizer=None, optimizer=None,
iteration=None, iteration=None,
file_path=None): file_path=None,
dtype="float32"):
if file_path is None: if file_path is None:
if iteration is None: if iteration is None:
iteration = load_latest_checkpoint(checkpoint_dir, rank) iteration = load_latest_checkpoint(checkpoint_dir, rank)
...@@ -135,6 +150,12 @@ def load_parameters(checkpoint_dir, ...@@ -135,6 +150,12 @@ def load_parameters(checkpoint_dir,
file_path = "{}/step-{}".format(checkpoint_dir, iteration) file_path = "{}/step-{}".format(checkpoint_dir, iteration)
model_dict, optimizer_dict = dg.load_dygraph(file_path) model_dict, optimizer_dict = dg.load_dygraph(file_path)
if dtype == "float16":
for k, v in model_dict.items():
if "conv2d_transpose" in k:
model_dict[k] = v.astype("float32")
else:
model_dict[k] = v.astype(dtype)
model.set_dict(model_dict) model.set_dict(model_dict)
print("[checkpoint] Rank {}: loaded model from {}".format(rank, file_path)) print("[checkpoint] Rank {}: loaded model from {}".format(rank, file_path))
if optimizer and optimizer_dict: if optimizer and optimizer_dict:
......
# Wavenet
Paddle implementation of wavenet in dynamic graph, a convolutional network based vocoder. Wavenet is proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499), but in thie experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Project Structure
```text
├── data.py data_processing
├── configs/ (example) configuration file
├── synthesis.py script to synthesize waveform from mel_spectrogram
├── train.py script to train a model
└── utils.py utility functions
```
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
usage: train.py [-h] [--data DATA] [--config CONFIG] [--output OUTPUT]
[--device DEVICE] [--resume RESUME]
Train a wavenet model with LJSpeech.
optional arguments:
-h, --help show this help message and exit
--data DATA path of the LJspeech dataset.
--config CONFIG path of the config file.
--output OUTPUT path to save results.
--device DEVICE device to use.
--resume RESUME checkpoint to resume from.
```
1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
```text
├── checkpoints # checkpoint
└── log # tensorboard log
```
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script:
```bash
python train.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
```
You can monitor training log via tensorboard, using the script below.
```bash
cd experiment/log
tensorboard --logdir=.
```
## Synthesis
```text
usage: synthesis.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE]
checkpoint output
Synthesize valid data from LJspeech with a wavenet model.
positional arguments:
checkpoint checkpoint to load.
output path to save results.
optional arguments:
-h, --help show this help message and exit
--data DATA path of the LJspeech dataset.
--config CONFIG path of the config file.
--device DEVICE device to use.
```
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
3. `checkpoint` is the checkpoint to load.
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script:
```bash
python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated
```
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 30
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 3
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "softmax"
output_dim: 2048
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import csv
import numpy as np
import librosa
from pathlib import Path
import pandas as pd
from parakeet.data import batch_spec, batch_wav
from parakeet.data import DatasetMixin
class LJSpeechMetaData(DatasetMixin):
def __init__(self, root):
self.root = Path(root)
self._wav_dir = self.root.joinpath("wavs")
csv_path = self.root.joinpath("metadata.csv")
self._table = pd.read_csv(
csv_path,
sep="|",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
fname = str(self._wav_dir.joinpath(fname + ".wav"))
return fname, raw_text, normalized_text
def __len__(self):
return len(self._table)
class Transform(object):
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.n_mels = n_mels
def __call__(self, example):
wav_path, _, _ = example
sr = self.sample_rate
n_fft = self.n_fft
win_length = self.win_length
hop_length = self.hop_length
n_mels = self.n_mels
wav, loaded_sr = librosa.load(wav_path, sr=None)
assert loaded_sr == sr, "sample rate does not match, resampling applied"
# Pad audio to the right size.
frames = int(np.ceil(float(wav.size) / hop_length))
fft_padding = (n_fft - hop_length) // 2 # sound
desired_length = frames * hop_length + fft_padding * 2
pad_amount = (desired_length - wav.size) // 2
if wav.size % 2 == 0:
wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
else:
wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
# Normalize audio.
wav = wav / np.abs(wav).max() * 0.999
# Compute mel-spectrogram.
# Turn center to False to prevent internal padding.
spectrogram = librosa.core.stft(
wav,
hop_length=hop_length,
win_length=win_length,
n_fft=n_fft,
center=False)
spectrogram_magnitude = np.abs(spectrogram)
# Compute mel-spectrograms.
mel_filter_bank = librosa.filters.mel(sr=sr,
n_fft=n_fft,
n_mels=n_mels)
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
mel_spectrogram = mel_spectrogram
# Rescale mel_spectrogram.
min_level, ref_level = 1e-5, 20 # hard code it
mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram))
mel_spectrogram = mel_spectrogram - ref_level
mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1)
# Extract the center of audio that corresponds to mel spectrograms.
audio = wav[fft_padding:-fft_padding]
assert mel_spectrogram.shape[1] * hop_length == audio.size
# there is no clipping here
return audio, mel_spectrogram
class DataCollector(object):
def __init__(self,
context_size,
sample_rate,
hop_length,
train_clip_seconds,
valid=False):
frames_per_second = sample_rate // hop_length
train_clip_frames = int(
np.ceil(train_clip_seconds * frames_per_second))
context_frames = context_size // hop_length
self.num_frames = train_clip_frames + context_frames
self.sample_rate = sample_rate
self.hop_length = hop_length
self.valid = valid
def random_crop(self, sample):
audio, mel_spectrogram = sample
audio_frames = int(audio.size) // self.hop_length
max_start_frame = audio_frames - self.num_frames
assert max_start_frame >= 0, "audio is too short to be cropped"
frame_start = np.random.randint(0, max_start_frame)
# frame_start = 0 # norandom
frame_end = frame_start + self.num_frames
audio_start = frame_start * self.hop_length
audio_end = frame_end * self.hop_length
audio = audio[audio_start:audio_end]
return audio, mel_spectrogram, audio_start
def __call__(self, samples):
# transform them first
if self.valid:
samples = [(audio, mel_spectrogram, 0)
for audio, mel_spectrogram in samples]
else:
samples = [self.random_crop(sample) for sample in samples]
# batch them
audios = [sample[0] for sample in samples]
audio_starts = [sample[2] for sample in samples]
mels = [sample[1] for sample in samples]
mels = batch_spec(mels)
if self.valid:
audios = batch_wav(audios, dtype=np.float32)
else:
audios = np.array(audios, dtype=np.float32)
audio_starts = np.array(audio_starts, dtype=np.int64)
return audios, mels, audio_starts
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import ruamel.yaml
import argparse
from tqdm import tqdm
from tensorboardX import SummaryWriter
from paddle import fluid
import paddle.fluid.dygraph as dg
from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
from parakeet.utils.layer_tools import summary
from data import LJSpeechMetaData, Transform, DataCollector
from utils import make_output_tree, valid_model, eval_model, save_checkpoint
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Synthesize valid data from LJspeech with a wavenet model.")
parser.add_argument(
"--data", type=str, help="path of the LJspeech dataset.")
parser.add_argument("--config", type=str, help="path of the config file.")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument("checkpoint", type=str, help="checkpoint to load.")
parser.add_argument(
"output", type=str, default="experiment", help="path to save results.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
model_config = config["model"]
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
filter_size = model_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
make_output_tree(args.output)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
with dg.guard(place):
model_config = config["model"]
upsampling_factors = model_config["upsampling_factors"]
encoder = UpsampleNet(upsampling_factors)
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
residual_channels = model_config["residual_channels"]
output_dim = model_config["output_dim"]
loss_type = model_config["loss_type"]
log_scale_min = model_config["log_scale_min"]
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim,
n_mels, filter_size, loss_type, log_scale_min)
model = ConditionalWavenet(encoder, decoder)
summary(model)
model_dict, _ = dg.load_dygraph(args.checkpoint)
print("Loading from {}.pdparams".format(args.checkpoint))
model.set_dict(model_dict)
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
eval_model(model, valid_loader, args.output, sample_rate)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import ruamel.yaml
import argparse
from tqdm import tqdm
from tensorboardX import SummaryWriter
from paddle import fluid
import paddle.fluid.dygraph as dg
from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
from parakeet.utils.layer_tools import summary
from data import LJSpeechMetaData, Transform, DataCollector
from utils import make_output_tree, valid_model, save_checkpoint
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Train a wavenet model with LJSpeech.")
parser.add_argument(
"--data", type=str, help="path of the LJspeech dataset.")
parser.add_argument("--config", type=str, help="path of the config file.")
parser.add_argument(
"--output",
type=str,
default="experiment",
help="path to save results.")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument(
"--resume", type=str, help="checkpoint to resume from.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
model_config = config["model"]
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
filter_size = model_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
make_output_tree(args.output)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
with dg.guard(place):
model_config = config["model"]
upsampling_factors = model_config["upsampling_factors"]
encoder = UpsampleNet(upsampling_factors)
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
residual_channels = model_config["residual_channels"]
output_dim = model_config["output_dim"]
loss_type = model_config["loss_type"]
log_scale_min = model_config["log_scale_min"]
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim,
n_mels, filter_size, loss_type, log_scale_min)
model = ConditionalWavenet(encoder, decoder)
summary(model)
train_config = config["train"]
learning_rate = train_config["learning_rate"]
anneal_rate = train_config["anneal_rate"]
anneal_interval = train_config["anneal_interval"]
lr_scheduler = dg.ExponentialDecay(
learning_rate, anneal_interval, anneal_rate, staircase=True)
optim = fluid.optimizer.Adam(
lr_scheduler, parameter_list=model.parameters())
gradiant_max_norm = train_config["gradient_max_norm"]
clipper = fluid.dygraph_grad_clip.GradClipByGlobalNorm(
gradiant_max_norm)
if args.resume:
model_dict, optim_dict = dg.load_dygraph(args.resume)
print("Loading from {}.pdparams".format(args.resume))
model.set_dict(model_dict)
if optim_dict:
optim.set_dict(optim_dict)
print("Loading from {}.pdopt".format(args.resume))
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
max_iterations = train_config["max_iterations"]
checkpoint_interval = train_config["checkpoint_interval"]
snap_interval = train_config["snap_interval"]
eval_interval = train_config["eval_interval"]
checkpoint_dir = os.path.join(args.output, "checkpoints")
log_dir = os.path.join(args.output, "log")
writer = SummaryWriter(log_dir)
global_step = 1
while global_step <= max_iterations:
epoch_loss = 0.
for i, batch in tqdm(enumerate(train_loader)):
audio_clips, mel_specs, audio_starts = batch
model.train()
y_var = model(audio_clips, mel_specs, audio_starts)
loss_var = model.loss(y_var, audio_clips)
loss_var.backward()
loss_np = loss_var.numpy()
epoch_loss += loss_np[0]
writer.add_scalar("loss", loss_np[0], global_step)
writer.add_scalar("learning_rate",
optim._learning_rate.step().numpy()[0],
global_step)
optim.minimize(loss_var, grad_clip=clipper)
optim.clear_gradients()
print("loss: {:<8.6f}".format(loss_np[0]))
if global_step % snap_interval == 0:
valid_model(model, valid_loader, writer, global_step,
sample_rate)
if global_step % checkpoint_interval == 0:
save_checkpoint(model, optim, checkpoint_dir, global_step)
global_step += 1
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
import soundfile as sf
import paddle.fluid.dygraph as dg
def make_output_tree(output_dir):
checkpoint_dir = os.path.join(output_dir, "checkpoints")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
state_dir = os.path.join(output_dir, "states")
if not os.path.exists(state_dir):
os.makedirs(state_dir)
def valid_model(model, valid_loader, writer, global_step, sample_rate):
loss = []
wavs = []
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
audio_clips, mel_specs, audio_starts = batch
y_var = model(audio_clips, mel_specs, audio_starts)
wav_var = model.sample(y_var)
loss_var = model.loss(y_var, audio_clips)
loss.append(loss_var.numpy()[0])
wavs.append(wav_var.numpy()[0])
average_loss = np.mean(loss)
writer.add_scalar("valid_loss", average_loss, global_step)
for i, wav in enumerate(wavs):
writer.add_audio("valid/sample_{}".format(i), wav, global_step,
sample_rate)
def eval_model(model, valid_loader, output_dir, sample_rate):
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
path = os.path.join(output_dir, "sentence_{}.wav".format(i))
audio_clips, mel_specs, audio_starts = batch
wav_var = model.synthesis(mel_specs)
wav_np = wav_var.numpy()[0]
sf.write(path, wav_np, samplerate=sample_rate)
print("generated {}".format(path))
def save_checkpoint(model, optim, checkpoint_dir, global_step):
checkpoint_path = os.path.join(checkpoint_dir,
"step_{:09d}".format(global_step))
dg.save_dygraph(model.state_dict(), checkpoint_path)
dg.save_dygraph(optim.state_dict(), checkpoint_path)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
__version__ = "0.0.0" __version__ = "0.0.0"
from . import data, g2p, models, modules from . import data, g2p, models, modules
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .audio import AudioProcessor from .audio import AudioProcessor
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import librosa import librosa
import soundfile as sf import soundfile as sf
import numpy as np import numpy as np
import scipy.io import scipy.io
import scipy.signal import scipy.signal
class AudioProcessor(object): class AudioProcessor(object):
def __init__(self, def __init__(
sample_rate=None, # int, sampling rate self,
num_mels=None, # int, bands of mel spectrogram sample_rate=None, # int, sampling rate
min_level_db=None, # float, minimum level db num_mels=None, # int, bands of mel spectrogram
ref_level_db=None, # float, reference level db min_level_db=None, # float, minimum level db
n_fft=None, # int: number of samples in a frame for stft ref_level_db=None, # float, reference level db
win_length=None, # int: the same meaning with n_fft n_fft=None, # int: number of samples in a frame for stft
hop_length=None, # int: number of samples between neighboring frame win_length=None, # int: the same meaning with n_fft
power=None, # float:power to raise before griffin-lim hop_length=None, # int: number of samples between neighboring frame
preemphasis=None, # float: preemphasis coefficident power=None, # float:power to raise before griffin-lim
signal_norm=None, # preemphasis=None, # float: preemphasis coefficident
symmetric_norm=False, # bool, apply clip norm in [-max_norm, max_form] signal_norm=None, #
max_norm=None, # float, max norm symmetric_norm=False, # bool, apply clip norm in [-max_norm, max_form]
mel_fmin=None, # int: mel spectrogram's minimum frequency max_norm=None, # float, max norm
mel_fmax=None, # int: mel spectrogram's maximum frequency mel_fmin=None, # int: mel spectrogram's minimum frequency
clip_norm=True, # bool: clip spectrogram's norm mel_fmax=None, # int: mel spectrogram's maximum frequency
griffin_lim_iters=None, # int: clip_norm=True, # bool: clip spectrogram's norm
do_trim_silence=False, # bool: trim silence griffin_lim_iters=None, # int:
sound_norm=False, do_trim_silence=False, # bool: trim silence
**kwargs): sound_norm=False,
**kwargs):
self.sample_rate = sample_rate self.sample_rate = sample_rate
self.num_mels = num_mels self.num_mels = num_mels
self.min_level_db = min_level_db self.min_level_db = min_level_db
...@@ -34,8 +50,8 @@ class AudioProcessor(object): ...@@ -34,8 +50,8 @@ class AudioProcessor(object):
self.n_fft = n_fft self.n_fft = n_fft
self.win_length = win_length or n_fft self.win_length = win_length or n_fft
# hop length defaults to 1/4 window_length # hop length defaults to 1/4 window_length
self.hop_length = hop_length or 0.25 * self.win_length self.hop_length = hop_length or 0.25 * self.win_length
self.power = power self.power = power
self.preemphasis = float(preemphasis) self.preemphasis = float(preemphasis)
...@@ -52,7 +68,8 @@ class AudioProcessor(object): ...@@ -52,7 +68,8 @@ class AudioProcessor(object):
self.do_trim_silence = do_trim_silence self.do_trim_silence = do_trim_silence
self.sound_norm = sound_norm self.sound_norm = sound_norm
self.num_freq, self.frame_length_ms, self.frame_shift_ms = self._stft_parameters() self.num_freq, self.frame_length_ms, self.frame_shift_ms = self._stft_parameters(
)
def _stft_parameters(self): def _stft_parameters(self):
"""compute frame length and hop length in ms""" """compute frame length and hop length in ms"""
...@@ -65,44 +82,54 @@ class AudioProcessor(object): ...@@ -65,44 +82,54 @@ class AudioProcessor(object):
"""object repr""" """object repr"""
cls_name_str = self.__class__.__name__ cls_name_str = self.__class__.__name__
members = vars(self) members = vars(self)
dict_str = "\n".join([" {}: {},".format(k, v) for k, v in members.items()]) dict_str = "\n".join(
[" {}: {},".format(k, v) for k, v in members.items()])
repr_str = "{}(\n{})\n".format(cls_name_str, dict_str) repr_str = "{}(\n{})\n".format(cls_name_str, dict_str)
return repr_str return repr_str
def save_wav(self, path, wav): def save_wav(self, path, wav):
"""save audio with scipy.io.wavfile in 16bit integers""" """save audio with scipy.io.wavfile in 16bit integers"""
wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav)))) wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav))))
scipy.io.wavfile.write(path, self.sample_rate, wav_norm.as_type(np.int16)) scipy.io.wavfile.write(path, self.sample_rate,
wav_norm.as_type(np.int16))
def load_wav(self, path, sr=None): def load_wav(self, path, sr=None):
"""load wav -> trim_silence -> rescale""" """load wav -> trim_silence -> rescale"""
x, sr = librosa.load(path, sr=None) x, sr = librosa.load(path, sr=None)
assert self.sample_rate == sr, "audio sample rate: {}Hz != processor sample rate: {}Hz".format(sr, self.sample_rate) assert self.sample_rate == sr, "audio sample rate: {}Hz != processor sample rate: {}Hz".format(
sr, self.sample_rate)
if self.do_trim_silence: if self.do_trim_silence:
try: try:
x = self.trim_silence(x) x = self.trim_silence(x)
except ValueError: except ValueError:
print(" [!] File cannot be trimmed for silence - {}".format(path)) print(" [!] File cannot be trimmed for silence - {}".format(
path))
if self.sound_norm: if self.sound_norm:
x = x / x.max() * 0.9 # why 0.9 ? x = x / x.max() * 0.9 # why 0.9 ?
return x return x
def trim_silence(self, wav): def trim_silence(self, wav):
"""Trim soilent parts with a threshold and 0.01s margin""" """Trim soilent parts with a threshold and 0.01s margin"""
margin = int(self.sample_rate * 0.01) margin = int(self.sample_rate * 0.01)
wav = wav[margin: -margin] wav = wav[margin:-margin]
trimed_wav = librosa.effects.trim(wav, top_db=60, frame_length=self.win_length, hop_length=self.hop_length)[0] trimed_wav = librosa.effects.trim(
wav,
top_db=60,
frame_length=self.win_length,
hop_length=self.hop_length)[0]
return trimed_wav return trimed_wav
def apply_preemphasis(self, x): def apply_preemphasis(self, x):
if self.preemphasis == 0.: if self.preemphasis == 0.:
raise RuntimeError(" !! Preemphasis coefficient should be positive. ") raise RuntimeError(
" !! Preemphasis coefficient should be positive. ")
return scipy.signal.lfilter([1., -self.preemphasis], [1.], x) return scipy.signal.lfilter([1., -self.preemphasis], [1.], x)
def apply_inv_preemphasis(self, x): def apply_inv_preemphasis(self, x):
if self.preemphasis == 0.: if self.preemphasis == 0.:
raise RuntimeError(" !! Preemphasis coefficient should be positive. ") raise RuntimeError(
" !! Preemphasis coefficient should be positive. ")
return scipy.signal.lfilter([1.], [1., -self.preemphasis], x) return scipy.signal.lfilter([1.], [1., -self.preemphasis], x)
def _amplitude_to_db(self, x): def _amplitude_to_db(self, x):
...@@ -125,12 +152,11 @@ class AudioProcessor(object): ...@@ -125,12 +152,11 @@ class AudioProcessor(object):
"""return mel basis for mel scale""" """return mel basis for mel scale"""
if self.mel_fmax is not None: if self.mel_fmax is not None:
assert self.mel_fmax <= self.sample_rate // 2 assert self.mel_fmax <= self.sample_rate // 2
return librosa.filters.mel( return librosa.filters.mel(self.sample_rate,
self.sample_rate, self.n_fft,
self.n_fft, n_mels=self.num_mels,
n_mels=self.num_mels, fmin=self.mel_fmin,
fmin=self.mel_fmin, fmax=self.mel_fmax)
fmax=self.mel_fmax)
def _normalize(self, S): def _normalize(self, S):
"""put values in [0, self.max_norm] or [-self.max_norm, self,max_norm]""" """put values in [0, self.max_norm] or [-self.max_norm, self,max_norm]"""
...@@ -156,25 +182,29 @@ class AudioProcessor(object): ...@@ -156,25 +182,29 @@ class AudioProcessor(object):
if self.symmetric_norm: if self.symmetric_norm:
if self.clip_norm: if self.clip_norm:
S_denorm = np.clip(S_denorm, -self.max_norm, self.max_norm) S_denorm = np.clip(S_denorm, -self.max_norm, self.max_norm)
S_denorm = (S_denorm + self.max_norm) * (-self.min_level_db) / (2 * self.max_norm) + self.min_level_db S_denorm = (S_denorm + self.max_norm) * (
-self.min_level_db) / (2 * self.max_norm
) + self.min_level_db
return S_denorm return S_denorm
else: else:
if self.clip_norm: if self.clip_norm:
S_denorm = np.clip(S_denorm, 0, self.max_norm) S_denorm = np.clip(S_denorm, 0, self.max_norm)
S_denorm = S_denorm * (-self.min_level_db)/ self.max_norm + self.min_level_db S_denorm = S_denorm * (-self.min_level_db
) / self.max_norm + self.min_level_db
return S_denorm return S_denorm
else: else:
return S return S
def _stft(self, y): def _stft(self, y):
return librosa.stft( return librosa.stft(
y=y, y=y,
n_fft=self.n_fft, n_fft=self.n_fft,
win_length=self.win_length, win_length=self.win_length,
hop_length=self.hop_length) hop_length=self.hop_length)
def _istft(self, S): def _istft(self, S):
return librosa.istft(S, hop_length=self.hop_length, win_length=self.win_length) return librosa.istft(
S, hop_length=self.hop_length, win_length=self.win_length)
def spectrogram(self, y): def spectrogram(self, y):
"""compute linear spectrogram(amplitude) """compute linear spectrogram(amplitude)
...@@ -195,7 +225,8 @@ class AudioProcessor(object): ...@@ -195,7 +225,8 @@ class AudioProcessor(object):
D = self._stft(self.apply_preemphasis(y)) D = self._stft(self.apply_preemphasis(y))
else: else:
D = self._stft(y) D = self._stft(y)
S = self._amplitude_to_db(self._linear_to_mel(np.abs(D))) - self.ref_level_db S = self._amplitude_to_db(self._linear_to_mel(np.abs(
D))) - self.ref_level_db
return self._normalize(S) return self._normalize(S)
def inv_spectrogram(self, spectrogram): def inv_spectrogram(self, spectrogram):
...@@ -203,16 +234,16 @@ class AudioProcessor(object): ...@@ -203,16 +234,16 @@ class AudioProcessor(object):
S = self._denormalize(spectrogram) S = self._denormalize(spectrogram)
S = self._db_to_amplitude(S + self.ref_level_db) S = self._db_to_amplitude(S + self.ref_level_db)
if self.preemphasis: if self.preemphasis:
return self.apply_inv_preemphasis(self._griffin_lim(S ** self.power)) return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
return self._griffin_lim(S ** self.power) return self._griffin_lim(S**self.power)
def inv_melspectrogram(self, mel_spectrogram): def inv_melspectrogram(self, mel_spectrogram):
S = self._denormalize(mel_spectrogram) S = self._denormalize(mel_spectrogram)
S = self._db_to_amplitude(S + self.ref_level_db) S = self._db_to_amplitude(S + self.ref_level_db)
S = self._mel_to_linear(np.abs(S)) S = self._mel_to_linear(np.abs(S))
if self.preemphasis: if self.preemphasis:
return self.apply_inv_preemphasis(self._griffin_lim(S ** self.power)) return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
return self._griffin_lim(S ** self.power) return self._griffin_lim(S**self.power)
def out_linear_to_mel(self, linear_spec): def out_linear_to_mel(self, linear_spec):
"""convert output linear spec to mel spec""" """convert output linear spec to mel spec"""
...@@ -222,7 +253,7 @@ class AudioProcessor(object): ...@@ -222,7 +253,7 @@ class AudioProcessor(object):
S = self._amplitude_to_db(S) - self.ref_level_db S = self._amplitude_to_db(S) - self.ref_level_db
mel = self._normalize(S) mel = self._normalize(S)
return mel return mel
def _griffin_lim(self, S): def _griffin_lim(self, S):
angles = np.exp(2j * np.pi * np.random.rand(*S.shape)) angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
S_complex = np.abs(S).astype(np.complex) S_complex = np.abs(S).astype(np.complex)
...@@ -234,18 +265,18 @@ class AudioProcessor(object): ...@@ -234,18 +265,18 @@ class AudioProcessor(object):
@staticmethod @staticmethod
def mulaw_encode(wav, qc): def mulaw_encode(wav, qc):
mu = 2 ** qc - 1 mu = 2**qc - 1
# wav_abs = np.minimum(np.abs(wav), 1.0) # wav_abs = np.minimum(np.abs(wav), 1.0)
signal = np.sign(wav) * np.log(1 + mu * np.abs(wav)) / np.log(1. + mu) signal = np.sign(wav) * np.log(1 + mu * np.abs(wav)) / np.log(1. + mu)
# Quantize signal to the specified number of levels. # Quantize signal to the specified number of levels.
signal = (signal + 1) / 2 * mu + 0.5 signal = (signal + 1) / 2 * mu + 0.5
return np.floor(signal,) return np.floor(signal, )
@staticmethod @staticmethod
def mulaw_decode(wav, qc): def mulaw_decode(wav, qc):
"""Recovers waveform from quantized values.""" """Recovers waveform from quantized values."""
mu = 2 ** qc - 1 mu = 2**qc - 1
x = np.sign(wav) / mu * ((1 + mu) ** np.abs(wav) - 1) x = np.sign(wav) / mu * ((1 + mu)**np.abs(wav) - 1)
return x return x
@staticmethod @staticmethod
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .dataset import * from .dataset import *
from .datacargo import * from .datacargo import *
from .sampler import * from .sampler import *
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" """
functions to make batch for arrays which satisfy some conditions. functions to make batch for arrays which satisfy some conditions.
""" """
import numpy as np import numpy as np
class TextIDBatcher(object): class TextIDBatcher(object):
"""A wrapper class for a function to build a functor, which holds the configs to pass to the function.""" """A wrapper class for a function to build a functor, which holds the configs to pass to the function."""
def __init__(self, pad_id=0, dtype=np.int64): def __init__(self, pad_id=0, dtype=np.int64):
self.pad_id = pad_id self.pad_id = pad_id
self.dtype = dtype self.dtype = dtype
def __call__(self, minibatch): def __call__(self, minibatch):
out = batch_text_id(minibatch, pad_id=self.pad_id, dtype=self.dtype) out = batch_text_id(minibatch, pad_id=self.pad_id, dtype=self.dtype)
return out return out
def batch_text_id(minibatch, pad_id=0, dtype=np.int64): def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
""" """
minibatch: List[Example] minibatch: List[Example]
...@@ -20,26 +36,32 @@ def batch_text_id(minibatch, pad_id=0, dtype=np.int64): ...@@ -20,26 +36,32 @@ def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
""" """
peek_example = minibatch[0] peek_example = minibatch[0]
assert len(peek_example.shape) == 1, "text example is an 1D tensor" assert len(peek_example.shape) == 1, "text example is an 1D tensor"
lengths = [example.shape[0] for example in minibatch] # assume (channel, n_samples) or (n_samples, ) lengths = [example.shape[0] for example in minibatch
] # assume (channel, n_samples) or (n_samples, )
max_len = np.max(lengths) max_len = np.max(lengths)
batch = [] batch = []
for example in minibatch: for example in minibatch:
pad_len = max_len - example.shape[0] pad_len = max_len - example.shape[0]
batch.append(np.pad(example, [(0, pad_len)], mode='constant', constant_values=pad_id)) batch.append(
np.pad(example, [(0, pad_len)],
mode='constant',
constant_values=pad_id))
return np.array(batch, dtype=dtype) return np.array(batch, dtype=dtype)
class WavBatcher(object): class WavBatcher(object):
def __init__(self, pad_value=0., dtype=np.float32): def __init__(self, pad_value=0., dtype=np.float32):
self.pad_value = pad_value self.pad_value = pad_value
self.dtype = dtype self.dtype = dtype
def __call__(self, minibatch): def __call__(self, minibatch):
out = batch_wav(minibatch, pad_value=self.pad_value, dtype=self.dtype) out = batch_wav(minibatch, pad_value=self.pad_value, dtype=self.dtype)
return out return out
def batch_wav(minibatch, pad_value=0., dtype=np.float32): def batch_wav(minibatch, pad_value=0., dtype=np.float32):
""" """
minibatch: List[Example] minibatch: List[Example]
...@@ -51,18 +73,25 @@ def batch_wav(minibatch, pad_value=0., dtype=np.float32): ...@@ -51,18 +73,25 @@ def batch_wav(minibatch, pad_value=0., dtype=np.float32):
mono_channel = True mono_channel = True
elif len(peek_example.shape) == 2: elif len(peek_example.shape) == 2:
mono_channel = False mono_channel = False
lengths = [example.shape[-1] for example in minibatch] # assume (channel, n_samples) or (n_samples, ) lengths = [example.shape[-1] for example in minibatch
] # assume (channel, n_samples) or (n_samples, )
max_len = np.max(lengths) max_len = np.max(lengths)
batch = [] batch = []
for example in minibatch: for example in minibatch:
pad_len = max_len - example.shape[-1] pad_len = max_len - example.shape[-1]
if mono_channel: if mono_channel:
batch.append(np.pad(example, [(0, pad_len)], mode='constant', constant_values=pad_value)) batch.append(
np.pad(example, [(0, pad_len)],
mode='constant',
constant_values=pad_value))
else: else:
batch.append(np.pad(example, [(0, 0), (0, pad_len)], mode='constant', constant_values=pad_value)) # what about PCM, no batch.append(
np.pad(example, [(0, 0), (0, pad_len)],
mode='constant',
constant_values=pad_value)) # what about PCM, no
return np.array(batch, dtype=dtype) return np.array(batch, dtype=dtype)
...@@ -75,6 +104,7 @@ class SpecBatcher(object): ...@@ -75,6 +104,7 @@ class SpecBatcher(object):
out = batch_spec(minibatch, pad_value=self.pad_value, dtype=self.dtype) out = batch_spec(minibatch, pad_value=self.pad_value, dtype=self.dtype)
return out return out
def batch_spec(minibatch, pad_value=0., dtype=np.float32): def batch_spec(minibatch, pad_value=0., dtype=np.float32):
""" """
minibatch: List[Example] minibatch: List[Example]
...@@ -86,16 +116,23 @@ def batch_spec(minibatch, pad_value=0., dtype=np.float32): ...@@ -86,16 +116,23 @@ def batch_spec(minibatch, pad_value=0., dtype=np.float32):
mono_channel = True mono_channel = True
elif len(peek_example.shape) == 3: elif len(peek_example.shape) == 3:
mono_channel = False mono_channel = False
lengths = [example.shape[-1] for example in minibatch] # assume (channel, F, n_frame) or (F, n_frame) lengths = [example.shape[-1] for example in minibatch
max_len = np.max(lengths) ] # assume (channel, F, n_frame) or (F, n_frame)
max_len = np.max(lengths)
batch = [] batch = []
for example in minibatch: for example in minibatch:
pad_len = max_len - example.shape[-1] pad_len = max_len - example.shape[-1]
if mono_channel: if mono_channel:
batch.append(np.pad(example, [(0, 0), (0, pad_len)], mode='constant', constant_values=pad_value)) batch.append(
np.pad(example, [(0, 0), (0, pad_len)],
mode='constant',
constant_values=pad_value))
else: else:
batch.append(np.pad(example, [(0, 0), (0, 0), (0, pad_len)], mode='constant', constant_values=pad_value)) # what about PCM, no batch.append(
np.pad(example, [(0, 0), (0, 0), (0, pad_len)],
return np.array(batch, dtype=dtype) mode='constant',
\ No newline at end of file constant_values=pad_value)) # what about PCM, no
return np.array(batch, dtype=dtype)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import six
from .sampler import SequentialSampler, RandomSampler, BatchSampler from .sampler import SequentialSampler, RandomSampler, BatchSampler
...@@ -84,7 +99,11 @@ class DataIterator(object): ...@@ -84,7 +99,11 @@ class DataIterator(object):
return minibatch return minibatch
def _next_index(self): def _next_index(self):
return next(self._sampler_iter) if six.PY3:
return next(self._sampler_iter)
else:
# six.PY2
return self._sampler_iter.next()
def __len__(self): def __len__(self):
return len(self._index_sampler) return len(self._index_sampler)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import six import six
import numpy as np import numpy as np
from tqdm import tqdm from tqdm import tqdm
...@@ -10,8 +24,7 @@ class DatasetMixin(object): ...@@ -10,8 +24,7 @@ class DatasetMixin(object):
if isinstance(index, slice): if isinstance(index, slice):
start, stop, step = index.indices(len(self)) start, stop, step = index.indices(len(self))
return [ return [
self.get_example(i) self.get_example(i) for i in six.moves.range(start, stop, step)
for i in six.moves.range(start, stop, step)
] ]
elif isinstance(index, (list, np.ndarray)): elif isinstance(index, (list, np.ndarray)):
return [self.get_example(i) for i in index] return [self.get_example(i) for i in index]
...@@ -46,6 +59,7 @@ class TransformDataset(DatasetMixin): ...@@ -46,6 +59,7 @@ class TransformDataset(DatasetMixin):
in_data = self._dataset[i] in_data = self._dataset[i]
return self._transform(in_data) return self._transform(in_data)
class CacheDataset(DatasetMixin): class CacheDataset(DatasetMixin):
def __init__(self, dataset): def __init__(self, dataset):
self._dataset = dataset self._dataset = dataset
...@@ -58,6 +72,7 @@ class CacheDataset(DatasetMixin): ...@@ -58,6 +72,7 @@ class CacheDataset(DatasetMixin):
def get_example(self, i): def get_example(self, i):
return self._cache[i] return self._cache[i]
class TupleDataset(object): class TupleDataset(object):
def __init__(self, *datasets): def __init__(self, *datasets):
if not datasets: if not datasets:
...@@ -133,7 +148,7 @@ class SliceDataset(DatasetMixin): ...@@ -133,7 +148,7 @@ class SliceDataset(DatasetMixin):
format(len(order), len(dataset))) format(len(order), len(dataset)))
self._order = order self._order = order
def len(self): def __len__(self):
return self._size return self._size
def get_example(self, i): def get_example(self, i):
...@@ -192,8 +207,7 @@ class ChainDataset(DatasetMixin): ...@@ -192,8 +207,7 @@ class ChainDataset(DatasetMixin):
def get_example(self, i): def get_example(self, i):
if i < 0: if i < 0:
raise IndexError( raise IndexError("ChainDataset doesnot support negative indexing.")
"ChainDataset doesnot support negative indexing.")
for dataset in self._datasets: for dataset in self._datasets:
if i < len(dataset): if i < len(dataset):
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" """
At most cases, we have non-stream dataset, which means we can random access it with __getitem__, and we can get the length of the dataset with __len__. At most cases, we have non-stream dataset, which means we can random access it with __getitem__, and we can get the length of the dataset with __len__.
...@@ -6,10 +19,10 @@ This suffices for a sampler. We implemente sampler as iterable of valid indices. ...@@ -6,10 +19,10 @@ This suffices for a sampler. We implemente sampler as iterable of valid indices.
So the sampler is only responsible for generating valid indices. So the sampler is only responsible for generating valid indices.
""" """
import numpy as np import numpy as np
import random import random
class Sampler(object): class Sampler(object):
def __init__(self, data_source): def __init__(self, data_source):
pass pass
...@@ -23,7 +36,7 @@ class Sampler(object): ...@@ -23,7 +36,7 @@ class Sampler(object):
class SequentialSampler(Sampler): class SequentialSampler(Sampler):
def __init__(self, data_source): def __init__(self, data_source):
self.data_source = data_source self.data_source = data_source
def __iter__(self): def __iter__(self):
return iter(range(len(self.data_source))) return iter(range(len(self.data_source)))
...@@ -42,12 +55,14 @@ class RandomSampler(Sampler): ...@@ -42,12 +55,14 @@ class RandomSampler(Sampler):
"replacement={}".format(self.replacement)) "replacement={}".format(self.replacement))
if self._num_samples is not None and not replacement: if self._num_samples is not None and not replacement:
raise ValueError("With replacement=False, num_samples should not be specified, " raise ValueError(
"since a random permutation will be performed.") "With replacement=False, num_samples should not be specified, "
"since a random permutation will be performed.")
if not isinstance(self.num_samples, int) or self.num_samples <= 0: if not isinstance(self.num_samples, int) or self.num_samples <= 0:
raise ValueError("num_samples should be a positive integer " raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(self.num_samples)) "value, but got num_samples={}".format(
self.num_samples))
@property @property
def num_samples(self): def num_samples(self):
...@@ -59,7 +74,9 @@ class RandomSampler(Sampler): ...@@ -59,7 +74,9 @@ class RandomSampler(Sampler):
def __iter__(self): def __iter__(self):
n = len(self.data_source) n = len(self.data_source)
if self.replacement: if self.replacement:
return iter(np.random.randint(0, n, size=(self.num_samples,), dtype=np.int64).tolist()) return iter(
np.random.randint(
0, n, size=(self.num_samples, ), dtype=np.int64).tolist())
return iter(np.random.permutation(n).tolist()) return iter(np.random.permutation(n).tolist())
def __len__(self): def __len__(self):
...@@ -76,7 +93,8 @@ class SubsetRandomSampler(Sampler): ...@@ -76,7 +93,8 @@ class SubsetRandomSampler(Sampler):
self.indices = indices self.indices = indices
def __iter__(self): def __iter__(self):
return (self.indices[i] for i in np.random.permutation(len(self.indices))) return (self.indices[i]
for i in np.random.permutation(len(self.indices)))
def __len__(self): def __len__(self):
return len(self.indices) return len(self.indices)
...@@ -89,9 +107,14 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler): ...@@ -89,9 +107,14 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler):
3. Permutate mini-batchs 3. Permutate mini-batchs
""" """
def __init__(self, lengths, batch_size=4, batch_group_size=None, def __init__(self,
lengths,
batch_size=4,
batch_group_size=None,
permutate=True): permutate=True):
_lengths = np.array(lengths, dtype=np.int64) # maybe better implement length as a sort key _lengths = np.array(
lengths,
dtype=np.int64) # maybe better implement length as a sort key
self.lengths = np.sort(_lengths) self.lengths = np.sort(_lengths)
self.sorted_indices = np.argsort(_lengths) self.sorted_indices = np.argsort(_lengths)
...@@ -112,20 +135,21 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler): ...@@ -112,20 +135,21 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler):
for i in range(len(indices) // batch_group_size): for i in range(len(indices) // batch_group_size):
s = i * batch_group_size s = i * batch_group_size
e = s + batch_group_size e = s + batch_group_size
random.shuffle(indices[s: e]) # inplace random.shuffle(indices[s:e]) # inplace
# Permutate batches # Permutate batches
if self.permutate: if self.permutate:
perm = np.arange(len(indices[:e]) // self.batch_size) perm = np.arange(len(indices[:e]) // self.batch_size)
random.shuffle(perm) random.shuffle(perm)
indices[:e] = indices[:e].reshape(-1, self.batch_size)[perm, :].reshape(-1) indices[:e] = indices[:e].reshape(
-1, self.batch_size)[perm, :].reshape(-1)
# Handle last elements # Handle last elements
s += batch_group_size s += batch_group_size
#print(indices) #print(indices)
if s < len(indices): if s < len(indices):
random.shuffle(indices[s:]) random.shuffle(indices[s:])
return iter(indices) return iter(indices)
def __len__(self): def __len__(self):
...@@ -150,14 +174,19 @@ class WeightedRandomSampler(Sampler): ...@@ -150,14 +174,19 @@ class WeightedRandomSampler(Sampler):
def __init__(self, weights, num_samples, replacement): def __init__(self, weights, num_samples, replacement):
if not isinstance(num_samples, int) or num_samples <= 0: if not isinstance(num_samples, int) or num_samples <= 0:
raise ValueError("num_samples should be a positive integer " raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(num_samples)) "value, but got num_samples={}".format(
num_samples))
self.weights = np.array(weights, dtype=np.float64) self.weights = np.array(weights, dtype=np.float64)
self.num_samples = num_samples self.num_samples = num_samples
self.replacement = replacement self.replacement = replacement
def __iter__(self): def __iter__(self):
return iter(np.random.choice(len(self.weights), size=(self.num_samples, ), return iter(
replace=self.replacement, p=self.weights).tolist()) np.random.choice(
len(self.weights),
size=(self.num_samples, ),
replace=self.replacement,
p=self.weights).tolist())
def __len__(self): def __len__(self):
return self.num_samples return self.num_samples
...@@ -184,7 +213,7 @@ class DistributedSampler(Sampler): ...@@ -184,7 +213,7 @@ class DistributedSampler(Sampler):
# Subset samples for each trainer. # Subset samples for each trainer.
indices = indices[self.rank:self.total_size:self.num_trainers] indices = indices[self.rank:self.total_size:self.num_trainers]
assert len(indices) == self.num_samples assert len(indices) == self.num_samples
return iter(indices) return iter(indices)
...@@ -209,8 +238,7 @@ class BatchSampler(Sampler): ...@@ -209,8 +238,7 @@ class BatchSampler(Sampler):
def __init__(self, sampler, batch_size, drop_last): def __init__(self, sampler, batch_size, drop_last):
if not isinstance(sampler, Sampler): if not isinstance(sampler, Sampler):
raise ValueError("sampler should be an instance of " raise ValueError("sampler should be an instance of "
"Sampler, but got sampler={}" "Sampler, but got sampler={}".format(sampler))
.format(sampler))
if not isinstance(batch_size, int) or batch_size <= 0: if not isinstance(batch_size, int) or batch_size <= 0:
raise ValueError("batch_size should be a positive integer value, " raise ValueError("batch_size should be a positive integer value, "
"but got batch_size={}".format(batch_size)) "but got batch_size={}".format(batch_size))
......
...@@ -14,9 +14,4 @@ One of the reasons we choose to load data lazily (only load metadata before hand ...@@ -14,9 +14,4 @@ One of the reasons we choose to load data lazily (only load metadata before hand
For deep learning practice, we typically batch examples. So the dataset should comes with a method to batch examples. Assuming the record is implemented as a tuple with several items. When an item is represented as a fix-sized array, to batch them is trivial, just `np.stack` suffices. But for array with dynamic size, padding is needed. We decide to implement a batching method for each item. Then batching a record can be implemented by these methods. For a dataset, a `_batch_examples` should be implemented. But in most cases, you can choose one from `batching.py`. For deep learning practice, we typically batch examples. So the dataset should comes with a method to batch examples. Assuming the record is implemented as a tuple with several items. When an item is represented as a fix-sized array, to batch them is trivial, just `np.stack` suffices. But for array with dynamic size, padding is needed. We decide to implement a batching method for each item. Then batching a record can be implemented by these methods. For a dataset, a `_batch_examples` should be implemented. But in most cases, you can choose one from `batching.py`.
That is it! That is it!
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path from pathlib import Path
import numpy as np import numpy as np
import pandas as pd import pandas as pd
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path from pathlib import Path
import pandas as pd import pandas as pd
from ruamel.yaml import YAML from ruamel.yaml import YAML
...@@ -11,23 +25,25 @@ from parakeet.data.dataset import Dataset ...@@ -11,23 +25,25 @@ from parakeet.data.dataset import Dataset
from parakeet.data.datacargo import DataCargo from parakeet.data.datacargo import DataCargo
from parakeet.data.batch import TextIDBatcher, WavBatcher from parakeet.data.batch import TextIDBatcher, WavBatcher
class VCTK(Dataset): class VCTK(Dataset):
def __init__(self, root): def __init__(self, root):
assert isinstance(root, (str, Path)), "root should be a string or Path object" assert isinstance(root, (
str, Path)), "root should be a string or Path object"
self.root = root if isinstance(root, Path) else Path(root) self.root = root if isinstance(root, Path) else Path(root)
self.text_root = self.root.joinpath("txt") self.text_root = self.root.joinpath("txt")
self.wav_root = self.root.joinpath("wav48") self.wav_root = self.root.joinpath("wav48")
if not (self.root.joinpath("metadata.csv").exists() and if not (self.root.joinpath("metadata.csv").exists() and
self.root.joinpath("speaker_indices.yaml").exists()): self.root.joinpath("speaker_indices.yaml").exists()):
self._prepare_metadata() self._prepare_metadata()
self.speaker_indices, self.metadata = self._load_metadata() self.speaker_indices, self.metadata = self._load_metadata()
def _load_metadata(self): def _load_metadata(self):
yaml=YAML(typ='safe') yaml = YAML(typ='safe')
speaker_indices = yaml.load(self.root.joinpath("speaker_indices.yaml")) speaker_indices = yaml.load(self.root.joinpath("speaker_indices.yaml"))
metadata = pd.read_csv(self.root.joinpath("metadata.csv"), metadata = pd.read_csv(
sep="|", quoting=3, header=1) self.root.joinpath("metadata.csv"), sep="|", quoting=3, header=1)
return speaker_indices, metadata return speaker_indices, metadata
def _prepare_metadata(self): def _prepare_metadata(self):
...@@ -41,15 +57,19 @@ class VCTK(Dataset): ...@@ -41,15 +57,19 @@ class VCTK(Dataset):
with io.open(str(text_file)) as f: with io.open(str(text_file)) as f:
transcription = f.read().strip() transcription = f.read().strip()
wav_file = text_file.with_suffix(".wav") wav_file = text_file.with_suffix(".wav")
metadata.append((wav_file.name, speaker_folder.name, transcription)) metadata.append(
metadata = pd.DataFrame.from_records(metadata, (wav_file.name, speaker_folder.name, transcription))
columns=["wave_file", "speaker", "text"]) metadata = pd.DataFrame.from_records(
metadata, columns=["wave_file", "speaker", "text"])
# save them # save them
yaml=YAML(typ='safe') yaml = YAML(typ='safe')
yaml.dump(speaker_to_index, self.root.joinpath("speaker_indices.yaml")) yaml.dump(speaker_to_index, self.root.joinpath("speaker_indices.yaml"))
metadata.to_csv(self.root.joinpath("metadata.csv"), metadata.to_csv(
sep="|", quoting=3, index=False) self.root.joinpath("metadata.csv"),
sep="|",
quoting=3,
index=False)
def _get_example(self, metadatum): def _get_example(self, metadatum):
wave_file, speaker, text = metadatum wave_file, speaker, text = metadatum
...@@ -77,5 +97,3 @@ class VCTK(Dataset): ...@@ -77,5 +97,3 @@ class VCTK(Dataset):
speaker_batch = np.array(speaker_batch) speaker_batch = np.array(speaker_batch)
phoneme_batch = TextIDBatcher(pad_id=0)(phoneme_batch) phoneme_batch = TextIDBatcher(pad_id=0)(phoneme_batch)
return wav_batch, speaker_batch, phoneme_batch return wav_batch, speaker_batch, phoneme_batch
\ No newline at end of file
# coding: utf-8 # coding: utf-8
"""Text processing frontend """Text processing frontend
All frontend module should have the following functions: All frontend module should have the following functions:
......
...@@ -32,6 +32,3 @@ def text_to_sequence(text, p=0.0): ...@@ -32,6 +32,3 @@ def text_to_sequence(text, p=0.0):
from ..text import text_to_sequence from ..text import text_to_sequence
text = text_to_sequence(text, ["english_cleaners"]) text = text_to_sequence(text, ["english_cleaners"])
return text return text
...@@ -12,6 +12,3 @@ def text_to_sequence(text, p=0.0): ...@@ -12,6 +12,3 @@ def text_to_sequence(text, p=0.0):
from ..text import text_to_sequence from ..text import text_to_sequence
text = text_to_sequence(text, ["basic_cleaners"]) text = text_to_sequence(text, ["basic_cleaners"])
return text return text
# coding: utf-8 # coding: utf-8
import MeCab import MeCab
import jaconv import jaconv
from random import random from random import random
...@@ -30,9 +29,9 @@ def _yomi(mecab_result): ...@@ -30,9 +29,9 @@ def _yomi(mecab_result):
def _mix_pronunciation(tokens, yomis, p): def _mix_pronunciation(tokens, yomis, p):
return "".join( return "".join(yomis[idx]
yomis[idx] if yomis[idx] is not None and random() < p else tokens[idx] if yomis[idx] is not None and random() < p else tokens[idx]
for idx in range(len(tokens))) for idx in range(len(tokens)))
def mix_pronunciation(text, p): def mix_pronunciation(text, p):
...@@ -59,8 +58,7 @@ def normalize_delimitor(text): ...@@ -59,8 +58,7 @@ def normalize_delimitor(text):
def text_to_sequence(text, p=0.0): def text_to_sequence(text, p=0.0):
for c in [" ", " ", "「", "」", "『", "』", "・", "【", "】", for c in [" ", " ", "「", "」", "『", "』", "・", "【", "】", "(", ")", "(", ")"]:
"(", ")", "(", ")"]:
text = text.replace(c, "") text = text.replace(c, "")
text = text.replace("!", "!") text = text.replace("!", "!")
text = text.replace("?", "?") text = text.replace("?", "?")
......
# coding: utf-8 # coding: utf-8
from random import random from random import random
n_vocab = 0xffff n_vocab = 0xffff
...@@ -13,5 +12,6 @@ _tagger = None ...@@ -13,5 +12,6 @@ _tagger = None
def text_to_sequence(text, p=0.0): def text_to_sequence(text, p=0.0):
return [ord(c) for c in text] + [_eos] # EOS return [ord(c) for c in text] + [_eos] # EOS
def sequence_to_text(seq): def sequence_to_text(seq):
return "".join(chr(n) for n in seq) return "".join(chr(n) for n in seq)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re import re
from . import cleaners from . import cleaners
from .symbols import symbols from .symbols import symbols
# Mappings from symbol to numeric ID and vice versa: # Mappings from symbol to numeric ID and vice versa:
_symbol_to_id = {s: i for i, s in enumerate(symbols)} _symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)} _id_to_symbol = {i: s for i, s in enumerate(symbols)}
...@@ -32,7 +45,8 @@ def text_to_sequence(text, cleaner_names): ...@@ -32,7 +45,8 @@ def text_to_sequence(text, cleaner_names):
if not m: if not m:
sequence += _symbols_to_sequence(_clean_text(text, cleaner_names)) sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))
break break
sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names)) sequence += _symbols_to_sequence(
_clean_text(m.group(1), cleaner_names))
sequence += _arpabet_to_sequence(m.group(2)) sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3) text = m.group(3)
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
''' '''
Cleaners are transformations that run over the input text at both training and eval time. Cleaners are transformations that run over the input text at both training and eval time.
...@@ -14,31 +27,31 @@ import re ...@@ -14,31 +27,31 @@ import re
from unidecode import unidecode from unidecode import unidecode
from .numbers import normalize_numbers from .numbers import normalize_numbers
# Regular expression matching whitespace: # Regular expression matching whitespace:
_whitespace_re = re.compile(r'\s+') _whitespace_re = re.compile(r'\s+')
# List of (regular expression, replacement) pairs for abbreviations: # List of (regular expression, replacement) pairs for abbreviations:
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [ _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1])
('mrs', 'misess'), for x in [
('mr', 'mister'), ('mrs', 'misess'),
('dr', 'doctor'), ('mr', 'mister'),
('st', 'saint'), ('dr', 'doctor'),
('co', 'company'), ('st', 'saint'),
('jr', 'junior'), ('co', 'company'),
('maj', 'major'), ('jr', 'junior'),
('gen', 'general'), ('maj', 'major'),
('drs', 'doctors'), ('gen', 'general'),
('rev', 'reverend'), ('drs', 'doctors'),
('lt', 'lieutenant'), ('rev', 'reverend'),
('hon', 'honorable'), ('lt', 'lieutenant'),
('sgt', 'sergeant'), ('hon', 'honorable'),
('capt', 'captain'), ('sgt', 'sergeant'),
('esq', 'esquire'), ('capt', 'captain'),
('ltd', 'limited'), ('esq', 'esquire'),
('col', 'colonel'), ('ltd', 'limited'),
('ft', 'fort'), ('col', 'colonel'),
]] ('ft', 'fort'),
]]
def expand_abbreviations(text): def expand_abbreviations(text):
......
import re # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
valid_symbols = [ valid_symbols = [
'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2', 'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1',
'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2', 'AH2', 'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0',
'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY', 'AY1', 'AY2', 'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0',
'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1', 'ER1', 'ER2', 'EY', 'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0',
'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1', 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG',
'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0', 'OY1', 'OY2', 'P', 'R', 'S', 'SH',
'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH' 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W',
'Y', 'Z', 'ZH'
] ]
_valid_symbol_set = set(valid_symbols) _valid_symbol_set = set(valid_symbols)
...@@ -24,7 +38,10 @@ class CMUDict: ...@@ -24,7 +38,10 @@ class CMUDict:
else: else:
entries = _parse_cmudict(file_or_path) entries = _parse_cmudict(file_or_path)
if not keep_ambiguous: if not keep_ambiguous:
entries = {word: pron for word, pron in entries.items() if len(pron) == 1} entries = {
word: pron
for word, pron in entries.items() if len(pron) == 1
}
self._entries = entries self._entries = entries
def __len__(self): def __len__(self):
......
...@@ -3,7 +3,6 @@ ...@@ -3,7 +3,6 @@
import inflect import inflect
import re import re
_inflect = inflect.engine() _inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
...@@ -56,7 +55,8 @@ def _expand_number(m): ...@@ -56,7 +55,8 @@ def _expand_number(m):
elif num % 100 == 0: elif num % 100 == 0:
return _inflect.number_to_words(num // 100) + ' hundred' return _inflect.number_to_words(num // 100) + ' hundred'
else: else:
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ') return _inflect.number_to_words(
num, andword='', zero='oh', group=2).replace(', ', ' ')
else: else:
return _inflect.number_to_words(num, andword='') return _inflect.number_to_words(num, andword='')
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
''' '''
Defines the set of symbols used in text input to the model. Defines the set of symbols used in text input to the model.
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .net import *
from .parallel_wavenet import *
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import numpy as np
from scipy import signal
from tqdm import trange
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
import paddle.fluid.initializer as I
import paddle.fluid.layers.distributions as D
from parakeet.modules.weight_norm import Conv2DTranspose
from parakeet.models.wavenet import crop, WaveNet, UpsampleNet
from parakeet.models.clarinet.parallel_wavenet import ParallelWaveNet
from parakeet.models.clarinet.utils import conv2d
# Gaussian IAF model
class Clarinet(dg.Layer):
def __init__(self,
encoder,
teacher,
student,
stft,
min_log_scale=-6.0,
lmd=4.0):
super(Clarinet, self).__init__()
self.lmd = lmd
self.encoder = encoder
self.teacher = teacher
self.student = student
self.min_log_scale = min_log_scale
self.stft = stft
def forward(self, audio, mel, audio_start, clip_kl=True):
"""Compute loss for a distill model
Arguments:
audio {Variable} -- shape(batch_size, time_steps), target waveform.
mel {Variable} -- shape(batch_size, condition_dim, time_steps // hop_length), original mel spectrogram, not upsampled yet.
audio_starts {Variable} -- shape(batch_size, ), the index of the start sample.
clip_kl (bool) -- whether to clip kl divergence if it is greater than 10.0.
Returns:
Variable -- shape(1,), loss
"""
batch_size, audio_length = audio.shape # audio clip's length
z = F.gaussian_random(audio.shape)
condition = self.encoder(mel) # (B, C, T)
condition_slice = crop(condition, audio_start, audio_length)
x, s_means, s_scales = self.student(z, condition_slice) # all [0: T]
s_means = s_means[:, 1:] # (B, T-1), time steps [1: T]
s_scales = s_scales[:, 1:] # (B, T-1), time steps [1: T]
s_clipped_scales = F.clip(s_scales, self.min_log_scale, 100.)
# teacher outputs single gaussian
y = self.teacher(x[:, :-1], condition_slice[:, :, 1:])
_, t_means, t_scales = F.split(y, 3, -1) # time steps [1: T]
t_means = F.squeeze(t_means, [-1]) # (B, T-1), time steps [1: T]
t_scales = F.squeeze(t_scales, [-1]) # (B, T-1), time steps [1: T]
t_clipped_scales = F.clip(t_scales, self.min_log_scale, 100.)
s_distribution = D.Normal(s_means, F.exp(s_clipped_scales))
t_distribution = D.Normal(t_means, F.exp(t_clipped_scales))
# kl divergence loss, so we only need to sample once? no MC
kl = s_distribution.kl_divergence(t_distribution)
if clip_kl:
kl = F.clip(kl, -100., 10.)
# context size dropped
kl = F.reduce_mean(kl[:, self.teacher.context_size:])
# major diff here
regularization = F.mse_loss(t_scales[:, self.teacher.context_size:],
s_scales[:, self.teacher.context_size:])
# introduce information from real target
spectrogram_frame_loss = F.mse_loss(
self.stft.magnitude(audio), self.stft.magnitude(x))
loss = kl + self.lmd * regularization + spectrogram_frame_loss
loss_dict = {
"loss": loss,
"kl_divergence": kl,
"regularization": regularization,
"stft_loss": spectrogram_frame_loss
}
return loss_dict
@dg.no_grad
def synthesis(self, mel):
"""Synthesize waveform conditioned on the mel spectrogram.
Arguments:
mel {Variable} -- shape(batch_size, frequqncy_bands, frames)
Returns:
Variable -- shape(batch_size, frames * upsample_factor)
"""
condition = self.encoder(mel)
samples_shape = (condition.shape[0], condition.shape[-1])
z = F.gaussian_random(samples_shape)
x, s_means, s_scales = self.student(z, condition)
return x
class STFT(dg.Layer):
def __init__(self, n_fft, hop_length, win_length, window="hanning"):
super(STFT, self).__init__()
self.hop_length = hop_length
self.n_bin = 1 + n_fft // 2
self.n_fft = n_fft
# calculate window
window = signal.get_window(window, win_length)
if n_fft != win_length:
pad = (n_fft - win_length) // 2
window = np.pad(window, ((pad, pad), ), 'constant')
# calculate weights
r = np.arange(0, n_fft)
M = np.expand_dims(r, -1) * np.expand_dims(r, 0)
w_real = np.reshape(window *
np.cos(2 * np.pi * M / n_fft)[:self.n_bin],
(self.n_bin, 1, 1, self.n_fft)).astype("float32")
w_imag = np.reshape(window *
np.sin(-2 * np.pi * M / n_fft)[:self.n_bin],
(self.n_bin, 1, 1, self.n_fft)).astype("float32")
w = np.concatenate([w_real, w_imag], axis=0)
self.weight = dg.to_variable(w)
def forward(self, x):
# x(batch_size, time_steps)
# pad it first with reflect mode
pad_start = F.reverse(x[:, 1:1 + self.n_fft // 2], axis=1)
pad_stop = F.reverse(x[:, -(1 + self.n_fft // 2):-1], axis=1)
x = F.concat([pad_start, x, pad_stop], axis=-1)
# to BC1T, C=1
x = F.unsqueeze(x, axes=[1, 2])
out = conv2d(x, self.weight, stride=(1, self.hop_length))
real, imag = F.split(out, 2, dim=1) # BC1T
return real, imag
def power(self, x):
real, imag = self(x)
power = real**2 + imag**2
return power
def magnitude(self, x):
power = self.power(x)
magnitude = F.sqrt(power)
return magnitude
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import time
import itertools
import numpy as np
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
import paddle.fluid.initializer as I
import paddle.fluid.layers.distributions as D
from parakeet.modules.weight_norm import Linear, Conv1D, Conv1DCell, Conv2DTranspose
from parakeet.models.wavenet import WaveNet
class ParallelWaveNet(dg.Layer):
def __init__(self, n_loops, n_layers, residual_channels, condition_dim,
filter_size):
super(ParallelWaveNet, self).__init__()
self.flows = dg.LayerList()
for n_loop, n_layer in zip(n_loops, n_layers):
# teacher's log_scale_min does not matter herem, -100 is a dummy value
self.flows.append(
WaveNet(n_loop, n_layer, residual_channels, 3, condition_dim,
filter_size, "mog", -100.0))
def forward(self, z, condition=None):
"""Inverse Autoregressive Flow. Several wavenets.
Arguments:
z {Variable} -- shape(batch_size, time_steps), hidden variable, sampled from a standard normal distribution.
Keyword Arguments:
condition {Variable} -- shape(batch_size, condition_dim, time_steps), condition, basically upsampled mel spectrogram. (default: {None})
Returns:
Variable -- shape(batch_size, time_steps), transformed z.
Variable -- shape(batch_size, time_steps), output distribution's mu.
Variable -- shape(batch_size, time_steps), output distribution's log_std.
"""
for i, flow in enumerate(self.flows):
theta = flow(z, condition) # w, mu, log_std [0: T]
w, mu, log_std = F.split(theta, 3, dim=-1) # (B, T, 1) for each
mu = F.squeeze(mu, [-1]) #[0: T]
log_std = F.squeeze(log_std, [-1]) #[0: T]
z = z * F.exp(log_std) + mu #[0: T]
if i == 0:
out_mu = mu
out_log_std = log_std
else:
out_mu = out_mu * F.exp(log_std) + mu
out_log_std += log_std
return z, out_mu, out_log_std
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddle import fluid
from paddle.fluid.core import ops
@fluid.framework.dygraph_only
def conv2d(input,
weight,
stride=(1, 1),
padding=((0, 0), (0, 0)),
dilation=(1, 1),
groups=1,
use_cudnn=True,
data_format="NCHW"):
padding = tuple(pad for pad_dim in padding for pad in pad_dim)
inputs = {
'Input': [input],
'Filter': [weight],
}
attrs = {
'strides': stride,
'paddings': padding,
'dilations': dilation,
'groups': groups,
'use_cudnn': use_cudnn,
'use_mkldnn': False,
'fuse_relu_before_depthwise_conv': False,
"padding_algorithm": "EXPLICIT",
"data_format": data_format,
}
outputs = ops.conv2d(inputs, attrs)
out = outputs["Output"][0]
return out
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from parakeet.models.deepvoice3.encoder import Encoder, ConvSpec from parakeet.models.deepvoice3.encoder import Encoder, ConvSpec
from parakeet.models.deepvoice3.decoder import Decoder, WindowRange from parakeet.models.deepvoice3.decoder import Decoder, WindowRange
from parakeet.models.deepvoice3.converter import Converter from parakeet.models.deepvoice3.converter import Converter
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np import numpy as np
from collections import namedtuple from collections import namedtuple
from paddle import fluid from paddle import fluid
...@@ -19,23 +33,19 @@ class Attention(dg.Layer): ...@@ -19,23 +33,19 @@ class Attention(dg.Layer):
value_projection=True): value_projection=True):
super(Attention, self).__init__() super(Attention, self).__init__()
std = np.sqrt(1 / query_dim) std = np.sqrt(1 / query_dim)
self.query_proj = Linear(query_dim, self.query_proj = Linear(
embed_dim, query_dim, embed_dim, param_attr=I.Normal(scale=std))
param_attr=I.Normal(scale=std))
if key_projection: if key_projection:
std = np.sqrt(1 / embed_dim) std = np.sqrt(1 / embed_dim)
self.key_proj = Linear(embed_dim, self.key_proj = Linear(
embed_dim, embed_dim, embed_dim, param_attr=I.Normal(scale=std))
param_attr=I.Normal(scale=std))
if value_projection: if value_projection:
std = np.sqrt(1 / embed_dim) std = np.sqrt(1 / embed_dim)
self.value_proj = Linear(embed_dim, self.value_proj = Linear(
embed_dim, embed_dim, embed_dim, param_attr=I.Normal(scale=std))
param_attr=I.Normal(scale=std))
std = np.sqrt(1 / embed_dim) std = np.sqrt(1 / embed_dim)
self.out_proj = Linear(embed_dim, self.out_proj = Linear(
query_dim, embed_dim, query_dim, param_attr=I.Normal(scale=std))
param_attr=I.Normal(scale=std))
self.key_projection = key_projection self.key_projection = key_projection
self.value_projection = value_projection self.value_projection = value_projection
...@@ -102,9 +112,8 @@ class Attention(dg.Layer): ...@@ -102,9 +112,8 @@ class Attention(dg.Layer):
x = F.softmax(x) x = F.softmax(x)
attn_scores = x attn_scores = x
x = F.dropout(x, x = F.dropout(
self.dropout, x, self.dropout, dropout_implementation="upscale_in_train")
dropout_implementation="upscale_in_train")
x = F.matmul(x, values) x = F.matmul(x, values)
encoder_length = keys.shape[1] encoder_length = keys.shape[1]
# CAUTION: is it wrong? let it be now # CAUTION: is it wrong? let it be now
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np import numpy as np
from paddle import fluid from paddle import fluid
...@@ -15,6 +29,7 @@ class Conv1DGLU(dg.Layer): ...@@ -15,6 +29,7 @@ class Conv1DGLU(dg.Layer):
has residual connection from the input x, and scale the output by has residual connection from the input x, and scale the output by
np.sqrt(0.5). np.sqrt(0.5).
""" """
def __init__(self, def __init__(self,
n_speakers, n_speakers,
speaker_dim, speaker_dim,
...@@ -50,20 +65,20 @@ class Conv1DGLU(dg.Layer): ...@@ -50,20 +65,20 @@ class Conv1DGLU(dg.Layer):
), "this block uses residual connection"\ ), "this block uses residual connection"\
"the input_channes should equals num_filters" "the input_channes should equals num_filters"
std = np.sqrt(std_mul * (1 - dropout) / (filter_size * in_channels)) std = np.sqrt(std_mul * (1 - dropout) / (filter_size * in_channels))
self.conv = Conv1DCell(in_channels, self.conv = Conv1DCell(
2 * num_filters, in_channels,
filter_size, 2 * num_filters,
dilation, filter_size,
causal, dilation,
param_attr=I.Normal(scale=std)) causal,
param_attr=I.Normal(scale=std))
if n_speakers > 1: if n_speakers > 1:
assert (speaker_dim is not None assert (speaker_dim is not None
), "speaker embed should not be null in multi-speaker case" ), "speaker embed should not be null in multi-speaker case"
std = np.sqrt(1 / speaker_dim) std = np.sqrt(1 / speaker_dim)
self.fc = Linear(speaker_dim, self.fc = Linear(
num_filters, speaker_dim, num_filters, param_attr=I.Normal(scale=std))
param_attr=I.Normal(scale=std))
def forward(self, x, speaker_embed=None): def forward(self, x, speaker_embed=None):
""" """
...@@ -82,9 +97,8 @@ class Conv1DGLU(dg.Layer): ...@@ -82,9 +97,8 @@ class Conv1DGLU(dg.Layer):
C_out means the output channels of Conv1DGLU. C_out means the output channels of Conv1DGLU.
""" """
residual = x residual = x
x = F.dropout(x, x = F.dropout(
self.dropout, x, self.dropout, dropout_implementation="upscale_in_train")
dropout_implementation="upscale_in_train")
x = self.conv(x) x = self.conv(x)
content, gate = F.split(x, num_or_sections=2, dim=1) content, gate = F.split(x, num_or_sections=2, dim=1)
...@@ -118,9 +132,8 @@ class Conv1DGLU(dg.Layer): ...@@ -118,9 +132,8 @@ class Conv1DGLU(dg.Layer):
C_out means the output channels of Conv1DGLU. C_out means the output channels of Conv1DGLU.
""" """
residual = x_t residual = x_t
x_t = F.dropout(x_t, x_t = F.dropout(
self.dropout, x_t, self.dropout, dropout_implementation="upscale_in_train")
dropout_implementation="upscale_in_train")
x_t = self.conv.add_input(x_t) x_t = self.conv.add_input(x_t)
content_t, gate_t = F.split(x_t, num_or_sections=2, dim=1) content_t, gate_t = F.split(x_t, num_or_sections=2, dim=1)
......
此差异已折叠。
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np import numpy as np
import paddle.fluid.layers as F import paddle.fluid.layers as F
...@@ -29,9 +43,9 @@ class DeepVoice3(dg.Layer): ...@@ -29,9 +43,9 @@ class DeepVoice3(dg.Layer):
mel_outputs, alignments, done, decoder_states = self.decoder( mel_outputs, alignments, done, decoder_states = self.decoder(
(keys, values), valid_lengths, mel_inputs, text_positions, (keys, values), valid_lengths, mel_inputs, text_positions,
frame_positions, speaker_embed) frame_positions, speaker_embed)
linear_outputs = self.converter( linear_outputs = self.converter(decoder_states
decoder_states if self.use_decoder_states else mel_outputs, if self.use_decoder_states else
speaker_embed) mel_outputs, speaker_embed)
return mel_outputs, linear_outputs, alignments, done return mel_outputs, linear_outputs, alignments, done
def transduce(self, text_sequences, text_positions, speaker_indices=None): def transduce(self, text_sequences, text_positions, speaker_indices=None):
...@@ -43,7 +57,7 @@ class DeepVoice3(dg.Layer): ...@@ -43,7 +57,7 @@ class DeepVoice3(dg.Layer):
keys, values = self.encoder(text_sequences, speaker_embed) keys, values = self.encoder(text_sequences, speaker_embed)
mel_outputs, alignments, done, decoder_states = self.decoder.decode( mel_outputs, alignments, done, decoder_states = self.decoder.decode(
(keys, values), text_positions, speaker_embed) (keys, values), text_positions, speaker_embed)
linear_outputs = self.converter( linear_outputs = self.converter(decoder_states
decoder_states if self.use_decoder_states else mel_outputs, if self.use_decoder_states else
speaker_embed) mel_outputs, speaker_embed)
return mel_outputs, linear_outputs, alignments, done return mel_outputs, linear_outputs, alignments, done
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np import numpy as np
from paddle import fluid from paddle import fluid
import paddle.fluid.layers as F import paddle.fluid.layers as F
...@@ -95,10 +109,11 @@ class PositionEmbedding(dg.Layer): ...@@ -95,10 +109,11 @@ class PositionEmbedding(dg.Layer):
speaker_position_rate) # (B, V, C) speaker_position_rate) # (B, V, C)
# make indices for gather_nd # make indices for gather_nd
batch_id = F.expand( batch_id = F.expand(
F.unsqueeze(F.range(0, batch_size, 1, dtype="int64"), [1]), F.unsqueeze(
[1, time_steps]) F.range(
0, batch_size, 1, dtype="int64"), [1]), [1, time_steps])
# (B, T, 2) # (B, T, 2)
gather_nd_id = F.stack([batch_id, indices], -1) gather_nd_id = F.stack([batch_id, indices], -1)
out = F.gather_nd(weight, gather_nd_id) out = F.gather_nd(weight, gather_nd_id)
return out return out
\ No newline at end of file
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册