提交 282c36c2 编写于 作者: C chenfeiyu

dv3 reloaded, back to the origin

上级 24eb14a7
......@@ -22,151 +22,118 @@ The model consists of an encoder, a decoder and a converter (and a speaker embed
## Project Structure
```text
├── data.py data_processing
├── model.py function to create model, criterion and optimizer
├── configs/ (example) configuration files
├── sentences.txt sample sentences
├── synthesis.py script to synthesize waveform from text
├── train.py script to train a model
└── utils.py utility functions
├── config/
├── synthesize.py
├── data.py
├── preprocess.py
├── clip.py
├── train.py
└── vocoder.py
```
## Saving & Loading
`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
# Preprocess
1. `output` is the directory for saving results.
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. States for training including alignment plots, spectrogram plots and generated audio files are saved in `states/` in `outuput`. In addition, we periodically evaluate the model with several given sentences, the alignment plots and generated audio files are save in `eval/` in `output`.
During synthesizing, audio files and the alignment plots are save in `synthesis/` in `output`.
So after training and synthesizing with the same output directory, the file structure of the output directory looks like this.
Preprocess to dataset with `preprocess.py`.
```text
├── checkpoints/ # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint)
├── states/ # alignment plots, spectrogram plots and generated wavs at training
├── log/ # tensorboard log
├── eval/ # audio files an alignment plots generated at evaluation during training
└── synthesis/ # synthesized audio files and alignment plots
usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT
preprocess ljspeech dataset and save it.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT data path of the original data
--output OUTPUT path to save the preprocessed dataset
```
2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
If `--checkpoint` is provided, the path of the checkpoint specified by `--checkpoint` is loaded.
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
example code:
```bash
python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech
```
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
usage: train.py [-h] [--config CONFIG] [--data DATA] [--device DEVICE]
[--checkpoint CHECKPOINT | --iteration ITERATION]
output
usage: train.py [-h] --config CONFIG --input INPUT
Train a Deep Voice 3 model with LJSpeech dataset.
positional arguments:
output path to save results
train a Deep Voice 3 model with LJSpeech
optional arguments:
-h, --help show this help message and exit
--config CONFIG experimrnt config
--data DATA The path of the LJSpeech dataset.
--device DEVICE device to use
--checkpoint CHECKPOINT checkpoint to resume from.
--iteration ITERATION the iteration of the checkpoint to load from output directory
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT data path of the original data
```
example code:
```bash
CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech
```
- `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
- `output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
It would create a `runs` folder, outputs for each run is saved in a seperate folder in `runs`, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(`*.pdparams`) and optimizer states(`*.pdopt`) are named by the step when they are saved.
```text
├── checkpoints # checkpoint
├── log # tensorboard log
└── states # train and evaluation results
├── alignments # attention
├── lin_spec # linear spectrogram
├── mel_spec # mel spectrogram
└── waveform # waveform (.wav files)
runs/Jul07_09-39-34_instance-mqcyj27y-4/
├── checkpoint
├── events.out.tfevents.1594085974.instance-mqcyj27y-4
├── step-1000000.pdopt
├── step-1000000.pdparams
├── step-100000.pdopt
├── step-100000.pdparams
...
```
Example script:
Since e use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training.
```bash
python train.py \
--config=configs/ljspeech.yaml \
--data=./LJSpeech-1.1/ \
--device=0 \
experiment
wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
unzip waveflow_res128_ljspeech_ckpt_1.0.zip
```
To train the model in a paralle in multiple gpus, you can launch the training script with `paddle.distributed.launch`. For example, to train with gpu `0,1,2,3`, you can use the example script below. Note that for parallel training, devices are specified with `--selected_gpus` passed to `paddle.distributed.launch`. In this case, `--device` passed to `train.py`, if specified, is ignored.
Example script:
```bash
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 \
train.py \
--config=configs/ljspeech.yaml \
--data=./LJSpeech-1.1/ \
experiment
```
## Visualization
You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing.
You can monitor training log via tensorboard, using the script below.
example code:
```bash
cd experiment/log
tensorboard --logdir=.
tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000
```
## Synthesis
```text
usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE]
[--checkpoint CHECKPOINT | --iteration ITERATION]
text output
Synthsize waveform with a checkpoint.
positional arguments:
text text file to synthesize
output path to save synthesized audio
```text
usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT
--output OUTPUT --checkpoint CHECKPOINT
--monotonic_layers MONOTONIC_LAYERS
optional arguments:
-h, --help show this help message and exit
--config CONFIG experiment config
--device DEVICE device to use
--checkpoint CHECKPOINT checkpoint to resume from
--iteration ITERATION the iteration of the checkpoint to load from output directory
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT text file to synthesize
--output OUTPUT path to save audio
--checkpoint CHECKPOINT
data path of the checkpoint
--monotonic_layers MONOTONIC_LAYERS
monotonic decoder layer, index starts friom 1
```
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
- `text`is the text file to synthesize.
- `output` is the directory to save results. The generated audio files (`*.wav`) and attention plots (*.png) for are save in `synthesis/` in ouput directory.
Example script:
```bash
python synthesis.py \
--config=configs/ljspeech.yaml \
--device=0 \
--checkpoint="experiment/checkpoints/model_step_005000000" \
sentences.txt experiment
```
`synthesize.py` is used to synthesize several sentences in a text file.
`--monotonic_layers` is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd.
or
example code:
```bash
python synthesis.py \
--config=configs/ljspeech.yaml \
--device=0 \
--iteration=005000000 \
sentences.txt experiment
CUDA_VISIBLE_DEVICES=2 python synthesize.py \
--config configs/ljspeech.yaml \
--input sentences.txt \
--output outputs/ \
--checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \
--monotonic_layers "5,6"
```
from __future__ import print_function
import copy
import six
import warnings
import functools
from paddle.fluid import layers
from paddle.fluid import framework
from paddle.fluid import core
from paddle.fluid import name_scope
from paddle.fluid.dygraph import base as imperative_base
from paddle.fluid.clip import GradientClipBase, _correct_clip_op_role_var
class DoubleClip(GradientClipBase):
"""
:alias_main: paddle.nn.GradientClipByGlobalNorm
:alias: paddle.nn.GradientClipByGlobalNorm,paddle.nn.clip.GradientClipByGlobalNorm
:old_api: paddle.fluid.clip.GradientClipByGlobalNorm
Given a list of Tensor :math:`t\_list` , calculate the global norm for the elements of all tensors in
:math:`t\_list` , and limit it to ``clip_norm`` .
- If the global norm is greater than ``clip_norm`` , all elements of :math:`t\_list` will be compressed by a ratio.
- If the global norm is less than or equal to ``clip_norm`` , nothing will be done.
The list of Tensor :math:`t\_list` is not passed from this class, but the gradients of all parameters in ``Program`` . If ``need_clip``
is not None, then only part of gradients can be selected for gradient clipping.
Gradient clip will takes effect after being set in ``optimizer`` , see the document ``optimizer``
(for example: :ref:`api_fluid_optimizer_SGDOptimizer`).
The clipping formula is:
.. math::
t\_list[i] = t\_list[i] * \\frac{clip\_norm}{\max(global\_norm, clip\_norm)}
where:
.. math::
global\_norm = \sqrt{\sum_{i=0}^{N-1}(l2norm(t\_list[i]))^2}
Args:
clip_norm (float): The maximum norm value.
group_name (str, optional): The group name for this clip. Default value is ``default_group``
need_clip (function, optional): Type: function. This function accepts a ``Parameter`` and returns ``bool``
(True: the gradient of this ``Parameter`` need to be clipped, False: not need). Default: None,
and gradients of all parameters in the network will be clipped.
Examples:
.. code-block:: python
# use for Static mode
import paddle
import paddle.fluid as fluid
import numpy as np
main_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(
main_program=main_prog, startup_program=startup_prog):
image = fluid.data(
name='x', shape=[-1, 2], dtype='float32')
predict = fluid.layers.fc(input=image, size=3, act='relu') # Trainable parameters: fc_0.w.0, fc_0.b.0
loss = fluid.layers.mean(predict)
# Clip all parameters in network:
clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
# Clip a part of parameters in network: (e.g. fc_0.w_0)
# pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool
# def fileter_func(Parameter):
# # It can be easily filtered by Parameter.name (name can be set in fluid.ParamAttr, and the default name is fc_0.w_0, fc_0.b_0)
# return Parameter.name=="fc_0.w_0"
# clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0, need_clip=fileter_func)
sgd_optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.1, grad_clip=clip)
sgd_optimizer.minimize(loss)
place = fluid.CPUPlace()
exe = fluid.Executor(place)
x = np.random.uniform(-100, 100, (10, 2)).astype('float32')
exe.run(startup_prog)
out = exe.run(main_prog, feed={'x': x}, fetch_list=loss)
# use for Dygraph mode
import paddle
import paddle.fluid as fluid
with fluid.dygraph.guard():
linear = fluid.dygraph.Linear(10, 10) # Trainable: linear_0.w.0, linear_0.b.0
inputs = fluid.layers.uniform_random([32, 10]).astype('float32')
out = linear(fluid.dygraph.to_variable(inputs))
loss = fluid.layers.reduce_mean(out)
loss.backward()
# Clip all parameters in network:
clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
# Clip a part of parameters in network: (e.g. linear_0.w_0)
# pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool
# def fileter_func(ParamBase):
# # It can be easily filtered by ParamBase.name(name can be set in fluid.ParamAttr, and the default name is linear_0.w_0, linear_0.b_0)
# return ParamBase.name == "linear_0.w_0"
# # Note: linear.weight and linear.bias can return the weight and bias of dygraph.Linear, respectively, and can be used to filter
# return ParamBase.name == linear.weight.name
# clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0, need_clip=fileter_func)
sgd_optimizer = fluid.optimizer.SGD(
learning_rate=0.1, parameter_list=linear.parameters(), grad_clip=clip)
sgd_optimizer.minimize(loss)
"""
def __init__(self, clip_value, clip_norm, group_name="default_group", need_clip=None):
super(DoubleClip, self).__init__(need_clip)
self.clip_value = float(clip_value)
self.clip_norm = float(clip_norm)
self.group_name = group_name
def __str__(self):
return "Gradient Clip By Value and GlobalNorm, value={}, global_norm={}".format(
self.clip_value, self.clip_norm)
@imperative_base.no_grad
def _dygraph_clip(self, params_grads):
params_and_grads = []
# clip by value first
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
params_and_grads.append((p, g))
continue
new_grad = layers.clip(x=g, min=-self.clip_value, max=self.clip_value)
params_and_grads.append((p, new_grad))
params_grads = params_and_grads
# clip by global norm
params_and_grads = []
sum_square_list = []
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
continue
merge_grad = g
if g.type == core.VarDesc.VarType.SELECTED_ROWS:
merge_grad = layers.merge_selected_rows(g)
merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
square = layers.square(merge_grad)
sum_square = layers.reduce_sum(square)
sum_square_list.append(sum_square)
# all parameters have been filterd out
if len(sum_square_list) == 0:
return params_grads
global_norm_var = layers.concat(sum_square_list)
global_norm_var = layers.reduce_sum(global_norm_var)
global_norm_var = layers.sqrt(global_norm_var)
max_global_norm = layers.fill_constant(
shape=[1], dtype='float32', value=self.clip_norm)
clip_var = layers.elementwise_div(
x=max_global_norm,
y=layers.elementwise_max(
x=global_norm_var, y=max_global_norm))
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
params_and_grads.append((p, g))
continue
new_grad = layers.elementwise_mul(x=g, y=clip_var)
params_and_grads.append((p, new_grad))
return params_and_grads
meta_data:
min_text_length: 20
transform:
# text
replace_pronunciation_prob: 0.5
# spectrogram
sample_rate: 22050
max_norm: 0.999
preemphasis: 0.97
n_fft: 1024
win_length: 1024
hop_length: 256
# mel
fmin: 125
fmax: 7600
n_mels: 80
# db scale
min_level_db: -100
ref_level_db: 20
clip_norm: true
loss:
masked_loss_weight: 0.5
priority_freq: 3000
priority_freq_weight: 0.0
binary_divergence_weight: 0.1
guided_attention_sigma: 0.2
synthesis:
max_steps: 512
power: 1.4
n_iter: 32
model:
# speaker_embedding
n_speakers: 1
speaker_embed_dim: 16
speaker_embedding_weight_std: 0.01
max_positions: 512
dropout: 0.050000000000000044
# encoder
text_embed_dim: 256
embedding_weight_std: 0.1
freeze_embedding: false
padding_idx: 0
encoder_channels: 512
# decoder
query_position_rate: 1.0
key_position_rate: 1.29
trainable_positional_encodings: false
kernel_size: 3
decoder_channels: 256
downsample_factor: 4
outputs_per_step: 1
# attention
key_projection: true
value_projection: true
force_monotonic_attention: true
window_backward: -1
window_ahead: 3
use_memory_mask: true
# converter
use_decoder_state_for_postnet_input: true
converter_channels: 256
optimizer:
beta1: 0.5
beta2: 0.9
epsilon: 1e-6
lr_scheduler:
warmup_steps: 4000
peak_learning_rate: 5e-4
train:
batch_size: 16
max_iteration: 2000000
snap_interval: 1000
eval_interval: 10000
save_interval: 10000
# data processing
p_pronunciation: 0.99
sample_rate: 22050 # Hz
n_fft: 1024
win_length: 1024
hop_length: 256
n_mels: 80
reduction_factor: 4
# model-s2s
n_speakers: 1
speaker_dim: 16
char_dim: 256
encoder_dim: 64
kernel_size: 5
encoder_layers: 7
decoder_layers: 8
prenet_sizes: [128]
attention_dim: 128
# model-postnet
postnet_layers: 5
postnet_dim: 256
# position embedding
position_weight: 1.0
position_rate: 5.54
forward_step: 4
backward_step: 0
dropout: 0.05
# output-griffinlim
sharpening_factor: 1.4
# optimizer:
learning_rate: 0.001
clip_value: 5.0
clip_norm: 100.0
# training:
batch_size: 16
report_interval: 10000
save_interval: 10000
valid_size: 5
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
import os
import csv
from pathlib import Path
import numpy as np
from paddle import fluid
import pandas as pd
import librosa
from scipy import signal
import paddle.fluid.dygraph as dg
from parakeet.g2p.en import text_to_sequence, sequence_to_text
from parakeet.data import DatasetMixin, TransformDataset, FilterDataset, CacheDataset
from parakeet.data import DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler, BucketSampler
import paddle
from paddle import fluid
from paddle.fluid import dygraph as dg
from paddle.fluid.dataloader import Dataset, BatchSampler
from paddle.fluid.io import DataLoader
from parakeet.data import DatasetMixin, DataCargo, PartialyRandomizedSimilarTimeLengthSampler
from parakeet.g2p import en
class LJSpeechMetaData(DatasetMixin):
class LJSpeech(DatasetMixin):
def __init__(self, root):
self.root = Path(root)
self._wav_dir = self.root.joinpath("wavs")
csv_path = self.root.joinpath("metadata.csv")
self._root = root
self._table = pd.read_csv(
csv_path,
sep="|",
encoding="utf-8",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
os.path.join(root, "metadata.csv"),
sep="|",
encoding="utf-8",
quoting=csv.QUOTE_NONE,
header=None,
names=["num_frames", "spec_name", "mel_name", "text"],
dtype={"num_frames": np.int64, "spec_name": str, "mel_name":str, "text":str})
def num_frames(self):
return self._table["num_frames"].to_list()
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
fname = str(self._wav_dir.joinpath(fname + ".wav"))
return fname, raw_text, normalized_text
"""
spec (T_frame, C_spec)
mel (T_frame, C_mel)
"""
num_frames, spec_name, mel_name, text = self._table.iloc[i]
spec = np.load(os.path.join(self._root, spec_name))
mel = np.load(os.path.join(self._root, mel_name))
return (text, spec, mel, num_frames)
def __len__(self):
return len(self._table)
class Transform(object):
def __init__(self,
replace_pronunciation_prob=0.,
sample_rate=22050,
preemphasis=.97,
n_fft=1024,
win_length=1024,
hop_length=256,
fmin=125,
fmax=7600,
n_mels=80,
min_level_db=-100,
ref_level_db=20,
max_norm=0.999,
clip_norm=True):
self.replace_pronunciation_prob = replace_pronunciation_prob
self.sample_rate = sample_rate
self.preemphasis = preemphasis
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.fmin = fmin
self.fmax = fmax
self.n_mels = n_mels
self.min_level_db = min_level_db
self.ref_level_db = ref_level_db
self.max_norm = max_norm
self.clip_norm = clip_norm
def __call__(self, in_data):
fname, _, normalized_text = in_data
# text processing
mix_grapheme_phonemes = text_to_sequence(
normalized_text, self.replace_pronunciation_prob)
text_length = len(mix_grapheme_phonemes)
# CAUTION: positions start from 1
speaker_id = None
# wave processing
wav, _ = librosa.load(fname, sr=self.sample_rate)
# preemphasis
y = signal.lfilter([1., -self.preemphasis], [1.], wav)
# STFT
D = librosa.stft(
y=y,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length)
S = np.abs(D)
# to db and normalize to 0-1
amplitude_min = np.exp(self.min_level_db / 20 * np.log(10)) # 1e-5
S_norm = 20 * np.log10(np.maximum(amplitude_min,
S)) - self.ref_level_db
S_norm = (S_norm - self.min_level_db) / (-self.min_level_db)
S_norm = self.max_norm * S_norm
if self.clip_norm:
S_norm = np.clip(S_norm, 0, self.max_norm)
# mel scale and to db and normalize to 0-1,
# CAUTION: pass linear scale S, not dbscaled S
S_mel = librosa.feature.melspectrogram(
S=S, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax, power=1.)
S_mel = 20 * np.log10(np.maximum(amplitude_min,
S_mel)) - self.ref_level_db
S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
S_mel_norm = self.max_norm * S_mel_norm
if self.clip_norm:
S_mel_norm = np.clip(S_mel_norm, 0, self.max_norm)
# num_frames
n_frames = S_mel_norm.shape[-1] # CAUTION: original number of frames
return (mix_grapheme_phonemes, text_length, speaker_id, S_norm.T,
S_mel_norm.T, n_frames)
class DataCollector(object):
def __init__(self, downsample_factor=4, r=1):
self.downsample_factor = int(downsample_factor)
self.frames_per_step = int(r)
self._factor = int(downsample_factor * r)
# CAUTION: small diff here
self._pad_begin = int(downsample_factor * r)
def __init__(self, p_pronunciation):
self.p_pronunciation = p_pronunciation
def __call__(self, examples):
batch_size = len(examples)
# lengths
text_lengths = np.array([example[1]
for example in examples]).astype(np.int64)
frames = np.array([example[5]
for example in examples]).astype(np.int64)
"""
output shape and dtype
(B, T_text) int64
(B,) int64
(B, T_frame, C_spec) float32
(B, T_frame, C_mel) float32
(B,) int64
"""
text_seqs = []
specs = []
mels = []
num_frames = np.array([example[3] for example in examples], dtype=np.int64)
max_frames = np.max(num_frames)
max_text_length = int(np.max(text_lengths))
max_frames = int(np.max(frames))
if max_frames % self._factor != 0:
max_frames += (self._factor - max_frames % self._factor)
max_frames += self._pad_begin
max_decoder_length = max_frames // self._factor
# pad time sequence
text_sequences = []
lin_specs = []
mel_specs = []
done_flags = []
for example in examples:
(mix_grapheme_phonemes, text_length, speaker_id, S_norm,
S_mel_norm, num_frames) = example
text_sequences.append(
np.pad(mix_grapheme_phonemes, (0, max_text_length - text_length
),
mode="constant"))
lin_specs.append(
np.pad(S_norm, ((self._pad_begin, max_frames - self._pad_begin
- num_frames), (0, 0)),
mode="constant"))
mel_specs.append(
np.pad(S_mel_norm, ((self._pad_begin, max_frames -
self._pad_begin - num_frames), (0, 0)),
mode="constant"))
done_flags.append(
np.pad(np.zeros((int(np.ceil(num_frames // self._factor)), )),
(0, max_decoder_length - int(
np.ceil(num_frames // self._factor))),
mode="constant",
constant_values=1))
text_sequences = np.array(text_sequences).astype(np.int64)
lin_specs = np.array(lin_specs).astype(np.float32)
mel_specs = np.array(mel_specs).astype(np.float32)
# downsample here
done_flags = np.array(done_flags).astype(np.float32)
# text positions
text_mask = (np.arange(1, 1 + max_text_length) <= np.expand_dims(
text_lengths, -1)).astype(np.int64)
text_positions = np.arange(
1, 1 + max_text_length, dtype=np.int64) * text_mask
# decoder_positions
decoder_positions = np.tile(
np.expand_dims(
np.arange(
1, 1 + max_decoder_length, dtype=np.int64), 0),
(batch_size, 1))
return (text_sequences, text_lengths, text_positions, mel_specs,
lin_specs, frames, decoder_positions, done_flags)
def make_data_loader(data_root, config):
# construct meta data
meta = LJSpeechMetaData(data_root)
# filter it!
min_text_length = config["meta_data"]["min_text_length"]
meta = FilterDataset(meta, lambda x: len(x[2]) >= min_text_length)
# transform meta data into meta data
c = config["transform"]
transform = Transform(
replace_pronunciation_prob=c["replace_pronunciation_prob"],
sample_rate=c["sample_rate"],
preemphasis=c["preemphasis"],
n_fft=c["n_fft"],
win_length=c["win_length"],
hop_length=c["hop_length"],
fmin=c["fmin"],
fmax=c["fmax"],
n_mels=c["n_mels"],
min_level_db=c["min_level_db"],
ref_level_db=c["ref_level_db"],
max_norm=c["max_norm"],
clip_norm=c["clip_norm"])
ljspeech = TransformDataset(meta, transform)
# use meta data's text length as a sort key for the sampler
batch_size = config["train"]["batch_size"]
text_lengths = [len(example[2]) for example in meta]
sampler = PartialyRandomizedSimilarTimeLengthSampler(text_lengths,
batch_size)
env = dg.parallel.ParallelEnv()
num_trainers = env.nranks
local_rank = env.local_rank
sampler = BucketSampler(
text_lengths, batch_size, num_trainers=num_trainers, rank=local_rank)
# some model hyperparameters affect how we process data
model_config = config["model"]
collector = DataCollector(
downsample_factor=model_config["downsample_factor"],
r=model_config["outputs_per_step"])
ljspeech_loader = DataCargo(
ljspeech, batch_fn=collector, batch_size=batch_size, sampler=sampler)
loader = fluid.io.DataLoader.from_generator(capacity=10, return_list=True)
loader.set_batch_generator(
ljspeech_loader, places=fluid.framework._current_expected_place())
return loader
text, spec, mel, _ = example
text_seqs.append(en.text_to_sequence(text, self.p_pronunciation))
# if max_frames - mel.shape[0] < 0:
# import pdb; pdb.set_trace()
specs.append(np.pad(spec, [(0, max_frames - spec.shape[0]), (0, 0)]))
mels.append(np.pad(mel, [(0, max_frames - mel.shape[0]), (0, 0)]))
specs = np.stack(specs)
mels = np.stack(mels)
text_lengths = np.array([len(seq) for seq in text_seqs], dtype=np.int64)
max_length = np.max(text_lengths)
text_seqs = np.array([seq + [0] * (max_length - len(seq)) for seq in text_seqs], dtype=np.int64)
return text_seqs, text_lengths, specs, mels, num_frames
if __name__ == "__main__":
import argparse
import tqdm
import time
from ruamel import yaml
parser = argparse.ArgumentParser(description="load the preprocessed ljspeech dataset")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
print("========= Command Line Arguments ========")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
print("=========== Configurations ==============")
for k in ["p_pronunciation", "batch_size"]:
print("{}: {}".format(k, config[k]))
ljspeech = LJSpeech(args.input)
collate_fn = DataCollector(config["p_pronunciation"])
dg.enable_dygraph(fluid.CPUPlace())
sampler = PartialyRandomizedSimilarTimeLengthSampler(ljspeech.num_frames())
cargo = DataCargo(ljspeech, collate_fn,
batch_size=config["batch_size"], sampler=sampler)
loader = DataLoader\
.from_generator(capacity=5, return_list=True)\
.set_batch_generator(cargo)
for i, batch in tqdm.tqdm(enumerate(loader)):
continue
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddle import fluid
import paddle.fluid.initializer as I
import paddle.fluid.dygraph as dg
from parakeet.g2p import en
from parakeet.models.deepvoice3 import Encoder, Decoder, Converter, DeepVoice3, TTSLoss, ConvSpec, WindowRange
from parakeet.utils.layer_tools import summary, freeze
def make_model(config):
c = config["model"]
# speaker embedding
n_speakers = c["n_speakers"]
speaker_dim = c["speaker_embed_dim"]
if n_speakers > 1:
speaker_embed = dg.Embedding(
(n_speakers, speaker_dim),
param_attr=I.Normal(scale=c["speaker_embedding_weight_std"]))
else:
speaker_embed = None
# encoder
h = c["encoder_channels"]
k = c["kernel_size"]
encoder_convolutions = (
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(h, k, 9),
ConvSpec(h, k, 27),
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(h, k, 9),
ConvSpec(h, k, 27),
ConvSpec(h, k, 1),
ConvSpec(h, k, 3), )
encoder = Encoder(
n_vocab=en.n_vocab,
embed_dim=c["text_embed_dim"],
n_speakers=n_speakers,
speaker_dim=speaker_dim,
embedding_weight_std=c["embedding_weight_std"],
convolutions=encoder_convolutions,
dropout=c["dropout"])
if c["freeze_embedding"]:
freeze(encoder.embed)
# decoder
h = c["decoder_channels"]
k = c["kernel_size"]
prenet_convolutions = (ConvSpec(h, k, 1), ConvSpec(h, k, 3))
attentive_convolutions = (
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(h, k, 9),
ConvSpec(h, k, 27),
ConvSpec(h, k, 1), )
attention = [True, False, False, False, True]
force_monotonic_attention = [True, False, False, False, True]
window = WindowRange(c["window_backward"], c["window_ahead"])
decoder = Decoder(
n_speakers,
speaker_dim,
embed_dim=c["text_embed_dim"],
mel_dim=config["transform"]["n_mels"],
r=c["outputs_per_step"],
max_positions=c["max_positions"],
preattention=prenet_convolutions,
convolutions=attentive_convolutions,
attention=attention,
dropout=c["dropout"],
use_memory_mask=c["use_memory_mask"],
force_monotonic_attention=force_monotonic_attention,
query_position_rate=c["query_position_rate"],
key_position_rate=c["key_position_rate"],
window_range=window,
key_projection=c["key_projection"],
value_projection=c["value_projection"])
if not c["trainable_positional_encodings"]:
freeze(decoder.embed_keys_positions)
freeze(decoder.embed_query_positions)
# converter(postnet)
linear_dim = 1 + config["transform"]["n_fft"] // 2
h = c["converter_channels"]
k = c["kernel_size"]
postnet_convolutions = (
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(2 * h, k, 1),
ConvSpec(2 * h, k, 3), )
use_decoder_states = c["use_decoder_state_for_postnet_input"]
converter = Converter(
n_speakers,
speaker_dim,
in_channels=decoder.state_dim
if use_decoder_states else config["transform"]["n_mels"],
linear_dim=linear_dim,
time_upsampling=c["downsample_factor"],
convolutions=postnet_convolutions,
dropout=c["dropout"])
model = DeepVoice3(
encoder,
decoder,
converter,
speaker_embed,
use_decoder_states=use_decoder_states)
return model
def make_criterion(config):
# =========================loss=========================
loss_config = config["loss"]
transform_config = config["transform"]
model_config = config["model"]
priority_freq = loss_config["priority_freq"] # Hz
sample_rate = transform_config["sample_rate"]
linear_dim = 1 + transform_config["n_fft"] // 2
priority_bin = int(priority_freq / (0.5 * sample_rate) * linear_dim)
criterion = TTSLoss(
masked_weight=loss_config["masked_loss_weight"],
priority_bin=priority_bin,
priority_weight=loss_config["priority_freq_weight"],
binary_divergence_weight=loss_config["binary_divergence_weight"],
guided_attention_sigma=loss_config["guided_attention_sigma"],
downsample_factor=model_config["downsample_factor"],
r=model_config["outputs_per_step"])
return criterion
def make_optimizer(model, config):
# =========================lr_scheduler=========================
lr_config = config["lr_scheduler"]
warmup_steps = lr_config["warmup_steps"]
peak_learning_rate = lr_config["peak_learning_rate"]
lr_scheduler = dg.NoamDecay(1 / (warmup_steps * (peak_learning_rate)**2),
warmup_steps)
# =========================optimizer=========================
optim_config = config["optimizer"]
optim = fluid.optimizer.Adam(
lr_scheduler,
beta1=optim_config["beta1"],
beta2=optim_config["beta2"],
epsilon=optim_config["epsilon"],
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(0.1))
return optim
from __future__ import division
import os
import argparse
from ruamel import yaml
import tqdm
from os.path import join
import csv
import numpy as np
import pandas as pd
import librosa
import logging
from parakeet.data import DatasetMixin
class LJSpeechMetaData(DatasetMixin):
def __init__(self, root):
self.root = root
self._wav_dir = join(root, "wavs")
csv_path = join(root, "metadata.csv")
self._table = pd.read_csv(
csv_path,
sep="|",
encoding="utf-8",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
abs_fname = join(self._wav_dir, fname + ".wav")
return fname, abs_fname, raw_text, normalized_text
def __len__(self):
return len(self._table)
class Transform(object):
def __init__(self, sample_rate, n_fft, hop_length, win_length, n_mels, reduction_factor):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.n_mels = n_mels
self.reduction_factor = reduction_factor
def __call__(self, fname):
# wave processing
audio, _ = librosa.load(fname, sr=self.sample_rate)
# Pad the data to the right size to have a whole number of timesteps,
# accounting properly for the model reduction factor.
frames = audio.size // (self.reduction_factor * self.hop_length) + 1
# librosa's stft extract frame of n_fft size, so we should pad n_fft // 2 on both sidess
desired_length = (frames * self.reduction_factor - 1) * self.hop_length + self.n_fft
pad_amount = (desired_length - audio.size) // 2
# we pad mannually to control the number of generated frames
if audio.size % 2 == 0:
audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect')
else:
audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect')
# STFT
D = librosa.stft(audio, self.n_fft, self.hop_length, self.win_length, center=False)
S = np.abs(D)
S_mel = librosa.feature.melspectrogram(sr=self.sample_rate, S=S, n_mels=self.n_mels, fmax=8000.0)
# log magnitude
log_spectrogram = np.log(np.clip(S, a_min=1e-5, a_max=None))
log_mel_spectrogram = np.log(np.clip(S_mel, a_min=1e-5, a_max=None))
num_frames = log_spectrogram.shape[-1]
assert num_frames % self.reduction_factor == 0, "num_frames is wrong"
return (log_spectrogram.T, log_mel_spectrogram.T, num_frames)
def save(output_path, dataset, transform):
if not os.path.exists(output_path):
os.makedirs(output_path)
records = []
for example in tqdm.tqdm(dataset):
fname, abs_fname, _, normalized_text = example
log_spec, log_mel_spec, num_frames = transform(abs_fname)
records.append((num_frames,
fname + "_spec.npy",
fname + "_mel.npy",
normalized_text))
np.save(join(output_path, fname + "_spec"), log_spec)
np.save(join(output_path, fname + "_mel"), log_mel_spec)
meta_data = pd.DataFrame.from_records(records)
meta_data.to_csv(join(output_path, "metadata.csv"),
quoting=csv.QUOTE_NONE, sep="|", encoding="utf-8",
header=False, index=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="preprocess ljspeech dataset and save it.")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
parser.add_argument("--output", type=str, required=True, help="path to save the preprocessed dataset")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
print("========= Command Line Arguments ========")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
print("=========== Configurations ==============")
for k in ["sample_rate", "n_fft", "win_length",
"hop_length", "n_mels", "reduction_factor"]:
print("{}: {}".format(k, config[k]))
ljspeech_meta = LJSpeechMetaData(args.input)
transform = Transform(config["sample_rate"],
config["n_fft"],
config["hop_length"],
config["win_length"],
config["n_mels"],
config["reduction_factor"])
save(args.output, ljspeech_meta, transform)
Scientists at the CERN laboratory say they have discovered a new particle.
There's a way to measure the acute emotional intelligence that has never gone out of style.
President Trump met with other leaders at the Group of 20 conference.
Generative adversarial network or variational auto-encoder.
Please call Stella.
Some have accepted this as a miracle without any physical explanation.
\ No newline at end of file
Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
in being comparatively modern.
For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process
produced the block books, which were the immediate predecessors of the true printed book,
the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import argparse
import ruamel.yaml
import numpy as np
import soundfile as sf
from paddle import fluid
fluid.require_version('1.8.0')
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
from tensorboardX import SummaryWriter
from parakeet.g2p import en
from parakeet.modules.weight_norm import WeightNormWrapper
from parakeet.utils.layer_tools import summary
from parakeet.utils import io
from model import make_model
from utils import make_evaluator
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Synthsize waveform with a checkpoint.")
parser.add_argument("--config", type=str, help="experiment config")
parser.add_argument("--device", type=int, default=-1, help="device to use")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument("text", type=str, help="text file to synthesize")
parser.add_argument(
"output", type=str, help="path to save synthesized audio")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
print("Command Line Args: ")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
dg.enable_dygraph(place)
model = make_model(config)
checkpoint_dir = os.path.join(args.output, "checkpoints")
if args.checkpoint is not None:
iteration = io.load_parameters(model, checkpoint_path=args.checkpoint)
else:
iteration = io.load_parameters(
model, checkpoint_dir=checkpoint_dir, iteration=args.iteration)
# WARNING: don't forget to remove weight norm to re-compute each wrapped layer's weight
# removing weight norm also speeds up computation
for layer in model.sublayers():
if isinstance(layer, WeightNormWrapper):
layer.remove_weight_norm()
synthesis_dir = os.path.join(args.output, "synthesis")
if not os.path.exists(synthesis_dir):
os.makedirs(synthesis_dir)
with open(args.text, "rt", encoding="utf-8") as f:
lines = f.readlines()
sentences = [line[:-1] for line in lines]
evaluator = make_evaluator(config, sentences, synthesis_dir)
evaluator(model, iteration)
import numpy as np
from matplotlib import cm
import librosa
import os
import time
import tqdm
import argparse
from ruamel import yaml
import paddle
from paddle import fluid
from paddle.fluid import layers as F
from paddle.fluid import dygraph as dg
from paddle.fluid.io import DataLoader
from tensorboardX import SummaryWriter
import soundfile as sf
from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
from parakeet.utils.io import save_parameters, load_parameters, add_yaml_config_to_args
from parakeet.g2p import en
from vocoder import WaveflowVocoder
from train import create_model
def main(args, config):
model = create_model(config)
loaded_step = load_parameters(model, checkpoint_path=args.checkpoint)
model.eval()
vocoder = WaveflowVocoder()
vocoder.model.eval()
if not os.path.exists(args.output):
os.makedirs(args.output)
monotonic_layers = [int(item.strip()) - 1 for item in args.monotonic_layers.split(',')]
with open(args.input, 'rt') as f:
sentences = [line.strip() for line in f.readlines()]
for i, sentence in enumerate(sentences):
wav = synthesize(config, model, vocoder, sentence, monotonic_layers)
sf.write(os.path.join(args.output, "sentence{}.wav".format(i)),
wav, samplerate=config["sample_rate"])
def synthesize(config, model, vocoder, sentence, monotonic_layers):
print("[synthesize] {}".format(sentence))
text = en.text_to_sequence(sentence, p=1.0)
text = np.expand_dims(np.array(text, dtype="int64"), 0)
lengths = np.array([text.size], dtype=np.int64)
text_seqs = dg.to_variable(text)
text_lengths = dg.to_variable(lengths)
decoder_layers = config["decoder_layers"]
force_monotonic_attention = [False] * decoder_layers
for i in monotonic_layers:
force_monotonic_attention[i] = True
with dg.no_grad():
outputs = model(text_seqs, text_lengths, speakers=None,
force_monotonic_attention=force_monotonic_attention,
window=(config["backward_step"], config["forward_step"]))
decoded, refined, attentions = outputs
wav = vocoder(F.transpose(decoded, (0, 2, 1)))
wav_np = wav.numpy()[0]
return wav_np
if __name__ == "__main__":
import argparse
from ruamel import yaml
parser = argparse.ArgumentParser("synthesize from a checkpoint")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="text file to synthesize")
parser.add_argument("--output", type=str, required=True, help="path to save audio")
parser.add_argument("--checkpoint", type=str, required=True, help="data path of the checkpoint")
parser.add_argument("--monotonic_layers", type=str, required=True, help="monotonic decoder layer, index starts friom 1")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
dg.enable_dygraph(fluid.CUDAPlace(0))
main(args, config)
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import time
import numpy as np
from matplotlib import cm
import librosa
import os
import argparse
import ruamel.yaml
import time
import tqdm
from tensorboardX import SummaryWriter
import paddle
from paddle import fluid
fluid.require_version('1.8.0')
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
from parakeet.utils.io import load_parameters, save_parameters
from data import make_data_loader
from model import make_model, make_criterion, make_optimizer
from utils import make_output_tree, add_options, get_place, Evaluator, StateSaver, make_evaluator, make_state_saver
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Train a Deep Voice 3 model with LJSpeech dataset.")
add_options(parser)
args, _ = parser.parse_known_args()
# only use args.device when training in single process
# when training with distributed.launch, devices are provided by
# `--selected_gpus` for distributed.launch
env = dg.parallel.ParallelEnv()
device_id = env.dev_id if env.nranks > 1 else args.device
place = get_place(device_id)
# start dygraph
dg.enable_dygraph(place)
from paddle.fluid import layers as F
from paddle.fluid import dygraph as dg
from paddle.fluid.io import DataLoader
from tensorboardX import SummaryWriter
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
print("Command Line Args: ")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
data_loader = make_data_loader(args.data, config)
model = make_model(config)
if env.nranks > 1:
strategy = dg.parallel.prepare_context()
model = dg.DataParallel(model, strategy)
criterion = make_criterion(config)
optim = make_optimizer(model, config)
# generation
synthesis_config = config["synthesis"]
power = synthesis_config["power"]
n_iter = synthesis_config["n_iter"]
# tensorboard & checkpoint preparation
output_dir = args.output
ckpt_dir = os.path.join(output_dir, "checkpoints")
log_dir = os.path.join(output_dir, "log")
state_dir = os.path.join(output_dir, "states")
eval_dir = os.path.join(output_dir, "eval")
if env.local_rank == 0:
make_output_tree(output_dir)
writer = SummaryWriter(logdir=log_dir)
else:
writer = None
sentences = [
"Scientists at the CERN laboratory say they have discovered a new particle.",
"There's a way to measure the acute emotional intelligence that has never gone out of style.",
"President Trump met with other leaders at the Group of 20 conference.",
"Generative adversarial network or variational auto-encoder.",
"Please call Stella.",
"Some have accepted this as a miracle without any physical explanation.",
]
evaluator = make_evaluator(config, sentences, eval_dir, writer)
state_saver = make_state_saver(config, state_dir, writer)
# load parameters and optimizer, and opdate iterations done sofar
if args.checkpoint is not None:
iteration = load_parameters(
model, optim, checkpoint_path=args.checkpoint)
else:
iteration = load_parameters(
model, optim, checkpoint_dir=ckpt_dir, iteration=args.iteration)
# =========================train=========================
train_config = config["train"]
max_iter = train_config["max_iteration"]
snap_interval = train_config["snap_interval"]
save_interval = train_config["save_interval"]
eval_interval = train_config["eval_interval"]
global_step = iteration + 1
iterator = iter(tqdm.tqdm(data_loader))
downsample_factor = config["model"]["downsample_factor"]
while global_step <= max_iter:
from parakeet.models.deepvoice3 import Encoder, Decoder, PostNet, SpectraNet
from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
from parakeet.utils.io import save_parameters, load_parameters
from parakeet.g2p import en
from data import LJSpeech, DataCollector
from vocoder import WaveflowVocoder, GriffinLimVocoder
from clip import DoubleClip
def create_model(config):
char_embedding = dg.Embedding((en.n_vocab, config["char_dim"]))
multi_speaker = config["n_speakers"] > 1
speaker_embedding = dg.Embedding((config["n_speakers"], config["speaker_dim"])) \
if multi_speaker else None
encoder = Encoder(config["encoder_layers"], config["char_dim"],
config["encoder_dim"], config["kernel_size"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
decoder = Decoder(config["n_mels"], config["reduction_factor"],
list(config["prenet_sizes"]) + [config["char_dim"]],
config["decoder_layers"], config["kernel_size"],
config["attention_dim"],
position_encoding_weight=config["position_weight"],
omega=config["position_rate"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
postnet = PostNet(config["postnet_layers"], config["char_dim"],
config["postnet_dim"], config["kernel_size"],
config["n_mels"], config["reduction_factor"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
spectranet = SpectraNet(char_embedding, speaker_embedding, encoder, decoder, postnet)
return spectranet
def create_data(config, data_path):
dataset = LJSpeech(data_path)
train_dataset = SliceDataset(dataset, config["valid_size"], len(dataset))
train_collator = DataCollector(config["p_pronunciation"])
train_sampler = PartialyRandomizedSimilarTimeLengthSampler(
dataset.num_frames()[config["valid_size"]:])
train_cargo = DataCargo(train_dataset, train_collator,
batch_size=config["batch_size"], sampler=train_sampler)
train_loader = DataLoader\
.from_generator(capacity=10, return_list=True)\
.set_batch_generator(train_cargo)
valid_dataset = SliceDataset(dataset, 0, config["valid_size"])
valid_collector = DataCollector(1.)
valid_sampler = SequentialSampler(valid_dataset)
valid_cargo = DataCargo(valid_dataset, valid_collector,
batch_size=1, sampler=valid_sampler)
valid_loader = DataLoader\
.from_generator(capacity=2, return_list=True)\
.set_batch_generator(valid_cargo)
return train_loader, valid_loader
def create_optimizer(model, config):
optim = fluid.optimizer.Adam(config["learning_rate"],
parameter_list=model.parameters(),
grad_clip=DoubleClip(config["clip_value"], config["clip_norm"]))
return optim
def train(args, config):
model = create_model(config)
train_loader, valid_loader = create_data(config, args.input)
optim = create_optimizer(model, config)
global global_step
max_iteration = 2000000
iterator = iter(tqdm.tqdm(train_loader))
while global_step <= max_iteration:
# get inputs
try:
batch = next(iterator)
except StopIteration as e:
iterator = iter(tqdm.tqdm(data_loader))
except StopIteration:
iterator = iter(tqdm.tqdm(train_loader))
batch = next(iterator)
# unzip it
text_seqs, text_lengths, specs, mels, num_frames = batch
# forward & backward
model.train()
(text_sequences, text_lengths, text_positions, mel_specs, lin_specs,
frames, decoder_positions, done_flags) = batch
downsampled_mel_specs = F.strided_slice(
mel_specs,
axes=[1],
starts=[0],
ends=[mel_specs.shape[1]],
strides=[downsample_factor])
outputs = model(
text_sequences,
text_positions,
text_lengths,
None,
downsampled_mel_specs,
decoder_positions, )
# mel_outputs, linear_outputs, alignments, done
inputs = (downsampled_mel_specs, lin_specs, done_flags, text_lengths,
frames)
losses = criterion(outputs, inputs)
l = losses["loss"]
if env.nranks > 1:
l = model.scale_loss(l)
l.backward()
model.apply_collective_grads()
else:
l.backward()
# record learning rate before updating
if env.local_rank == 0:
writer.add_scalar("learning_rate",
optim._learning_rate.step().numpy(), global_step)
optim.minimize(l)
optim.clear_gradients()
# record step losses
step_loss = {k: v.numpy()[0] for k, v in losses.items()}
if env.local_rank == 0:
tqdm.tqdm.write("[Train] global_step: {}\tloss: {}".format(
global_step, step_loss["loss"]))
for k, v in step_loss.items():
writer.add_scalar(k, v, global_step)
# train state saving, the first sentence in the batch
if env.local_rank == 0 and global_step % snap_interval == 0:
input_specs = (mel_specs, lin_specs)
state_saver(outputs, input_specs, global_step)
# evaluation
if env.local_rank == 0 and global_step % eval_interval == 0:
evaluator(model, global_step)
# save checkpoint
if env.local_rank == 0 and global_step % save_interval == 0:
save_parameters(ckpt_dir, global_step, model, optim)
outputs = model(text_seqs, text_lengths, speakers=None, mel=mels)
decoded, refined, attentions, final_state = outputs
causal_mel_loss = model.spec_loss(decoded, mels, num_frames)
non_causal_mel_loss = model.spec_loss(refined, mels, num_frames)
loss = causal_mel_loss + non_causal_mel_loss
loss.backward()
# update
optim.minimize(loss)
# logging
tqdm.tqdm.write("[train] step: {}\tloss: {:.6f}\tcausal:{:.6f}\tnon_causal:{:.6f}".format(
global_step,
loss.numpy()[0],
causal_mel_loss.numpy()[0],
non_causal_mel_loss.numpy()[0]))
writer.add_scalar("loss/causal_mel_loss", causal_mel_loss.numpy()[0], global_step=global_step)
writer.add_scalar("loss/non_causal_mel_loss", non_causal_mel_loss.numpy()[0], global_step=global_step)
writer.add_scalar("loss/loss", loss.numpy()[0], global_step=global_step)
if global_step % config["report_interval"] == 0:
text_length = int(text_lengths.numpy()[0])
num_frame = int(num_frames.numpy()[0])
tag = "train_mel/ground-truth"
img = cm.viridis(normalize(mels.numpy()[0, :num_frame].T))
writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
tag = "train_mel/decoded"
img = cm.viridis(normalize(decoded.numpy()[0, :num_frame].T))
writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
tag = "train_mel/refined"
img = cm.viridis(normalize(refined.numpy()[0, :num_frame].T))
writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
vocoder = WaveflowVocoder()
vocoder.model.eval()
tag = "train_audio/ground-truth-waveflow"
wav = vocoder(F.transpose(mels[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
tag = "train_audio/decoded-waveflow"
wav = vocoder(F.transpose(decoded[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
tag = "train_audio/refined-waveflow"
wav = vocoder(F.transpose(refined[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
attentions_np = attentions.numpy()
attentions_np = attentions_np[:, 0, :num_frame // 4 , :text_length]
for i, attention_layer in enumerate(np.rot90(attentions_np, axes=(1,2))):
tag = "train_attention/layer_{}".format(i)
img = cm.viridis(normalize(attention_layer))
writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
if global_step % config["save_interval"] == 0:
save_parameters(writer.logdir, global_step, model, optim)
# global step +1
global_step += 1
def normalize(arr):
return (arr - arr.min()) / (arr.max() - arr.min())
if __name__ == "__main__":
import argparse
from ruamel import yaml
parser = argparse.ArgumentParser(description="train a Deep Voice 3 model with LJSpeech")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
dg.enable_dygraph(fluid.CUDAPlace(0))
global global_step
global_step = 1
global writer
writer = SummaryWriter()
print("[Training] tensorboard log and checkpoints are save in {}".format(
writer.logdir))
train(args, config)
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import numpy as np
import matplotlib
matplotlib.use("agg")
from matplotlib import cm
import matplotlib.pyplot as plt
import librosa
from scipy import signal
from librosa import display
import soundfile as sf
from paddle import fluid
import paddle.fluid.dygraph as dg
from parakeet.g2p import en
def get_place(device_id):
"""get place from device_id, -1 stands for CPU"""
if device_id == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(device_id)
return place
def add_options(parser):
parser.add_argument("--config", type=str, help="experimrnt config")
parser.add_argument(
"--data",
type=str,
default="/workspace/datasets/LJSpeech-1.1/",
help="The path of the LJSpeech dataset.")
parser.add_argument("--device", type=int, default=-1, help="device to use")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from.")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"output", type=str, default="experiment", help="path to save results")
def make_evaluator(config, text_sequences, output_dir, writer=None):
c = config["transform"]
p_replace = 0.0
sample_rate = c["sample_rate"]
preemphasis = c["preemphasis"]
win_length = c["win_length"]
hop_length = c["hop_length"]
min_level_db = c["min_level_db"]
ref_level_db = c["ref_level_db"]
synthesis_config = config["synthesis"]
power = synthesis_config["power"]
n_iter = synthesis_config["n_iter"]
return Evaluator(
text_sequences,
p_replace,
sample_rate,
preemphasis,
win_length,
hop_length,
min_level_db,
ref_level_db,
power,
n_iter,
output_dir=output_dir,
writer=writer)
class Evaluator(object):
def __init__(self,
text_sequences,
p_replace,
sample_rate,
preemphasis,
win_length,
hop_length,
min_level_db,
ref_level_db,
power,
n_iter,
output_dir,
writer=None):
self.text_sequences = text_sequences
self.output_dir = output_dir
self.writer = writer
self.p_replace = p_replace
self.sample_rate = sample_rate
self.preemphasis = preemphasis
self.win_length = win_length
self.hop_length = hop_length
self.min_level_db = min_level_db
self.ref_level_db = ref_level_db
self.power = power
self.n_iter = n_iter
def process_a_sentence(self, model, text):
text = np.array(
en.text_to_sequence(
text, p=self.p_replace), dtype=np.int64)
length = len(text)
text_positions = np.arange(1, 1 + length, dtype=np.int64)
text = np.expand_dims(text, 0)
text_positions = np.expand_dims(text_positions, 0)
model.eval()
if isinstance(model, dg.DataParallel):
_model = model._layers
else:
_model = model
mel_outputs, linear_outputs, alignments, done = _model.transduce(
dg.to_variable(text), dg.to_variable(text_positions))
linear_outputs_np = linear_outputs.numpy()[0].T # (C, T)
wav = spec_to_waveform(linear_outputs_np, self.min_level_db,
self.ref_level_db, self.power, self.n_iter,
self.win_length, self.hop_length,
self.preemphasis)
alignments_np = alignments.numpy()[0] # batch_size = 1
return wav, alignments_np
def __call__(self, model, iteration):
writer = self.writer
for i, seq in enumerate(self.text_sequences):
print("[Eval] synthesizing sentence {}".format(i))
wav, alignments_np = self.process_a_sentence(model, seq)
wav_path = os.path.join(
self.output_dir,
"eval_sample_{}_step_{:09d}.wav".format(i, iteration))
sf.write(wav_path, wav, self.sample_rate)
if writer is not None:
writer.add_audio(
"eval_sample_{}".format(i),
wav,
iteration,
sample_rate=self.sample_rate)
attn_path = os.path.join(
self.output_dir,
"eval_sample_{}_step_{:09d}.png".format(i, iteration))
plot_alignment(alignments_np, attn_path)
if writer is not None:
writer.add_image(
"eval_sample_attn_{}".format(i),
cm.viridis(alignments_np),
iteration,
dataformats="HWC")
def make_state_saver(config, output_dir, writer=None):
c = config["transform"]
p_replace = c["replace_pronunciation_prob"]
sample_rate = c["sample_rate"]
preemphasis = c["preemphasis"]
win_length = c["win_length"]
hop_length = c["hop_length"]
min_level_db = c["min_level_db"]
ref_level_db = c["ref_level_db"]
synthesis_config = config["synthesis"]
power = synthesis_config["power"]
n_iter = synthesis_config["n_iter"]
return StateSaver(p_replace, sample_rate, preemphasis, win_length,
hop_length, min_level_db, ref_level_db, power, n_iter,
output_dir, writer)
class StateSaver(object):
def __init__(self,
p_replace,
sample_rate,
preemphasis,
win_length,
hop_length,
min_level_db,
ref_level_db,
power,
n_iter,
output_dir,
writer=None):
self.output_dir = output_dir
self.writer = writer
self.p_replace = p_replace
self.sample_rate = sample_rate
self.preemphasis = preemphasis
self.win_length = win_length
self.hop_length = hop_length
self.min_level_db = min_level_db
self.ref_level_db = ref_level_db
self.power = power
self.n_iter = n_iter
def __call__(self, outputs, inputs, iteration):
mel_output, lin_output, alignments, done_output = outputs
mel_input, lin_input = inputs
writer = self.writer
# mel spectrogram
mel_input = mel_input[0].numpy().T
mel_output = mel_output[0].numpy().T
path = os.path.join(self.output_dir, "mel_spec")
plt.figure(figsize=(10, 3))
display.specshow(mel_input)
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(path, "target_mel_spec_step_{:09d}.png".format(
iteration)))
plt.close()
if writer is not None:
writer.add_image(
"target/mel_spec",
cm.viridis(mel_input),
iteration,
dataformats="HWC")
plt.figure(figsize=(10, 3))
display.specshow(mel_output)
plt.colorbar()
plt.title("mel_output")
plt.savefig(
os.path.join(path, "predicted_mel_spec_step_{:09d}.png".format(
iteration)))
plt.close()
if writer is not None:
writer.add_image(
"predicted/mel_spec",
cm.viridis(mel_output),
iteration,
dataformats="HWC")
# linear spectrogram
lin_input = lin_input[0].numpy().T
lin_output = lin_output[0].numpy().T
path = os.path.join(self.output_dir, "lin_spec")
plt.figure(figsize=(10, 3))
display.specshow(lin_input)
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(path, "target_lin_spec_step_{:09d}.png".format(
iteration)))
plt.close()
if writer is not None:
writer.add_image(
"target/lin_spec",
cm.viridis(lin_input),
iteration,
dataformats="HWC")
plt.figure(figsize=(10, 3))
display.specshow(lin_output)
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(path, "predicted_lin_spec_step_{:09d}.png".format(
iteration)))
plt.close()
if writer is not None:
writer.add_image(
"predicted/lin_spec",
cm.viridis(lin_output),
iteration,
dataformats="HWC")
# alignment
path = os.path.join(self.output_dir, "alignments")
alignments = alignments[:, 0, :, :].numpy()
for idx, attn_layer in enumerate(alignments):
save_path = os.path.join(
path, "train_attn_layer_{}_step_{}.png".format(idx, iteration))
plot_alignment(attn_layer, save_path)
if writer is not None:
writer.add_image(
"train_attn/layer_{}".format(idx),
cm.viridis(attn_layer),
iteration,
dataformats="HWC")
# synthesize waveform
wav = spec_to_waveform(
lin_output, self.min_level_db, self.ref_level_db, self.power,
self.n_iter, self.win_length, self.hop_length, self.preemphasis)
path = os.path.join(self.output_dir, "waveform")
save_path = os.path.join(
path, "train_sample_step_{:09d}.wav".format(iteration))
sf.write(save_path, wav, self.sample_rate)
if writer is not None:
writer.add_audio(
"train_sample", wav, iteration, sample_rate=self.sample_rate)
def spec_to_waveform(spec, min_level_db, ref_level_db, power, n_iter,
win_length, hop_length, preemphasis):
"""Convert output linear spec to waveform using griffin-lim vocoder.
Args:
spec (ndarray): the output linear spectrogram, shape(C, T), where C means n_fft, T means frames.
"""
denoramlized = np.clip(spec, 0, 1) * (-min_level_db) + min_level_db
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
wav = librosa.griffinlim(
lin_scaled**power,
n_iter=n_iter,
hop_length=hop_length,
win_length=win_length)
if preemphasis > 0:
wav = signal.lfilter([1.], [1., -preemphasis], wav)
wav = np.clip(wav, -1.0, 1.0)
return wav
def make_output_tree(output_dir):
print("creating output tree: {}".format(output_dir))
ckpt_dir = os.path.join(output_dir, "checkpoints")
state_dir = os.path.join(output_dir, "states")
eval_dir = os.path.join(output_dir, "eval")
for x in [ckpt_dir, state_dir, eval_dir]:
if not os.path.exists(x):
os.makedirs(x)
for x in ["alignments", "waveform", "lin_spec", "mel_spec"]:
p = os.path.join(state_dir, x)
if not os.path.exists(p):
os.makedirs(p)
def plot_alignment(alignment, path):
"""
Plot an attention layer's alignment for a sentence.
alignment: shape(T_dec, T_enc).
"""
plt.figure()
plt.imshow(alignment)
plt.colorbar()
plt.xlabel('Encoder timestep')
plt.ylabel('Decoder timestep')
plt.savefig(path)
plt.close()
import argparse
from ruamel import yaml
import numpy as np
import librosa
import paddle
from paddle import fluid
from paddle.fluid import layers as F
from paddle.fluid import dygraph as dg
from parakeet.utils.io import load_parameters
from parakeet.models.waveflow.waveflow_modules import WaveFlowModule
class WaveflowVocoder(object):
def __init__(self):
config_path = "waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml"
with open(config_path, 'rt') as f:
config = yaml.safe_load(f)
ns = argparse.Namespace()
for k, v in config.items():
setattr(ns, k, v)
ns.use_fp16 = False
self.model = WaveFlowModule(ns)
checkpoint_path = "waveflow_res128_ljspeech_ckpt_1.0/step-2000000"
load_parameters(self.model, checkpoint_path=checkpoint_path)
def __call__(self, mel):
with dg.no_grad():
self.model.eval()
audio = self.model.synthesize(mel)
self.model.train()
return audio
class GriffinLimVocoder(object):
def __init__(self, sharpening_factor=1.4, win_length=1024, hop_length=256):
self.sharpening_factor = sharpening_factor
self.win_length = win_length
self.hop_length = hop_length
def __call__(self, spec):
audio = librosa.core.griffinlim(np.exp(spec * self.sharpening_factor),
win_length=self.win_length, hop_length=self.hop_length)
return audio
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from parakeet.models.deepvoice3.encoder import Encoder, ConvSpec
from parakeet.models.deepvoice3.decoder import Decoder, WindowRange
from parakeet.models.deepvoice3.converter import Converter
from parakeet.models.deepvoice3.loss import TTSLoss
from parakeet.models.deepvoice3.model import DeepVoice3
from .model import *
\ No newline at end of file
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from collections import namedtuple
from paddle import fluid
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
from parakeet.modules.weight_norm import Linear
WindowRange = namedtuple("WindowRange", ["backward", "ahead"])
class Attention(dg.Layer):
def __init__(self,
query_dim,
embed_dim,
dropout=0.0,
window_range=WindowRange(-1, 3),
key_projection=True,
value_projection=True):
"""Attention Layer for Deep Voice 3.
Args:
query_dim (int): the dimension of query vectors. (The size of a single vector of query.)
embed_dim (int): the dimension of keys and values.
dropout (float, optional): dropout probability of attention. Defaults to 0.0.
window_range (WindowRange, optional): range of attention, this is only used at inference. Defaults to WindowRange(-1, 3).
key_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the keys to pass through before computing attention. Defaults to True.
value_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the values to pass through before computing attention. Defaults to True.
"""
super(Attention, self).__init__()
std = np.sqrt(1 / query_dim)
self.query_proj = Linear(
query_dim, embed_dim, param_attr=I.Normal(scale=std))
if key_projection:
std = np.sqrt(1 / embed_dim)
self.key_proj = Linear(
embed_dim, embed_dim, param_attr=I.Normal(scale=std))
if value_projection:
std = np.sqrt(1 / embed_dim)
self.value_proj = Linear(
embed_dim, embed_dim, param_attr=I.Normal(scale=std))
std = np.sqrt(1 / embed_dim)
self.out_proj = Linear(
embed_dim, query_dim, param_attr=I.Normal(scale=std))
self.key_projection = key_projection
self.value_projection = value_projection
self.dropout = dropout
self.window_range = window_range
def forward(self, query, encoder_out, mask=None, last_attended=None):
"""
Compute contextualized representation and alignment scores.
Args:
query (Variable): shape(B, T_dec, C_q), dtype float32, the query tensor, where C_q means the query dim.
encoder_out (keys, values):
keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means embed dim.
values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means embed dim.
mask (Variable, optional): shape(B, T_enc), dtype float32, mask generated with valid text lengths. Pad tokens corresponds to 1, and valid tokens correspond to 0.
last_attended (int, optional): The position that received the most attention at last time step. This is only used at inference.
Outpus:
x (Variable): shape(B, T_dec, C_q), dtype float32, the contextualized representation from attention mechanism.
attn_scores (Variable): shape(B, T_dec, T_enc), dtype float32, the alignment tensor, where T_dec means the number of decoder time steps and T_enc means number the number of decoder time steps.
"""
keys, values = encoder_out
residual = query
if self.value_projection:
values = self.value_proj(values)
if self.key_projection:
keys = self.key_proj(keys)
x = self.query_proj(query)
x = F.matmul(x, keys, transpose_y=True)
# mask generated by sentence length
neg_inf = -1.e30
if mask is not None:
neg_inf_mask = F.scale(F.unsqueeze(mask, [1]), neg_inf)
x += neg_inf_mask
# if last_attended is provided, focus only on a window range around it
# to enforce monotonic attention.
if last_attended is not None:
locality_mask = np.ones(shape=x.shape, dtype=np.float32)
backward, ahead = self.window_range
backward = last_attended + backward
ahead = last_attended + ahead
backward = max(backward, 0)
ahead = min(ahead, x.shape[-1])
locality_mask[:, :, backward:ahead] = 0.
locality_mask = dg.to_variable(locality_mask)
neg_inf_mask = F.scale(locality_mask, neg_inf)
x += neg_inf_mask
x = F.softmax(x)
attn_scores = x
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = F.matmul(x, values)
encoder_length = keys.shape[1]
x = F.scale(x, encoder_length * np.sqrt(1.0 / encoder_length))
x = self.out_proj(x)
x = F.scale((x + residual), np.sqrt(0.5))
return x, attn_scores
import numpy as np
from paddle.fluid import layers as F
from paddle.fluid.framework import Variable, in_dygraph_mode
from paddle.fluid import core, dygraph_utils
from paddle.fluid.layers import nn, utils
from paddle.fluid.data_feeder import check_variable_and_dtype
from paddle.fluid.param_attr import ParamAttr
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.dygraph import layers
from paddle.fluid.initializer import Normal
def _is_list_or_tuple(input):
return isinstance(input, (list, tuple))
def _zero_padding_in_batch_and_channel(padding, channel_last):
if channel_last:
return list(padding[0]) == [0, 0] and list(padding[-1]) == [0, 0]
else:
return list(padding[0]) == [0, 0] and list(padding[1]) == [0, 0]
def _exclude_padding_in_batch_and_channel(padding, channel_last):
padding_ = padding[1:-1] if channel_last else padding[2:]
padding_ = [elem for pad_a_dim in padding_ for elem in pad_a_dim]
return padding_
def _update_padding_nd(padding, channel_last, num_dims):
if isinstance(padding, str):
padding = padding.upper()
if padding not in ["SAME", "VALID"]:
raise ValueError(
"Unknown padding: '{}'. It can only be 'SAME' or 'VALID'.".
format(padding))
if padding == "VALID":
padding_algorithm = "VALID"
padding = [0] * num_dims
else:
padding_algorithm = "SAME"
padding = [0] * num_dims
elif _is_list_or_tuple(padding):
# for padding like
# [(pad_before, pad_after), (pad_before, pad_after), ...]
# padding for batch_dim and channel_dim included
if len(padding) == 2 + num_dims and _is_list_or_tuple(padding[0]):
if not _zero_padding_in_batch_and_channel(padding, channel_last):
raise ValueError(
"Non-zero padding({}) in the batch or channel dimensions "
"is not supported.".format(padding))
padding_algorithm = "EXPLICIT"
padding = _exclude_padding_in_batch_and_channel(padding,
channel_last)
if utils._is_symmetric_padding(padding, num_dims):
padding = padding[0::2]
# for padding like [pad_before, pad_after, pad_before, pad_after, ...]
elif len(padding) == 2 * num_dims and isinstance(padding[0], int):
padding_algorithm = "EXPLICIT"
padding = utils.convert_to_list(padding, 2 * num_dims, 'padding')
if utils._is_symmetric_padding(padding, num_dims):
padding = padding[0::2]
# for padding like [pad_d1, pad_d2, ...]
elif len(padding) == num_dims and isinstance(padding[0], int):
padding_algorithm = "EXPLICIT"
padding = utils.convert_to_list(padding, num_dims, 'padding')
else:
raise ValueError("In valid padding: {}".format(padding))
# for integer padding
else:
padding_algorithm = "EXPLICIT"
padding = utils.convert_to_list(padding, num_dims, 'padding')
return padding, padding_algorithm
def _get_default_param_initializer(num_channels, filter_size):
filter_elem_num = num_channels * np.prod(filter_size)
std = (2.0 / filter_elem_num)**0.5
return Normal(0.0, std, 0)
def conv1d(input,
weight,
bias=None,
padding=0,
stride=1,
dilation=1,
groups=1,
use_cudnn=True,
act=None,
data_format="NCT",
name=None):
# entry checks
if not isinstance(use_cudnn, bool):
raise ValueError("Attr(use_cudnn) should be True or False. "
"Received Attr(use_cudnn): {}.".format(use_cudnn))
if data_format not in ["NCT", "NTC"]:
raise ValueError("Attr(data_format) should be 'NCT' or 'NTC'. "
"Received Attr(data_format): {}.".format(data_format))
channel_last = (data_format == "NTC")
channel_dim = -1 if channel_last else 1
num_channels = input.shape[channel_dim]
num_filters = weight.shape[0]
if num_channels < 0:
raise ValueError("The channel dimmention of the input({}) "
"should be defined. Received: {}.".format(
input.shape, num_channels))
if num_channels % groups != 0:
raise ValueError(
"the channel of input must be divisible by groups,"
"received: the channel of input is {}, the shape of input is {}"
", the groups is {}".format(num_channels, input.shape, groups))
if num_filters % groups != 0:
raise ValueError(
"the number of filters must be divisible by groups,"
"received: the number of filters is {}, the shape of weight is {}"
", the groups is {}".format(num_filters, weight.shape, groups))
# update attrs
padding, padding_algorithm = _update_padding_nd(padding, channel_last, 1)
if len(padding) == 1: # synmmetric padding
padding = [0,] + padding
else:
# len(padding) == 2
padding = [0, 0] + padding
stride = [1,] + utils.convert_to_list(stride, 1, 'stride')
dilation = [1,] + utils.convert_to_list(dilation, 1, 'dilation')
data_format = "NHWC" if channel_last else "NCHW"
l_type = "conv2d"
if (num_channels == groups and num_filters % num_channels == 0 and
not use_cudnn):
l_type = 'depthwise_conv2d'
weight = F.unsqueeze(weight, [2])
input = F.unsqueeze(input, [1]) if channel_last else F.unsqueeze(input, [2])
if in_dygraph_mode():
attrs = ('strides', stride, 'paddings', padding, 'dilations', dilation,
'groups', groups, 'use_cudnn', use_cudnn, 'use_mkldnn', False,
'fuse_relu_before_depthwise_conv', False, "padding_algorithm",
padding_algorithm, "data_format", data_format)
pre_bias = getattr(core.ops, l_type)(input, weight, *attrs)
if bias is not None:
pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
else:
pre_act = pre_bias
out = dygraph_utils._append_activation_in_dygraph(
pre_act, act, use_cudnn=use_cudnn)
else:
inputs = {'Input': [input], 'Filter': [weight]}
attrs = {
'strides': stride,
'paddings': padding,
'dilations': dilation,
'groups': groups,
'use_cudnn': use_cudnn,
'use_mkldnn': False,
'fuse_relu_before_depthwise_conv': False,
"padding_algorithm": padding_algorithm,
"data_format": data_format
}
check_variable_and_dtype(input, 'input',
['float16', 'float32', 'float64'], 'conv2d')
helper = LayerHelper(l_type, **locals())
dtype = helper.input_dtype()
pre_bias = helper.create_variable_for_type_inference(dtype)
outputs = {"Output": [pre_bias]}
helper.append_op(
type=l_type, inputs=inputs, outputs=outputs, attrs=attrs)
if bias is not None:
pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
else:
pre_act = pre_bias
out = helper.append_activation(pre_act)
out = F.squeeze(out, [1]) if channel_last else F.squeeze(out, [2])
return out
class Conv1D(layers.Layer):
def __init__(self,
num_channels,
num_filters,
filter_size,
padding=0,
stride=1,
dilation=1,
groups=1,
param_attr=None,
bias_attr=None,
use_cudnn=True,
act=None,
data_format="NCT",
dtype='float32'):
super(Conv1D, self).__init__()
assert param_attr is not False, "param_attr should not be False here."
self._num_channels = num_channels
self._num_filters = num_filters
self._groups = groups
if num_channels % groups != 0:
raise ValueError("num_channels must be divisible by groups.")
self._act = act
self._data_format = data_format
self._dtype = dtype
if not isinstance(use_cudnn, bool):
raise ValueError("use_cudnn should be True or False")
self._use_cudnn = use_cudnn
self._filter_size = utils.convert_to_list(filter_size, 1, 'filter_size')
self._stride = utils.convert_to_list(stride, 1, 'stride')
self._dilation = utils.convert_to_list(dilation, 1, 'dilation')
channel_last = (data_format == "NTC")
self._padding = padding # leave it to F.conv1d
self._param_attr = param_attr
self._bias_attr = bias_attr
num_filter_channels = num_channels // groups
filter_shape = [self._num_filters, num_filter_channels
] + self._filter_size
self.weight = self.create_parameter(
attr=self._param_attr,
shape=filter_shape,
dtype=self._dtype,
default_initializer=_get_default_param_initializer(
self._num_channels, filter_shape))
self.bias = self.create_parameter(
attr=self._bias_attr,
shape=[self._num_filters],
dtype=self._dtype,
is_bias=True)
def forward(self, input):
out = conv1d(
input,
self.weight,
bias=self.bias,
padding=self._padding,
stride=self._stride,
dilation=self._dilation,
groups=self._groups,
use_cudnn=self._use_cudnn,
act=self._act,
data_format=self._data_format)
return out
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from paddle import fluid
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
from parakeet.modules.weight_norm import Conv1D, Conv1DCell, Conv2D, Linear
class Conv1DGLU(dg.Layer):
"""
A Convolution 1D block with GLU activation. It also applys dropout for the input x. It integrates speaker embeddings through a Linear activated by softsign. It has residual connection from the input x, and scale the output by np.sqrt(0.5).
"""
def __init__(self,
n_speakers,
speaker_dim,
in_channels,
num_filters,
filter_size=1,
dilation=1,
std_mul=4.0,
dropout=0.0,
causal=False,
residual=True):
"""[summary]
Args:
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding's size.
in_channels (int): channels of the input.
num_filters (int): channels of the output.
filter_size (int, optional): filter size of the internal Conv1DCell. Defaults to 1.
dilation (int, optional): dilation of the internal Conv1DCell. Defaults to 1.
std_mul (float, optional): [description]. Defaults to 4.0.
dropout (float, optional): dropout probability. Defaults to 0.0.
causal (bool, optional): padding of the Conv1DCell. It shoudl be True if `add_input` method of `Conv1DCell` is ever used. Defaults to False.
residual (bool, optional): whether to use residual connection. If True, in_channels shoudl equals num_filters. Defaults to True.
"""
super(Conv1DGLU, self).__init__()
# conv spec
self.in_channels = in_channels
self.n_speakers = n_speakers
self.speaker_dim = speaker_dim
self.num_filters = num_filters
self.filter_size = filter_size
self.dilation = dilation
# padding
self.causal = causal
# weight init and dropout
self.std_mul = std_mul
self.dropout = dropout
self.residual = residual
if residual:
assert (
in_channels == num_filters
), "this block uses residual connection"\
"the input_channes should equals num_filters"
std = np.sqrt(std_mul * (1 - dropout) / (filter_size * in_channels))
self.conv = Conv1DCell(
in_channels,
2 * num_filters,
filter_size,
dilation,
causal,
param_attr=I.Normal(scale=std))
if n_speakers > 1:
assert (speaker_dim is not None
), "speaker embed should not be null in multi-speaker case"
std = np.sqrt(1 / speaker_dim)
self.fc = Linear(
speaker_dim, num_filters, param_attr=I.Normal(scale=std))
def forward(self, x, speaker_embed=None):
"""
Args:
x (Variable): shape(B, C_in, T), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels T means input time steps.
speaker_embed (Variable): shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size.
Returns:
x (Variable): shape(B, C_out, T), the output of Conv1DGLU, where
C_out means the `num_filters`.
"""
residual = x
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = self.conv(x)
content, gate = F.split(x, num_or_sections=2, dim=1)
if speaker_embed is not None:
sp = F.softsign(self.fc(speaker_embed))
content = F.elementwise_add(content, sp, axis=0)
# glu
x = F.sigmoid(gate) * content
if self.residual:
x = F.scale(x + residual, np.sqrt(0.5))
return x
def start_sequence(self):
"""Prepare the Conv1DGLU to generate a new sequence. This method should be called before starting calling `add_input` multiple times.
"""
self.conv.start_sequence()
def add_input(self, x_t, speaker_embed=None):
"""
Takes a step of inputs and return a step of outputs. It works similarily with the `forward` method, but in a `step-in-step-out` fashion.
Args:
x_t (Variable): shape(B, C_in, T=1), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels.
speaker_embed (Variable): Shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size.
Returns:
x (Variable): shape(B, C_out), the output of Conv1DGLU, where C_out means the `num_filter`.
"""
residual = x_t
x_t = F.dropout(
x_t, self.dropout, dropout_implementation="upscale_in_train")
x_t = self.conv.add_input(x_t)
content_t, gate_t = F.split(x_t, num_or_sections=2, dim=1)
if speaker_embed is not None:
sp = F.softsign(self.fc(speaker_embed))
content_t = F.elementwise_add(content_t, sp, axis=0)
# glu
x_t = F.sigmoid(gate_t) * content_t
if self.residual:
x_t = F.scale(x_t + residual, np.sqrt(0.5))
return x_t
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from itertools import chain
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
import paddle.fluid.dygraph as dg
from parakeet.modules.weight_norm import Conv1D, Conv1DTranspose, Conv2D, Conv2DTranspose, Linear
from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
from parakeet.models.deepvoice3.encoder import ConvSpec
def upsampling_4x_blocks(n_speakers, speaker_dim, target_channels, dropout):
"""Return a list of Layers that upsamples the input by 4 times in time dimension.
Args:
n_speakers (int): number of speakers of the Conv1DGLU layers used.
speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
dropout (float): dropout probability.
Returns:
List[Layer]: upsampling layers.
"""
# upsampling convolitions
upsampling_convolutions = [
Conv1DTranspose(
target_channels,
target_channels,
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(1 / (2 * target_channels)))),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout),
Conv1DTranspose(
target_channels,
target_channels,
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(4. / (2 * target_channels)))),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout),
]
return upsampling_convolutions
def upsampling_2x_blocks(n_speakers, speaker_dim, target_channels, dropout):
"""Return a list of Layers that upsamples the input by 2 times in time dimension.
Args:
n_speakers (int): number of speakers of the Conv1DGLU layers used.
speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
dropout (float): dropout probability.
Returns:
List[Layer]: upsampling layers.
"""
upsampling_convolutions = [
Conv1DTranspose(
target_channels,
target_channels,
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(1. / (2 * target_channels)))),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout), Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
]
return upsampling_convolutions
def upsampling_1x_blocks(n_speakers, speaker_dim, target_channels, dropout):
"""Return a list of Layers that upsamples the input by 1 times in time dimension.
Args:
n_speakers (int): number of speakers of the Conv1DGLU layers used.
speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
dropout (float): dropout probability.
Returns:
List[Layer]: upsampling layers.
"""
upsampling_convolutions = [
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
]
return upsampling_convolutions
class Converter(dg.Layer):
def __init__(self,
n_speakers,
speaker_dim,
in_channels,
linear_dim,
convolutions=(ConvSpec(256, 5, 1), ) * 4,
time_upsampling=1,
dropout=0.0):
"""Vocoder that transforms mel spectrogram (or ecoder hidden states) to waveform.
Args:
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding size.
in_channels (int): channels of the input.
linear_dim (int): channels of the linear spectrogram.
convolutions (Iterable[ConvSpec], optional): specifications of the internal convolutional layers. ConvSpec is a namedtuple of (output_channels, filter_size, dilation) Defaults to (ConvSpec(256, 5, 1), )*4.
time_upsampling (int, optional): time upsampling factor of the converter, possible options are {1, 2, 4}. Note that this should equals the downsample factor of the mel spectrogram. Defaults to 1.
dropout (float, optional): dropout probability. Defaults to 0.0.
"""
super(Converter, self).__init__()
self.n_speakers = n_speakers
self.speaker_dim = speaker_dim
self.in_channels = in_channels
self.linear_dim = linear_dim
# CAUTION: this should equals the downsampling steps coefficient
self.time_upsampling = time_upsampling
self.dropout = dropout
target_channels = convolutions[0].out_channels
# conv proj to target channels
self.first_conv_proj = Conv1D(
in_channels,
target_channels,
1,
param_attr=I.Normal(scale=np.sqrt(1 / in_channels)))
# Idea from nyanko
if time_upsampling == 4:
self.upsampling_convolutions = dg.LayerList(
upsampling_4x_blocks(n_speakers, speaker_dim, target_channels,
dropout))
elif time_upsampling == 2:
self.upsampling_convolutions = dg.LayerList(
upsampling_2x_blocks(n_speakers, speaker_dim, target_channels,
dropout))
elif time_upsampling == 1:
self.upsampling_convolutions = dg.LayerList(
upsampling_1x_blocks(n_speakers, speaker_dim, target_channels,
dropout))
else:
raise ValueError(
"Upsampling factors other than {1, 2, 4} are Not supported.")
# post conv layers
std_mul = 4.0
in_channels = target_channels
self.convolutions = dg.LayerList()
for (out_channels, filter_size, dilation) in convolutions:
if in_channels != out_channels:
std = np.sqrt(std_mul / in_channels)
# CAUTION: relu
self.convolutions.append(
Conv1D(
in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
in_channels = out_channels
std_mul = 2.0
self.convolutions.append(
Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation=dilation,
std_mul=std_mul,
dropout=dropout))
in_channels = out_channels
std_mul = 4.0
# final conv proj, channel transformed to linear dim
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
# CAUTION: sigmoid
self.last_conv_proj = Conv1D(
in_channels,
linear_dim,
1,
act="sigmoid",
param_attr=I.Normal(scale=std))
def forward(self, x, speaker_embed=None):
"""
Convert mel spectrogram or decoder hidden states to linear spectrogram.
Args:
x (Variable): Shape(B, T_mel, C_in), dtype float32, converter inputs, where C_in means the input channel for the converter. Note that it can be either C_mel (channel of mel spectrogram) or C_dec // r.
When use mel_spectrogram as the input of converter, C_in = C_mel; and when use decoder states as the input of converter, C_in = C_dec // r.
speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embedding, where C_sp means the speaker embedding size.
Returns:
out (Variable): Shape(B, T_lin, C_lin), the output linear spectrogram, where C_lin means the channel of linear spectrogram and T_linear means the length(time steps) of linear spectrogram. T_line = time_upsampling * T_mel, which depends on the time_upsampling of the converter.
"""
x = F.transpose(x, [0, 2, 1])
x = self.first_conv_proj(x)
if speaker_embed is not None:
speaker_embed = F.dropout(
speaker_embed,
self.dropout,
dropout_implementation="upscale_in_train")
for layer in chain(self.upsampling_convolutions, self.convolutions):
if isinstance(layer, Conv1DGLU):
x = layer(x, speaker_embed)
else:
x = layer(x)
out = self.last_conv_proj(x)
out = F.transpose(out, [0, 2, 1])
return out
此差异已折叠。
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from collections import namedtuple
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
import paddle.fluid.dygraph as dg
from parakeet.modules.weight_norm import Conv1D, Linear
from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
ConvSpec = namedtuple("ConvSpec", ["out_channels", "filter_size", "dilation"])
class Encoder(dg.Layer):
def __init__(self,
n_vocab,
embed_dim,
n_speakers,
speaker_dim,
padding_idx=None,
embedding_weight_std=0.1,
convolutions=(ConvSpec(64, 5, 1), ) * 7,
dropout=0.):
"""Encoder of Deep Voice 3.
Args:
n_vocab (int): vocabulary size of the text embedding.
embed_dim (int): embedding size of the text embedding.
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding size.
padding_idx (int, optional): padding index of text embedding. Defaults to None.
embedding_weight_std (float, optional): standard deviation of the embedding weights when intialized. Defaults to 0.1.
convolutions (Iterable[ConvSpec], optional): specifications of the convolutional layers. ConvSpec is a namedtuple of output channels, filter_size and dilation. Defaults to (ConvSpec(64, 5, 1), )*7.
dropout (float, optional): dropout probability. Defaults to 0..
"""
super(Encoder, self).__init__()
self.embedding_weight_std = embedding_weight_std
self.embed = dg.Embedding(
(n_vocab, embed_dim),
padding_idx=padding_idx,
param_attr=I.Normal(scale=embedding_weight_std))
self.dropout = dropout
if n_speakers > 1:
std = np.sqrt((1 - dropout) / speaker_dim)
self.sp_proj1 = Linear(
speaker_dim,
embed_dim,
act="softsign",
param_attr=I.Normal(scale=std))
self.sp_proj2 = Linear(
speaker_dim,
embed_dim,
act="softsign",
param_attr=I.Normal(scale=std))
self.n_speakers = n_speakers
self.convolutions = dg.LayerList()
in_channels = embed_dim
std_mul = 1.0
for (out_channels, filter_size, dilation) in convolutions:
# 1 * 1 convolution & relu
if in_channels != out_channels:
std = np.sqrt(std_mul / in_channels)
self.convolutions.append(
Conv1D(
in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
in_channels = out_channels
std_mul = 2.0
self.convolutions.append(
Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=False,
residual=True))
in_channels = out_channels
std_mul = 4.0
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
self.convolutions.append(
Conv1D(
in_channels, embed_dim, 1, param_attr=I.Normal(scale=std)))
def forward(self, x, speaker_embed=None):
"""
Encode text sequence.
Args:
x (Variable): shape(B, T_enc), dtype: int64. Ihe input text indices. T_enc means the timesteps of decoder input x.
speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embeddings. This arg is not None only when the model is a multispeaker model.
Returns:
keys (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded epresentation for keys, where C_emb menas the text embedding size.
values (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded representation for values.
"""
x = self.embed(x)
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = F.transpose(x, [0, 2, 1])
if self.n_speakers > 1 and speaker_embed is not None:
speaker_embed = F.dropout(
speaker_embed,
self.dropout,
dropout_implementation="upscale_in_train")
x = F.elementwise_add(x, self.sp_proj1(speaker_embed), axis=0)
input_embed = x
for layer in self.convolutions:
if isinstance(layer, Conv1DGLU):
x = layer(x, speaker_embed)
else:
# layer is a Conv1D with (1,) filter wrapped by WeightNormWrapper
x = layer(x)
if self.n_speakers > 1 and speaker_embed is not None:
x = F.elementwise_add(x, self.sp_proj2(speaker_embed), axis=0)
keys = x # (B, C, T)
values = F.scale(input_embed + x, scale=np.sqrt(0.5))
keys = F.transpose(keys, [0, 2, 1])
values = F.transpose(values, [0, 2, 1])
return keys, values
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from numba import jit
from paddle import fluid
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
def masked_mean(inputs, mask):
"""
Args:
inputs (Variable): shape(B, T, C), dtype float32, the input.
mask (Variable): shape(B, T), dtype float32, a mask.
Returns:
loss (Variable): shape(1, ), dtype float32, masked mean.
"""
channels = inputs.shape[-1]
masked_inputs = F.elementwise_mul(inputs, mask, axis=0)
loss = F.reduce_sum(masked_inputs) / (channels * F.reduce_sum(mask))
return loss
@jit(nopython=True)
def guided_attention(N, max_N, T, max_T, g):
"""Generate an diagonal attention guide.
Args:
N (int): valid length of encoder.
max_N (int): max length of encoder.
T (int): valid length of decoder.
max_T (int): max length of decoder.
g (float): sigma to adjust the degree of diagonal guide.
Returns:
np.ndarray: shape(max_N, max_T), dtype float32, the diagonal guide.
"""
W = np.zeros((max_N, max_T), dtype=np.float32)
for n in range(N):
for t in range(T):
W[n, t] = 1 - np.exp(-(n / N - t / T)**2 / (2 * g * g))
return W
def guided_attentions(encoder_lengths, decoder_lengths, max_decoder_len,
g=0.2):
"""Generate a diagonal attention guide for a batch.
Args:
encoder_lengths (np.ndarray): shape(B, ), dtype: int64, encoder valid lengths.
decoder_lengths (np.ndarray): shape(B, ), dtype: int64, decoder valid lengths.
max_decoder_len (int): max length of decoder.
g (float, optional): sigma to adjust the degree of diagonal guide.. Defaults to 0.2.
Returns:
np.ndarray: shape(B, max_T, max_N), dtype float32, the diagonal guide. (max_N: max encoder length, max_T: max decoder length.)
"""
B = len(encoder_lengths)
max_input_len = encoder_lengths.max()
W = np.zeros((B, max_decoder_len, max_input_len), dtype=np.float32)
for b in range(B):
W[b] = guided_attention(encoder_lengths[b], max_input_len,
decoder_lengths[b], max_decoder_len, g).T
return W
class TTSLoss(object):
def __init__(self,
masked_weight=0.0,
priority_bin=None,
priority_weight=0.0,
binary_divergence_weight=0.0,
guided_attention_sigma=0.2,
downsample_factor=4,
r=1):
"""Compute loss for Deep Voice 3 model.
Args:
masked_weight (float, optional): the weight of masked loss. Defaults to 0.0.
priority_bin ([type], optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
priority_weight (float, optional): weight for the prioritized frequency bands. Defaults to 0.0.
binary_divergence_weight (float, optional): weight for binary cross entropy (used for spectrogram loss). Defaults to 0.0.
guided_attention_sigma (float, optional): `sigma` for attention guide. Defaults to 0.2.
downsample_factor (int, optional): the downsample factor for mel spectrogram. Defaults to 4.
r (int, optional): frames per decoder step. Defaults to 1.
"""
self.masked_weight = masked_weight
self.priority_bin = priority_bin # only used for lin-spec loss
self.priority_weight = priority_weight # only used for lin-spec loss
self.binary_divergence_weight = binary_divergence_weight
self.guided_attention_sigma = guided_attention_sigma
self.time_shift = r
self.r = r
self.downsample_factor = downsample_factor
def l1_loss(self, prediction, target, mask, priority_bin=None):
"""L1 loss for spectrogram.
Args:
prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
target (Variable): shape(B, T, C), dtype float32, target spectrogram.
mask (Variable): shape(B, T), mask.
priority_bin (int, optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
Returns:
Variable: shape(1,), dtype float32, l1 loss(with mask and possibly priority bin applied.)
"""
abs_diff = F.abs(prediction - target)
# basic mask-weighted l1 loss
w = self.masked_weight
if w > 0 and mask is not None:
base_l1_loss = w * masked_mean(abs_diff, mask) \
+ (1 - w) * F.reduce_mean(abs_diff)
else:
base_l1_loss = F.reduce_mean(abs_diff)
if self.priority_weight > 0 and priority_bin is not None:
# mask-weighted priority channels' l1-loss
priority_abs_diff = abs_diff[:, :, :priority_bin]
if w > 0 and mask is not None:
priority_loss = w * masked_mean(priority_abs_diff, mask) \
+ (1 - w) * F.reduce_mean(priority_abs_diff)
else:
priority_loss = F.reduce_mean(priority_abs_diff)
# priority weighted sum
p = self.priority_weight
loss = p * priority_loss + (1 - p) * base_l1_loss
else:
loss = base_l1_loss
return loss
def binary_divergence(self, prediction, target, mask):
"""Binary cross entropy loss for spectrogram. All the values in the spectrogram are treated as logits in a logistic regression.
Args:
prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
target (Variable): shape(B, T, C), dtype float32, target spectrogram.
mask (Variable): shape(B, T), mask.
Returns:
Variable: shape(1,), dtype float32, binary cross entropy loss.
"""
flattened_prediction = F.reshape(prediction, [-1, 1])
flattened_target = F.reshape(target, [-1, 1])
flattened_loss = F.log_loss(
flattened_prediction, flattened_target, epsilon=1e-8)
bin_div = fluid.layers.reshape(flattened_loss, prediction.shape)
w = self.masked_weight
if w > 0 and mask is not None:
loss = w * masked_mean(bin_div, mask) \
+ (1 - w) * F.reduce_mean(bin_div)
else:
loss = F.reduce_mean(bin_div)
return loss
@staticmethod
def done_loss(done_hat, done):
"""Compute done loss
Args:
done_hat (Variable): shape(B, T), dtype float32, predicted done probability(the probability that the final frame has been generated.)
done (Variable): shape(B, T), dtype float32, ground truth done probability(the probability that the final frame has been generated.)
Returns:
Variable: shape(1, ), dtype float32, done loss.
"""
flat_done_hat = F.reshape(done_hat, [-1, 1])
flat_done = F.reshape(done, [-1, 1])
loss = F.log_loss(flat_done_hat, flat_done, epsilon=1e-8)
loss = F.reduce_mean(loss)
return loss
def attention_loss(self, predicted_attention, input_lengths,
target_lengths):
"""
Given valid encoder_lengths and decoder_lengths, compute a diagonal guide, and compute loss from the predicted attention and the guide.
Args:
predicted_attention (Variable): shape(*, B, T_dec, T_enc), dtype float32, the alignment tensor, where B means batch size, T_dec means number of time steps of the decoder, T_enc means the number of time steps of the encoder, * means other possible dimensions.
input_lengths (numpy.ndarray): shape(B,), dtype:int64, valid lengths (time steps) of encoder outputs.
target_lengths (numpy.ndarray): shape(batch_size,), dtype:int64, valid lengths (time steps) of decoder outputs.
Returns:
loss (Variable): shape(1, ), dtype float32, attention loss.
"""
n_attention, batch_size, max_target_len, max_input_len = (
predicted_attention.shape)
soft_mask = guided_attentions(input_lengths, target_lengths,
max_target_len,
self.guided_attention_sigma)
soft_mask_ = dg.to_variable(soft_mask)
loss = fluid.layers.reduce_mean(predicted_attention * soft_mask_)
return loss
def __call__(self, outputs, inputs):
"""Total loss
Args:
outpus is a tuple of (mel_hyp, lin_hyp, attn_hyp, done_hyp).
mel_hyp (Variable): shape(B, T, C_mel), dtype float32, predicted mel spectrogram.
lin_hyp (Variable): shape(B, T, C_lin), dtype float32, predicted linear spectrogram.
done_hyp (Variable): shape(B, T), dtype float32, predicted done probability.
attn_hyp (Variable): shape(N, B, T_dec, T_enc), dtype float32, predicted attention.
inputs is a tuple of (mel_ref, lin_ref, done_ref, input_lengths, n_frames)
mel_ref (Variable): shape(B, T, C_mel), dtype float32, ground truth mel spectrogram.
lin_ref (Variable): shape(B, T, C_lin), dtype float32, ground truth linear spectrogram.
done_ref (Variable): shape(B, T), dtype float32, ground truth done flag.
input_lengths (Variable): shape(B, ), dtype: int, encoder valid lengths.
n_frames (Variable): shape(B, ), dtype: int, decoder valid lengths.
Returns:
Dict(str, Variable): details of loss.
"""
total_loss = 0.
mel_hyp, lin_hyp, attn_hyp, done_hyp = outputs
mel_ref, lin_ref, done_ref, input_lengths, n_frames = inputs
# n_frames # mel_lengths # decoder_lengths
max_frames = lin_hyp.shape[1]
max_mel_steps = max_frames // self.downsample_factor
# max_decoder_steps = max_mel_steps // self.r
# decoder_mask = F.sequence_mask(n_frames // self.downsample_factor //
# self.r,
# max_decoder_steps,
# dtype="float32")
mel_mask = F.sequence_mask(
n_frames // self.downsample_factor, max_mel_steps, dtype="float32")
lin_mask = F.sequence_mask(n_frames, max_frames, dtype="float32")
lin_hyp = lin_hyp[:, :-self.time_shift, :]
lin_ref = lin_ref[:, self.time_shift:, :]
lin_mask = lin_mask[:, self.time_shift:]
lin_l1_loss = self.l1_loss(
lin_hyp, lin_ref, lin_mask, priority_bin=self.priority_bin)
lin_bce_loss = self.binary_divergence(lin_hyp, lin_ref, lin_mask)
lin_loss = self.binary_divergence_weight * lin_bce_loss \
+ (1 - self.binary_divergence_weight) * lin_l1_loss
total_loss += lin_loss
mel_hyp = mel_hyp[:, :-self.time_shift, :]
mel_ref = mel_ref[:, self.time_shift:, :]
mel_mask = mel_mask[:, self.time_shift:]
mel_l1_loss = self.l1_loss(mel_hyp, mel_ref, mel_mask)
mel_bce_loss = self.binary_divergence(mel_hyp, mel_ref, mel_mask)
# print("=====>", mel_l1_loss.numpy()[0], mel_bce_loss.numpy()[0])
mel_loss = self.binary_divergence_weight * mel_bce_loss \
+ (1 - self.binary_divergence_weight) * mel_l1_loss
total_loss += mel_loss
attn_loss = self.attention_loss(attn_hyp,
input_lengths.numpy(),
n_frames.numpy() //
(self.downsample_factor * self.r))
total_loss += attn_loss
done_loss = self.done_loss(done_hyp, done_ref)
total_loss += done_loss
losses = {
"loss": total_loss,
"mel/mel_loss": mel_loss,
"mel/l1_loss": mel_l1_loss,
"mel/bce_loss": mel_bce_loss,
"lin/lin_loss": lin_loss,
"lin/l1_loss": lin_l1_loss,
"lin/bce_loss": lin_bce_loss,
"done": done_loss,
"attn": attn_loss,
}
return losses
此差异已折叠。
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from paddle import fluid
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
def lookup(weight, indices, padding_idx):
out = fluid.core.ops.lookup_table_v2(
weight, indices, 'is_sparse', False, 'is_distributed', False,
'remote_prefetch', False, 'padding_idx', padding_idx)
return out
def compute_position_embedding_single_speaker(radians, speaker_position_rate):
"""Compute sin/cos interleaved matrix from the radians.
Arg:
radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
speaker_position_rate (float or Variable): float or Variable of shape(1, ), speaker positioning rate.
Returns:
Variable: shape(n_vocab, embed_dim), the sin, cos interleaved matrix.
"""
_, embed_dim = radians.shape
scaled_radians = radians * speaker_position_rate
odd_mask = (np.arange(embed_dim) % 2).astype(np.float32)
odd_mask = dg.to_variable(odd_mask)
out = odd_mask * F.cos(scaled_radians) \
+ (1 - odd_mask) * F.sin(scaled_radians)
return out
def compute_position_embedding(radians, speaker_position_rate):
"""Compute sin/cos interleaved matrix from the radians.
Arg:
radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
speaker_position_rate (Variable): shape(B, ), speaker positioning rate.
Returns:
Variable: shape(B, n_vocab, embed_dim), the sin, cos interleaved matrix.
"""
_, embed_dim = radians.shape
batch_size = speaker_position_rate.shape[0]
scaled_radians = F.elementwise_mul(
F.expand(F.unsqueeze(radians, [0]), [batch_size, 1, 1]),
speaker_position_rate,
axis=0)
odd_mask = (np.arange(embed_dim) % 2).astype(np.float32)
odd_mask = dg.to_variable(odd_mask)
out = odd_mask * F.cos(scaled_radians) \
+ (1 - odd_mask) * F.sin(scaled_radians)
out = F.concat(
[F.zeros((batch_size, 1, embed_dim), radians.dtype), out[:, 1:, :]],
axis=1)
return out
def position_encoding_init(n_position,
d_pos_vec,
position_rate=1.0,
padding_idx=None):
"""Init the position encoding.
Args:
n_position (int): max position, vocab size for position embedding.
d_pos_vec (int): position embedding size.
position_rate (float, optional): position rate (this should only be used when all the utterances are from one speaker.). Defaults to 1.0.
padding_idx (int, optional): padding index for the position embedding(it is set as 0 internally if not provided.). Defaults to None.
Returns:
[type]: [description]
"""
# init the position encoding table
# keep idx 0 for padding token position encoding zero vector
# CAUTION: it is radians here, sin and cos are not applied
indices_range = np.expand_dims(np.arange(n_position), -1)
embed_range = 2 * (np.arange(d_pos_vec) // 2)
radians = position_rate \
* indices_range \
/ np.power(1.e4, embed_range / d_pos_vec)
if padding_idx is not None:
radians[padding_idx] = 0.
return radians
class PositionEmbedding(dg.Layer):
def __init__(self, n_position, d_pos_vec, position_rate=1.0):
"""Position Embedding for Deep Voice 3.
Args:
n_position (int): max position, vocab size for position embedding.
d_pos_vec (int): position embedding size.
position_rate (float, optional): position rate (this should only be used when all the utterances are from one speaker.). Defaults to 1.0.
"""
super(PositionEmbedding, self).__init__()
self.weight = self.create_parameter((n_position, d_pos_vec))
self.weight.set_value(
position_encoding_init(n_position, d_pos_vec, position_rate)
.astype("float32"))
def forward(self, indices, speaker_position_rate=None):
"""
Args:
indices (Variable): shape (B, T), dtype: int64, position
indices, where B means the batch size, T means the time steps.
speaker_position_rate (Variable | float, optional), position
rate. It can be a float point number or a Variable with
shape (1,), then this speaker_position_rate is used for every
example. It can also be a Variable with shape (B, ), which
contains a speaker position rate for each utterance.
Returns:
out (Variable): shape(B, T, C_pos), dtype float32, position embedding, where C_pos
means position embedding size.
"""
batch_size, time_steps = indices.shape
if isinstance(speaker_position_rate, float) or \
(isinstance(speaker_position_rate, fluid.framework.Variable)
and list(speaker_position_rate.shape) == [1]):
temp_weight = compute_position_embedding_single_speaker(
self.weight, speaker_position_rate)
out = lookup(temp_weight, indices, 0)
return out
assert len(speaker_position_rate.shape) == 1 and \
list(speaker_position_rate.shape) == [batch_size]
weight = compute_position_embedding(self.weight,
speaker_position_rate) # (B, V, C)
# make indices for gather_nd
batch_id = F.expand(
F.unsqueeze(
F.range(
0, batch_size, 1, dtype="int64"), [1]), [1, time_steps])
# (B, T, 2)
gather_nd_id = F.stack([batch_id, indices], -1)
out = F.gather_nd(weight, gather_nd_id)
return out
import paddle
import paddle.fluid.dygraph as dg
import numpy as np
from paddle import fluid
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as F
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.data_feeder import check_variable_and_dtype
def l2_norm(x, axis, epsilon=1e-12, name=None):
if len(x.shape) == 1:
axis = 0
check_variable_and_dtype(x, "X", ("float32", "float64"), "norm")
helper = LayerHelper("l2_normalize", **locals())
out = helper.create_variable_for_type_inference(dtype=x.dtype)
norm = helper.create_variable_for_type_inference(dtype=x.dtype)
helper.append_op(
type="norm",
inputs={"X": x},
outputs={"Out": out,
"Norm": norm},
attrs={
"axis": 1 if axis is None else axis,
"epsilon": epsilon,
})
return F.squeeze(norm, axes=[axis])
def norm_except_dim(p, dim):
shape = p.shape
ndims = len(shape)
if dim is None:
return F.sqrt(F.reduce_sum(F.square(p)))
elif dim == 0:
p_matrix = F.reshape(p, (shape[0], -1))
return l2_norm(p_matrix, axis=1)
elif dim == -1 or dim == ndims - 1:
p_matrix = F.reshape(p, (-1, shape[-1]))
return l2_norm(p_matrix, axis=0)
else:
perm = list(range(ndims))
perm[0] = dim
perm[dim] = 0
p_transposed = F.transpose(p, perm)
return norm_except_dim(p_transposed, 0)
def _weight_norm(v, g, dim):
shape = v.shape
ndims = len(shape)
if dim is None:
v_normalized = v / (F.sqrt(F.reduce_sum(F.square(v))) + 1e-12)
elif dim == 0:
p_matrix = F.reshape(v, (shape[0], -1))
v_normalized = F.l2_normalize(p_matrix, axis=1)
v_normalized = F.reshape(v_normalized, shape)
elif dim == -1 or dim == ndims - 1:
p_matrix = F.reshape(v, (-1, shape[-1]))
v_normalized = F.l2_normalize(p_matrix, axis=0)
v_normalized = F.reshape(v_normalized, shape)
else:
perm = list(range(ndims))
perm[0] = dim
perm[dim] = 0
p_transposed = F.transpose(v, perm)
transposed_shape = p_transposed.shape
p_matrix = F.reshape(p_transposed, (p_transposed.shape[0], -1))
v_normalized = F.l2_normalize(p_matrix, axis=1)
v_normalized = F.reshape(v_normalized, transposed_shape)
v_normalized = F.transpose(v_normalized, perm)
weight = F.elementwise_mul(v_normalized, g, axis=dim if dim is not None else -1)
return weight
class WeightNorm(object):
def __init__(self, name, dim):
if dim is None:
dim = -1
self.name = name
self.dim = dim
def compute_weight(self, module):
g = getattr(module, self.name + '_g')
v = getattr(module, self.name + '_v')
w = _weight_norm(v, g, self.dim)
return w
@staticmethod
def apply(module: dg.Layer, name, dim):
for k, hook in module._forward_pre_hooks.items():
if isinstance(hook, WeightNorm) and hook.name == name:
raise RuntimeError("Cannot register two weight_norm hooks on "
"the same parameter {}".format(name))
if dim is None:
dim = -1
fn = WeightNorm(name, dim)
# remove w from parameter list
w = getattr(module, name)
del module._parameters[name]
# add g and v as new parameters and express w as g/||v|| * v
g_var = norm_except_dim(w, dim)
v = module.create_parameter(w.shape, dtype=w.dtype)
module.add_parameter(name + "_v", v)
g = module.create_parameter(g_var.shape, dtype=g_var.dtype)
module.add_parameter(name + "_g", g)
with dg.no_grad():
F.assign(w, v)
F.assign(g_var, g)
setattr(module, name, fn.compute_weight(module))
# recompute weight before every forward()
module.register_forward_pre_hook(fn)
return fn
def remove(self, module):
w_var = self.compute_weight(module)
delattr(module, self.name)
del module._parameters[self.name + '_g']
del module._parameters[self.name + '_v']
w = module.create_parameter(w_var.shape, dtype=w_var.dtype)
module.add_parameter(self.name, w)
with dg.no_grad():
F.assign(w_var, w)
def __call__(self, module, inputs):
setattr(module, self.name, self.compute_weight(module))
def weight_norm(module, name='weight', dim=0):
WeightNorm.apply(module, name, dim)
return module
def remove_weight_norm(module, name='weight'):
for k, hook in module._forward_pre_hooks.items():
if isinstance(hook, WeightNorm) and hook.name == name:
hook.remove(module)
del module._forward_pre_hooks[k]
return module
raise ValueError("weight_norm of '{}' not found in {}"
.format(name, module))
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册