@@ -4,7 +4,7 @@ PaddlePaddle dynamic graph implementation of [WaveFlow: A Compact Flow-based Mod
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow] (https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
## Project Structure
```text
...
...
@@ -99,7 +99,7 @@ python -u synthesis.py \
--sigma=1.0
```
In this example, `--output` specifies where to save the synthesized audios and `--sample` specifies which sample in the valid dataset (a split from the whole LJSpeech dataset, by default contains the first 16 audio samples) to synthesize based on the mel-spectrograms computed from the ground truth sample audio, e.g., `--sample=0` means to synthesize the first audio in the valid dataset.
In this example, `--output` specifies where to save the synthesized audios and `--sample`(<16) specifies which sample in the valid dataset (a split from the whole LJSpeech dataset, by default contains the first 16 audio samples) to synthesize based on the mel-spectrograms computed from the ground truth sample audio, e.g., `--sample=0` means to synthesize the first audio in the valid dataset.
"""pad audios to the largest length and batch them.
Args:
minibatch (List[np.ndarray]): list of rank-1 float arrays(mono-channel audio, shape(T,)) or list of rank-2 float arrays(multi-channel audio, shape(C, T), C stands for numer of channels, T stands for length), dtype: float.
pad_value (float, optional): the pad value. Defaults to 0..
dtype (np.dtype, optional): the data type of the output. Defaults to np.float32.
Returns:
np.ndarray: the output batch. It is a rank-2 float array of shape(B, T) if the minibatch is a list of mono-channel audios, or a rank-3 float array of shape(B, C, T) if the minibatch is a list of multi-channel audios.
"""
minibatch: List[Example]
Example: ndarray, shape(C, T) for multi-channel wav, shape(T,) for mono-channel wav, dtype: float32
"""
# detect data format, maybe better to specify it in __init__
Example: ndarray, shape(C, F, T) for multi-channel spectrogram, shape(F, T) for mono-channel spectrogram, dtype: float32
"""Pad spectra to the largest length and batch them.
Args:
minibatch (List[np.ndarray]): list of rank-2 arrays of shape(F, T) for mono-channel spectrograms, or list of rank-3 arrays of shape(C, F, T) for multi-channel spectrograms(F stands for frequency bands.), dtype: float.
pad_value (float, optional): the pad value. Defaults to 0..
dtype (np.dtype, optional): data type of the output. Defaults to np.float32.
Returns:
np.ndarray: a rank-3 array of shape(B, F, T) when the minibatch is a list of mono-channel spectrograms, or a rank-4 array of shape(B, C, F, T) when the minibatch is a list of multi-channel spectorgrams.
"""An Iterable object of batches. It requires a dataset, a batch function and a sampler. The sampler yields the example ids, then the corresponding examples in the dataset are collected and transformed into a batch with the batch function.
Args:
dataset (Dataset): the dataset used to build a data cargo.
batch_fn (callable, optional): a callable that takes a list of examples of `dataset` and return a batch, it can be None if the dataset has a `_batch_examples` method which satisfy the requirement. Defaults to None.
batch_size (int, optional): number of examples in a batch. Defaults to 1.
sampler (Sampler, optional): an iterable of example ids(intergers), the example ids are used to pick examples. Defaults to None.
shuffle (bool, optional): when sampler is not provided, shuffle = True creates a RandomSampler and shuffle=False creates a SequentialSampler internally. Defaults to False.
batch_sampler (BatchSampler, optional): an iterable of lists of example ids(intergers), the list is used to pick examples, `batch_sampler` option is mutually exclusive with `batch_size`, `shuffle`, `sampler`, and `drop_last`. Defaults to None.
drop_last (bool, optional): whether to drop the last minibatch. Defaults to False.
"""Standard indexing interface for dataset. Inherit this class to
get the indexing interface. Since it is a mixin class which does
not have an `__init__` class, the subclass not need to call
`super().__init__()`.
"""
def__getitem__(self,index):
"""Standard indexing interface for dataset.
Args:
index (slice, list[int], np.array or int): the index. if can be int, slice, list of integers, or ndarray of integers. It calls `get_example` to pick an example.
Returns:
Example, or List[Example]: If `index` is an interger, it returns an
example. If `index` is a slice, a list of intergers or an array of intergers,
it returns a list of examples.
"""
ifisinstance(index,slice):
start,stop,step=index.indices(len(self))
return[
...
...
@@ -32,6 +47,12 @@ class DatasetMixin(object):
returnself.get_example(index)
defget_example(self,i):
"""Get an example from the dataset. Custom datasets should have
this method implemented.
Args:
i (int): example index.
"""
raiseNotImplementedError
def__len__(self):
...
...
@@ -43,9 +64,13 @@ class DatasetMixin(object):
classTransformDataset(DatasetMixin):
"""Transform a dataset to another with a transform."""
def__init__(self,dataset,transform):
"""Dataset which is transformed from another with a transform.
Args:
dataset (DatasetMixin): the base dataset.
transform (callable): the transform which takes an example of the base dataset as parameter and return a new example.
"""
self._dataset=dataset
self._transform=transform
...
...
@@ -53,14 +78,17 @@ class TransformDataset(DatasetMixin):
returnlen(self._dataset)
defget_example(self,i):
# CAUTION: only int is supported?
# CAUTION: dataset support support __getitem__ and __len__
in_data=self._dataset[i]
returnself._transform(in_data)
classCacheDataset(DatasetMixin):
def__init__(self,dataset):
"""A lazy cache of the base dataset.
Args:
dataset (DatasetMixin): the base dataset to cache.
"""
self._dataset=dataset
self._cache=dict()
...
...
@@ -75,6 +103,11 @@ class CacheDataset(DatasetMixin):
classTupleDataset(object):
def__init__(self,*datasets):
"""A compound dataset made from several datasets of the same length. An example of the `TupleDataset` is a tuple of examples from the constituent datasets.
Args:
datasets: tuple[DatasetMixin], the constituent datasets.
"""
ifnotdatasets:
raiseValueError("no datasets are given")
length=len(datasets[0])
...
...
@@ -105,6 +138,11 @@ class TupleDataset(object):
classDictDataset(object):
def__init__(self,**datasets):
"""A compound dataset made from several datasets of the same length. An example of the `DictDataset` is a dict of examples from the constituent datasets.
Args:
datasets: Dict[DatasetMixin], the constituent datasets.
"""A Dataset which is a slice of the base dataset.
Args:
dataset (DatasetMixin): the base dataset.
start (int): the start of the slice.
finish (int): the end of the slice, not inclusive.
order (List[int], optional): the order, it is a permutation of the valid example ids of the base dataset. If `order` is provided, the slice is taken in `order`. Defaults to None.
"""
ifstart<0orfinish>len(dataset):
raiseValueError("subset overruns the dataset.")
self._dataset=dataset
...
...
@@ -168,6 +214,12 @@ class SliceDataset(DatasetMixin):
classSubsetDataset(DatasetMixin):
def__init__(self,dataset,indices):
"""A Dataset which is a subset of the base dataset.
Args:
dataset (DatasetMixin): the base dataset.
indices (Iterable[int]): the indices of the examples to pick.
"""
self._dataset=dataset
iflen(indices)>len(dataset):
raiseValueError("subset's size larger that dataset's size!")
...
...
@@ -184,6 +236,12 @@ class SubsetDataset(DatasetMixin):
classFilterDataset(DatasetMixin):
def__init__(self,dataset,filter_fn):
"""A filtered dataset.
Args:
dataset (DatasetMixin): the base dataset.
filter_fn (callable): a callable which takes an example of the base dataset and return a boolean.
"""
self._dataset=dataset
self._indices=[
iforiinrange(len(dataset))iffilter_fn(dataset[i])
...
...
@@ -200,6 +258,11 @@ class FilterDataset(DatasetMixin):
classChainDataset(DatasetMixin):
def__init__(self,*datasets):
"""A concatenation of the several datasets which the same structure.
Args:
datasets (Iterable[DatasetMixin]): datasets to concat.
At most cases, we have non-stream dataset, which means we can random access it with __getitem__, and we can get the length of the dataset with __len__.
This suffices for a sampler. We implemente sampler as iterable of valid indices. By valid, we mean 0 <= index < N, where N is the length of the dataset. We then collect several indices within a batch and use it to collect examples from the dataset with __getitem__. Then collate this examples to form a batch.
This suffices for a sampler. We implemente sampler as iterable of valid indices. By valid, we mean 0 <= index < N, where N is the length of the dataset. We then collect several indices within a batch and use them to collect examples from the dataset with __getitem__. Then transform these examples into a batch.
So the sampler is only responsible for generating valid indices.
"""
...
...
@@ -24,9 +24,6 @@ import random
classSampler(object):
def__init__(self,data_source):
pass
def__iter__(self):
# return a iterator of indices
# or a iterator of list[int], for BatchSampler
...
...
@@ -35,6 +32,11 @@ class Sampler(object):
classSequentialSampler(Sampler):
def__init__(self,data_source):
"""Sequential sampler, the simplest sampler that samples indices from 0 to N - 1, where N is the dataset is length.
Args:
data_source (DatasetMixin): the dataset. This is used to get the dataset's length.
"""
self.data_source=data_source
def__iter__(self):
...
...
@@ -46,6 +48,13 @@ class SequentialSampler(Sampler):
data_source (DatasetMixin): the dataset. This is used to get the dataset's length.
replacement (bool, optional): whether replacement is enabled in sampling. When `replacement` is True, `num_samples` must be provided. Defaults to False.
num_samples (int, optional): numbers of indices to draw. This option should only be provided when replacement is True. Defaults to None.
"""
self.data_source=data_source
self.replacement=replacement
self._num_samples=num_samples
...
...
@@ -66,7 +75,6 @@ class RandomSampler(Sampler):
@property
defnum_samples(self):
# dataset size might change at runtime
ifself._num_samplesisNone:
returnlen(self.data_source)
returnself._num_samples
...
...
@@ -84,12 +92,16 @@ class RandomSampler(Sampler):
classSubsetRandomSampler(Sampler):
r"""Samples elements randomly from a given list of indices, without replacement.
"""Samples elements randomly from a given list of indices, without replacement.
Arguments:
indices (sequence): a sequence of indices
"""
def__init__(self,indices):
"""
Args:
indices (List[int]): indices to sample from.
"""
self.indices=indices
def__iter__(self):
...
...
@@ -112,6 +124,14 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler):
batch_size=4,
batch_group_size=None,
permutate=True):
"""[summary]
Args:
lengths (List[int]): The length of the examples of the dataset. This is the key to be considered as 'time length'.
batch_size (int, optional): batch size. Defaults to 4.
batch_group_size (int, optional): the size of a small batch. Random shuffling is applied within such patches. If `batch_group_size` is not provided, it is set to min(batch_size * 32, len(self.lengths)). Batch_group_size should be perfectly divided by batch_size. Defaults to None.
permutate (bool, optional): permutate batches. Defaults to True.
"""
_lengths=np.array(
lengths,
dtype=np.int64)# maybe better implement length as a sort key
...
...
@@ -157,13 +177,11 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler):
classWeightedRandomSampler(Sampler):
r"""Samples elements from ``[0,..,len(weights)-1]`` with given probabilities (weights).
"""Samples elements from ``[0,..,len(weights)-1]`` with given probabilities (weights).
Args:
weights (sequence) : a sequence of weights, not necessary summing up to one
num_samples (int): number of samples to draw
replacement (bool): if ``True``, samples are drawn with replacement.
If not, they are drawn without replacement, which means that when a
sample index is drawn for a row, it cannot be drawn again for that row.
weights (List[float]): a sequence of weights, not necessary summing up to 1.
num_samples (int): number of samples to draw.
replacement (bool): whether samples are drawn with replacement. When replacement is False, num_samples should not be larger than len(weights).
"""Sampler used for data parallel training. Indices are divided into num_trainers parts. Each trainer gets a subset and iter that subset. If the dataset has 16 examples, and there are 4 trainers.
Trainer 0 gets [0, 4, 8, 12];
Trainer 1 gets [1, 5, 9, 13];
Trainer 2 gets [2, 6, 10, 14];
trainer 3 gets [3, 7, 11, 15].
It ensures that trainer get different parts of the dataset. If dataset's length cannot be perfectly devidef by num_trainers, some examples appended to the dataset, to ensures that every trainer gets the same amounts of examples.
Args:
dataset_size (int): the length of the dataset.
num_trainers (int): number of trainers(training processes).
rank (int): local rank of the trainer.
shuffle (bool, optional): whether to shuffle the indices before iteration. Defaults to True.
"""
self.dataset_size=dataset_size
self.num_trainers=num_trainers
self.rank=rank
...
...
@@ -222,20 +259,20 @@ class DistributedSampler(Sampler):
classBatchSampler(Sampler):
r"""Wraps another sampler to yield a mini-batch of indices.
Args:
sampler (Sampler): Base sampler.
batch_size (int): Size of mini-batch.
drop_last (bool): If ``True``, the sampler will drop the last batch if
encoder (UpsampleNet): an UpsampleNet to upsample mel spectrogram.
teacher (WaveNet): a WaveNet, the teacher.
student (ParallelWaveNet): a ParallelWaveNet model, the student.
stft (STFT): a STFT model to perform differentiable stft transform.
min_log_scale (float, optional): used only for computing loss, the minimal value of log standard deviation of the output distribution of both the teacher and the student . Defaults to -6.0.
lmd (float, optional): weight for stft loss. Defaults to 4.0.
clip_kl (bool, optional): whether to clip kl_loss by maximum=100. Defaults to True.
Returns:
Variable -- shape(1,), loss
Dict(str, Variable)
loss (Variable): shape(1, ), dtype: float, total loss.
kl (Variable): shape(1, ), dtype: float, kl divergence between the teacher's output distribution and student's output distribution.
regularization (Variable): shape(1, ), dtype: float, a regularization term of the KL divergence.
spectrogram_frame_loss (Variable): shape(1, ), dytpe: float, stft loss, the L1-distance of the magnitudes of the spectrograms of the ground truth waveform and synthesized waveform.
Variable: shape(B, T_audio), the synthesized waveform. (T_audio = T_mel * upscale_factor, where upscale_factor is the `upscale_factor` of the encoder.)
"""Transform a random noise sampled from a standard Gaussian distribution into sample from the target distribution. And output the mean and log standard deviation of the output distribution.
Args:
z (Variable): shape(B, T), random noise sampled from a standard gaussian disribution.
condition (Variable, optional): shape(B, F, T), dtype: float, the upsampled condition. Defaults to None.
Returns:
Variable -- shape(batch_size, time_steps), transformed z.
query_dim (int): the dimension of query vectors. (The size of a single vector of query.)
embed_dim (int): the dimension of keys and values.
dropout (float, optional): dropout probability of attention. Defaults to 0.0.
window_range (WindowRange, optional): range of attention, this is only used at inference. Defaults to WindowRange(-1, 3).
key_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the keys to pass through before computing attention. Defaults to True.
value_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the values to pass through before computing attention. Defaults to True.
Compute pooled context representation and alignment scores.
Compute contextualized representation and alignment scores.
Args:
query (Variable): shape(B, T_dec, C_q), the query tensor,
where C_q means the channel of query.
encoder_out (Tuple(Variable, Variable)):
keys (Variable): shape(B, T_enc, C_emb), the key
representation from an encoder, where C_emb means
text embedding size.
values (Variable): shape(B, T_enc, C_emb), the value
representation from an encoder, where C_emb means
text embedding size.
mask (Variable, optional): Shape(B, T_enc), mask generated with
valid text lengths.
last_attended (int, optional): The position that received most
attention at last timestep. This is only used at decoding.
query (Variable): shape(B, T_dec, C_q), dtype: float, the query tensor, where C_q means the query dim.
encoder_out (keys, values):
keys (Variable): shape(B, T_enc, C_emb), dtype: float, the key representation from an encoder, where C_emb means embed dim.
values (Variable): shape(B, T_enc, C_emb), dtype: float, the value representation from an encoder, where C_emb means embed dim.
mask (Variable, optional): shape(B, T_enc), dtype: float, mask generated with valid text lengths. Pad tokens corresponds to 1, and valid tokens correspond to 0.
last_attended (int, optional): The position that received the most attention at last time step. This is only used at inference.
Outpus:
x (Variable): Shape(B, T_dec, C_q), the context representation
pooled from attention mechanism.
attn_scores (Variable): shape(B, T_dec, T_enc), the alignment
tensor, where T_dec means the number of decoder time steps and
T_enc means number the number of decoder time steps.
x (Variable): shape(B, T_dec, C_q), dtype: float, the contextualized representation from attention mechanism.
attn_scores (Variable): shape(B, T_dec, T_enc), dtype: float, the alignment tensor, where T_dec means the number of decoder time steps and T_enc means number the number of decoder time steps.
"""
keys,values=encoder_out
residual=query
...
...
@@ -85,7 +85,6 @@ class Attention(dg.Layer):
ifself.key_projection:
keys=self.key_proj(keys)
x=self.query_proj(query)
# TODO: check the code
x=F.matmul(x,keys,transpose_y=True)
...
...
@@ -97,7 +96,6 @@ class Attention(dg.Layer):
# if last_attended is provided, focus only on a window range around it
@@ -24,10 +24,7 @@ from parakeet.modules.weight_norm import Conv1D, Conv1DCell, Conv2D, Linear
classConv1DGLU(dg.Layer):
"""
A Convolution 1D block with GLU activation. It also applys dropout for the
input x. It fuses speaker embeddings through a FC activated by softsign. It
has residual connection from the input x, and scale the output by
np.sqrt(0.5).
A Convolution 1D block with GLU activation. It also applys dropout for the input x. It integrates speaker embeddings through a Linear activated by softsign. It has residual connection from the input x, and scale the output by np.sqrt(0.5).
"""
def__init__(self,
...
...
@@ -41,8 +38,21 @@ class Conv1DGLU(dg.Layer):
dropout=0.0,
causal=False,
residual=True):
super(Conv1DGLU,self).__init__()
"""[summary]
Args:
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding's size.
in_channels (int): channels of the input.
num_filters (int): channels of the output.
filter_size (int, optional): filter size of the internal Conv1DCell. Defaults to 1.
dilation (int, optional): dilation of the internal Conv1DCell. Defaults to 1.
std_mul (float, optional): [description]. Defaults to 4.0.
dropout (float, optional): dropout probability. Defaults to 0.0.
causal (bool, optional): padding of the Conv1DCell. It shoudl be True if `add_input` method of `Conv1DCell` is ever used. Defaults to False.
residual (bool, optional): whether to use residual connection. If True, in_channels shoudl equals num_filters. Defaults to True.
"""
super(Conv1DGLU,self).__init__()
# conv spec
self.in_channels=in_channels
self.n_speakers=n_speakers
...
...
@@ -83,18 +93,12 @@ class Conv1DGLU(dg.Layer):
defforward(self,x,speaker_embed=None):
"""
Args:
x (Variable): Shape(B, C_in, T), the input of Conv1DGLU
layer, where B means batch_size, C_in means the input channels
speaker embed, where C_sp means speaker embedding size. Note
that when using residual connection, the Conv1DGLU does not
change the number of channels, so out channels equals input
channels.
x (Variable): shape(B, C_in, T), dtype: float, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels T means input time steps.
speaker_embed (Variable): shape(B, C_sp), dtype: float, speaker embed, where C_sp means speaker embedding size.
Returns:
x (Variable): Shape(B, C_out, T), the output of Conv1DGLU, where
C_out means the output channels of Conv1DGLU.
x (Variable): shape(B, C_out, T), the output of Conv1DGLU, where
C_out means the `num_filters`.
"""
residual=x
x=F.dropout(
...
...
@@ -114,22 +118,20 @@ class Conv1DGLU(dg.Layer):
returnx
defstart_sequence(self):
"""Prepare the Conv1DGLU to generate a new sequence. This method should be called before starting calling `add_input` multiple times.
"""
self.conv.start_sequence()
defadd_input(self,x_t,speaker_embed=None):
"""
Takes a step of inputs and return a step of outputs. It works similarily with the `forward` method, but in a `step-in-step-out` fashion.
Args:
x (Variable): Shape(B, C_in), the input of Conv1DGLU
layer, where B means batch_size, C_in means the input channels.
Vocoder that transforms mel spectrogram (or ecoder hidden states)
to waveform.
"""
"""Vocoder that transforms mel spectrogram (or ecoder hidden states) to waveform."""
def__init__(self,
n_speakers,
...
...
@@ -134,6 +167,17 @@ class Converter(dg.Layer):
convolutions=(ConvSpec(256,5,1),)*4,
time_upsampling=1,
dropout=0.0):
"""[summary]
Args:
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding size.
in_channels (int): channels of the input.
linear_dim (int): channels of the linear spectrogram.
convolutions (Iterable[ConvSpec], optional): specifications of the internal convolutional layers. ConvSpec is a namedtuple of (output_channels, filter_size, dilation) Defaults to (ConvSpec(256, 5, 1), )*4.
time_upsampling (int, optional): time upsampling factor of the converter, possible options are {1, 2, 4}. Note that this should equals the downsample factor of the mel spectrogram. Defaults to 1.
dropout (float, optional): dropout probability. Defaults to 0.0.
"""
super(Converter,self).__init__()
self.n_speakers=n_speakers
...
...
@@ -215,23 +259,12 @@ class Converter(dg.Layer):
Convert mel spectrogram or decoder hidden states to linear spectrogram.
Args:
x (Variable): Shape(B, T_mel, C_in), converter inputs, where
C_in means the input channel for the converter. Note that it
can be either C_mel (channel of mel spectrogram) or C_dec // r.
When use mel_spectrogram as the input of converter, C_in =
C_mel; and when use decoder states as the input of converter,
C_in = C_dec // r. In this scenario, decoder hidden states are
treated as if they were r outputs per decoder step and are
embedding, where C_sp means the speaker embedding size.
x (Variable): Shape(B, T_mel, C_in), dtype: float, converter inputs, where C_in means the input channel for the converter. Note that it can be either C_mel (channel of mel spectrogram) or C_dec // r.
When use mel_spectrogram as the input of converter, C_in = C_mel; and when use decoder states as the input of converter, C_in = C_dec // r.
speaker_embed (Variable, optional): shape(B, C_sp), dtype: float, speaker embedding, where C_sp means the speaker embedding size.
Returns:
out (Variable): Shape(B, T_lin, C_lin), the output linear
spectrogram, where C_lin means the channel of linear
spectrogram and T_linear means the length(time steps) of linear
spectrogram. T_line = time_upsampling * T_mel, which depends
on the time_upsampling converter.
out (Variable): Shape(B, T_lin, C_lin), the output linear spectrogram, where C_lin means the channel of linear spectrogram and T_linear means the length(time steps) of linear spectrogram. T_line = time_upsampling * T_mel, which depends on the time_upsampling of the converter.
valid_lengths (Variable): Shape(B), dtype: int64. A 1D-Tensor containing
the valid lengths (timesteps) of each example, where B means
beatch_size.
max_len (int): The length (number of timesteps) of the mask.
dtype (str, optional): A string that specifies the data type of the
returned mask.
valid_lengths (Variable): shape(B, ), dtype: int64. A rank-1 Tensor containing the valid lengths (timesteps) of each example, where B means beatch_size.
max_len (int): The length (number of time steps) of the mask.
dtype (str, optional): A string that specifies the data type of the returned mask. Defaults to 'float32'.
Returns:
mask (Variable): A mask computed from valid lengths.
mask (Variable): shape(B, max_len), dtype: float, a mask computed from valid lengths.
r (int, optional): number of frames generated per decoder step. Defaults to 1.
max_positions (int, optional): max position for text and decoder steps. Defaults to 512.
convolutions (Iterable[ConvSpec], optional): specification of causal convolutional layers inside the decoder. ConvSpec is a namedtuple of output_channels, filter_size and dilation. Defaults to (ConvSpec(128, 5, 1), )*4.
attention (bool or List[bool], optional): whether to use attention, it should have the same length with `convolutions` if it is a list of bool, indicating whether to have an Attention layer coupled with the corresponding convolutional layer. If it is a bool, it is repeated len(convolutions) times internally. Defaults to True.
dropout (float, optional): dropout probability. Defaults to 0.0.
use_memory_mask (bool, optional): whether to use memory mask at the Attention layer. It should have the same length with `attention` if it is a list of bool, indicating whether to use memory mask at the corresponding Attention layer. If it is a bool, it is repeated len(attention) times internally. Defaults to False.
force_monotonic_attention (bool, optional): whether to use monotonic_attention at the Attention layer when inferencing. It should have the same length with `attention` if it is a list of bool, indicating whether to use monotonic_attention at the corresponding Attention layer. If it is a bool, it is repeated len(attention) times internally. Defaults to False.
query_position_rate (float, optional): position_rate of the PositionEmbedding for query. Defaults to 1.0.
key_position_rate (float, optional): position_rate of the PositionEmbedding for key. Defaults to 1.0.
window_range (WindowRange, optional): window range of monotonic attention. Defaults to WindowRange(-1, 3).
key_projection (bool, optional): `key_projection` of Attention layers. Defaults to True.
value_projection (bool, optional): `value_projection` of Attention layers Defaults to True.
"""
super(Decoder,self).__init__()
self.dropout=dropout
...
...
@@ -125,10 +138,9 @@ class Decoder(dg.Layer):
conv_channels=convolutions[0].out_channels
# only when padding idx is 0 can we easilt handle it
hidden states, where C_dec means the channels of decoder states.
outputs (Variable): shape(B, T_mel, C_mel), dtype: float, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram.
alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype: float, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
done (Variable): shape(B, T_mel // r), dtype: float, probability that the last frame has been generated.
decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype: float, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
"""
ifspeaker_embedisnotNone:
speaker_embed=F.dropout(
...
...
@@ -366,6 +357,8 @@ class Decoder(dg.Layer):
returnr
defstart_sequence(self):
"""Prepare the Decoder to decode. This method is called by `decode`.
"""
forlayerinself.prenet:
ifisinstance(layer,Conv1DGLU):
layer.start_sequence()
...
...
@@ -379,6 +372,25 @@ class Decoder(dg.Layer):
text_positions,
speaker_embed=None,
test_inputs=None):
"""Decode from the encoder's output and other conditions.
Args:
encoder_out (keys, values):
keys (Variable): shape(B, T_enc, C_emb), dtype: float, the key representation from an encoder, where C_emb means text embedding size.
values (Variable): shape(B, T_enc, C_emb), dtype: float, the value representation from an encoder, where C_emb means text embedding size.
text_positions (Variable): shape(B, T_enc), dtype: int64. Positions indices for text inputs for the encoder, where T_enc means the encoder timesteps.
speaker_embed (Variable, optional): shape(B, C_sp), speaker embedding, only used for multispeaker model.
test_inputs (Variable, optional): shape(B, T_test, C_mel). test input, it is only used for debugging. Defaults to None.
Returns:
outputs (Variable): shape(B, T_mel, C_mel), dtype: float, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram.
alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype: float, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
done (Variable): shape(B, T_mel // r), dtype: float, probability that the last frame has been generated. If the probability is larger than 0.5 at a step, the generation stops.
decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype: float, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
Note:
Only single instance inference is supported now, so B = 1.
n_vocab (int): vocabulary size of the text embedding.
embed_dim (int): embedding size of the text embedding.
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding size.
padding_idx (int, optional): padding index of text embedding. Defaults to None.
embedding_weight_std (float, optional): standard deviation of the embedding weights when intialized. Defaults to 0.1.
convolutions (Iterable[ConvSpec], optional): specifications of the convolutional layers. ConvSpec is a namedtuple of output channels, filter_size and dilation. Defaults to (ConvSpec(64, 5, 1), )*7.
dropout (float, optional): dropout probability. Defaults to 0..
"""
super(Encoder,self).__init__()
self.embedding_weight_std=embedding_weight_std
self.embed=dg.Embedding(
(n_vocab,embed_dim),
...
...
@@ -101,18 +111,12 @@ class Encoder(dg.Layer):
Encode text sequence.
Args:
x (Variable): Shape(B, T_enc), dtype: int64. Ihe input text
indices. T_enc means the timesteps of decoder input x.
dtype: float32. Speaker embeddings. This arg is not None only
when the model is a multispeaker model.
x (Variable): shape(B, T_enc), dtype: int64. Ihe input text indices. T_enc means the timesteps of decoder input x.
speaker_embed (Variable, optional): shape(B, C_sp), dtype: float, speaker embeddings. This arg is not None only when the model is a multispeaker model.
Returns:
keys (Variable), Shape(B, T_enc, C_emb), the encoded
representation for keys, where C_emb menas the text embedding
size.
values (Variable), Shape(B, T_enc, C_emb), the encoded
representation for values.
keys (Variable), Shape(B, T_enc, C_emb), dtype: float, the encoded epresentation for keys, where C_emb menas the text embedding size.
values (Variable), Shape(B, T_enc, C_emb), dtype: float, the encoded representation for values.
predicted_attention (Variable): shape(*, B, T_dec, T_enc), dtype: float, the alignment tensor, where B means batch size, T_dec means number of time steps of the decoder, T_enc means the number of time steps of the encoder, * means other possible dimensions.
@@ -27,17 +27,16 @@ from parakeet.models.wavenet.wavenet import WaveNet
defcrop(x,audio_start,audio_length):
"""Crop mel spectrogram.
"""Crop the upsampled condition to match audio_length. The upsampled condition has the same time steps as the whole audio does. But since audios are sliced to 0.5 seconds randomly while conditions are not, upsampled conditions should also be sliced to extaclt match the time steps of the audio slice.
Args:
x (Variable): shape(batch_size, channels, time_steps), the condition, upsampled mel spectrogram.
audio_start (int): starting point.
audio_length (int): length.
x (Variable): shape(B, C, T), dtype: float, the upsample condition.
audio_start (Variable): shape(B, ), dtype: int64, the index the starting point.
audio_length (int): the length of the audio (number of samples it contaions).
"""A upsampling net (bridge net) in clarinet to upsample spectrograms from frame level to sample level.
It consists of several(2) layers of transposed_conv2d. in time and frequency.
The time dim is dilated hop_length times. The frequency bands retains.
"""
def__init__(self,upscale_factors=[16,16]):
"""UpsamplingNet.
It consists of several layers of Conv2DTranspose. Each Conv2DTranspose layer upsamples the time dimension by its `stride` times. And each Conv2DTranspose's filter_size at frequency dimension is 3.
Args:
upscale_factors (list[int], optional): time upsampling factors for each Conv2DTranspose Layer. The `UpsampleNet` contains len(upscale_factor) Conv2DTranspose Layers. Each upscale_factor is used as the `stride` for the corresponding Conv2DTranspose. Defaults to [16, 16].
Note:
np.prod(upscale_factors) should equals the `hop_length` of the stft transformation used to extract spectrogram features from audios. For example, 16 * 16 = 256, then the spectram extracted using a stft transformation whose `hop_length` is 256. See `librosa.stft` for more details.
"""
super(UpsampleNet,self).__init__()
self.upscale_factors=list(upscale_factors)
self.upsample_convs=dg.LayerList()
...
...
@@ -74,13 +76,13 @@ class UpsampleNet(dg.Layer):
returnnp.prod(self.upscale_factors)
defforward(self,x):
"""upsample local condition to match time steps of input signals. i.e. upsample mel spectrogram to match time steps for waveform, for each layer of a wavenet.
Arguments:
x {Variable} -- shape(batch_size, frequency, time_steps), local condition
"""Compute the upsampled condition.
Args:
x (Variable): shape(B, F, T), dtype: float, the condition (mel spectrogram here.) (F means the frequency bands). In the internal Conv2DTransposes, the frequency dimension is treated as `height` dimension instead of `in_channels`.
Returns:
Variable -- shape(batch_size, frequency, time_steps * np.prod(upscale_factors)), upsampled condition for each layer.
Variable: shape(B, F, T * upscale_factor), dtype: float, the upsampled condition.
"""A Residual block in wavenet. It does not have parametric residual or skip connection. It consists of a Conv1DCell and an Conv1D(filter_size = 1) to integrate the condition.
Args:
residual_channels (int): the channels of the input, residual and skip.
condition_dim (int): the channels of the condition.
filter_size (int): filter size of the internal convolution cell.
dilation (int): dilation of the internal convolution cell.
"""
super(ResidualBlock,self).__init__()
dilated_channels=2*residual_channels
# following clarinet's implementation, we do not have parametric residual
...
...
@@ -64,17 +90,16 @@ class ResidualBlock(dg.Layer):
self.condition_dim=condition_dim
defforward(self,x,condition=None):
"""Conv1D gated tanh Block
Arguments:
x {Variable} -- shape(batch_size, residual_channels, time_steps), the input.
Keyword Arguments:
condition {Variable} -- shape(batch_size, condition_dim, time_steps), upsampled local condition, it has the shape time steps as the input x. (default: {None})
"""Conv1D gated-tanh Block.
Args:
x (Variable): shape(B, C_res, T), the input. (B stands for batch_size, C_res stands for residual channels, T stands for time steps.) dtype: float.
condition (Variable, optional): shape(B, C_cond, T), the condition, it has been upsampled in time steps, so it has the same time steps as the input does.(C_cond stands for the condition's channels). Defaults to None.
Returns:
Variable -- shape(batch_size, residual_channels, time_steps), the output which is used as the input of the next layer.
Variable -- shape(batch_size, residual_channels, time_steps), the output which is stacked alongside with other layers' as the output of wavenet.
(residual, skip_connection)
residual (Variable): shape(B, C_res, T), the residual, which is used as the input to the next layer of ResidualBlock.
skip_connection (Variable): shape(B, C_res, T), the skip connection. This output is accumulated with that of other ResidualBlocks.
"""
time_steps=x.shape[-1]
h=x
...
...
@@ -98,20 +123,21 @@ class ResidualBlock(dg.Layer):
returnresidual,skip_connection
defstart_sequence(self):
"""Prepare the ResidualBlock to generate a new sequence. This method should be called before starting calling `add_input` multiple times.
"""
self.conv.start_sequence()
defadd_input(self,x,condition=None):
"""add a step input.
Arguments:
x {Variable} -- shape(batch_size, in_channels, time_steps=1), step input
"""The residual network in wavenet. It consists of `n_layer` stacks, each of which consists of `n_loop` ResidualBlocks.
Args:
n_loop (int): number of ResidualBlocks in a stack.
n_layer (int): number of stacks in the `ResidualNet`.
residual_channels (int): channels of each `ResidualBlock`'s input.
condition_dim (int): channels of the condition.
filter_size (int): filter size of the internal Conv1DCell of each `ResidualBlock`.
"""
super(ResidualNet,self).__init__()
# double the dilation at each layer in a loop(n_loop layers)
dilations=[2**iforiinrange(n_loop)]*n_layer
...
...
@@ -145,19 +180,14 @@ class ResidualNet(dg.Layer):
])
defforward(self,x,condition=None):
"""n_layer layers of n_loop Residual Blocks.
Arguments:
x {Variable} -- shape(batch_size, residual_channels, time_steps), input of the residual net.
Keyword Arguments:
condition {Variable} -- shape(batch_size, condition_dim, time_steps), upsampled conditions, which has the same time steps as the input. (default: {None})
Returns:
Variable -- shape(batch_size, skip_channels, time_steps), output of the residual net.
"""
Args:
x (Variable): shape(B, C_res, T), dtype: float, the input. (B stands for batch_size, C_res stands for residual channels, T stands for time steps.)
condition (Variable, optional): shape(B, C_cond, T), dtype: float, the condition, it has been upsampled in time steps, so it has the same time steps as the input does.(C_cond stands for the condition's channels) Defaults to None.
#before_resnet = time.time()
Returns:
skip_connection (Variable): shape(B, C_res, T), dtype: float, the output.
"""
fori,funcinenumerate(self.residual_blocks):
x,skip=func(x,condition)
ifi==0:
...
...
@@ -165,24 +195,23 @@ class ResidualNet(dg.Layer):
else:
skip_connections=F.scale(skip_connections+skip,
np.sqrt(0.5))
#print("resnet: ", time.time() - before_resnet)
returnskip_connections
defstart_sequence(self):
"""Prepare the ResidualNet to generate a new sequence. This method should be called before starting calling `add_input` multiple times.
"""
forblockinself.residual_blocks:
block.start_sequence()
defadd_input(self,x,condition=None):
"""add step input and return step output.
Arguments:
x {Variable} -- shape(batch_size, residual_channels, time_steps=1), step input.
"""Wavenet that transform upsampled mel spectrogram into waveform.
Args:
n_loop (int): n_loop for the internal ResidualNet.
n_layer (int): n_loop for the internal ResidualNet.
residual_channels (int): the channel of the input.
output_dim (int): the channel of the output distribution.
condition_dim (int): the channel of the condition.
filter_size (int): the filter size of the internal ResidualNet.
loss_type (str): loss type of the wavenet. Possible values are 'softmax' and 'mog'. If `loss_type` is 'softmax', the output is the logits of the catrgotical(multinomial) distribution, `output_dim` means the number of classes of the categorical distribution. If `loss_type` is mog(mixture of gaussians), the output is the parameters of a mixture of gaussians, which consists of weight(in the form of logit) of each gaussian distribution and its mean and log standard deviaton. So when `loss_type` is 'mog', `output_dim` should be perfectly divided by 3.
log_scale_min (int): the minimum value of log standard deviation of the output gaussian distributions. Note that this value is only used for computing loss if `loss_type` is 'mog', values less than `log_scale_min` is clipped when computing loss.
"""
super(WaveNet,self).__init__()
ifloss_typenotin["softmax","mog"]:
raiseValueError("loss_type {} is not supported".format(loss_type))
...
...
@@ -225,19 +266,16 @@ class WaveNet(dg.Layer):
self.log_scale_min=log_scale_min
defforward(self,x,condition=None):
"""(Possibly) Conditonal Wavenet.
Arguments:
x {Variable} -- shape(batch_size, time_steps), the input signal of wavenet. The waveform in 0.5 seconds.
Keyword Arguments:
conditions {Variable} -- shape(batch_size, condition_dim, 1, time_steps), the upsampled local condition. (default: {None})
"""compute the output distribution (represented by its parameters).
Args:
x (Variable): shape(B, T), dtype: float, the input waveform.
condition (Variable, optional): shape(B, C_cond, T), dtype: float, the upsampled condition. Defaults to None.
Returns:
Variable -- shape(batch_size, time_steps, output_dim), output distributions at each time_steps.
Variable: shape(B, T, C_output), dtype: float, the parameter of the output distributions.
"""
# CAUTION: rank-4 condition here
# Causal Conv
ifself.loss_type=="softmax":
x=F.clip(x,min=-1.,max=0.99999)
...
...
@@ -258,21 +296,20 @@ class WaveNet(dg.Layer):
returny
defstart_sequence(self):
"""Prepare the WaveNet to generate a new sequence. This method should be called before starting calling `add_input` multiple times.
"""
self.resnet.start_sequence()
defadd_input(self,x,condition=None):
"""add step input
Arguments:
x {Variable} -- shape(batch_size, time_steps=1), step input.
"""compute the output distribution (represented by its parameters) for a step. It works similarily with the `forward` method but in a `step-in-step-out` fashion.
Args:
x (Variable): shape(B, T=1), dtype: float, a step of the input waveform.
condition (Variable, optional): shape(B, C_cond, T=1), dtype: float, a step of the upsampled condition. Defaults to None.
Returns:
Variable -- ouput parameter for the distribution.
Variable: shape(B, T=1, C_output), dtype: float, the parameter of the output distributions.
"""
# Causal Conv
ifself.loss_type=="softmax":
x=quantize(x,self.output_dim)
...
...
@@ -292,16 +329,15 @@ class WaveNet(dg.Layer):
returny
defcompute_softmax_loss(self,y,t):
"""compute loss, it is basically a language_model-like loss.
Arguments:
y {Variable} -- shape(batch_size, time_steps - 1, output_dim), output distribution of multinomial distribution.
t {Variable} -- shape(batch_size, time_steps - 1), target waveform.
"""compute the loss where output distribution is a categorial distribution.
Args:
y (Variable): shape(B, T, C_output), dtype: float, the logits of the output distribution.
t (Variable): shape(B, T), dtype: float, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.
Returns:
Variable -- shape(1,), loss
Variable: shape(1, ), dtype: float, the loss.
"""
# context size is not taken into account
y=y[:,self.context_size:,:]
t=t[:,self.context_size:]
...
...
@@ -314,15 +350,14 @@ class WaveNet(dg.Layer):
returnreduced_loss
defsample_from_softmax(self,y):
"""sample from output distribution.
Arguments:
y {Variable} -- shape(batch_size, time_steps - 1, output_dim), output distribution.
"""Sample from the output distribution where the output distribution is a categorical distriobution.
Args:
y (Variable): shape(B, T, C_output), the logits of the output distribution
Variable: shape(B, T), waveform sampled from the output distribution.
"""
# dequantize
batch_size,time_steps,output_dim,=y.shape
y=F.reshape(y,(batch_size*time_steps,output_dim))
...
...
@@ -333,17 +368,15 @@ class WaveNet(dg.Layer):
returnsamples
defcompute_mog_loss(self,y,t):
"""compute the loss with an mog output distribution.
WARNING: this is not a legal probability, but a density. so it might be greater than 1.
Arguments:
y {Variable} -- shape(batch_size, time_steps, output_dim), output distribution's parameter. To represent a mixture of Gaussians. The output for each example at each time_step consists of 3 parts. The mean, the stddev, and a weight for that gaussian.
t {Variable} -- shape(batch_size, time_steps), target waveform.
"""compute the loss where output distribution is a mixture of Gaussians.
Args:
y (Variable): shape(B, T, C_output), dtype: float, the parameterd of the output distribution. It is the concatenation of 3 parts, the logits of every distribution, the mean of each distribution and the log standard deviation of each distribution. Each part's shape is (B, T, n_mixture), where `n_mixture` means the number of Gaussians in the mixture.
t (Variable): shape(B, T), dtype: float, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.
Returns:
Variable -- loss, note that it is computed with the pdf of the MoG distribution.
Variable: shape(1, ), dtype: float, the loss.
"""
n_mixture=self.output_dim//3
# context size is not taken in to account
...
...
@@ -373,15 +406,13 @@ class WaveNet(dg.Layer):
returnloss
defsample_from_mog(self,y):
"""sample from output distribution.
Arguments:
y {Variable} -- shape(batch_size, time_steps - 1, output_dim), output distribution.
"""Sample from the output distribution where the output distribution is a mixture of Gaussians.
Args:
y (Variable): shape(B, T, C_output), dtype: float, the parameterd of the output distribution. It is the concatenation of 3 parts, the logits of every distribution, the mean of each distribution and the log standard deviation of each distribution. Each part's shape is (B, T, n_mixture), where `n_mixture` means the number of Gaussians in the mixture.
Variable: shape(B, T), waveform sampled from the output distribution.
"""
ifself.loss_type=="softmax":
returnself.sample_from_softmax(y)
else:
returnself.sample_from_mog(y)
defloss(self,y,t):
"""compute loss.
Arguments:
y {Variable} -- shape(batch_size, time_steps - 1, output_dim), output distribution of multinomial distribution.
t {Variable} -- shape(batch_size, time_steps - 1), target waveform.
"""compute the loss where output distribution is a mixture of Gaussians.
Args:
y (Variable): shape(B, T, C_output), dtype: float, the parameterd of the output distribution.
t (Variable): shape(B, T), dtype: float, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.