dv3 reloaded, back to the origin

282c36c2 · chenfeiyu · 24eb14a7 · 282c36c2 · 282c36c2 · 282c36c2
24 changed file
--- a/examples/deepvoice3/README.md
+++ b/examples/deepvoice3/README.md
@@ -22,151 +22,118 @@ The model consists of an encoder, a decoder and a converter (and a speaker embed
 ## Project Structure

 ```text
-├── data.py          data_processing
-├── model.py         function to create model, criterion and optimizer
-├── configs/         (example) configuration files
-├── sentences.txt    sample sentences
-├── synthesis.py     script to synthesize waveform from text
-├── train.py         script to train a model
-└── utils.py         utility functions
+├── config/
+├── synthesize.py
+├── data.py
+├── preprocess.py
+├── clip.py
+├── train.py
+└── vocoder.py
 ```

-## Saving & Loading
-`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
+# Preprocess

-1. `output` is the directory for saving results.
-During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. States for training including alignment plots, spectrogram plots and generated audio files are saved in `states/` in `outuput`. In addition, we periodically evaluate the model with several given sentences, the alignment plots and generated audio files are save in `eval/` in `output`.
-During synthesizing, audio files and the alignment plots are save in `synthesis/` in `output`.
-So after training and synthesizing with the same output directory, the file structure of the output directory looks like this.
+Preprocess to dataset with `preprocess.py`. 

 ```text
-├── checkpoints/      # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint)
-├── states/           # alignment plots, spectrogram plots and generated wavs at training
-├── log/              # tensorboard log
-├── eval/             # audio files an alignment plots generated at evaluation during training
-└── synthesis/        # synthesized audio files and alignment plots
+usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT
+
+preprocess ljspeech dataset and save it.
+
+optional arguments:
+  -h, --help       show this help message and exit
+  --config CONFIG  config file
+  --input INPUT    data path of the original data
+  --output OUTPUT  path to save the preprocessed dataset
 ```

-2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
-If `--checkpoint` is provided, the path of the checkpoint specified by `--checkpoint` is loaded.
-If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
+example code:
+
+```bash
+python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech
+```

 ## Train

 Train the model using train.py, follow the usage displayed by `python train.py --help`.

 ```text
-usage: train.py [-h] [--config CONFIG] [--data DATA] [--device DEVICE]
-                [--checkpoint CHECKPOINT | --iteration ITERATION]
-                output
+usage: train.py [-h] --config CONFIG --input INPUT

-Train a Deep Voice 3 model with LJSpeech dataset.
-
-positional arguments:
-  output                        path to save results
+train a Deep Voice 3 model with LJSpeech

 optional arguments:
-  -h, --help                    show this help message and exit
-  --config CONFIG               experimrnt config
-  --data DATA                   The path of the LJSpeech dataset.
-  --device DEVICE               device to use
-  --checkpoint CHECKPOINT       checkpoint to resume from.
-  --iteration ITERATION         the iteration of the checkpoint to load from output directory
+  -h, --help       show this help message and exit
+  --config CONFIG  config file
+  --input INPUT    data path of the original data
+```
+
+example code:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech
 ```

- `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
-See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
- `output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
+It would create a `runs` folder, outputs for each run is saved in a seperate folder in `runs`, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(`*.pdparams`) and optimizer states(`*.pdopt`) are named by the step when they are saved.

 ```text
-├── checkpoints      # checkpoint
-├── log              # tensorboard log
-└── states           # train and evaluation results
-    ├── alignments   # attention
-    ├── lin_spec     # linear spectrogram
-    ├── mel_spec     # mel spectrogram
-    └── waveform     # waveform (.wav files)
+runs/Jul07_09-39-34_instance-mqcyj27y-4/
+├── checkpoint
+├── events.out.tfevents.1594085974.instance-mqcyj27y-4
+├── step-1000000.pdopt
+├── step-1000000.pdparams
+├── step-100000.pdopt
+├── step-100000.pdparams
+...
 ```

-Example script:
+Since e use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training.

 ```bash
-python train.py \
-    --config=configs/ljspeech.yaml \
-    --data=./LJSpeech-1.1/ \
-    --device=0 \
-    experiment
+wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
+unzip waveflow_res128_ljspeech_ckpt_1.0.zip
 ```

-To train the model in a paralle in multiple gpus, you can launch the training script with `paddle.distributed.launch`. For example, to train with gpu `0,1,2,3`, you can use the example script below. Note that for parallel training, devices are specified with `--selected_gpus` passed to `paddle.distributed.launch`. In this case, `--device` passed to `train.py`, if specified, is ignored.

-Example script:

-```bash
-python -m paddle.distributed.launch --selected_gpus=0,1,2,3 \
-    train.py \
-    --config=configs/ljspeech.yaml \
-    --data=./LJSpeech-1.1/ \
-    experiment
-```
+## Visualization
+
+You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing.

-You can monitor training log via tensorboard, using the script below.
+example code:

 ```bash
-cd experiment/log
-tensorboard --logdir=.
+tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000
 ```

 ## Synthesis
-```text
-usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE]
-                    [--checkpoint CHECKPOINT | --iteration ITERATION]
-                    text output

-Synthsize waveform with a checkpoint.
-
-positional arguments:
-  text                          text file to synthesize
-  output                        path to save synthesized audio
+```text
+usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT
+                                    --output OUTPUT --checkpoint CHECKPOINT
+                                    --monotonic_layers MONOTONIC_LAYERS

 optional arguments:
-  -h, --help                    show this help message and exit
-  --config CONFIG               experiment config
-  --device DEVICE               device to use
-  --checkpoint CHECKPOINT       checkpoint to resume from
-  --iteration ITERATION         the iteration of the checkpoint to load from output directory
+  -h, --help            show this help message and exit
+  --config CONFIG       config file
+  --input INPUT         text file to synthesize
+  --output OUTPUT       path to save audio
+  --checkpoint CHECKPOINT
+                        data path of the checkpoint
+  --monotonic_layers MONOTONIC_LAYERS
+                        monotonic decoder layer, index starts friom 1
 ```

- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
-
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
-See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
-
- `text`is the text file to synthesize.
- `output` is the directory to save results. The generated audio files (`*.wav`) and attention plots (*.png) for are save in `synthesis/` in ouput directory.
-
-Example script:
-
-```bash
-python synthesis.py \
-    --config=configs/ljspeech.yaml \
-    --device=0 \
-    --checkpoint="experiment/checkpoints/model_step_005000000" \
-    sentences.txt experiment
-```
+`synthesize.py` is used to synthesize several sentences in a text file.
+`--monotonic_layers` is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd.

-or
+example code:

 ```bash
-python synthesis.py \
-    --config=configs/ljspeech.yaml \
-    --device=0 \
-    --iteration=005000000 \
-    sentences.txt experiment
+CUDA_VISIBLE_DEVICES=2 python synthesize.py \
+    --config configs/ljspeech.yaml \
+    --input sentences.txt \
+    --output outputs/ \
+    --checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \
+    --monotonic_layers "5,6"
 ```
--- a/examples/deepvoice3/clip.py
+++ b/examples/deepvoice3/clip.py
+from __future__ import print_function
+
+import copy
+import six
+import warnings
+
+import functools
+from paddle.fluid import layers
+from paddle.fluid import framework
+from paddle.fluid import core
+from paddle.fluid import name_scope
+from paddle.fluid.dygraph import base as imperative_base
+from paddle.fluid.clip import GradientClipBase, _correct_clip_op_role_var
+
+class DoubleClip(GradientClipBase):
+    """
+    :alias_main: paddle.nn.GradientClipByGlobalNorm
+	:alias: paddle.nn.GradientClipByGlobalNorm,paddle.nn.clip.GradientClipByGlobalNorm
+	:old_api: paddle.fluid.clip.GradientClipByGlobalNorm
+
+    Given a list of Tensor :math:`t\_list` , calculate the global norm for the elements of all tensors in 
+    :math:`t\_list` , and limit it to ``clip_norm`` .
+    
+    - If the global norm is greater than ``clip_norm`` , all elements of :math:`t\_list` will be compressed by a ratio.
+    
+    - If the global norm is less than or equal to ``clip_norm`` , nothing will be done.
+    
+    The list of Tensor :math:`t\_list` is not passed from this class, but the gradients of all parameters in ``Program`` . If ``need_clip``
+    is not None, then only part of gradients can be selected for gradient clipping.
+    
+    Gradient clip will takes effect after being set in ``optimizer`` , see the document ``optimizer`` 
+    (for example: :ref:`api_fluid_optimizer_SGDOptimizer`).
+
+    The clipping formula is:
+
+    .. math::
+
+        t\_list[i] = t\_list[i] * \\frac{clip\_norm}{\max(global\_norm, clip\_norm)}
+
+    where:
+
+    .. math::
+
+        global\_norm = \sqrt{\sum_{i=0}^{N-1}(l2norm(t\_list[i]))^2}
+
+    Args:
+        clip_norm (float): The maximum norm value.
+        group_name (str, optional): The group name for this clip. Default value is ``default_group``
+        need_clip (function, optional): Type: function. This function accepts a ``Parameter`` and returns ``bool`` 
+            (True: the gradient of this ``Parameter`` need to be clipped, False: not need). Default: None, 
+            and gradients of all parameters in the network will be clipped.
+
+    Examples:
+        .. code-block:: python
+        
+            # use for Static mode
+            import paddle
+            import paddle.fluid as fluid
+            import numpy as np
+                        
+            main_prog = fluid.Program()
+            startup_prog = fluid.Program()
+            with fluid.program_guard(
+                    main_program=main_prog, startup_program=startup_prog):
+                image = fluid.data(
+                    name='x', shape=[-1, 2], dtype='float32')
+                predict = fluid.layers.fc(input=image, size=3, act='relu') # Trainable parameters: fc_0.w.0, fc_0.b.0
+                loss = fluid.layers.mean(predict)
+                
+                # Clip all parameters in network:
+                clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
+                
+                # Clip a part of parameters in network: (e.g. fc_0.w_0)
+                # pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool
+                # def fileter_func(Parameter):
+                # # It can be easily filtered by Parameter.name (name can be set in fluid.ParamAttr, and the default name is fc_0.w_0, fc_0.b_0)
+                #   return Parameter.name=="fc_0.w_0"
+                # clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0, need_clip=fileter_func)
+
+                sgd_optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.1, grad_clip=clip)
+                sgd_optimizer.minimize(loss)
+
+            place = fluid.CPUPlace()
+            exe = fluid.Executor(place)
+            x = np.random.uniform(-100, 100, (10, 2)).astype('float32')
+            exe.run(startup_prog)
+            out = exe.run(main_prog, feed={'x': x}, fetch_list=loss)
+
+
+            # use for Dygraph mode
+            import paddle
+            import paddle.fluid as fluid
+
+            with fluid.dygraph.guard():
+                linear = fluid.dygraph.Linear(10, 10)  # Trainable: linear_0.w.0, linear_0.b.0
+                inputs = fluid.layers.uniform_random([32, 10]).astype('float32')
+                out = linear(fluid.dygraph.to_variable(inputs))
+                loss = fluid.layers.reduce_mean(out)
+                loss.backward()
+
+                # Clip all parameters in network:
+                clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
+
+                # Clip a part of parameters in network: (e.g. linear_0.w_0)
+                # pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool
+                # def fileter_func(ParamBase):
+                # # It can be easily filtered by ParamBase.name(name can be set in fluid.ParamAttr, and the default name is linear_0.w_0, linear_0.b_0)
+                #   return ParamBase.name == "linear_0.w_0"
+                # # Note: linear.weight and linear.bias can return the weight and bias of dygraph.Linear, respectively, and can be used to filter
+                #   return ParamBase.name == linear.weight.name
+                # clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0, need_clip=fileter_func)
+
+                sgd_optimizer = fluid.optimizer.SGD(
+                    learning_rate=0.1, parameter_list=linear.parameters(), grad_clip=clip)
+                sgd_optimizer.minimize(loss)
+
+    """
+
+    def __init__(self, clip_value, clip_norm, group_name="default_group", need_clip=None):
+        super(DoubleClip, self).__init__(need_clip)
+        self.clip_value = float(clip_value)
+        self.clip_norm = float(clip_norm)
+        self.group_name = group_name
+
+    def __str__(self):
+        return "Gradient Clip By Value and GlobalNorm, value={}, global_norm={}".format(
+            self.clip_value, self.clip_norm)
+
+    @imperative_base.no_grad
+    def _dygraph_clip(self, params_grads):
+        params_and_grads = []
+        # clip by value first
+        for p, g in params_grads:
+            if g is None:
+                continue
+            if self._need_clip_func is not None and not self._need_clip_func(p):
+                params_and_grads.append((p, g))
+                continue
+            new_grad = layers.clip(x=g, min=-self.clip_value, max=self.clip_value)
+            params_and_grads.append((p, new_grad))
+        params_grads = params_and_grads
+        
+        # clip by global norm
+        params_and_grads = []
+        sum_square_list = []
+        for p, g in params_grads:
+            if g is None:
+                continue
+            if self._need_clip_func is not None and not self._need_clip_func(p):
+                continue
+            merge_grad = g
+            if g.type == core.VarDesc.VarType.SELECTED_ROWS:
+                merge_grad = layers.merge_selected_rows(g)
+                merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
+            square = layers.square(merge_grad)
+            sum_square = layers.reduce_sum(square)
+            sum_square_list.append(sum_square)
+
+        # all parameters have been filterd out
+        if len(sum_square_list) == 0:
+            return params_grads
+
+        global_norm_var = layers.concat(sum_square_list)
+        global_norm_var = layers.reduce_sum(global_norm_var)
+        global_norm_var = layers.sqrt(global_norm_var)
+        max_global_norm = layers.fill_constant(
+            shape=[1], dtype='float32', value=self.clip_norm)
+        clip_var = layers.elementwise_div(
+            x=max_global_norm,
+            y=layers.elementwise_max(
+                x=global_norm_var, y=max_global_norm))
+        for p, g in params_grads:
+            if g is None:
+                continue
+            if self._need_clip_func is not None and not self._need_clip_func(p):
+                params_and_grads.append((p, g))
+                continue
+            new_grad = layers.elementwise_mul(x=g, y=clip_var)
+            params_and_grads.append((p, new_grad))
+
+        return params_and_grads
--- a/examples/deepvoice3/configs/ljspeech.yaml
+++ b/examples/deepvoice3/configs/ljspeech.yaml
-meta_data:
-  min_text_length: 20
-
-transform:
-  # text
-  replace_pronunciation_prob: 0.5
-
-  # spectrogram
-  sample_rate: 22050
-  max_norm: 0.999
-  preemphasis: 0.97
-  n_fft: 1024
-  win_length: 1024
-  hop_length: 256
-
-  # mel
-  fmin: 125
-  fmax: 7600
-  n_mels: 80
-
-  # db scale
-  min_level_db: -100
-  ref_level_db: 20
-  clip_norm: true
-
-
-loss:
-  masked_loss_weight: 0.5
-  priority_freq: 3000
-  priority_freq_weight: 0.0
-  binary_divergence_weight: 0.1
-  guided_attention_sigma: 0.2
-
-synthesis:
-  max_steps: 512
-  power: 1.4
-  n_iter: 32
-
-model:
-  # speaker_embedding
-  n_speakers: 1
-  speaker_embed_dim: 16
-  speaker_embedding_weight_std: 0.01
-  
-  max_positions: 512
-  dropout: 0.050000000000000044
-  # encoder
-  text_embed_dim: 256
-  embedding_weight_std: 0.1
-  freeze_embedding: false
-  padding_idx: 0
-  encoder_channels: 512
-
-  # decoder
-  query_position_rate: 1.0
-  key_position_rate: 1.29
-  trainable_positional_encodings: false
-  kernel_size: 3
-  decoder_channels: 256
-  downsample_factor: 4
-  outputs_per_step: 1
-  
-  # attention
-  key_projection: true
-  value_projection: true
-  force_monotonic_attention: true
-  window_backward: -1
-  window_ahead: 3
-  use_memory_mask: true
-
-  # converter
-  use_decoder_state_for_postnet_input: true
-  converter_channels: 256
-
-optimizer:
-  beta1: 0.5
-  beta2: 0.9
-  epsilon: 1e-6
-
-lr_scheduler:
-  warmup_steps: 4000
-  peak_learning_rate: 5e-4
-  
-train:
-  batch_size: 16
-  max_iteration: 2000000
-  
-  snap_interval: 1000
-  eval_interval: 10000
-  save_interval: 10000
+# data processing
+p_pronunciation: 0.99
+sample_rate: 22050 # Hz
+n_fft: 1024
+win_length: 1024
+hop_length: 256
+n_mels: 80
+reduction_factor: 4
+
+# model-s2s
+n_speakers: 1
+speaker_dim: 16
+char_dim: 256
+encoder_dim: 64
+kernel_size: 5
+encoder_layers: 7
+decoder_layers: 8
+prenet_sizes: [128]
+attention_dim: 128
+
+# model-postnet
+postnet_layers: 5
+postnet_dim: 256
+
+# position embedding
+position_weight: 1.0
+position_rate: 5.54
+forward_step: 4
+backward_step: 0
+
+dropout: 0.05
+
+# output-griffinlim
+sharpening_factor: 1.4
+
+# optimizer:
+learning_rate: 0.001
+clip_value: 5.0
+clip_norm: 100.0
+
+# training:
+batch_size: 16
+report_interval: 10000
+save_interval: 10000
+valid_size: 5
\ No newline at end of file
--- a/examples/deepvoice3/data.py
+++ b/examples/deepvoice3/data.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
+import numpy as np
 import os
 import csv
-from pathlib import Path
-import numpy as np
-from paddle import fluid
 import pandas as pd
-import librosa
-from scipy import signal
-
-import paddle.fluid.dygraph as dg

-from parakeet.g2p.en import text_to_sequence, sequence_to_text
-from parakeet.data import DatasetMixin, TransformDataset, FilterDataset, CacheDataset
-from parakeet.data import DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler, BucketSampler
+import paddle
+from paddle import fluid
+from paddle.fluid import dygraph as dg
+from paddle.fluid.dataloader import Dataset, BatchSampler
+from paddle.fluid.io import DataLoader

+from parakeet.data import DatasetMixin, DataCargo, PartialyRandomizedSimilarTimeLengthSampler
+from parakeet.g2p import en

-class LJSpeechMetaData(DatasetMixin):
+class LJSpeech(DatasetMixin):
    def __init__(self, root):
-        self.root = Path(root)
-        self._wav_dir = self.root.joinpath("wavs")
-        csv_path = self.root.joinpath("metadata.csv")
+        self._root = root
        self._table = pd.read_csv(
-            csv_path,
-            sep="|",
-            encoding="utf-8",
-            header=None,
-            quoting=csv.QUOTE_NONE,
-            names=["fname", "raw_text", "normalized_text"])
+            os.path.join(root, "metadata.csv"), 
+            sep="|", 
+            encoding="utf-8", 
+            quoting=csv.QUOTE_NONE, 
+            header=None, 
+            names=["num_frames", "spec_name", "mel_name", "text"],
+            dtype={"num_frames": np.int64, "spec_name": str, "mel_name":str, "text":str})
+    
+    def num_frames(self):
+        return self._table["num_frames"].to_list()

    def get_example(self, i):
-        fname, raw_text, normalized_text = self._table.iloc[i]
-        fname = str(self._wav_dir.joinpath(fname + ".wav"))
-        return fname, raw_text, normalized_text
-
+        """
+        spec (T_frame, C_spec)
+        mel (T_frame, C_mel)
+        """
+        num_frames, spec_name, mel_name, text = self._table.iloc[i]
+        spec = np.load(os.path.join(self._root, spec_name))
+        mel = np.load(os.path.join(self._root, mel_name))
+        return (text, spec, mel, num_frames)
+    
    def __len__(self):
        return len(self._table)

-
-class Transform(object):
-    def __init__(self,
-                 replace_pronunciation_prob=0.,
-                 sample_rate=22050,
-                 preemphasis=.97,
-                 n_fft=1024,
-                 win_length=1024,
-                 hop_length=256,
-                 fmin=125,
-                 fmax=7600,
-                 n_mels=80,
-                 min_level_db=-100,
-                 ref_level_db=20,
-                 max_norm=0.999,
-                 clip_norm=True):
-        self.replace_pronunciation_prob = replace_pronunciation_prob
-
-        self.sample_rate = sample_rate
-        self.preemphasis = preemphasis
-        self.n_fft = n_fft
-        self.win_length = win_length
-        self.hop_length = hop_length
-
-        self.fmin = fmin
-        self.fmax = fmax
-        self.n_mels = n_mels
-
-        self.min_level_db = min_level_db
-        self.ref_level_db = ref_level_db
-        self.max_norm = max_norm
-        self.clip_norm = clip_norm
-
-    def __call__(self, in_data):
-        fname, _, normalized_text = in_data
-
-        # text processing
-        mix_grapheme_phonemes = text_to_sequence(
-            normalized_text, self.replace_pronunciation_prob)
-        text_length = len(mix_grapheme_phonemes)
-        # CAUTION: positions start from 1
-        speaker_id = None
-
-        # wave processing
-        wav, _ = librosa.load(fname, sr=self.sample_rate)
-        # preemphasis
-        y = signal.lfilter([1., -self.preemphasis], [1.], wav)
-
-        # STFT
-        D = librosa.stft(
-            y=y,
-            n_fft=self.n_fft,
-            win_length=self.win_length,
-            hop_length=self.hop_length)
-        S = np.abs(D)
-
-        # to db and normalize to 0-1
-        amplitude_min = np.exp(self.min_level_db / 20 * np.log(10))  # 1e-5
-        S_norm = 20 * np.log10(np.maximum(amplitude_min,
-                                          S)) - self.ref_level_db
-        S_norm = (S_norm - self.min_level_db) / (-self.min_level_db)
-        S_norm = self.max_norm * S_norm
-        if self.clip_norm:
-            S_norm = np.clip(S_norm, 0, self.max_norm)
-
-        # mel scale and to db and normalize to 0-1,
-        # CAUTION: pass linear scale S, not dbscaled S
-        S_mel = librosa.feature.melspectrogram(
-            S=S, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax, power=1.)
-        S_mel = 20 * np.log10(np.maximum(amplitude_min,
-                                         S_mel)) - self.ref_level_db
-        S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
-        S_mel_norm = self.max_norm * S_mel_norm
-        if self.clip_norm:
-            S_mel_norm = np.clip(S_mel_norm, 0, self.max_norm)
-
-        # num_frames
-        n_frames = S_mel_norm.shape[-1]  # CAUTION: original number of frames
-        return (mix_grapheme_phonemes, text_length, speaker_id, S_norm.T,
-                S_mel_norm.T, n_frames)
-
-
 class DataCollector(object):
-    def __init__(self, downsample_factor=4, r=1):
-        self.downsample_factor = int(downsample_factor)
-        self.frames_per_step = int(r)
-        self._factor = int(downsample_factor * r)
-        # CAUTION: small diff here
-        self._pad_begin = int(downsample_factor * r)
-
+    def __init__(self, p_pronunciation):
+        self.p_pronunciation = p_pronunciation
+        
    def __call__(self, examples):
-        batch_size = len(examples)
-
-        # lengths
-        text_lengths = np.array([example[1]
-                                 for example in examples]).astype(np.int64)
-        frames = np.array([example[5]
-                           for example in examples]).astype(np.int64)
+        """
+        output shape and dtype
+        (B, T_text) int64
+        (B,) int64
+        (B, T_frame, C_spec) float32
+        (B, T_frame, C_mel) float32
+        (B,) int64
+        """
+        text_seqs = []
+        specs = []
+        mels = []
+        num_frames = np.array([example[3] for example in examples], dtype=np.int64)
+        max_frames = np.max(num_frames)

-        max_text_length = int(np.max(text_lengths))
-        max_frames = int(np.max(frames))
-        if max_frames % self._factor != 0:
-            max_frames += (self._factor - max_frames % self._factor)
-        max_frames += self._pad_begin
-        max_decoder_length = max_frames // self._factor
-
-        # pad time sequence
-        text_sequences = []
-        lin_specs = []
-        mel_specs = []
-        done_flags = []
        for example in examples:
-            (mix_grapheme_phonemes, text_length, speaker_id, S_norm,
-             S_mel_norm, num_frames) = example
-            text_sequences.append(
-                np.pad(mix_grapheme_phonemes, (0, max_text_length - text_length
-                                               ),
-                       mode="constant"))
-            lin_specs.append(
-                np.pad(S_norm, ((self._pad_begin, max_frames - self._pad_begin
-                                 - num_frames), (0, 0)),
-                       mode="constant"))
-            mel_specs.append(
-                np.pad(S_mel_norm, ((self._pad_begin, max_frames -
-                                     self._pad_begin - num_frames), (0, 0)),
-                       mode="constant"))
-            done_flags.append(
-                np.pad(np.zeros((int(np.ceil(num_frames // self._factor)), )),
-                       (0, max_decoder_length - int(
-                           np.ceil(num_frames // self._factor))),
-                       mode="constant",
-                       constant_values=1))
-        text_sequences = np.array(text_sequences).astype(np.int64)
-        lin_specs = np.array(lin_specs).astype(np.float32)
-        mel_specs = np.array(mel_specs).astype(np.float32)
-
-        # downsample here
-        done_flags = np.array(done_flags).astype(np.float32)
-
-        # text positions
-        text_mask = (np.arange(1, 1 + max_text_length) <= np.expand_dims(
-            text_lengths, -1)).astype(np.int64)
-        text_positions = np.arange(
-            1, 1 + max_text_length, dtype=np.int64) * text_mask
-
-        # decoder_positions
-        decoder_positions = np.tile(
-            np.expand_dims(
-                np.arange(
-                    1, 1 + max_decoder_length, dtype=np.int64), 0),
-            (batch_size, 1))
-
-        return (text_sequences, text_lengths, text_positions, mel_specs,
-                lin_specs, frames, decoder_positions, done_flags)
-
-
-def make_data_loader(data_root, config):
-    # construct meta data
-    meta = LJSpeechMetaData(data_root)
-
-    # filter it!
-    min_text_length = config["meta_data"]["min_text_length"]
-    meta = FilterDataset(meta, lambda x: len(x[2]) >= min_text_length)
-
-    # transform meta data into meta data
-    c = config["transform"]
-    transform = Transform(
-        replace_pronunciation_prob=c["replace_pronunciation_prob"],
-        sample_rate=c["sample_rate"],
-        preemphasis=c["preemphasis"],
-        n_fft=c["n_fft"],
-        win_length=c["win_length"],
-        hop_length=c["hop_length"],
-        fmin=c["fmin"],
-        fmax=c["fmax"],
-        n_mels=c["n_mels"],
-        min_level_db=c["min_level_db"],
-        ref_level_db=c["ref_level_db"],
-        max_norm=c["max_norm"],
-        clip_norm=c["clip_norm"])
-    ljspeech = TransformDataset(meta, transform)
-
-    # use meta data's text length as a sort key for the sampler
-    batch_size = config["train"]["batch_size"]
-    text_lengths = [len(example[2]) for example in meta]
-    sampler = PartialyRandomizedSimilarTimeLengthSampler(text_lengths,
-                                                         batch_size)
-
-    env = dg.parallel.ParallelEnv()
-    num_trainers = env.nranks
-    local_rank = env.local_rank
-    sampler = BucketSampler(
-        text_lengths, batch_size, num_trainers=num_trainers, rank=local_rank)
-
-    # some model hyperparameters affect how we process data
-    model_config = config["model"]
-    collector = DataCollector(
-        downsample_factor=model_config["downsample_factor"],
-        r=model_config["outputs_per_step"])
-    ljspeech_loader = DataCargo(
-        ljspeech, batch_fn=collector, batch_size=batch_size, sampler=sampler)
-    loader = fluid.io.DataLoader.from_generator(capacity=10, return_list=True)
-    loader.set_batch_generator(
-        ljspeech_loader, places=fluid.framework._current_expected_place())
-    return loader
+            text, spec, mel, _ = example
+            text_seqs.append(en.text_to_sequence(text, self.p_pronunciation))
+            # if max_frames - mel.shape[0] < 0:
+            #     import pdb; pdb.set_trace()
+            specs.append(np.pad(spec, [(0, max_frames - spec.shape[0]), (0, 0)]))
+            mels.append(np.pad(mel, [(0, max_frames - mel.shape[0]), (0, 0)]))
+
+        specs = np.stack(specs)
+        mels = np.stack(mels)
+
+        text_lengths = np.array([len(seq) for seq in text_seqs], dtype=np.int64)
+        max_length = np.max(text_lengths)
+        text_seqs = np.array([seq + [0] * (max_length - len(seq)) for seq in text_seqs], dtype=np.int64)
+        return text_seqs, text_lengths, specs, mels, num_frames
+
+if __name__ == "__main__":
+    import argparse
+    import tqdm
+    import time
+    from ruamel import yaml
+
+    parser = argparse.ArgumentParser(description="load the preprocessed ljspeech dataset")
+    parser.add_argument("--config", type=str, required=True, help="config file")
+    parser.add_argument("--input", type=str, required=True, help="data path of the original data")
+    args = parser.parse_args()
+    with open(args.config, 'rt') as f:
+        config = yaml.safe_load(f)
+    
+    print("========= Command Line Arguments ========")
+    for k, v in vars(args).items():
+        print("{}: {}".format(k, v))
+    print("=========== Configurations ==============")
+    for k in ["p_pronunciation", "batch_size"]:
+        print("{}: {}".format(k, config[k]))
+
+    ljspeech = LJSpeech(args.input)
+    collate_fn = DataCollector(config["p_pronunciation"])
+
+    dg.enable_dygraph(fluid.CPUPlace())
+    sampler = PartialyRandomizedSimilarTimeLengthSampler(ljspeech.num_frames())
+    cargo = DataCargo(ljspeech, collate_fn, 
+                      batch_size=config["batch_size"], sampler=sampler)
+    loader = DataLoader\
+           .from_generator(capacity=5, return_list=True)\
+           .set_batch_generator(cargo)
+
+    for i, batch in tqdm.tqdm(enumerate(loader)):
+        continue
--- a/examples/deepvoice3/images/model_architecture.png
+++ b/examples/deepvoice3/images/model_architecture.png
--- a/examples/deepvoice3/model.py
+++ b/examples/deepvoice3/model.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from paddle import fluid
-import paddle.fluid.initializer as I
-import paddle.fluid.dygraph as dg
-
-from parakeet.g2p import en
-from parakeet.models.deepvoice3 import Encoder, Decoder, Converter, DeepVoice3, TTSLoss, ConvSpec, WindowRange
-from parakeet.utils.layer_tools import summary, freeze
-
-
-def make_model(config):
-    c = config["model"]
-    # speaker embedding
-    n_speakers = c["n_speakers"]
-    speaker_dim = c["speaker_embed_dim"]
-    if n_speakers > 1:
-        speaker_embed = dg.Embedding(
-            (n_speakers, speaker_dim),
-            param_attr=I.Normal(scale=c["speaker_embedding_weight_std"]))
-    else:
-        speaker_embed = None
-
-    # encoder
-    h = c["encoder_channels"]
-    k = c["kernel_size"]
-    encoder_convolutions = (
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3),
-        ConvSpec(h, k, 9),
-        ConvSpec(h, k, 27),
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3),
-        ConvSpec(h, k, 9),
-        ConvSpec(h, k, 27),
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3), )
-    encoder = Encoder(
-        n_vocab=en.n_vocab,
-        embed_dim=c["text_embed_dim"],
-        n_speakers=n_speakers,
-        speaker_dim=speaker_dim,
-        embedding_weight_std=c["embedding_weight_std"],
-        convolutions=encoder_convolutions,
-        dropout=c["dropout"])
-    if c["freeze_embedding"]:
-        freeze(encoder.embed)
-
-    # decoder
-    h = c["decoder_channels"]
-    k = c["kernel_size"]
-    prenet_convolutions = (ConvSpec(h, k, 1), ConvSpec(h, k, 3))
-    attentive_convolutions = (
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3),
-        ConvSpec(h, k, 9),
-        ConvSpec(h, k, 27),
-        ConvSpec(h, k, 1), )
-    attention = [True, False, False, False, True]
-    force_monotonic_attention = [True, False, False, False, True]
-    window = WindowRange(c["window_backward"], c["window_ahead"])
-    decoder = Decoder(
-        n_speakers,
-        speaker_dim,
-        embed_dim=c["text_embed_dim"],
-        mel_dim=config["transform"]["n_mels"],
-        r=c["outputs_per_step"],
-        max_positions=c["max_positions"],
-        preattention=prenet_convolutions,
-        convolutions=attentive_convolutions,
-        attention=attention,
-        dropout=c["dropout"],
-        use_memory_mask=c["use_memory_mask"],
-        force_monotonic_attention=force_monotonic_attention,
-        query_position_rate=c["query_position_rate"],
-        key_position_rate=c["key_position_rate"],
-        window_range=window,
-        key_projection=c["key_projection"],
-        value_projection=c["value_projection"])
-    if not c["trainable_positional_encodings"]:
-        freeze(decoder.embed_keys_positions)
-        freeze(decoder.embed_query_positions)
-
-    # converter(postnet)
-    linear_dim = 1 + config["transform"]["n_fft"] // 2
-    h = c["converter_channels"]
-    k = c["kernel_size"]
-    postnet_convolutions = (
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3),
-        ConvSpec(2 * h, k, 1),
-        ConvSpec(2 * h, k, 3), )
-    use_decoder_states = c["use_decoder_state_for_postnet_input"]
-    converter = Converter(
-        n_speakers,
-        speaker_dim,
-        in_channels=decoder.state_dim
-        if use_decoder_states else config["transform"]["n_mels"],
-        linear_dim=linear_dim,
-        time_upsampling=c["downsample_factor"],
-        convolutions=postnet_convolutions,
-        dropout=c["dropout"])
-
-    model = DeepVoice3(
-        encoder,
-        decoder,
-        converter,
-        speaker_embed,
-        use_decoder_states=use_decoder_states)
-    return model
-
-
-def make_criterion(config):
-    # =========================loss=========================
-    loss_config = config["loss"]
-    transform_config = config["transform"]
-    model_config = config["model"]
-
-    priority_freq = loss_config["priority_freq"]  # Hz
-    sample_rate = transform_config["sample_rate"]
-    linear_dim = 1 + transform_config["n_fft"] // 2
-    priority_bin = int(priority_freq / (0.5 * sample_rate) * linear_dim)
-
-    criterion = TTSLoss(
-        masked_weight=loss_config["masked_loss_weight"],
-        priority_bin=priority_bin,
-        priority_weight=loss_config["priority_freq_weight"],
-        binary_divergence_weight=loss_config["binary_divergence_weight"],
-        guided_attention_sigma=loss_config["guided_attention_sigma"],
-        downsample_factor=model_config["downsample_factor"],
-        r=model_config["outputs_per_step"])
-    return criterion
-
-
-def make_optimizer(model, config):
-    # =========================lr_scheduler=========================
-    lr_config = config["lr_scheduler"]
-    warmup_steps = lr_config["warmup_steps"]
-    peak_learning_rate = lr_config["peak_learning_rate"]
-    lr_scheduler = dg.NoamDecay(1 / (warmup_steps * (peak_learning_rate)**2),
-                                warmup_steps)
-
-    # =========================optimizer=========================
-    optim_config = config["optimizer"]
-    optim = fluid.optimizer.Adam(
-        lr_scheduler,
-        beta1=optim_config["beta1"],
-        beta2=optim_config["beta2"],
-        epsilon=optim_config["epsilon"],
-        parameter_list=model.parameters(),
-        grad_clip=fluid.clip.GradientClipByGlobalNorm(0.1))
-    return optim
--- a/examples/deepvoice3/preprocess.py
+++ b/examples/deepvoice3/preprocess.py
+from __future__ import division
+import os
+import argparse
+from ruamel import yaml
+import tqdm
+from os.path import join
+import csv
+import numpy as np
+import pandas as pd
+import librosa
+import logging
+
+from parakeet.data import DatasetMixin
+
+
+class LJSpeechMetaData(DatasetMixin):
+    def __init__(self, root):
+        self.root = root
+        self._wav_dir = join(root, "wavs")
+        csv_path = join(root, "metadata.csv")
+        self._table = pd.read_csv(
+            csv_path,
+            sep="|",
+            encoding="utf-8",
+            header=None,
+            quoting=csv.QUOTE_NONE,
+            names=["fname", "raw_text", "normalized_text"])
+
+    def get_example(self, i):
+        fname, raw_text, normalized_text = self._table.iloc[i]
+        abs_fname = join(self._wav_dir, fname + ".wav")
+        return fname, abs_fname, raw_text, normalized_text
+
+    def __len__(self):
+        return len(self._table)
+
+
+class Transform(object):
+    def __init__(self, sample_rate, n_fft, hop_length, win_length, n_mels, reduction_factor):
+        self.sample_rate = sample_rate
+        self.n_fft = n_fft
+        self.win_length = win_length
+        self.hop_length = hop_length
+        self.n_mels = n_mels
+        self.reduction_factor = reduction_factor
+
+    def __call__(self, fname):
+        # wave processing
+        audio, _ = librosa.load(fname, sr=self.sample_rate)
+
+        # Pad the data to the right size to have a whole number of timesteps,
+        # accounting properly for the model reduction factor.
+        frames = audio.size // (self.reduction_factor * self.hop_length) + 1
+        # librosa's stft extract frame of n_fft size, so we should pad n_fft // 2 on both sidess
+        desired_length = (frames * self.reduction_factor - 1) * self.hop_length + self.n_fft
+        pad_amount = (desired_length - audio.size) // 2
+
+        # we pad mannually to control the number of generated frames
+        if audio.size % 2 == 0:
+            audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect')
+        else:
+            audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect')
+
+        # STFT
+        D = librosa.stft(audio, self.n_fft, self.hop_length, self.win_length, center=False)
+        S = np.abs(D)
+        S_mel = librosa.feature.melspectrogram(sr=self.sample_rate, S=S, n_mels=self.n_mels, fmax=8000.0)
+
+        # log magnitude
+        log_spectrogram = np.log(np.clip(S, a_min=1e-5, a_max=None))
+        log_mel_spectrogram = np.log(np.clip(S_mel, a_min=1e-5, a_max=None))
+        num_frames = log_spectrogram.shape[-1]
+        assert num_frames % self.reduction_factor == 0, "num_frames is wrong"
+        return (log_spectrogram.T, log_mel_spectrogram.T, num_frames)
+
+
+def save(output_path, dataset, transform):
+    if not os.path.exists(output_path):
+        os.makedirs(output_path)
+    records = []
+    for example in tqdm.tqdm(dataset):
+        fname, abs_fname, _, normalized_text = example
+        log_spec, log_mel_spec, num_frames = transform(abs_fname)
+        records.append((num_frames,
+                        fname + "_spec.npy", 
+                        fname + "_mel.npy", 
+                        normalized_text))
+        np.save(join(output_path, fname + "_spec"), log_spec)
+        np.save(join(output_path, fname + "_mel"), log_mel_spec)
+    meta_data = pd.DataFrame.from_records(records)
+    meta_data.to_csv(join(output_path, "metadata.csv"), 
+                     quoting=csv.QUOTE_NONE, sep="|", encoding="utf-8",
+                     header=False, index=False)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="preprocess ljspeech dataset and save it.")
+    parser.add_argument("--config", type=str, required=True, help="config file")
+    parser.add_argument("--input", type=str, required=True, help="data path of the original data")
+    parser.add_argument("--output", type=str, required=True, help="path to save the preprocessed dataset")
+
+    args = parser.parse_args()
+    with open(args.config, 'rt') as f:
+        config = yaml.safe_load(f)
+    
+    print("========= Command Line Arguments ========")
+    for k, v in vars(args).items():
+        print("{}: {}".format(k, v))
+    print("=========== Configurations ==============")
+    for k in ["sample_rate", "n_fft", "win_length", 
+              "hop_length", "n_mels", "reduction_factor"]:
+        print("{}: {}".format(k, config[k]))
+
+    ljspeech_meta = LJSpeechMetaData(args.input)
+    transform = Transform(config["sample_rate"],
+                          config["n_fft"],
+                          config["hop_length"],
+                          config["win_length"],
+                          config["n_mels"],
+                          config["reduction_factor"])
+    save(args.output, ljspeech_meta, transform)
+
--- a/examples/deepvoice3/sentences.txt
+++ b/examples/deepvoice3/sentences.txt
-Scientists at the CERN laboratory say they have discovered a new particle.
-There's a way to measure the acute emotional intelligence that has never gone out of style.
-President Trump met with other leaders at the Group of 20 conference.
-Generative adversarial network or variational auto-encoder.
-Please call Stella.
-Some have accepted this as a miracle without any physical explanation.
\ No newline at end of file
+Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
+in being comparatively modern.
+For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process
+produced the block books, which were the immediate predecessors of the true printed book,
+the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.
--- a/examples/deepvoice3/synthesis.py
+++ b/examples/deepvoice3/synthesis.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import os
-import argparse
-import ruamel.yaml
-import numpy as np
-import soundfile as sf
-
-from paddle import fluid
-fluid.require_version('1.8.0')
-import paddle.fluid.layers as F
-import paddle.fluid.dygraph as dg
-from tensorboardX import SummaryWriter
-
-from parakeet.g2p import en
-from parakeet.modules.weight_norm import WeightNormWrapper
-from parakeet.utils.layer_tools import summary
-from parakeet.utils import io
-
-from model import make_model
-from utils import make_evaluator
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Synthsize waveform with a checkpoint.")
-    parser.add_argument("--config", type=str, help="experiment config")
-    parser.add_argument("--device", type=int, default=-1, help="device to use")
-
-    g = parser.add_mutually_exclusive_group()
-    g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
-    g.add_argument(
-        "--iteration",
-        type=int,
-        help="the iteration of the checkpoint to load from output directory")
-
-    parser.add_argument("text", type=str, help="text file to synthesize")
-    parser.add_argument(
-        "output", type=str, help="path to save synthesized audio")
-
-    args = parser.parse_args()
-    with open(args.config, 'rt') as f:
-        config = ruamel.yaml.safe_load(f)
-
-    print("Command Line Args: ")
-    for k, v in vars(args).items():
-        print("{}: {}".format(k, v))
-
-    if args.device == -1:
-        place = fluid.CPUPlace()
-    else:
-        place = fluid.CUDAPlace(args.device)
-
-    dg.enable_dygraph(place)
-
-    model = make_model(config)
-    checkpoint_dir = os.path.join(args.output, "checkpoints")
-    if args.checkpoint is not None:
-        iteration = io.load_parameters(model, checkpoint_path=args.checkpoint)
-    else:
-        iteration = io.load_parameters(
-            model, checkpoint_dir=checkpoint_dir, iteration=args.iteration)
-
-    # WARNING: don't forget to remove weight norm to re-compute each wrapped layer's weight
-    # removing weight norm also speeds up computation
-    for layer in model.sublayers():
-        if isinstance(layer, WeightNormWrapper):
-            layer.remove_weight_norm()
-
-    synthesis_dir = os.path.join(args.output, "synthesis")
-    if not os.path.exists(synthesis_dir):
-        os.makedirs(synthesis_dir)
-
-    with open(args.text, "rt", encoding="utf-8") as f:
-        lines = f.readlines()
-        sentences = [line[:-1] for line in lines]
-
-    evaluator = make_evaluator(config, sentences, synthesis_dir)
-    evaluator(model, iteration)
--- a/examples/deepvoice3/synthesize.py
+++ b/examples/deepvoice3/synthesize.py
+import numpy as np 
+from matplotlib import cm
+import librosa
+import os
+import time
+import tqdm
+import argparse
+from ruamel import yaml
+import paddle
+from paddle import fluid
+from paddle.fluid import layers as F
+from paddle.fluid import dygraph as dg
+from paddle.fluid.io import DataLoader
+from tensorboardX import SummaryWriter
+import soundfile as sf
+
+from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
+from parakeet.utils.io import save_parameters, load_parameters, add_yaml_config_to_args
+from parakeet.g2p import en
+
+from vocoder import WaveflowVocoder
+from train import create_model
+
+
+def main(args, config):
+    model = create_model(config)
+    loaded_step = load_parameters(model, checkpoint_path=args.checkpoint)
+    model.eval()
+    vocoder = WaveflowVocoder()
+    vocoder.model.eval()
+    
+    if not os.path.exists(args.output):
+        os.makedirs(args.output)
+    monotonic_layers = [int(item.strip()) - 1 for item in args.monotonic_layers.split(',')]
+    with open(args.input, 'rt') as f:
+        sentences = [line.strip() for line in f.readlines()]
+    for i, sentence in enumerate(sentences):
+        wav = synthesize(config, model, vocoder, sentence, monotonic_layers)
+        sf.write(os.path.join(args.output, "sentence{}.wav".format(i)),
+                 wav, samplerate=config["sample_rate"])
+
+
+def synthesize(config, model, vocoder, sentence, monotonic_layers):
+    print("[synthesize] {}".format(sentence))
+    text = en.text_to_sequence(sentence, p=1.0)
+    text = np.expand_dims(np.array(text, dtype="int64"), 0)
+    lengths = np.array([text.size], dtype=np.int64)
+    text_seqs = dg.to_variable(text)
+    text_lengths = dg.to_variable(lengths)
+
+    decoder_layers = config["decoder_layers"]
+    force_monotonic_attention = [False] * decoder_layers
+    for i in monotonic_layers:
+        force_monotonic_attention[i] = True
+    
+    with dg.no_grad():
+        outputs = model(text_seqs, text_lengths, speakers=None,
+            force_monotonic_attention=force_monotonic_attention, 
+            window=(config["backward_step"], config["forward_step"]))
+        decoded, refined, attentions = outputs
+        wav = vocoder(F.transpose(decoded, (0, 2, 1)))
+        wav_np = wav.numpy()[0]
+    return wav_np
+
+
+if __name__ == "__main__":
+    import argparse
+    from ruamel import yaml
+    parser = argparse.ArgumentParser("synthesize from a checkpoint")
+    parser.add_argument("--config", type=str, required=True, help="config file")
+    parser.add_argument("--input", type=str, required=True, help="text file to synthesize")
+    parser.add_argument("--output", type=str, required=True, help="path to save audio")
+    parser.add_argument("--checkpoint", type=str, required=True, help="data path of the checkpoint")
+    parser.add_argument("--monotonic_layers", type=str, required=True, help="monotonic decoder layer, index starts friom 1")
+    args = parser.parse_args()
+    with open(args.config, 'rt') as f:
+        config = yaml.safe_load(f)
+    
+    dg.enable_dygraph(fluid.CUDAPlace(0))
+    main(args, config)
\ No newline at end of file
--- a/examples/deepvoice3/train.py
+++ b/examples/deepvoice3/train.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import time
+import numpy as np 
+from matplotlib import cm
+import librosa
 import os
-import argparse
-import ruamel.yaml
+import time
 import tqdm
-from tensorboardX import SummaryWriter
+import paddle
 from paddle import fluid
-fluid.require_version('1.8.0')
-import paddle.fluid.layers as F
-import paddle.fluid.dygraph as dg
-from parakeet.utils.io import load_parameters, save_parameters
-
-from data import make_data_loader
-from model import make_model, make_criterion, make_optimizer
-from utils import make_output_tree, add_options, get_place, Evaluator, StateSaver, make_evaluator, make_state_saver
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Train a Deep Voice 3 model with LJSpeech dataset.")
-    add_options(parser)
-    args, _ = parser.parse_known_args()
-
-    # only use args.device when training in single process
-    # when training with distributed.launch, devices are provided by
-    # `--selected_gpus` for distributed.launch
-    env = dg.parallel.ParallelEnv()
-    device_id = env.dev_id if env.nranks > 1 else args.device
-    place = get_place(device_id)
-    # start dygraph
-    dg.enable_dygraph(place)
+from paddle.fluid import layers as F
+from paddle.fluid import dygraph as dg
+from paddle.fluid.io import DataLoader
+from tensorboardX import SummaryWriter

-    with open(args.config, 'rt') as f:
-        config = ruamel.yaml.safe_load(f)
-
-    print("Command Line Args: ")
-    for k, v in vars(args).items():
-        print("{}: {}".format(k, v))
-
-    data_loader = make_data_loader(args.data, config)
-    model = make_model(config)
-    if env.nranks > 1:
-        strategy = dg.parallel.prepare_context()
-        model = dg.DataParallel(model, strategy)
-    criterion = make_criterion(config)
-    optim = make_optimizer(model, config)
-
-    # generation
-    synthesis_config = config["synthesis"]
-    power = synthesis_config["power"]
-    n_iter = synthesis_config["n_iter"]
-
-    # tensorboard & checkpoint preparation
-    output_dir = args.output
-    ckpt_dir = os.path.join(output_dir, "checkpoints")
-    log_dir = os.path.join(output_dir, "log")
-    state_dir = os.path.join(output_dir, "states")
-    eval_dir = os.path.join(output_dir, "eval")
-    if env.local_rank == 0:
-        make_output_tree(output_dir)
-        writer = SummaryWriter(logdir=log_dir)
-    else:
-        writer = None
-    sentences = [
-        "Scientists at the CERN laboratory say they have discovered a new particle.",
-        "There's a way to measure the acute emotional intelligence that has never gone out of style.",
-        "President Trump met with other leaders at the Group of 20 conference.",
-        "Generative adversarial network or variational auto-encoder.",
-        "Please call Stella.",
-        "Some have accepted this as a miracle without any physical explanation.",
-    ]
-    evaluator = make_evaluator(config, sentences, eval_dir, writer)
-    state_saver = make_state_saver(config, state_dir, writer)
-
-    # load parameters and optimizer, and opdate iterations done sofar
-    if args.checkpoint is not None:
-        iteration = load_parameters(
-            model, optim, checkpoint_path=args.checkpoint)
-    else:
-        iteration = load_parameters(
-            model, optim, checkpoint_dir=ckpt_dir, iteration=args.iteration)
-
-    # =========================train=========================
-    train_config = config["train"]
-    max_iter = train_config["max_iteration"]
-    snap_interval = train_config["snap_interval"]
-    save_interval = train_config["save_interval"]
-    eval_interval = train_config["eval_interval"]
-
-    global_step = iteration + 1
-    iterator = iter(tqdm.tqdm(data_loader))
-    downsample_factor = config["model"]["downsample_factor"]
-    while global_step <= max_iter:
+from parakeet.models.deepvoice3 import Encoder, Decoder, PostNet, SpectraNet
+from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
+from parakeet.utils.io import save_parameters, load_parameters
+from parakeet.g2p import en
+
+from data import LJSpeech, DataCollector
+from vocoder import WaveflowVocoder, GriffinLimVocoder
+from clip import DoubleClip
+
+
+def create_model(config):
+    char_embedding = dg.Embedding((en.n_vocab, config["char_dim"]))
+    multi_speaker = config["n_speakers"] > 1
+    speaker_embedding = dg.Embedding((config["n_speakers"], config["speaker_dim"])) \
+        if multi_speaker else None
+    encoder = Encoder(config["encoder_layers"], config["char_dim"], 
+                      config["encoder_dim"], config["kernel_size"], 
+                      has_bias=multi_speaker, bias_dim=config["speaker_dim"], 
+                      keep_prob=1.0 - config["dropout"])
+    decoder = Decoder(config["n_mels"], config["reduction_factor"], 
+                      list(config["prenet_sizes"]) + [config["char_dim"]], 
+                      config["decoder_layers"], config["kernel_size"], 
+                      config["attention_dim"],
+                      position_encoding_weight=config["position_weight"], 
+                      omega=config["position_rate"], 
+                      has_bias=multi_speaker, bias_dim=config["speaker_dim"], 
+                      keep_prob=1.0 - config["dropout"])
+    postnet = PostNet(config["postnet_layers"], config["char_dim"], 
+                      config["postnet_dim"], config["kernel_size"], 
+                      config["n_mels"], config["reduction_factor"], 
+                      has_bias=multi_speaker, bias_dim=config["speaker_dim"], 
+                      keep_prob=1.0 - config["dropout"])
+    spectranet = SpectraNet(char_embedding, speaker_embedding, encoder, decoder, postnet)
+    return spectranet
+
+def create_data(config, data_path):
+    dataset = LJSpeech(data_path)
+
+    train_dataset = SliceDataset(dataset, config["valid_size"], len(dataset))
+    train_collator = DataCollector(config["p_pronunciation"])
+    train_sampler = PartialyRandomizedSimilarTimeLengthSampler(
+        dataset.num_frames()[config["valid_size"]:])
+    train_cargo = DataCargo(train_dataset, train_collator, 
+        batch_size=config["batch_size"], sampler=train_sampler)
+    train_loader = DataLoader\
+                 .from_generator(capacity=10, return_list=True)\
+                 .set_batch_generator(train_cargo)
+
+    valid_dataset = SliceDataset(dataset, 0, config["valid_size"])
+    valid_collector = DataCollector(1.)
+    valid_sampler = SequentialSampler(valid_dataset)
+    valid_cargo = DataCargo(valid_dataset, valid_collector, 
+        batch_size=1, sampler=valid_sampler)
+    valid_loader = DataLoader\
+                 .from_generator(capacity=2, return_list=True)\
+                 .set_batch_generator(valid_cargo)
+    return train_loader, valid_loader
+
+def create_optimizer(model, config):
+    optim = fluid.optimizer.Adam(config["learning_rate"], 
+        parameter_list=model.parameters(), 
+        grad_clip=DoubleClip(config["clip_value"], config["clip_norm"]))
+    return optim
+
+def train(args, config):
+    model = create_model(config)
+    train_loader, valid_loader = create_data(config, args.input)
+    optim = create_optimizer(model, config)
+
+    global global_step
+    max_iteration = 2000000
+    
+    iterator = iter(tqdm.tqdm(train_loader))
+    while global_step <= max_iteration:
+        # get inputs
        try:
            batch = next(iterator)
-        except StopIteration as e:
-            iterator = iter(tqdm.tqdm(data_loader))
+        except StopIteration:
+            iterator = iter(tqdm.tqdm(train_loader))
            batch = next(iterator)
+        
+        # unzip it
+        text_seqs, text_lengths, specs, mels, num_frames = batch

+        # forward & backward
        model.train()
-        (text_sequences, text_lengths, text_positions, mel_specs, lin_specs,
-         frames, decoder_positions, done_flags) = batch
-        downsampled_mel_specs = F.strided_slice(
-            mel_specs,
-            axes=[1],
-            starts=[0],
-            ends=[mel_specs.shape[1]],
-            strides=[downsample_factor])
-        outputs = model(
-            text_sequences,
-            text_positions,
-            text_lengths,
-            None,
-            downsampled_mel_specs,
-            decoder_positions, )
-        # mel_outputs, linear_outputs, alignments, done
-        inputs = (downsampled_mel_specs, lin_specs, done_flags, text_lengths,
-                  frames)
-        losses = criterion(outputs, inputs)
-
-        l = losses["loss"]
-        if env.nranks > 1:
-            l = model.scale_loss(l)
-            l.backward()
-            model.apply_collective_grads()
-        else:
-            l.backward()
-
-        # record learning rate before updating
-        if env.local_rank == 0:
-            writer.add_scalar("learning_rate",
-                              optim._learning_rate.step().numpy(), global_step)
-        optim.minimize(l)
-        optim.clear_gradients()
-
-        # record step losses
-        step_loss = {k: v.numpy()[0] for k, v in losses.items()}
-
-        if env.local_rank == 0:
-            tqdm.tqdm.write("[Train] global_step: {}\tloss: {}".format(
-                global_step, step_loss["loss"]))
-            for k, v in step_loss.items():
-                writer.add_scalar(k, v, global_step)
-
-        # train state saving, the first sentence in the batch
-        if env.local_rank == 0 and global_step % snap_interval == 0:
-            input_specs = (mel_specs, lin_specs)
-            state_saver(outputs, input_specs, global_step)
-
-        # evaluation
-        if env.local_rank == 0 and global_step % eval_interval == 0:
-            evaluator(model, global_step)
-
-        # save checkpoint
-        if env.local_rank == 0 and global_step % save_interval == 0:
-            save_parameters(ckpt_dir, global_step, model, optim)
-
+        outputs = model(text_seqs, text_lengths, speakers=None, mel=mels)
+        decoded, refined, attentions, final_state = outputs
+
+        causal_mel_loss = model.spec_loss(decoded, mels, num_frames)
+        non_causal_mel_loss = model.spec_loss(refined, mels, num_frames)
+        loss = causal_mel_loss + non_causal_mel_loss
+        loss.backward()
+
+        # update
+        optim.minimize(loss)
+
+        # logging
+        tqdm.tqdm.write("[train] step: {}\tloss: {:.6f}\tcausal:{:.6f}\tnon_causal:{:.6f}".format(
+            global_step, 
+            loss.numpy()[0], 
+            causal_mel_loss.numpy()[0], 
+            non_causal_mel_loss.numpy()[0]))
+        writer.add_scalar("loss/causal_mel_loss", causal_mel_loss.numpy()[0], global_step=global_step)
+        writer.add_scalar("loss/non_causal_mel_loss", non_causal_mel_loss.numpy()[0], global_step=global_step)
+        writer.add_scalar("loss/loss", loss.numpy()[0], global_step=global_step)
+        
+        if global_step % config["report_interval"] == 0:
+            text_length = int(text_lengths.numpy()[0])
+            num_frame = int(num_frames.numpy()[0])
+
+            tag = "train_mel/ground-truth"
+            img = cm.viridis(normalize(mels.numpy()[0, :num_frame].T))
+            writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
+
+            tag = "train_mel/decoded"
+            img = cm.viridis(normalize(decoded.numpy()[0, :num_frame].T))
+            writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
+
+            tag = "train_mel/refined"
+            img = cm.viridis(normalize(refined.numpy()[0, :num_frame].T))
+            writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
+
+            vocoder = WaveflowVocoder()
+            vocoder.model.eval()
+
+            tag = "train_audio/ground-truth-waveflow"
+            wav = vocoder(F.transpose(mels[0:1, :num_frame, :], (0, 2, 1)))
+            writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
+
+            tag = "train_audio/decoded-waveflow"
+            wav = vocoder(F.transpose(decoded[0:1, :num_frame, :], (0, 2, 1)))
+            writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
+
+            tag = "train_audio/refined-waveflow"
+            wav = vocoder(F.transpose(refined[0:1, :num_frame, :], (0, 2, 1)))
+            writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
+            
+            attentions_np = attentions.numpy()
+            attentions_np = attentions_np[:, 0, :num_frame // 4 , :text_length]
+            for i, attention_layer in enumerate(np.rot90(attentions_np, axes=(1,2))):
+                tag = "train_attention/layer_{}".format(i)
+                img = cm.viridis(normalize(attention_layer))
+                writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
+
+        if global_step % config["save_interval"] == 0:
+            save_parameters(writer.logdir, global_step, model, optim)
+
+        # global step +1
        global_step += 1
+
+def normalize(arr):
+    return (arr - arr.min()) / (arr.max() - arr.min())
+
+if __name__ == "__main__":
+    import argparse
+    from ruamel import yaml
+
+    parser = argparse.ArgumentParser(description="train a Deep Voice 3 model with LJSpeech")
+    parser.add_argument("--config", type=str, required=True, help="config file")
+    parser.add_argument("--input", type=str, required=True, help="data path of the original data")
+
+    args = parser.parse_args()
+    with open(args.config, 'rt') as f:
+        config = yaml.safe_load(f)
+    
+    dg.enable_dygraph(fluid.CUDAPlace(0))
+    global global_step
+    global_step = 1
+    global writer
+    writer = SummaryWriter()
+    print("[Training] tensorboard log and checkpoints are save in {}".format(
+        writer.logdir))
+    train(args, config)
\ No newline at end of file
--- a/examples/deepvoice3/utils.py
+++ b/examples/deepvoice3/utils.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import os
-import numpy as np
-import matplotlib
-matplotlib.use("agg")
-from matplotlib import cm
-import matplotlib.pyplot as plt
-import librosa
-from scipy import signal
-from librosa import display
-import soundfile as sf
-
-from paddle import fluid
-import paddle.fluid.dygraph as dg
-from parakeet.g2p import en
-
-
-def get_place(device_id):
-    """get place from device_id, -1 stands for CPU"""
-    if device_id == -1:
-        place = fluid.CPUPlace()
-    else:
-        place = fluid.CUDAPlace(device_id)
-    return place
-
-
-def add_options(parser):
-    parser.add_argument("--config", type=str, help="experimrnt config")
-    parser.add_argument(
-        "--data",
-        type=str,
-        default="/workspace/datasets/LJSpeech-1.1/",
-        help="The path of the LJSpeech dataset.")
-    parser.add_argument("--device", type=int, default=-1, help="device to use")
-
-    g = parser.add_mutually_exclusive_group()
-    g.add_argument("--checkpoint", type=str, help="checkpoint to resume from.")
-    g.add_argument(
-        "--iteration",
-        type=int,
-        help="the iteration of the checkpoint to load from output directory")
-
-    parser.add_argument(
-        "output", type=str, default="experiment", help="path to save results")
-
-
-def make_evaluator(config, text_sequences, output_dir, writer=None):
-    c = config["transform"]
-    p_replace = 0.0
-    sample_rate = c["sample_rate"]
-    preemphasis = c["preemphasis"]
-    win_length = c["win_length"]
-    hop_length = c["hop_length"]
-    min_level_db = c["min_level_db"]
-    ref_level_db = c["ref_level_db"]
-
-    synthesis_config = config["synthesis"]
-    power = synthesis_config["power"]
-    n_iter = synthesis_config["n_iter"]
-
-    return Evaluator(
-        text_sequences,
-        p_replace,
-        sample_rate,
-        preemphasis,
-        win_length,
-        hop_length,
-        min_level_db,
-        ref_level_db,
-        power,
-        n_iter,
-        output_dir=output_dir,
-        writer=writer)
-
-
-class Evaluator(object):
-    def __init__(self,
-                 text_sequences,
-                 p_replace,
-                 sample_rate,
-                 preemphasis,
-                 win_length,
-                 hop_length,
-                 min_level_db,
-                 ref_level_db,
-                 power,
-                 n_iter,
-                 output_dir,
-                 writer=None):
-        self.text_sequences = text_sequences
-        self.output_dir = output_dir
-        self.writer = writer
-
-        self.p_replace = p_replace
-        self.sample_rate = sample_rate
-        self.preemphasis = preemphasis
-        self.win_length = win_length
-        self.hop_length = hop_length
-        self.min_level_db = min_level_db
-        self.ref_level_db = ref_level_db
-
-        self.power = power
-        self.n_iter = n_iter
-
-    def process_a_sentence(self, model, text):
-        text = np.array(
-            en.text_to_sequence(
-                text, p=self.p_replace), dtype=np.int64)
-        length = len(text)
-        text_positions = np.arange(1, 1 + length, dtype=np.int64)
-        text = np.expand_dims(text, 0)
-        text_positions = np.expand_dims(text_positions, 0)
-
-        model.eval()
-        if isinstance(model, dg.DataParallel):
-            _model = model._layers
-        else:
-            _model = model
-        mel_outputs, linear_outputs, alignments, done = _model.transduce(
-            dg.to_variable(text), dg.to_variable(text_positions))
-
-        linear_outputs_np = linear_outputs.numpy()[0].T  # (C, T)
-
-        wav = spec_to_waveform(linear_outputs_np, self.min_level_db,
-                               self.ref_level_db, self.power, self.n_iter,
-                               self.win_length, self.hop_length,
-                               self.preemphasis)
-        alignments_np = alignments.numpy()[0]  # batch_size = 1
-        return wav, alignments_np
-
-    def __call__(self, model, iteration):
-        writer = self.writer
-        for i, seq in enumerate(self.text_sequences):
-            print("[Eval] synthesizing sentence {}".format(i))
-            wav, alignments_np = self.process_a_sentence(model, seq)
-
-            wav_path = os.path.join(
-                self.output_dir,
-                "eval_sample_{}_step_{:09d}.wav".format(i, iteration))
-            sf.write(wav_path, wav, self.sample_rate)
-            if writer is not None:
-                writer.add_audio(
-                    "eval_sample_{}".format(i),
-                    wav,
-                    iteration,
-                    sample_rate=self.sample_rate)
-            attn_path = os.path.join(
-                self.output_dir,
-                "eval_sample_{}_step_{:09d}.png".format(i, iteration))
-            plot_alignment(alignments_np, attn_path)
-            if writer is not None:
-                writer.add_image(
-                    "eval_sample_attn_{}".format(i),
-                    cm.viridis(alignments_np),
-                    iteration,
-                    dataformats="HWC")
-
-
-def make_state_saver(config, output_dir, writer=None):
-    c = config["transform"]
-    p_replace = c["replace_pronunciation_prob"]
-    sample_rate = c["sample_rate"]
-    preemphasis = c["preemphasis"]
-    win_length = c["win_length"]
-    hop_length = c["hop_length"]
-    min_level_db = c["min_level_db"]
-    ref_level_db = c["ref_level_db"]
-
-    synthesis_config = config["synthesis"]
-    power = synthesis_config["power"]
-    n_iter = synthesis_config["n_iter"]
-
-    return StateSaver(p_replace, sample_rate, preemphasis, win_length,
-                      hop_length, min_level_db, ref_level_db, power, n_iter,
-                      output_dir, writer)
-
-
-class StateSaver(object):
-    def __init__(self,
-                 p_replace,
-                 sample_rate,
-                 preemphasis,
-                 win_length,
-                 hop_length,
-                 min_level_db,
-                 ref_level_db,
-                 power,
-                 n_iter,
-                 output_dir,
-                 writer=None):
-        self.output_dir = output_dir
-        self.writer = writer
-
-        self.p_replace = p_replace
-        self.sample_rate = sample_rate
-        self.preemphasis = preemphasis
-        self.win_length = win_length
-        self.hop_length = hop_length
-        self.min_level_db = min_level_db
-        self.ref_level_db = ref_level_db
-
-        self.power = power
-        self.n_iter = n_iter
-
-    def __call__(self, outputs, inputs, iteration):
-        mel_output, lin_output, alignments, done_output = outputs
-        mel_input, lin_input = inputs
-        writer = self.writer
-
-        # mel spectrogram
-        mel_input = mel_input[0].numpy().T
-        mel_output = mel_output[0].numpy().T
-
-        path = os.path.join(self.output_dir, "mel_spec")
-        plt.figure(figsize=(10, 3))
-        display.specshow(mel_input)
-        plt.colorbar()
-        plt.title("mel_input")
-        plt.savefig(
-            os.path.join(path, "target_mel_spec_step_{:09d}.png".format(
-                iteration)))
-        plt.close()
-
-        if writer is not None:
-            writer.add_image(
-                "target/mel_spec",
-                cm.viridis(mel_input),
-                iteration,
-                dataformats="HWC")
-
-        plt.figure(figsize=(10, 3))
-        display.specshow(mel_output)
-        plt.colorbar()
-        plt.title("mel_output")
-        plt.savefig(
-            os.path.join(path, "predicted_mel_spec_step_{:09d}.png".format(
-                iteration)))
-        plt.close()
-
-        if writer is not None:
-            writer.add_image(
-                "predicted/mel_spec",
-                cm.viridis(mel_output),
-                iteration,
-                dataformats="HWC")
-
-        # linear spectrogram
-        lin_input = lin_input[0].numpy().T
-        lin_output = lin_output[0].numpy().T
-        path = os.path.join(self.output_dir, "lin_spec")
-
-        plt.figure(figsize=(10, 3))
-        display.specshow(lin_input)
-        plt.colorbar()
-        plt.title("mel_input")
-        plt.savefig(
-            os.path.join(path, "target_lin_spec_step_{:09d}.png".format(
-                iteration)))
-        plt.close()
-
-        if writer is not None:
-            writer.add_image(
-                "target/lin_spec",
-                cm.viridis(lin_input),
-                iteration,
-                dataformats="HWC")
-
-        plt.figure(figsize=(10, 3))
-        display.specshow(lin_output)
-        plt.colorbar()
-        plt.title("mel_input")
-        plt.savefig(
-            os.path.join(path, "predicted_lin_spec_step_{:09d}.png".format(
-                iteration)))
-        plt.close()
-
-        if writer is not None:
-            writer.add_image(
-                "predicted/lin_spec",
-                cm.viridis(lin_output),
-                iteration,
-                dataformats="HWC")
-
-        # alignment
-        path = os.path.join(self.output_dir, "alignments")
-        alignments = alignments[:, 0, :, :].numpy()
-        for idx, attn_layer in enumerate(alignments):
-            save_path = os.path.join(
-                path, "train_attn_layer_{}_step_{}.png".format(idx, iteration))
-            plot_alignment(attn_layer, save_path)
-
-            if writer is not None:
-                writer.add_image(
-                    "train_attn/layer_{}".format(idx),
-                    cm.viridis(attn_layer),
-                    iteration,
-                    dataformats="HWC")
-
-        # synthesize waveform
-        wav = spec_to_waveform(
-            lin_output, self.min_level_db, self.ref_level_db, self.power,
-            self.n_iter, self.win_length, self.hop_length, self.preemphasis)
-        path = os.path.join(self.output_dir, "waveform")
-        save_path = os.path.join(
-            path, "train_sample_step_{:09d}.wav".format(iteration))
-        sf.write(save_path, wav, self.sample_rate)
-
-        if writer is not None:
-            writer.add_audio(
-                "train_sample", wav, iteration, sample_rate=self.sample_rate)
-
-
-def spec_to_waveform(spec, min_level_db, ref_level_db, power, n_iter,
-                     win_length, hop_length, preemphasis):
-    """Convert output linear spec to waveform using griffin-lim vocoder.
-    
-    Args:
-        spec (ndarray): the output linear spectrogram, shape(C, T), where C means n_fft, T means frames.
-    """
-    denoramlized = np.clip(spec, 0, 1) * (-min_level_db) + min_level_db
-    lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
-    wav = librosa.griffinlim(
-        lin_scaled**power,
-        n_iter=n_iter,
-        hop_length=hop_length,
-        win_length=win_length)
-    if preemphasis > 0:
-        wav = signal.lfilter([1.], [1., -preemphasis], wav)
-    wav = np.clip(wav, -1.0, 1.0)
-    return wav
-
-
-def make_output_tree(output_dir):
-    print("creating output tree: {}".format(output_dir))
-    ckpt_dir = os.path.join(output_dir, "checkpoints")
-    state_dir = os.path.join(output_dir, "states")
-    eval_dir = os.path.join(output_dir, "eval")
-
-    for x in [ckpt_dir, state_dir, eval_dir]:
-        if not os.path.exists(x):
-            os.makedirs(x)
-    for x in ["alignments", "waveform", "lin_spec", "mel_spec"]:
-        p = os.path.join(state_dir, x)
-        if not os.path.exists(p):
-            os.makedirs(p)
-
-
-def plot_alignment(alignment, path):
-    """
-    Plot an attention layer's alignment for a sentence.
-    alignment: shape(T_dec, T_enc).
-    """
-
-    plt.figure()
-    plt.imshow(alignment)
-    plt.colorbar()
-    plt.xlabel('Encoder timestep')
-    plt.ylabel('Decoder timestep')
-    plt.savefig(path)
-    plt.close()
--- a/examples/deepvoice3/vocoder.py
+++ b/examples/deepvoice3/vocoder.py
+import argparse
+from ruamel import yaml
+import numpy as np
+import librosa
+import paddle
+from paddle import fluid
+from paddle.fluid import layers as F
+from paddle.fluid import dygraph as dg
+from parakeet.utils.io import load_parameters
+from parakeet.models.waveflow.waveflow_modules import WaveFlowModule
+
+class WaveflowVocoder(object):
+    def __init__(self):
+        config_path = "waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml"
+        with open(config_path, 'rt') as f:
+           config = yaml.safe_load(f)
+        ns = argparse.Namespace()
+        for k, v in config.items():
+            setattr(ns, k, v)
+        ns.use_fp16 = False
+        
+        self.model = WaveFlowModule(ns)
+        checkpoint_path = "waveflow_res128_ljspeech_ckpt_1.0/step-2000000"
+        load_parameters(self.model, checkpoint_path=checkpoint_path)
+
+    def __call__(self, mel):
+        with dg.no_grad():
+            self.model.eval()
+            audio = self.model.synthesize(mel)
+        self.model.train()
+        return audio
+
+class GriffinLimVocoder(object):
+    def __init__(self, sharpening_factor=1.4, win_length=1024, hop_length=256):
+        self.sharpening_factor = sharpening_factor
+        self.win_length = win_length
+        self.hop_length = hop_length
+
+    def __call__(self, spec):
+        audio = librosa.core.griffinlim(np.exp(spec * self.sharpening_factor), 
+            win_length=self.win_length, hop_length=self.hop_length)
+        return audio
+
--- a/parakeet/models/deepvoice3/__init__.py
+++ b/parakeet/models/deepvoice3/__init__.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from parakeet.models.deepvoice3.encoder import Encoder, ConvSpec
-from parakeet.models.deepvoice3.decoder import Decoder, WindowRange
-from parakeet.models.deepvoice3.converter import Converter
-from parakeet.models.deepvoice3.loss import TTSLoss
-from parakeet.models.deepvoice3.model import DeepVoice3
+from .model import *
\ No newline at end of file
--- a/parakeet/models/deepvoice3/attention.py
+++ b/parakeet/models/deepvoice3/attention.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from collections import namedtuple
-from paddle import fluid
-import paddle.fluid.dygraph as dg
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-
-from parakeet.modules.weight_norm import Linear
-WindowRange = namedtuple("WindowRange", ["backward", "ahead"])
-
-
-class Attention(dg.Layer):
-    def __init__(self,
-                 query_dim,
-                 embed_dim,
-                 dropout=0.0,
-                 window_range=WindowRange(-1, 3),
-                 key_projection=True,
-                 value_projection=True):
-        """Attention Layer for Deep Voice 3.
-
-        Args:
-            query_dim (int): the dimension of query vectors. (The size of a single vector of query.)
-            embed_dim (int): the dimension of keys and values.
-            dropout (float, optional): dropout probability of attention. Defaults to 0.0.
-            window_range (WindowRange, optional): range of attention, this is only used at inference. Defaults to WindowRange(-1, 3).
-            key_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the keys to pass through before computing attention. Defaults to True.
-            value_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the values to pass through before computing attention. Defaults to True.
-        """
-        super(Attention, self).__init__()
-        std = np.sqrt(1 / query_dim)
-        self.query_proj = Linear(
-            query_dim, embed_dim, param_attr=I.Normal(scale=std))
-        if key_projection:
-            std = np.sqrt(1 / embed_dim)
-            self.key_proj = Linear(
-                embed_dim, embed_dim, param_attr=I.Normal(scale=std))
-        if value_projection:
-            std = np.sqrt(1 / embed_dim)
-            self.value_proj = Linear(
-                embed_dim, embed_dim, param_attr=I.Normal(scale=std))
-        std = np.sqrt(1 / embed_dim)
-        self.out_proj = Linear(
-            embed_dim, query_dim, param_attr=I.Normal(scale=std))
-
-        self.key_projection = key_projection
-        self.value_projection = value_projection
-        self.dropout = dropout
-        self.window_range = window_range
-
-    def forward(self, query, encoder_out, mask=None, last_attended=None):
-        """
-        Compute contextualized representation and alignment scores.
-        
-        Args:
-            query (Variable): shape(B, T_dec, C_q), dtype float32, the query tensor, where C_q means the query dim.
-            encoder_out (keys, values): 
-                keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means embed dim.
-                values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means embed dim.
-            mask (Variable, optional): shape(B, T_enc), dtype float32, mask generated with valid text lengths. Pad tokens corresponds to 1, and valid tokens correspond to 0.
-            last_attended (int, optional): The position that received the most attention at last time step. This is only used at inference.
-
-        Outpus:
-            x (Variable): shape(B, T_dec, C_q), dtype float32, the contextualized representation from attention mechanism.
-            attn_scores (Variable): shape(B, T_dec, T_enc), dtype float32, the alignment tensor, where T_dec means the number of decoder time steps and T_enc means number the number of decoder time steps.
-        """
-        keys, values = encoder_out
-        residual = query
-        if self.value_projection:
-            values = self.value_proj(values)
-        if self.key_projection:
-            keys = self.key_proj(keys)
-        x = self.query_proj(query)
-
-        x = F.matmul(x, keys, transpose_y=True)
-
-        # mask generated by sentence length
-        neg_inf = -1.e30
-        if mask is not None:
-            neg_inf_mask = F.scale(F.unsqueeze(mask, [1]), neg_inf)
-            x += neg_inf_mask
-
-        # if last_attended is provided, focus only on a window range around it
-        # to enforce monotonic attention.
-        if last_attended is not None:
-            locality_mask = np.ones(shape=x.shape, dtype=np.float32)
-            backward, ahead = self.window_range
-            backward = last_attended + backward
-            ahead = last_attended + ahead
-            backward = max(backward, 0)
-            ahead = min(ahead, x.shape[-1])
-            locality_mask[:, :, backward:ahead] = 0.
-            locality_mask = dg.to_variable(locality_mask)
-            neg_inf_mask = F.scale(locality_mask, neg_inf)
-            x += neg_inf_mask
-
-        x = F.softmax(x)
-        attn_scores = x
-        x = F.dropout(
-            x, self.dropout, dropout_implementation="upscale_in_train")
-        x = F.matmul(x, values)
-        encoder_length = keys.shape[1]
-
-        x = F.scale(x, encoder_length * np.sqrt(1.0 / encoder_length))
-        x = self.out_proj(x)
-        x = F.scale((x + residual), np.sqrt(0.5))
-        return x, attn_scores
--- a/parakeet/models/deepvoice3/conv.py
+++ b/parakeet/models/deepvoice3/conv.py
+import numpy as np
+from paddle.fluid import layers as F
+from paddle.fluid.framework import Variable, in_dygraph_mode
+from paddle.fluid import core, dygraph_utils
+from paddle.fluid.layers import nn, utils
+from paddle.fluid.data_feeder import check_variable_and_dtype
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.dygraph import layers
+from paddle.fluid.initializer import Normal
+
+
+def _is_list_or_tuple(input):
+    return isinstance(input, (list, tuple))
+
+
+def _zero_padding_in_batch_and_channel(padding, channel_last):
+    if channel_last:
+        return list(padding[0]) == [0, 0] and list(padding[-1]) == [0, 0]
+    else:
+        return list(padding[0]) == [0, 0] and list(padding[1]) == [0, 0]
+
+
+def _exclude_padding_in_batch_and_channel(padding, channel_last):
+    padding_ = padding[1:-1] if channel_last else padding[2:]
+    padding_ = [elem for pad_a_dim in padding_ for elem in pad_a_dim]
+    return padding_
+
+
+def _update_padding_nd(padding, channel_last, num_dims):
+    if isinstance(padding, str):
+        padding = padding.upper()
+        if padding not in ["SAME", "VALID"]:
+            raise ValueError(
+                "Unknown padding: '{}'. It can only be 'SAME' or 'VALID'.".
+                format(padding))
+        if padding == "VALID":
+            padding_algorithm = "VALID"
+            padding = [0] * num_dims
+        else:
+            padding_algorithm = "SAME"
+            padding = [0] * num_dims
+    elif _is_list_or_tuple(padding):
+        # for padding like
+        # [(pad_before, pad_after), (pad_before, pad_after), ...]
+        # padding for batch_dim and channel_dim included
+        if len(padding) == 2 + num_dims and _is_list_or_tuple(padding[0]):
+            if not _zero_padding_in_batch_and_channel(padding, channel_last):
+                raise ValueError(
+                    "Non-zero padding({}) in the batch or channel dimensions "
+                    "is not supported.".format(padding))
+            padding_algorithm = "EXPLICIT"
+            padding = _exclude_padding_in_batch_and_channel(padding,
+                                                            channel_last)
+            if utils._is_symmetric_padding(padding, num_dims):
+                padding = padding[0::2]
+        # for padding like [pad_before, pad_after, pad_before, pad_after, ...]
+        elif len(padding) == 2 * num_dims and isinstance(padding[0], int):
+            padding_algorithm = "EXPLICIT"
+            padding = utils.convert_to_list(padding, 2 * num_dims, 'padding')
+            if utils._is_symmetric_padding(padding, num_dims):
+                padding = padding[0::2]
+        # for padding like [pad_d1, pad_d2, ...]
+        elif len(padding) == num_dims and isinstance(padding[0], int):
+            padding_algorithm = "EXPLICIT"
+            padding = utils.convert_to_list(padding, num_dims, 'padding')
+        else:
+            raise ValueError("In valid padding: {}".format(padding))
+    # for integer padding
+    else:
+        padding_algorithm = "EXPLICIT"
+        padding = utils.convert_to_list(padding, num_dims, 'padding')
+    return padding, padding_algorithm
+
+def _get_default_param_initializer(num_channels, filter_size):
+    filter_elem_num = num_channels * np.prod(filter_size)
+    std = (2.0 / filter_elem_num)**0.5
+    return Normal(0.0, std, 0)
+
+def conv1d(input,
+           weight,
+           bias=None,
+           padding=0,
+           stride=1,
+           dilation=1,
+           groups=1,
+           use_cudnn=True,
+           act=None,
+           data_format="NCT",
+           name=None):
+    # entry checks
+    if not isinstance(use_cudnn, bool):
+        raise ValueError("Attr(use_cudnn) should be True or False. "
+                         "Received Attr(use_cudnn): {}.".format(use_cudnn))
+    if data_format not in ["NCT", "NTC"]:
+        raise ValueError("Attr(data_format) should be 'NCT' or 'NTC'. "
+                         "Received Attr(data_format): {}.".format(data_format))
+
+    channel_last = (data_format == "NTC")
+    channel_dim = -1 if channel_last else 1
+    num_channels = input.shape[channel_dim]
+    num_filters = weight.shape[0]
+    if num_channels < 0:
+        raise ValueError("The channel dimmention of the input({}) "
+                         "should be defined. Received: {}.".format(
+                             input.shape, num_channels))
+    if num_channels % groups != 0:
+        raise ValueError(
+            "the channel of input must be divisible by groups,"
+            "received: the channel of input is {}, the shape of input is {}"
+            ", the groups is {}".format(num_channels, input.shape, groups))
+    if num_filters % groups != 0:
+        raise ValueError(
+            "the number of filters must be divisible by groups,"
+            "received: the number of filters is {}, the shape of weight is {}"
+            ", the groups is {}".format(num_filters, weight.shape, groups))
+
+    # update attrs
+    padding, padding_algorithm = _update_padding_nd(padding, channel_last, 1)
+    if len(padding) == 1: # synmmetric padding
+        padding = [0,] + padding
+    else:
+        # len(padding) == 2
+        padding = [0, 0] + padding
+    stride = [1,] + utils.convert_to_list(stride, 1, 'stride')
+    dilation = [1,] + utils.convert_to_list(dilation, 1, 'dilation')
+    data_format = "NHWC" if channel_last else "NCHW"
+
+    l_type = "conv2d"
+
+    if (num_channels == groups and num_filters % num_channels == 0 and
+            not use_cudnn):
+        l_type = 'depthwise_conv2d'
+    weight = F.unsqueeze(weight, [2])
+    input = F.unsqueeze(input, [1]) if channel_last else F.unsqueeze(input, [2])
+
+    if in_dygraph_mode():
+        attrs = ('strides', stride, 'paddings', padding, 'dilations', dilation,
+                 'groups', groups, 'use_cudnn', use_cudnn, 'use_mkldnn', False,
+                 'fuse_relu_before_depthwise_conv', False, "padding_algorithm",
+                 padding_algorithm, "data_format", data_format)
+        pre_bias = getattr(core.ops, l_type)(input, weight, *attrs)
+        if bias is not None:
+            pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
+        else:
+            pre_act = pre_bias
+        out = dygraph_utils._append_activation_in_dygraph(
+            pre_act, act, use_cudnn=use_cudnn)
+    else:
+        inputs = {'Input': [input], 'Filter': [weight]}
+        attrs = {
+            'strides': stride,
+            'paddings': padding,
+            'dilations': dilation,
+            'groups': groups,
+            'use_cudnn': use_cudnn,
+            'use_mkldnn': False,
+            'fuse_relu_before_depthwise_conv': False,
+            "padding_algorithm": padding_algorithm,
+            "data_format": data_format
+        }
+        check_variable_and_dtype(input, 'input',
+                                 ['float16', 'float32', 'float64'], 'conv2d')
+        helper = LayerHelper(l_type, **locals())
+        dtype = helper.input_dtype()
+        pre_bias = helper.create_variable_for_type_inference(dtype)
+        outputs = {"Output": [pre_bias]}
+        helper.append_op(
+            type=l_type, inputs=inputs, outputs=outputs, attrs=attrs)
+        if bias is not None:
+            pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
+        else:
+            pre_act = pre_bias
+        out = helper.append_activation(pre_act)
+    out = F.squeeze(out, [1]) if channel_last else F.squeeze(out, [2])
+    return out
+
+class Conv1D(layers.Layer):
+    def __init__(self,
+                 num_channels,
+                 num_filters,
+                 filter_size,
+                 padding=0,
+                 stride=1,
+                 dilation=1,
+                 groups=1,
+                 param_attr=None,
+                 bias_attr=None,
+                 use_cudnn=True,
+                 act=None,
+                 data_format="NCT",
+                 dtype='float32'):
+        super(Conv1D, self).__init__()
+        assert param_attr is not False, "param_attr should not be False here."
+        self._num_channels = num_channels
+        self._num_filters = num_filters
+        self._groups = groups
+        if num_channels % groups != 0:
+            raise ValueError("num_channels must be divisible by groups.")
+        self._act = act
+        self._data_format = data_format
+        self._dtype = dtype
+        if not isinstance(use_cudnn, bool):
+            raise ValueError("use_cudnn should be True or False")
+        self._use_cudnn = use_cudnn
+
+        self._filter_size = utils.convert_to_list(filter_size, 1, 'filter_size')
+        self._stride = utils.convert_to_list(stride, 1, 'stride')
+        self._dilation = utils.convert_to_list(dilation, 1, 'dilation')
+        channel_last = (data_format == "NTC")
+        self._padding = padding  # leave it to F.conv1d
+
+        self._param_attr = param_attr
+        self._bias_attr = bias_attr
+
+        num_filter_channels = num_channels // groups
+        filter_shape = [self._num_filters, num_filter_channels
+                        ] + self._filter_size
+
+        self.weight = self.create_parameter(
+            attr=self._param_attr,
+            shape=filter_shape,
+            dtype=self._dtype,
+            default_initializer=_get_default_param_initializer(
+                self._num_channels, filter_shape))
+        self.bias = self.create_parameter(
+            attr=self._bias_attr,
+            shape=[self._num_filters],
+            dtype=self._dtype,
+            is_bias=True)
+
+    def forward(self, input):
+        out = conv1d(
+            input,
+            self.weight,
+            bias=self.bias,
+            padding=self._padding,
+            stride=self._stride,
+            dilation=self._dilation,
+            groups=self._groups,
+            use_cudnn=self._use_cudnn,
+            act=self._act,
+            data_format=self._data_format)
+        return out
+
--- a/parakeet/models/deepvoice3/conv1dglu.py
+++ b/parakeet/models/deepvoice3/conv1dglu.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-
-from paddle import fluid
-import paddle.fluid.dygraph as dg
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-
-from parakeet.modules.weight_norm import Conv1D, Conv1DCell, Conv2D, Linear
-
-
-class Conv1DGLU(dg.Layer):
-    """
-    A Convolution 1D block with GLU activation. It also applys dropout for the input x. It integrates speaker embeddings through a Linear activated by softsign. It has residual connection from the input x, and scale the output by np.sqrt(0.5).
-    """
-
-    def __init__(self,
-                 n_speakers,
-                 speaker_dim,
-                 in_channels,
-                 num_filters,
-                 filter_size=1,
-                 dilation=1,
-                 std_mul=4.0,
-                 dropout=0.0,
-                 causal=False,
-                 residual=True):
-        """[summary]
-
-        Args:
-            n_speakers (int): number of speakers.
-            speaker_dim (int): speaker embedding's size.
-            in_channels (int): channels of the input.
-            num_filters (int): channels of the output.
-            filter_size (int, optional): filter size of the internal Conv1DCell. Defaults to 1.
-            dilation (int, optional): dilation of the internal Conv1DCell. Defaults to 1.
-            std_mul (float, optional): [description]. Defaults to 4.0.
-            dropout (float, optional): dropout probability. Defaults to 0.0.
-            causal (bool, optional): padding of the Conv1DCell. It shoudl be True if `add_input` method of `Conv1DCell` is ever used. Defaults to False.
-            residual (bool, optional): whether to use residual connection. If True, in_channels shoudl equals num_filters. Defaults to True.
-        """
-        super(Conv1DGLU, self).__init__()
-        # conv spec
-        self.in_channels = in_channels
-        self.n_speakers = n_speakers
-        self.speaker_dim = speaker_dim
-        self.num_filters = num_filters
-        self.filter_size = filter_size
-        self.dilation = dilation
-
-        # padding
-        self.causal = causal
-
-        # weight init and dropout
-        self.std_mul = std_mul
-        self.dropout = dropout
-
-        self.residual = residual
-        if residual:
-            assert (
-                in_channels == num_filters
-            ), "this block uses residual connection"\
-                "the input_channes should equals num_filters"
-        std = np.sqrt(std_mul * (1 - dropout) / (filter_size * in_channels))
-        self.conv = Conv1DCell(
-            in_channels,
-            2 * num_filters,
-            filter_size,
-            dilation,
-            causal,
-            param_attr=I.Normal(scale=std))
-
-        if n_speakers > 1:
-            assert (speaker_dim is not None
-                    ), "speaker embed should not be null in multi-speaker case"
-            std = np.sqrt(1 / speaker_dim)
-            self.fc = Linear(
-                speaker_dim, num_filters, param_attr=I.Normal(scale=std))
-
-    def forward(self, x, speaker_embed=None):
-        """
-        Args:
-            x (Variable): shape(B, C_in, T), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels T means input time steps.
-            speaker_embed (Variable): shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size.
-
-        Returns:
-            x (Variable): shape(B, C_out, T), the output of Conv1DGLU, where
-                C_out means the `num_filters`.
-        """
-        residual = x
-        x = F.dropout(
-            x, self.dropout, dropout_implementation="upscale_in_train")
-        x = self.conv(x)
-        content, gate = F.split(x, num_or_sections=2, dim=1)
-
-        if speaker_embed is not None:
-            sp = F.softsign(self.fc(speaker_embed))
-            content = F.elementwise_add(content, sp, axis=0)
-
-        # glu
-        x = F.sigmoid(gate) * content
-
-        if self.residual:
-            x = F.scale(x + residual, np.sqrt(0.5))
-        return x
-
-    def start_sequence(self):
-        """Prepare the Conv1DGLU to generate a new sequence. This method should be called before starting calling `add_input` multiple times.
-        """
-        self.conv.start_sequence()
-
-    def add_input(self, x_t, speaker_embed=None):
-        """
-        Takes a step of inputs and return a step of outputs. It works similarily with the `forward` method, but in a `step-in-step-out` fashion.
-
-        Args:
-            x_t (Variable): shape(B, C_in, T=1), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels.
-            speaker_embed (Variable): Shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size. 
-
-        Returns:
-            x (Variable): shape(B, C_out), the output of Conv1DGLU, where C_out means the `num_filter`.
-        """
-        residual = x_t
-        x_t = F.dropout(
-            x_t, self.dropout, dropout_implementation="upscale_in_train")
-        x_t = self.conv.add_input(x_t)
-        content_t, gate_t = F.split(x_t, num_or_sections=2, dim=1)
-
-        if speaker_embed is not None:
-            sp = F.softsign(self.fc(speaker_embed))
-            content_t = F.elementwise_add(content_t, sp, axis=0)
-
-        # glu
-        x_t = F.sigmoid(gate_t) * content_t
-
-        if self.residual:
-            x_t = F.scale(x_t + residual, np.sqrt(0.5))
-        return x_t
--- a/parakeet/models/deepvoice3/converter.py
+++ b/parakeet/models/deepvoice3/converter.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from itertools import chain
-
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-import paddle.fluid.dygraph as dg
-
-from parakeet.modules.weight_norm import Conv1D, Conv1DTranspose, Conv2D, Conv2DTranspose, Linear
-from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
-from parakeet.models.deepvoice3.encoder import ConvSpec
-
-
-def upsampling_4x_blocks(n_speakers, speaker_dim, target_channels, dropout):
-    """Return a list of Layers that upsamples the input by 4 times in time dimension.
-
-    Args:
-        n_speakers (int): number of speakers of the Conv1DGLU layers used.
-        speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
-        target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
-        dropout (float): dropout probability.
-
-    Returns:
-        List[Layer]: upsampling layers.
-    """
-    # upsampling convolitions
-    upsampling_convolutions = [
-        Conv1DTranspose(
-            target_channels,
-            target_channels,
-            2,
-            stride=2,
-            param_attr=I.Normal(scale=np.sqrt(1 / (2 * target_channels)))),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=1,
-            std_mul=1.,
-            dropout=dropout),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=3,
-            std_mul=4.,
-            dropout=dropout),
-        Conv1DTranspose(
-            target_channels,
-            target_channels,
-            2,
-            stride=2,
-            param_attr=I.Normal(scale=np.sqrt(4. / (2 * target_channels)))),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=1,
-            std_mul=1.,
-            dropout=dropout),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=3,
-            std_mul=4.,
-            dropout=dropout),
-    ]
-    return upsampling_convolutions
-
-
-def upsampling_2x_blocks(n_speakers, speaker_dim, target_channels, dropout):
-    """Return a list of Layers that upsamples the input by 2 times in time dimension.
-
-    Args:
-        n_speakers (int): number of speakers of the Conv1DGLU layers used.
-        speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
-        target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
-        dropout (float): dropout probability.
-
-    Returns:
-        List[Layer]: upsampling layers.
-    """
-    upsampling_convolutions = [
-        Conv1DTranspose(
-            target_channels,
-            target_channels,
-            2,
-            stride=2,
-            param_attr=I.Normal(scale=np.sqrt(1. / (2 * target_channels)))),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=1,
-            std_mul=1.,
-            dropout=dropout), Conv1DGLU(
-                n_speakers,
-                speaker_dim,
-                target_channels,
-                target_channels,
-                3,
-                dilation=3,
-                std_mul=4.,
-                dropout=dropout)
-    ]
-    return upsampling_convolutions
-
-
-def upsampling_1x_blocks(n_speakers, speaker_dim, target_channels, dropout):
-    """Return a list of Layers that upsamples the input by 1 times in time dimension.
-
-    Args:
-        n_speakers (int): number of speakers of the Conv1DGLU layers used.
-        speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
-        target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
-        dropout (float): dropout probability.
-
-    Returns:
-        List[Layer]: upsampling layers.
-    """
-    upsampling_convolutions = [
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=3,
-            std_mul=4.,
-            dropout=dropout)
-    ]
-    return upsampling_convolutions
-
-
-class Converter(dg.Layer):
-    def __init__(self,
-                 n_speakers,
-                 speaker_dim,
-                 in_channels,
-                 linear_dim,
-                 convolutions=(ConvSpec(256, 5, 1), ) * 4,
-                 time_upsampling=1,
-                 dropout=0.0):
-        """Vocoder that transforms mel spectrogram (or ecoder hidden states) to waveform.
-
-        Args:
-            n_speakers (int): number of speakers.
-            speaker_dim (int): speaker embedding size.
-            in_channels (int): channels of the input.
-            linear_dim (int): channels of the linear spectrogram.
-            convolutions (Iterable[ConvSpec], optional): specifications of the internal convolutional layers. ConvSpec is a namedtuple of (output_channels, filter_size, dilation) Defaults to (ConvSpec(256, 5, 1), )*4.
-            time_upsampling (int, optional): time upsampling factor of the converter, possible options are {1, 2, 4}. Note that this should equals the downsample factor of the mel spectrogram. Defaults to 1.
-            dropout (float, optional): dropout probability. Defaults to 0.0.
-        """
-        super(Converter, self).__init__()
-
-        self.n_speakers = n_speakers
-        self.speaker_dim = speaker_dim
-        self.in_channels = in_channels
-        self.linear_dim = linear_dim
-        # CAUTION: this should equals the downsampling steps coefficient
-        self.time_upsampling = time_upsampling
-        self.dropout = dropout
-
-        target_channels = convolutions[0].out_channels
-
-        # conv proj to target channels
-        self.first_conv_proj = Conv1D(
-            in_channels,
-            target_channels,
-            1,
-            param_attr=I.Normal(scale=np.sqrt(1 / in_channels)))
-
-        # Idea from nyanko
-        if time_upsampling == 4:
-            self.upsampling_convolutions = dg.LayerList(
-                upsampling_4x_blocks(n_speakers, speaker_dim, target_channels,
-                                     dropout))
-        elif time_upsampling == 2:
-            self.upsampling_convolutions = dg.LayerList(
-                upsampling_2x_blocks(n_speakers, speaker_dim, target_channels,
-                                     dropout))
-        elif time_upsampling == 1:
-            self.upsampling_convolutions = dg.LayerList(
-                upsampling_1x_blocks(n_speakers, speaker_dim, target_channels,
-                                     dropout))
-        else:
-            raise ValueError(
-                "Upsampling factors other than {1, 2, 4} are Not supported.")
-
-        # post conv layers
-        std_mul = 4.0
-        in_channels = target_channels
-        self.convolutions = dg.LayerList()
-        for (out_channels, filter_size, dilation) in convolutions:
-            if in_channels != out_channels:
-                std = np.sqrt(std_mul / in_channels)
-                # CAUTION: relu
-                self.convolutions.append(
-                    Conv1D(
-                        in_channels,
-                        out_channels,
-                        1,
-                        act="relu",
-                        param_attr=I.Normal(scale=std)))
-                in_channels = out_channels
-                std_mul = 2.0
-            self.convolutions.append(
-                Conv1DGLU(
-                    n_speakers,
-                    speaker_dim,
-                    in_channels,
-                    out_channels,
-                    filter_size,
-                    dilation=dilation,
-                    std_mul=std_mul,
-                    dropout=dropout))
-            in_channels = out_channels
-            std_mul = 4.0
-
-        # final conv proj, channel transformed to linear dim
-        std = np.sqrt(std_mul * (1 - dropout) / in_channels)
-        # CAUTION: sigmoid
-        self.last_conv_proj = Conv1D(
-            in_channels,
-            linear_dim,
-            1,
-            act="sigmoid",
-            param_attr=I.Normal(scale=std))
-
-    def forward(self, x, speaker_embed=None):
-        """
-        Convert mel spectrogram or decoder hidden states to linear spectrogram.
-        
-        Args:
-            x (Variable): Shape(B, T_mel, C_in), dtype float32, converter inputs, where C_in means the input channel for the converter. Note that it can be either C_mel (channel of mel spectrogram) or C_dec // r.
-                When use mel_spectrogram as the input of converter, C_in = C_mel; and when use decoder states as the input of converter, C_in = C_dec // r.
-            speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embedding, where C_sp means the speaker embedding size.
-
-        Returns:
-            out (Variable): Shape(B, T_lin, C_lin), the output linear spectrogram, where C_lin means the channel of linear spectrogram and T_linear means the length(time steps) of linear spectrogram. T_line = time_upsampling * T_mel, which depends on the time_upsampling of the converter.
-        """
-        x = F.transpose(x, [0, 2, 1])
-        x = self.first_conv_proj(x)
-
-        if speaker_embed is not None:
-            speaker_embed = F.dropout(
-                speaker_embed,
-                self.dropout,
-                dropout_implementation="upscale_in_train")
-
-        for layer in chain(self.upsampling_convolutions, self.convolutions):
-            if isinstance(layer, Conv1DGLU):
-                x = layer(x, speaker_embed)
-            else:
-                x = layer(x)
-
-        out = self.last_conv_proj(x)
-        out = F.transpose(out, [0, 2, 1])
-        return out
--- a/parakeet/models/deepvoice3/decoder.py
+++ b/parakeet/models/deepvoice3/decoder.py
--- a/parakeet/models/deepvoice3/encoder.py
+++ b/parakeet/models/deepvoice3/encoder.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from collections import namedtuple
-
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-import paddle.fluid.dygraph as dg
-
-from parakeet.modules.weight_norm import Conv1D, Linear
-from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
-
-ConvSpec = namedtuple("ConvSpec", ["out_channels", "filter_size", "dilation"])
-
-
-class Encoder(dg.Layer):
-    def __init__(self,
-                 n_vocab,
-                 embed_dim,
-                 n_speakers,
-                 speaker_dim,
-                 padding_idx=None,
-                 embedding_weight_std=0.1,
-                 convolutions=(ConvSpec(64, 5, 1), ) * 7,
-                 dropout=0.):
-        """Encoder of Deep Voice 3.
-
-        Args:
-            n_vocab (int): vocabulary size of the text embedding.
-            embed_dim (int): embedding size of the text embedding.
-            n_speakers (int): number of speakers.
-            speaker_dim (int): speaker embedding size.
-            padding_idx (int, optional): padding index of text embedding. Defaults to None.
-            embedding_weight_std (float, optional): standard deviation of the embedding weights when intialized. Defaults to 0.1.
-            convolutions (Iterable[ConvSpec], optional): specifications of the convolutional layers. ConvSpec is a namedtuple of output channels, filter_size and dilation. Defaults to (ConvSpec(64, 5, 1), )*7.
-            dropout (float, optional): dropout probability. Defaults to 0..
-        """
-        super(Encoder, self).__init__()
-        self.embedding_weight_std = embedding_weight_std
-        self.embed = dg.Embedding(
-            (n_vocab, embed_dim),
-            padding_idx=padding_idx,
-            param_attr=I.Normal(scale=embedding_weight_std))
-
-        self.dropout = dropout
-        if n_speakers > 1:
-            std = np.sqrt((1 - dropout) / speaker_dim)
-            self.sp_proj1 = Linear(
-                speaker_dim,
-                embed_dim,
-                act="softsign",
-                param_attr=I.Normal(scale=std))
-            self.sp_proj2 = Linear(
-                speaker_dim,
-                embed_dim,
-                act="softsign",
-                param_attr=I.Normal(scale=std))
-        self.n_speakers = n_speakers
-
-        self.convolutions = dg.LayerList()
-        in_channels = embed_dim
-        std_mul = 1.0
-        for (out_channels, filter_size, dilation) in convolutions:
-            # 1 * 1 convolution & relu
-            if in_channels != out_channels:
-                std = np.sqrt(std_mul / in_channels)
-                self.convolutions.append(
-                    Conv1D(
-                        in_channels,
-                        out_channels,
-                        1,
-                        act="relu",
-                        param_attr=I.Normal(scale=std)))
-                in_channels = out_channels
-                std_mul = 2.0
-
-            self.convolutions.append(
-                Conv1DGLU(
-                    n_speakers,
-                    speaker_dim,
-                    in_channels,
-                    out_channels,
-                    filter_size,
-                    dilation,
-                    std_mul,
-                    dropout,
-                    causal=False,
-                    residual=True))
-            in_channels = out_channels
-            std_mul = 4.0
-
-        std = np.sqrt(std_mul * (1 - dropout) / in_channels)
-        self.convolutions.append(
-            Conv1D(
-                in_channels, embed_dim, 1, param_attr=I.Normal(scale=std)))
-
-    def forward(self, x, speaker_embed=None):
-        """
-        Encode text sequence.
-        
-        Args:
-            x (Variable): shape(B, T_enc), dtype: int64. Ihe input text indices. T_enc means the timesteps of decoder input x.
-            speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embeddings. This arg is not None only when the model is a multispeaker model.
-
-        Returns:
-            keys (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded epresentation for keys, where C_emb menas the text embedding size.
-            values (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded representation for values.
-        """
-        x = self.embed(x)
-        x = F.dropout(
-            x, self.dropout, dropout_implementation="upscale_in_train")
-        x = F.transpose(x, [0, 2, 1])
-
-        if self.n_speakers > 1 and speaker_embed is not None:
-            speaker_embed = F.dropout(
-                speaker_embed,
-                self.dropout,
-                dropout_implementation="upscale_in_train")
-            x = F.elementwise_add(x, self.sp_proj1(speaker_embed), axis=0)
-
-        input_embed = x
-        for layer in self.convolutions:
-            if isinstance(layer, Conv1DGLU):
-                x = layer(x, speaker_embed)
-            else:
-                # layer is a Conv1D with (1,) filter wrapped by WeightNormWrapper
-                x = layer(x)
-
-        if self.n_speakers > 1 and speaker_embed is not None:
-            x = F.elementwise_add(x, self.sp_proj2(speaker_embed), axis=0)
-
-        keys = x  # (B, C, T)
-        values = F.scale(input_embed + x, scale=np.sqrt(0.5))
-        keys = F.transpose(keys, [0, 2, 1])
-        values = F.transpose(values, [0, 2, 1])
-        return keys, values
--- a/parakeet/models/deepvoice3/loss.py
+++ b/parakeet/models/deepvoice3/loss.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from numba import jit
-
-from paddle import fluid
-import paddle.fluid.layers as F
-import paddle.fluid.dygraph as dg
-
-
-def masked_mean(inputs, mask):
-    """
-    Args:
-        inputs (Variable): shape(B, T, C), dtype float32, the input.
-        mask (Variable): shape(B, T), dtype float32, a mask. 
-    Returns:
-        loss (Variable): shape(1, ), dtype float32, masked mean.
-    """
-    channels = inputs.shape[-1]
-    masked_inputs = F.elementwise_mul(inputs, mask, axis=0)
-    loss = F.reduce_sum(masked_inputs) / (channels * F.reduce_sum(mask))
-    return loss
-
-
-@jit(nopython=True)
-def guided_attention(N, max_N, T, max_T, g):
-    """Generate an diagonal attention guide.
-    
-    Args:
-        N (int): valid length of encoder.
-        max_N (int): max length of encoder.
-        T (int): valid length of decoder.
-        max_T (int): max length of decoder.
-        g (float): sigma to adjust the degree of diagonal guide.
-
-    Returns:
-        np.ndarray: shape(max_N, max_T), dtype float32, the diagonal guide.
-    """
-    W = np.zeros((max_N, max_T), dtype=np.float32)
-    for n in range(N):
-        for t in range(T):
-            W[n, t] = 1 - np.exp(-(n / N - t / T)**2 / (2 * g * g))
-    return W
-
-
-def guided_attentions(encoder_lengths, decoder_lengths, max_decoder_len,
-                      g=0.2):
-    """Generate a diagonal attention guide for a batch.
-
-    Args:
-        encoder_lengths (np.ndarray): shape(B, ), dtype: int64, encoder valid lengths.
-        decoder_lengths (np.ndarray): shape(B, ), dtype: int64, decoder valid lengths.
-        max_decoder_len (int): max length of decoder.
-        g (float, optional): sigma to adjust the degree of diagonal guide.. Defaults to 0.2.
-
-    Returns:
-        np.ndarray: shape(B, max_T, max_N), dtype float32, the diagonal guide. (max_N: max encoder length, max_T: max decoder length.)
-    """
-    B = len(encoder_lengths)
-    max_input_len = encoder_lengths.max()
-    W = np.zeros((B, max_decoder_len, max_input_len), dtype=np.float32)
-    for b in range(B):
-        W[b] = guided_attention(encoder_lengths[b], max_input_len,
-                                decoder_lengths[b], max_decoder_len, g).T
-    return W
-
-
-class TTSLoss(object):
-    def __init__(self,
-                 masked_weight=0.0,
-                 priority_bin=None,
-                 priority_weight=0.0,
-                 binary_divergence_weight=0.0,
-                 guided_attention_sigma=0.2,
-                 downsample_factor=4,
-                 r=1):
-        """Compute loss for Deep Voice 3 model.
-
-        Args:
-            masked_weight (float, optional): the weight of masked loss. Defaults to 0.0.
-            priority_bin ([type], optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
-            priority_weight (float, optional): weight for the prioritized frequency bands. Defaults to 0.0.
-            binary_divergence_weight (float, optional): weight for binary cross entropy (used for spectrogram loss). Defaults to 0.0.
-            guided_attention_sigma (float, optional): `sigma` for attention guide. Defaults to 0.2.
-            downsample_factor (int, optional): the downsample factor for mel spectrogram. Defaults to 4.
-            r (int, optional): frames per decoder step. Defaults to 1.
-        """
-        self.masked_weight = masked_weight
-        self.priority_bin = priority_bin  # only used for lin-spec loss
-        self.priority_weight = priority_weight  # only used for lin-spec loss
-        self.binary_divergence_weight = binary_divergence_weight
-        self.guided_attention_sigma = guided_attention_sigma
-
-        self.time_shift = r
-        self.r = r
-        self.downsample_factor = downsample_factor
-
-    def l1_loss(self, prediction, target, mask, priority_bin=None):
-        """L1 loss for spectrogram.
-
-        Args:
-            prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
-            target (Variable): shape(B, T, C), dtype float32, target spectrogram.
-            mask (Variable): shape(B, T), mask.
-            priority_bin (int, optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
-
-        Returns:
-            Variable: shape(1,), dtype float32, l1 loss(with mask and possibly priority bin applied.)
-        """
-        abs_diff = F.abs(prediction - target)
-
-        # basic mask-weighted l1 loss
-        w = self.masked_weight
-        if w > 0 and mask is not None:
-            base_l1_loss = w * masked_mean(abs_diff, mask) \
-                         + (1 - w) * F.reduce_mean(abs_diff)
-        else:
-            base_l1_loss = F.reduce_mean(abs_diff)
-
-        if self.priority_weight > 0 and priority_bin is not None:
-            # mask-weighted priority channels' l1-loss
-            priority_abs_diff = abs_diff[:, :, :priority_bin]
-            if w > 0 and mask is not None:
-                priority_loss = w * masked_mean(priority_abs_diff, mask) \
-                              + (1 - w) * F.reduce_mean(priority_abs_diff)
-            else:
-                priority_loss = F.reduce_mean(priority_abs_diff)
-
-            # priority weighted sum
-            p = self.priority_weight
-            loss = p * priority_loss + (1 - p) * base_l1_loss
-        else:
-            loss = base_l1_loss
-        return loss
-
-    def binary_divergence(self, prediction, target, mask):
-        """Binary cross entropy loss for spectrogram. All the values in the spectrogram are treated as logits in a logistic regression.
-
-        Args:
-            prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
-            target (Variable): shape(B, T, C), dtype float32, target spectrogram.
-            mask (Variable): shape(B, T), mask.
-
-        Returns:
-            Variable: shape(1,), dtype float32, binary cross entropy loss.
-        """
-        flattened_prediction = F.reshape(prediction, [-1, 1])
-        flattened_target = F.reshape(target, [-1, 1])
-        flattened_loss = F.log_loss(
-            flattened_prediction, flattened_target, epsilon=1e-8)
-        bin_div = fluid.layers.reshape(flattened_loss, prediction.shape)
-
-        w = self.masked_weight
-        if w > 0 and mask is not None:
-            loss = w * masked_mean(bin_div, mask) \
-                 + (1 - w) * F.reduce_mean(bin_div)
-        else:
-            loss = F.reduce_mean(bin_div)
-        return loss
-
-    @staticmethod
-    def done_loss(done_hat, done):
-        """Compute done loss
-
-        Args:
-            done_hat (Variable): shape(B, T), dtype float32, predicted done probability(the probability that the final frame has been generated.)
-            done (Variable): shape(B, T), dtype float32, ground truth done probability(the probability that the final frame has been generated.)
-
-        Returns:
-            Variable: shape(1, ), dtype float32, done loss.
-        """
-        flat_done_hat = F.reshape(done_hat, [-1, 1])
-        flat_done = F.reshape(done, [-1, 1])
-        loss = F.log_loss(flat_done_hat, flat_done, epsilon=1e-8)
-        loss = F.reduce_mean(loss)
-        return loss
-
-    def attention_loss(self, predicted_attention, input_lengths,
-                       target_lengths):
-        """
-        Given valid encoder_lengths and decoder_lengths, compute a diagonal guide, and compute loss from the predicted attention and the guide.
-        
-        Args:
-            predicted_attention (Variable): shape(*, B, T_dec, T_enc), dtype float32, the alignment tensor, where B means batch size, T_dec means number of time steps of the decoder, T_enc means the number of time steps of the encoder, * means other possible dimensions.
-            input_lengths (numpy.ndarray): shape(B,), dtype:int64, valid lengths (time steps) of encoder outputs.
-            target_lengths (numpy.ndarray): shape(batch_size,), dtype:int64, valid lengths (time steps) of decoder outputs.
-        
-        Returns:
-            loss (Variable): shape(1, ), dtype float32, attention loss.
-        """
-        n_attention, batch_size, max_target_len, max_input_len = (
-            predicted_attention.shape)
-        soft_mask = guided_attentions(input_lengths, target_lengths,
-                                      max_target_len,
-                                      self.guided_attention_sigma)
-        soft_mask_ = dg.to_variable(soft_mask)
-        loss = fluid.layers.reduce_mean(predicted_attention * soft_mask_)
-        return loss
-
-    def __call__(self, outputs, inputs):
-        """Total loss
-
-        Args:
-            outpus is a tuple of (mel_hyp, lin_hyp, attn_hyp, done_hyp).
-            mel_hyp (Variable): shape(B, T, C_mel), dtype float32, predicted mel spectrogram.
-            lin_hyp (Variable): shape(B, T, C_lin), dtype float32, predicted linear spectrogram.
-            done_hyp (Variable): shape(B, T), dtype float32, predicted done probability.
-            attn_hyp (Variable): shape(N, B, T_dec, T_enc), dtype float32, predicted attention.
-
-            inputs is a tuple of (mel_ref, lin_ref, done_ref, input_lengths, n_frames)
-            mel_ref (Variable): shape(B, T, C_mel), dtype float32, ground truth mel spectrogram.
-            lin_ref (Variable): shape(B, T, C_lin), dtype float32, ground truth linear spectrogram.
-            done_ref (Variable): shape(B, T), dtype float32, ground truth done flag.
-            input_lengths (Variable): shape(B, ), dtype: int, encoder valid lengths.
-            n_frames (Variable): shape(B, ), dtype: int, decoder valid lengths.
-
-        Returns:
-            Dict(str, Variable): details of loss.
-        """
-        total_loss = 0.
-
-        mel_hyp, lin_hyp, attn_hyp, done_hyp = outputs
-        mel_ref, lin_ref, done_ref, input_lengths, n_frames = inputs
-
-        # n_frames # mel_lengths # decoder_lengths
-        max_frames = lin_hyp.shape[1]
-        max_mel_steps = max_frames // self.downsample_factor
-        # max_decoder_steps = max_mel_steps // self.r
-        # decoder_mask = F.sequence_mask(n_frames // self.downsample_factor //
-        #                                self.r,
-        #                                max_decoder_steps,
-        #                                dtype="float32")
-        mel_mask = F.sequence_mask(
-            n_frames // self.downsample_factor, max_mel_steps, dtype="float32")
-        lin_mask = F.sequence_mask(n_frames, max_frames, dtype="float32")
-
-        lin_hyp = lin_hyp[:, :-self.time_shift, :]
-        lin_ref = lin_ref[:, self.time_shift:, :]
-        lin_mask = lin_mask[:, self.time_shift:]
-        lin_l1_loss = self.l1_loss(
-            lin_hyp, lin_ref, lin_mask, priority_bin=self.priority_bin)
-        lin_bce_loss = self.binary_divergence(lin_hyp, lin_ref, lin_mask)
-        lin_loss = self.binary_divergence_weight * lin_bce_loss \
-                    + (1 - self.binary_divergence_weight) * lin_l1_loss
-        total_loss += lin_loss
-
-        mel_hyp = mel_hyp[:, :-self.time_shift, :]
-        mel_ref = mel_ref[:, self.time_shift:, :]
-        mel_mask = mel_mask[:, self.time_shift:]
-        mel_l1_loss = self.l1_loss(mel_hyp, mel_ref, mel_mask)
-        mel_bce_loss = self.binary_divergence(mel_hyp, mel_ref, mel_mask)
-        # print("=====>", mel_l1_loss.numpy()[0], mel_bce_loss.numpy()[0])
-        mel_loss = self.binary_divergence_weight * mel_bce_loss \
-                    + (1 - self.binary_divergence_weight) * mel_l1_loss
-        total_loss += mel_loss
-
-        attn_loss = self.attention_loss(attn_hyp,
-                                        input_lengths.numpy(),
-                                        n_frames.numpy() //
-                                        (self.downsample_factor * self.r))
-        total_loss += attn_loss
-
-        done_loss = self.done_loss(done_hyp, done_ref)
-        total_loss += done_loss
-
-        losses = {
-            "loss": total_loss,
-            "mel/mel_loss": mel_loss,
-            "mel/l1_loss": mel_l1_loss,
-            "mel/bce_loss": mel_bce_loss,
-            "lin/lin_loss": lin_loss,
-            "lin/l1_loss": lin_l1_loss,
-            "lin/bce_loss": lin_bce_loss,
-            "done": done_loss,
-            "attn": attn_loss,
-        }
-
-        return losses
--- a/parakeet/models/deepvoice3/model.py
+++ b/parakeet/models/deepvoice3/model.py
--- a/parakeet/models/deepvoice3/position_embedding.py
+++ b/parakeet/models/deepvoice3/position_embedding.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from paddle import fluid
-import paddle.fluid.layers as F
-import paddle.fluid.dygraph as dg
-
-
-def lookup(weight, indices, padding_idx):
-    out = fluid.core.ops.lookup_table_v2(
-        weight, indices, 'is_sparse', False, 'is_distributed', False,
-        'remote_prefetch', False, 'padding_idx', padding_idx)
-    return out
-
-
-def compute_position_embedding_single_speaker(radians, speaker_position_rate):
-    """Compute sin/cos interleaved matrix from the radians.
-    
-    Arg:
-        radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
-        speaker_position_rate (float or Variable): float or Variable of shape(1, ), speaker positioning rate.
-    
-    Returns:
-        Variable: shape(n_vocab, embed_dim), the sin, cos interleaved matrix.
-    """
-    _, embed_dim = radians.shape
-    scaled_radians = radians * speaker_position_rate
-
-    odd_mask = (np.arange(embed_dim) % 2).astype(np.float32)
-    odd_mask = dg.to_variable(odd_mask)
-
-    out = odd_mask * F.cos(scaled_radians) \
-        + (1 - odd_mask) * F.sin(scaled_radians)
-    return out
-
-
-def compute_position_embedding(radians, speaker_position_rate):
-    """Compute sin/cos interleaved matrix from the radians.
-    
-    Arg:
-        radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
-        speaker_position_rate (Variable): shape(B, ), speaker positioning rate.
-    
-    Returns:
-        Variable: shape(B, n_vocab, embed_dim), the sin, cos interleaved matrix.
-    """
-    _, embed_dim = radians.shape
-    batch_size = speaker_position_rate.shape[0]
-    scaled_radians = F.elementwise_mul(
-        F.expand(F.unsqueeze(radians, [0]), [batch_size, 1, 1]),
-        speaker_position_rate,
-        axis=0)
-
-    odd_mask = (np.arange(embed_dim) % 2).astype(np.float32)
-    odd_mask = dg.to_variable(odd_mask)
-
-    out = odd_mask * F.cos(scaled_radians) \
-        + (1 - odd_mask) * F.sin(scaled_radians)
-    out = F.concat(
-        [F.zeros((batch_size, 1, embed_dim), radians.dtype), out[:, 1:, :]],
-        axis=1)
-    return out
-
-
-def position_encoding_init(n_position,
-                           d_pos_vec,
-                           position_rate=1.0,
-                           padding_idx=None):
-    """Init the position encoding.
-
-    Args:
-        n_position (int): max position, vocab size for position embedding.
-        d_pos_vec (int): position embedding size.
-        position_rate (float, optional): position rate (this should only be used when all the utterances are from one speaker.). Defaults to 1.0.
-        padding_idx (int, optional): padding index for the position embedding(it is set as 0 internally if not provided.). Defaults to None.
-
-    Returns:
-        [type]: [description]
-    """
-    # init the position encoding table
-    # keep idx 0 for padding token position encoding zero vector
-    # CAUTION: it is radians here, sin and cos are not applied
-    indices_range = np.expand_dims(np.arange(n_position), -1)
-    embed_range = 2 * (np.arange(d_pos_vec) // 2)
-    radians = position_rate \
-            * indices_range \
-            / np.power(1.e4, embed_range / d_pos_vec)
-    if padding_idx is not None:
-        radians[padding_idx] = 0.
-    return radians
-
-
-class PositionEmbedding(dg.Layer):
-    def __init__(self, n_position, d_pos_vec, position_rate=1.0):
-        """Position Embedding for Deep Voice 3.
-
-        Args:
-            n_position (int): max position, vocab size for position embedding.
-            d_pos_vec (int): position embedding size.
-            position_rate (float, optional): position rate (this should only be used when all the utterances are from one speaker.). Defaults to 1.0.
-        """
-        super(PositionEmbedding, self).__init__()
-        self.weight = self.create_parameter((n_position, d_pos_vec))
-        self.weight.set_value(
-            position_encoding_init(n_position, d_pos_vec, position_rate)
-            .astype("float32"))
-
-    def forward(self, indices, speaker_position_rate=None):
-        """
-        Args:
-            indices (Variable): shape (B, T), dtype: int64, position
-                indices, where B means the batch size, T means the time steps.
-            speaker_position_rate (Variable | float, optional), position
-                rate. It can be a float point number or a Variable with 
-                shape (1,), then this speaker_position_rate is used for every 
-                example. It can also be a Variable with shape (B, ), which 
-                contains a speaker position rate for each utterance.
-        Returns:
-            out (Variable): shape(B, T, C_pos), dtype float32, position embedding, where C_pos 
-                means position embedding size.
-        """
-        batch_size, time_steps = indices.shape
-
-        if isinstance(speaker_position_rate, float) or \
-            (isinstance(speaker_position_rate, fluid.framework.Variable)
-            and list(speaker_position_rate.shape) == [1]):
-            temp_weight = compute_position_embedding_single_speaker(
-                self.weight, speaker_position_rate)
-            out = lookup(temp_weight, indices, 0)
-            return out
-
-        assert len(speaker_position_rate.shape) == 1 and \
-            list(speaker_position_rate.shape) == [batch_size]
-
-        weight = compute_position_embedding(self.weight,
-                                            speaker_position_rate)  # (B, V, C)
-        # make indices for gather_nd
-        batch_id = F.expand(
-            F.unsqueeze(
-                F.range(
-                    0, batch_size, 1, dtype="int64"), [1]), [1, time_steps])
-        # (B, T, 2)
-        gather_nd_id = F.stack([batch_id, indices], -1)
-        out = F.gather_nd(weight, gather_nd_id)
-        return out
--- a/parakeet/models/deepvoice3/weight_norm_hook.py
+++ b/parakeet/models/deepvoice3/weight_norm_hook.py
+import paddle
+import paddle.fluid.dygraph as dg
+
+import numpy as np
+from paddle import fluid
+import paddle.fluid.dygraph as dg
+import paddle.fluid.layers as F
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.data_feeder import check_variable_and_dtype
+
+
+def l2_norm(x, axis, epsilon=1e-12, name=None):
+    if len(x.shape) == 1:
+        axis = 0
+    check_variable_and_dtype(x, "X", ("float32", "float64"), "norm")
+
+    helper = LayerHelper("l2_normalize", **locals())
+    out = helper.create_variable_for_type_inference(dtype=x.dtype)
+    norm = helper.create_variable_for_type_inference(dtype=x.dtype)
+    helper.append_op(
+        type="norm",
+        inputs={"X": x},
+        outputs={"Out": out,
+                 "Norm": norm},
+        attrs={
+            "axis": 1 if axis is None else axis,
+            "epsilon": epsilon,
+        })
+    return F.squeeze(norm, axes=[axis])
+    
+def norm_except_dim(p, dim):
+    shape = p.shape
+    ndims = len(shape)
+    if dim is None:
+        return F.sqrt(F.reduce_sum(F.square(p)))
+    elif dim == 0:
+        p_matrix = F.reshape(p, (shape[0], -1))
+        return l2_norm(p_matrix, axis=1)
+    elif dim == -1 or dim == ndims - 1:
+        p_matrix = F.reshape(p, (-1, shape[-1]))
+        return l2_norm(p_matrix, axis=0)
+    else:
+        perm = list(range(ndims))
+        perm[0] = dim
+        perm[dim] = 0
+        p_transposed = F.transpose(p, perm)
+        return norm_except_dim(p_transposed, 0)
+
+def _weight_norm(v, g, dim):
+    shape = v.shape
+    ndims = len(shape)
+
+    if dim is None:
+        v_normalized = v / (F.sqrt(F.reduce_sum(F.square(v))) + 1e-12)
+    elif dim == 0:
+        p_matrix = F.reshape(v, (shape[0], -1))
+        v_normalized = F.l2_normalize(p_matrix, axis=1)
+        v_normalized = F.reshape(v_normalized, shape)
+    elif dim == -1 or dim == ndims - 1:
+        p_matrix = F.reshape(v, (-1, shape[-1]))
+        v_normalized = F.l2_normalize(p_matrix, axis=0)
+        v_normalized = F.reshape(v_normalized, shape)
+    else:
+        perm = list(range(ndims))
+        perm[0] = dim
+        perm[dim] = 0
+        p_transposed = F.transpose(v, perm)
+        transposed_shape = p_transposed.shape
+        p_matrix = F.reshape(p_transposed, (p_transposed.shape[0], -1))
+        v_normalized = F.l2_normalize(p_matrix, axis=1)
+        v_normalized = F.reshape(v_normalized, transposed_shape)
+        v_normalized = F.transpose(v_normalized, perm)
+    weight = F.elementwise_mul(v_normalized, g, axis=dim if dim is not None else -1)
+    return weight
+
+
+class WeightNorm(object):
+    def __init__(self, name, dim):
+        if dim is None:
+            dim = -1
+        self.name = name
+        self.dim = dim
+
+    def compute_weight(self, module):
+        g = getattr(module, self.name + '_g')
+        v = getattr(module, self.name + '_v')
+        w = _weight_norm(v, g, self.dim)
+        return w
+
+    @staticmethod
+    def apply(module: dg.Layer, name, dim):
+        for k, hook in module._forward_pre_hooks.items():
+            if isinstance(hook, WeightNorm) and hook.name == name:
+                raise RuntimeError("Cannot register two weight_norm hooks on "
+                                   "the same parameter {}".format(name))
+
+        if dim is None:
+            dim = -1
+
+        fn = WeightNorm(name, dim)
+
+        # remove w from parameter list
+        w = getattr(module, name)
+        del module._parameters[name]
+
+        # add g and v as new parameters and express w as g/||v|| * v
+        g_var = norm_except_dim(w, dim)
+        v = module.create_parameter(w.shape, dtype=w.dtype)
+        module.add_parameter(name + "_v", v)
+        g = module.create_parameter(g_var.shape, dtype=g_var.dtype)
+        module.add_parameter(name + "_g", g)
+        with dg.no_grad():
+            F.assign(w, v)
+            F.assign(g_var, g)
+        setattr(module, name, fn.compute_weight(module))
+
+        # recompute weight before every forward()
+        module.register_forward_pre_hook(fn)
+        return fn
+
+    def remove(self, module):
+        w_var = self.compute_weight(module)
+        delattr(module, self.name)
+        del module._parameters[self.name + '_g']
+        del module._parameters[self.name + '_v']
+        w = module.create_parameter(w_var.shape, dtype=w_var.dtype)
+        module.add_parameter(self.name, w)
+        with dg.no_grad():
+            F.assign(w_var, w)
+
+    def __call__(self, module, inputs):
+        setattr(module, self.name, self.compute_weight(module))
+
+
+def weight_norm(module, name='weight', dim=0):
+    WeightNorm.apply(module, name, dim)
+    return module
+
+
+def remove_weight_norm(module, name='weight'):
+    for k, hook in module._forward_pre_hooks.items():
+        if isinstance(hook, WeightNorm) and hook.name == name:
+            hook.remove(module)
+            del module._forward_pre_hooks[k]
+            return module
+
+    raise ValueError("weight_norm of '{}' not found in {}"
+                     .format(name, module))
\ No newline at end of file