Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleSpeech into fix_ci_waveflow

f5109761 · 小湉湉 · fc8c0e3e · cd23be7c · f5109761 · f5109761
104 changed file
--- a/README.md
+++ b/README.md
@@ -16,12 +16,15 @@
 <p align="center">
    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-red.svg"></a>
-    <a href="support os"><img src="https://img.shields.io/badge/os-linux-yellow.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleSpeech/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleSpeech?color=ffa"></a>
+    <a href="support os"><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
    <a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleSpeech?color=9ea"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleSpeech?color=3af"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleSpeech?color=9cc"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleSpeech?color=ccf"></a>
+    <a href="=https://pypi.org/project/paddlespeech/"><img src="https://img.shields.io/pypi/dm/PaddleSpeech"></a>
+    <a href="=https://pypi.org/project/paddlespeech/"><img src="https://static.pepy.tech/badge/paddlespeech"></a>
    <a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
 </p>
@@ -143,6 +146,8 @@ For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech sample
 <div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
+- [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
 ### 🔥 Hot Activities
 - 2021.12.21~12.24
@@ -494,7 +499,17 @@ author={PaddlePaddle Authors},
 howpublished = {\url{https://github.com/PaddlePaddle/PaddleSpeech}},
 year={2021}
 }
+@inproceedings{zheng2021fused,
+  title={Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation},
+  author={Zheng, Renjie and Chen, Junkun and Ma, Mingbo and Huang, Liang},
+  booktitle={International Conference on Machine Learning},
+  pages={12736--12746},
+  year={2021},
+  organization={PMLR}
+}
 ```
 <a name="contribution"></a>
 ## Contribute to PaddleSpeech

--- a/README_cn.md
+++ b/README_cn.md
@@ -147,6 +147,8 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
 <div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
+- [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
 ### 🔥 热门活动

--- a/docs/source/demo_video.rst
+++ b/docs/source/demo_video.rst
+Demo Video 
+==================
+.. raw:: html
+    <video controls width="1024">
+    <source src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/PaddleSpeech_Demo.mp4"
+            type="video/mp4">
+    Sorry, your browser doesn't support embedded videos.
+    </video>
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -41,6 +41,7 @@ Contents
   tts/gan_vocoder
   tts/demo
   tts/demo_2
 .. toctree::
   :maxdepth: 1
@@ -50,12 +51,14 @@ Contents
 .. toctree::
   :maxdepth: 1
-   :caption: Acknowledgement
+   :caption: Demos
-   asr/reference
+   demo_video
+   tts_demo_video
+.. toctree::
+   :maxdepth: 1
+   :caption: Acknowledgement
+   asr/reference
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@@ -41,7 +41,7 @@ FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSp
 FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
 FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
 FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|||
-FastSpeech2| VCTK |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)|||
+FastSpeech2| VCTK |[fastspeech2-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)|||
 ### Vocoders
 Model Type | Dataset| Example Link | Pretrained Models| Static Models|Size (static)

--- a/docs/source/tts_demo_video.rst
+++ b/docs/source/tts_demo_video.rst
+TTS Demo Video
+==================
+.. raw:: html
+    <video controls width="1024">
+    <source src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/paddle2021_with_me.mp4"
+            type="video/mp4">
+    Sorry, your browser doesn't support embedded videos.
+    </video>
--- a/examples/callcenter/README.md
+++ b/examples/callcenter/README.md
+# Callcenter 8k sample rate
+Data distribution:
+```
+676048 utts
+491.4004722221223 h
+4357792.0 text
+2.4633630739178654 text/sec
+2.6167397877068495 sec/utt
+```
+train/dev/test partition:
+```
+    33802 manifest.dev
+    67606 manifest.test
+   574640 manifest.train
+   676048 total
+```
--- a/examples/csmsc/voc6/README.md
+++ b/examples/csmsc/voc6/README.md
+# WaveRNN with CSMSC
+This example contains code used to train a [WaveRNN](https://arxiv.org/abs/1802.08435) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
+## Dataset
+### Download and Extract
+Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
+### Get MFA Result and Extract
+We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
+## Get Started
+Assume the path to the dataset is `~/datasets/BZNSYP`.
+Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
+Run the command below to
+1. **source path**.
+2. preprocess the dataset.
+3. train the model.
+4. synthesize wavs.
+    - synthesize waveform from `metadata.jsonl`.
+```bash
+./run.sh
+```
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
+```bash
+./run.sh --stage 0 --stop-stage 0
+```
+### Data Preprocessing
+```bash
+./local/preprocess.sh ${conf_path}
+```
+When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
+```text
+dump
+├── dev
+│   ├── norm
+│   └── raw
+├── test
+│   ├── norm
+│   └── raw
+└── train
+    ├── norm
+    ├── raw
+    └── feats_stats.npy
+```
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
+### Model Training
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
+```
+`./local/train.sh` calls `${BIN_DIR}/train.py`.
+Here's the complete help message.
+```text
+usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
+                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
+                [--ngpu NGPU]
+Train a WaveRNN model.
+optional arguments:
+  -h, --help            show this help message and exit
+  --config CONFIG       config file to overwrite default config.
+  --train-metadata TRAIN_METADATA
+                        training data.
+  --dev-metadata DEV_METADATA
+                        dev data.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --ngpu NGPU           if ngpu == 0, use cpu.
+```
+1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
+2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
+4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+### Synthesizing
+`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+```bash
+CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
+```
+```text
+usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
+                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--ngpu NGPU]
+Synthesize with WaveRNN.
+optional arguments:
+  -h, --help            show this help message and exit
+  --config CONFIG       Vocoder config file.
+  --checkpoint CHECKPOINT
+                        snapshot to load.
+  --test-metadata TEST_METADATA
+                        dev data.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --ngpu NGPU           if ngpu == 0, use cpu.
+```
+1. `--config` wavernn config file. You should use the same config with which the model is trained.
+2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
+3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
+4. `--output-dir` is the directory to save the synthesized audio files.
+5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+## Pretrained Models
+The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip).
+The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip).
+Model | Step | eval/loss
+:-------------:|:------------:| :------------:
+default| 1(gpu) x 400000|2.602768
+WaveRNN checkpoint contains files listed below.
+```text
+wavernn_csmsc_ckpt_0.2.0
+├── default.yaml                   # default config used to train wavernn
+├── feats_stats.npy                # statistics used to normalize spectrogram when training wavernn
+└── snapshot_iter_400000.pdz       # parameters of wavernn
+```
--- a/paddlespeech/cli/cls/infer.py
+++ b/paddlespeech/cli/cls/infer.py
@@ -114,8 +114,9 @@ class CLSExecutor(BaseExecutor):
        """
            Download and returns pretrained resources path of current task.
        """
-        assert tag in pretrained_models, 'Can not find pretrained resources of {}.'.format(
+        support_models = list(pretrained_models.keys())
-            tag)
+        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
+            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],

--- a/paddlespeech/cli/st/infer.py
+++ b/paddlespeech/cli/st/infer.py
@@ -112,8 +112,9 @@ class STExecutor(BaseExecutor):
        """
            Download and returns pretrained resources path of current task.
        """
-        assert tag in pretrained_models, "Can not find pretrained resources of {}.".format(
+        support_models = list(pretrained_models.keys())
-            tag)
+        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
+            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],

--- a/paddlespeech/cli/text/infer.py
+++ b/paddlespeech/cli/text/infer.py
@@ -124,8 +124,9 @@ class TextExecutor(BaseExecutor):
        """
            Download and returns pretrained resources path of current task.
        """
-        assert tag in pretrained_models, 'Can not find pretrained resources of {}.'.format(
+        support_models = list(pretrained_models.keys())
-            tag)
+        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
+            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],

--- a/paddlespeech/cli/tts/infer.py
+++ b/paddlespeech/cli/tts/infer.py
@@ -117,6 +117,36 @@ pretrained_models = {
        'speaker_dict':
        'speaker_id_map.txt',
    },
+    # tacotron2
+    "tacotron2_csmsc-zh": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip',
+        'md5':
+        '0df4b6f0bcbe0d73c5ed6df8867ab91a',
+        'config':
+        'default.yaml',
+        'ckpt':
+        'snapshot_iter_30600.pdz',
+        'speech_stats':
+        'speech_stats.npy',
+        'phones_dict':
+        'phone_id_map.txt',
+    },
+    "tacotron2_ljspeech-en": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip',
+        'md5':
+        '6a5eddd81ae0e81d16959b97481135f3',
+        'config':
+        'default.yaml',
+        'ckpt':
+        'snapshot_iter_60300.pdz',
+        'speech_stats':
+        'speech_stats.npy',
+        'phones_dict':
+        'phone_id_map.txt',
+    },
    # pwgan
    "pwgan_csmsc-zh": {
        'url':
@@ -205,6 +235,20 @@ pretrained_models = {
        'speech_stats':
        'feats_stats.npy',
    },
+    # wavernn
+    "wavernn_csmsc-zh": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip',
+        'md5':
+        'ee37b752f09bcba8f2af3b777ca38e13',
+        'config':
+        'default.yaml',
+        'ckpt':
+        'snapshot_iter_400000.pdz',
+        'speech_stats':
+        'feats_stats.npy',
+    }
 }
 model_alias = {
@@ -217,6 +261,10 @@ model_alias = {
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2",
    "fastspeech2_inference":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
+    "tacotron2":
+    "paddlespeech.t2s.models.tacotron2:Tacotron2",
+    "tacotron2_inference":
+    "paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
    # voc
    "pwgan":
    "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
@@ -234,6 +282,10 @@ model_alias = {
    "paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
    "hifigan_inference":
    "paddlespeech.t2s.models.hifigan:HiFiGANInference",
+    "wavernn":
+    "paddlespeech.t2s.models.wavernn:WaveRNN",
+    "wavernn_inference":
+    "paddlespeech.t2s.models.wavernn:WaveRNNInference",
 }
@@ -253,9 +305,13 @@ class TTSExecutor(BaseExecutor):
            type=str,
            default='fastspeech2_csmsc',
            choices=[
-                'speedyspeech_csmsc', 'fastspeech2_csmsc',
+                'speedyspeech_csmsc',
-                'fastspeech2_ljspeech', 'fastspeech2_aishell3',
+                'fastspeech2_csmsc',
-                'fastspeech2_vctk'
+                'fastspeech2_ljspeech',
+                'fastspeech2_aishell3',
+                'fastspeech2_vctk',
+                'tacotron2_csmsc',
+                'tacotron2_ljspeech',
            ],
            help='Choose acoustic model type of tts task.')
        self.parser.add_argument(
@@ -300,8 +356,14 @@ class TTSExecutor(BaseExecutor):
            type=str,
            default='pwgan_csmsc',
            choices=[
-                'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
+                'pwgan_csmsc',
-                'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc'
+                'pwgan_ljspeech',
+                'pwgan_aishell3',
+                'pwgan_vctk',
+                'mb_melgan_csmsc',
+                'style_melgan_csmsc',
+                'hifigan_csmsc',
+                'wavernn_csmsc',
            ],
            help='Choose vocoder type of tts task.')
@@ -340,8 +402,9 @@ class TTSExecutor(BaseExecutor):
        """
        Download and returns pretrained resources path of current task.
        """
-        assert tag in pretrained_models, 'Can not find pretrained resources of {}.'.format(
+        support_models = list(pretrained_models.keys())
-            tag)
+        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
+            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
@@ -488,6 +551,8 @@ class TTSExecutor(BaseExecutor):
                vocab_size=vocab_size,
                tone_size=tone_size,
                **self.am_config["model"])
+        elif am_name == 'tacotron2':
+            am = am_class(idim=vocab_size, odim=odim, **self.am_config["model"])
        am.set_state_dict(paddle.load(self.am_ckpt)["main_params"])
        am.eval()
@@ -505,10 +570,15 @@ class TTSExecutor(BaseExecutor):
        voc_class = dynamic_import(voc_name, model_alias)
        voc_inference_class = dynamic_import(voc_name + '_inference',
                                             model_alias)
-        voc = voc_class(**self.voc_config["generator_params"])
+        if voc_name != 'wavernn':
-        voc.set_state_dict(paddle.load(self.voc_ckpt)["generator_params"])
+            voc = voc_class(**self.voc_config["generator_params"])
-        voc.remove_weight_norm()
+            voc.set_state_dict(paddle.load(self.voc_ckpt)["generator_params"])
-        voc.eval()
+            voc.remove_weight_norm()
+            voc.eval()
+        else:
+            voc = voc_class(**self.voc_config["model"])
+            voc.set_state_dict(paddle.load(self.voc_ckpt)["main_params"])
+            voc.eval()
        voc_mu, voc_std = np.load(self.voc_stat)
        voc_mu = paddle.to_tensor(voc_mu)
        voc_std = paddle.to_tensor(voc_std)

--- a/paddlespeech/s2t/exps/u2/model.py
+++ b/paddlespeech/s2t/exps/u2/model.py
@@ -175,7 +175,7 @@ class U2Trainer(Trainer):
                        observation['batch_cost'] = observation[
                            'reader_cost'] + observation['step_cost']
                        observation['samples'] = observation['batch_size']
-                        observation['ips,sent./sec'] = observation[
+                        observation['ips,samples/s'] = observation[
                            'batch_size'] / observation['batch_cost']
                        for k, v in observation.items():
                            msg += f" {k.split(',')[0]}: "

--- a/paddlespeech/s2t/io/batchfy.py
+++ b/paddlespeech/s2t/io/batchfy.py
@@ -419,7 +419,7 @@ def make_batchset(
        # sort it by input lengths (long to short)
        sorted_data = sorted(
            d.items(),
-            key=lambda data: int(data[1][batch_sort_key][batch_sort_axis]["shape"][0]),
+            key=lambda data: float(data[1][batch_sort_key][batch_sort_axis]["shape"][0]),
            reverse=not shortest_first, )
        logger.info("# utts: " + str(len(sorted_data)))

--- a/paddlespeech/s2t/io/dataloader.py
+++ b/paddlespeech/s2t/io/dataloader.py
@@ -61,7 +61,7 @@ class BatchDataLoader():
    def __init__(self,
                 json_file: str,
                 train_mode: bool,
-                 sortagrad: bool=False,
+                 sortagrad: int=0,
                 batch_size: int=0,
                 maxlen_in: float=float('inf'),
                 maxlen_out: float=float('inf'),

--- a/paddlespeech/s2t/training/trainer.py
+++ b/paddlespeech/s2t/training/trainer.py
@@ -252,8 +252,7 @@ class Trainer():
        if self.args.benchmark_max_step and self.iteration > self.args.benchmark_max_step:
            logger.info(
                f"Reach benchmark-max-step: {self.args.benchmark_max_step}")
-            sys.exit(
+            sys.exit(0)
-                f"Reach benchmark-max-step: {self.args.benchmark_max_step}")
    def do_train(self):
        """The training process control by epoch."""
@@ -282,7 +281,7 @@ class Trainer():
                        observation['batch_cost'] = observation[
                            'reader_cost'] + observation['step_cost']
                        observation['samples'] = observation['batch_size']
-                        observation['ips[sent./sec]'] = observation[
+                        observation['ips samples/s'] = observation[
                            'batch_size'] / observation['batch_cost']
                        for k, v in observation.items():
                            msg += f" {k}: "

--- a/paddlespeech/t2s/__init__.py
+++ b/paddlespeech/t2s/__init__.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 import logging
-from . import data
 from . import datasets
 from . import exps
 from . import frontend

--- a/paddlespeech/t2s/data/__init__.py
+++ b/paddlespeech/t2s/data/__init__.py
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""t2s's infrastructure for data processing.
-"""
-from .batch import *
-from .dataset import *
--- a/paddlespeech/t2s/data/dataset.py
+++ b/paddlespeech/t2s/data/dataset.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import six
-from paddle.io import Dataset
-__all__ = [
-    "split",
-    "TransformDataset",
-    "CacheDataset",
-    "TupleDataset",
-    "DictDataset",
-    "SliceDataset",
-    "SubsetDataset",
-    "FilterDataset",
-    "ChainDataset",
-]
-def split(dataset, first_size):
-    """A utility function to split a dataset into two datasets."""
-    first = SliceDataset(dataset, 0, first_size)
-    second = SliceDataset(dataset, first_size, len(dataset))
-    return first, second
-class TransformDataset(Dataset):
-    def __init__(self, dataset, transform):
-        """Dataset which is transformed from another with a transform.
-        Args:
-            dataset (Dataset): the base dataset.
-            transform (callable): the transform which takes an example of the base dataset as parameter and return a new example.
-        """
-        self._dataset = dataset
-        self._transform = transform
-    def __len__(self):
-        return len(self._dataset)
-    def __getitem__(self, i):
-        in_data = self._dataset[i]
-        return self._transform(in_data)
-class CacheDataset(Dataset):
-    def __init__(self, dataset):
-        """A lazy cache of the base dataset.
-        Args:
-            dataset (Dataset): the base dataset to cache.
-        """
-        self._dataset = dataset
-        self._cache = dict()
-    def __len__(self):
-        return len(self._dataset)
-    def __getitem__(self, i):
-        if i not in self._cache:
-            self._cache[i] = self._dataset[i]
-        return self._cache[i]
-class TupleDataset(Dataset):
-    def __init__(self, *datasets):
-        """A compound dataset made from several datasets of the same length. An example of the `TupleDataset` is a tuple of examples from the constituent datasets.
-        Args:
-            datasets: tuple[Dataset], the constituent datasets.
-        """
-        if not datasets:
-            raise ValueError("no datasets are given")
-        length = len(datasets[0])
-        for i, dataset in enumerate(datasets):
-            if len(dataset) != length:
-                raise ValueError("all the datasets should have the same length."
-                                 "dataset {} has a different length".format(i))
-        self._datasets = datasets
-        self._length = length
-    def __getitem__(self, index):
-        # SOA
-        batches = [dataset[index] for dataset in self._datasets]
-        if isinstance(index, slice):
-            length = len(batches[0])
-            # AOS
-            return [
-                tuple([batch[i] for batch in batches])
-                for i in six.moves.range(length)
-            ]
-        else:
-            return tuple(batches)
-    def __len__(self):
-        return self._length
-class DictDataset(Dataset):
-    def __init__(self, **datasets):
-        """
-        A compound dataset made from several datasets of the same length. An 
-        example of the `DictDataset` is a dict of examples from the constituent 
-        datasets.
-        WARNING: paddle does not have a good support for DictDataset, because
-        every batch yield from a DataLoader is a list, but it cannot be a dict.
-        So you have to provide a collate function because you cannot use the
-        default one.
-        Args:
-            datasets: Dict[Dataset], the constituent datasets.
-        """
-        if not datasets:
-            raise ValueError("no datasets are given")
-        length = None
-        for key, dataset in six.iteritems(datasets):
-            if length is None:
-                length = len(dataset)
-            elif len(dataset) != length:
-                raise ValueError(
-                    "all the datasets should have the same length."
-                    "dataset {} has a different length".format(key))
-        self._datasets = datasets
-        self._length = length
-    def __getitem__(self, index):
-        batches = {
-            key: dataset[index]
-            for key, dataset in six.iteritems(self._datasets)
-        }
-        if isinstance(index, slice):
-            length = len(six.next(six.itervalues(batches)))
-            return [{key: batch[i]
-                     for key, batch in six.iteritems(batches)}
-                    for i in six.moves.range(length)]
-        else:
-            return batches
-    def __len__(self):
-        return self._length
-class SliceDataset(Dataset):
-    def __init__(self, dataset, start, finish, order=None):
-        """A Dataset which is a slice of the base dataset.
-        Args:
-            dataset (Dataset): the base dataset.
-            start (int): the start of the slice.
-            finish (int): the end of the slice, not inclusive.
-            order (List[int], optional): the order, it is a permutation of the valid example ids of the base dataset. If `order` is provided, the slice is taken in `order`. Defaults to None.
-        """
-        if start < 0 or finish > len(dataset):
-            raise ValueError("subset overruns the dataset.")
-        self._dataset = dataset
-        self._start = start
-        self._finish = finish
-        self._size = finish - start
-        if order is not None and len(order) != len(dataset):
-            raise ValueError(
-                "order should have the same length as the dataset"
-                "len(order) = {} which does not euqals len(dataset) = {} ".
-                format(len(order), len(dataset)))
-        self._order = order
-    def __len__(self):
-        return self._size
-    def __getitem__(self, i):
-        if i >= 0:
-            if i >= self._size:
-                raise IndexError('dataset index out of range')
-            index = self._start + i
-        else:
-            if i < -self._size:
-                raise IndexError('dataset index out of range')
-            index = self._finish + i
-        if self._order is not None:
-            index = self._order[index]
-        return self._dataset[index]
-class SubsetDataset(Dataset):
-    def __init__(self, dataset, indices):
-        """A Dataset which is a subset of the base dataset.
-        Args:
-            dataset (Dataset): the base dataset.
-            indices (Iterable[int]): the indices of the examples to pick.
-        """
-        self._dataset = dataset
-        if len(indices) > len(dataset):
-            raise ValueError("subset's size larger that dataset's size!")
-        self._indices = indices
-        self._size = len(indices)
-    def __len__(self):
-        return self._size
-    def __getitem__(self, i):
-        index = self._indices[i]
-        return self._dataset[index]
-class FilterDataset(Dataset):
-    def __init__(self, dataset, filter_fn):
-        """A filtered dataset.
-        Args:
-            dataset (Dataset): the base dataset.
-            filter_fn (callable): a callable which takes an example of the base dataset and return a boolean.
-        """
-        self._dataset = dataset
-        self._indices = [
-            i for i in range(len(dataset)) if filter_fn(dataset[i])
-        ]
-        self._size = len(self._indices)
-    def __len__(self):
-        return self._size
-    def __getitem__(self, i):
-        index = self._indices[i]
-        return self._dataset[index]
-class ChainDataset(Dataset):
-    def __init__(self, *datasets):
-        """A concatenation of the several datasets which the same structure.
-        Args:
-            datasets (Iterable[Dataset]): datasets to concat.
-        """
-        self._datasets = datasets
-    def __len__(self):
-        return sum(len(dataset) for dataset in self._datasets)
-    def __getitem__(self, i):
-        if i < 0:
-            raise IndexError("ChainDataset doesnot support negative indexing.")
-        for dataset in self._datasets:
-            if i < len(dataset):
-                return dataset[i]
-            i -= len(dataset)
-        raise IndexError("dataset index out of range")
--- a/paddlespeech/t2s/datasets/__init__.py
+++ b/paddlespeech/t2s/datasets/__init__.py
@@ -11,5 +11,4 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from .common import *
 from .ljspeech import *
--- a/paddlespeech/t2s/datasets/am_batch_fn.py
+++ b/paddlespeech/t2s/datasets/am_batch_fn.py
@@ -14,7 +14,7 @@
 import numpy as np
 import paddle
-from paddlespeech.t2s.data.batch import batch_sequences
+from paddlespeech.t2s.datasets.batch import batch_sequences
 def tacotron2_single_spk_batch_fn(examples):

--- a/paddlespeech/t2s/data/batch.py
+++ b/paddlespeech/t2s/data/batch.py
--- a/paddlespeech/t2s/datasets/common.py
+++ b/paddlespeech/t2s/datasets/common.py
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from pathlib import Path
-from typing import List
-import librosa
-import numpy as np
-from paddle.io import Dataset
-__all__ = ["AudioSegmentDataset", "AudioDataset", "AudioFolderDataset"]
-class AudioSegmentDataset(Dataset):
-    """A simple dataset adaptor for audio files to train vocoders.
-    Read -> trim silence -> normalize -> extract a segment
-    """
-    def __init__(self,
-                 file_paths: List[Path],
-                 sample_rate: int,
-                 length: int,
-                 top_db: float):
-        self.file_paths = file_paths
-        self.sr = sample_rate
-        self.top_db = top_db
-        self.length = length  # samples in the clip
-    def __getitem__(self, i):
-        fpath = self.file_paths[i]
-        y, sr = librosa.load(fpath, sr=self.sr)
-        y, _ = librosa.effects.trim(y, top_db=self.top_db)
-        y = librosa.util.normalize(y)
-        y = y.astype(np.float32)
-        # pad or trim
-        if y.size <= self.length:
-            y = np.pad(y, [0, self.length - len(y)], mode='constant')
-        else:
-            start = np.random.randint(0, 1 + len(y) - self.length)
-            y = y[start:start + self.length]
-        return y
-    def __len__(self):
-        return len(self.file_paths)
-class AudioDataset(Dataset):
-    """A simple dataset adaptor for the audio files.
-    Read -> trim silence -> normalize
-    """
-    def __init__(self,
-                 file_paths: List[Path],
-                 sample_rate: int,
-                 top_db: float=60):
-        self.file_paths = file_paths
-        self.sr = sample_rate
-        self.top_db = top_db
-    def __getitem__(self, i):
-        fpath = self.file_paths[i]
-        y, sr = librosa.load(fpath, sr=self.sr)
-        y, _ = librosa.effects.trim(y, top_db=self.top_db)
-        y = librosa.util.normalize(y)
-        y = y.astype(np.float32)
-        return y
-    def __len__(self):
-        return len(self.file_paths)
-class AudioFolderDataset(AudioDataset):
-    def __init__(
-            self,
-            root,
-            sample_rate,
-            top_db=60,
-            extension=".wav", ):
-        root = Path(root).expanduser()
-        file_paths = sorted(list(root.rglob("*{}".format(extension))))
-        super().__init__(file_paths, sample_rate, top_db)
--- a/paddlespeech/t2s/datasets/data_table.py
+++ b/paddlespeech/t2s/datasets/data_table.py
@@ -22,26 +22,17 @@ from paddle.io import Dataset
 class DataTable(Dataset):
    """Dataset to load and convert data for general purpose.
+    Args:
-    Parameters
+        data (List[Dict[str, Any]]): Metadata, a list of meta datum, each of which is composed of  several fields
-    ----------
+        fields (List[str], optional): Fields to use, if not specified, all the fields in the data are used, by default None
-    data : List[Dict[str, Any]]
+        converters (Dict[str, Callable], optional): Converters used to process each field, by default None
-        Metadata, a list of meta datum, each of which is composed of 
+        use_cache (bool, optional): Whether to use cache, by default False
-        several fields
-    fields : List[str], optional
+    Raises:
-        Fields to use, if not specified, all the fields in the data are 
+        ValueError:
-        used, by default None
+            If there is some field that does not exist in data. 
-    converters : Dict[str, Callable], optional
+        ValueError:
-        Converters used to process each field, by default None
+            If there is some field in converters that does not exist in fields.
-    use_cache : bool, optional
-        Whether to use cache, by default False
-    Raises
-    ------
-    ValueError
-        If there is some field that does not exist in data. 
-    ValueError
-        If there is some field in converters that does not exist in fields.
    """
    def __init__(self,
@@ -95,15 +86,11 @@ class DataTable(Dataset):
        """Convert a meta datum to an example by applying the corresponding 
        converters to each fields requested.
-        Parameters
+        Args:
-        ----------
+            meta_datum (Dict[str, Any]): Meta datum
-        meta_datum : Dict[str, Any]
-            Meta datum
-        Returns
+        Returns:
-        -------
+            Dict[str, Any]: Converted example
-        Dict[str, Any]
-            Converted example
        """
        example = {}
        for field in self.fields:
@@ -118,16 +105,11 @@ class DataTable(Dataset):
    def __getitem__(self, idx: int) -> Dict[str, Any]:
        """Get an example given an index.
+        Args:
+            idx (int): Index of the example to get
-        Parameters
+        Returns:
-        ----------
+            Dict[str, Any]: A converted example
-        idx : int
-            Index of the example to get
-        Returns
-        -------
-        Dict[str, Any]
-            A converted example
        """
        if self.use_cache and self.caches[idx] is not None:
            return self.caches[idx]

--- a/paddlespeech/t2s/data/get_feats.py
+++ b/paddlespeech/t2s/data/get_feats.py
--- a/paddlespeech/t2s/datasets/preprocess_utils.py
+++ b/paddlespeech/t2s/datasets/preprocess_utils.py
@@ -18,14 +18,10 @@ import re
 def get_phn_dur(file_name):
    '''
    read MFA duration.txt
-    Parameters
+    Args:
-    ----------
+        file_name (str or Path): path of gen_duration_from_textgrid.py's result
-    file_name : str or Path
+    Returns: 
-        path of gen_duration_from_textgrid.py's result
+        Dict: sentence: {'utt': ([char], [int])}
-    Returns
-    ----------
-    Dict
-        sentence: {'utt': ([char], [int])}
    '''
    f = open(file_name, 'r')
    sentence = {}
@@ -48,10 +44,8 @@ def get_phn_dur(file_name):
 def merge_silence(sentence):
    '''
    merge silences
-    Parameters
+    Args:
-    ----------
+        sentence (Dict): sentence: {'utt': (([char], [int]), str)}
-    sentence : Dict
-        sentence: {'utt': (([char], [int]), str)}
    '''
    for utt in sentence:
        cur_phn, cur_dur, speaker = sentence[utt]
@@ -81,12 +75,9 @@ def merge_silence(sentence):
 def get_input_token(sentence, output_path, dataset="baker"):
    '''
    get phone set from training data and save it
-    Parameters
+    Args:
-    ----------
+        sentence (Dict): sentence: {'utt': ([char], [int])}
-    sentence : Dict
+        output_path (str or path):path to save phone_id_map
-        sentence: {'utt': ([char], [int])}
-    output_path : str or path
-        path to save phone_id_map
    '''
    phn_token = set()
    for utt in sentence:
@@ -112,14 +103,10 @@ def get_phones_tones(sentence,
                     dataset="baker"):
    '''
    get phone set and tone set from training data and save it
-    Parameters
+    Args:
-    ----------
+        sentence (Dict): sentence: {'utt': ([char], [int])}
-    sentence : Dict
+        phones_output_path (str or path): path to save phone_id_map
-        sentence: {'utt': ([char], [int])}
+        tones_output_path (str or path): path to save tone_id_map
-    phones_output_path : str or path
-        path to save phone_id_map
-    tones_output_path : str or path
-        path to save tone_id_map
    '''
    phn_token = set()
    tone_token = set()
@@ -162,14 +149,10 @@ def get_spk_id_map(speaker_set, output_path):
 def compare_duration_and_mel_length(sentences, utt, mel):
    '''
    check duration error, correct sentences[utt] if possible, else pop sentences[utt]
-    Parameters
+    Args:
-    ----------
+        sentences (Dict): sentences[utt] = [phones_list ,durations_list]
-    sentences : Dict
+        utt (str): utt_id
-        sentences[utt] = [phones_list ,durations_list]
+        mel (np.ndarry): features (num_frames, n_mels)
-    utt : str
-        utt_id
-    mel : np.ndarry
-        features (num_frames, n_mels)
    '''
    if utt in sentences:

--- a/paddlespeech/t2s/datasets/vocoder_batch_fn.py
+++ b/paddlespeech/t2s/datasets/vocoder_batch_fn.py
@@ -29,15 +29,11 @@ class Clip(object):
            hop_size=256,
            aux_context_window=0, ):
        """Initialize customized collater for DataLoader.
+        Args:
-        Parameters
+            batch_max_steps (int): The maximum length of input signal in batch.
-        ----------
+            hop_size (int): Hop size of auxiliary features.
-        batch_max_steps : int
+            aux_context_window (int): Context window size for auxiliary feature conv.
-            The maximum length of input signal in batch.
-        hop_size : int
-            Hop size of auxiliary features.
-        aux_context_window : int
-            Context window size for auxiliary feature conv.
        """
        if batch_max_steps % hop_size != 0:
@@ -56,18 +52,15 @@ class Clip(object):
    def __call__(self, batch):
        """Convert into batch tensors.
-        Parameters
+        Args:
-        ----------
+            batch (list): list of tuple of the pair of audio and features. Audio shape (T, ), features shape(T', C).
-        batch : list
-            list of tuple of the pair of audio and features. Audio shape (T, ), features shape(T', C).
-        Returns
+        Returns: 
-        ----------
+            Tensor:
-        Tensor
+                Auxiliary feature batch (B, C, T'), where
-            Auxiliary feature batch (B, C, T'), where
+                T = (T' - 2 * aux_context_window) * hop_size.
-            T = (T' - 2 * aux_context_window) * hop_size.
+            Tensor:
-        Tensor
+                Target signal batch (B, 1, T).
-            Target signal batch (B, 1, T).
        """
        # check length
@@ -104,11 +97,10 @@ class Clip(object):
    def _adjust_length(self, x, c):
        """Adjust the audio and feature lengths.
-        Note
+        Note:
-        -------
+            Basically we assume that the length of x and c are adjusted
-        Basically we assume that the length of x and c are adjusted
+            through preprocessing stage, but if we use other library processed
-        through preprocessing stage, but if we use other library processed
+            features, this process will be needed.
-        features, this process will be needed.
        """
        if len(x) < c.shape[0] * self.hop_size:
@@ -162,22 +154,14 @@ class WaveRNNClip(Clip):
        # voc_pad = 2  this will pad the input so that the resnet can 'see' wider than input length
        # max_offsets = n_frames - 2 - (mel_win + 2 * hp.voc_pad) = n_frames - 15
        """Convert into batch tensors.
+        Args:
-        Parameters
+            batch (list): list of tuple of the pair of audio and features. Audio shape (T, ), features shape(T', C).
-        ----------
-        batch : list
+        Returns:
-            list of tuple of the pair of audio and features. 
+            Tensor: Input signal batch (B, 1, T).
-            Audio shape (T, ), features shape(T', C).
+            Tensor: Target signal batch (B, 1, T).
+            Tensor: Auxiliary feature batch (B, C, T'), 
-        Returns
+                where T = (T' - 2 * aux_context_window) * hop_size.
-        ----------
-        Tensor
-            Input signal batch (B, 1, T).
-        Tensor
-            Target signal batch (B, 1, T).
-        Tensor
-            Auxiliary feature batch (B, C, T'), where
-            T = (T' - 2 * aux_context_window) * hop_size.
        """
        # check length

--- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py
+++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py
@@ -27,9 +27,9 @@ import tqdm
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import Energy
+from paddlespeech.t2s.datasets.get_feats import Energy
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
-from paddlespeech.t2s.data.get_feats import Pitch
+from paddlespeech.t2s.datasets.get_feats import Pitch
 from paddlespeech.t2s.datasets.preprocess_utils import compare_duration_and_mel_length
 from paddlespeech.t2s.datasets.preprocess_utils import get_input_token
 from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur

--- a/paddlespeech/t2s/exps/fastspeech2/train.py
+++ b/paddlespeech/t2s/exps/fastspeech2/train.py
@@ -160,9 +160,8 @@ def train_sp(args, config):
    if dist.get_rank() == 0:
        trainer.extend(evaluator, trigger=(1, "epoch"))
        trainer.extend(VisualDL(output_dir), trigger=(1, "iteration"))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+        Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
-    # print(trainer.extensions)
    trainer.run()

--- a/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py
@@ -231,9 +231,9 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()

--- a/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py
@@ -219,9 +219,9 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()

--- a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/synthesize_from_wav.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/synthesize_from_wav.py
@@ -23,7 +23,7 @@ import soundfile as sf
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGInference
 from paddlespeech.t2s.modules.normalizer import ZScore

--- a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py
@@ -194,11 +194,10 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
-    # print(trainer.extensions.keys())
    print("Trainer Done!")
    trainer.run()

--- a/paddlespeech/t2s/exps/gan_vocoder/preprocess.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/preprocess.py
@@ -27,7 +27,7 @@ import tqdm
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
 from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
 from paddlespeech.t2s.utils import str2bool

--- a/paddlespeech/t2s/exps/gan_vocoder/style_melgan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/style_melgan/train.py
@@ -212,9 +212,9 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()

--- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py
+++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py
@@ -27,7 +27,7 @@ import tqdm
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.datasets.preprocess_utils import compare_duration_and_mel_length
 from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
 from paddlespeech.t2s.datasets.preprocess_utils import get_phones_tones

--- a/paddlespeech/t2s/exps/speedyspeech/train.py
+++ b/paddlespeech/t2s/exps/speedyspeech/train.py
@@ -171,8 +171,8 @@ def train_sp(args, config):
    if dist.get_rank() == 0:
        trainer.extend(evaluator, trigger=(1, "epoch"))
        trainer.extend(VisualDL(output_dir), trigger=(1, "iteration"))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+        Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
    trainer.run()

--- a/paddlespeech/t2s/exps/tacotron2/preprocess.py
+++ b/paddlespeech/t2s/exps/tacotron2/preprocess.py
@@ -27,7 +27,7 @@ import tqdm
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.datasets.preprocess_utils import compare_duration_and_mel_length
 from paddlespeech.t2s.datasets.preprocess_utils import get_input_token
 from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur

--- a/paddlespeech/t2s/exps/tacotron2/train.py
+++ b/paddlespeech/t2s/exps/tacotron2/train.py
@@ -155,9 +155,8 @@ def train_sp(args, config):
    if dist.get_rank() == 0:
        trainer.extend(evaluator, trigger=(1, "epoch"))
        trainer.extend(VisualDL(output_dir), trigger=(1, "iteration"))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+        Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
-    # print(trainer.extensions)
    trainer.run()

--- a/paddlespeech/t2s/exps/transformer_tts/preprocess.py
+++ b/paddlespeech/t2s/exps/transformer_tts/preprocess.py
@@ -26,20 +26,17 @@ import tqdm
 import yaml
 from yacs.config import CfgNode as Configuration
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.frontend import English
 def get_lj_sentences(file_name, frontend):
-    '''
+    '''read MFA duration.txt
-    read MFA duration.txt
-    Parameters
+    Args:
-    ----------
+        file_name (str or Path)
-    file_name : str or Path
+    Returns:
-    Returns
+        Dict: sentence: {'utt': ([char], [int])}
-    ----------
-    Dict
-        sentence: {'utt': ([char], [int])}
    '''
    f = open(file_name, 'r')
    sentence = {}
@@ -59,14 +56,11 @@ def get_lj_sentences(file_name, frontend):
 def get_input_token(sentence, output_path):
-    '''
+    '''get phone set from training data and save it
-    get phone set from training data and save it
-    Parameters
+    Args:
-    ----------
+        sentence (Dict): sentence: {'utt': ([char], str)}
-    sentence : Dict
+        output_path (str or path): path to save phone_id_map
-        sentence: {'utt': ([char], str)}
-    output_path : str or path
-        path to save phone_id_map
    '''
    phn_token = set()
    for utt in sentence:

--- a/paddlespeech/t2s/exps/transformer_tts/train.py
+++ b/paddlespeech/t2s/exps/transformer_tts/train.py
@@ -148,9 +148,8 @@ def train_sp(args, config):
    if dist.get_rank() == 0:
        trainer.extend(evaluator, trigger=(1, "epoch"))
        trainer.extend(VisualDL(output_dir), trigger=(1, "iteration"))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+        Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
-    # print(trainer.extensions)
    trainer.run()

--- a/paddlespeech/t2s/exps/waveflow/ljspeech.py
+++ b/paddlespeech/t2s/exps/waveflow/ljspeech.py
@@ -17,8 +17,8 @@ import numpy as np
 import pandas
 from paddle.io import Dataset
-from paddlespeech.t2s.data.batch import batch_spec
+from paddlespeech.t2s.datasets.batch import batch_spec
-from paddlespeech.t2s.data.batch import batch_wav
+from paddlespeech.t2s.datasets.batch import batch_wav
 class LJSpeech(Dataset):

--- a/paddlespeech/t2s/exps/wavernn/train.py
+++ b/paddlespeech/t2s/exps/wavernn/train.py
@@ -168,9 +168,9 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()

--- a/paddlespeech/t2s/frontend/arpabet.py
+++ b/paddlespeech/t2s/frontend/arpabet.py
@@ -133,16 +133,11 @@ class ARPABET(Phonetics):
    def phoneticize(self, sentence, add_start_end=False):
        """ Normalize the input text sequence and convert it into pronunciation sequence.
+        Args:
+            sentence (str): The input text sequence.
-        Parameters
+        Returns:
-        -----------
+            List[str]: The list of pronunciation sequence.
-        sentence: str
-            The input text sequence.
-        Returns
-        ----------
-        List[str]
-            The list of pronunciation sequence.
        """
        phonemes = [
            self._remove_vowels(item) for item in self.backend(sentence)
@@ -156,16 +151,12 @@ class ARPABET(Phonetics):
    def numericalize(self, phonemes):
        """ Convert pronunciation sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            phonemes (List[str]): The list of pronunciation sequence.
-        phonemes: List[str]
-            The list of pronunciation sequence.
-        Returns
+        Returns:
-        ----------
+            List[int]: The list of pronunciation id sequence.
-        List[int]
-            The list of pronunciation id sequence.
        """
        ids = [self.vocab.lookup(item) for item in phonemes]
        return ids
@@ -173,30 +164,23 @@ class ARPABET(Phonetics):
    def reverse(self, ids):
        """ Reverse the list of pronunciation id sequence to a list of pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+            ids( List[int]): The list of pronunciation id sequence.
-        ids: List[int]
-            The list of pronunciation id sequence.
-        Returns
+        Returns: 
-        ----------
+            List[str]: 
-        List[str]
+                The list of pronunciation sequence.
-            The list of pronunciation sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
    def __call__(self, sentence, add_start_end=False):
        """ Convert the input text sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
-            The input text sequence.
-        Returns
+        Returns:
-        ----------
+            List[str]: The list of pronunciation id sequence.
-        List[str]
-            The list of pronunciation id sequence.
        """
        return self.numericalize(
            self.phoneticize(sentence, add_start_end=add_start_end))
@@ -229,15 +213,11 @@ class ARPABETWithStress(Phonetics):
    def phoneticize(self, sentence, add_start_end=False):
        """ Normalize the input text sequence and convert it into pronunciation sequence.
-        Parameters
+        Args: 
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
-            The input text sequence.
-        Returns
+        Returns: 
-        ----------
+            List[str]: The list of pronunciation sequence.
-        List[str]
-            The list of pronunciation sequence.
        """
        phonemes = self.backend(sentence)
        if add_start_end:
@@ -249,47 +229,33 @@ class ARPABETWithStress(Phonetics):
    def numericalize(self, phonemes):
        """ Convert pronunciation sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            phonemes (List[str]): The list of pronunciation sequence.
-        phonemes: List[str]
-            The list of pronunciation sequence.
-        Returns
+        Returns:
-        ----------
+            List[int]: The list of pronunciation id sequence.
-        List[int]
-            The list of pronunciation id sequence.
        """
        ids = [self.vocab.lookup(item) for item in phonemes]
        return ids
    def reverse(self, ids):
        """ Reverse the list of pronunciation id sequence to a list of pronunciation sequence.
+        Args:
-        Parameters
+            ids (List[int]): The list of pronunciation id sequence.
-        -----------
-        ids: List[int]
-            The list of pronunciation id sequence.
-        Returns
+        Returns: 
-        ----------
+            List[str]: The list of pronunciation sequence.
-        List[str]
-            The list of pronunciation sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
    def __call__(self, sentence, add_start_end=False):
        """ Convert the input text sequence into pronunciation id sequence.
+        Args:
+            sentence (str): The input text sequence.
-        Parameters
+        Returns: 
-        -----------
+            List[str]: The list of pronunciation id sequence.
-        sentence: str
-            The input text sequence.
-        Returns
-        ----------
-        List[str]
-            The list of pronunciation id sequence.
        """
        return self.numericalize(
            self.phoneticize(sentence, add_start_end=add_start_end))

--- a/paddlespeech/t2s/frontend/phonectic.py
+++ b/paddlespeech/t2s/frontend/phonectic.py
@@ -65,14 +65,10 @@ class English(Phonetics):
    def phoneticize(self, sentence):
        """ Normalize the input text sequence and convert it into pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
+        Returns: 
-            The input text sequence.
+            List[str]: The list of pronunciation sequence.
-        Returns
-        ----------
-        List[str]
-            The list of pronunciation sequence.
        """
        start = self.vocab.start_symbol
        end = self.vocab.end_symbol
@@ -123,14 +119,10 @@ class English(Phonetics):
    def numericalize(self, phonemes):
        """ Convert pronunciation sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            phonemes (List[str]): The list of pronunciation sequence.
-        phonemes: List[str]
+        Returns: 
-            The list of pronunciation sequence.
+            List[int]: The list of pronunciation id sequence.
-        Returns
-        ----------
-        List[int]
-            The list of pronunciation id sequence.
        """
        ids = [
            self.vocab.lookup(item) for item in phonemes
@@ -140,27 +132,19 @@ class English(Phonetics):
    def reverse(self, ids):
        """ Reverse the list of pronunciation id sequence to a list of pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+            ids (List[int]): The list of pronunciation id sequence.
-        ids: List[int]
+        Returns: 
-            The list of pronunciation id sequence.
+            List[str]: The list of pronunciation sequence.
-        Returns
-        ----------
-        List[str]
-            The list of pronunciation sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
    def __call__(self, sentence):
        """ Convert the input text sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            sentence(str): The input text sequence.
-        sentence: str
+        Returns: 
-            The input text sequence.
+            List[str]: The list of pronunciation id sequence.
-        Returns
-        ----------
-        List[str]
-            The list of pronunciation id sequence.
        """
        return self.numericalize(self.phoneticize(sentence))
@@ -183,28 +167,21 @@ class EnglishCharacter(Phonetics):
    def phoneticize(self, sentence):
        """ Normalize the input text sequence.
-        Parameters
+        Args:
-        -----------
+            sentence(str): The input text sequence.
-        sentence: str
+        Returns:
-            The input text sequence.
+            str: A text sequence after normalize.
-        Returns
-        ----------
-        str
-            A text sequence after normalize.
        """
        words = normalize(sentence)
        return words
    def numericalize(self, sentence):
        """ Convert a text sequence into ids.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
+        Returns:
-            The input text sequence.
+            List[int]:
-        Returns
+                List of a character id sequence.
-        ----------
-        List[int]
-            List of a character id sequence.
        """
        ids = [
            self.vocab.lookup(item) for item in sentence
@@ -214,27 +191,19 @@ class EnglishCharacter(Phonetics):
    def reverse(self, ids):
        """ Convert a character id sequence into text.
-        Parameters
+        Args:
-        -----------
+            ids (List[int]): List of a character id sequence.
-        ids: List[int]
+        Returns:
-            List of a character id sequence.
+            str: The input text sequence.
-        Returns
-        ----------
-        str
-            The input text sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
    def __call__(self, sentence):
        """ Normalize the input text sequence and convert it into character id sequence.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
+        Returns: 
-            The input text sequence.
+            List[int]: List of a character id sequence.
-        Returns
-        ----------
-        List[int]
-            List of a character id sequence.
        """
        return self.numericalize(self.phoneticize(sentence))
@@ -264,14 +233,10 @@ class Chinese(Phonetics):
    def phoneticize(self, sentence):
        """ Normalize the input text sequence and convert it into pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+            sentence(str): The input text sequence.
-        sentence: str
+        Returns: 
-            The input text sequence.
+            List[str]: The list of pronunciation sequence.
-        Returns
-        ----------
-        List[str]
-            The list of pronunciation sequence.
        """
        # simplified = self.opencc_backend.convert(sentence)
        simplified = sentence
@@ -296,28 +261,20 @@ class Chinese(Phonetics):
    def numericalize(self, phonemes):
        """ Convert pronunciation sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            phonemes(List[str]): The list of pronunciation sequence.
-        phonemes: List[str]
+        Returns:
-            The list of pronunciation sequence.
+                List[int]: The list of pronunciation id sequence.
-        Returns
-        ----------
-        List[int]
-            The list of pronunciation id sequence.
        """
        ids = [self.vocab.lookup(item) for item in phonemes]
        return ids
    def __call__(self, sentence):
        """ Convert the input text sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
+        Returns:
-            The input text sequence.
+            List[str]: The list of pronunciation id sequence.
-        Returns
-        ----------
-        List[str]
-            The list of pronunciation id sequence.
        """
        return self.numericalize(self.phoneticize(sentence))
@@ -329,13 +286,9 @@ class Chinese(Phonetics):
    def reverse(self, ids):
        """ Reverse the list of pronunciation id sequence to a list of pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+        ids (List[int]): The list of pronunciation id sequence.
-        ids: List[int]
+        Returns: 
-            The list of pronunciation id sequence.
+            List[str]: The list of pronunciation sequence.
-        Returns
-        ----------
-        List[str]
-            The list of pronunciation sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
--- a/paddlespeech/t2s/frontend/vocab.py
+++ b/paddlespeech/t2s/frontend/vocab.py
@@ -20,22 +20,12 @@ __all__ = ["Vocab"]
 class Vocab(object):
    """  Vocabulary.
-    Parameters
+    Args:
-    -----------
+        symbols (Iterable[str]): Common symbols.
-    symbols: Iterable[str]
+        padding_symbol (str, optional): Symbol for pad. Defaults to "<pad>".
-        Common symbols.
+        unk_symbol (str, optional): Symbol for unknow. Defaults to "<unk>"
+        start_symbol (str, optional): Symbol for start. Defaults to "<s>"
-    padding_symbol: str, optional
+        end_symbol (str, optional): Symbol for end. Defaults to "</s>"
-        Symbol for pad. Defaults to "<pad>".
-    unk_symbol: str, optional
-        Symbol for unknow. Defaults to "<unk>"
-    start_symbol: str, optional
-        Symbol for start. Defaults to "<s>"
-    end_symbol: str, optional
-        Symbol for end. Defaults to "</s>"
    """
    def __init__(self,

--- a/paddlespeech/t2s/frontend/zh_normalization/chronology.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/chronology.py
@@ -44,12 +44,10 @@ RE_TIME_RANGE = re.compile(r'([0-1]?[0-9]|2[0-3])'
 def replace_time(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    is_range = len(match.groups()) > 5
@@ -87,12 +85,10 @@ RE_DATE = re.compile(r'(\d{4}|\d{2})年'
 def replace_date(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    year = match.group(1)
    month = match.group(3)
@@ -114,12 +110,10 @@ RE_DATE2 = re.compile(
 def replace_date2(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    year = match.group(1)
    month = match.group(3)

--- a/paddlespeech/t2s/frontend/zh_normalization/num.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/num.py
@@ -36,12 +36,10 @@ RE_FRAC = re.compile(r'(-?)(\d+)/(\d+)')
 def replace_frac(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    sign = match.group(1)
    nominator = match.group(2)
@@ -59,12 +57,10 @@ RE_PERCENTAGE = re.compile(r'(-?)(\d+(\.\d+)?)%')
 def replace_percentage(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    sign = match.group(1)
    percent = match.group(2)
@@ -81,12 +77,10 @@ RE_INTEGER = re.compile(r'(-)' r'(\d+)')
 def replace_negative_num(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    sign = match.group(1)
    number = match.group(2)
@@ -103,12 +97,10 @@ RE_DEFAULT_NUM = re.compile(r'\d{3}\d*')
 def replace_default_num(match):
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    number = match.group(0)
    return verbalize_digit(number)
@@ -124,12 +116,10 @@ RE_NUMBER = re.compile(r'(-?)((\d+)(\.\d+)?)' r'|(\.(\d+))')
 def replace_positive_quantifier(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    number = match.group(1)
    match_2 = match.group(2)
@@ -142,12 +132,10 @@ def replace_positive_quantifier(match) -> str:
 def replace_number(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    sign = match.group(1)
    number = match.group(2)
@@ -169,12 +157,10 @@ RE_RANGE = re.compile(
 def replace_range(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    first, second = match.group(1), match.group(8)
    first = RE_NUMBER.sub(replace_number, first)

--- a/paddlespeech/t2s/frontend/zh_normalization/phonecode.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/phonecode.py
@@ -45,23 +45,19 @@ def phone2str(phone_string: str, mobile=True) -> str:
 def replace_phone(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    return phone2str(match.group(0), mobile=False)
 def replace_mobile(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    return phone2str(match.group(0))
--- a/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
@@ -22,12 +22,10 @@ RE_TEMPERATURE = re.compile(r'(-?)(\d+(\.\d+)?)(°C|℃|度|摄氏度)')
 def replace_temperature(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
-    ----------
-    str
    """
    sign = match.group(1)
    temperature = match.group(2)

--- a/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
@@ -55,14 +55,10 @@ class TextNormalizer():
    def _split(self, text: str, lang="zh") -> List[str]:
        """Split long text into sentences with sentence-splitting punctuations.
-        Parameters
+        Args:
-        ----------
+            text (str): The input text.
-        text : str
+        Returns:
-            The input text.
+            List[str]: Sentences.
-        Returns
-        -------
-        List[str]
-            Sentences.
        """
        # Only for pure Chinese here
        if lang == "zh":

--- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
+++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
--- a/paddlespeech/t2s/models/hifigan/hifigan.py
+++ b/paddlespeech/t2s/models/hifigan/hifigan.py
@@ -37,35 +37,21 @@ class HiFiGANGenerator(nn.Layer):
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANGenerator module.
-        Parameters
+        Args:
-        ----------
+            in_channels (int): Number of input channels.
-        in_channels : int
+            out_channels (int): Number of output channels.
-            Number of input channels.
+            channels (int): Number of hidden representation channels.
-        out_channels : int
+            kernel_size (int): Kernel size of initial and final conv layer.
-            Number of output channels.
+            upsample_scales (list): List of upsampling scales.
-        channels : int
+            upsample_kernel_sizes (list): List of kernel sizes for upsampling layers.
-            Number of hidden representation channels.
+            resblock_kernel_sizes (list): List of kernel sizes for residual blocks.
-        kernel_size : int
+            resblock_dilations (list): List of dilation list for residual blocks.
-            Kernel size of initial and final conv layer.
+            use_additional_convs (bool): Whether to use additional conv layers in residual blocks.
-        upsample_scales : list
+            bias (bool): Whether to add bias parameter in convolution layers.
-            List of upsampling scales.
+            nonlinear_activation (str): Activation function module name.
-        upsample_kernel_sizes : list
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-            List of kernel sizes for upsampling layers.
+            use_weight_norm (bool): Whether to use weight norm.
-        resblock_kernel_sizes : list
+                If set to true, it will be applied to all of the conv layers.
-            List of kernel sizes for residual blocks.
-        resblock_dilations : list
-            List of dilation list for residual blocks.
-        use_additional_convs : bool
-            Whether to use additional conv layers in residual blocks.
-        bias : bool
-            Whether to add bias parameter in convolution layers.
-        nonlinear_activation : str
-            Activation function module name.
-        nonlinear_activation_params : dict
-            Hyperparameters for activation function.
-        use_weight_norm : bool
-            Whether to use weight norm.
-            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@@ -134,14 +120,11 @@ class HiFiGANGenerator(nn.Layer):
    def forward(self, c):
        """Calculate forward propagation.
-        Parameters
-        ----------
+        Args:
-        c : Tensor
+            c (Tensor): Input tensor (B, in_channels, T).
-            Input tensor (B, in_channels, T).
+        Returns:
-        Returns
+            Tensor: Output tensor (B, out_channels, T).
-        ----------
-        Tensor
-            Output tensor (B, out_channels, T).
        """
        c = self.input_conv(c)
        for i in range(self.num_upsamples):
@@ -196,15 +179,12 @@ class HiFiGANGenerator(nn.Layer):
    def inference(self, c):
        """Perform inference.
-        Parameters
+        Args:
-        ----------
+            c (Tensor): Input tensor (T, in_channels).
-        c : Tensor 
+                normalize_before (bool): Whether to perform normalization.
-            Input tensor (T, in_channels).
+        Returns:
-            normalize_before (bool): Whether to perform normalization.
+            Tensor:
-        Returns
+                Output tensor (T ** prod(upsample_scales), out_channels).
-        ----------
-        Tensor
-            Output tensor (T ** prod(upsample_scales), out_channels).
        """
        c = self.forward(c.transpose([1, 0]).unsqueeze(0))
        return c.squeeze(0).transpose([1, 0])
@@ -229,36 +209,23 @@ class HiFiGANPeriodDiscriminator(nn.Layer):
            use_spectral_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANPeriodDiscriminator module.
-        Parameters
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels.
-        out_channels : int
+            period (int): Period.
-            Number of output channels.
+            kernel_sizes (list): Kernel sizes of initial conv layers and the final conv layer.
-        period : int
+            channels (int): Number of initial channels.
-            Period.
+            downsample_scales (list): List of downsampling scales.
-        kernel_sizes : list
+            max_downsample_channels (int): Number of maximum downsampling channels.
-            Kernel sizes of initial conv layers and the final conv layer.
+            use_additional_convs (bool): Whether to use additional conv layers in residual blocks.
-        channels : int
+            bias (bool): Whether to add bias parameter in convolution layers.
-            Number of initial channels.
+            nonlinear_activation (str): Activation function module name.
-        downsample_scales : list
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-            List of downsampling scales.
+            use_weight_norm (bool): Whether to use weight norm.
-        max_downsample_channels : int
+                If set to true, it will be applied to all of the conv layers.
-            Number of maximum downsampling channels.
+            use_spectral_norm (bool): Whether to use spectral norm.
-        use_additional_convs : bool
+                If set to true, it will be applied to all of the conv layers.
-            Whether to use additional conv layers in residual blocks.
-        bias : bool
-            Whether to add bias parameter in convolution layers.
-        nonlinear_activation : str
-            Activation function module name.
-        nonlinear_activation_params : dict
-            Hyperparameters for activation function.
-        use_weight_norm : bool
-            Whether to use weight norm.
-            If set to true, it will be applied to all of the conv layers.
-        use_spectral_norm : bool
-            Whether to use spectral norm.
-            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@@ -307,14 +274,11 @@ class HiFiGANPeriodDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
-        ----------
+        Args:
-        c : Tensor
+            c (Tensor): Input tensor (B, in_channels, T).
-            Input tensor (B, in_channels, T).
+        Returns:
-        Returns
+            list: List of each layer's tensors.
-        ----------
-        list
-            List of each layer's tensors.
        """
        # transform 1d to 2d -> (B, C, T/P, P)
        b, c, t = paddle.shape(x)
@@ -379,13 +343,11 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer):
            },
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANMultiPeriodDiscriminator module.
-        Parameters
-        ----------
+        Args:
-        periods : list
+            periods (list): List of periods.
-            List of periods.
+            discriminator_params (dict): Parameters for hifi-gan period discriminator module.
-        discriminator_params : dict
+                The period parameter will be overwritten.
-            Parameters for hifi-gan period discriminator module.
-            The period parameter will be overwritten.
        """
        super().__init__()
        # initialize parameters
@@ -399,14 +361,11 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Input noise signal (B, 1, T).
-            Input noise signal (B, 1, T).
+        Returns:
-        Returns
+            List: List of list of each discriminator outputs, which consists of each layer output tensors.
-        ----------
-        List
-            List of list of each discriminator outputs, which consists of each layer output tensors.
        """
        outs = []
        for f in self.discriminators:
@@ -434,33 +393,22 @@ class HiFiGANScaleDiscriminator(nn.Layer):
            use_spectral_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN scale discriminator module.
-        Parameters
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels.
-        out_channels : int
+            kernel_sizes (list): List of four kernel sizes. The first will be used for the first conv layer,
-            Number of output channels.
+                and the second is for downsampling part, and the remaining two are for output layers.
-        kernel_sizes : list
+            channels (int): Initial number of channels for conv layer.
-            List of four kernel sizes. The first will be used for the first conv layer,
+            max_downsample_channels (int): Maximum number of channels for downsampling layers.
-            and the second is for downsampling part, and the remaining two are for output layers.
+            bias (bool): Whether to add bias parameter in convolution layers.
-        channels : int
+            downsample_scales (list): List of downsampling scales.
-            Initial number of channels for conv layer.
+            nonlinear_activation (str): Activation function module name.
-        max_downsample_channels : int
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-            Maximum number of channels for downsampling layers.
+            use_weight_norm (bool): Whether to use weight norm.
-        bias : bool
+                If set to true, it will be applied to all of the conv layers.
-            Whether to add bias parameter in convolution layers.
+            use_spectral_norm (bool): Whether to use spectral norm.
-        downsample_scales : list
+                If set to true, it will be applied to all of the conv layers.
-            List of downsampling scales.
-        nonlinear_activation : str
-            Activation function module name.
-        nonlinear_activation_params : dict
-            Hyperparameters for activation function.
-        use_weight_norm : bool
-            Whether to use weight norm.
-            If set to true, it will be applied to all of the conv layers.
-        use_spectral_norm : bool
-            Whether to use spectral norm.
-            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@@ -546,14 +494,11 @@ class HiFiGANScaleDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Input noise signal (B, 1, T).
-            Input noise signal (B, 1, T).
+        Returns:
-        Returns
+            List: List of output tensors of each layer.
-        ----------
-        List
-            List of output tensors of each layer.
        """
        outs = []
        for f in self.layers:
@@ -613,20 +558,14 @@ class HiFiGANMultiScaleDiscriminator(nn.Layer):
            follow_official_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN multi-scale discriminator module.
-        Parameters
-        ----------
+        Args:
-        scales : int
+            scales (int): Number of multi-scales.
-            Number of multi-scales.
+            downsample_pooling (str): Pooling module name for downsampling of the inputs.
-        downsample_pooling : str
+            downsample_pooling_params (dict): Parameters for the above pooling module.
-            Pooling module name for downsampling of the inputs.
+            discriminator_params (dict): Parameters for hifi-gan scale discriminator module.
-        downsample_pooling_params : dict
+            follow_official_norm (bool): Whether to follow the norm setting of the official
-            Parameters for the above pooling module.
+                implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm.
-        discriminator_params : dict
-            Parameters for hifi-gan scale discriminator module.
-        follow_official_norm : bool
-            Whether to follow the norm setting of the official
-            implementaion. The first discriminator uses spectral norm and the other
-            discriminators use weight norm.
        """
        super().__init__()
@@ -651,14 +590,11 @@ class HiFiGANMultiScaleDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Input noise signal (B, 1, T).
-            Input noise signal (B, 1, T).
+        Returns:
-        Returns
+            List: List of list of each discriminator outputs, which consists of each layer output tensors.
-        ----------
-        List
-            List of list of each discriminator outputs, which consists of each layer output tensors.
        """
        outs = []
        for f in self.discriminators:
@@ -715,24 +651,17 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
            },
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN multi-scale + multi-period discriminator module.
-        Parameters
-        ----------
+        Args:
-        scales : int
+            scales (int): Number of multi-scales.
-            Number of multi-scales.
+            scale_downsample_pooling (str): Pooling module name for downsampling of the inputs.
-        scale_downsample_pooling : str
+            scale_downsample_pooling_params (dict): Parameters for the above pooling module.
-            Pooling module name for downsampling of the inputs.
+            scale_discriminator_params (dict): Parameters for hifi-gan scale discriminator module.
-        scale_downsample_pooling_params : dict
+            follow_official_norm （bool): Whether to follow the norm setting of the official implementaion. 
-            Parameters for the above pooling module.
+                The first discriminator uses spectral norm and the other discriminators use weight norm.
-        scale_discriminator_params : dict
+            periods (list): List of periods.
-            Parameters for hifi-gan scale discriminator module.
+            period_discriminator_params (dict): Parameters for hifi-gan period discriminator module.
-        follow_official_norm : bool): Whether to follow the norm setting of the official
+                The period parameter will be overwritten.
-            implementaion. The first discriminator uses spectral norm and the other
-            discriminators use weight norm.
-        periods : list
-            List of periods.
-        period_discriminator_params : dict
-            Parameters for hifi-gan period discriminator module.
-            The period parameter will be overwritten.
        """
        super().__init__()
@@ -751,16 +680,14 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Input noise signal (B, 1, T).
-            Input noise signal (B, 1, T).
+        Returns:
-        Returns
+            List:
-        ----------
+                List of list of each discriminator outputs,
-        List:
+                which consists of each layer output tensors.
-            List of list of each discriminator outputs,
+                Multi scale and multi period ones are concatenated.
-            which consists of each layer output tensors.
-            Multi scale and multi period ones are concatenated.
        """
        msd_outs = self.msd(x)
        mpd_outs = self.mpd(x)

--- a/paddlespeech/t2s/models/melgan/melgan.py
+++ b/paddlespeech/t2s/models/melgan/melgan.py
@@ -51,41 +51,26 @@ class MelGANGenerator(nn.Layer):
            use_causal_conv: bool=False,
            init_type: str="xavier_uniform", ):
        """Initialize MelGANGenerator module.
-        Parameters
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels,
-        out_channels : int
+                the number of sub-band is out_channels in multi-band melgan.
-            Number of output channels,
+            kernel_size (int): Kernel size of initial and final conv layer.
-            the number of sub-band is out_channels in multi-band melgan.
+            channels (int): Initial number of channels for conv layer.
-        kernel_size : int
+            bias (bool): Whether to add bias parameter in convolution layers.
-            Kernel size of initial and final conv layer.
+            upsample_scales (List[int]): List of upsampling scales.
-        channels : int
+            stack_kernel_size (int): Kernel size of dilated conv layers in residual stack.
-            Initial number of channels for conv layer.
+            stacks (int): Number of stacks in a single residual stack.
-        bias : bool
+            nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None
-            Whether to add bias parameter in convolution layers.
+            nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, 
-        upsample_scales : List[int]
+                by default {}
-            List of upsampling scales.
+            pad (str): Padding function module name before dilated convolution layer.
-        stack_kernel_size : int
+            pad_params （dict): Hyperparameters for padding function.
-            Kernel size of dilated conv layers in residual stack.
+            use_final_nonlinear_activation (nn.Layer): Activation function for the final layer.
-        stacks : int
+            use_weight_norm (bool): Whether to use weight norm.
-            Number of stacks in a single residual stack.
+                If set to true, it will be applied to all of the conv layers.
-        nonlinear_activation : Optional[str], optional
+            use_causal_conv (bool): Whether to use causal convolution.
-            Non linear activation in upsample network, by default None
-        nonlinear_activation_params : Dict[str, Any], optional
-            Parameters passed to the linear activation in the upsample network, 
-            by default {}
-        pad : str
-            Padding function module name before dilated convolution layer.
-        pad_params : dict
-            Hyperparameters for padding function.
-        use_final_nonlinear_activation : nn.Layer
-            Activation function for the final layer.
-        use_weight_norm : bool
-            Whether to use weight norm.
-            If set to true, it will be applied to all of the conv layers.
-        use_causal_conv : bool
-            Whether to use causal convolution.
        """
        super().__init__()
@@ -207,14 +192,11 @@ class MelGANGenerator(nn.Layer):
    def forward(self, c):
        """Calculate forward propagation.
-        Parameters
-        ----------
+        Args:
-        c : Tensor
+            c (Tensor): Input tensor (B, in_channels, T).
-            Input tensor (B, in_channels, T).
+        Returns:
-        Returns
+            Tensor: Output tensor (B, out_channels, T ** prod(upsample_scales)).
-        ----------
-        Tensor
-            Output tensor (B, out_channels, T ** prod(upsample_scales)).
        """
        out = self.melgan(c)
        return out
@@ -260,14 +242,11 @@ class MelGANGenerator(nn.Layer):
    def inference(self, c):
        """Perform inference.
-        Parameters
-        ----------
+        Args:
-        c : Union[Tensor, ndarray]
+            c (Union[Tensor, ndarray]): Input tensor (T, in_channels).
-            Input tensor (T, in_channels).
+        Returns:
-        Returns
+            Tensor: Output tensor (out_channels*T ** prod(upsample_scales), 1).
-        ----------
-        Tensor
-            Output tensor (out_channels*T ** prod(upsample_scales), 1).
        """
        # pseudo batch
        c = c.transpose([1, 0]).unsqueeze(0)
@@ -298,33 +277,22 @@ class MelGANDiscriminator(nn.Layer):
            pad_params: Dict[str, Any]={"mode": "reflect"},
            init_type: str="xavier_uniform", ):
        """Initilize MelGAN discriminator module.
-        Parameters
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels.
-        out_channels : int
+            kernel_sizes (List[int]): List of two kernel sizes. The prod will be used for the first conv layer,
-            Number of output channels.
+                and the first and the second kernel sizes will be used for the last two layers.
-        kernel_sizes : List[int]
+                For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15,
-            List of two kernel sizes. The prod will be used for the first conv layer,
+                the last two layers' kernel size will be 5 and 3, respectively.
-            and the first and the second kernel sizes will be used for the last two layers.
+            channels (int): Initial number of channels for conv layer.
-            For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15,
+            max_downsample_channels (int): Maximum number of channels for downsampling layers.
-            the last two layers' kernel size will be 5 and 3, respectively.
+            bias (bool): Whether to add bias parameter in convolution layers.
-        channels : int
+            downsample_scales (List[int]): List of downsampling scales.
-            Initial number of channels for conv layer.
+            nonlinear_activation (str): Activation function module name.
-        max_downsample_channels : int
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-            Maximum number of channels for downsampling layers.
+            pad (str): Padding function module name before dilated convolution layer.
-        bias : bool
+            pad_params (dict): Hyperparameters for padding function.
-            Whether to add bias parameter in convolution layers.
-        downsample_scales : List[int]
-            List of downsampling scales.
-        nonlinear_activation : str
-            Activation function module name.
-        nonlinear_activation_params : dict
-            Hyperparameters for activation function.
-        pad : str
-            Padding function module name before dilated convolution layer.
-        pad_params : dict
-            Hyperparameters for padding function.
        """
        super().__init__()
@@ -395,14 +363,10 @@ class MelGANDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input noise signal (B, 1, T).
-        x : Tensor
+        Returns:
-            Input noise signal (B, 1, T).
+            List: List of output tensors of each layer (for feat_match_loss).
-        Returns
-        ----------
-        List
-            List of output tensors of each layer (for feat_match_loss).
        """
        outs = []
        for f in self.layers:
@@ -440,39 +404,24 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initilize MelGAN multi-scale discriminator module.
-        Parameters
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels.
-        out_channels : int
+            scales (int): Number of multi-scales.
-            Number of output channels.
+            downsample_pooling (str): Pooling module name for downsampling of the inputs.
-        scales : int
+            downsample_pooling_params (dict): Parameters for the above pooling module.
-            Number of multi-scales.
+            kernel_sizes (List[int]): List of two kernel sizes. The sum will be used for the first conv layer,
-        downsample_pooling : str
+                and the first and the second kernel sizes will be used for the last two layers.
-            Pooling module name for downsampling of the inputs.
+            channels (int): Initial number of channels for conv layer.
-        downsample_pooling_params : dict
+            max_downsample_channels (int): Maximum number of channels for downsampling layers.
-            Parameters for the above pooling module.
+            bias (bool): Whether to add bias parameter in convolution layers.
-        kernel_sizes : List[int]
+            downsample_scales (List[int]): List of downsampling scales.
-            List of two kernel sizes. The sum will be used for the first conv layer,
+            nonlinear_activation (str): Activation function module name.
-            and the first and the second kernel sizes will be used for the last two layers.
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-        channels : int
+            pad (str): Padding function module name before dilated convolution layer.
-            Initial number of channels for conv layer.
+            pad_params (dict): Hyperparameters for padding function.
-        max_downsample_channels : int
+            use_causal_conv (bool): Whether to use causal convolution.
-            Maximum number of channels for downsampling layers.
-        bias : bool
-            Whether to add bias parameter in convolution layers.
-        downsample_scales : List[int]
-            List of downsampling scales.
-        nonlinear_activation : str
-            Activation function module name.
-        nonlinear_activation_params : dict
-            Hyperparameters for activation function.
-        pad : str
-            Padding function module name before dilated convolution layer.
-        pad_params : dict
-            Hyperparameters for padding function.
-        use_causal_conv : bool
-            Whether to use causal convolution.
        """
        super().__init__()
@@ -514,14 +463,10 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input noise signal (B, 1, T).
-        x : Tensor
+        Returns:
-            Input noise signal (B, 1, T).
+            List: List of list of each discriminator outputs, which consists of each layer output tensors.
-        Returns
-        ----------
-        List
-            List of list of each discriminator outputs, which consists of each layer output tensors.
        """
        outs = []
        for f in self.discriminators:

--- a/paddlespeech/t2s/models/melgan/style_melgan.py
+++ b/paddlespeech/t2s/models/melgan/style_melgan.py
@@ -52,37 +52,23 @@ class StyleMelGANGenerator(nn.Layer):
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initilize Style MelGAN generator.
-        Parameters
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input noise channels.
-            Number of input noise channels.
+            aux_channels (int): Number of auxiliary input channels.
-        aux_channels : int
+            channels (int): Number of channels for conv layer.
-            Number of auxiliary input channels.
+            out_channels (int): Number of output channels.
-        channels : int
+            kernel_size (int): Kernel size of conv layers.
-            Number of channels for conv layer.
+            dilation (int): Dilation factor for conv layers.
-        out_channels : int
+            bias (bool): Whether to add bias parameter in convolution layers.
-            Number of output channels.
+            noise_upsample_scales (list): List of noise upsampling scales.
-        kernel_size : int
+            noise_upsample_activation (str): Activation function module name for noise upsampling.
-            Kernel size of conv layers.
+            noise_upsample_activation_params (dict): Hyperparameters for the above activation function.
-        dilation : int
+            upsample_scales (list): List of upsampling scales.
-            Dilation factor for conv layers.
+            upsample_mode (str): Upsampling mode in TADE layer.
-        bias : bool
+            gated_function (str): Gated function in TADEResBlock ("softmax" or "sigmoid").
-            Whether to add bias parameter in convolution layers.
+            use_weight_norm (bool): Whether to use weight norm.
-        noise_upsample_scales : list
+                If set to true, it will be applied to all of the conv layers.
-            List of noise upsampling scales.
-        noise_upsample_activation : str
-            Activation function module name for noise upsampling.
-        noise_upsample_activation_params : dict
-            Hyperparameters for the above activation function.
-        upsample_scales : list
-            List of upsampling scales.
-        upsample_mode : str
-            Upsampling mode in TADE layer.
-        gated_function : str
-            Gated function in TADEResBlock ("softmax" or "sigmoid").
-        use_weight_norm : bool
-            Whether to use weight norm.
-            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@@ -147,16 +133,12 @@ class StyleMelGANGenerator(nn.Layer):
    def forward(self, c, z=None):
        """Calculate forward propagation.
-        Parameters
-        ----------
+        Args:
-        c : Tensor
+            c (Tensor): Auxiliary input tensor (B, channels, T).
-            Auxiliary input tensor (B, channels, T).
+            z (Tensor): Input noise tensor (B, in_channels, 1).
-        z : Tensor
+        Returns:
-            Input noise tensor (B, in_channels, 1).
+            Tensor: Output tensor (B, out_channels, T ** prod(upsample_scales)).
-        Returns
-        ----------
-        Tensor
-            Output tensor (B, out_channels, T ** prod(upsample_scales)).
        """
        # batch_max_steps(24000) == noise_upsample_factor(80) * upsample_factor(300)
        if z is None:
@@ -211,14 +193,10 @@ class StyleMelGANGenerator(nn.Layer):
    def inference(self, c):
        """Perform inference.
-        Parameters
+        Args:
-        ----------
+            c (Tensor): Input tensor (T, in_channels).
-        c : Tensor
+        Returns:
-            Input tensor (T, in_channels).
+            Tensor: Output tensor (T ** prod(upsample_scales), out_channels).
-        Returns
-        ----------
-        Tensor
-            Output tensor (T ** prod(upsample_scales), out_channels).
        """
        # (1, in_channels, T)
        c = c.transpose([1, 0]).unsqueeze(0)
@@ -278,18 +256,13 @@ class StyleMelGANDiscriminator(nn.Layer):
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initilize Style MelGAN discriminator.
-        Parameters
-        ----------
+        Args:
-        repeats : int
+            repeats (int): Number of repititons to apply RWD.
-            Number of repititons to apply RWD.
+            window_sizes (list): List of random window sizes.
-        window_sizes : list
+            pqmf_params (list): List of list of Parameters for PQMF modules
-            List of random window sizes.
+            discriminator_params (dict): Parameters for base discriminator module.
-        pqmf_params : list
+            use_weight_nom (bool): Whether to apply weight normalization.
-            List of list of Parameters for PQMF modules
-        discriminator_params : dict
-            Parameters for base discriminator module.
-        use_weight_nom : bool
-            Whether to apply weight normalization.
        """
        super().__init__()
@@ -325,15 +298,11 @@ class StyleMelGANDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input tensor (B, 1, T).
-        x : Tensor
+        Returns:
-            Input tensor (B, 1, T).
+            List: List of discriminator outputs, #items in the list will be
-        Returns
+                equal to repeats * #discriminators.
-        ----------
-        List
-            List of discriminator outputs, #items in the list will be
-            equal to repeats * #discriminators.
        """
        outs = []
        for _ in range(self.repeats):

--- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py
+++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py
@@ -31,51 +31,30 @@ from paddlespeech.t2s.modules.upsample import ConvInUpsampleNet
 class PWGGenerator(nn.Layer):
    """Wave Generator for Parallel WaveGAN
-    Parameters
+    Args:
-    ----------
+        in_channels (int, optional): Number of channels of the input waveform, by default 1
-    in_channels : int, optional
+        out_channels (int, optional): Number of channels of the output waveform, by default 1
-        Number of channels of the input waveform, by default 1
+        kernel_size (int, optional): Kernel size of the residual blocks inside, by default 3
-    out_channels : int, optional
+        layers (int, optional): Number of residual blocks inside, by default 30
-        Number of channels of the output waveform, by default 1
+        stacks (int, optional): The number of groups to split the residual blocks into, by default 3
-    kernel_size : int, optional
+            Within each group, the dilation of the residual block grows exponentially.
-        Kernel size of the residual blocks inside, by default 3
+        residual_channels (int, optional): Residual channel of the residual blocks, by default 64
-    layers : int, optional
+        gate_channels (int, optional): Gate channel of the residual blocks, by default 128
-        Number of residual blocks inside, by default 30
+        skip_channels (int, optional): Skip channel of the residual blocks, by default 64
-    stacks : int, optional
+        aux_channels (int, optional): Auxiliary channel of the residual blocks, by default 80
-        The number of groups to split the residual blocks into, by default 3
+        aux_context_window (int, optional): The context window size of the first convolution applied to the 
-        Within each group, the dilation of the residual block grows 
+            auxiliary input, by default 2
-        exponentially.
+        dropout (float, optional): Dropout of the residual blocks, by default 0.
-    residual_channels : int, optional
+        bias (bool, optional): Whether to use bias in residual blocks, by default True
-        Residual channel of the residual blocks, by default 64
+        use_weight_norm (bool, optional): Whether to use weight norm in all convolutions, by default True
-    gate_channels : int, optional
+        use_causal_conv (bool, optional): Whether to use causal padding in the upsample network and residual 
-        Gate channel of the residual blocks, by default 128
+            blocks, by default False
-    skip_channels : int, optional
+        upsample_scales (List[int], optional): Upsample scales of the upsample network, by default [4, 4, 4, 4]
-        Skip channel of the residual blocks, by default 64
+        nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None
-    aux_channels : int, optional
+        nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, 
-        Auxiliary channel of the residual blocks, by default 80
+            by default {}
-    aux_context_window : int, optional
+        interpolate_mode (str, optional): Interpolation mode of the upsample network, by default "nearest"
-        The context window size of the first convolution applied to the 
+        freq_axis_kernel_size (int, optional): Kernel size along the frequency axis of the upsample network, by default 1
-        auxiliary input, by default 2
-    dropout : float, optional
-        Dropout of the residual blocks, by default 0.
-    bias : bool, optional
-        Whether to use bias in residual blocks, by default True
-    use_weight_norm : bool, optional
-        Whether to use weight norm in all convolutions, by default True
-    use_causal_conv : bool, optional
-        Whether to use causal padding in the upsample network and residual 
-        blocks, by default False
-    upsample_scales : List[int], optional
-        Upsample scales of the upsample network, by default [4, 4, 4, 4]
-    nonlinear_activation : Optional[str], optional
-        Non linear activation in upsample network, by default None
-    nonlinear_activation_params : Dict[str, Any], optional
-        Parameters passed to the linear activation in the upsample network, 
-        by default {}
-    interpolate_mode : str, optional
-        Interpolation mode of the upsample network, by default "nearest"
-    freq_axis_kernel_size : int, optional
-        Kernel size along the frequency axis of the upsample network, by default 1
    """
    def __init__(
@@ -167,18 +146,13 @@ class PWGGenerator(nn.Layer):
    def forward(self, x, c):
        """Generate waveform.
-        Parameters
+        Args:
-        ----------
+            x(Tensor): Shape (N, C_in, T), The input waveform.
-        x : Tensor
+            c(Tensor): Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). It
-            Shape (N, C_in, T), The input waveform.
-        c : Tensor
-            Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). It 
            is upsampled to match the time resolution of the input.
-        Returns
+        Returns:
-        -------
+            Tensor: Shape (N, C_out, T), the generated waveform.
-        Tensor
-            Shape (N, C_out, T), the generated waveform.
        """
        c = self.upsample_net(c)
        assert c.shape[-1] == x.shape[-1]
@@ -218,19 +192,14 @@ class PWGGenerator(nn.Layer):
        self.apply(_remove_weight_norm)
    def inference(self, c=None):
-        """Waveform generation. This function is used for single instance 
+        """Waveform generation. This function is used for single instance inference.
-        inference.
-        Parameters
+        Args:
-        ----------
+            c(Tensor, optional, optional): Shape (T', C_aux), the auxiliary input, by default None
-        c : Tensor, optional
+            x(Tensor, optional): Shape (T, C_in), the noise waveform, by default None
-            Shape (T', C_aux), the auxiliary input, by default None
-        x : Tensor, optional
+        Returns:
-            Shape (T, C_in), the noise waveform, by default None
+            Tensor: Shape (T, C_out), the generated waveform
-            If not provided, a sample is drawn from a gaussian distribution.
-        Returns
-        -------
-        Tensor
-            Shape (T, C_out), the generated waveform
        """
        # when to static, can not input x, see https://github.com/PaddlePaddle/Parakeet/pull/132/files
        x = paddle.randn(
@@ -244,32 +213,21 @@ class PWGGenerator(nn.Layer):
 class PWGDiscriminator(nn.Layer):
    """A convolutional discriminator for audio.
-    Parameters
+    Args:
-    ----------
+        in_channels (int, optional): Number of channels of the input audio, by default 1
-    in_channels : int, optional
+        out_channels (int, optional): Output feature size, by default 1
-        Number of channels of the input audio, by default 1
+        kernel_size (int, optional): Kernel size of convolutional sublayers, by default 3
-    out_channels : int, optional
+        layers (int, optional): Number of layers, by default 10
-        Output feature size, by default 1
+        conv_channels (int, optional): Feature size of the convolutional sublayers, by default 64
-    kernel_size : int, optional
+        dilation_factor (int, optional): The factor with which dilation of each convolutional sublayers grows 
-        Kernel size of convolutional sublayers, by default 3
+            exponentially if it is greater than 1, else the dilation of each convolutional sublayers grows linearly, 
-    layers : int, optional
+            by default 1
-        Number of layers, by default 10
+        nonlinear_activation (str, optional): The activation after each convolutional sublayer, by default "leakyrelu"
-    conv_channels : int, optional
+        nonlinear_activation_params (Dict[str, Any], optional): The parameters passed to the activation's initializer, by default 
-        Feature size of the convolutional sublayers, by default 64
+            {"negative_slope": 0.2}
-    dilation_factor : int, optional
+        bias (bool, optional): Whether to use bias in convolutional sublayers, by default True
-        The factor with which dilation of each convolutional sublayers grows 
+        use_weight_norm (bool, optional): Whether to use weight normalization at all convolutional sublayers, 
-        exponentially if it is greater than 1, else the dilation of each 
+            by default True
-        convolutional sublayers grows linearly, by default 1
-    nonlinear_activation : str, optional
-        The activation after each convolutional sublayer, by default "leakyrelu"
-    nonlinear_activation_params : Dict[str, Any], optional
-        The parameters passed to the activation's initializer, by default 
-        {"negative_slope": 0.2}
-    bias : bool, optional
-        Whether to use bias in convolutional sublayers, by default True
-    use_weight_norm : bool, optional
-        Whether to use weight normalization at all convolutional sublayers, 
-        by default True
    """
    def __init__(
@@ -330,15 +288,12 @@ class PWGDiscriminator(nn.Layer):
    def forward(self, x):
        """
-        Parameters
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Shape (N, in_channels, num_samples), the input audio.
-            Shape (N, in_channels, num_samples), the input audio.
+        Returns:
-        Returns
+            Tensor: Shape (N, out_channels, num_samples), the predicted logits.
-        -------
-        Tensor
-            Shape (N, out_channels, num_samples), the predicted logits.
        """
        return self.conv_layers(x)
@@ -362,39 +317,25 @@ class PWGDiscriminator(nn.Layer):
 class ResidualPWGDiscriminator(nn.Layer):
    """A wavenet-style discriminator for audio.
-    Parameters
+    Args:
-    ----------
+        in_channels (int, optional): Number of channels of the input audio, by default 1
-    in_channels : int, optional
+        out_channels (int, optional): Output feature size, by default 1
-        Number of channels of the input audio, by default 1
+        kernel_size (int, optional): Kernel size of residual blocks, by default 3
-    out_channels : int, optional
+        layers (int, optional): Number of residual blocks, by default 30
-        Output feature size, by default 1
+        stacks (int, optional): Number of groups of residual blocks, within which the dilation 
-    kernel_size : int, optional
+            of each residual blocks grows exponentially, by default 3
-        Kernel size of residual blocks, by default 3
+        residual_channels (int, optional): Residual channels of residual blocks, by default 64
-    layers : int, optional
+        gate_channels (int, optional): Gate channels of residual blocks, by default 128
-        Number of residual blocks, by default 30
+        skip_channels (int, optional): Skip channels of residual blocks, by default 64
-    stacks : int, optional
+        dropout (float, optional): Dropout probability of residual blocks, by default 0.
-        Number of groups of residual blocks, within which the dilation 
+        bias (bool, optional): Whether to use bias in residual blocks, by default True
-        of each residual blocks grows exponentially, by default 3
+        use_weight_norm (bool, optional): Whether to use weight normalization in all convolutional layers, 
-    residual_channels : int, optional
+            by default True
-        Residual channels of residual blocks, by default 64
+        use_causal_conv (bool, optional): Whether to use causal convolution in residual blocks, by default False
-    gate_channels : int, optional
+        nonlinear_activation (str, optional): Activation after convolutions other than those in residual blocks, 
-        Gate channels of residual blocks, by default 128
+            by default "leakyrelu"
-    skip_channels : int, optional
+        nonlinear_activation_params (Dict[str, Any], optional): Parameters to pass to the activation, 
-        Skip channels of residual blocks, by default 64
+            by default {"negative_slope": 0.2}
-    dropout : float, optional
-        Dropout probability of residual blocks, by default 0.
-    bias : bool, optional
-        Whether to use bias in residual blocks, by default True
-    use_weight_norm : bool, optional
-        Whether to use weight normalization in all convolutional layers, 
-        by default True
-    use_causal_conv : bool, optional
-        Whether to use causal convolution in residual blocks, by default False
-    nonlinear_activation : str, optional
-        Activation after convolutions other than those in residual blocks, 
-        by default "leakyrelu"
-    nonlinear_activation_params : Dict[str, Any], optional
-        Parameters to pass to the activation, by default {"negative_slope": 0.2}
    """
    def __init__(
@@ -463,15 +404,11 @@ class ResidualPWGDiscriminator(nn.Layer):
    def forward(self, x):
        """
-        Parameters
+        Args:
-        ----------
+            x(Tensor): Shape (N, in_channels, num_samples), the input audio.↩
-        x : Tensor
-            Shape (N, in_channels, num_samples), the input audio.
+        Returns:
+            Tensor: Shape (N, out_channels, num_samples), the predicted logits.
-        Returns
-        -------
-        Tensor
-            Shape (N, out_channels, num_samples), the predicted logits.
        """
        x = self.first_conv(x)
        skip = 0

--- a/paddlespeech/t2s/models/tacotron2/tacotron2.py
+++ b/paddlespeech/t2s/models/tacotron2/tacotron2.py
@@ -81,69 +81,39 @@ class Tacotron2(nn.Layer):
            # training related
            init_type: str="xavier_uniform", ):
        """Initialize Tacotron2 module.
-        Parameters
+        Args:
-        ----------
+            idim (int): Dimension of the inputs.
-        idim : int
+            odim (int): Dimension of the outputs.
-            Dimension of the inputs.
+            embed_dim (int): Dimension of the token embedding.
-        odim : int
+            elayers (int): Number of encoder blstm layers.
-            Dimension of the outputs.
+            eunits (int): Number of encoder blstm units.
-        embed_dim : int
+            econv_layers (int): Number of encoder conv layers.
-            Dimension of the token embedding.
+            econv_filts (int): Number of encoder conv filter size.
-        elayers : int
+            econv_chans (int): Number of encoder conv filter channels.
-            Number of encoder blstm layers.
+            dlayers (int): Number of decoder lstm layers.
-        eunits : int
+            dunits (int): Number of decoder lstm units.
-            Number of encoder blstm units.
+            prenet_layers (int): Number of prenet layers.
-        econv_layers : int
+            prenet_units (int): Number of prenet units.
-            Number of encoder conv layers.
+            postnet_layers (int): Number of postnet layers.
-        econv_filts : int
+            postnet_filts (int): Number of postnet filter size.
-            Number of encoder conv filter size.
+            postnet_chans (int): Number of postnet filter channels.
-        econv_chans : int
+            output_activation (str): Name of activation function for outputs.
-            Number of encoder conv filter channels.
+            adim (int): Number of dimension of mlp in attention.
-        dlayers : int
+            aconv_chans (int): Number of attention conv filter channels.
-            Number of decoder lstm layers.
+            aconv_filts (int): Number of attention conv filter size.
-        dunits : int
+            cumulate_att_w (bool): Whether to cumulate previous attention weight.
-            Number of decoder lstm units.
+            use_batch_norm (bool): Whether to use batch normalization.
-        prenet_layers : int
+            use_concate (bool): Whether to concat enc outputs w/ dec lstm outputs.
-            Number of prenet layers.
+            reduction_factor (int): Reduction factor.
-        prenet_units : int
+            spk_num (Optional[int]): Number of speakers. If set to > 1, assume that the
-            Number of prenet units.
+                sids will be provided as the input and use sid embedding layer.
-        postnet_layers : int
+            lang_num (Optional[int]): Number of languages. If set to > 1, assume that the
-            Number of postnet layers.
+                lids will be provided as the input and use sid embedding layer.
-        postnet_filts : int
+            spk_embed_dim (Optional[int]): Speaker embedding dimension. If set to > 0,
-            Number of postnet filter size.
+                assume that spk_emb will be provided as the input.
-        postnet_chans : int
+            spk_embed_integration_type (str): How to integrate speaker embedding.
-            Number of postnet filter channels.
+            dropout_rate (float): Dropout rate.
-        output_activation : str
+            zoneout_rate (float): Zoneout rate.
-            Name of activation function for outputs.
-        adim : int
-            Number of dimension of mlp in attention.
-        aconv_chans : int
-            Number of attention conv filter channels.
-        aconv_filts : int
-            Number of attention conv filter size.
-        cumulate_att_w : bool
-            Whether to cumulate previous attention weight.
-        use_batch_norm : bool
-            Whether to use batch normalization.
-        use_concate : bool
-            Whether to concat enc outputs w/ dec lstm outputs.
-        reduction_factor : int
-            Reduction factor.
-        spk_num : Optional[int]
-            Number of speakers. If set to > 1, assume that the
-            sids will be provided as the input and use sid embedding layer.
-        lang_num : Optional[int]
-            Number of languages. If set to > 1, assume that the
-            lids will be provided as the input and use sid embedding layer.
-        spk_embed_dim : Optional[int]
-            Speaker embedding dimension. If set to > 0,
-            assume that spk_emb will be provided as the input.
-        spk_embed_integration_type : str
-            How to integrate speaker embedding.
-        dropout_rate : float
-            Dropout rate.
-        zoneout_rate : float
-            Zoneout rate.
        """
        assert check_argument_types()
        super().__init__()
@@ -258,31 +228,19 @@ class Tacotron2(nn.Layer):
    ) -> Tuple[paddle.Tensor, Dict[str, paddle.Tensor], paddle.Tensor]:
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            text (Tensor(int64)): Batch of padded character ids (B, T_text).
-        text : Tensor(int64)
+            text_lengths (Tensor(int64)): Batch of lengths of each input batch (B,).
-            Batch of padded character ids (B, T_text).
+            speech (Tensor): Batch of padded target features (B, T_feats, odim).
-        text_lengths : Tensor(int64)
+            speech_lengths (Tensor(int64)): Batch of the lengths of each target (B,).
-            Batch of lengths of each input batch (B,).
+            spk_emb (Optional[Tensor]): Batch of speaker embeddings (B, spk_embed_dim).
-        speech : Tensor
+            spk_id (Optional[Tensor]): Batch of speaker IDs (B, 1).
-            Batch of padded target features (B, T_feats, odim).
+            lang_id (Optional[Tensor]): Batch of language IDs (B, 1).
-        speech_lengths : Tensor(int64)
-            Batch of the lengths of each target (B,).
+        Returns:
-        spk_emb : Optional[Tensor]
+            Tensor: Loss scalar value.
-            Batch of speaker embeddings (B, spk_embed_dim).
+            Dict: Statistics to be monitored.
-        spk_id : Optional[Tensor]
+            Tensor: Weight value if not joint training else model outputs.
-            Batch of speaker IDs (B, 1).
-        lang_id : Optional[Tensor]
-            Batch of language IDs (B, 1).
-        Returns
-        ----------
-        Tensor
-            Loss scalar value.
-        Dict
-            Statistics to be monitored.
-        Tensor
-            Weight value if not joint training else model outputs.
        """
        text = text[:, :text_lengths.max()]
@@ -369,40 +327,26 @@ class Tacotron2(nn.Layer):
            use_teacher_forcing: bool=False, ) -> Dict[str, paddle.Tensor]:
        """Generate the sequence of features given the sequences of characters.
-        Parameters
+        Args:
-        ----------
+            text (Tensor(int64)): Input sequence of characters (T_text,).
-        text Tensor(int64)
+            speech (Optional[Tensor]): Feature sequence to extract style (N, idim).
-            Input sequence of characters (T_text,).
+            spk_emb (ptional[Tensor]): Speaker embedding (spk_embed_dim,).
-        speech : Optional[Tensor]
+            spk_id (Optional[Tensor]): Speaker ID (1,).
-            Feature sequence to extract style (N, idim).
+            lang_id (Optional[Tensor]): Language ID (1,).
-        spk_emb : ptional[Tensor]
+            threshold (float): Threshold in inference.
-            Speaker embedding (spk_embed_dim,).
+            minlenratio (float): Minimum length ratio in inference.
-        spk_id : Optional[Tensor]
+            maxlenratio (float): Maximum length ratio in inference.
-            Speaker ID (1,).
+            use_att_constraint (bool): Whether to apply attention constraint.
-        lang_id : Optional[Tensor]
+            backward_window (int): Backward window in attention constraint.
-            Language ID (1,).
+            forward_window (int): Forward window in attention constraint.
-        threshold : float
+            use_teacher_forcing (bool): Whether to use teacher forcing.
-            Threshold in inference.
-        minlenratio : float
+        Returns:
-            Minimum length ratio in inference.
+            Dict[str, Tensor]
-        maxlenratio : float
+            Output dict including the following items:
-            Maximum length ratio in inference.
+                * feat_gen (Tensor): Output sequence of features (T_feats, odim).
-        use_att_constraint : bool
+                * prob (Tensor): Output sequence of stop probabilities (T_feats,).
-            Whether to apply attention constraint.
+                * att_w (Tensor): Attention weights (T_feats, T).
-        backward_window : int
-            Backward window in attention constraint.
-        forward_window : int
-            Forward window in attention constraint.
-        use_teacher_forcing : bool
-            Whether to use teacher forcing.
-        Return
-        ----------
-        Dict[str, Tensor]
-        Output dict including the following items:
-            * feat_gen (Tensor): Output sequence of features (T_feats, odim).
-            * prob (Tensor): Output sequence of stop probabilities (T_feats,).
-            * att_w (Tensor): Attention weights (T_feats, T).
        """
        x = text
@@ -458,18 +402,13 @@ class Tacotron2(nn.Layer):
                                  spk_emb: paddle.Tensor) -> paddle.Tensor:
        """Integrate speaker embedding with hidden states.
-        Parameters
+        Args:
-        ----------
+            hs (Tensor): Batch of hidden state sequences (B, Tmax, eunits).
-         hs : Tensor
+            spk_emb (Tensor): Batch of speaker embeddings (B, spk_embed_dim).
-            Batch of hidden state sequences (B, Tmax, eunits).
-         spk_emb : Tensor
+        Returns:
-            Batch of speaker embeddings (B, spk_embed_dim).
+            Tensor: Batch of integrated hidden state sequences (B, Tmax, eunits) if
+                integration_type is "add" else (B, Tmax, eunits + spk_embed_dim).
-        Returns
-        ----------
-         Tensor
-            Batch of integrated hidden state sequences (B, Tmax, eunits) if
-            integration_type is "add" else (B, Tmax, eunits + spk_embed_dim).
        """
        if self.spk_embed_integration_type == "add":

--- a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py
+++ b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py
@@ -48,127 +48,67 @@ class TransformerTTS(nn.Layer):
    .. _`Neural Speech Synthesis with Transformer Network`:
        https://arxiv.org/pdf/1809.08895.pdf
-    Parameters
+    Args:
-    ----------
+        idim (int): Dimension of the inputs.
-    idim : int
+        odim (int): Dimension of the outputs.
-        Dimension of the inputs.
+        embed_dim (int, optional): Dimension of character embedding.
-    odim : int
+        eprenet_conv_layers (int, optional): Number of encoder prenet convolution layers.
-        Dimension of the outputs.
+        eprenet_conv_chans (int, optional): Number of encoder prenet convolution channels.
-    embed_dim : int, optional
+        eprenet_conv_filts (int, optional): Filter size of encoder prenet convolution.
-        Dimension of character embedding.
+        dprenet_layers (int, optional): Number of decoder prenet layers.
-    eprenet_conv_layers : int, optional
+        dprenet_units (int, optional): Number of decoder prenet hidden units.
-        Number of encoder prenet convolution layers.
+        elayers (int, optional): Number of encoder layers.
-    eprenet_conv_chans : int, optional
+        eunits (int, optional): Number of encoder hidden units.
-        Number of encoder prenet convolution channels.
+        adim (int, optional): Number of attention transformation dimensions.
-    eprenet_conv_filts : int, optional
+        aheads (int, optional): Number of heads for multi head attention.
-        Filter size of encoder prenet convolution.
+        dlayers (int, optional): Number of decoder layers.
-    dprenet_layers : int, optional
+        dunits (int, optional): Number of decoder hidden units.
-        Number of decoder prenet layers.
+        postnet_layers (int, optional): Number of postnet layers.
-    dprenet_units : int, optional
+        postnet_chans (int, optional): Number of postnet channels.
-        Number of decoder prenet hidden units.
+        postnet_filts (int, optional): Filter size of postnet.
-    elayers : int, optional
+        use_scaled_pos_enc (pool, optional): Whether to use trainable scaled positional encoding.
-        Number of encoder layers.
+        use_batch_norm (bool, optional): Whether to use batch normalization in encoder prenet.
-    eunits : int, optional
+        encoder_normalize_before (bool, optional): Whether to perform layer normalization before encoder block.
-        Number of encoder hidden units.
+        decoder_normalize_before (bool, optional): Whether to perform layer normalization before decoder block.
-    adim : int, optional
+        encoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in encoder.
-        Number of attention transformation dimensions.
+        decoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in decoder.
-    aheads : int, optional
+        positionwise_layer_type (str, optional): Position-wise operation type.
-        Number of heads for multi head attention.
+        positionwise_conv_kernel_size (int, optional): Kernel size in position wise conv 1d.
-    dlayers : int, optional
+        reduction_factor (int, optional): Reduction factor.
-        Number of decoder layers.
+        spk_embed_dim (int, optional): Number of speaker embedding dimenstions.
-    dunits : int, optional
+        spk_embed_integration_type (str, optional): How to integrate speaker embedding.
-        Number of decoder hidden units.
+        use_gst (str, optional): Whether to use global style token.
-    postnet_layers : int, optional
+        gst_tokens (int, optional): The number of GST embeddings.
-        Number of postnet layers.
+        gst_heads (int, optional): The number of heads in GST multihead attention.
-    postnet_chans : int, optional
+        gst_conv_layers (int, optional): The number of conv layers in GST.
-        Number of postnet channels.
+        gst_conv_chans_list (Sequence[int], optional): List of the number of channels of conv layers in GST.
-    postnet_filts : int, optional
+        gst_conv_kernel_size (int, optional): Kernal size of conv layers in GST.
-        Filter size of postnet.
+        gst_conv_stride (int, optional): Stride size of conv layers in GST.
-    use_scaled_pos_enc : pool, optional
+        gst_gru_layers (int, optional): The number of GRU layers in GST.
-        Whether to use trainable scaled positional encoding.
+        gst_gru_units (int, optional): The number of GRU units in GST.
-    use_batch_norm : bool, optional
+        transformer_lr (float, optional): Initial value of learning rate.
-        Whether to use batch normalization in encoder prenet.
+        transformer_warmup_steps (int, optional): Optimizer warmup steps.
-    encoder_normalize_before : bool, optional
+        transformer_enc_dropout_rate (float, optional): Dropout rate in encoder except attention and positional encoding.
-        Whether to perform layer normalization before encoder block.
+        transformer_enc_positional_dropout_rate (float, optional): Dropout rate after encoder positional encoding.
-    decoder_normalize_before : bool, optional
+        transformer_enc_attn_dropout_rate （float, optional): Dropout rate in encoder self-attention module.
-        Whether to perform layer normalization before decoder block.
+        transformer_dec_dropout_rate (float, optional): Dropout rate in decoder except attention & positional encoding.
-    encoder_concat_after : bool, optional
+        transformer_dec_positional_dropout_rate (float, optional): Dropout rate after decoder positional encoding.
-        Whether to concatenate attention layer's input and output in encoder.
+        transformer_dec_attn_dropout_rate （float, optional): Dropout rate in deocoder self-attention module.
-    decoder_concat_after : bool, optional
+        transformer_enc_dec_attn_dropout_rate (float, optional): Dropout rate in encoder-deocoder attention module.
-        Whether to concatenate attention layer's input and output in decoder.
+        init_type (str, optional): How to initialize transformer parameters.
-    positionwise_layer_type : str, optional
+        init_enc_alpha （float, optional）: Initial value of alpha in scaled pos encoding of the encoder.
-        Position-wise operation type.
+        init_dec_alpha (float, optional): Initial value of alpha in scaled pos encoding of the decoder.
-    positionwise_conv_kernel_size : int, optional
+        eprenet_dropout_rate (float, optional): Dropout rate in encoder prenet.
-        Kernel size in position wise conv 1d.
+        dprenet_dropout_rate (float, optional): Dropout rate in decoder prenet.
-    reduction_factor : int, optional
+        postnet_dropout_rate (float, optional): Dropout rate in postnet.
-        Reduction factor.
+        use_masking (bool, optional): Whether to apply masking for padded part in loss calculation.
-    spk_embed_dim : int, optional
+        use_weighted_masking (bool, optional): Whether to apply weighted masking in loss calculation.
-        Number of speaker embedding dimenstions.
+        bce_pos_weight (float, optional): Positive sample weight in bce calculation (only for use_masking=true).
-    spk_embed_integration_type : str, optional
+        loss_type (str, optional): How to calculate loss.
-        How to integrate speaker embedding.
+        use_guided_attn_loss (bool, optional): Whether to use guided attention loss.
-    use_gst : str, optional
+        num_heads_applied_guided_attn (int, optional): Number of heads in each layer to apply guided attention loss.
-        Whether to use global style token.
+        num_layers_applied_guided_attn (int, optional): Number of layers to apply guided attention loss.
-    gst_tokens : int, optional
+            List of module names to apply guided attention loss.
-        The number of GST embeddings.
-    gst_heads : int, optional
-        The number of heads in GST multihead attention.
-    gst_conv_layers : int, optional
-        The number of conv layers in GST.
-    gst_conv_chans_list : Sequence[int], optional
-            List of the number of channels of conv layers in GST.
-    gst_conv_kernel_size : int, optional
-        Kernal size of conv layers in GST.
-    gst_conv_stride : int, optional
-        Stride size of conv layers in GST.
-    gst_gru_layers : int, optional
-        The number of GRU layers in GST.
-    gst_gru_units : int, optional
-        The number of GRU units in GST.
-    transformer_lr : float, optional
-        Initial value of learning rate.
-    transformer_warmup_steps : int, optional
-        Optimizer warmup steps.
-    transformer_enc_dropout_rate : float, optional
-        Dropout rate in encoder except attention and positional encoding.
-    transformer_enc_positional_dropout_rate : float, optional
-        Dropout rate after encoder positional encoding.
-    transformer_enc_attn_dropout_rate : float, optional
-        Dropout rate in encoder self-attention module.
-    transformer_dec_dropout_rate : float, optional
-        Dropout rate in decoder except attention & positional encoding.
-    transformer_dec_positional_dropout_rate : float, optional
-        Dropout rate after decoder positional encoding.
-    transformer_dec_attn_dropout_rate : float, optional
-        Dropout rate in deocoder self-attention module.
-    transformer_enc_dec_attn_dropout_rate : float, optional
-        Dropout rate in encoder-deocoder attention module.
-    init_type : str, optional
-        How to initialize transformer parameters.
-    init_enc_alpha : float, optional
-        Initial value of alpha in scaled pos encoding of the encoder.
-    init_dec_alpha : float, optional
-        Initial value of alpha in scaled pos encoding of the decoder.
-    eprenet_dropout_rate : float, optional
-        Dropout rate in encoder prenet.
-    dprenet_dropout_rate : float, optional
-        Dropout rate in decoder prenet.
-    postnet_dropout_rate : float, optional
-        Dropout rate in postnet.
-    use_masking : bool, optional
-        Whether to apply masking for padded part in loss calculation.
-    use_weighted_masking : bool, optional
-        Whether to apply weighted masking in loss calculation.
-    bce_pos_weight : float, optional
-        Positive sample weight in bce calculation (only for use_masking=true).
-    loss_type : str, optional
-        How to calculate loss.
-    use_guided_attn_loss : bool, optional
-        Whether to use guided attention loss.
-    num_heads_applied_guided_attn : int, optional
-        Number of heads in each layer to apply guided attention loss.
-    num_layers_applied_guided_attn : int, optional
-        Number of layers to apply guided attention loss.
-        List of module names to apply guided attention loss.
    """
    def __init__(
@@ -398,25 +338,16 @@ class TransformerTTS(nn.Layer):
    ) -> Tuple[paddle.Tensor, Dict[str, paddle.Tensor], paddle.Tensor]:
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            text(Tensor(int64)): Batch of padded character ids (B, Tmax).
-        text : Tensor(int64)
+            text_lengths(Tensor(int64)): Batch of lengths of each input batch (B,).
-            Batch of padded character ids (B, Tmax).
+            speech(Tensor): Batch of padded target features (B, Lmax, odim).
-        text_lengths : Tensor(int64)
+            speech_lengths(Tensor(int64)): Batch of the lengths of each target (B,).
-            Batch of lengths of each input batch (B,).
+            spk_emb(Tensor, optional): Batch of speaker embeddings (B, spk_embed_dim).
-        speech : Tensor
-            Batch of padded target features (B, Lmax, odim).
+        Returns:
-        speech_lengths : Tensor(int64)
+            Tensor: Loss scalar value.
-            Batch of the lengths of each target (B,).
+            Dict: Statistics to be monitored.
-        spk_emb : Tensor, optional
-            Batch of speaker embeddings (B, spk_embed_dim).
-        Returns
-        ----------
-        Tensor
-            Loss scalar value.
-        Dict
-            Statistics to be monitored.
        """
        # input of embedding must be int64
@@ -525,31 +456,19 @@ class TransformerTTS(nn.Layer):
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Generate the sequence of features given the sequences of characters.
-        Parameters
+        Args:
-        ----------
+            text(Tensor(int64)): Input sequence of characters (T,).
-        text : Tensor(int64)
+            speech(Tensor, optional): Feature sequence to extract style (N, idim).
-            Input sequence of characters (T,).
+            spk_emb(Tensor, optional): Speaker embedding vector (spk_embed_dim,).
-        speech : Tensor, optional
+            threshold(float, optional): Threshold in inference.
-            Feature sequence to extract style (N, idim).
+            minlenratio(float, optional): Minimum length ratio in inference.
-        spk_emb : Tensor, optional
+            maxlenratio(float, optional): Maximum length ratio in inference.
-            Speaker embedding vector (spk_embed_dim,).
+            use_teacher_forcing(bool, optional): Whether to use teacher forcing.
-        threshold : float, optional
-            Threshold in inference.
+        Returns:
-        minlenratio : float, optional
+            Tensor: Output sequence of features (L, odim).
-            Minimum length ratio in inference.
+            Tensor: Output sequence of stop probabilities (L,).
-        maxlenratio : float, optional
+            Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).
-            Maximum length ratio in inference.
-        use_teacher_forcing : bool, optional
-            Whether to use teacher forcing.
-        Returns
-        ----------
-        Tensor
-            Output sequence of features (L, odim).
-        Tensor
-            Output sequence of stop probabilities (L,).
-        Tensor
-            Encoder-decoder (source) attention weights (#layers, #heads, L, T).
        """
        # input of embedding must be int64
@@ -671,23 +590,17 @@ class TransformerTTS(nn.Layer):
    def _source_mask(self, ilens: paddle.Tensor) -> paddle.Tensor:
        """Make masks for self-attention.
-        Parameters
+        Args:
-        ----------
+            ilens(Tensor): Batch of lengths (B,).
-        ilens : Tensor
-            Batch of lengths (B,).
-        Returns
+        Returns:
-        -------
+            Tensor: Mask tensor for self-attention. dtype=paddle.bool
-        Tensor
-            Mask tensor for self-attention.
-            dtype=paddle.bool
-        Examples
+        Examples:
-        -------
+            >>> ilens = [5, 3]
-        >>> ilens = [5, 3]
+            >>> self._source_mask(ilens)
-        >>> self._source_mask(ilens)
+            tensor([[[1, 1, 1, 1, 1],
-        tensor([[[1, 1, 1, 1, 1],
+                        [1, 1, 1, 0, 0]]]) bool
-                    [1, 1, 1, 0, 0]]]) bool
        """
        x_masks = make_non_pad_mask(ilens)
@@ -696,30 +609,25 @@ class TransformerTTS(nn.Layer):
    def _target_mask(self, olens: paddle.Tensor) -> paddle.Tensor:
        """Make masks for masked self-attention.
-        Parameters
+        Args:
-        ----------
+            olens (Tensor(int64)): Batch of lengths (B,).
-            olens : LongTensor
-                Batch of lengths (B,).
+        Returns:
+            Tensor: Mask tensor for masked self-attention.
-        Returns
-        ----------
+        Examples:
-        Tensor
+            >>> olens = [5, 3]
-            Mask tensor for masked self-attention.
+            >>> self._target_mask(olens)
+            tensor([[[1, 0, 0, 0, 0],
-        Examples
+                        [1, 1, 0, 0, 0],
-        ----------
+                        [1, 1, 1, 0, 0],
-        >>> olens = [5, 3]
+                        [1, 1, 1, 1, 0],
-        >>> self._target_mask(olens)
+                        [1, 1, 1, 1, 1]],
-        tensor([[[1, 0, 0, 0, 0],
+                    [[1, 0, 0, 0, 0],
-                    [1, 1, 0, 0, 0],
+                        [1, 1, 0, 0, 0],
-                    [1, 1, 1, 0, 0],
+                        [1, 1, 1, 0, 0],
-                    [1, 1, 1, 1, 0],
+                        [1, 1, 1, 0, 0],
-                    [1, 1, 1, 1, 1]],
+                        [1, 1, 1, 0, 0]]], dtype=paddle.uint8)
-                [[1, 0, 0, 0, 0],
-                    [1, 1, 0, 0, 0],
-                    [1, 1, 1, 0, 0],
-                    [1, 1, 1, 0, 0],
-                    [1, 1, 1, 0, 0]]], dtype=paddle.uint8)
        """
        y_masks = make_non_pad_mask(olens)
@@ -731,17 +639,12 @@ class TransformerTTS(nn.Layer):
                                  spk_emb: paddle.Tensor) -> paddle.Tensor:
        """Integrate speaker embedding with hidden states.
-        Parameters
+        Args:
-        ----------
+            hs(Tensor): Batch of hidden state sequences (B, Tmax, adim).
-        hs : Tensor
+            spk_emb(Tensor): Batch of speaker embeddings (B, spk_embed_dim).
-            Batch of hidden state sequences (B, Tmax, adim).
-        spk_emb : Tensor
+        Returns:
-            Batch of speaker embeddings (B, spk_embed_dim).
+            Tensor: Batch of integrated hidden state sequences (B, Tmax, adim).
-        Returns
-        ----------
-        Tensor
-            Batch of integrated hidden state sequences (B, Tmax, adim).
        """
        if self.spk_embed_integration_type == "add":

--- a/paddlespeech/t2s/models/waveflow.py
+++ b/paddlespeech/t2s/models/waveflow.py
--- a/paddlespeech/t2s/models/wavernn/wavernn.py
+++ b/paddlespeech/t2s/models/wavernn/wavernn.py
--- a/paddlespeech/t2s/modules/causal_conv.py
+++ b/paddlespeech/t2s/modules/causal_conv.py
@@ -41,14 +41,10 @@ class CausalConv1D(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input tensor (B, in_channels, T).
-        x : Tensor
+        Returns: 
-            Input tensor (B, in_channels, T).
+            Tensor: Output tensor (B, out_channels, T).
-        Returns
-        ----------
-        Tensor
-            Output tensor (B, out_channels, T).
        """
        return self.conv(self.pad(x))[:, :, :x.shape[2]]
@@ -70,13 +66,9 @@ class CausalConv1DTranspose(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input tensor (B, in_channels, T_in).
-        x : Tensor
+        Returns:
-            Input tensor (B, in_channels, T_in).
+            Tensor: Output tensor (B, out_channels, T_out).
-        Returns
-        ----------
-        Tensor
-            Output tensor (B, out_channels, T_out).
        """
        return self.deconv(x)[:, :, :-self.stride]
--- a/paddlespeech/t2s/modules/conformer/convolution.py
+++ b/paddlespeech/t2s/modules/conformer/convolution.py
@@ -18,12 +18,10 @@ from paddle import nn
 class ConvolutionModule(nn.Layer):
    """ConvolutionModule in Conformer model.
-    Parameters
-    ----------
+    Args:
-    channels : int
+        channels (int): The number of channels of conv layers.
-        The number of channels of conv layers.
+        kernel_size (int): Kernerl size of conv layers.
-    kernel_size : int
-        Kernerl size of conv layers.
    """
    def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True):
@@ -59,14 +57,11 @@ class ConvolutionModule(nn.Layer):
    def forward(self, x):
        """Compute convolution module.
-        Parameters
-        ----------
+        Args:
-        x : paddle.Tensor
+            x (Tensor): Input tensor (#batch, time, channels).
-            Input tensor (#batch, time, channels).
+        Returns:
-        Returns
+            Tensor: Output tensor (#batch, time, channels).
-        ----------
-        paddle.Tensor
-            Output tensor (#batch, time, channels).
        """
        # exchange the temporal dimension and the feature dimension
        x = x.transpose([0, 2, 1])

--- a/paddlespeech/t2s/modules/conformer/encoder_layer.py
+++ b/paddlespeech/t2s/modules/conformer/encoder_layer.py
@@ -21,38 +21,29 @@ from paddlespeech.t2s.modules.layer_norm import LayerNorm
 class EncoderLayer(nn.Layer):
    """Encoder layer module.
-    Parameters
-    ----------
+    Args:
-    size : int
+        size (int): Input dimension.
-        Input dimension.
+        self_attn (nn.Layer): Self-attention module instance.
-    self_attn : nn.Layer
+            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
-        Self-attention module instance.
+            can be used as the argument.
-        `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
+        feed_forward (nn.Layer): Feed-forward module instance.
-        can be used as the argument.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
-    feed_forward : nn.Layer
+            can be used as the argument.
-        Feed-forward module instance.
+        feed_forward_macaron (nn.Layer): Additional feed-forward module instance.
-        `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
-        can be used as the argument.
+            can be used as the argument.
-    feed_forward_macaron : nn.Layer
+        conv_module (nn.Layer): Convolution module instance.
-        Additional feed-forward module instance.
+            `ConvlutionModule` instance can be used as the argument.
-        `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+        dropout_rate (float): Dropout rate.
-        can be used as the argument.
+        normalize_before (bool): Whether to use layer_norm before the first block.
-    conv_module : nn.Layer
+        concat_after (bool): Whether to concat attention layer's input and output.
-        Convolution module instance.
+            if True, additional linear will be applied.
-        `ConvlutionModule` instance can be used as the argument.
+            i.e. x -> x + linear(concat(x, att(x)))
-    dropout_rate : float
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
-        Dropout rate.
+        stochastic_depth_rate (float): Proability to skip this layer.
-    normalize_before : bool
+            During training, the layer may skip residual computation and return input
-        Whether to use layer_norm before the first block.
+            as-is with given probability.
-    concat_after : bool
-        Whether to concat attention layer's input and output.
-        if True, additional linear will be applied.
-        i.e. x -> x + linear(concat(x, att(x)))
-        if False, no additional linear will be applied. i.e. x -> x + att(x)
-    stochastic_depth_rate : float
-        Proability to skip this layer.
-        During training, the layer may skip residual computation and return input
-        as-is with given probability.
    """
    def __init__(
@@ -93,22 +84,17 @@ class EncoderLayer(nn.Layer):
    def forward(self, x_input, mask, cache=None):
        """Compute encoded features.
-        Parameters
-        ----------
+        Args:
-        x_input : Union[Tuple, paddle.Tensor]
+            x_input(Union[Tuple, Tensor]): Input tensor w/ or w/o pos emb.
-            Input tensor w/ or w/o pos emb.
+                - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)].
-            - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)].
+                - w/o pos emb: Tensor (#batch, time, size).
-            - w/o pos emb: Tensor (#batch, time, size).
+            mask(Tensor): Mask tensor for the input (#batch, time).
-        mask : paddle.Tensor
+            cache (Tensor): 
-            Mask tensor for the input (#batch, time).
-        cache paddle.Tensor
+        Returns:
-            Cache tensor of the input (#batch, time - 1, size).
+            Tensor: Output tensor (#batch, time, size).
-        Returns
+            Tensor: Mask tensor (#batch, time).
-        ----------
-        paddle.Tensor
-            Output tensor (#batch, time, size).
-        paddle.Tensor
-            Mask tensor (#batch, time).
        """
        if isinstance(x_input, tuple):
            x, pos_emb = x_input[0], x_input[1]

--- a/paddlespeech/t2s/modules/conv.py
+++ b/paddlespeech/t2s/modules/conv.py
--- a/paddlespeech/t2s/modules/geometry.py
+++ b/paddlespeech/t2s/modules/geometry.py
@@ -17,24 +17,18 @@ import paddle
 def shuffle_dim(x, axis, perm=None):
    """Permute input tensor along aixs given the permutation or randomly.
+    Args:
+        x (Tensor): The input tensor.
+        axis (int): The axis to shuffle.
+        perm (List[int], ndarray, optional): 
+            The order to reorder the tensor along the ``axis``-th dimension.
+            It is a permutation of ``[0, d)``, where d is the size of the
+            ``axis``-th dimension of the input tensor. If not provided,
+            a random permutation is used. Defaults to None.
-    Parameters
+    Returns:
-    ----------
+        Tensor: The shuffled tensor, which has the same shape as x does.
-    x : Tensor
-        The input tensor.
-    axis : int
-        The axis to shuffle.
-    perm : List[int], ndarray, optional
-        The order to reorder the tensor along the ``axis``-th dimension.
-        It is a permutation of ``[0, d)``, where d is the size of the
-        ``axis``-th dimension of the input tensor. If not provided,
-        a random permutation is used. Defaults to None.
-    Returns
-    ---------
-    Tensor
-        The shuffled tensor, which has the same shape as x does.
    """
    size = x.shape[axis]
    if perm is not None and len(perm) != size:

--- a/paddlespeech/t2s/modules/layer_norm.py
+++ b/paddlespeech/t2s/modules/layer_norm.py
@@ -18,13 +18,9 @@ from paddle import nn
 class LayerNorm(nn.LayerNorm):
    """Layer normalization module.
+    Args:
-    Parameters
+        nout (int): Output dim size.
-    ----------
+        dim (int): Dimension to be normalized.
-    nout : int
-        Output dim size.
-    dim : int
-        Dimension to be normalized.
    """
    def __init__(self, nout, dim=-1):
@@ -35,15 +31,11 @@ class LayerNorm(nn.LayerNorm):
    def forward(self, x):
        """Apply layer normalization.
-        Parameters
+        Args:
-        ----------
+            x (Tensor):Input tensor.
-        x : paddle.Tensor
-            Input tensor.
-        Returns
+        Returns: 
-        ----------
+            Tensor: Normalized tensor.
-        paddle.Tensor
-            Normalized tensor.
        """
        if self.dim == -1:

--- a/paddlespeech/t2s/modules/losses.py
+++ b/paddlespeech/t2s/modules/losses.py
--- a/paddlespeech/t2s/modules/nets_utils.py
+++ b/paddlespeech/t2s/modules/nets_utils.py
--- a/paddlespeech/t2s/modules/pqmf.py
+++ b/paddlespeech/t2s/modules/pqmf.py
--- a/paddlespeech/t2s/modules/predictor/duration_predictor.py
+++ b/paddlespeech/t2s/modules/predictor/duration_predictor.py
--- a/paddlespeech/t2s/modules/predictor/length_regulator.py
+++ b/paddlespeech/t2s/modules/predictor/length_regulator.py
--- a/paddlespeech/t2s/modules/predictor/variance_predictor.py
+++ b/paddlespeech/t2s/modules/predictor/variance_predictor.py
--- a/paddlespeech/t2s/modules/residual_block.py
+++ b/paddlespeech/t2s/modules/residual_block.py
--- a/paddlespeech/t2s/modules/residual_stack.py
+++ b/paddlespeech/t2s/modules/residual_stack.py
--- a/paddlespeech/t2s/modules/style_encoder.py
+++ b/paddlespeech/t2s/modules/style_encoder.py
--- a/paddlespeech/t2s/modules/tacotron2/attentions.py
+++ b/paddlespeech/t2s/modules/tacotron2/attentions.py
--- a/paddlespeech/t2s/modules/tacotron2/decoder.py
+++ b/paddlespeech/t2s/modules/tacotron2/decoder.py
--- a/paddlespeech/t2s/modules/tacotron2/encoder.py
+++ b/paddlespeech/t2s/modules/tacotron2/encoder.py
--- a/paddlespeech/t2s/modules/tade_res_block.py
+++ b/paddlespeech/t2s/modules/tade_res_block.py
--- a/paddlespeech/t2s/modules/transformer/attention.py
+++ b/paddlespeech/t2s/modules/transformer/attention.py
--- a/paddlespeech/t2s/modules/transformer/decoder.py
+++ b/paddlespeech/t2s/modules/transformer/decoder.py
--- a/paddlespeech/t2s/modules/transformer/decoder_layer.py
+++ b/paddlespeech/t2s/modules/transformer/decoder_layer.py
--- a/paddlespeech/t2s/modules/transformer/embedding.py
+++ b/paddlespeech/t2s/modules/transformer/embedding.py
--- a/paddlespeech/t2s/modules/transformer/encoder.py
+++ b/paddlespeech/t2s/modules/transformer/encoder.py
--- a/paddlespeech/t2s/modules/transformer/encoder_layer.py
+++ b/paddlespeech/t2s/modules/transformer/encoder_layer.py
--- a/paddlespeech/t2s/modules/transformer/lightconv.py
+++ b/paddlespeech/t2s/modules/transformer/lightconv.py
--- a/paddlespeech/t2s/modules/transformer/mask.py
+++ b/paddlespeech/t2s/modules/transformer/mask.py
--- a/paddlespeech/t2s/modules/transformer/multi_layer_conv.py
+++ b/paddlespeech/t2s/modules/transformer/multi_layer_conv.py
--- a/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py
+++ b/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py
--- a/paddlespeech/t2s/modules/transformer/repeat.py
+++ b/paddlespeech/t2s/modules/transformer/repeat.py
--- a/paddlespeech/t2s/modules/transformer/subsampling.py
+++ b/paddlespeech/t2s/modules/transformer/subsampling.py
--- a/paddlespeech/t2s/modules/upsample.py
+++ b/paddlespeech/t2s/modules/upsample.py
--- a/paddlespeech/t2s/training/experiment.py
+++ b/paddlespeech/t2s/training/experiment.py
--- a/paddlespeech/t2s/training/extensions/snapshot.py
+++ b/paddlespeech/t2s/training/extensions/snapshot.py
@@ -43,10 +43,8 @@ class Snapshot(extension.Extension):
    parameters and optimizer states. If the updater inside the trainer
    subclasses StandardUpdater, everything is good to go.
-    Parameters
+    Arsg:
-    ----------
+        checkpoint_dir (Union[str, Path]): The directory to save checkpoints into.
-    checkpoint_dir : Union[str, Path]
-        The directory to save checkpoints into.
    """
    trigger = (1, 'epoch')

--- a/paddlespeech/t2s/utils/error_rate.py
+++ b/paddlespeech/t2s/utils/error_rate.py
--- a/paddlespeech/t2s/utils/h5_utils.py
+++ b/paddlespeech/t2s/utils/h5_utils.py
--- a/paddlespeech/text/exps/ernie_linear/train.py
+++ b/paddlespeech/text/exps/ernie_linear/train.py
--- a/tests/test_tipc/benchmark_train.sh
+++ b/tests/test_tipc/benchmark_train.sh
--- a/tests/test_tipc/configs/conformer/train_benchmark.txt
+++ b/tests/test_tipc/configs/conformer/train_benchmark.txt
--- a/tests/test_tipc/conformer/scripts/aishell_tiny.py
+++ b/tests/test_tipc/conformer/scripts/aishell_tiny.py
--- a/tests/test_tipc/docs/benchmark_train.md
+++ b/tests/test_tipc/docs/benchmark_train.md
--- a/tests/test_tipc/prepare.sh
+++ b/tests/test_tipc/prepare.sh
--- a/tests/test_tipc/test_train_inference_python.sh
+++ b/tests/test_tipc/test_train_inference_python.sh
--- a/tests/unit/cli/test_cli.sh
+++ b/tests/unit/cli/test_cli.sh