Add Diffsinger Module (#2120)

* add diffsinger * update README * update README

Add Diffsinger Module (#2120)
* add diffsinger * update README * update README
7eef3bfd · jm_12138 · GitHub · c9211e2a · 7eef3bfd · 7eef3bfd
40 changed file
--- a/modules/audio/svs/diffsinger/README.md
+++ b/modules/audio/svs/diffsinger/README.md
+# diffsinger
+
+|模型名称|diffsinger|
+| :--- | :---: |
+|类别|音频-歌声合成|
+|网络|DiffSinger|
+|数据集|-|
+|是否支持Fine-tuning|否|
+|模型大小|256.1MB|
+|指标|-|
+|最新更新日期|2022-10-25|
+
+
+## 一、模型基本信息
+
+- ### 应用效果展示
+
+  - 网络结构：
+      <p align="center">
+        <img src="https://neuralsvb.github.io/resources/model_all7.png"/>
+      </p>
+
+  - 样例结果示例：
+
+    |文本|音频|
+    |:-:|:-:|
+    |让 梦 恒 久 比 天 长|<audio controls="controls"><source src="https://diffsinger.github.io/audio/singing_demo/diffsinger-base/000000007.wav" autoplay=""></audio>|
+    |我 终 于 翱 翔|<audio controls="controls"><source src="https://diffsinger.github.io/audio/singing_demo/diffsinger-base/000000005.wav" autoplay=""></audio>|
+
+- ### 模型介绍
+
+  - DiffSinger，一个基于扩散概率模型的 SVS 声学模型。DiffSinger 是一个参数化的马尔科夫链，它可以根据乐谱的条件，迭代地将噪声转换为旋律谱。通过隐式优化变异约束，DiffSinger 可以被稳定地训练并产生真实的输出。
+
+
+## 二、安装
+
+- ### 1、环境依赖
+
+  - onnxruntime >= 1.12.0
+
+    ```shell
+    # CPU
+    $ pip install onnxruntime
+
+    # GPU
+    $ pip install onnxruntime-gpu
+    ```
+
+  - paddlehub >= 2.0.0  
+
+- ### 2.安装
+
+    - ```shell
+      $ hub install diffsinger
+      ```
+    -  如您安装时遇到问题，可参考：[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
+      | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
+
+## 三、模型API预测
+  - ### 1、命令行预测
+
+    ```shell
+    $ hub run diffsinger \
+        --input_type "word" \
+        --text "小酒窝长睫毛AP是你最美的记号" \
+        --notes "C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4" \
+        --notes_duration "0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340" \
+        --sample_num 1 \
+        --save_dir "outputs"
+
+    $ hub run diffsinger \
+        --input_type "phoneme" \
+        --text "小酒窝长睫毛AP是你最美的记号" \
+        --ph_seq "x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao" \
+        --note_seq "C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4" \
+        --note_dur_seq "0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340" \
+        --is_slur_seq "0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" \
+        --sample_num 1 \
+        --save_dir "outputs"
+    ```
+
+  - ### 2、预测代码示例
+
+    ```python
+    import paddlehub as hub
+
+    module = hub.Module(name="diffsinger")
+    results = module.singing_voice_synthesis(
+      inputs={
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+        'input_type': 'word'
+      },
+      sample_num=1,
+      save_audio=True,
+      save_dir='outputs'
+    )
+    ```
+
+  - ### 3、API
+
+    ```python
+    def singing_voice_synthesis(
+      inputs: Dict[str, str],
+      sample_num: int = 1,
+      save_audio: bool = True,
+      save_dir: str = 'outputs'
+    ) -> Dict[str, Union[List[List[int]], int]]:
+    ```
+
+    - 歌声合成 API
+
+    - **参数**
+
+      * inputs (Dict\[str, str\]): 输入数据，支持如下两种格式；
+
+        ```python
+        {
+          'text': '小酒窝长睫毛AP是你最美的记号',
+          'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+          'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+          'input_type': 'word'
+        }
+        {
+            'text': '小酒窝长睫毛AP是你最美的记号',
+            'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+            'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+            'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+            'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
+            'input_type': 'phoneme'
+        }
+        ```
+
+      * sample_num (int): 生成音频的数量；
+      * save_audio (bool): 是否保存音频文件；
+      * save\_dir (str): 保存处理结果的文件目录。
+
+    - **返回**
+
+      * res (Dict\[str, Union\[List\[List\[int\]\], int\]\]): 歌声合成结果，一个字典，包容如下内容；
+
+        * wavs: 歌声音频数据
+        * sample_rate: 音频采样率
+
+## 四、服务部署
+
+- PaddleHub Serving 可以部署一个歌声合成的在线服务。
+
+- ### 第一步：启动PaddleHub Serving
+
+  - 运行启动命令：
+
+    ```shell
+     $ hub serving start -m diffsinger
+    ```
+
+    - 这样就完成了一个歌声合成服务化API的部署，默认端口号为8866。
+
+- ### 第二步：发送预测请求
+
+  - 配置好服务端，以下数行代码即可实现发送预测请求，获取预测结果
+
+    ```python
+    import requests
+    import json
+
+    data = {
+        'inputs': {
+                'text': '小酒窝长睫毛AP是你最美的记号',
+                'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+                'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+                'input_type': 'word'
+            },
+        'save_audio': False,
+    }
+    headers = {"Content-type": "application/json"}
+    url = "http://127.0.0.1:8866/predict/diffsinger"
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    results = r.json()['results']
+    ```
+
+## 五、参考资料
+
+* 论文：[DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism](https://arxiv.org/abs/2105.02446)
+
+* 官方实现：[MoonInTheRiver/DiffSinger](https://github.com/MoonInTheRiver/DiffSinger)
+
+## 六、更新历史
+
+* 1.0.0
+
+  初始发布
+
+  ```shell
+  $ hub install diffsinger==1.0.0
+  ```
--- a/modules/audio/svs/diffsinger/__init__.py
+++ b/modules/audio/svs/diffsinger/__init__.py
--- a/modules/audio/svs/diffsinger/configs/config_base.yaml
+++ b/modules/audio/svs/diffsinger/configs/config_base.yaml
+# task
+binary_data_dir: ''
+work_dir: '' # experiment directory.
+infer: false # infer
+seed: 1234
+debug: false
+save_codes:
+  - configs
+  - modules
+  - tasks
+  - utils
+  - usr
+
+#############
+# dataset
+#############
+ds_workers: 1
+test_num: 100
+valid_num: 100
+endless_ds: false
+sort_by_len: true
+
+#########
+# train and eval
+#########
+load_ckpt: ''
+save_ckpt: true
+save_best: false
+num_ckpt_keep: 3
+clip_grad_norm: 0
+accumulate_grad_batches: 1
+log_interval: 100
+num_sanity_val_steps: 5  # steps of validation at the beginning
+check_val_every_n_epoch: 10
+val_check_interval: 2000
+max_epochs: 1000
+max_updates: 160000
+max_tokens: 31250
+max_sentences: 100000
+max_eval_tokens: -1
+max_eval_sentences: -1
+test_input_dir: ''
--- a/modules/audio/svs/diffsinger/configs/singing/base.yaml
+++ b/modules/audio/svs/diffsinger/configs/singing/base.yaml
+base_config:
+  - configs/tts/base.yaml
+  - configs/tts/base_zh.yaml
+
+
+datasets: []
+test_prefixes: []
+test_num: 0
+valid_num: 0
+
+pre_align_cls: data_gen.singing.pre_align.SingingPreAlign
+binarizer_cls: data_gen.singing.binarize.SingingBinarizer
+pre_align_args:
+  use_tone: false # for ZH
+  forced_align: mfa
+  use_sox: true
+hop_size: 128            # Hop size.
+fft_size: 512           # FFT size.
+win_size: 512           # FFT size.
+max_frames: 8000
+fmin: 50                 # Minimum freq in mel basis calculation.
+fmax: 11025               # Maximum frequency in mel basis calculation.
+pitch_type: frame
+
+hidden_size: 256
+mel_loss: "ssim:0.5|l1:0.5"
+lambda_f0: 0.0
+lambda_uv: 0.0
+lambda_energy: 0.0
+lambda_ph_dur: 0.0
+lambda_sent_dur: 0.0
+lambda_word_dur: 0.0
+predictor_grad: 0.0
+use_spk_embed: true
+use_spk_id: false
+
+max_tokens: 20000
+max_updates: 400000
+num_spk: 100
+save_f0: true
+use_gt_dur: true
+use_gt_f0: true
--- a/modules/audio/svs/diffsinger/configs/singing/fs2.yaml
+++ b/modules/audio/svs/diffsinger/configs/singing/fs2.yaml
+base_config:
+  - configs/tts/fs2.yaml
+  - configs/singing/base.yaml
--- a/modules/audio/svs/diffsinger/configs/tts/base.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/base.yaml
+# task
+base_config: configs/config_base.yaml
+task_cls: ''
+#############
+# dataset
+#############
+raw_data_dir: ''
+processed_data_dir: ''
+binary_data_dir: ''
+dict_dir: ''
+pre_align_cls: ''
+binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
+pre_align_args:
+  use_tone: true # for ZH
+  forced_align: mfa
+  use_sox: false
+  txt_processor: en
+  allow_no_txt: false
+  denoise: false
+binarization_args:
+  shuffle: false
+  with_txt: true
+  with_wav: false
+  with_align: true
+  with_spk_embed: true
+  with_f0: true
+  with_f0cwt: true
+
+loud_norm: false
+endless_ds: true
+reset_phone_dict: true
+
+test_num: 100
+valid_num: 100
+max_frames: 1550
+max_input_tokens: 1550
+audio_num_mel_bins: 80
+audio_sample_rate: 22050
+hop_size: 256  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
+win_size: 1024  # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
+fmin: 80  # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
+fmax: 7600  # To be increased/reduced depending on data.
+fft_size: 1024  # Extra window size is filled with 0 paddings to match this parameter
+min_level_db: -100
+num_spk: 1
+mel_vmin: -6
+mel_vmax: 1.5
+ds_workers: 4
+
+#########
+# model
+#########
+dropout: 0.1
+enc_layers: 4
+dec_layers: 4
+hidden_size: 384
+num_heads: 2
+prenet_dropout: 0.5
+prenet_hidden_size: 256
+stop_token_weight: 5.0
+enc_ffn_kernel_size: 9
+dec_ffn_kernel_size: 9
+ffn_act: gelu
+ffn_padding: 'SAME'
+
+
+###########
+# optimization
+###########
+lr: 2.0
+warmup_updates: 8000
+optimizer_adam_beta1: 0.9
+optimizer_adam_beta2: 0.98
+weight_decay: 0
+clip_grad_norm: 1
+
+
+###########
+# train and eval
+###########
+max_tokens: 30000
+max_sentences: 100000
+max_eval_sentences: 1
+max_eval_tokens: 60000
+train_set_name: 'train'
+valid_set_name: 'valid'
+test_set_name: 'test'
+vocoder: pwg
+vocoder_ckpt: ''
+profile_infer: false
+out_wav_norm: false
+save_gt: false
+save_f0: false
+gen_dir_name: ''
+use_denoise: false
--- a/modules/audio/svs/diffsinger/configs/tts/base_zh.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/base_zh.yaml
+pre_align_args:
+  txt_processor: zh_g2pM
+binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer
--- a/modules/audio/svs/diffsinger/configs/tts/fs2.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/fs2.yaml
+base_config: configs/tts/base.yaml
+task_cls: tasks.tts.fs2.FastSpeech2Task
+
+# model
+hidden_size: 256
+dropout: 0.1
+encoder_type: fft # fft|tacotron|tacotron2|conformer
+encoder_K: 8 # for tacotron encoder
+decoder_type: fft # fft|rnn|conv|conformer
+use_pos_embed: true
+
+# duration
+predictor_hidden: -1
+predictor_kernel: 5
+predictor_layers: 2
+dur_predictor_kernel: 3
+dur_predictor_layers: 2
+predictor_dropout: 0.5
+
+# pitch and energy
+use_pitch_embed: true
+pitch_type: ph # frame|ph|cwt
+use_uv: true
+cwt_hidden_size: 128
+cwt_layers: 2
+cwt_loss: l1
+cwt_add_f0_loss: false
+cwt_std_scale: 0.8
+
+pitch_ar: false
+#pitch_embed_type: 0q
+pitch_loss: 'l1' # l1|l2|ssim
+pitch_norm: log
+use_energy_embed: false
+
+# reference encoder and speaker embedding
+use_spk_id: false
+use_split_spk_id: false
+use_spk_embed: false
+use_var_enc: false
+lambda_commit: 0.25
+ref_norm_layer: bn
+pitch_enc_hidden_stride_kernel:
+  - 0,2,5 # conv_hidden_size, conv_stride, conv_kernel_size. conv_hidden_size=0: use hidden_size
+  - 0,2,5
+  - 0,2,5
+dur_enc_hidden_stride_kernel:
+  - 0,2,3 # conv_hidden_size, conv_stride, conv_kernel_size. conv_hidden_size=0: use hidden_size
+  - 0,2,3
+  - 0,1,3
+
+
+# mel
+mel_loss: l1:0.5|ssim:0.5 # l1|l2|gdl|ssim or l1:0.5|ssim:0.5
+
+# loss lambda
+lambda_f0: 1.0
+lambda_uv: 1.0
+lambda_energy: 0.1
+lambda_ph_dur: 1.0
+lambda_sent_dur: 1.0
+lambda_word_dur: 1.0
+predictor_grad: 0.1
+
+# train and eval
+pretrain_fs_ckpt: ''
+warmup_updates: 2000
+max_tokens: 32000
+max_sentences: 100000
+max_eval_sentences: 1
+max_updates: 120000
+num_valid_plots: 5
+num_test_samples: 0
+test_ids: []
+use_gt_dur: false
+use_gt_f0: false
+
+# exp
+dur_loss: mse # huber|mol
+norm_type: gn
--- a/modules/audio/svs/diffsinger/configs/tts/hifigan.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/hifigan.yaml
+base_config: configs/tts/pwg.yaml
+task_cls: tasks.vocoder.hifigan.HifiGanTask
+resblock: "1"
+adam_b1: 0.8
+adam_b2: 0.99
+upsample_rates: [ 8,8,2,2 ]
+upsample_kernel_sizes: [ 16,16,4,4 ]
+upsample_initial_channel: 128
+resblock_kernel_sizes: [ 3,7,11 ]
+resblock_dilation_sizes: [ [ 1,3,5 ], [ 1,3,5 ], [ 1,3,5 ] ]
+
+lambda_mel: 45.0
+
+max_samples: 8192
+max_sentences: 16
+
+generator_params:
+  lr: 0.0002            # Generator's learning rate.
+  aux_context_window: 0 # Context window size for auxiliary feature.
+discriminator_optimizer_params:
+  lr: 0.0002            # Discriminator's learning rate.
--- a/modules/audio/svs/diffsinger/configs/tts/lj/base_mel2wav.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/lj/base_mel2wav.yaml
+raw_data_dir: 'data/raw/LJSpeech-1.1'
+processed_data_dir: 'data/processed/ljspeech'
+binary_data_dir: 'data/binary/ljspeech_wav'
--- a/modules/audio/svs/diffsinger/configs/tts/lj/base_text2mel.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/lj/base_text2mel.yaml
+raw_data_dir: 'data/raw/LJSpeech-1.1'
+processed_data_dir: 'data/processed/ljspeech'
+binary_data_dir: 'data/binary/ljspeech'
+pre_align_cls: data_gen.tts.lj.pre_align.LJPreAlign
+
+pitch_type: cwt
+mel_loss: l1
+num_test_samples: 20
+test_ids: [ 68, 70, 74, 87, 110, 172, 190, 215, 231, 294,
+            316, 324, 402, 422, 485, 500, 505, 508, 509, 519 ]
+use_energy_embed: false
+test_num: 523
+valid_num: 348
--- a/modules/audio/svs/diffsinger/configs/tts/lj/fs2.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/lj/fs2.yaml
+base_config:
+  - configs/tts/fs2.yaml
+  - configs/tts/lj/base_text2mel.yaml
--- a/modules/audio/svs/diffsinger/configs/tts/lj/hifigan.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/lj/hifigan.yaml
+base_config:
+  - configs/tts/hifigan.yaml
+  - configs/tts/lj/base_mel2wav.yaml
--- a/modules/audio/svs/diffsinger/configs/tts/lj/pwg.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/lj/pwg.yaml
+base_config:
+  - configs/tts/pwg.yaml
+  - configs/tts/lj/base_mel2wav.yaml
--- a/modules/audio/svs/diffsinger/configs/tts/pwg.yaml
+++ b/modules/audio/svs/diffsinger/configs/tts/pwg.yaml
+base_config: configs/tts/base.yaml
+task_cls: tasks.vocoder.pwg.PwgTask
+
+binarization_args:
+  with_wav: true
+  with_spk_embed: false
+  with_align: false
+test_input_dir: ''
+
+###########
+# train and eval
+###########
+max_samples: 25600
+max_sentences: 5
+max_eval_sentences: 1
+max_updates: 1000000
+val_check_interval: 2000
+
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050     # Sampling rate.
+fft_size: 1024           # FFT size.
+hop_size: 256            # Hop size.
+win_length: null         # Window length.
+# If set to null, it will be the same as fft_size.
+window: "hann"           # Window function.
+num_mels: 80             # Number of mel basis.
+fmin: 80                 # Minimum freq in mel basis calculation.
+fmax: 7600               # Maximum frequency in mel basis calculation.
+format: "hdf5"           # Feature file format. "npy" or "hdf5" is supported.
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+generator_params:
+  in_channels: 1        # Number of input channels.
+  out_channels: 1       # Number of output channels.
+  kernel_size: 3        # Kernel size of dilated convolution.
+  layers: 30            # Number of residual block layers.
+  stacks: 3             # Number of stacks i.e., dilation cycles.
+  residual_channels: 64 # Number of channels in residual conv.
+  gate_channels: 128    # Number of channels in gated conv.
+  skip_channels: 64     # Number of channels in skip conv.
+  aux_channels: 80      # Number of channels for auxiliary feature conv.
+  # Must be the same as num_mels.
+  aux_context_window: 2 # Context window size for auxiliary feature.
+  # If set to 2, previous 2 and future 2 frames will be considered.
+  dropout: 0.0          # Dropout rate. 0.0 means no dropout applied.
+  use_weight_norm: true # Whether to use weight norm.
+  # If set to true, it will be applied to all of the conv layers.
+  upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
+  upsample_params:                      # Upsampling network parameters.
+    upsample_scales: [4, 4, 4, 4]     # Upsampling scales. Prodcut of these must be the same as hop size.
+  use_pitch_embed: false
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+discriminator_params:
+  in_channels: 1        # Number of input channels.
+  out_channels: 1       # Number of output channels.
+  kernel_size: 3        # Number of output channels.
+  layers: 10            # Number of conv layers.
+  conv_channels: 64     # Number of chnn layers.
+  bias: true            # Whether to use bias parameter in conv.
+  use_weight_norm: true # Whether to use weight norm.
+  # If set to true, it will be applied to all of the conv layers.
+  nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+  nonlinear_activation_params:      # Nonlinear function parameters
+    negative_slope: 0.2           # Alpha in LeakyReLU.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+  fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+  hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
+  win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+  window: "hann_window"         # Window function for STFT-based loss
+use_mel_loss: false
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_adv: 4.0  # Loss balancing coefficient.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+  lr: 0.0001             # Generator's learning rate.
+  eps: 1.0e-6            # Generator's epsilon.
+  weight_decay: 0.0      # Generator's weight decay coefficient.
+generator_scheduler_params:
+  step_size: 200000      # Generator's scheduler step size.
+  gamma: 0.5             # Generator's scheduler gamma.
+  # At each step size, lr will be multiplied by this parameter.
+generator_grad_norm: 10    # Generator's gradient norm.
+discriminator_optimizer_params:
+  lr: 0.00005            # Discriminator's learning rate.
+  eps: 1.0e-6            # Discriminator's epsilon.
+  weight_decay: 0.0      # Discriminator's weight decay coefficient.
+discriminator_scheduler_params:
+  step_size: 200000      # Discriminator's scheduler step size.
+  gamma: 0.5             # Discriminator's scheduler gamma.
+  # At each step size, lr will be multiplied by this parameter.
+discriminator_grad_norm: 1 # Discriminator's gradient norm.
+disc_start_steps: 40000 # Number of steps to start to train discriminator.
--- a/modules/audio/svs/diffsinger/infer.py
+++ b/modules/audio/svs/diffsinger/infer.py
+import os
+from collections import deque
+
+import librosa
+import numpy as np
+import onnxruntime as rt
+from pypinyin import lazy_pinyin
+from tqdm import tqdm
+
+from .inference.svs.opencpop.map import cpop_pinyin2ph_func
+from .utils.hparams import hparams
+from .utils.text_encoder import TokenTextEncoder
+
+
+class Infer:
+
+    def __init__(self, root='.', providers=None):
+        model_dir = os.path.join(root, 'model')
+        if providers is None:
+            providers = rt.get_available_providers()
+        print('Using these as onnxruntime providers:', providers)
+
+        phone_list = [
+            "AP", "SP", "a", "ai", "an", "ang", "ao", "b", "c", "ch", "d", "e", "ei", "en", "eng", "er", "f", "g", "h",
+            "i", "ia", "ian", "iang", "iao", "ie", "in", "ing", "iong", "iu", "j", "k", "l", "m", "n", "o", "ong", "ou",
+            "p", "q", "r", "s", "sh", "t", "u", "ua", "uai", "uan", "uang", "ui", "un", "uo", "v", "van", "ve", "vn",
+            "w", "x", "y", "z", "zh"
+        ]
+        self.ph_encoder = TokenTextEncoder(None, vocab_list=phone_list, replace_oov=',')
+        self.pinyin2phs = cpop_pinyin2ph_func(path=os.path.join(root, 'inference/svs/opencpop/cpop_pinyin2ph.txt'))
+        self.spk_map = {'opencpop': 0}
+
+        options = rt.SessionOptions()
+        for provider in providers:
+            if 'dml' in provider.lower():
+                options.enable_mem_pattern = False
+                options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL
+        fs2_path = os.path.join(model_dir, 'fs2.onnx')
+        q_sample_path = os.path.join(model_dir, 'q_sample.onnx')
+        p_sample_path = os.path.join(model_dir, 'p_sample.onnx')
+        pe_path = os.path.join(model_dir, 'pe.onnx')
+        vocoder_path = os.path.join(model_dir, 'vocoder.onnx')
+        self.fs2 = rt.InferenceSession(fs2_path, options, providers=providers)
+        self.q_sample = rt.InferenceSession(q_sample_path, options, providers=providers)
+        self.p_sample = rt.InferenceSession(p_sample_path, options, providers=providers)
+        self.pe = rt.InferenceSession(pe_path, options, providers=providers)
+        self.vocoder = rt.InferenceSession(vocoder_path, options, providers=providers)
+
+        self.K_step = hparams['K_step']
+        self.spec_min = np.asarray(hparams['spec_min'], np.float32)[None, None, :hparams['keep_bins']]
+        self.spec_max = np.asarray(hparams['spec_max'], np.float32)[None, None, :hparams['keep_bins']]
+        self.mel_bins = hparams['audio_num_mel_bins']
+        self.use_pe = hparams.get('pe_enable') is not None and hparams['pe_enable']
+
+    def model(self, txt_tokens, **kwargs):
+        fs_input_names = [node.name for node in self.fs2.get_inputs()]
+        inputs = {'txt_tokens': txt_tokens}
+        inputs.update({k: v for k, v in kwargs.items() if isinstance(v, np.ndarray) and k in fs_input_names})
+
+        io_binding = self.fs2.io_binding()
+        for k, v in inputs.items():
+            io_binding.bind_cpu_input(k, v)
+        io_binding.bind_output('decoder_inp')
+        io_binding.bind_output('mel_out')
+        if not self.use_pe:
+            io_binding.bind_output('f0_denorm')
+        self.fs2.run_with_iobinding(io_binding)
+        decoder_inp, mel_out = io_binding.get_outputs()[:2]
+        self.device_name = mel_out.device_name()
+        ret = {'decoder_inp': decoder_inp, 'mel_out': mel_out}
+        if not self.use_pe:
+            ret.update({'f0_denorm': io_binding.get_outputs()[-1]})
+        cond = decoder_inp.numpy().transpose([0, 2, 1])
+
+        ret['fs2_mel'] = ret['mel_out']
+        fs2_mels = mel_out.numpy()
+        t = self.K_step
+        fs2_mels = self.norm_spec(fs2_mels)
+        fs2_mels = fs2_mels.transpose([0, 2, 1])[:, None, :, :]
+
+        io_binding = self.q_sample.io_binding()
+        io_binding.bind_cpu_input('x_start', fs2_mels)
+        io_binding.bind_cpu_input('noise', np.random.randn(*fs2_mels.shape).astype(fs2_mels.dtype))
+        io_binding.bind_cpu_input('t', np.asarray([t - 1], dtype=np.int64))
+        io_binding.bind_output('x_next')
+        self.q_sample.run_with_iobinding(io_binding)
+        x = io_binding.get_outputs()[0].numpy()
+        if hparams.get('gaussian_start') is not None and hparams['gaussian_start']:
+            print('===> gaussion start.')
+            shape = (cond.shape[0], 1, self.mel_bins, cond.shape[2])
+            x = np.random.randn(*shape).astype(fs2_mels.dtype)
+
+        cond = rt.OrtValue.ortvalue_from_numpy(cond, mel_out.device_name(), 0)
+        x = rt.OrtValue.ortvalue_from_numpy(x, mel_out.device_name(), 0)
+
+        if hparams.get('pndm_speedup'):
+            self.noise_list = deque(maxlen=4)
+            iteration_interval = hparams['pndm_speedup']
+            interval = rt.OrtValue.ortvalue_from_numpy(np.asarray([iteration_interval], np.int64),
+                                                       mel_out.device_name(), 0)
+            for i in tqdm(reversed(range(0, t, iteration_interval)),
+                          desc='sample time step',
+                          total=t // iteration_interval):
+                io_binding = self.p_sample_plms.io_binding()
+                io_binding.bind_ortvalue_input('x', x)
+                io_binding.bind_cpu_input('noise', np.random.randn(*x.shape).astype(x.dtype))
+                io_binding.bind_ortvalue_input('cond', cond)
+                io_binding.bind_cpu_input('t', np.asarray([i], dtype=np.int64))  # torch i-1 but here i
+                io_binding.bind_ortvalue_input('interval', interval)
+                io_binding.bind_output('x_next')
+                self.p_sample_plms.run_with_iobinding(io_binding)
+                x = io_binding.get_outputs()[0]
+        else:
+            for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
+                io_binding = self.p_sample.io_binding()
+                io_binding.bind_ortvalue_input('x', x)
+                io_binding.bind_cpu_input('noise', np.random.randn(*x.shape()).astype(np.float32))
+                io_binding.bind_ortvalue_input('cond', cond)
+                io_binding.bind_cpu_input('t', np.asarray([i], dtype=np.int64))  # torch i-1 but here i
+                io_binding.bind_output('x_next')
+                self.p_sample.run_with_iobinding(io_binding)
+                x = io_binding.get_outputs()[0]
+        x = x.numpy()[:, 0].transpose([0, 2, 1])
+        mel2ph = kwargs.get('mel2ph', None)
+        if mel2ph is not None:  # for singing
+            ret['mel_out'] = self.denorm_spec(x) * ((mel2ph > 0).astype(np.float32)[:, :, None])
+        else:
+            ret['mel_out'] = self.denorm_spec(x)
+        return ret
+
+    def norm_spec(self, x):
+        return (x - self.spec_min) / (self.spec_max - self.spec_min) * 2 - 1
+
+    def denorm_spec(self, x):
+        return (x + 1) / 2 * (self.spec_max - self.spec_min) + self.spec_min
+
+    def forward_model(self, inp):
+        sample = self.input_to_batch(inp)
+        txt_tokens = sample['txt_tokens']  # [B, T_t]
+        spk_id = sample.get('spk_ids')
+
+        output = self.model(txt_tokens,
+                            spk_id=spk_id,
+                            ref_mels=None,
+                            infer=True,
+                            pitch_midi=sample['pitch_midi'],
+                            midi_dur=sample['midi_dur'],
+                            is_slur=sample['is_slur'])
+        mel_out = output['mel_out']  # [B, T,80]
+        mel_out = rt.OrtValue.ortvalue_from_numpy(mel_out, self.device_name, 0)
+        if hparams.get('pe_enable') is not None and hparams['pe_enable']:
+            # pe predict from Pred mel
+            io_binding = self.pe.io_binding()
+            io_binding.bind_ortvalue_input('mel_input', mel_out)
+            io_binding.bind_output('f0_denorm_pred')
+            self.pe.run_with_iobinding(io_binding)
+            f0_pred = io_binding.get_outputs()[0]
+        else:
+            f0_pred = output['f0_denorm']
+        wav_out = self.run_vocoder(mel_out, f0=f0_pred.numpy())
+
+        return wav_out[0]
+
+    def run_vocoder(self, c, **kwargs):
+        # c = c.transpose([0, 2, 1])  # [B, 80, T]
+        f0 = kwargs.get('f0')  # [B, T]
+        if f0 is not None and hparams.get('use_nsf'):
+            y = self.vocoder.run(['wav_out'], {
+                'mel_out': c,
+                'f0': f0,
+            })[0]  # .reshape([-1])
+        else:
+            y = self.vocoder.run(['wav_out'], {
+                'mel_out': c,
+            })[0]  # .reshape([-1])
+            # [T]
+        return y  # [None]
+
+    def preprocess_word_level_input(self, inp):
+        # Pypinyin can't solve polyphonic words
+        text_raw = inp['text'].replace('最长', '最常').replace('长睫毛', '常睫毛') \
+            .replace('那么长', '那么常').replace('多长', '多常') \
+            .replace('很长', '很常')  # We hope someone could provide a better g2p module for us by opening pull requests.
+
+        # lyric
+        pinyins = lazy_pinyin(text_raw, strict=False)
+        ph_per_word_lst = [self.pinyin2phs[pinyin.strip()] for pinyin in pinyins if pinyin.strip() in self.pinyin2phs]
+
+        # Note
+        note_per_word_lst = [x.strip() for x in inp['notes'].split('|') if x.strip() != '']
+        mididur_per_word_lst = [x.strip() for x in inp['notes_duration'].split('|') if x.strip() != '']
+
+        if len(note_per_word_lst) == len(ph_per_word_lst) == len(mididur_per_word_lst):
+            print('Pass word-notes check.')
+        else:
+            print('The number of words does\'t match the number of notes\' windows. ',
+                  'You should split the note(s) for each word by | mark.')
+            print(ph_per_word_lst, note_per_word_lst, mididur_per_word_lst)
+            print(len(ph_per_word_lst), len(note_per_word_lst), len(mididur_per_word_lst))
+            return None
+
+        note_lst = []
+        ph_lst = []
+        midi_dur_lst = []
+        is_slur = []
+        for idx, ph_per_word in enumerate(ph_per_word_lst):
+            # for phs in one word:
+            # single ph like ['ai']  or multiple phs like ['n', 'i']
+            ph_in_this_word = ph_per_word.split()
+
+            # for notes in one word:
+            # single note like ['D4'] or multiple notes like ['D4', 'E4'] which means a 'slur' here.
+            note_in_this_word = note_per_word_lst[idx].split()
+            midi_dur_in_this_word = mididur_per_word_lst[idx].split()
+            # process for the model input
+            # Step 1.
+            #  Deal with note of 'not slur' case or the first note of 'slur' case
+            #  j        ie
+            #  F#4/Gb4  F#4/Gb4
+            #  0        0
+            for ph in ph_in_this_word:
+                ph_lst.append(ph)
+                note_lst.append(note_in_this_word[0])
+                midi_dur_lst.append(midi_dur_in_this_word[0])
+                is_slur.append(0)
+            # step 2.
+            #  Deal with the 2nd, 3rd... notes of 'slur' case
+            #  j        ie         ie
+            #  F#4/Gb4  F#4/Gb4    C#4/Db4
+            #  0        0          1
+            # is_slur = True, we should repeat the YUNMU to match the 2nd, 3rd... notes.
+            if len(note_in_this_word) > 1:
+                for idx in range(1, len(note_in_this_word)):
+                    ph_lst.append(ph_in_this_word[-1])
+                    note_lst.append(note_in_this_word[idx])
+                    midi_dur_lst.append(midi_dur_in_this_word[idx])
+                    is_slur.append(1)
+        ph_seq = ' '.join(ph_lst)
+
+        if len(ph_lst) == len(note_lst) == len(midi_dur_lst):
+            print(len(ph_lst), len(note_lst), len(midi_dur_lst))
+            print('Pass word-notes check.')
+        else:
+            print('The number of words does\'t match the number of notes\' windows. ',
+                  'You should split the note(s) for each word by | mark.')
+            return None
+        return ph_seq, note_lst, midi_dur_lst, is_slur
+
+    def preprocess_phoneme_level_input(self, inp):
+        ph_seq = inp['ph_seq']
+        note_lst = inp['note_seq'].split()
+        midi_dur_lst = inp['note_dur_seq'].split()
+        is_slur = [float(x) for x in inp['is_slur_seq'].split()]
+        print(len(note_lst), len(ph_seq.split()), len(midi_dur_lst))
+        if len(note_lst) == len(ph_seq.split()) == len(midi_dur_lst):
+            print('Pass word-notes check.')
+        else:
+            print('The number of words does\'t match the number of notes\' windows. ',
+                  'You should split the note(s) for each word by | mark.')
+            return None
+        return ph_seq, note_lst, midi_dur_lst, is_slur
+
+    def preprocess_input(self, inp, input_type='word'):
+        """
+
+        :param inp: {'text': str, 'item_name': (str, optional), 'spk_name': (str, optional)}
+        :return:
+        """
+
+        item_name = inp.get('item_name', '<ITEM_NAME>')
+        spk_name = inp.get('spk_name', 'opencpop')
+
+        # single spk
+        spk_id = self.spk_map[spk_name]
+
+        # get ph seq, note lst, midi dur lst, is slur lst.
+        if input_type == 'word':
+            ret = self.preprocess_word_level_input(inp)
+        # like transcriptions.txt in Opencpop dataset.
+        elif input_type == 'phoneme':
+            ret = self.preprocess_phoneme_level_input(inp)
+        else:
+            print('Invalid input type.')
+            return None
+
+        if ret:
+            ph_seq, note_lst, midi_dur_lst, is_slur = ret
+        else:
+            print('==========> Preprocess_word_level or phone_level input wrong.')
+            return None
+
+        # convert note lst to midi id; convert note dur lst to midi duration
+        try:
+            midis = [librosa.note_to_midi(x.split("/")[0]) if x != 'rest' else 0 for x in note_lst]
+            midi_dur_lst = [float(x) for x in midi_dur_lst]
+        except Exception as e:
+            print(e)
+            print('Invalid Input Type.')
+            return None
+
+        ph_token = self.ph_encoder.encode(ph_seq)
+        item = {
+            'item_name': item_name,
+            'text': inp['text'],
+            'ph': ph_seq,
+            'spk_id': spk_id,
+            'ph_token': ph_token,
+            'pitch_midi': np.asarray(midis),
+            'midi_dur': np.asarray(midi_dur_lst),
+            'is_slur': np.asarray(is_slur),
+        }
+        item['ph_len'] = len(item['ph_token'])
+        return item
+
+    def input_to_batch(self, item):
+        item_names = [item['item_name']]
+        text = [item['text']]
+        ph = [item['ph']]
+        txt_tokens = np.int64(item['ph_token'])[None, :]
+        txt_lengths = np.int64([txt_tokens.shape[1]])
+        spk_ids = np.asarray(item['spk_id'], np.int64)[None]
+
+        pitch_midi = np.int64(item['pitch_midi'])[None, :hparams['max_frames']]
+        midi_dur = np.float32(item['midi_dur'])[None, :hparams['max_frames']]
+        is_slur = np.int64(item['is_slur'])[None, :hparams['max_frames']]
+
+        batch = {
+            'item_name': item_names,
+            'text': text,
+            'ph': ph,
+            'txt_tokens': txt_tokens,
+            'txt_lengths': txt_lengths,
+            'spk_ids': spk_ids,
+            'pitch_midi': pitch_midi,
+            'midi_dur': midi_dur,
+            'is_slur': is_slur
+        }
+        return batch
+
+    def infer_once(self, inp):
+        inp = self.preprocess_input(inp, input_type=inp['input_type'] if inp.get('input_type') else 'word')
+        output = self.forward_model(inp)
+        return output
--- a/modules/audio/svs/diffsinger/inference/svs/opencpop/cpop_pinyin2ph.txt
+++ b/modules/audio/svs/diffsinger/inference/svs/opencpop/cpop_pinyin2ph.txt
+| a      | a        |
+| ai     | ai       |
+| an     | an       |
+| ang    | ang      |
+| ao     | ao       |
+| ba     | b a      |
+| bai    | b ai     |
+| ban    | b an     |
+| bang   | b ang    |
+| bao    | b ao     |
+| bei    | b ei     |
+| ben    | b en     |
+| beng   | b eng    |
+| bi     | b i      |
+| bian   | b ian    |
+| biao   | b iao    |
+| bie    | b ie     |
+| bin    | b in     |
+| bing   | b ing    |
+| bo     | b o      |
+| bu     | b u      |
+| ca     | c a      |
+| cai    | c ai     |
+| can    | c an     |
+| cang   | c ang    |
+| cao    | c ao     |
+| ce     | c e      |
+| cei    | c ei     |
+| cen    | c en     |
+| ceng   | c eng    |
+| cha    | ch a     |
+| chai   | ch ai    |
+| chan   | ch an    |
+| chang  | ch ang   |
+| chao   | ch ao    |
+| che    | ch e     |
+| chen   | ch en    |
+| cheng  | ch eng   |
+| chi    | ch i     |
+| chong  | ch ong   |
+| chou   | ch ou    |
+| chu    | ch u     |
+| chua   | ch ua    |
+| chuai  | ch uai   |
+| chuan  | ch uan   |
+| chuang | ch uang  |
+| chui   | ch ui    |
+| chun   | ch un    |
+| chuo   | ch uo    |
+| ci     | c i      |
+| cong   | c ong    |
+| cou    | c ou     |
+| cu     | c u      |
+| cuan   | c uan    |
+| cui    | c ui     |
+| cun    | c un     |
+| cuo    | c uo     |
+| da     | d a      |
+| dai    | d ai     |
+| dan    | d an     |
+| dang   | d ang    |
+| dao    | d ao     |
+| de     | d e      |
+| dei    | d ei     |
+| den    | d en     |
+| deng   | d eng    |
+| di     | d i      |
+| dia    | d ia     |
+| dian   | d ian    |
+| diao   | d iao    |
+| die    | d ie     |
+| ding   | d ing    |
+| diu    | d iu     |
+| dong   | d ong    |
+| dou    | d ou     |
+| du     | d u      |
+| duan   | d uan    |
+| dui    | d ui     |
+| dun    | d un     |
+| duo    | d uo     |
+| e      | e        |
+| ei     | ei       |
+| en     | en       |
+| eng    | eng      |
+| er     | er       |
+| fa     | f a      |
+| fan    | f an     |
+| fang   | f ang    |
+| fei    | f ei     |
+| fen    | f en     |
+| feng   | f eng    |
+| fo     | f o      |
+| fou    | f ou     |
+| fu     | f u      |
+| ga     | g a      |
+| gai    | g ai     |
+| gan    | g an     |
+| gang   | g ang    |
+| gao    | g ao     |
+| ge     | g e      |
+| gei    | g ei     |
+| gen    | g en     |
+| geng   | g eng    |
+| gong   | g ong    |
+| gou    | g ou     |
+| gu     | g u      |
+| gua    | g ua     |
+| guai   | g uai    |
+| guan   | g uan    |
+| guang  | g uang   |
+| gui    | g ui     |
+| gun    | g un     |
+| guo    | g uo     |
+| ha     | h a      |
+| hai    | h ai     |
+| han    | h an     |
+| hang   | h ang    |
+| hao    | h ao     |
+| he     | h e      |
+| hei    | h ei     |
+| hen    | h en     |
+| heng   | h eng    |
+| hm     | h m      |
+| hng    | h ng     |
+| hong   | h ong    |
+| hou    | h ou     |
+| hu     | h u      |
+| hua    | h ua     |
+| huai   | h uai    |
+| huan   | h uan    |
+| huang  | h uang   |
+| hui    | h ui     |
+| hun    | h un     |
+| huo    | h uo     |
+| ji     | j i      |
+| jia    | j ia     |
+| jian   | j ian    |
+| jiang  | j iang   |
+| jiao   | j iao    |
+| jie    | j ie     |
+| jin    | j in     |
+| jing   | j ing    |
+| jiong  | j iong   |
+| jiu    | j iu     |
+| ju     | j v      |
+| juan   | j van    |
+| jue    | j ve     |
+| jun    | j vn     |
+| ka     | k a      |
+| kai    | k ai     |
+| kan    | k an     |
+| kang   | k ang    |
+| kao    | k ao     |
+| ke     | k e      |
+| kei    | k ei     |
+| ken    | k en     |
+| keng   | k eng    |
+| kong   | k ong    |
+| kou    | k ou     |
+| ku     | k u      |
+| kua    | k ua     |
+| kuai   | k uai    |
+| kuan   | k uan    |
+| kuang  | k uang   |
+| kui    | k ui     |
+| kun    | k un     |
+| kuo    | k uo     |
+| la     | l a      |
+| lai    | l ai     |
+| lan    | l an     |
+| lang   | l ang    |
+| lao    | l ao     |
+| le     | l e      |
+| lei    | l ei     |
+| leng   | l eng    |
+| li     | l i      |
+| lia    | l ia     |
+| lian   | l ian    |
+| liang  | l iang   |
+| liao   | l iao    |
+| lie    | l ie     |
+| lin    | l in     |
+| ling   | l ing    |
+| liu    | l iu     |
+| lo     | l o      |
+| long   | l ong    |
+| lou    | l ou     |
+| lu     | l u      |
+| luan   | l uan    |
+| lun    | l un     |
+| luo    | l uo     |
+| lv     | l v      |
+| lve    | l ve     |
+| m      | m        |
+| ma     | m a      |
+| mai    | m ai     |
+| man    | m an     |
+| mang   | m ang    |
+| mao    | m ao     |
+| me     | m e      |
+| mei    | m ei     |
+| men    | m en     |
+| meng   | m eng    |
+| mi     | m i      |
+| mian   | m ian    |
+| miao   | m iao    |
+| mie    | m ie     |
+| min    | m in     |
+| ming   | m ing    |
+| miu    | m iu     |
+| mo     | m o      |
+| mou    | m ou     |
+| mu     | m u      |
+| n      | n        |
+| na     | n a      |
+| nai    | n ai     |
+| nan    | n an     |
+| nang   | n ang    |
+| nao    | n ao     |
+| ne     | n e      |
+| nei    | n ei     |
+| nen    | n en     |
+| neng   | n eng    |
+| ng     | n g      |
+| ni     | n i      |
+| nian   | n ian    |
+| niang  | n iang   |
+| niao   | n iao    |
+| nie    | n ie     |
+| nin    | n in     |
+| ning   | n ing    |
+| niu    | n iu     |
+| nong   | n ong    |
+| nou    | n ou     |
+| nu     | n u      |
+| nuan   | n uan    |
+| nun    | n un     |
+| nuo    | n uo     |
+| nv     | n v      |
+| nve    | n ve     |
+| o      | o        |
+| ou     | ou       |
+| pa     | p a      |
+| pai    | p ai     |
+| pan    | p an     |
+| pang   | p ang    |
+| pao    | p ao     |
+| pei    | p ei     |
+| pen    | p en     |
+| peng   | p eng    |
+| pi     | p i      |
+| pian   | p ian    |
+| piao   | p iao    |
+| pie    | p ie     |
+| pin    | p in     |
+| ping   | p ing    |
+| po     | p o      |
+| pou    | p ou     |
+| pu     | p u      |
+| qi     | q i      |
+| qia    | q ia     |
+| qian   | q ian    |
+| qiang  | q iang   |
+| qiao   | q iao    |
+| qie    | q ie     |
+| qin    | q in     |
+| qing   | q ing    |
+| qiong  | q iong   |
+| qiu    | q iu     |
+| qu     | q v      |
+| quan   | q van    |
+| que    | q ve     |
+| qun    | q vn     |
+| ran    | r an     |
+| rang   | r ang    |
+| rao    | r ao     |
+| re     | r e      |
+| ren    | r en     |
+| reng   | r eng    |
+| ri     | r i      |
+| rong   | r ong    |
+| rou    | r ou     |
+| ru     | r u      |
+| rua    | r ua     |
+| ruan   | r uan    |
+| rui    | r ui     |
+| run    | r un     |
+| ruo    | r uo     |
+| sa     | s a      |
+| sai    | s ai     |
+| san    | s an     |
+| sang   | s ang    |
+| sao    | s ao     |
+| se     | s e      |
+| sen    | s en     |
+| seng   | s eng    |
+| sha    | sh a     |
+| shai   | sh ai    |
+| shan   | sh an    |
+| shang  | sh ang   |
+| shao   | sh ao    |
+| she    | sh e     |
+| shei   | sh ei    |
+| shen   | sh en    |
+| sheng  | sh eng   |
+| shi    | sh i     |
+| shou   | sh ou    |
+| shu    | sh u     |
+| shua   | sh ua    |
+| shuai  | sh uai   |
+| shuan  | sh uan   |
+| shuang | sh uang  |
+| shui   | sh ui    |
+| shun   | sh un    |
+| shuo   | sh uo    |
+| si     | s i      |
+| song   | s ong    |
+| sou    | s ou     |
+| su     | s u      |
+| suan   | s uan    |
+| sui    | s ui     |
+| sun    | s un     |
+| suo    | s uo     |
+| ta     | t a      |
+| tai    | t ai     |
+| tan    | t an     |
+| tang   | t ang    |
+| tao    | t ao     |
+| te     | t e      |
+| tei    | t ei     |
+| teng   | t eng    |
+| ti     | t i      |
+| tian   | t ian    |
+| tiao   | t iao    |
+| tie    | t ie     |
+| ting   | t ing    |
+| tong   | t ong    |
+| tou    | t ou     |
+| tu     | t u      |
+| tuan   | t uan    |
+| tui    | t ui     |
+| tun    | t un     |
+| tuo    | t uo     |
+| wa     | w a      |
+| wai    | w ai     |
+| wan    | w an     |
+| wang   | w ang    |
+| wei    | w ei     |
+| wen    | w en     |
+| weng   | w eng    |
+| wo     | w o      |
+| wu     | w u      |
+| xi     | x i      |
+| xia    | x ia     |
+| xian   | x ian    |
+| xiang  | x iang   |
+| xiao   | x iao    |
+| xie    | x ie     |
+| xin    | x in     |
+| xing   | x ing    |
+| xiong  | x iong   |
+| xiu    | x iu     |
+| xu     | x v      |
+| xuan   | x van    |
+| xue    | x ve     |
+| xun    | x vn     |
+| ya     | y a      |
+| yan    | y an     |
+| yang   | y ang    |
+| yao    | y ao     |
+| ye     | y e      |
+| yi     | y i      |
+| yin    | y in     |
+| ying   | y ing    |
+| yo     | y o      |
+| yong   | y ong    |
+| you    | y ou     |
+| yu     | y v      |
+| yuan   | y van    |
+| yue    | y ve     |
+| yun    | y vn     |
+| za     | z a      |
+| zai    | z ai     |
+| zan    | z an     |
+| zang   | z ang    |
+| zao    | z ao     |
+| ze     | z e      |
+| zei    | z ei     |
+| zen    | z en     |
+| zeng   | z eng    |
+| zha    | zh a     |
+| zhai   | zh ai    |
+| zhan   | zh an    |
+| zhang  | zh ang   |
+| zhao   | zh ao    |
+| zhe    | zh e     |
+| zhei   | zh ei    |
+| zhen   | zh en    |
+| zheng  | zh eng   |
+| zhi    | zh i     |
+| zhong  | zh ong   |
+| zhou   | zh ou    |
+| zhu    | zh u     |
+| zhua   | zh ua    |
+| zhuai  | zh uai   |
+| zhuan  | zh uan   |
+| zhuang | zh uang  |
+| zhui   | zh ui    |
+| zhun   | zh un    |
+| zhuo   | zh uo    |
+| zi     | z i      |
+| zong   | z ong    |
+| zou    | z ou     |
+| zu     | z u      |
+| zuan   | z uan    |
+| zui    | z ui     |
+| zun    | z un     |
+| zuo    | z uo     |
--- a/modules/audio/svs/diffsinger/inference/svs/opencpop/map.py
+++ b/modules/audio/svs/diffsinger/inference/svs/opencpop/map.py
+def cpop_pinyin2ph_func(path):
+    # In the README file of opencpop dataset, they defined a "pinyin to phoneme mapping table"
+    pinyin2phs = {'AP': 'AP', 'SP': 'SP'}
+    with open(path) as rf:
+        for line in rf.readlines():
+            elements = [x.strip() for x in line.split('|') if x.strip() != '']
+            pinyin2phs[elements[0]] = elements[1]
+    return pinyin2phs
--- a/modules/audio/svs/diffsinger/module.py
+++ b/modules/audio/svs/diffsinger/module.py
+import argparse
+import os
+import time
+from typing import Dict
+from typing import List
+from typing import Union
+
+from .infer import Infer
+from .utils.audio import save_wav
+from .utils.hparams import hparams
+from .utils.hparams import set_hparams
+from paddlehub.module.module import moduleinfo
+from paddlehub.module.module import runnable
+from paddlehub.module.module import serving
+
+
+@moduleinfo(name="diffsinger",
+            type="Audio/svs",
+            author="",
+            author_email="",
+            summary="DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism",
+            version="1.0.0")
+class DiffSinger:
+
+    def __init__(self, providers: List[str] = None) -> None:
+        root = self.directory
+        config = os.path.join('model', 'config.yaml')
+        set_hparams(config, root=root)
+        self.infer = Infer(root, providers=providers)
+
+    @serving
+    def singing_voice_synthesis(self,
+                                inputs: Dict[str, str],
+                                sample_num: int = 1,
+                                save_audio: bool = True,
+                                save_dir: str = 'outputs') -> Dict[str, Union[List[List[int]], int]]:
+        '''
+        inputs = {
+            'text': '小酒窝长睫毛AP是你最美的记号',
+            'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+            'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+            'input_type': 'word'
+        }  # user input: Chinese characters
+        or,
+        inputs = {
+            'text': '小酒窝长睫毛AP是你最美的记号',
+            'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+            'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+            'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+            'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
+            'input_type': 'phoneme'
+        }  # input like Opencpop dataset.
+        '''
+        outputs = []
+        for i in range(sample_num):
+            output = self.infer.infer_once(inputs)
+            os.makedirs(save_dir, exist_ok=True)
+            if save_audio:
+                save_wav(output, os.path.join(save_dir, '%d_%d.wav' % (i, int(time.time()))),
+                         hparams['audio_sample_rate'])
+            outputs.append(output.tolist())
+        return {'wavs': outputs, 'sample_rate': hparams['audio_sample_rate']}
+
+    @runnable
+    def run_cmd(self, argvs: List[str]) -> str:
+        self.parser = argparse.ArgumentParser(description="Run the {} module.".format(self.name),
+                                              prog='hub run {}'.format(self.name),
+                                              usage='%(prog)s',
+                                              add_help=True)
+        self.parser.add_argument('--input_type',
+                                 type=str,
+                                 choices=['word', 'phoneme'],
+                                 required=True,
+                                 help='input type in ["word", "phoneme"].')
+        args = self.parser.parse_args(argvs[:2])
+        if args.input_type == 'word':
+            self.arg_input_group = self.parser.add_argument_group(title="Input options (type: word).",
+                                                                  description="Input options (type: word).")
+            self.arg_input_group.add_argument('--text', type=str, required=True, help='input text.')
+            self.arg_input_group.add_argument('--notes', type=str, required=True, help='input notes.')
+            self.arg_input_group.add_argument('--notes_duration', type=str, required=True, help='input notes duration.')
+        elif args.input_type == 'phoneme':
+            self.arg_input_group = self.parser.add_argument_group(title="Input options (type: phoneme).",
+                                                                  description="Input options (type: phoneme).")
+            self.arg_input_group.add_argument('--text', type=str, required=True, help='input text.')
+            self.arg_input_group.add_argument('--ph_seq', type=str, required=True, help='input phoneme seq.')
+            self.arg_input_group.add_argument('--note_seq', type=str, required=True, help='input note seq.')
+            self.arg_input_group.add_argument('--note_dur_seq',
+                                              type=str,
+                                              required=True,
+                                              help='input note duration seq.')
+            self.arg_input_group.add_argument('--is_slur_seq',
+                                              type=str,
+                                              required=True,
+                                              help='input if note is slur seq.')
+        else:
+            raise ValueError('Input type (--input_type) should be in ["word", "phoneme"]')
+        self.parser.add_argument('--sample_num', type=int, default=1, help='sample audios num, default=1')
+        self.parser.add_argument('--save_dir',
+                                 type=str,
+                                 default='outputs',
+                                 help='sample audios save_dir, default="outputs"')
+        args = self.parser.parse_args(argvs)
+        kwargs = vars(args).copy()
+        kwargs.pop('sample_num')
+        kwargs.pop('save_dir')
+        self.singing_voice_synthesis(kwargs, sample_num=args.sample_num, save_dir=args.save_dir, save_audio=True)
+        return "Audios are saved in %s" % args.save_dir
--- a/modules/audio/svs/diffsinger/requirements.txt
+++ b/modules/audio/svs/diffsinger/requirements.txt
+librosa>=0.9.2
+matplotlib==3.5.3
+numpy>=1.21.6
+pycwt>=0.3.0a22
+pypinyin>=0.47.1
+PyYAML>=6.0
+scipy>=1.7.3
+six>=1.16.0
+soundfile>=0.11.0
+tqdm>=4.64.1
--- a/modules/audio/svs/diffsinger/test.py
+++ b/modules/audio/svs/diffsinger/test.py
+import shutil
+import unittest
+
+import paddlehub as hub
+
+
+class TestHubModule(unittest.TestCase):
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        cls.module = hub.Module(name="diffsinger")
+
+    @classmethod
+    def tearDownClass(cls) -> None:
+        shutil.rmtree('outputs')
+
+    def test_singing_voice_synthesis1(self):
+        results = self.module.singing_voice_synthesis(inputs={
+            'text': '小酒窝长睫毛AP是你最美的记号',
+            'notes':
+            'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+            'notes_duration':
+            '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+            'input_type': 'word'
+        },
+                                                      sample_num=1,
+                                                      save_audio=True,
+                                                      save_dir='outputs')
+        self.assertIsInstance(results, dict)
+        self.assertIsInstance(results['wavs'], list)
+        self.assertIsInstance(results['wavs'][0], list)
+        self.assertEqual(len(results['wavs'][0]), 123776)
+        self.assertEqual(results['sample_rate'], 24000)
+
+    def test_singing_voice_synthesis2(self):
+        results = self.module.singing_voice_synthesis(inputs={
+            'text': '小酒窝长睫毛AP是你最美的记号',
+            'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+            'note_seq':
+            'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+            'note_dur_seq':
+            '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+            'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
+            'input_type': 'phoneme'
+        },
+                                                      sample_num=1,
+                                                      save_audio=True,
+                                                      save_dir='outputs')
+        self.assertIsInstance(results, dict)
+        self.assertIsInstance(results['wavs'], list)
+        self.assertIsInstance(results['wavs'][0], list)
+        self.assertEqual(len(results['wavs'][0]), 123776)
+        self.assertEqual(results['sample_rate'], 24000)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/modules/audio/svs/diffsinger/usr/configs/base.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/base.yaml
+task_cls: usr.task.DiffFsTask
+pitch_type: frame
+timesteps: 100
+dilation_cycle_length: 1
+residual_layers: 20
+residual_channels: 256
+lr: 0.001
+decay_steps: 50000
+keep_bins: 80
+spec_min: [ ]
+spec_max: [ ]
+
+content_cond_steps: [ ] # [ 0, 10000 ]
+spk_cond_steps: [ ] # [ 0, 10000 ]
+# train and eval
+fs2_ckpt: ''
+max_updates: 400000
+# max_updates: 200000
+use_gt_dur: true
+use_gt_f0: true
+gen_tgt_spk_id: -1
+max_sentences: 48
+num_sanity_val_steps: 1
+num_valid_plots: 1
--- a/modules/audio/svs/diffsinger/usr/configs/lj_ds_beta6.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/lj_ds_beta6.yaml
+base_config:
+  - configs/tts/lj/fs2.yaml
+  - ./base.yaml
+# spec_min and spec_max are calculated on the training set.
+spec_min: [ -4.7574, -4.6783, -4.6431, -4.5832, -4.5390, -4.6771, -4.8089, -4.7672,
+            -4.5784, -4.7755, -4.7150, -4.8919, -4.8271, -4.7389, -4.6047, -4.7759,
+            -4.6799, -4.8201, -4.7823, -4.8262, -4.7857, -4.7545, -4.9358, -4.9733,
+            -5.1134, -5.1395, -4.9016, -4.8434, -5.0189, -4.8460, -5.0529, -4.9510,
+            -5.0217, -5.0049, -5.1831, -5.1445, -5.1015, -5.0281, -4.9887, -4.9916,
+            -4.9785, -4.9071, -4.9488, -5.0342, -4.9332, -5.0650, -4.8924, -5.0875,
+            -5.0483, -5.0848, -5.1809, -5.0677, -5.0015, -5.0792, -5.0636, -5.2413,
+            -5.1421, -5.1710, -5.3256, -5.0511, -5.1186, -5.0057, -5.0446, -5.1173,
+            -5.0325, -5.1085, -5.0053, -5.0755, -5.1176, -5.1004, -5.2153, -5.2757,
+            -5.3025, -5.2867, -5.2918, -5.3328, -5.2731, -5.2985, -5.2400, -5.2211 ]
+spec_max: [ -0.5982, -0.0778,  0.1205,  0.2747,  0.4657,  0.5123,  0.5684,  0.7093,
+            0.6461,  0.6420,  0.7316,  0.7715,  0.7681,  0.8349,  0.7815,  0.7591,
+            0.7910,  0.7433,  0.7352,  0.6869,  0.6854,  0.6623,  0.5353,  0.6492,
+            0.6909,  0.6106,  0.5761,  0.5936,  0.5638,  0.4054,  0.4545,  0.3589,
+            0.3037,  0.3380,  0.1599,  0.2433,  0.2741,  0.2130,  0.1569,  0.1911,
+            0.2324,  0.1586,  0.1221,  0.0341, -0.0558,  0.0553, -0.1153, -0.0933,
+            -0.1171, -0.0050, -0.1519, -0.1629, -0.0522, -0.0739, -0.2069, -0.2405,
+            -0.1244, -0.2116, -0.1361, -0.1575, -0.1442,  0.0513, -0.1567, -0.2000,
+            0.0086, -0.0698,  0.1385,  0.0941,  0.1864,  0.1225,  0.2176,  0.2566,
+            0.1670,  0.1007,  0.1444,  0.0888,  0.1998,  0.2414,  0.2932,  0.3047 ]
+
+task_cls: usr.diffspeech_task.DiffSpeechTask
+vocoder: vocoders.hifigan.HifiGAN
+vocoder_ckpt: checkpoints/0414_hifi_lj_1
+num_valid_plots: 10
+use_gt_dur: false
+use_gt_f0: false
+pitch_type: cwt
+pitch_extractor: 'parselmouth'
+max_updates: 160000
+lr: 0.001
+timesteps: 100
+K_step: 71
+diff_loss_type: l1
+diff_decoder_type: 'wavenet'
+schedule_type: 'linear'
+max_beta: 0.06
+fs2_ckpt: checkpoints/fs2_lj_1/model_ckpt_steps_150000.ckpt
+save_gt: true
--- a/modules/audio/svs/diffsinger/usr/configs/midi/cascade/opencs/aux_rel.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/midi/cascade/opencs/aux_rel.yaml
+base_config:
+  - configs/singing/fs2.yaml
+  - usr/configs/midi/cascade/opencs/opencpop_statis.yaml
+
+audio_sample_rate: 24000
+hop_size: 128            # Hop size.
+fft_size: 512           # FFT size.
+win_size: 512           # FFT size.
+fmin: 30
+fmax: 12000
+min_level_db: -120
+
+binarization_args:
+  with_wav: true
+  with_spk_embed: false
+  with_align: true
+raw_data_dir: 'data/raw/opencpop/segments'
+processed_data_dir: 'xxx'
+binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
+
+
+binary_data_dir: 'data/binary/opencpop-midi-dp'
+use_midi: true  #  for midi exp
+use_gt_f0: false  #  for midi exp
+use_gt_dur: false  # for further midi exp
+lambda_f0: 1.0
+lambda_uv: 1.0
+#lambda_energy: 0.1
+lambda_ph_dur: 1.0
+lambda_sent_dur: 1.0
+lambda_word_dur: 1.0
+predictor_grad: 0.1
+pe_enable: false
+pe_ckpt: ''
+
+num_spk: 1
+test_prefixes: [
+    '2044',
+    '2086',
+    '2092',
+    '2093',
+    '2100',
+]
+
+task_cls: usr.diffsinger_task.AuxDecoderMIDITask
+#vocoder: usr.singingvocoder.highgan.HighGAN
+#vocoder_ckpt: checkpoints/h_2_model/checkpoint-530000steps.pkl
+vocoder: vocoders.hifigan.HifiGAN
+vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128
+
+use_nsf: true
+
+# config for experiments
+max_frames: 5000
+max_tokens: 40000
+predictor_layers: 5
+rel_pos: true
+dur_predictor_layers: 5  # *
+
+use_spk_embed: false
+num_valid_plots: 10
+max_updates: 160000
+save_gt: true
--- a/modules/audio/svs/diffsinger/usr/configs/midi/cascade/opencs/ds60_rel.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/midi/cascade/opencs/ds60_rel.yaml
+base_config:
+  - usr/configs/popcs_ds_beta6.yaml
+  - usr/configs/midi/cascade/opencs/opencpop_statis.yaml
+
+binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
+binary_data_dir: 'data/binary/opencpop-midi-dp'
+
+#switch_midi2f0_step: 174000
+use_midi: true  #  for midi exp
+use_gt_f0: false  #  for midi exp
+use_gt_dur: false  # for further midi exp
+lambda_f0: 1.0
+lambda_uv: 1.0
+#lambda_energy: 0.1
+lambda_ph_dur: 1.0
+lambda_sent_dur: 1.0
+lambda_word_dur: 1.0
+predictor_grad: 0.1
+pe_enable: false
+pe_ckpt: ''
+
+fs2_ckpt: 'checkpoints/0302_opencpop_fs_midi/model_ckpt_steps_160000.ckpt'  #
+#num_valid_plots: 0
+task_cls: usr.diffsinger_task.DiffSingerMIDITask
+
+K_step: 60
+max_tokens: 40000
+predictor_layers: 5
+dilation_cycle_length: 4  # *
+rel_pos: true
+dur_predictor_layers: 5  # *
+max_updates: 160000
+gaussian_start: false
--- a/modules/audio/svs/diffsinger/usr/configs/midi/cascade/opencs/opencpop_statis.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/midi/cascade/opencs/opencpop_statis.yaml
+spec_min: [-6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
+           -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
+           -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
+           -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
+           -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
+           -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
+           -6., -6., -6., -6., -6., -6., -6., -6.]
+spec_max: [-7.9453e-01, -8.1116e-01, -6.1631e-01, -3.0679e-01, -1.3863e-01,
+           -5.0652e-02, -1.1563e-01, -1.0679e-01, -9.1068e-02, -6.2174e-02,
+           -7.5302e-02, -7.2217e-02, -6.3815e-02, -7.3299e-02,  7.3610e-03,
+           -7.2508e-02, -5.0234e-02, -1.6534e-01, -2.6928e-01, -2.0782e-01,
+           -2.0823e-01, -1.1702e-01, -7.0128e-02, -6.5868e-02, -1.2675e-02,
+           1.5121e-03, -8.9902e-02, -2.1392e-01, -2.3789e-01, -2.8922e-01,
+           -3.0405e-01, -2.3029e-01, -2.2088e-01, -2.1542e-01, -2.9367e-01,
+           -3.0137e-01, -3.8281e-01, -4.3590e-01, -2.8681e-01, -4.6855e-01,
+           -5.7485e-01, -4.7022e-01, -5.4266e-01, -4.4848e-01, -6.4120e-01,
+           -6.8700e-01, -6.4860e-01, -7.6436e-01, -4.9971e-01, -7.1068e-01,
+           -6.9724e-01, -6.1487e-01, -5.5843e-01, -6.9773e-01, -5.7502e-01,
+           -7.0919e-01, -8.2431e-01, -8.4213e-01, -9.0431e-01, -8.2840e-01,
+           -7.7945e-01, -8.2758e-01, -8.7699e-01, -1.0532e+00, -1.0766e+00,
+           -1.1198e+00, -1.0185e+00, -9.8983e-01, -1.0001e+00, -1.0756e+00,
+           -1.0024e+00, -1.0304e+00, -1.0579e+00, -1.0188e+00, -1.0500e+00,
+           -1.0842e+00, -1.0923e+00, -1.1223e+00, -1.2381e+00, -1.6467e+00]
+
+mel_vmin: -6. #-6.
+mel_vmax: 1.5
+wav2spec_eps: 1e-6
+
+raw_data_dir: 'data/raw/opencpop/segments'
+processed_data_dir: 'xxx'
+binary_data_dir: 'data/binary/opencpop-midi-dp'
+datasets: [
+  'opencpop',
+]
+test_prefixes: [
+    '2044',
+    '2086',
+    '2092',
+    '2093',
+    '2100',
+]
--- a/modules/audio/svs/diffsinger/usr/configs/midi/e2e/opencpop/ds1000.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/midi/e2e/opencpop/ds1000.yaml
+base_config:
+  - usr/configs/popcs_ds_beta6.yaml
+  - usr/configs/midi/cascade/opencs/opencpop_statis.yaml
+
+binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
+binary_data_dir: 'data/binary/opencpop-midi-dp'
+
+#switch_midi2f0_step: 174000
+use_midi: true  #  for midi exp
+use_gt_dur: false  # for further midi exp
+lambda_ph_dur: 1.0
+lambda_sent_dur: 1.0
+lambda_word_dur: 1.0
+predictor_grad: 0.1
+dur_predictor_layers: 5  # *
+
+
+fs2_ckpt: ''  #
+#num_valid_plots: 0
+task_cls: usr.diffsinger_task.DiffSingerMIDITask
+
+# for diffusion schedule
+timesteps: 1000
+K_step: 1000
+max_beta: 0.02
+max_tokens: 36000
+max_updates: 320000
+gaussian_start: True
+pndm_speedup: 40
+
+use_pitch_embed: false
+use_gt_f0: false  #  for midi exp
+
+lambda_f0: 0.
+lambda_uv: 0.
+dilation_cycle_length: 4  # *
+rel_pos: true
+predictor_layers: 5
+pe_enable: true
+pe_ckpt: 'checkpoints/0102_xiaoma_pe'
--- a/modules/audio/svs/diffsinger/usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml
+base_config:
+  - usr/configs/popcs_ds_beta6.yaml
+  - usr/configs/midi/cascade/opencs/opencpop_statis.yaml
+
+binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
+binary_data_dir: 'data/binary/opencpop-midi-dp'
+
+#switch_midi2f0_step: 174000
+use_midi: true  #  for midi exp
+use_gt_dur: false  # for further midi exp
+lambda_ph_dur: 1.0
+lambda_sent_dur: 1.0
+lambda_word_dur: 1.0
+predictor_grad: 0.1
+dur_predictor_layers: 5  # *
+
+
+fs2_ckpt: ''  #
+#num_valid_plots: 0
+task_cls: usr.diffsinger_task.DiffSingerMIDITask
+
+K_step: 100
+max_tokens: 40000
+max_updates: 160000
+gaussian_start: True
+
+use_pitch_embed: false
+use_gt_f0: false  #  for midi exp
+
+lambda_f0: 0.
+lambda_uv: 0.
+dilation_cycle_length: 4  # *
+rel_pos: true
+predictor_layers: 5
+pe_enable: true
+pe_ckpt: 'checkpoints/0102_xiaoma_pe'
--- a/modules/audio/svs/diffsinger/usr/configs/midi/e2e/popcs/ds100_adj_rel.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/midi/e2e/popcs/ds100_adj_rel.yaml
+base_config:
+  - usr/configs/popcs_ds_beta6.yaml
+  - usr/configs/midi/cascade/popcs/popcs_statis.yaml
+
+binarizer_cls: data_gen.singing.binarize.MidiSingingBinarizer
+binary_data_dir: 'data/binary/popcs-midi-dp'
+
+#switch_midi2f0_step: 174000
+use_midi: true  #  for midi exp
+use_gt_dur: false  # for further midi exp
+lambda_ph_dur: 1.0
+lambda_sent_dur: 1.0
+lambda_word_dur: 1.0
+predictor_grad: 0.1
+dur_predictor_layers: 5  # *
+
+
+fs2_ckpt: ''  #
+#num_valid_plots: 0
+task_cls: usr.diffsinger_task.DiffSingerMIDITask
+
+K_step: 100
+max_tokens: 40000
+max_updates: 160000
+gaussian_start: True
+
+use_pitch_embed: false
+use_gt_f0: false  #  for midi exp
+
+lambda_f0: 0.
+lambda_uv: 0.
+dilation_cycle_length: 4  # *
+rel_pos: true
+predictor_layers: 5
+pe_enable: true
+pe_ckpt: 'checkpoints/0102_xiaoma_pe'
--- a/modules/audio/svs/diffsinger/usr/configs/midi/pe.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/midi/pe.yaml
+base_config:
+  - configs/tts/lj/fs2.yaml
+
+max_frames: 8000
+audio_sample_rate: 24000
+hop_size: 128            # Hop size.
+fft_size: 512           # FFT size.
+win_size: 512           # FFT size.
+fmin: 30
+fmax: 12000
+min_level_db: -120
+
+binary_data_dir: 'xxx'
+
+pitch_type: frame
+task_cls: tasks.tts.pe.PitchExtractionTask
+pitch_extractor_conv_layers: 2
+
+
+# config for experiments
+max_tokens: 20000
+use_spk_embed: false
+num_valid_plots: 10
+max_updates: 60000
--- a/modules/audio/svs/diffsinger/usr/configs/popcs_ds_beta6.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/popcs_ds_beta6.yaml
+base_config:
+  - configs/tts/fs2.yaml
+  - configs/singing/base.yaml
+  - ./base.yaml
+
+audio_sample_rate: 24000
+hop_size: 128            # Hop size.
+fft_size: 512           # FFT size.
+win_size: 512           # FFT size.
+fmin: 30
+fmax: 12000
+min_level_db: -120
+
+binarization_args:
+  with_wav: true
+  with_spk_embed: false
+  with_align: true
+raw_data_dir: 'data/raw/popcs'
+processed_data_dir: 'data/processed/popcs'
+binary_data_dir: 'data/binary/popcs-pmf0'
+num_spk: 1
+datasets: [
+  'popcs',
+]
+test_prefixes: [
+  'popcs-说散就散',
+  'popcs-隐形的翅膀',
+]
+
+spec_min: [-6.8276, -7.0270, -6.8142, -7.1429, -7.6669, -7.6000, -7.1148, -6.9640,
+           -6.8414, -6.6596, -6.6880, -6.7439, -6.7986, -7.4940, -7.7845, -7.6586,
+           -6.9288, -6.7639, -6.9118, -6.8246, -6.7183, -7.1769, -6.9794, -7.4513,
+           -7.3422, -7.5623, -6.9610, -6.8158, -6.9595, -6.8403, -6.5688, -6.6356,
+           -7.0209, -6.5002, -6.7819, -6.5232, -6.6927, -6.5701, -6.5531, -6.7069,
+           -6.6462, -6.4523, -6.5954, -6.4264, -6.4487, -6.7070, -6.4025, -6.3042,
+           -6.4008, -6.3857, -6.3903, -6.3094, -6.2491, -6.3518, -6.3566, -6.4168,
+           -6.2481, -6.3624, -6.2858, -6.2575, -6.3638, -6.4520, -6.1835, -6.2754,
+           -6.1253, -6.1645, -6.0638, -6.1262, -6.0710, -6.1039, -6.4428, -6.1363,
+           -6.1054, -6.1252, -6.1797, -6.0235, -6.0758, -5.9453, -6.0213, -6.0446]
+spec_max: [ 0.2645,  0.0583, -0.2344, -0.0184,  0.1227,  0.1533,  0.1103,  0.1212,
+            0.2421,  0.1809,  0.2134,  0.3161,  0.3301,  0.3289,  0.2667,  0.2421,
+            0.2581,  0.2600,  0.1394,  0.1907,  0.1082,  0.1474,  0.1680,  0.2550,
+            0.1057,  0.0826,  0.0423,  0.1203, -0.0701, -0.0056,  0.0477, -0.0639,
+            -0.0272, -0.0728, -0.1648, -0.0855, -0.2652, -0.1998, -0.1547, -0.2167,
+            -0.4181, -0.5463, -0.4161, -0.4733, -0.6518, -0.5387, -0.4290, -0.4191,
+            -0.4151, -0.3042, -0.3810, -0.4160, -0.4496, -0.2847, -0.4676, -0.4658,
+            -0.4931, -0.4885, -0.5547, -0.5481, -0.6948, -0.7968, -0.8455, -0.8392,
+            -0.8770, -0.9520, -0.8749, -0.7297, -0.8374, -0.8667, -0.7157, -0.9035,
+            -0.9219, -0.8801, -0.9298, -0.9009, -0.9604, -1.0537, -1.0781, -1.3766]
+
+task_cls: usr.diffsinger_task.DiffSingerTask
+#vocoder: usr.singingvocoder.highgan.HighGAN
+#vocoder_ckpt: checkpoints/h_2_model/checkpoint-530000steps.pkl
+vocoder: vocoders.hifigan.HifiGAN
+vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128
+
+pitch_extractor: 'parselmouth'
+# config for experiments
+use_spk_embed: false
+num_valid_plots: 10
+max_updates: 160000
+lr: 0.001
+timesteps: 100
+K_step: 51
+diff_loss_type: l1
+diff_decoder_type: 'wavenet'
+schedule_type: 'linear'
+max_beta: 0.06
+fs2_ckpt: ''
+use_nsf: true
--- a/modules/audio/svs/diffsinger/usr/configs/popcs_ds_beta6_offline.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/popcs_ds_beta6_offline.yaml
+base_config:
+  - ./popcs_ds_beta6.yaml
+
+fs2_ckpt: checkpoints/popcs_fs2_pmf0_1230/model_ckpt_steps_160000.ckpt  # to be infer
+num_valid_plots: 0
+task_cls: usr.diffsinger_task.DiffSingerOfflineTask
+
+# tmp:
+#pe_enable: true
+#pe_ckpt: ''
+vocoder: vocoders.hifigan.HifiGAN
+vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128
--- a/modules/audio/svs/diffsinger/usr/configs/popcs_fs2.yaml
+++ b/modules/audio/svs/diffsinger/usr/configs/popcs_fs2.yaml
+base_config:
+  - configs/singing/fs2.yaml
+
+audio_sample_rate: 24000
+hop_size: 128            # Hop size.
+fft_size: 512           # FFT size.
+win_size: 512           # FFT size.
+fmin: 30
+fmax: 12000
+min_level_db: -120
+
+binarization_args:
+  with_wav: true
+  with_spk_embed: false
+  with_align: true
+raw_data_dir: 'data/raw/popcs'
+processed_data_dir: 'data/processed/popcs'
+binary_data_dir: 'data/binary/popcs-pmf0'
+num_spk: 1
+datasets: [
+  'popcs',
+]
+test_prefixes: [
+  'popcs-说散就散',
+  'popcs-隐形的翅膀',
+]
+
+task_cls: tasks.tts.fs2.FastSpeech2Task
+#vocoder: usr.singingvocoder.highgan.HighGAN
+#vocoder_ckpt: checkpoints/h_2_model/checkpoint-530000steps.pkl
+vocoder: vocoders.hifigan.HifiGAN
+vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128
+use_nsf: true
+
+# config for experiments
+max_tokens: 18000
+use_spk_embed: false
+num_valid_plots: 10
+max_updates: 160000
+save_gt: true
+
+# tmp:
+#pe_enable: true
+#pe_ckpt: ''
--- a/modules/audio/svs/diffsinger/utils/__init__.py
+++ b/modules/audio/svs/diffsinger/utils/__init__.py
+import sys
+import time
+import types
+
+import numpy as np
+
+
+class AvgrageMeter(object):
+
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+def collate_1d(values, pad_idx=0, left_pad=False, shift_right=False, max_len=None, shift_id=1):
+    """Convert a list of 1d tensors into a padded 2d tensor."""
+    size = max(v.size(0) for v in values) if max_len is None else max_len
+    res = values[0].new(len(values), size).fill_(pad_idx)
+
+    def copy_tensor(src, dst):
+        assert dst.numel() == src.numel()
+        if shift_right:
+            dst[1:] = src[:-1]
+            dst[0] = shift_id
+        else:
+            dst.copy_(src)
+
+    for i, v in enumerate(values):
+        copy_tensor(v, res[i][size - len(v):] if left_pad else res[i][:len(v)])
+    return res
+
+
+def collate_2d(values, pad_idx=0, left_pad=False, shift_right=False, max_len=None):
+    """Convert a list of 2d tensors into a padded 3d tensor."""
+    size = max(v.size(0) for v in values) if max_len is None else max_len
+    res = values[0].new(len(values), size, values[0].shape[1]).fill_(pad_idx)
+
+    def copy_tensor(src, dst):
+        assert dst.numel() == src.numel()
+        if shift_right:
+            dst[1:] = src[:-1]
+        else:
+            dst.copy_(src)
+
+    for i, v in enumerate(values):
+        copy_tensor(v, res[i][size - len(v):] if left_pad else res[i][:len(v)])
+    return res
+
+
+def _is_batch_full(batch, num_tokens, max_tokens, max_sentences):
+    if len(batch) == 0:
+        return 0
+    if len(batch) == max_sentences:
+        return 1
+    if num_tokens > max_tokens:
+        return 1
+    return 0
+
+
+def batch_by_size(indices,
+                  num_tokens_fn,
+                  max_tokens=None,
+                  max_sentences=None,
+                  required_batch_size_multiple=1,
+                  distributed=False):
+    """
+    Yield mini-batches of indices bucketed by size. Batches may contain
+    sequences of different lengths.
+
+    Args:
+        indices (List[int]): ordered list of dataset indices
+        num_tokens_fn (callable): function that returns the number of tokens at
+            a given index
+        max_tokens (int, optional): max number of tokens in each batch
+            (default: None).
+        max_sentences (int, optional): max number of sentences in each
+            batch (default: None).
+        required_batch_size_multiple (int, optional): require batch size to
+            be a multiple of N (default: 1).
+    """
+    max_tokens = max_tokens if max_tokens is not None else sys.maxsize
+    max_sentences = max_sentences if max_sentences is not None else sys.maxsize
+    bsz_mult = required_batch_size_multiple
+
+    if isinstance(indices, types.GeneratorType):
+        indices = np.fromiter(indices, dtype=np.int64, count=-1)
+
+    sample_len = 0
+    sample_lens = []
+    batch = []
+    batches = []
+    for i in range(len(indices)):
+        idx = indices[i]
+        num_tokens = num_tokens_fn(idx)
+        sample_lens.append(num_tokens)
+        sample_len = max(sample_len, num_tokens)
+        assert sample_len <= max_tokens, ("sentence at index {} of size {} exceeds max_tokens "
+                                          "limit of {}!".format(idx, sample_len, max_tokens))
+        num_tokens = (len(batch) + 1) * sample_len
+
+        if _is_batch_full(batch, num_tokens, max_tokens, max_sentences):
+            mod_len = max(
+                bsz_mult * (len(batch) // bsz_mult),
+                len(batch) % bsz_mult,
+            )
+            batches.append(batch[:mod_len])
+            batch = batch[mod_len:]
+            sample_lens = sample_lens[mod_len:]
+            sample_len = max(sample_lens) if len(sample_lens) > 0 else 0
+        batch.append(idx)
+    if len(batch) > 0:
+        batches.append(batch)
+    return batches
+
+
+def unpack_dict_to_list(samples):
+    samples_ = []
+    bsz = samples.get('outputs').size(0)
+    for i in range(bsz):
+        res = {}
+        for k, v in samples.items():
+            try:
+                res[k] = v[i]
+            except:
+                pass
+        samples_.append(res)
+    return samples_
+
+
+def remove_padding(x, padding_idx=0):
+    if x is None:
+        return None
+    assert len(x.shape) in [1, 2]
+    if len(x.shape) == 2:  # [T, H]
+        return x[np.abs(x).sum(-1) != padding_idx]
+    elif len(x.shape) == 1:  # [T]
+        return x[x != padding_idx]
+
+
+class Timer:
+    timer_map = {}
+
+    def __init__(self, name, print_time=False):
+        if name not in Timer.timer_map:
+            Timer.timer_map[name] = 0
+        self.name = name
+        self.print_time = print_time
+
+    def __enter__(self):
+        self.t = time.time()
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        Timer.timer_map[self.name] += time.time() - self.t
+        if self.print_time:
+            print(self.name, Timer.timer_map[self.name])
--- a/modules/audio/svs/diffsinger/utils/audio.py
+++ b/modules/audio/svs/diffsinger/utils/audio.py
+import subprocess
+
+import matplotlib
+
+matplotlib.use('Agg')
+import librosa
+import librosa.filters
+import numpy as np
+from scipy import signal
+from scipy.io import wavfile
+
+
+def save_wav(wav, path, sr, norm=False):
+    if norm:
+        wav = wav / np.abs(wav).max()
+    wav *= 32767
+    # proposed by @dsmiller
+    wavfile.write(path, sr, wav.astype(np.int16))
+
+
+def get_hop_size(hparams):
+    hop_size = hparams['hop_size']
+    if hop_size is None:
+        assert hparams['frame_shift_ms'] is not None
+        hop_size = int(hparams['frame_shift_ms'] / 1000 * hparams['audio_sample_rate'])
+    return hop_size
+
+
+###########################################################################################
+def _stft(y, hparams):
+    return librosa.stft(y=y,
+                        n_fft=hparams['fft_size'],
+                        hop_length=get_hop_size(hparams),
+                        win_length=hparams['win_size'],
+                        pad_mode='constant')
+
+
+def _istft(y, hparams):
+    return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams['win_size'])
+
+
+def librosa_pad_lr(x, fsize, fshift, pad_sides=1):
+    '''compute right padding (final frame) or both sides padding (first and final frames)
+    '''
+    assert pad_sides in (1, 2)
+    # return int(fsize // 2)
+    pad = (x.shape[0] // fshift + 1) * fshift - x.shape[0]
+    if pad_sides == 1:
+        return 0, pad
+    else:
+        return pad // 2, pad // 2 + pad % 2
+
+
+# Conversions
+def amp_to_db(x):
+    return 20 * np.log10(np.maximum(1e-5, x))
+
+
+def normalize(S, hparams):
+    return (S - hparams['min_level_db']) / -hparams['min_level_db']
--- a/modules/audio/svs/diffsinger/utils/cwt.py
+++ b/modules/audio/svs/diffsinger/utils/cwt.py
+import librosa
+import numpy as np
+from pycwt import wavelet
+from scipy.interpolate import interp1d
+
+
+def load_wav(wav_file, sr):
+    wav, _ = librosa.load(wav_file, sr=sr, mono=True)
+    return wav
+
+
+def convert_continuos_f0(f0):
+    '''CONVERT F0 TO CONTINUOUS F0
+    Args:
+        f0 (ndarray): original f0 sequence with the shape (T)
+    Return:
+        (ndarray): continuous f0 with the shape (T)
+    '''
+    # get uv information as binary
+    f0 = np.copy(f0)
+    uv = np.float32(f0 != 0)
+
+    # get start and end of f0
+    if (f0 == 0).all():
+        print("| all of the f0 values are 0.")
+        return uv, f0
+    start_f0 = f0[f0 != 0][0]
+    end_f0 = f0[f0 != 0][-1]
+
+    # padding start and end of f0 sequence
+    start_idx = np.where(f0 == start_f0)[0][0]
+    end_idx = np.where(f0 == end_f0)[0][-1]
+    f0[:start_idx] = start_f0
+    f0[end_idx:] = end_f0
+
+    # get non-zero frame index
+    nz_frames = np.where(f0 != 0)[0]
+
+    # perform linear interpolation
+    f = interp1d(nz_frames, f0[nz_frames])
+    cont_f0 = f(np.arange(0, f0.shape[0]))
+
+    return uv, cont_f0
+
+
+def get_cont_lf0(f0, frame_period=5.0):
+    uv, cont_f0_lpf = convert_continuos_f0(f0)
+    # cont_f0_lpf = low_pass_filter(cont_f0_lpf, int(1.0 / (frame_period * 0.001)), cutoff=20)
+    cont_lf0_lpf = np.log(cont_f0_lpf)
+    return uv, cont_lf0_lpf
+
+
+def get_lf0_cwt(lf0):
+    '''
+    input:
+        signal of shape (N)
+    output:
+        Wavelet_lf0 of shape(10, N), scales of shape(10)
+    '''
+    mother = wavelet.MexicanHat()
+    dt = 0.005
+    dj = 1
+    s0 = dt * 2
+    J = 9
+
+    Wavelet_lf0, scales, _, _, _, _ = wavelet.cwt(np.squeeze(lf0), dt, dj, s0, J, mother)
+    # Wavelet.shape => (J + 1, len(lf0))
+    Wavelet_lf0 = np.real(Wavelet_lf0).T
+    return Wavelet_lf0, scales
+
+
+def norm_scale(Wavelet_lf0):
+    Wavelet_lf0_norm = np.zeros((Wavelet_lf0.shape[0], Wavelet_lf0.shape[1]))
+    mean = Wavelet_lf0.mean(0)[None, :]
+    std = Wavelet_lf0.std(0)[None, :]
+    Wavelet_lf0_norm = (Wavelet_lf0 - mean) / std
+    return Wavelet_lf0_norm, mean, std
+
+
+def normalize_cwt_lf0(f0, mean, std):
+    uv, cont_lf0_lpf = get_cont_lf0(f0)
+    cont_lf0_norm = (cont_lf0_lpf - mean) / std
+    Wavelet_lf0, scales = get_lf0_cwt(cont_lf0_norm)
+    Wavelet_lf0_norm, _, _ = norm_scale(Wavelet_lf0)
+
+    return Wavelet_lf0_norm
+
+
+def get_lf0_cwt_norm(f0s, mean, std):
+    uvs = list()
+    cont_lf0_lpfs = list()
+    cont_lf0_lpf_norms = list()
+    Wavelet_lf0s = list()
+    Wavelet_lf0s_norm = list()
+    scaless = list()
+
+    means = list()
+    stds = list()
+    for f0 in f0s:
+        uv, cont_lf0_lpf = get_cont_lf0(f0)
+        cont_lf0_lpf_norm = (cont_lf0_lpf - mean) / std
+
+        Wavelet_lf0, scales = get_lf0_cwt(cont_lf0_lpf_norm)  # [560,10]
+        Wavelet_lf0_norm, mean_scale, std_scale = norm_scale(Wavelet_lf0)  # [560,10],[1,10],[1,10]
+
+        Wavelet_lf0s_norm.append(Wavelet_lf0_norm)
+        uvs.append(uv)
+        cont_lf0_lpfs.append(cont_lf0_lpf)
+        cont_lf0_lpf_norms.append(cont_lf0_lpf_norm)
+        Wavelet_lf0s.append(Wavelet_lf0)
+        scaless.append(scales)
+        means.append(mean_scale)
+        stds.append(std_scale)
+
+    return Wavelet_lf0s_norm, scaless, means, stds
+
+
+def inverse_cwt(Wavelet_lf0, scales):
+    b = ((np.arange(0, len(scales))[None, None, :] + 1 + 2.5)**(-2.5))
+    lf0_rec = Wavelet_lf0 * b
+    lf0_rec_sum = lf0_rec.sum(-1)
+    lf0_rec_sum = (lf0_rec_sum - lf0_rec_sum.mean(-1, keepdims=True)) / lf0_rec_sum.std(-1, keepdims=True)
+    return lf0_rec_sum
--- a/modules/audio/svs/diffsinger/utils/hparams.py
+++ b/modules/audio/svs/diffsinger/utils/hparams.py
+import argparse
+import os
+
+import yaml
+
+global_print_hparams = True
+hparams = {}
+
+
+class Args:
+
+    def __init__(self, **kwargs):
+        for k, v in kwargs.items():
+            self.__setattr__(k, v)
+
+
+def override_config(old_config: dict, new_config: dict):
+    for k, v in new_config.items():
+        if isinstance(v, dict) and k in old_config:
+            override_config(old_config[k], new_config[k])
+        else:
+            old_config[k] = v
+
+
+def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, global_hparams=True, root='.'):
+    if config == '' and exp_name == '':
+        parser = argparse.ArgumentParser(description='neural music')
+        parser.add_argument('--config', type=str, default='', help='location of the data corpus')
+        parser.add_argument('--exp_name', type=str, default='', help='exp_name')
+        parser.add_argument('--hparams', type=str, default='', help='location of the data corpus')
+        parser.add_argument('--infer', action='store_true', help='infer')
+        parser.add_argument('--validate', action='store_true', help='validate')
+        parser.add_argument('--reset', action='store_true', help='reset hparams')
+        parser.add_argument('--debug', action='store_true', help='debug')
+        args, unknown = parser.parse_known_args()
+    else:
+        args = Args(config=config,
+                    exp_name=exp_name,
+                    hparams=hparams_str,
+                    infer=False,
+                    validate=False,
+                    reset=False,
+                    debug=False)
+    args_work_dir = ''
+    if args.exp_name != '':
+        args.work_dir = args.exp_name
+        args_work_dir = f'checkpoints/{args.work_dir}'
+
+    config_chains = []
+    loaded_config = set()
+
+    def load_config(config_fn):  # deep first
+        with open(os.path.join(root, config_fn)) as f:
+            hparams_ = yaml.safe_load(f)
+        loaded_config.add(config_fn)
+        if 'base_config' in hparams_:
+            ret_hparams = {}
+            if not isinstance(hparams_['base_config'], list):
+                hparams_['base_config'] = [hparams_['base_config']]
+            for c in hparams_['base_config']:
+                if c not in loaded_config:
+                    if c.startswith('.'):
+                        c = f'{os.path.dirname(config_fn)}/{c}'
+                        c = os.path.normpath(c)
+                    override_config(ret_hparams, load_config(c))
+            override_config(ret_hparams, hparams_)
+        else:
+            ret_hparams = hparams_
+        config_chains.append(config_fn)
+        return ret_hparams
+
+    global hparams
+    assert args.config != '' or args_work_dir != ''
+    saved_hparams = {}
+    if args_work_dir != 'checkpoints/':
+        ckpt_config_path = f'{args_work_dir}/config.yaml'
+        if os.path.exists(ckpt_config_path):
+            try:
+                with open(ckpt_config_path) as f:
+                    saved_hparams.update(yaml.safe_load(f))
+            except:
+                pass
+        if args.config == '':
+            args.config = ckpt_config_path
+
+    hparams_ = {}
+
+    hparams_.update(load_config(args.config))
+
+    if not args.reset:
+        hparams_.update(saved_hparams)
+    hparams_['work_dir'] = args_work_dir
+
+    if args.hparams != "":
+        for new_hparam in args.hparams.split(","):
+            k, v = new_hparam.split("=")
+            if v in ['True', 'False'] or type(hparams_[k]) == bool:
+                hparams_[k] = eval(v)
+            else:
+                hparams_[k] = type(hparams_[k])(v)
+
+    if args_work_dir != '' and (not os.path.exists(ckpt_config_path) or args.reset) and not args.infer:
+        os.makedirs(hparams_['work_dir'], exist_ok=True)
+        with open(ckpt_config_path, 'w') as f:
+            yaml.safe_dump(hparams_, f)
+
+    hparams_['infer'] = args.infer
+    hparams_['debug'] = args.debug
+    hparams_['validate'] = args.validate
+    global global_print_hparams
+    if global_hparams:
+        hparams.clear()
+        hparams.update(hparams_)
+
+    if print_hparams and global_print_hparams and global_hparams:
+        print('| Hparams chains: ', config_chains)
+        print('| Hparams: ')
+        for i, (k, v) in enumerate(sorted(hparams_.items())):
+            print(f"\033[;33;m{k}\033[0m: {v}, ", end="\n" if i % 5 == 4 else "")
+        print("")
+        global_print_hparams = False
+    # print(hparams_.keys())
+    if hparams.get('exp_name') is None:
+        hparams['exp_name'] = args.exp_name
+    if hparams_.get('exp_name') is None:
+        hparams_['exp_name'] = args.exp_name
+    return hparams_
--- a/modules/audio/svs/diffsinger/utils/multiprocess_utils.py
+++ b/modules/audio/svs/diffsinger/utils/multiprocess_utils.py
+import os
+import traceback
+from multiprocessing import Process
+from multiprocessing import Queue
+
+
+def chunked_worker(worker_id, map_func, args, results_queue=None, init_ctx_func=None):
+    ctx = init_ctx_func(worker_id) if init_ctx_func is not None else None
+    for job_idx, arg in args:
+        try:
+            if ctx is not None:
+                res = map_func(*arg, ctx=ctx)
+            else:
+                res = map_func(*arg)
+            results_queue.put((job_idx, res))
+        except:
+            traceback.print_exc()
+            results_queue.put((job_idx, None))
+
+
+def chunked_multiprocess_run(map_func, args, num_workers=None, ordered=True, init_ctx_func=None, q_max_size=1000):
+    args = zip(range(len(args)), args)
+    args = list(args)
+    n_jobs = len(args)
+    if num_workers is None:
+        num_workers = int(os.getenv('N_PROC', os.cpu_count()))
+    results_queues = []
+    if ordered:
+        for i in range(num_workers):
+            results_queues.append(Queue(maxsize=q_max_size // num_workers))
+    else:
+        results_queue = Queue(maxsize=q_max_size)
+        for i in range(num_workers):
+            results_queues.append(results_queue)
+    workers = []
+    for i in range(num_workers):
+        args_worker = args[i::num_workers]
+        p = Process(target=chunked_worker,
+                    args=(i, map_func, args_worker, results_queues[i], init_ctx_func),
+                    daemon=True)
+        workers.append(p)
+        p.start()
+    for n_finished in range(n_jobs):
+        results_queue = results_queues[n_finished % num_workers]
+        job_idx, res = results_queue.get()
+        assert job_idx == n_finished or not ordered, (job_idx, n_finished)
+        yield res
+    for w in workers:
+        w.join()
+        w.close()
--- a/modules/audio/svs/diffsinger/utils/text_encoder.py
+++ b/modules/audio/svs/diffsinger/utils/text_encoder.py
+import re
+
+import six
+
+PAD = "<pad>"
+EOS = "<EOS>"
+UNK = "<UNK>"
+SEG = "|"
+RESERVED_TOKENS = [PAD, EOS, UNK]
+NUM_RESERVED_TOKENS = len(RESERVED_TOKENS)
+PAD_ID = RESERVED_TOKENS.index(PAD)  # Normally 0
+EOS_ID = RESERVED_TOKENS.index(EOS)  # Normally 1
+UNK_ID = RESERVED_TOKENS.index(UNK)  # Normally 2
+
+if six.PY2:
+    RESERVED_TOKENS_BYTES = RESERVED_TOKENS
+else:
+    RESERVED_TOKENS_BYTES = [bytes(PAD, "ascii"), bytes(EOS, "ascii")]
+
+# Regular expression for unescaping token strings.
+# '\u' is converted to '_'
+# '\\' is converted to '\'
+# '\213;' is converted to unichr(213)
+_UNESCAPE_REGEX = re.compile(r"\\u|\\\\|\\([0-9]+);")
+_ESCAPE_CHARS = set(u"\\_u;0123456789")
+
+
+def strip_ids(ids, ids_to_strip):
+    """Strip ids_to_strip from the end ids."""
+    ids = list(ids)
+    while ids and ids[-1] in ids_to_strip:
+        ids.pop()
+    return ids
+
+
+class TextEncoder(object):
+    """Base class for converting from ints to/from human readable strings."""
+
+    def __init__(self, num_reserved_ids=NUM_RESERVED_TOKENS):
+        self._num_reserved_ids = num_reserved_ids
+
+    @property
+    def num_reserved_ids(self):
+        return self._num_reserved_ids
+
+    def encode(self, s):
+        """Transform a human-readable string into a sequence of int ids.
+
+        The ids should be in the range [num_reserved_ids, vocab_size). Ids [0,
+        num_reserved_ids) are reserved.
+
+        EOS is not appended.
+
+        Args:
+        s: human-readable string to be converted.
+
+        Returns:
+        ids: list of integers
+        """
+        return [int(w) + self._num_reserved_ids for w in s.split()]
+
+    def decode(self, ids, strip_extraneous=False):
+        """Transform a sequence of int ids into a human-readable string.
+
+        EOS is not expected in ids.
+
+        Args:
+        ids: list of integers to be converted.
+        strip_extraneous: bool, whether to strip off extraneous tokens
+            (EOS and PAD).
+
+        Returns:
+        s: human-readable string.
+        """
+        if strip_extraneous:
+            ids = strip_ids(ids, list(range(self._num_reserved_ids or 0)))
+        return " ".join(self.decode_list(ids))
+
+    def decode_list(self, ids):
+        """Transform a sequence of int ids into a their string versions.
+
+        This method supports transforming individual input/output ids to their
+        string versions so that sequence to/from text conversions can be visualized
+        in a human readable format.
+
+        Args:
+        ids: list of integers to be converted.
+
+        Returns:
+        strs: list of human-readable string.
+        """
+        decoded_ids = []
+        for id_ in ids:
+            if 0 <= id_ < self._num_reserved_ids:
+                decoded_ids.append(RESERVED_TOKENS[int(id_)])
+            else:
+                decoded_ids.append(id_ - self._num_reserved_ids)
+        return [str(d) for d in decoded_ids]
+
+    @property
+    def vocab_size(self):
+        raise NotImplementedError()
+
+
+class ByteTextEncoder(TextEncoder):
+    """Encodes each byte to an id. For 8-bit strings only."""
+
+    def encode(self, s):
+        numres = self._num_reserved_ids
+        return [c + numres for c in s.encode("utf-8")]
+
+    def decode(self, ids, strip_extraneous=False):
+        if strip_extraneous:
+            ids = strip_ids(ids, list(range(self._num_reserved_ids or 0)))
+        numres = self._num_reserved_ids
+        decoded_ids = []
+        int2byte = six.int2byte
+        for id_ in ids:
+            if 0 <= id_ < numres:
+                decoded_ids.append(RESERVED_TOKENS_BYTES[int(id_)])
+            else:
+                decoded_ids.append(int2byte(id_ - numres))
+        if six.PY2:
+            return "".join(decoded_ids)
+        # Python3: join byte arrays and then decode string
+        return b"".join(decoded_ids).decode("utf-8", "replace")
+
+    def decode_list(self, ids):
+        numres = self._num_reserved_ids
+        decoded_ids = []
+        int2byte = six.int2byte
+        for id_ in ids:
+            if 0 <= id_ < numres:
+                decoded_ids.append(RESERVED_TOKENS_BYTES[int(id_)])
+            else:
+                decoded_ids.append(int2byte(id_ - numres))
+        # Python3: join byte arrays and then decode string
+        return decoded_ids
+
+    @property
+    def vocab_size(self):
+        return 2**8 + self._num_reserved_ids
+
+
+class ByteTextEncoderWithEos(ByteTextEncoder):
+    """Encodes each byte to an id and appends the EOS token."""
+
+    def encode(self, s):
+        return super(ByteTextEncoderWithEos, self).encode(s) + [EOS_ID]
+
+
+class TokenTextEncoder(TextEncoder):
+    """Encoder based on a user-supplied vocabulary (file or list)."""
+
+    def __init__(self,
+                 vocab_filename,
+                 reverse=False,
+                 vocab_list=None,
+                 replace_oov=None,
+                 num_reserved_ids=NUM_RESERVED_TOKENS):
+        """Initialize from a file or list, one token per line.
+
+        Handling of reserved tokens works as follows:
+        - When initializing from a list, we add reserved tokens to the vocab.
+        - When initializing from a file, we do not add reserved tokens to the vocab.
+        - When saving vocab files, we save reserved tokens to the file.
+
+        Args:
+            vocab_filename: If not None, the full filename to read vocab from. If this
+                is not None, then vocab_list should be None.
+            reverse: Boolean indicating if tokens should be reversed during encoding
+                and decoding.
+            vocab_list: If not None, a list of elements of the vocabulary. If this is
+                not None, then vocab_filename should be None.
+            replace_oov: If not None, every out-of-vocabulary token seen when
+                encoding will be replaced by this string (which must be in vocab).
+            num_reserved_ids: Number of IDs to save for reserved tokens like <EOS>.
+        """
+        super(TokenTextEncoder, self).__init__(num_reserved_ids=num_reserved_ids)
+        self._reverse = reverse
+        self._replace_oov = replace_oov
+        if vocab_filename:
+            self._init_vocab_from_file(vocab_filename)
+        else:
+            assert vocab_list is not None
+            self._init_vocab_from_list(vocab_list)
+        self.pad_index = self._token_to_id[PAD]
+        self.eos_index = self._token_to_id[EOS]
+        self.unk_index = self._token_to_id[UNK]
+        self.seg_index = self._token_to_id[SEG] if SEG in self._token_to_id else self.eos_index
+
+    def encode(self, s):
+        """Converts a space-separated string of tokens to a list of ids."""
+        sentence = s
+        tokens = sentence.strip().split()
+        if self._replace_oov is not None:
+            tokens = [t if t in self._token_to_id else self._replace_oov for t in tokens]
+        ret = [self._token_to_id[tok] for tok in tokens]
+        return ret[::-1] if self._reverse else ret
+
+    def decode(self, ids, strip_eos=False, strip_padding=False):
+        if strip_padding and self.pad() in list(ids):
+            pad_pos = list(ids).index(self.pad())
+            ids = ids[:pad_pos]
+        if strip_eos and self.eos() in list(ids):
+            eos_pos = list(ids).index(self.eos())
+            ids = ids[:eos_pos]
+        return " ".join(self.decode_list(ids))
+
+    def decode_list(self, ids):
+        seq = reversed(ids) if self._reverse else ids
+        return [self._safe_id_to_token(i) for i in seq]
+
+    @property
+    def vocab_size(self):
+        return len(self._id_to_token)
+
+    def __len__(self):
+        return self.vocab_size
+
+    def _safe_id_to_token(self, idx):
+        return self._id_to_token.get(idx, "ID_%d" % idx)
+
+    def _init_vocab_from_file(self, filename):
+        """Load vocab from a file.
+
+        Args:
+        filename: The file to load vocabulary from.
+        """
+        with open(filename) as f:
+            tokens = [token.strip() for token in f.readlines()]
+
+        def token_gen():
+            for token in tokens:
+                yield token
+
+        self._init_vocab(token_gen(), add_reserved_tokens=False)
+
+    def _init_vocab_from_list(self, vocab_list):
+        """Initialize tokens from a list of tokens.
+
+        It is ok if reserved tokens appear in the vocab list. They will be
+        removed. The set of tokens in vocab_list should be unique.
+
+        Args:
+        vocab_list: A list of tokens.
+        """
+
+        def token_gen():
+            for token in vocab_list:
+                if token not in RESERVED_TOKENS:
+                    yield token
+
+        self._init_vocab(token_gen())
+
+    def _init_vocab(self, token_generator, add_reserved_tokens=True):
+        """Initialize vocabulary with tokens from token_generator."""
+
+        self._id_to_token = {}
+        non_reserved_start_index = 0
+
+        if add_reserved_tokens:
+            self._id_to_token.update(enumerate(RESERVED_TOKENS))
+            non_reserved_start_index = len(RESERVED_TOKENS)
+
+        self._id_to_token.update(enumerate(token_generator, start=non_reserved_start_index))
+
+        # _token_to_id is the reverse of _id_to_token
+        self._token_to_id = dict((v, k) for k, v in six.iteritems(self._id_to_token))
+
+    def pad(self):
+        return self.pad_index
+
+    def eos(self):
+        return self.eos_index
+
+    def unk(self):
+        return self.unk_index
+
+    def seg(self):
+        return self.seg_index
+
+    def store_to_file(self, filename):
+        """Write vocab file to disk.
+
+        Vocab files have one token per line. The file ends in a newline. Reserved
+        tokens are written to the vocab file as well.
+
+        Args:
+        filename: Full path of the file to store the vocab to.
+        """
+        with open(filename, "w") as f:
+            for i in range(len(self._id_to_token)):
+                f.write(self._id_to_token[i] + "\n")
+
+    def sil_phonemes(self):
+        return [p for p in self._id_to_token.values() if not p[0].isalpha()]
--- a/modules/audio/svs/diffsinger/utils/text_norm.py
+++ b/modules/audio/svs/diffsinger/utils/text_norm.py