未验证 提交 7eef3bfd 编写于 作者: jm_12138's avatar jm_12138 提交者: GitHub

Add Diffsinger Module (#2120)

* add diffsinger

* update README

* update README
上级 c9211e2a
# diffsinger
|模型名称|diffsinger|
| :--- | :---: |
|类别|音频-歌声合成|
|网络|DiffSinger|
|数据集|-|
|是否支持Fine-tuning|否|
|模型大小|256.1MB|
|指标|-|
|最新更新日期|2022-10-25|
## 一、模型基本信息
- ### 应用效果展示
- 网络结构:
<p align="center">
<img src="https://neuralsvb.github.io/resources/model_all7.png"/>
</p>
- 样例结果示例:
|文本|音频|
|:-:|:-:|
|让 梦 恒 久 比 天 长|<audio controls="controls"><source src="https://diffsinger.github.io/audio/singing_demo/diffsinger-base/000000007.wav" autoplay=""></audio>|
|我 终 于 翱 翔|<audio controls="controls"><source src="https://diffsinger.github.io/audio/singing_demo/diffsinger-base/000000005.wav" autoplay=""></audio>|
- ### 模型介绍
- DiffSinger,一个基于扩散概率模型的 SVS 声学模型。DiffSinger 是一个参数化的马尔科夫链,它可以根据乐谱的条件,迭代地将噪声转换为旋律谱。通过隐式优化变异约束,DiffSinger 可以被稳定地训练并产生真实的输出。
## 二、安装
- ### 1、环境依赖
- onnxruntime >= 1.12.0
```shell
# CPU
$ pip install onnxruntime
# GPU
$ pip install onnxruntime-gpu
```
- paddlehub >= 2.0.0
- ### 2.安装
- ```shell
$ hub install diffsinger
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
## 三、模型API预测
- ### 1、命令行预测
```shell
$ hub run diffsinger \
--input_type "word" \
--text "小酒窝长睫毛AP是你最美的记号" \
--notes "C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4" \
--notes_duration "0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340" \
--sample_num 1 \
--save_dir "outputs"
$ hub run diffsinger \
--input_type "phoneme" \
--text "小酒窝长睫毛AP是你最美的记号" \
--ph_seq "x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao" \
--note_seq "C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4" \
--note_dur_seq "0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340" \
--is_slur_seq "0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" \
--sample_num 1 \
--save_dir "outputs"
```
- ### 2、预测代码示例
```python
import paddlehub as hub
module = hub.Module(name="diffsinger")
results = module.singing_voice_synthesis(
inputs={
'text': '小酒窝长睫毛AP是你最美的记号',
'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
'input_type': 'word'
},
sample_num=1,
save_audio=True,
save_dir='outputs'
)
```
- ### 3、API
```python
def singing_voice_synthesis(
inputs: Dict[str, str],
sample_num: int = 1,
save_audio: bool = True,
save_dir: str = 'outputs'
) -> Dict[str, Union[List[List[int]], int]]:
```
- 歌声合成 API
- **参数**
* inputs (Dict\[str, str\]): 输入数据,支持如下两种格式;
```python
{
'text': '小酒窝长睫毛AP是你最美的记号',
'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
'input_type': 'word'
}
{
'text': '小酒窝长睫毛AP是你最美的记号',
'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
'input_type': 'phoneme'
}
```
* sample_num (int): 生成音频的数量;
* save_audio (bool): 是否保存音频文件;
* save\_dir (str): 保存处理结果的文件目录。
- **返回**
* res (Dict\[str, Union\[List\[List\[int\]\], int\]\]): 歌声合成结果,一个字典,包容如下内容;
* wavs: 歌声音频数据
* sample_rate: 音频采样率
## 四、服务部署
- PaddleHub Serving 可以部署一个歌声合成的在线服务。
- ### 第一步:启动PaddleHub Serving
- 运行启动命令:
```shell
$ hub serving start -m diffsinger
```
- 这样就完成了一个歌声合成服务化API的部署,默认端口号为8866。
- ### 第二步:发送预测请求
- 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
```python
import requests
import json
data = {
'inputs': {
'text': '小酒窝长睫毛AP是你最美的记号',
'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
'input_type': 'word'
},
'save_audio': False,
}
headers = {"Content-type": "application/json"}
url = "http://127.0.0.1:8866/predict/diffsinger"
r = requests.post(url=url, headers=headers, data=json.dumps(data))
results = r.json()['results']
```
## 五、参考资料
* 论文:[DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism](https://arxiv.org/abs/2105.02446)
* 官方实现:[MoonInTheRiver/DiffSinger](https://github.com/MoonInTheRiver/DiffSinger)
## 六、更新历史
* 1.0.0
初始发布
```shell
$ hub install diffsinger==1.0.0
```
# task
binary_data_dir: ''
work_dir: '' # experiment directory.
infer: false # infer
seed: 1234
debug: false
save_codes:
- configs
- modules
- tasks
- utils
- usr
#############
# dataset
#############
ds_workers: 1
test_num: 100
valid_num: 100
endless_ds: false
sort_by_len: true
#########
# train and eval
#########
load_ckpt: ''
save_ckpt: true
save_best: false
num_ckpt_keep: 3
clip_grad_norm: 0
accumulate_grad_batches: 1
log_interval: 100
num_sanity_val_steps: 5 # steps of validation at the beginning
check_val_every_n_epoch: 10
val_check_interval: 2000
max_epochs: 1000
max_updates: 160000
max_tokens: 31250
max_sentences: 100000
max_eval_tokens: -1
max_eval_sentences: -1
test_input_dir: ''
base_config:
- configs/tts/base.yaml
- configs/tts/base_zh.yaml
datasets: []
test_prefixes: []
test_num: 0
valid_num: 0
pre_align_cls: data_gen.singing.pre_align.SingingPreAlign
binarizer_cls: data_gen.singing.binarize.SingingBinarizer
pre_align_args:
use_tone: false # for ZH
forced_align: mfa
use_sox: true
hop_size: 128 # Hop size.
fft_size: 512 # FFT size.
win_size: 512 # FFT size.
max_frames: 8000
fmin: 50 # Minimum freq in mel basis calculation.
fmax: 11025 # Maximum frequency in mel basis calculation.
pitch_type: frame
hidden_size: 256
mel_loss: "ssim:0.5|l1:0.5"
lambda_f0: 0.0
lambda_uv: 0.0
lambda_energy: 0.0
lambda_ph_dur: 0.0
lambda_sent_dur: 0.0
lambda_word_dur: 0.0
predictor_grad: 0.0
use_spk_embed: true
use_spk_id: false
max_tokens: 20000
max_updates: 400000
num_spk: 100
save_f0: true
use_gt_dur: true
use_gt_f0: true
base_config:
- configs/tts/fs2.yaml
- configs/singing/base.yaml
# task
base_config: configs/config_base.yaml
task_cls: ''
#############
# dataset
#############
raw_data_dir: ''
processed_data_dir: ''
binary_data_dir: ''
dict_dir: ''
pre_align_cls: ''
binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
pre_align_args:
use_tone: true # for ZH
forced_align: mfa
use_sox: false
txt_processor: en
allow_no_txt: false
denoise: false
binarization_args:
shuffle: false
with_txt: true
with_wav: false
with_align: true
with_spk_embed: true
with_f0: true
with_f0cwt: true
loud_norm: false
endless_ds: true
reset_phone_dict: true
test_num: 100
valid_num: 100
max_frames: 1550
max_input_tokens: 1550
audio_num_mel_bins: 80
audio_sample_rate: 22050
hop_size: 256 # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
win_size: 1024 # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
fmin: 80 # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
fmax: 7600 # To be increased/reduced depending on data.
fft_size: 1024 # Extra window size is filled with 0 paddings to match this parameter
min_level_db: -100
num_spk: 1
mel_vmin: -6
mel_vmax: 1.5
ds_workers: 4
#########
# model
#########
dropout: 0.1
enc_layers: 4
dec_layers: 4
hidden_size: 384
num_heads: 2
prenet_dropout: 0.5
prenet_hidden_size: 256
stop_token_weight: 5.0
enc_ffn_kernel_size: 9
dec_ffn_kernel_size: 9
ffn_act: gelu
ffn_padding: 'SAME'
###########
# optimization
###########
lr: 2.0
warmup_updates: 8000
optimizer_adam_beta1: 0.9
optimizer_adam_beta2: 0.98
weight_decay: 0
clip_grad_norm: 1
###########
# train and eval
###########
max_tokens: 30000
max_sentences: 100000
max_eval_sentences: 1
max_eval_tokens: 60000
train_set_name: 'train'
valid_set_name: 'valid'
test_set_name: 'test'
vocoder: pwg
vocoder_ckpt: ''
profile_infer: false
out_wav_norm: false
save_gt: false
save_f0: false
gen_dir_name: ''
use_denoise: false
pre_align_args:
txt_processor: zh_g2pM
binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer
base_config: configs/tts/base.yaml
task_cls: tasks.tts.fs2.FastSpeech2Task
# model
hidden_size: 256
dropout: 0.1
encoder_type: fft # fft|tacotron|tacotron2|conformer
encoder_K: 8 # for tacotron encoder
decoder_type: fft # fft|rnn|conv|conformer
use_pos_embed: true
# duration
predictor_hidden: -1
predictor_kernel: 5
predictor_layers: 2
dur_predictor_kernel: 3
dur_predictor_layers: 2
predictor_dropout: 0.5
# pitch and energy
use_pitch_embed: true
pitch_type: ph # frame|ph|cwt
use_uv: true
cwt_hidden_size: 128
cwt_layers: 2
cwt_loss: l1
cwt_add_f0_loss: false
cwt_std_scale: 0.8
pitch_ar: false
#pitch_embed_type: 0q
pitch_loss: 'l1' # l1|l2|ssim
pitch_norm: log
use_energy_embed: false
# reference encoder and speaker embedding
use_spk_id: false
use_split_spk_id: false
use_spk_embed: false
use_var_enc: false
lambda_commit: 0.25
ref_norm_layer: bn
pitch_enc_hidden_stride_kernel:
- 0,2,5 # conv_hidden_size, conv_stride, conv_kernel_size. conv_hidden_size=0: use hidden_size
- 0,2,5
- 0,2,5
dur_enc_hidden_stride_kernel:
- 0,2,3 # conv_hidden_size, conv_stride, conv_kernel_size. conv_hidden_size=0: use hidden_size
- 0,2,3
- 0,1,3
# mel
mel_loss: l1:0.5|ssim:0.5 # l1|l2|gdl|ssim or l1:0.5|ssim:0.5
# loss lambda
lambda_f0: 1.0
lambda_uv: 1.0
lambda_energy: 0.1
lambda_ph_dur: 1.0
lambda_sent_dur: 1.0
lambda_word_dur: 1.0
predictor_grad: 0.1
# train and eval
pretrain_fs_ckpt: ''
warmup_updates: 2000
max_tokens: 32000
max_sentences: 100000
max_eval_sentences: 1
max_updates: 120000
num_valid_plots: 5
num_test_samples: 0
test_ids: []
use_gt_dur: false
use_gt_f0: false
# exp
dur_loss: mse # huber|mol
norm_type: gn
base_config: configs/tts/pwg.yaml
task_cls: tasks.vocoder.hifigan.HifiGanTask
resblock: "1"
adam_b1: 0.8
adam_b2: 0.99
upsample_rates: [ 8,8,2,2 ]
upsample_kernel_sizes: [ 16,16,4,4 ]
upsample_initial_channel: 128
resblock_kernel_sizes: [ 3,7,11 ]
resblock_dilation_sizes: [ [ 1,3,5 ], [ 1,3,5 ], [ 1,3,5 ] ]
lambda_mel: 45.0
max_samples: 8192
max_sentences: 16
generator_params:
lr: 0.0002 # Generator's learning rate.
aux_context_window: 0 # Context window size for auxiliary feature.
discriminator_optimizer_params:
lr: 0.0002 # Discriminator's learning rate.
raw_data_dir: 'data/raw/LJSpeech-1.1'
processed_data_dir: 'data/processed/ljspeech'
binary_data_dir: 'data/binary/ljspeech_wav'
raw_data_dir: 'data/raw/LJSpeech-1.1'
processed_data_dir: 'data/processed/ljspeech'
binary_data_dir: 'data/binary/ljspeech'
pre_align_cls: data_gen.tts.lj.pre_align.LJPreAlign
pitch_type: cwt
mel_loss: l1
num_test_samples: 20
test_ids: [ 68, 70, 74, 87, 110, 172, 190, 215, 231, 294,
316, 324, 402, 422, 485, 500, 505, 508, 509, 519 ]
use_energy_embed: false
test_num: 523
valid_num: 348
base_config:
- configs/tts/fs2.yaml
- configs/tts/lj/base_text2mel.yaml
base_config:
- configs/tts/hifigan.yaml
- configs/tts/lj/base_mel2wav.yaml
base_config:
- configs/tts/pwg.yaml
- configs/tts/lj/base_mel2wav.yaml
base_config: configs/tts/base.yaml
task_cls: tasks.vocoder.pwg.PwgTask
binarization_args:
with_wav: true
with_spk_embed: false
with_align: false
test_input_dir: ''
###########
# train and eval
###########
max_samples: 25600
max_sentences: 5
max_eval_sentences: 1
max_updates: 1000000
val_check_interval: 2000
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
sampling_rate: 22050 # Sampling rate.
fft_size: 1024 # FFT size.
hop_size: 256 # Hop size.
win_length: null # Window length.
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
num_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation.
fmax: 7600 # Maximum frequency in mel basis calculation.
format: "hdf5" # Feature file format. "npy" or "hdf5" is supported.
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Kernel size of dilated convolution.
layers: 30 # Number of residual block layers.
stacks: 3 # Number of stacks i.e., dilation cycles.
residual_channels: 64 # Number of channels in residual conv.
gate_channels: 128 # Number of channels in gated conv.
skip_channels: 64 # Number of channels in skip conv.
aux_channels: 80 # Number of channels for auxiliary feature conv.
# Must be the same as num_mels.
aux_context_window: 2 # Context window size for auxiliary feature.
# If set to 2, previous 2 and future 2 frames will be considered.
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
upsample_params: # Upsampling network parameters.
upsample_scales: [4, 4, 4, 4] # Upsampling scales. Prodcut of these must be the same as hop size.
use_pitch_embed: false
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Number of output channels.
layers: 10 # Number of conv layers.
conv_channels: 64 # Number of chnn layers.
bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in LeakyReLU.
###########################################################
# STFT LOSS SETTING #
###########################################################
stft_loss_params:
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss.
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
window: "hann_window" # Window function for STFT-based loss
use_mel_loss: false
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_adv: 4.0 # Loss balancing coefficient.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
lr: 0.0001 # Generator's learning rate.
eps: 1.0e-6 # Generator's epsilon.
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
step_size: 200000 # Generator's scheduler step size.
gamma: 0.5 # Generator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
generator_grad_norm: 10 # Generator's gradient norm.
discriminator_optimizer_params:
lr: 0.00005 # Discriminator's learning rate.
eps: 1.0e-6 # Discriminator's epsilon.
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
step_size: 200000 # Discriminator's scheduler step size.
gamma: 0.5 # Discriminator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
discriminator_grad_norm: 1 # Discriminator's gradient norm.
disc_start_steps: 40000 # Number of steps to start to train discriminator.
import os
from collections import deque
import librosa
import numpy as np
import onnxruntime as rt
from pypinyin import lazy_pinyin
from tqdm import tqdm
from .inference.svs.opencpop.map import cpop_pinyin2ph_func
from .utils.hparams import hparams
from .utils.text_encoder import TokenTextEncoder
class Infer:
def __init__(self, root='.', providers=None):
model_dir = os.path.join(root, 'model')
if providers is None:
providers = rt.get_available_providers()
print('Using these as onnxruntime providers:', providers)
phone_list = [
"AP", "SP", "a", "ai", "an", "ang", "ao", "b", "c", "ch", "d", "e", "ei", "en", "eng", "er", "f", "g", "h",
"i", "ia", "ian", "iang", "iao", "ie", "in", "ing", "iong", "iu", "j", "k", "l", "m", "n", "o", "ong", "ou",
"p", "q", "r", "s", "sh", "t", "u", "ua", "uai", "uan", "uang", "ui", "un", "uo", "v", "van", "ve", "vn",
"w", "x", "y", "z", "zh"
]
self.ph_encoder = TokenTextEncoder(None, vocab_list=phone_list, replace_oov=',')
self.pinyin2phs = cpop_pinyin2ph_func(path=os.path.join(root, 'inference/svs/opencpop/cpop_pinyin2ph.txt'))
self.spk_map = {'opencpop': 0}
options = rt.SessionOptions()
for provider in providers:
if 'dml' in provider.lower():
options.enable_mem_pattern = False
options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL
fs2_path = os.path.join(model_dir, 'fs2.onnx')
q_sample_path = os.path.join(model_dir, 'q_sample.onnx')
p_sample_path = os.path.join(model_dir, 'p_sample.onnx')
pe_path = os.path.join(model_dir, 'pe.onnx')
vocoder_path = os.path.join(model_dir, 'vocoder.onnx')
self.fs2 = rt.InferenceSession(fs2_path, options, providers=providers)
self.q_sample = rt.InferenceSession(q_sample_path, options, providers=providers)
self.p_sample = rt.InferenceSession(p_sample_path, options, providers=providers)
self.pe = rt.InferenceSession(pe_path, options, providers=providers)
self.vocoder = rt.InferenceSession(vocoder_path, options, providers=providers)
self.K_step = hparams['K_step']
self.spec_min = np.asarray(hparams['spec_min'], np.float32)[None, None, :hparams['keep_bins']]
self.spec_max = np.asarray(hparams['spec_max'], np.float32)[None, None, :hparams['keep_bins']]
self.mel_bins = hparams['audio_num_mel_bins']
self.use_pe = hparams.get('pe_enable') is not None and hparams['pe_enable']
def model(self, txt_tokens, **kwargs):
fs_input_names = [node.name for node in self.fs2.get_inputs()]
inputs = {'txt_tokens': txt_tokens}
inputs.update({k: v for k, v in kwargs.items() if isinstance(v, np.ndarray) and k in fs_input_names})
io_binding = self.fs2.io_binding()
for k, v in inputs.items():
io_binding.bind_cpu_input(k, v)
io_binding.bind_output('decoder_inp')
io_binding.bind_output('mel_out')
if not self.use_pe:
io_binding.bind_output('f0_denorm')
self.fs2.run_with_iobinding(io_binding)
decoder_inp, mel_out = io_binding.get_outputs()[:2]
self.device_name = mel_out.device_name()
ret = {'decoder_inp': decoder_inp, 'mel_out': mel_out}
if not self.use_pe:
ret.update({'f0_denorm': io_binding.get_outputs()[-1]})
cond = decoder_inp.numpy().transpose([0, 2, 1])
ret['fs2_mel'] = ret['mel_out']
fs2_mels = mel_out.numpy()
t = self.K_step
fs2_mels = self.norm_spec(fs2_mels)
fs2_mels = fs2_mels.transpose([0, 2, 1])[:, None, :, :]
io_binding = self.q_sample.io_binding()
io_binding.bind_cpu_input('x_start', fs2_mels)
io_binding.bind_cpu_input('noise', np.random.randn(*fs2_mels.shape).astype(fs2_mels.dtype))
io_binding.bind_cpu_input('t', np.asarray([t - 1], dtype=np.int64))
io_binding.bind_output('x_next')
self.q_sample.run_with_iobinding(io_binding)
x = io_binding.get_outputs()[0].numpy()
if hparams.get('gaussian_start') is not None and hparams['gaussian_start']:
print('===> gaussion start.')
shape = (cond.shape[0], 1, self.mel_bins, cond.shape[2])
x = np.random.randn(*shape).astype(fs2_mels.dtype)
cond = rt.OrtValue.ortvalue_from_numpy(cond, mel_out.device_name(), 0)
x = rt.OrtValue.ortvalue_from_numpy(x, mel_out.device_name(), 0)
if hparams.get('pndm_speedup'):
self.noise_list = deque(maxlen=4)
iteration_interval = hparams['pndm_speedup']
interval = rt.OrtValue.ortvalue_from_numpy(np.asarray([iteration_interval], np.int64),
mel_out.device_name(), 0)
for i in tqdm(reversed(range(0, t, iteration_interval)),
desc='sample time step',
total=t // iteration_interval):
io_binding = self.p_sample_plms.io_binding()
io_binding.bind_ortvalue_input('x', x)
io_binding.bind_cpu_input('noise', np.random.randn(*x.shape).astype(x.dtype))
io_binding.bind_ortvalue_input('cond', cond)
io_binding.bind_cpu_input('t', np.asarray([i], dtype=np.int64)) # torch i-1 but here i
io_binding.bind_ortvalue_input('interval', interval)
io_binding.bind_output('x_next')
self.p_sample_plms.run_with_iobinding(io_binding)
x = io_binding.get_outputs()[0]
else:
for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
io_binding = self.p_sample.io_binding()
io_binding.bind_ortvalue_input('x', x)
io_binding.bind_cpu_input('noise', np.random.randn(*x.shape()).astype(np.float32))
io_binding.bind_ortvalue_input('cond', cond)
io_binding.bind_cpu_input('t', np.asarray([i], dtype=np.int64)) # torch i-1 but here i
io_binding.bind_output('x_next')
self.p_sample.run_with_iobinding(io_binding)
x = io_binding.get_outputs()[0]
x = x.numpy()[:, 0].transpose([0, 2, 1])
mel2ph = kwargs.get('mel2ph', None)
if mel2ph is not None: # for singing
ret['mel_out'] = self.denorm_spec(x) * ((mel2ph > 0).astype(np.float32)[:, :, None])
else:
ret['mel_out'] = self.denorm_spec(x)
return ret
def norm_spec(self, x):
return (x - self.spec_min) / (self.spec_max - self.spec_min) * 2 - 1
def denorm_spec(self, x):
return (x + 1) / 2 * (self.spec_max - self.spec_min) + self.spec_min
def forward_model(self, inp):
sample = self.input_to_batch(inp)
txt_tokens = sample['txt_tokens'] # [B, T_t]
spk_id = sample.get('spk_ids')
output = self.model(txt_tokens,
spk_id=spk_id,
ref_mels=None,
infer=True,
pitch_midi=sample['pitch_midi'],
midi_dur=sample['midi_dur'],
is_slur=sample['is_slur'])
mel_out = output['mel_out'] # [B, T,80]
mel_out = rt.OrtValue.ortvalue_from_numpy(mel_out, self.device_name, 0)
if hparams.get('pe_enable') is not None and hparams['pe_enable']:
# pe predict from Pred mel
io_binding = self.pe.io_binding()
io_binding.bind_ortvalue_input('mel_input', mel_out)
io_binding.bind_output('f0_denorm_pred')
self.pe.run_with_iobinding(io_binding)
f0_pred = io_binding.get_outputs()[0]
else:
f0_pred = output['f0_denorm']
wav_out = self.run_vocoder(mel_out, f0=f0_pred.numpy())
return wav_out[0]
def run_vocoder(self, c, **kwargs):
# c = c.transpose([0, 2, 1]) # [B, 80, T]
f0 = kwargs.get('f0') # [B, T]
if f0 is not None and hparams.get('use_nsf'):
y = self.vocoder.run(['wav_out'], {
'mel_out': c,
'f0': f0,
})[0] # .reshape([-1])
else:
y = self.vocoder.run(['wav_out'], {
'mel_out': c,
})[0] # .reshape([-1])
# [T]
return y # [None]
def preprocess_word_level_input(self, inp):
# Pypinyin can't solve polyphonic words
text_raw = inp['text'].replace('最长', '最常').replace('长睫毛', '常睫毛') \
.replace('那么长', '那么常').replace('多长', '多常') \
.replace('很长', '很常') # We hope someone could provide a better g2p module for us by opening pull requests.
# lyric
pinyins = lazy_pinyin(text_raw, strict=False)
ph_per_word_lst = [self.pinyin2phs[pinyin.strip()] for pinyin in pinyins if pinyin.strip() in self.pinyin2phs]
# Note
note_per_word_lst = [x.strip() for x in inp['notes'].split('|') if x.strip() != '']
mididur_per_word_lst = [x.strip() for x in inp['notes_duration'].split('|') if x.strip() != '']
if len(note_per_word_lst) == len(ph_per_word_lst) == len(mididur_per_word_lst):
print('Pass word-notes check.')
else:
print('The number of words does\'t match the number of notes\' windows. ',
'You should split the note(s) for each word by | mark.')
print(ph_per_word_lst, note_per_word_lst, mididur_per_word_lst)
print(len(ph_per_word_lst), len(note_per_word_lst), len(mididur_per_word_lst))
return None
note_lst = []
ph_lst = []
midi_dur_lst = []
is_slur = []
for idx, ph_per_word in enumerate(ph_per_word_lst):
# for phs in one word:
# single ph like ['ai'] or multiple phs like ['n', 'i']
ph_in_this_word = ph_per_word.split()
# for notes in one word:
# single note like ['D4'] or multiple notes like ['D4', 'E4'] which means a 'slur' here.
note_in_this_word = note_per_word_lst[idx].split()
midi_dur_in_this_word = mididur_per_word_lst[idx].split()
# process for the model input
# Step 1.
# Deal with note of 'not slur' case or the first note of 'slur' case
# j ie
# F#4/Gb4 F#4/Gb4
# 0 0
for ph in ph_in_this_word:
ph_lst.append(ph)
note_lst.append(note_in_this_word[0])
midi_dur_lst.append(midi_dur_in_this_word[0])
is_slur.append(0)
# step 2.
# Deal with the 2nd, 3rd... notes of 'slur' case
# j ie ie
# F#4/Gb4 F#4/Gb4 C#4/Db4
# 0 0 1
# is_slur = True, we should repeat the YUNMU to match the 2nd, 3rd... notes.
if len(note_in_this_word) > 1:
for idx in range(1, len(note_in_this_word)):
ph_lst.append(ph_in_this_word[-1])
note_lst.append(note_in_this_word[idx])
midi_dur_lst.append(midi_dur_in_this_word[idx])
is_slur.append(1)
ph_seq = ' '.join(ph_lst)
if len(ph_lst) == len(note_lst) == len(midi_dur_lst):
print(len(ph_lst), len(note_lst), len(midi_dur_lst))
print('Pass word-notes check.')
else:
print('The number of words does\'t match the number of notes\' windows. ',
'You should split the note(s) for each word by | mark.')
return None
return ph_seq, note_lst, midi_dur_lst, is_slur
def preprocess_phoneme_level_input(self, inp):
ph_seq = inp['ph_seq']
note_lst = inp['note_seq'].split()
midi_dur_lst = inp['note_dur_seq'].split()
is_slur = [float(x) for x in inp['is_slur_seq'].split()]
print(len(note_lst), len(ph_seq.split()), len(midi_dur_lst))
if len(note_lst) == len(ph_seq.split()) == len(midi_dur_lst):
print('Pass word-notes check.')
else:
print('The number of words does\'t match the number of notes\' windows. ',
'You should split the note(s) for each word by | mark.')
return None
return ph_seq, note_lst, midi_dur_lst, is_slur
def preprocess_input(self, inp, input_type='word'):
"""
:param inp: {'text': str, 'item_name': (str, optional), 'spk_name': (str, optional)}
:return:
"""
item_name = inp.get('item_name', '<ITEM_NAME>')
spk_name = inp.get('spk_name', 'opencpop')
# single spk
spk_id = self.spk_map[spk_name]
# get ph seq, note lst, midi dur lst, is slur lst.
if input_type == 'word':
ret = self.preprocess_word_level_input(inp)
# like transcriptions.txt in Opencpop dataset.
elif input_type == 'phoneme':
ret = self.preprocess_phoneme_level_input(inp)
else:
print('Invalid input type.')
return None
if ret:
ph_seq, note_lst, midi_dur_lst, is_slur = ret
else:
print('==========> Preprocess_word_level or phone_level input wrong.')
return None
# convert note lst to midi id; convert note dur lst to midi duration
try:
midis = [librosa.note_to_midi(x.split("/")[0]) if x != 'rest' else 0 for x in note_lst]
midi_dur_lst = [float(x) for x in midi_dur_lst]
except Exception as e:
print(e)
print('Invalid Input Type.')
return None
ph_token = self.ph_encoder.encode(ph_seq)
item = {
'item_name': item_name,
'text': inp['text'],
'ph': ph_seq,
'spk_id': spk_id,
'ph_token': ph_token,
'pitch_midi': np.asarray(midis),
'midi_dur': np.asarray(midi_dur_lst),
'is_slur': np.asarray(is_slur),
}
item['ph_len'] = len(item['ph_token'])
return item
def input_to_batch(self, item):
item_names = [item['item_name']]
text = [item['text']]
ph = [item['ph']]
txt_tokens = np.int64(item['ph_token'])[None, :]
txt_lengths = np.int64([txt_tokens.shape[1]])
spk_ids = np.asarray(item['spk_id'], np.int64)[None]
pitch_midi = np.int64(item['pitch_midi'])[None, :hparams['max_frames']]
midi_dur = np.float32(item['midi_dur'])[None, :hparams['max_frames']]
is_slur = np.int64(item['is_slur'])[None, :hparams['max_frames']]
batch = {
'item_name': item_names,
'text': text,
'ph': ph,
'txt_tokens': txt_tokens,
'txt_lengths': txt_lengths,
'spk_ids': spk_ids,
'pitch_midi': pitch_midi,
'midi_dur': midi_dur,
'is_slur': is_slur
}
return batch
def infer_once(self, inp):
inp = self.preprocess_input(inp, input_type=inp['input_type'] if inp.get('input_type') else 'word')
output = self.forward_model(inp)
return output
| a | a |
| ai | ai |
| an | an |
| ang | ang |
| ao | ao |
| ba | b a |
| bai | b ai |
| ban | b an |
| bang | b ang |
| bao | b ao |
| bei | b ei |
| ben | b en |
| beng | b eng |
| bi | b i |
| bian | b ian |
| biao | b iao |
| bie | b ie |
| bin | b in |
| bing | b ing |
| bo | b o |
| bu | b u |
| ca | c a |
| cai | c ai |
| can | c an |
| cang | c ang |
| cao | c ao |
| ce | c e |
| cei | c ei |
| cen | c en |
| ceng | c eng |
| cha | ch a |
| chai | ch ai |
| chan | ch an |
| chang | ch ang |
| chao | ch ao |
| che | ch e |
| chen | ch en |
| cheng | ch eng |
| chi | ch i |
| chong | ch ong |
| chou | ch ou |
| chu | ch u |
| chua | ch ua |
| chuai | ch uai |
| chuan | ch uan |
| chuang | ch uang |
| chui | ch ui |
| chun | ch un |
| chuo | ch uo |
| ci | c i |
| cong | c ong |
| cou | c ou |
| cu | c u |
| cuan | c uan |
| cui | c ui |
| cun | c un |
| cuo | c uo |
| da | d a |
| dai | d ai |
| dan | d an |
| dang | d ang |
| dao | d ao |
| de | d e |
| dei | d ei |
| den | d en |
| deng | d eng |
| di | d i |
| dia | d ia |
| dian | d ian |
| diao | d iao |
| die | d ie |
| ding | d ing |
| diu | d iu |
| dong | d ong |
| dou | d ou |
| du | d u |
| duan | d uan |
| dui | d ui |
| dun | d un |
| duo | d uo |
| e | e |
| ei | ei |
| en | en |
| eng | eng |
| er | er |
| fa | f a |
| fan | f an |
| fang | f ang |
| fei | f ei |
| fen | f en |
| feng | f eng |
| fo | f o |
| fou | f ou |
| fu | f u |
| ga | g a |
| gai | g ai |
| gan | g an |
| gang | g ang |
| gao | g ao |
| ge | g e |
| gei | g ei |
| gen | g en |
| geng | g eng |
| gong | g ong |
| gou | g ou |
| gu | g u |
| gua | g ua |
| guai | g uai |
| guan | g uan |
| guang | g uang |
| gui | g ui |
| gun | g un |
| guo | g uo |
| ha | h a |
| hai | h ai |
| han | h an |
| hang | h ang |
| hao | h ao |
| he | h e |
| hei | h ei |
| hen | h en |
| heng | h eng |
| hm | h m |
| hng | h ng |
| hong | h ong |
| hou | h ou |
| hu | h u |
| hua | h ua |
| huai | h uai |
| huan | h uan |
| huang | h uang |
| hui | h ui |
| hun | h un |
| huo | h uo |
| ji | j i |
| jia | j ia |
| jian | j ian |
| jiang | j iang |
| jiao | j iao |
| jie | j ie |
| jin | j in |
| jing | j ing |
| jiong | j iong |
| jiu | j iu |
| ju | j v |
| juan | j van |
| jue | j ve |
| jun | j vn |
| ka | k a |
| kai | k ai |
| kan | k an |
| kang | k ang |
| kao | k ao |
| ke | k e |
| kei | k ei |
| ken | k en |
| keng | k eng |
| kong | k ong |
| kou | k ou |
| ku | k u |
| kua | k ua |
| kuai | k uai |
| kuan | k uan |
| kuang | k uang |
| kui | k ui |
| kun | k un |
| kuo | k uo |
| la | l a |
| lai | l ai |
| lan | l an |
| lang | l ang |
| lao | l ao |
| le | l e |
| lei | l ei |
| leng | l eng |
| li | l i |
| lia | l ia |
| lian | l ian |
| liang | l iang |
| liao | l iao |
| lie | l ie |
| lin | l in |
| ling | l ing |
| liu | l iu |
| lo | l o |
| long | l ong |
| lou | l ou |
| lu | l u |
| luan | l uan |
| lun | l un |
| luo | l uo |
| lv | l v |
| lve | l ve |
| m | m |
| ma | m a |
| mai | m ai |
| man | m an |
| mang | m ang |
| mao | m ao |
| me | m e |
| mei | m ei |
| men | m en |
| meng | m eng |
| mi | m i |
| mian | m ian |
| miao | m iao |
| mie | m ie |
| min | m in |
| ming | m ing |
| miu | m iu |
| mo | m o |
| mou | m ou |
| mu | m u |
| n | n |
| na | n a |
| nai | n ai |
| nan | n an |
| nang | n ang |
| nao | n ao |
| ne | n e |
| nei | n ei |
| nen | n en |
| neng | n eng |
| ng | n g |
| ni | n i |
| nian | n ian |
| niang | n iang |
| niao | n iao |
| nie | n ie |
| nin | n in |
| ning | n ing |
| niu | n iu |
| nong | n ong |
| nou | n ou |
| nu | n u |
| nuan | n uan |
| nun | n un |
| nuo | n uo |
| nv | n v |
| nve | n ve |
| o | o |
| ou | ou |
| pa | p a |
| pai | p ai |
| pan | p an |
| pang | p ang |
| pao | p ao |
| pei | p ei |
| pen | p en |
| peng | p eng |
| pi | p i |
| pian | p ian |
| piao | p iao |
| pie | p ie |
| pin | p in |
| ping | p ing |
| po | p o |
| pou | p ou |
| pu | p u |
| qi | q i |
| qia | q ia |
| qian | q ian |
| qiang | q iang |
| qiao | q iao |
| qie | q ie |
| qin | q in |
| qing | q ing |
| qiong | q iong |
| qiu | q iu |
| qu | q v |
| quan | q van |
| que | q ve |
| qun | q vn |
| ran | r an |
| rang | r ang |
| rao | r ao |
| re | r e |
| ren | r en |
| reng | r eng |
| ri | r i |
| rong | r ong |
| rou | r ou |
| ru | r u |
| rua | r ua |
| ruan | r uan |
| rui | r ui |
| run | r un |
| ruo | r uo |
| sa | s a |
| sai | s ai |
| san | s an |
| sang | s ang |
| sao | s ao |
| se | s e |
| sen | s en |
| seng | s eng |
| sha | sh a |
| shai | sh ai |
| shan | sh an |
| shang | sh ang |
| shao | sh ao |
| she | sh e |
| shei | sh ei |
| shen | sh en |
| sheng | sh eng |
| shi | sh i |
| shou | sh ou |
| shu | sh u |
| shua | sh ua |
| shuai | sh uai |
| shuan | sh uan |
| shuang | sh uang |
| shui | sh ui |
| shun | sh un |
| shuo | sh uo |
| si | s i |
| song | s ong |
| sou | s ou |
| su | s u |
| suan | s uan |
| sui | s ui |
| sun | s un |
| suo | s uo |
| ta | t a |
| tai | t ai |
| tan | t an |
| tang | t ang |
| tao | t ao |
| te | t e |
| tei | t ei |
| teng | t eng |
| ti | t i |
| tian | t ian |
| tiao | t iao |
| tie | t ie |
| ting | t ing |
| tong | t ong |
| tou | t ou |
| tu | t u |
| tuan | t uan |
| tui | t ui |
| tun | t un |
| tuo | t uo |
| wa | w a |
| wai | w ai |
| wan | w an |
| wang | w ang |
| wei | w ei |
| wen | w en |
| weng | w eng |
| wo | w o |
| wu | w u |
| xi | x i |
| xia | x ia |
| xian | x ian |
| xiang | x iang |
| xiao | x iao |
| xie | x ie |
| xin | x in |
| xing | x ing |
| xiong | x iong |
| xiu | x iu |
| xu | x v |
| xuan | x van |
| xue | x ve |
| xun | x vn |
| ya | y a |
| yan | y an |
| yang | y ang |
| yao | y ao |
| ye | y e |
| yi | y i |
| yin | y in |
| ying | y ing |
| yo | y o |
| yong | y ong |
| you | y ou |
| yu | y v |
| yuan | y van |
| yue | y ve |
| yun | y vn |
| za | z a |
| zai | z ai |
| zan | z an |
| zang | z ang |
| zao | z ao |
| ze | z e |
| zei | z ei |
| zen | z en |
| zeng | z eng |
| zha | zh a |
| zhai | zh ai |
| zhan | zh an |
| zhang | zh ang |
| zhao | zh ao |
| zhe | zh e |
| zhei | zh ei |
| zhen | zh en |
| zheng | zh eng |
| zhi | zh i |
| zhong | zh ong |
| zhou | zh ou |
| zhu | zh u |
| zhua | zh ua |
| zhuai | zh uai |
| zhuan | zh uan |
| zhuang | zh uang |
| zhui | zh ui |
| zhun | zh un |
| zhuo | zh uo |
| zi | z i |
| zong | z ong |
| zou | z ou |
| zu | z u |
| zuan | z uan |
| zui | z ui |
| zun | z un |
| zuo | z uo |
def cpop_pinyin2ph_func(path):
# In the README file of opencpop dataset, they defined a "pinyin to phoneme mapping table"
pinyin2phs = {'AP': 'AP', 'SP': 'SP'}
with open(path) as rf:
for line in rf.readlines():
elements = [x.strip() for x in line.split('|') if x.strip() != '']
pinyin2phs[elements[0]] = elements[1]
return pinyin2phs
import argparse
import os
import time
from typing import Dict
from typing import List
from typing import Union
from .infer import Infer
from .utils.audio import save_wav
from .utils.hparams import hparams
from .utils.hparams import set_hparams
from paddlehub.module.module import moduleinfo
from paddlehub.module.module import runnable
from paddlehub.module.module import serving
@moduleinfo(name="diffsinger",
type="Audio/svs",
author="",
author_email="",
summary="DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism",
version="1.0.0")
class DiffSinger:
def __init__(self, providers: List[str] = None) -> None:
root = self.directory
config = os.path.join('model', 'config.yaml')
set_hparams(config, root=root)
self.infer = Infer(root, providers=providers)
@serving
def singing_voice_synthesis(self,
inputs: Dict[str, str],
sample_num: int = 1,
save_audio: bool = True,
save_dir: str = 'outputs') -> Dict[str, Union[List[List[int]], int]]:
'''
inputs = {
'text': '小酒窝长睫毛AP是你最美的记号',
'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
'input_type': 'word'
} # user input: Chinese characters
or,
inputs = {
'text': '小酒窝长睫毛AP是你最美的记号',
'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
'input_type': 'phoneme'
} # input like Opencpop dataset.
'''
outputs = []
for i in range(sample_num):
output = self.infer.infer_once(inputs)
os.makedirs(save_dir, exist_ok=True)
if save_audio:
save_wav(output, os.path.join(save_dir, '%d_%d.wav' % (i, int(time.time()))),
hparams['audio_sample_rate'])
outputs.append(output.tolist())
return {'wavs': outputs, 'sample_rate': hparams['audio_sample_rate']}
@runnable
def run_cmd(self, argvs: List[str]) -> str:
self.parser = argparse.ArgumentParser(description="Run the {} module.".format(self.name),
prog='hub run {}'.format(self.name),
usage='%(prog)s',
add_help=True)
self.parser.add_argument('--input_type',
type=str,
choices=['word', 'phoneme'],
required=True,
help='input type in ["word", "phoneme"].')
args = self.parser.parse_args(argvs[:2])
if args.input_type == 'word':
self.arg_input_group = self.parser.add_argument_group(title="Input options (type: word).",
description="Input options (type: word).")
self.arg_input_group.add_argument('--text', type=str, required=True, help='input text.')
self.arg_input_group.add_argument('--notes', type=str, required=True, help='input notes.')
self.arg_input_group.add_argument('--notes_duration', type=str, required=True, help='input notes duration.')
elif args.input_type == 'phoneme':
self.arg_input_group = self.parser.add_argument_group(title="Input options (type: phoneme).",
description="Input options (type: phoneme).")
self.arg_input_group.add_argument('--text', type=str, required=True, help='input text.')
self.arg_input_group.add_argument('--ph_seq', type=str, required=True, help='input phoneme seq.')
self.arg_input_group.add_argument('--note_seq', type=str, required=True, help='input note seq.')
self.arg_input_group.add_argument('--note_dur_seq',
type=str,
required=True,
help='input note duration seq.')
self.arg_input_group.add_argument('--is_slur_seq',
type=str,
required=True,
help='input if note is slur seq.')
else:
raise ValueError('Input type (--input_type) should be in ["word", "phoneme"]')
self.parser.add_argument('--sample_num', type=int, default=1, help='sample audios num, default=1')
self.parser.add_argument('--save_dir',
type=str,
default='outputs',
help='sample audios save_dir, default="outputs"')
args = self.parser.parse_args(argvs)
kwargs = vars(args).copy()
kwargs.pop('sample_num')
kwargs.pop('save_dir')
self.singing_voice_synthesis(kwargs, sample_num=args.sample_num, save_dir=args.save_dir, save_audio=True)
return "Audios are saved in %s" % args.save_dir
librosa>=0.9.2
matplotlib==3.5.3
numpy>=1.21.6
pycwt>=0.3.0a22
pypinyin>=0.47.1
PyYAML>=6.0
scipy>=1.7.3
six>=1.16.0
soundfile>=0.11.0
tqdm>=4.64.1
import shutil
import unittest
import paddlehub as hub
class TestHubModule(unittest.TestCase):
@classmethod
def setUpClass(cls) -> None:
cls.module = hub.Module(name="diffsinger")
@classmethod
def tearDownClass(cls) -> None:
shutil.rmtree('outputs')
def test_singing_voice_synthesis1(self):
results = self.module.singing_voice_synthesis(inputs={
'text': '小酒窝长睫毛AP是你最美的记号',
'notes':
'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
'notes_duration':
'0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
'input_type': 'word'
},
sample_num=1,
save_audio=True,
save_dir='outputs')
self.assertIsInstance(results, dict)
self.assertIsInstance(results['wavs'], list)
self.assertIsInstance(results['wavs'][0], list)
self.assertEqual(len(results['wavs'][0]), 123776)
self.assertEqual(results['sample_rate'], 24000)
def test_singing_voice_synthesis2(self):
results = self.module.singing_voice_synthesis(inputs={
'text': '小酒窝长睫毛AP是你最美的记号',
'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
'note_seq':
'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
'note_dur_seq':
'0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
'input_type': 'phoneme'
},
sample_num=1,
save_audio=True,
save_dir='outputs')
self.assertIsInstance(results, dict)
self.assertIsInstance(results['wavs'], list)
self.assertIsInstance(results['wavs'][0], list)
self.assertEqual(len(results['wavs'][0]), 123776)
self.assertEqual(results['sample_rate'], 24000)
if __name__ == "__main__":
unittest.main()
task_cls: usr.task.DiffFsTask
pitch_type: frame
timesteps: 100
dilation_cycle_length: 1
residual_layers: 20
residual_channels: 256
lr: 0.001
decay_steps: 50000
keep_bins: 80
spec_min: [ ]
spec_max: [ ]
content_cond_steps: [ ] # [ 0, 10000 ]
spk_cond_steps: [ ] # [ 0, 10000 ]
# train and eval
fs2_ckpt: ''
max_updates: 400000
# max_updates: 200000
use_gt_dur: true
use_gt_f0: true
gen_tgt_spk_id: -1
max_sentences: 48
num_sanity_val_steps: 1
num_valid_plots: 1
base_config:
- configs/tts/lj/fs2.yaml
- ./base.yaml
# spec_min and spec_max are calculated on the training set.
spec_min: [ -4.7574, -4.6783, -4.6431, -4.5832, -4.5390, -4.6771, -4.8089, -4.7672,
-4.5784, -4.7755, -4.7150, -4.8919, -4.8271, -4.7389, -4.6047, -4.7759,
-4.6799, -4.8201, -4.7823, -4.8262, -4.7857, -4.7545, -4.9358, -4.9733,
-5.1134, -5.1395, -4.9016, -4.8434, -5.0189, -4.8460, -5.0529, -4.9510,
-5.0217, -5.0049, -5.1831, -5.1445, -5.1015, -5.0281, -4.9887, -4.9916,
-4.9785, -4.9071, -4.9488, -5.0342, -4.9332, -5.0650, -4.8924, -5.0875,
-5.0483, -5.0848, -5.1809, -5.0677, -5.0015, -5.0792, -5.0636, -5.2413,
-5.1421, -5.1710, -5.3256, -5.0511, -5.1186, -5.0057, -5.0446, -5.1173,
-5.0325, -5.1085, -5.0053, -5.0755, -5.1176, -5.1004, -5.2153, -5.2757,
-5.3025, -5.2867, -5.2918, -5.3328, -5.2731, -5.2985, -5.2400, -5.2211 ]
spec_max: [ -0.5982, -0.0778, 0.1205, 0.2747, 0.4657, 0.5123, 0.5684, 0.7093,
0.6461, 0.6420, 0.7316, 0.7715, 0.7681, 0.8349, 0.7815, 0.7591,
0.7910, 0.7433, 0.7352, 0.6869, 0.6854, 0.6623, 0.5353, 0.6492,
0.6909, 0.6106, 0.5761, 0.5936, 0.5638, 0.4054, 0.4545, 0.3589,
0.3037, 0.3380, 0.1599, 0.2433, 0.2741, 0.2130, 0.1569, 0.1911,
0.2324, 0.1586, 0.1221, 0.0341, -0.0558, 0.0553, -0.1153, -0.0933,
-0.1171, -0.0050, -0.1519, -0.1629, -0.0522, -0.0739, -0.2069, -0.2405,
-0.1244, -0.2116, -0.1361, -0.1575, -0.1442, 0.0513, -0.1567, -0.2000,
0.0086, -0.0698, 0.1385, 0.0941, 0.1864, 0.1225, 0.2176, 0.2566,
0.1670, 0.1007, 0.1444, 0.0888, 0.1998, 0.2414, 0.2932, 0.3047 ]
task_cls: usr.diffspeech_task.DiffSpeechTask
vocoder: vocoders.hifigan.HifiGAN
vocoder_ckpt: checkpoints/0414_hifi_lj_1
num_valid_plots: 10
use_gt_dur: false
use_gt_f0: false
pitch_type: cwt
pitch_extractor: 'parselmouth'
max_updates: 160000
lr: 0.001
timesteps: 100
K_step: 71
diff_loss_type: l1
diff_decoder_type: 'wavenet'
schedule_type: 'linear'
max_beta: 0.06
fs2_ckpt: checkpoints/fs2_lj_1/model_ckpt_steps_150000.ckpt
save_gt: true
base_config:
- configs/singing/fs2.yaml
- usr/configs/midi/cascade/opencs/opencpop_statis.yaml
audio_sample_rate: 24000
hop_size: 128 # Hop size.
fft_size: 512 # FFT size.
win_size: 512 # FFT size.
fmin: 30
fmax: 12000
min_level_db: -120
binarization_args:
with_wav: true
with_spk_embed: false
with_align: true
raw_data_dir: 'data/raw/opencpop/segments'
processed_data_dir: 'xxx'
binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
binary_data_dir: 'data/binary/opencpop-midi-dp'
use_midi: true # for midi exp
use_gt_f0: false # for midi exp
use_gt_dur: false # for further midi exp
lambda_f0: 1.0
lambda_uv: 1.0
#lambda_energy: 0.1
lambda_ph_dur: 1.0
lambda_sent_dur: 1.0
lambda_word_dur: 1.0
predictor_grad: 0.1
pe_enable: false
pe_ckpt: ''
num_spk: 1
test_prefixes: [
'2044',
'2086',
'2092',
'2093',
'2100',
]
task_cls: usr.diffsinger_task.AuxDecoderMIDITask
#vocoder: usr.singingvocoder.highgan.HighGAN
#vocoder_ckpt: checkpoints/h_2_model/checkpoint-530000steps.pkl
vocoder: vocoders.hifigan.HifiGAN
vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128
use_nsf: true
# config for experiments
max_frames: 5000
max_tokens: 40000
predictor_layers: 5
rel_pos: true
dur_predictor_layers: 5 # *
use_spk_embed: false
num_valid_plots: 10
max_updates: 160000
save_gt: true
base_config:
- usr/configs/popcs_ds_beta6.yaml
- usr/configs/midi/cascade/opencs/opencpop_statis.yaml
binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
binary_data_dir: 'data/binary/opencpop-midi-dp'
#switch_midi2f0_step: 174000
use_midi: true # for midi exp
use_gt_f0: false # for midi exp
use_gt_dur: false # for further midi exp
lambda_f0: 1.0
lambda_uv: 1.0
#lambda_energy: 0.1
lambda_ph_dur: 1.0
lambda_sent_dur: 1.0
lambda_word_dur: 1.0
predictor_grad: 0.1
pe_enable: false
pe_ckpt: ''
fs2_ckpt: 'checkpoints/0302_opencpop_fs_midi/model_ckpt_steps_160000.ckpt' #
#num_valid_plots: 0
task_cls: usr.diffsinger_task.DiffSingerMIDITask
K_step: 60
max_tokens: 40000
predictor_layers: 5
dilation_cycle_length: 4 # *
rel_pos: true
dur_predictor_layers: 5 # *
max_updates: 160000
gaussian_start: false
spec_min: [-6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
-6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
-6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
-6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
-6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
-6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6., -6.,
-6., -6., -6., -6., -6., -6., -6., -6.]
spec_max: [-7.9453e-01, -8.1116e-01, -6.1631e-01, -3.0679e-01, -1.3863e-01,
-5.0652e-02, -1.1563e-01, -1.0679e-01, -9.1068e-02, -6.2174e-02,
-7.5302e-02, -7.2217e-02, -6.3815e-02, -7.3299e-02, 7.3610e-03,
-7.2508e-02, -5.0234e-02, -1.6534e-01, -2.6928e-01, -2.0782e-01,
-2.0823e-01, -1.1702e-01, -7.0128e-02, -6.5868e-02, -1.2675e-02,
1.5121e-03, -8.9902e-02, -2.1392e-01, -2.3789e-01, -2.8922e-01,
-3.0405e-01, -2.3029e-01, -2.2088e-01, -2.1542e-01, -2.9367e-01,
-3.0137e-01, -3.8281e-01, -4.3590e-01, -2.8681e-01, -4.6855e-01,
-5.7485e-01, -4.7022e-01, -5.4266e-01, -4.4848e-01, -6.4120e-01,
-6.8700e-01, -6.4860e-01, -7.6436e-01, -4.9971e-01, -7.1068e-01,
-6.9724e-01, -6.1487e-01, -5.5843e-01, -6.9773e-01, -5.7502e-01,
-7.0919e-01, -8.2431e-01, -8.4213e-01, -9.0431e-01, -8.2840e-01,
-7.7945e-01, -8.2758e-01, -8.7699e-01, -1.0532e+00, -1.0766e+00,
-1.1198e+00, -1.0185e+00, -9.8983e-01, -1.0001e+00, -1.0756e+00,
-1.0024e+00, -1.0304e+00, -1.0579e+00, -1.0188e+00, -1.0500e+00,
-1.0842e+00, -1.0923e+00, -1.1223e+00, -1.2381e+00, -1.6467e+00]
mel_vmin: -6. #-6.
mel_vmax: 1.5
wav2spec_eps: 1e-6
raw_data_dir: 'data/raw/opencpop/segments'
processed_data_dir: 'xxx'
binary_data_dir: 'data/binary/opencpop-midi-dp'
datasets: [
'opencpop',
]
test_prefixes: [
'2044',
'2086',
'2092',
'2093',
'2100',
]
base_config:
- usr/configs/popcs_ds_beta6.yaml
- usr/configs/midi/cascade/opencs/opencpop_statis.yaml
binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
binary_data_dir: 'data/binary/opencpop-midi-dp'
#switch_midi2f0_step: 174000
use_midi: true # for midi exp
use_gt_dur: false # for further midi exp
lambda_ph_dur: 1.0
lambda_sent_dur: 1.0
lambda_word_dur: 1.0
predictor_grad: 0.1
dur_predictor_layers: 5 # *
fs2_ckpt: '' #
#num_valid_plots: 0
task_cls: usr.diffsinger_task.DiffSingerMIDITask
# for diffusion schedule
timesteps: 1000
K_step: 1000
max_beta: 0.02
max_tokens: 36000
max_updates: 320000
gaussian_start: True
pndm_speedup: 40
use_pitch_embed: false
use_gt_f0: false # for midi exp
lambda_f0: 0.
lambda_uv: 0.
dilation_cycle_length: 4 # *
rel_pos: true
predictor_layers: 5
pe_enable: true
pe_ckpt: 'checkpoints/0102_xiaoma_pe'
base_config:
- usr/configs/popcs_ds_beta6.yaml
- usr/configs/midi/cascade/opencs/opencpop_statis.yaml
binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
binary_data_dir: 'data/binary/opencpop-midi-dp'
#switch_midi2f0_step: 174000
use_midi: true # for midi exp
use_gt_dur: false # for further midi exp
lambda_ph_dur: 1.0
lambda_sent_dur: 1.0
lambda_word_dur: 1.0
predictor_grad: 0.1
dur_predictor_layers: 5 # *
fs2_ckpt: '' #
#num_valid_plots: 0
task_cls: usr.diffsinger_task.DiffSingerMIDITask
K_step: 100
max_tokens: 40000
max_updates: 160000
gaussian_start: True
use_pitch_embed: false
use_gt_f0: false # for midi exp
lambda_f0: 0.
lambda_uv: 0.
dilation_cycle_length: 4 # *
rel_pos: true
predictor_layers: 5
pe_enable: true
pe_ckpt: 'checkpoints/0102_xiaoma_pe'
base_config:
- usr/configs/popcs_ds_beta6.yaml
- usr/configs/midi/cascade/popcs/popcs_statis.yaml
binarizer_cls: data_gen.singing.binarize.MidiSingingBinarizer
binary_data_dir: 'data/binary/popcs-midi-dp'
#switch_midi2f0_step: 174000
use_midi: true # for midi exp
use_gt_dur: false # for further midi exp
lambda_ph_dur: 1.0
lambda_sent_dur: 1.0
lambda_word_dur: 1.0
predictor_grad: 0.1
dur_predictor_layers: 5 # *
fs2_ckpt: '' #
#num_valid_plots: 0
task_cls: usr.diffsinger_task.DiffSingerMIDITask
K_step: 100
max_tokens: 40000
max_updates: 160000
gaussian_start: True
use_pitch_embed: false
use_gt_f0: false # for midi exp
lambda_f0: 0.
lambda_uv: 0.
dilation_cycle_length: 4 # *
rel_pos: true
predictor_layers: 5
pe_enable: true
pe_ckpt: 'checkpoints/0102_xiaoma_pe'
base_config:
- configs/tts/lj/fs2.yaml
max_frames: 8000
audio_sample_rate: 24000
hop_size: 128 # Hop size.
fft_size: 512 # FFT size.
win_size: 512 # FFT size.
fmin: 30
fmax: 12000
min_level_db: -120
binary_data_dir: 'xxx'
pitch_type: frame
task_cls: tasks.tts.pe.PitchExtractionTask
pitch_extractor_conv_layers: 2
# config for experiments
max_tokens: 20000
use_spk_embed: false
num_valid_plots: 10
max_updates: 60000
base_config:
- configs/tts/fs2.yaml
- configs/singing/base.yaml
- ./base.yaml
audio_sample_rate: 24000
hop_size: 128 # Hop size.
fft_size: 512 # FFT size.
win_size: 512 # FFT size.
fmin: 30
fmax: 12000
min_level_db: -120
binarization_args:
with_wav: true
with_spk_embed: false
with_align: true
raw_data_dir: 'data/raw/popcs'
processed_data_dir: 'data/processed/popcs'
binary_data_dir: 'data/binary/popcs-pmf0'
num_spk: 1
datasets: [
'popcs',
]
test_prefixes: [
'popcs-说散就散',
'popcs-隐形的翅膀',
]
spec_min: [-6.8276, -7.0270, -6.8142, -7.1429, -7.6669, -7.6000, -7.1148, -6.9640,
-6.8414, -6.6596, -6.6880, -6.7439, -6.7986, -7.4940, -7.7845, -7.6586,
-6.9288, -6.7639, -6.9118, -6.8246, -6.7183, -7.1769, -6.9794, -7.4513,
-7.3422, -7.5623, -6.9610, -6.8158, -6.9595, -6.8403, -6.5688, -6.6356,
-7.0209, -6.5002, -6.7819, -6.5232, -6.6927, -6.5701, -6.5531, -6.7069,
-6.6462, -6.4523, -6.5954, -6.4264, -6.4487, -6.7070, -6.4025, -6.3042,
-6.4008, -6.3857, -6.3903, -6.3094, -6.2491, -6.3518, -6.3566, -6.4168,
-6.2481, -6.3624, -6.2858, -6.2575, -6.3638, -6.4520, -6.1835, -6.2754,
-6.1253, -6.1645, -6.0638, -6.1262, -6.0710, -6.1039, -6.4428, -6.1363,
-6.1054, -6.1252, -6.1797, -6.0235, -6.0758, -5.9453, -6.0213, -6.0446]
spec_max: [ 0.2645, 0.0583, -0.2344, -0.0184, 0.1227, 0.1533, 0.1103, 0.1212,
0.2421, 0.1809, 0.2134, 0.3161, 0.3301, 0.3289, 0.2667, 0.2421,
0.2581, 0.2600, 0.1394, 0.1907, 0.1082, 0.1474, 0.1680, 0.2550,
0.1057, 0.0826, 0.0423, 0.1203, -0.0701, -0.0056, 0.0477, -0.0639,
-0.0272, -0.0728, -0.1648, -0.0855, -0.2652, -0.1998, -0.1547, -0.2167,
-0.4181, -0.5463, -0.4161, -0.4733, -0.6518, -0.5387, -0.4290, -0.4191,
-0.4151, -0.3042, -0.3810, -0.4160, -0.4496, -0.2847, -0.4676, -0.4658,
-0.4931, -0.4885, -0.5547, -0.5481, -0.6948, -0.7968, -0.8455, -0.8392,
-0.8770, -0.9520, -0.8749, -0.7297, -0.8374, -0.8667, -0.7157, -0.9035,
-0.9219, -0.8801, -0.9298, -0.9009, -0.9604, -1.0537, -1.0781, -1.3766]
task_cls: usr.diffsinger_task.DiffSingerTask
#vocoder: usr.singingvocoder.highgan.HighGAN
#vocoder_ckpt: checkpoints/h_2_model/checkpoint-530000steps.pkl
vocoder: vocoders.hifigan.HifiGAN
vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128
pitch_extractor: 'parselmouth'
# config for experiments
use_spk_embed: false
num_valid_plots: 10
max_updates: 160000
lr: 0.001
timesteps: 100
K_step: 51
diff_loss_type: l1
diff_decoder_type: 'wavenet'
schedule_type: 'linear'
max_beta: 0.06
fs2_ckpt: ''
use_nsf: true
base_config:
- ./popcs_ds_beta6.yaml
fs2_ckpt: checkpoints/popcs_fs2_pmf0_1230/model_ckpt_steps_160000.ckpt # to be infer
num_valid_plots: 0
task_cls: usr.diffsinger_task.DiffSingerOfflineTask
# tmp:
#pe_enable: true
#pe_ckpt: ''
vocoder: vocoders.hifigan.HifiGAN
vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128
base_config:
- configs/singing/fs2.yaml
audio_sample_rate: 24000
hop_size: 128 # Hop size.
fft_size: 512 # FFT size.
win_size: 512 # FFT size.
fmin: 30
fmax: 12000
min_level_db: -120
binarization_args:
with_wav: true
with_spk_embed: false
with_align: true
raw_data_dir: 'data/raw/popcs'
processed_data_dir: 'data/processed/popcs'
binary_data_dir: 'data/binary/popcs-pmf0'
num_spk: 1
datasets: [
'popcs',
]
test_prefixes: [
'popcs-说散就散',
'popcs-隐形的翅膀',
]
task_cls: tasks.tts.fs2.FastSpeech2Task
#vocoder: usr.singingvocoder.highgan.HighGAN
#vocoder_ckpt: checkpoints/h_2_model/checkpoint-530000steps.pkl
vocoder: vocoders.hifigan.HifiGAN
vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128
use_nsf: true
# config for experiments
max_tokens: 18000
use_spk_embed: false
num_valid_plots: 10
max_updates: 160000
save_gt: true
# tmp:
#pe_enable: true
#pe_ckpt: ''
import sys
import time
import types
import numpy as np
class AvgrageMeter(object):
def __init__(self):
self.reset()
def reset(self):
self.avg = 0
self.sum = 0
self.cnt = 0
def update(self, val, n=1):
self.sum += val * n
self.cnt += n
self.avg = self.sum / self.cnt
def collate_1d(values, pad_idx=0, left_pad=False, shift_right=False, max_len=None, shift_id=1):
"""Convert a list of 1d tensors into a padded 2d tensor."""
size = max(v.size(0) for v in values) if max_len is None else max_len
res = values[0].new(len(values), size).fill_(pad_idx)
def copy_tensor(src, dst):
assert dst.numel() == src.numel()
if shift_right:
dst[1:] = src[:-1]
dst[0] = shift_id
else:
dst.copy_(src)
for i, v in enumerate(values):
copy_tensor(v, res[i][size - len(v):] if left_pad else res[i][:len(v)])
return res
def collate_2d(values, pad_idx=0, left_pad=False, shift_right=False, max_len=None):
"""Convert a list of 2d tensors into a padded 3d tensor."""
size = max(v.size(0) for v in values) if max_len is None else max_len
res = values[0].new(len(values), size, values[0].shape[1]).fill_(pad_idx)
def copy_tensor(src, dst):
assert dst.numel() == src.numel()
if shift_right:
dst[1:] = src[:-1]
else:
dst.copy_(src)
for i, v in enumerate(values):
copy_tensor(v, res[i][size - len(v):] if left_pad else res[i][:len(v)])
return res
def _is_batch_full(batch, num_tokens, max_tokens, max_sentences):
if len(batch) == 0:
return 0
if len(batch) == max_sentences:
return 1
if num_tokens > max_tokens:
return 1
return 0
def batch_by_size(indices,
num_tokens_fn,
max_tokens=None,
max_sentences=None,
required_batch_size_multiple=1,
distributed=False):
"""
Yield mini-batches of indices bucketed by size. Batches may contain
sequences of different lengths.
Args:
indices (List[int]): ordered list of dataset indices
num_tokens_fn (callable): function that returns the number of tokens at
a given index
max_tokens (int, optional): max number of tokens in each batch
(default: None).
max_sentences (int, optional): max number of sentences in each
batch (default: None).
required_batch_size_multiple (int, optional): require batch size to
be a multiple of N (default: 1).
"""
max_tokens = max_tokens if max_tokens is not None else sys.maxsize
max_sentences = max_sentences if max_sentences is not None else sys.maxsize
bsz_mult = required_batch_size_multiple
if isinstance(indices, types.GeneratorType):
indices = np.fromiter(indices, dtype=np.int64, count=-1)
sample_len = 0
sample_lens = []
batch = []
batches = []
for i in range(len(indices)):
idx = indices[i]
num_tokens = num_tokens_fn(idx)
sample_lens.append(num_tokens)
sample_len = max(sample_len, num_tokens)
assert sample_len <= max_tokens, ("sentence at index {} of size {} exceeds max_tokens "
"limit of {}!".format(idx, sample_len, max_tokens))
num_tokens = (len(batch) + 1) * sample_len
if _is_batch_full(batch, num_tokens, max_tokens, max_sentences):
mod_len = max(
bsz_mult * (len(batch) // bsz_mult),
len(batch) % bsz_mult,
)
batches.append(batch[:mod_len])
batch = batch[mod_len:]
sample_lens = sample_lens[mod_len:]
sample_len = max(sample_lens) if len(sample_lens) > 0 else 0
batch.append(idx)
if len(batch) > 0:
batches.append(batch)
return batches
def unpack_dict_to_list(samples):
samples_ = []
bsz = samples.get('outputs').size(0)
for i in range(bsz):
res = {}
for k, v in samples.items():
try:
res[k] = v[i]
except:
pass
samples_.append(res)
return samples_
def remove_padding(x, padding_idx=0):
if x is None:
return None
assert len(x.shape) in [1, 2]
if len(x.shape) == 2: # [T, H]
return x[np.abs(x).sum(-1) != padding_idx]
elif len(x.shape) == 1: # [T]
return x[x != padding_idx]
class Timer:
timer_map = {}
def __init__(self, name, print_time=False):
if name not in Timer.timer_map:
Timer.timer_map[name] = 0
self.name = name
self.print_time = print_time
def __enter__(self):
self.t = time.time()
def __exit__(self, exc_type, exc_val, exc_tb):
Timer.timer_map[self.name] += time.time() - self.t
if self.print_time:
print(self.name, Timer.timer_map[self.name])
import subprocess
import matplotlib
matplotlib.use('Agg')
import librosa
import librosa.filters
import numpy as np
from scipy import signal
from scipy.io import wavfile
def save_wav(wav, path, sr, norm=False):
if norm:
wav = wav / np.abs(wav).max()
wav *= 32767
# proposed by @dsmiller
wavfile.write(path, sr, wav.astype(np.int16))
def get_hop_size(hparams):
hop_size = hparams['hop_size']
if hop_size is None:
assert hparams['frame_shift_ms'] is not None
hop_size = int(hparams['frame_shift_ms'] / 1000 * hparams['audio_sample_rate'])
return hop_size
###########################################################################################
def _stft(y, hparams):
return librosa.stft(y=y,
n_fft=hparams['fft_size'],
hop_length=get_hop_size(hparams),
win_length=hparams['win_size'],
pad_mode='constant')
def _istft(y, hparams):
return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams['win_size'])
def librosa_pad_lr(x, fsize, fshift, pad_sides=1):
'''compute right padding (final frame) or both sides padding (first and final frames)
'''
assert pad_sides in (1, 2)
# return int(fsize // 2)
pad = (x.shape[0] // fshift + 1) * fshift - x.shape[0]
if pad_sides == 1:
return 0, pad
else:
return pad // 2, pad // 2 + pad % 2
# Conversions
def amp_to_db(x):
return 20 * np.log10(np.maximum(1e-5, x))
def normalize(S, hparams):
return (S - hparams['min_level_db']) / -hparams['min_level_db']
import librosa
import numpy as np
from pycwt import wavelet
from scipy.interpolate import interp1d
def load_wav(wav_file, sr):
wav, _ = librosa.load(wav_file, sr=sr, mono=True)
return wav
def convert_continuos_f0(f0):
'''CONVERT F0 TO CONTINUOUS F0
Args:
f0 (ndarray): original f0 sequence with the shape (T)
Return:
(ndarray): continuous f0 with the shape (T)
'''
# get uv information as binary
f0 = np.copy(f0)
uv = np.float32(f0 != 0)
# get start and end of f0
if (f0 == 0).all():
print("| all of the f0 values are 0.")
return uv, f0
start_f0 = f0[f0 != 0][0]
end_f0 = f0[f0 != 0][-1]
# padding start and end of f0 sequence
start_idx = np.where(f0 == start_f0)[0][0]
end_idx = np.where(f0 == end_f0)[0][-1]
f0[:start_idx] = start_f0
f0[end_idx:] = end_f0
# get non-zero frame index
nz_frames = np.where(f0 != 0)[0]
# perform linear interpolation
f = interp1d(nz_frames, f0[nz_frames])
cont_f0 = f(np.arange(0, f0.shape[0]))
return uv, cont_f0
def get_cont_lf0(f0, frame_period=5.0):
uv, cont_f0_lpf = convert_continuos_f0(f0)
# cont_f0_lpf = low_pass_filter(cont_f0_lpf, int(1.0 / (frame_period * 0.001)), cutoff=20)
cont_lf0_lpf = np.log(cont_f0_lpf)
return uv, cont_lf0_lpf
def get_lf0_cwt(lf0):
'''
input:
signal of shape (N)
output:
Wavelet_lf0 of shape(10, N), scales of shape(10)
'''
mother = wavelet.MexicanHat()
dt = 0.005
dj = 1
s0 = dt * 2
J = 9
Wavelet_lf0, scales, _, _, _, _ = wavelet.cwt(np.squeeze(lf0), dt, dj, s0, J, mother)
# Wavelet.shape => (J + 1, len(lf0))
Wavelet_lf0 = np.real(Wavelet_lf0).T
return Wavelet_lf0, scales
def norm_scale(Wavelet_lf0):
Wavelet_lf0_norm = np.zeros((Wavelet_lf0.shape[0], Wavelet_lf0.shape[1]))
mean = Wavelet_lf0.mean(0)[None, :]
std = Wavelet_lf0.std(0)[None, :]
Wavelet_lf0_norm = (Wavelet_lf0 - mean) / std
return Wavelet_lf0_norm, mean, std
def normalize_cwt_lf0(f0, mean, std):
uv, cont_lf0_lpf = get_cont_lf0(f0)
cont_lf0_norm = (cont_lf0_lpf - mean) / std
Wavelet_lf0, scales = get_lf0_cwt(cont_lf0_norm)
Wavelet_lf0_norm, _, _ = norm_scale(Wavelet_lf0)
return Wavelet_lf0_norm
def get_lf0_cwt_norm(f0s, mean, std):
uvs = list()
cont_lf0_lpfs = list()
cont_lf0_lpf_norms = list()
Wavelet_lf0s = list()
Wavelet_lf0s_norm = list()
scaless = list()
means = list()
stds = list()
for f0 in f0s:
uv, cont_lf0_lpf = get_cont_lf0(f0)
cont_lf0_lpf_norm = (cont_lf0_lpf - mean) / std
Wavelet_lf0, scales = get_lf0_cwt(cont_lf0_lpf_norm) # [560,10]
Wavelet_lf0_norm, mean_scale, std_scale = norm_scale(Wavelet_lf0) # [560,10],[1,10],[1,10]
Wavelet_lf0s_norm.append(Wavelet_lf0_norm)
uvs.append(uv)
cont_lf0_lpfs.append(cont_lf0_lpf)
cont_lf0_lpf_norms.append(cont_lf0_lpf_norm)
Wavelet_lf0s.append(Wavelet_lf0)
scaless.append(scales)
means.append(mean_scale)
stds.append(std_scale)
return Wavelet_lf0s_norm, scaless, means, stds
def inverse_cwt(Wavelet_lf0, scales):
b = ((np.arange(0, len(scales))[None, None, :] + 1 + 2.5)**(-2.5))
lf0_rec = Wavelet_lf0 * b
lf0_rec_sum = lf0_rec.sum(-1)
lf0_rec_sum = (lf0_rec_sum - lf0_rec_sum.mean(-1, keepdims=True)) / lf0_rec_sum.std(-1, keepdims=True)
return lf0_rec_sum
import argparse
import os
import yaml
global_print_hparams = True
hparams = {}
class Args:
def __init__(self, **kwargs):
for k, v in kwargs.items():
self.__setattr__(k, v)
def override_config(old_config: dict, new_config: dict):
for k, v in new_config.items():
if isinstance(v, dict) and k in old_config:
override_config(old_config[k], new_config[k])
else:
old_config[k] = v
def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, global_hparams=True, root='.'):
if config == '' and exp_name == '':
parser = argparse.ArgumentParser(description='neural music')
parser.add_argument('--config', type=str, default='', help='location of the data corpus')
parser.add_argument('--exp_name', type=str, default='', help='exp_name')
parser.add_argument('--hparams', type=str, default='', help='location of the data corpus')
parser.add_argument('--infer', action='store_true', help='infer')
parser.add_argument('--validate', action='store_true', help='validate')
parser.add_argument('--reset', action='store_true', help='reset hparams')
parser.add_argument('--debug', action='store_true', help='debug')
args, unknown = parser.parse_known_args()
else:
args = Args(config=config,
exp_name=exp_name,
hparams=hparams_str,
infer=False,
validate=False,
reset=False,
debug=False)
args_work_dir = ''
if args.exp_name != '':
args.work_dir = args.exp_name
args_work_dir = f'checkpoints/{args.work_dir}'
config_chains = []
loaded_config = set()
def load_config(config_fn): # deep first
with open(os.path.join(root, config_fn)) as f:
hparams_ = yaml.safe_load(f)
loaded_config.add(config_fn)
if 'base_config' in hparams_:
ret_hparams = {}
if not isinstance(hparams_['base_config'], list):
hparams_['base_config'] = [hparams_['base_config']]
for c in hparams_['base_config']:
if c not in loaded_config:
if c.startswith('.'):
c = f'{os.path.dirname(config_fn)}/{c}'
c = os.path.normpath(c)
override_config(ret_hparams, load_config(c))
override_config(ret_hparams, hparams_)
else:
ret_hparams = hparams_
config_chains.append(config_fn)
return ret_hparams
global hparams
assert args.config != '' or args_work_dir != ''
saved_hparams = {}
if args_work_dir != 'checkpoints/':
ckpt_config_path = f'{args_work_dir}/config.yaml'
if os.path.exists(ckpt_config_path):
try:
with open(ckpt_config_path) as f:
saved_hparams.update(yaml.safe_load(f))
except:
pass
if args.config == '':
args.config = ckpt_config_path
hparams_ = {}
hparams_.update(load_config(args.config))
if not args.reset:
hparams_.update(saved_hparams)
hparams_['work_dir'] = args_work_dir
if args.hparams != "":
for new_hparam in args.hparams.split(","):
k, v = new_hparam.split("=")
if v in ['True', 'False'] or type(hparams_[k]) == bool:
hparams_[k] = eval(v)
else:
hparams_[k] = type(hparams_[k])(v)
if args_work_dir != '' and (not os.path.exists(ckpt_config_path) or args.reset) and not args.infer:
os.makedirs(hparams_['work_dir'], exist_ok=True)
with open(ckpt_config_path, 'w') as f:
yaml.safe_dump(hparams_, f)
hparams_['infer'] = args.infer
hparams_['debug'] = args.debug
hparams_['validate'] = args.validate
global global_print_hparams
if global_hparams:
hparams.clear()
hparams.update(hparams_)
if print_hparams and global_print_hparams and global_hparams:
print('| Hparams chains: ', config_chains)
print('| Hparams: ')
for i, (k, v) in enumerate(sorted(hparams_.items())):
print(f"\033[;33;m{k}\033[0m: {v}, ", end="\n" if i % 5 == 4 else "")
print("")
global_print_hparams = False
# print(hparams_.keys())
if hparams.get('exp_name') is None:
hparams['exp_name'] = args.exp_name
if hparams_.get('exp_name') is None:
hparams_['exp_name'] = args.exp_name
return hparams_
import os
import traceback
from multiprocessing import Process
from multiprocessing import Queue
def chunked_worker(worker_id, map_func, args, results_queue=None, init_ctx_func=None):
ctx = init_ctx_func(worker_id) if init_ctx_func is not None else None
for job_idx, arg in args:
try:
if ctx is not None:
res = map_func(*arg, ctx=ctx)
else:
res = map_func(*arg)
results_queue.put((job_idx, res))
except:
traceback.print_exc()
results_queue.put((job_idx, None))
def chunked_multiprocess_run(map_func, args, num_workers=None, ordered=True, init_ctx_func=None, q_max_size=1000):
args = zip(range(len(args)), args)
args = list(args)
n_jobs = len(args)
if num_workers is None:
num_workers = int(os.getenv('N_PROC', os.cpu_count()))
results_queues = []
if ordered:
for i in range(num_workers):
results_queues.append(Queue(maxsize=q_max_size // num_workers))
else:
results_queue = Queue(maxsize=q_max_size)
for i in range(num_workers):
results_queues.append(results_queue)
workers = []
for i in range(num_workers):
args_worker = args[i::num_workers]
p = Process(target=chunked_worker,
args=(i, map_func, args_worker, results_queues[i], init_ctx_func),
daemon=True)
workers.append(p)
p.start()
for n_finished in range(n_jobs):
results_queue = results_queues[n_finished % num_workers]
job_idx, res = results_queue.get()
assert job_idx == n_finished or not ordered, (job_idx, n_finished)
yield res
for w in workers:
w.join()
w.close()
import re
import six
PAD = "<pad>"
EOS = "<EOS>"
UNK = "<UNK>"
SEG = "|"
RESERVED_TOKENS = [PAD, EOS, UNK]
NUM_RESERVED_TOKENS = len(RESERVED_TOKENS)
PAD_ID = RESERVED_TOKENS.index(PAD) # Normally 0
EOS_ID = RESERVED_TOKENS.index(EOS) # Normally 1
UNK_ID = RESERVED_TOKENS.index(UNK) # Normally 2
if six.PY2:
RESERVED_TOKENS_BYTES = RESERVED_TOKENS
else:
RESERVED_TOKENS_BYTES = [bytes(PAD, "ascii"), bytes(EOS, "ascii")]
# Regular expression for unescaping token strings.
# '\u' is converted to '_'
# '\\' is converted to '\'
# '\213;' is converted to unichr(213)
_UNESCAPE_REGEX = re.compile(r"\\u|\\\\|\\([0-9]+);")
_ESCAPE_CHARS = set(u"\\_u;0123456789")
def strip_ids(ids, ids_to_strip):
"""Strip ids_to_strip from the end ids."""
ids = list(ids)
while ids and ids[-1] in ids_to_strip:
ids.pop()
return ids
class TextEncoder(object):
"""Base class for converting from ints to/from human readable strings."""
def __init__(self, num_reserved_ids=NUM_RESERVED_TOKENS):
self._num_reserved_ids = num_reserved_ids
@property
def num_reserved_ids(self):
return self._num_reserved_ids
def encode(self, s):
"""Transform a human-readable string into a sequence of int ids.
The ids should be in the range [num_reserved_ids, vocab_size). Ids [0,
num_reserved_ids) are reserved.
EOS is not appended.
Args:
s: human-readable string to be converted.
Returns:
ids: list of integers
"""
return [int(w) + self._num_reserved_ids for w in s.split()]
def decode(self, ids, strip_extraneous=False):
"""Transform a sequence of int ids into a human-readable string.
EOS is not expected in ids.
Args:
ids: list of integers to be converted.
strip_extraneous: bool, whether to strip off extraneous tokens
(EOS and PAD).
Returns:
s: human-readable string.
"""
if strip_extraneous:
ids = strip_ids(ids, list(range(self._num_reserved_ids or 0)))
return " ".join(self.decode_list(ids))
def decode_list(self, ids):
"""Transform a sequence of int ids into a their string versions.
This method supports transforming individual input/output ids to their
string versions so that sequence to/from text conversions can be visualized
in a human readable format.
Args:
ids: list of integers to be converted.
Returns:
strs: list of human-readable string.
"""
decoded_ids = []
for id_ in ids:
if 0 <= id_ < self._num_reserved_ids:
decoded_ids.append(RESERVED_TOKENS[int(id_)])
else:
decoded_ids.append(id_ - self._num_reserved_ids)
return [str(d) for d in decoded_ids]
@property
def vocab_size(self):
raise NotImplementedError()
class ByteTextEncoder(TextEncoder):
"""Encodes each byte to an id. For 8-bit strings only."""
def encode(self, s):
numres = self._num_reserved_ids
return [c + numres for c in s.encode("utf-8")]
def decode(self, ids, strip_extraneous=False):
if strip_extraneous:
ids = strip_ids(ids, list(range(self._num_reserved_ids or 0)))
numres = self._num_reserved_ids
decoded_ids = []
int2byte = six.int2byte
for id_ in ids:
if 0 <= id_ < numres:
decoded_ids.append(RESERVED_TOKENS_BYTES[int(id_)])
else:
decoded_ids.append(int2byte(id_ - numres))
if six.PY2:
return "".join(decoded_ids)
# Python3: join byte arrays and then decode string
return b"".join(decoded_ids).decode("utf-8", "replace")
def decode_list(self, ids):
numres = self._num_reserved_ids
decoded_ids = []
int2byte = six.int2byte
for id_ in ids:
if 0 <= id_ < numres:
decoded_ids.append(RESERVED_TOKENS_BYTES[int(id_)])
else:
decoded_ids.append(int2byte(id_ - numres))
# Python3: join byte arrays and then decode string
return decoded_ids
@property
def vocab_size(self):
return 2**8 + self._num_reserved_ids
class ByteTextEncoderWithEos(ByteTextEncoder):
"""Encodes each byte to an id and appends the EOS token."""
def encode(self, s):
return super(ByteTextEncoderWithEos, self).encode(s) + [EOS_ID]
class TokenTextEncoder(TextEncoder):
"""Encoder based on a user-supplied vocabulary (file or list)."""
def __init__(self,
vocab_filename,
reverse=False,
vocab_list=None,
replace_oov=None,
num_reserved_ids=NUM_RESERVED_TOKENS):
"""Initialize from a file or list, one token per line.
Handling of reserved tokens works as follows:
- When initializing from a list, we add reserved tokens to the vocab.
- When initializing from a file, we do not add reserved tokens to the vocab.
- When saving vocab files, we save reserved tokens to the file.
Args:
vocab_filename: If not None, the full filename to read vocab from. If this
is not None, then vocab_list should be None.
reverse: Boolean indicating if tokens should be reversed during encoding
and decoding.
vocab_list: If not None, a list of elements of the vocabulary. If this is
not None, then vocab_filename should be None.
replace_oov: If not None, every out-of-vocabulary token seen when
encoding will be replaced by this string (which must be in vocab).
num_reserved_ids: Number of IDs to save for reserved tokens like <EOS>.
"""
super(TokenTextEncoder, self).__init__(num_reserved_ids=num_reserved_ids)
self._reverse = reverse
self._replace_oov = replace_oov
if vocab_filename:
self._init_vocab_from_file(vocab_filename)
else:
assert vocab_list is not None
self._init_vocab_from_list(vocab_list)
self.pad_index = self._token_to_id[PAD]
self.eos_index = self._token_to_id[EOS]
self.unk_index = self._token_to_id[UNK]
self.seg_index = self._token_to_id[SEG] if SEG in self._token_to_id else self.eos_index
def encode(self, s):
"""Converts a space-separated string of tokens to a list of ids."""
sentence = s
tokens = sentence.strip().split()
if self._replace_oov is not None:
tokens = [t if t in self._token_to_id else self._replace_oov for t in tokens]
ret = [self._token_to_id[tok] for tok in tokens]
return ret[::-1] if self._reverse else ret
def decode(self, ids, strip_eos=False, strip_padding=False):
if strip_padding and self.pad() in list(ids):
pad_pos = list(ids).index(self.pad())
ids = ids[:pad_pos]
if strip_eos and self.eos() in list(ids):
eos_pos = list(ids).index(self.eos())
ids = ids[:eos_pos]
return " ".join(self.decode_list(ids))
def decode_list(self, ids):
seq = reversed(ids) if self._reverse else ids
return [self._safe_id_to_token(i) for i in seq]
@property
def vocab_size(self):
return len(self._id_to_token)
def __len__(self):
return self.vocab_size
def _safe_id_to_token(self, idx):
return self._id_to_token.get(idx, "ID_%d" % idx)
def _init_vocab_from_file(self, filename):
"""Load vocab from a file.
Args:
filename: The file to load vocabulary from.
"""
with open(filename) as f:
tokens = [token.strip() for token in f.readlines()]
def token_gen():
for token in tokens:
yield token
self._init_vocab(token_gen(), add_reserved_tokens=False)
def _init_vocab_from_list(self, vocab_list):
"""Initialize tokens from a list of tokens.
It is ok if reserved tokens appear in the vocab list. They will be
removed. The set of tokens in vocab_list should be unique.
Args:
vocab_list: A list of tokens.
"""
def token_gen():
for token in vocab_list:
if token not in RESERVED_TOKENS:
yield token
self._init_vocab(token_gen())
def _init_vocab(self, token_generator, add_reserved_tokens=True):
"""Initialize vocabulary with tokens from token_generator."""
self._id_to_token = {}
non_reserved_start_index = 0
if add_reserved_tokens:
self._id_to_token.update(enumerate(RESERVED_TOKENS))
non_reserved_start_index = len(RESERVED_TOKENS)
self._id_to_token.update(enumerate(token_generator, start=non_reserved_start_index))
# _token_to_id is the reverse of _id_to_token
self._token_to_id = dict((v, k) for k, v in six.iteritems(self._id_to_token))
def pad(self):
return self.pad_index
def eos(self):
return self.eos_index
def unk(self):
return self.unk_index
def seg(self):
return self.seg_index
def store_to_file(self, filename):
"""Write vocab file to disk.
Vocab files have one token per line. The file ends in a newline. Reserved
tokens are written to the vocab file as well.
Args:
filename: Full path of the file to store the vocab to.
"""
with open(filename, "w") as f:
for i in range(len(self._id_to_token)):
f.write(self._id_to_token[i] + "\n")
def sil_phonemes(self):
return [p for p in self._id_to_token.values() if not p[0].isalpha()]
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册